limetech Posted June 27, 2013 Share Posted June 27, 2013 Used nars/weebotech scripts and it seems solid to me but I have only done limited testing. Link to comment
madburg Posted June 27, 2013 Share Posted June 27, 2013 Can you share some details (e.g. kernel version , changes, etc.. in RC16a from RC15a) for those of us that are not testing. When u have a moment. Link to comment
limetech Posted June 27, 2013 Author Share Posted June 27, 2013 Can you share some details (e.g. kernel version , changes, etc.. in RC16a from RC15a) for those of us that are not testing. When u have a moment. Everything is the same as -15a except: The "patch" obtained from reiserfs maintainer and put into -rc15, is not applied in -rc16. Instead I developed my own "patch" to fix this problem and created -rc16. But this did not completely solve the problem, so now after learning more of the reiserfs code I have tweaked my patch and created -rc16a. I believe this patch will solve the problem but I'm not sure it's the most elegant solution I think a better solution exists but I don't understand enough of the reiserfs code to make this change. So my plan is this: if this -16a works, then I will release 5.0 and move on to bigger and better, meanwhile monitor reiserfs-develop mailing list for better solution. If this does not work correctly, then I will spend more time and learn more reiserfs code. Link to comment
nars Posted June 27, 2013 Share Posted June 27, 2013 I do have good news Same kernel, 3.9.6, guess only the patch changed? But I would also like to know details... I guess Tom may have had to go to San Quentin Well, ran 2 tests so far, one with 512MB and one with 1GB (with any previous problematic rc version I know would have reproduced this by now... as for some reason I could always reproduce it too easily...) and I can't reproduce it with rc16a!! I will do more extensive tests but so far I think it's really sorted and many thanks Tom for this! Link to comment
madburg Posted June 27, 2013 Share Posted June 27, 2013 Sounds good, anything you feel one should look out for since "not sure it's the most elegant solution", e.g. don't stop and start the array too much, or random disks maybe spin up, or you may see more read/writes once a day at said hour (I am making crap up as an example.) unless you can share what your approach is (if its not public info thats no problem) and could possible give some of us a better sense what to look out for. Link to comment
limetech Posted June 27, 2013 Author Share Posted June 27, 2013 I do have good news Same kernel, 3.9.6, guess only the patch changed? Correct. But I would also like to know details... I guess Tom may have had to go to San Quentin LOL thought about it - but he's in Mule Creek and I doubt he's thought about file systems for a long time. The original changes put in 3.5 added a work queue and the notion of queuing "work" every 5 seconds to flush dirty blocks (this emulates the old kernel thread that did this in 3.4 and below). The implementation of this had a bug in that each 'flush' result in additional disk i/o which, as a result, scheduled another 'flush' to take place 5 seconds later. Hence we got the "continuous write" issue. What my patch does essentially, is maintain all this but changed the 'flush' to not issue disk i/o if it doesn't need to. But note that a "sync" command will still cause disk i/o. I say the change is not "elegant" because it maintains this 5-sec polling loop (which btw is totally and completely harmless) which they are trying to get rid of (because linux is being run on phones and they don't want anything running that doesn't have to in the interest of saving battery power - which we don't care about obviously). A better solution is to implement a different way of scheduling flush of the journal, but I don't know enough of this code to make this change. Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 on an idle system i've passed the nars test. 4G/ESX 5.1 to physical disks. copied to 2 other disks and read back. I've got a parity check going on now. I'm going to do the same test. FWIW, I captured the following information to insure I reset the machine while it 'should' have been flushed Jun 26 23:48:16 unRAID2 watchmeminfo[1261]: LowFree: 763544 HighFree: 2591648 Dirty: 151064 Writeback: 7208 Jun 26 23:48:17 unRAID2 watchmeminfo[1261]: LowFree: 762676 HighFree: 2543412 Dirty: 146760 Writeback: 3144 Jun 26 23:48:18 unRAID2 watchmeminfo[1261]: LowFree: 761932 HighFree: 2506336 Dirty: 122164 Writeback: 8196 Jun 26 23:48:19 unRAID2 watchmeminfo[1261]: LowFree: 761784 HighFree: 2506336 Dirty: 78784 Writeback: 3448 Jun 26 23:48:20 unRAID2 watchmeminfo[1261]: LowFree: 761816 HighFree: 2506336 Dirty: 47392 Writeback: 0 Jun 26 23:48:21 unRAID2 watchmeminfo[1261]: LowFree: 762156 HighFree: 2506336 Dirty: 47684 Writeback: 0 Jun 26 23:48:22 unRAID2 watchmeminfo[1261]: LowFree: 760684 HighFree: 2408624 Dirty: 112680 Writeback: 8208 Jun 26 23:48:23 unRAID2 watchmeminfo[1261]: LowFree: 759652 HighFree: 2348360 Dirty: 143100 Writeback: 5156 Jun 26 23:48:24 unRAID2 watchmeminfo[1261]: LowFree: 758404 HighFree: 2298760 Dirty: 146756 Writeback: 3988 Jun 26 23:48:25 unRAID2 watchmeminfo[1261]: LowFree: 757908 HighFree: 2259824 Dirty: 148732 Writeback: 4 Jun 26 23:48:26 unRAID2 watchmeminfo[1261]: LowFree: 757412 HighFree: 2229196 Dirty: 120932 Writeback: 872 Jun 26 23:48:27 unRAID2 watchmeminfo[1261]: LowFree: 757420 HighFree: 2229196 Dirty: 79256 Writeback: 20 Jun 26 23:48:28 unRAID2 watchmeminfo[1261]: LowFree: 757420 HighFree: 2229196 Dirty: 46236 Writeback: 0 Jun 26 23:48:29 unRAID2 watchmeminfo[1261]: LowFree: 757420 HighFree: 2229196 Dirty: 46220 Writeback: 0 Jun 26 23:48:31 unRAID2 last message repeated 2 times Jun 26 23:48:32 unRAID2 watchmeminfo[1261]: LowFree: 757404 HighFree: 2229196 Dirty: 46324 Writeback: 0 Jun 26 23:48:33 unRAID2 watchmeminfo[1261]: LowFree: 757528 HighFree: 2229196 Dirty: 13832 Writeback: 0 Jun 26 23:48:34 unRAID2 watchmeminfo[1261]: LowFree: 757544 HighFree: 2229196 Dirty: 13804 Writeback: 0 Jun 26 23:48:51 unRAID2 last message repeated 16 times Jun 26 23:48:52 unRAID2 watchmeminfo[1261]: LowFree: 757544 HighFree: 2229072 Dirty: 13816 Writeback: 0 Jun 26 23:48:56 unRAID2 last message repeated 4 times Jun 26 23:48:57 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2228824 Dirty: 13816 Writeback: 0 Jun 26 23:49:01 unRAID2 last message repeated 4 times Jun 26 23:49:02 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2229032 Dirty: 13816 Writeback: 0 Jun 26 23:49:03 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2229072 Dirty: 4 Writeback: 52 Jun 26 23:49:04 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2229072 Dirty: 12 Writeback: 0 Jun 26 23:49:05 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2228824 Dirty: 12 Writeback: 0 Jun 26 23:49:06 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2228824 Dirty: 12 Writeback: 0 Jun 26 23:49:07 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2228824 Dirty: 12 Writeback: 0 Jun 26 23:49:08 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2229032 Dirty: 12 Writeback: 0 Jun 26 23:49:09 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2228948 Dirty: 12 Writeback: 0 Jun 26 23:49:10 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2228948 Dirty: 12 Writeback: 0 Jun 26 23:49:11 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228948 Dirty: 12 Writeback: 0 Jun 26 23:49:12 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228948 Dirty: 12 Writeback: 0 Jun 26 23:49:13 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228824 Dirty: 12 Writeback: 0 Jun 26 23:49:14 unRAID2 watchmeminfo[1261]: LowFree: 757472 HighFree: 2228824 Dirty: 12 Writeback: 0 Jun 26 23:49:20 unRAID2 last message repeated 6 times Jun 26 23:49:21 unRAID2 watchmeminfo[1261]: LowFree: 757472 HighFree: 2228824 Dirty: 0 Writeback: 0 Jun 26 23:49:25 unRAID2 last message repeated 3 times Jun 26 23:49:26 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228928 Dirty: 0 Writeback: 0 Jun 26 23:49:27 unRAID2 watchmeminfo[1261]: LowFree: 757448 HighFree: 2228948 Dirty: 0 Writeback: 0 Jun 26 23:49:28 unRAID2 watchmeminfo[1261]: LowFree: 757448 HighFree: 2228948 Dirty: 0 Writeback: 0 Jun 26 23:49:29 unRAID2 watchmeminfo[1261]: LowFree: 757456 HighFree: 2228948 Dirty: 0 Writeback: 0 So it looks like you need a minimum of 30 seconds before all data should be flushed. I've check this multiple times, there are times when the dirty buffer sits there for 30 seconds before it drops to 0. and for the record. I'm thrilled we have bash 4.1 We have associative arrays and co-processes. For those interested, here's the script. It's got some interesting techniques. (Ask questions in another thread) #!/bin/bash [ ${DEBUG:=0} -gt 0 ] && set -x -v P=${0##*/} # basename of program R=${0%%$P} # dirname of program P=${P%.*} # strip off after last . character TMPFILE=/tmp/${P}.$$ declare -a FIELDS declare -A MEMINFO coproc LOGGER { logger -s -t${P}[$$] } # ps -fp ${LOGGER_PID} exec 11>&1 # Link file descriptor #11 with stdout. exec 12>&2 # Link file descriptor #12 with stdout. exec 1>&${LOGGER[1]} # redirect stdout to logger co-process. exec 2>&${LOGGER[1]} # redirect stderr to logger co-process. capture_meminfo() { while read -a FIELDS do FIELDS[0]="${FIELDS[0]%:}" FIELDS[0]="${FIELDS[0]/(/_}" FIELDS[0]="${FIELDS[0]/)/_}" MEMINFO["${FIELDS[0]}"]=${FIELDS[1]} # echo "${FIELDS[0]} ${FIELDS[1]}" done < /proc/meminfo >&12 } # echo "1 this is STDOUT" >&1 # echo "2 this is STDERR" >&2 while true do capture_meminfo echo "LowFree: ${MEMINFO[LowFree]} HighFree: ${MEMINFO[HighFree]} Dirty: ${MEMINFO[Dirty]} Writeback: ${MEMINFO[Writeback]}" sleep 1 done exec 1>&11 11>&- # Restore stdout and close file descriptor #11. exec 2>&12 12>&- # Restore stdout and close file descriptor #11. eval "exec ${LOGGER[0]}<&-" eval "exec ${LOGGER[1]}>&-" # echo "3 this is STDOUT" >&1 # echo "4 this is STDERR" >&2 Link to comment
limetech Posted June 27, 2013 Author Share Posted June 27, 2013 So it looks like you need a minimum of 30 seconds before all data should be flushed. I've check this multiple times, there are times when the dirty buffer sits there for 30 seconds before it drops to 0. Looking at meminfo tells you there are dirty buffers but not which file systems those buffers are for, e.g., some of those are for sure for the root file system. I believe it should only be necessary to wait 5 seconds after the last disk i/o has completed (by looking at the disk activity LED's and/or looking at the i/o counters on the Main page) before data should be safe. Actually, this "5 seconds" is programmable: vm.dirty_writeback_centisecs Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 I wanted to be absolutely sure I wasn't going to taint the test. I know you're working so hard on this. The last thing you need is a test that triggers, what looks like a bug. Link to comment
limetech Posted June 27, 2013 Author Share Posted June 27, 2013 I wanted to be absolutely sure I wasn't going to taint the test. I know you're working so hard on this. The last thing you need is a test that triggers, what looks like a bug. Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 I've passed the nars test while a parity check is in progress. It's midnight here, so I'm going to resume my 512 parallel copy tomorrow after the parity check completes. I think after that I might do one final test. Remove parity and Create new parity then do the parallel copy while the new parity generate is in progress. (just for kicks). So far I feel much better about this version. Also fwiw, I'm getting 170MB/s parity check speeds. I can't believe it!!! Link to comment
mejutty Posted June 27, 2013 Share Posted June 27, 2013 My test system is about to have 22 disks put into it tonight as a test for another apparent limitation of esxi and 3 m1015 cards. I am going to look to modify some of the test scripts floating around to try and get every disk in the array being written to in parallel at the same time then cut power after 15 seconds and see what happens Link to comment
limetech Posted June 27, 2013 Author Share Posted June 27, 2013 I've passed the nars test while a parity check is in progress. It's midnight here, so I'm going to resume my 512 parallel copy tomorrow after the parity check completes. I think after that I might do one final test. Remove parity and Create new parity then do the parallel copy while the new parity generate is in progress. (just for kicks). So far I feel much better about this version. Also fwiw, I'm getting 170MB/s parity check speeds. I can't believe it!!! Thanks for you help! I very much appreciate it. Link to comment
limetech Posted June 27, 2013 Author Share Posted June 27, 2013 Also, with this release, please try Stop/Start array a few times to make sure this kernel-crash bug is still gone too. Link to comment
nars Posted June 27, 2013 Share Posted June 27, 2013 I do have good news Same kernel, 3.9.6, guess only the patch changed? Correct. But I would also like to know details... I guess Tom may have had to go to San Quentin LOL thought about it - but he's in Mule Creek and I doubt he's thought about file systems for a long time. The original changes put in 3.5 added a work queue and the notion of queuing "work" every 5 seconds to flush dirty blocks (this emulates the old kernel thread that did this in 3.4 and below). The implementation of this had a bug in that each 'flush' result in additional disk i/o which, as a result, scheduled another 'flush' to take place 5 seconds later. Hence we got the "continuous write" issue. What my patch does essentially, is maintain all this but changed the 'flush' to not issue disk i/o if it doesn't need to. But note that a "sync" command will still cause disk i/o. I say the change is not "elegant" because it maintains this 5-sec polling loop (which btw is totally and completely harmless) which they are trying to get rid of (because linux is being run on phones and they don't want anything running that doesn't have to in the interest of saving battery power - which we don't care about obviously). A better solution is to implement a different way of scheduling flush of the journal, but I don't know enough of this code to make this change. Thanks for the explanation, really appreciate it. And it sounds good to me... keep all "as is" and just fix the unneeded flushs i/o... if you are absolutely sure that it surely does only stop unneeded i/o (nothing points to the opposite from tests) then... I would say it seems perfect to me. I would not care about getting rid of that 5 sec. pooling loop... this will not likely help unraid users at all, and... it is just not your task... I would just mail Jeff and/or post to reiserfs newsgroups your findings and let such improvements, etc for them. This is my humble opinion. Let's just hope the crash's when stopping array don't get back as you did bring back some of the behavior in rc13, but these could surely be related to other changes you did and not to this? Btw, Repeated tests a few more times now and so far seems perfect to me. I did also monitor the dirty buffers and write back values while testing and found same about it going to 0 only on 30 seconds. Also if you read my previous description about hdd activity status led's, it's exactly when it does the last small led blink after 30 seconds from write completes that dirty buffers go to 0. Link to comment
nars Posted June 27, 2013 Share Posted June 27, 2013 Also, with this release, please try Stop/Start array a few times to make sure this kernel-crash bug is still gone too. Thought about it... and will do... but I was never able to reproduce it on rc13 then I don't think I'm good guy to test it... maybe would be a good idea you include on the test someone that was able to easily/always reproduce it on rc13? Link to comment
garycase Posted June 27, 2013 Share Posted June 27, 2013 Okay, I've run the same test I could reliably create problems with before (on rc15a), and the "nars test" (which also failed on rc15a), and both work perfectly on 16a. Note that on my system these also worked fine on rc16 ... so I can't really tell if 16a improved things or not. I'll try the parallel write test on this tomorrow, but I suspect the 22-drive version that mejutty plans to run is a FAR better test than on my little test box. Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 then cut power after 15 seconds and see what happens This isn't a totally fair test, at least for the bug encountered previously. I would suggest monitoring the dirty buffers and then waiting until it's 0. Should be around 30 seconds. While it would be interesting to see what happens when there are some dirty buffers left. It may skew the test a bit. Link to comment
jumperalex Posted June 27, 2013 Share Posted June 27, 2013 For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers? Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption? I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits). Or is that not an appropriate benchmark? Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers? Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption? I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits). Or is that not an appropriate benchmark? Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it. In the meantime here's a dump of rc16a's /proc/sys/vm/dirty* values According to what I'm reading 30 seconds seems to be right. 5 second commits. data older then 30 seconds is written out. root@unRAID2:/proc/sys/vm# more /proc/sys/vm/dirty* :::::::::::::: /proc/sys/vm/dirty_background_bytes :::::::::::::: 0 :::::::::::::: /proc/sys/vm/dirty_background_ratio :::::::::::::: 10 :::::::::::::: /proc/sys/vm/dirty_bytes :::::::::::::: 0 :::::::::::::: /proc/sys/vm/dirty_expire_centisecs :::::::::::::: 3000 :::::::::::::: /proc/sys/vm/dirty_ratio :::::::::::::: 20 :::::::::::::: /proc/sys/vm/dirty_writeback_centisecs :::::::::::::: 500 and information about them according to here https://www.kernel.org/doc/Documentation/sysctl/vm.txt dirty_background_bytes Contains the amount of dirty memory at which the pdflush background writeback daemon will start writeback. Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. ============================================================== dirty_background_ratio Contains, as a percentage of total system memory, the number of pages at which the pdflush background writeback daemon will start writing out dirty data. ============================================================== dirty_bytes Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be specified at a time. When one sysctl is written it is immediately taken into account to evaluate the dirty memory limits and the other appears as 0 when read. Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this limit will be ignored and the old configuration will be retained. ============================================================== dirty_expire_centisecs This tunable is used to define when dirty data is old enough to be eligible for writeout by the pdflush daemons. It is expressed in 100'ths of a second. Data which has been dirty in-memory for longer than this interval will be written out next time a pdflush daemon wakes up. ============================================================== dirty_ratio Contains, as a percentage of total system memory, the number of pages at which a process which is generating disk writes will itself start writing out dirty data. ============================================================== dirty_writeback_centisecs The pdflush writeback daemons will periodically wake up and write `old' data out to disk. This tunable expresses the interval between those wakeups, in 100'ths of a second. Setting this to zero disables periodic writeback altogether. Link to comment
jumperalex Posted June 27, 2013 Share Posted June 27, 2013 For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers? Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption? I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits). Or is that not an appropriate benchmark? Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it. Well all I was saying is, in line with your post about fair tests, is that if we wanted a benchmark for what is reasonable to expect from RSF, then maybe that benchmark can be found by looking at another FS. But I acknowledge that I only know enough to be dangerous and so maybe it is not a valid comparison. Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers? Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption? I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits). Or is that not an appropriate benchmark? Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it. Well all I was saying is, in line with your post about fair tests, is that if we wanted a benchmark for what is reasonable to expect from RSF, then maybe that benchmark can be found by looking at another FS. But I acknowledge that I only know enough to be dangerous and so maybe it is not a valid comparison. Hmmm, OK I get the idea. I think we were just looking for valid data to be present with reiserfs after an unclean shutdown. According to the kernel values and what I observed by monitoring the dirty buffer values, it seems around 30 seconds. nars had tested EXT3 and did not report any corruption. In our other tests, we found that with reiserfs you could wait 3 minutes, 5 minutes, 10 minutes, 30 minutes and/or 2 hours and find corruption and/or missing files on the reiserfs. Link to comment
mejutty Posted June 27, 2013 Share Posted June 27, 2013 OK I'm not far away from testing with my 20 drive Frankenstein. 1 build limitation tonight has been eliminated so on to the next. Link to comment
WeeboTech Posted June 27, 2013 Share Posted June 27, 2013 OK I'm not far away from testing with my 20 drive Frankenstein. 1 build limitation tonight has been eliminated so on to the next. Please consider watching the Dirty memory and Writeback. It should take about 30 seconds to drop to 0. Then power off. watch grep -e Dirty: -e Writeback: /proc/meminfo After that test proves successful or not, you can lower that time threshold to see what happens. Link to comment
garycase Posted June 27, 2013 Share Posted June 27, 2013 Just for grins, I ran the "unfair" test ==> killed power after 15 seconds ... and indeed, I can cause problems with that If I watch the disk activity light, with the "nars test" it takes ~ 18 seconds to stop flashing ... and if I cut power at that point all is well; with my set of files, it's about 2 seconds different, but same result: if I kill power immediately after it stops flashing everything's fine. So it certainly seems that RC16a has fixed the "bug" ... although I anxiously await the results of mejutty's Frankenstein parallel write test Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.