copy files, wait 30min, hard reboot = data loss [thread -rc16a]


Recommended Posts

  • Replies 82
  • Created
  • Last Reply

Top Posters In This Topic

Can you share some details (e.g. kernel version , changes, etc.. in RC16a from RC15a) for those of us that are not testing.

When u have a moment.

Everything is the same as -15a except:

 

The "patch" obtained from reiserfs maintainer and put into -rc15, is not applied in -rc16.  Instead I developed my own "patch" to fix this problem and created -rc16.  But this did not completely solve the problem, so now after learning more of the reiserfs code I have tweaked my patch and created -rc16a.

 

I believe this patch will solve the problem but I'm not sure it's the most elegant solution  :o  I think a better solution exists but I don't understand enough of the reiserfs code to make this change.  So my plan is this: if this -16a works, then I will release 5.0 and move on to bigger and better, meanwhile monitor reiserfs-develop mailing list for better solution.  If this does not work correctly, then I will spend more time and learn more reiserfs code.  :P

Link to comment

I do have good news :)

 

Same kernel, 3.9.6, guess only the patch changed? But I would also like to know details... I guess Tom may have had to go to San Quentin ;D

 

Well, ran 2 tests so far, one with 512MB and one with 1GB (with any previous problematic rc version I know would have reproduced this by now... as for some reason I could always reproduce it too easily...) and I can't reproduce it with rc16a!!

 

I will do more extensive tests but so far I think it's really sorted and many thanks Tom for this!

Link to comment

Sounds good, anything you feel one should look out for since "not sure it's the most elegant solution", e.g. don't stop and start the array too much, or random disks maybe spin up, or you may see more read/writes once a day at said hour (I am making crap up as an example.) unless you can share what your approach is (if its not public info thats no problem) and could possible give some of us a better sense what to look out for.

 

Link to comment

I do have good news :)

 

Same kernel, 3.9.6, guess only the patch changed?

Correct.

 

But I would also like to know details... I guess Tom may have had to go to San Quentin ;D

LOL thought about it - but he's in Mule Creek and I doubt he's thought about file systems for a long time.  ;)

 

The original changes put in 3.5 added a work queue and the notion of queuing "work" every 5 seconds to flush dirty blocks (this emulates the old kernel thread that did this in 3.4 and below).  The implementation of this had a bug in that each 'flush' result in additional disk i/o which, as a result, scheduled another 'flush' to take place 5 seconds later.  Hence we got the "continuous write" issue.  What my patch does essentially, is maintain all this but changed the 'flush' to not issue disk i/o if it doesn't need to.  But note that a "sync" command will still cause disk i/o.

 

I say the change is not "elegant" because it maintains this 5-sec polling loop (which btw is totally and completely harmless) which they are trying to get rid of (because linux is being run on phones and they don't want anything running that doesn't have to in the interest of saving battery power - which we don't care about obviously).  A better solution is to implement a different way of scheduling flush of the journal, but I don't know enough of this code to make this change.

Link to comment

 

on an idle system i've passed the nars test.  4G/ESX 5.1 to physical disks. copied to 2 other disks and read back.

 

I've got a parity check going on now. I'm going to do the same test.

 

FWIW, I captured the following information to insure I reset the machine while it 'should' have been flushed

 

Jun 26 23:48:16 unRAID2 watchmeminfo[1261]: LowFree: 763544 HighFree: 2591648  Dirty: 151064 Writeback: 7208
Jun 26 23:48:17 unRAID2 watchmeminfo[1261]: LowFree: 762676 HighFree: 2543412  Dirty: 146760 Writeback: 3144
Jun 26 23:48:18 unRAID2 watchmeminfo[1261]: LowFree: 761932 HighFree: 2506336  Dirty: 122164 Writeback: 8196
Jun 26 23:48:19 unRAID2 watchmeminfo[1261]: LowFree: 761784 HighFree: 2506336  Dirty: 78784 Writeback: 3448
Jun 26 23:48:20 unRAID2 watchmeminfo[1261]: LowFree: 761816 HighFree: 2506336  Dirty: 47392 Writeback: 0
Jun 26 23:48:21 unRAID2 watchmeminfo[1261]: LowFree: 762156 HighFree: 2506336  Dirty: 47684 Writeback: 0
Jun 26 23:48:22 unRAID2 watchmeminfo[1261]: LowFree: 760684 HighFree: 2408624  Dirty: 112680 Writeback: 8208
Jun 26 23:48:23 unRAID2 watchmeminfo[1261]: LowFree: 759652 HighFree: 2348360  Dirty: 143100 Writeback: 5156
Jun 26 23:48:24 unRAID2 watchmeminfo[1261]: LowFree: 758404 HighFree: 2298760  Dirty: 146756 Writeback: 3988
Jun 26 23:48:25 unRAID2 watchmeminfo[1261]: LowFree: 757908 HighFree: 2259824  Dirty: 148732 Writeback: 4
Jun 26 23:48:26 unRAID2 watchmeminfo[1261]: LowFree: 757412 HighFree: 2229196  Dirty: 120932 Writeback: 872
Jun 26 23:48:27 unRAID2 watchmeminfo[1261]: LowFree: 757420 HighFree: 2229196  Dirty: 79256 Writeback: 20
Jun 26 23:48:28 unRAID2 watchmeminfo[1261]: LowFree: 757420 HighFree: 2229196  Dirty: 46236 Writeback: 0
Jun 26 23:48:29 unRAID2 watchmeminfo[1261]: LowFree: 757420 HighFree: 2229196  Dirty: 46220 Writeback: 0
Jun 26 23:48:31 unRAID2 last message repeated 2 times
Jun 26 23:48:32 unRAID2 watchmeminfo[1261]: LowFree: 757404 HighFree: 2229196  Dirty: 46324 Writeback: 0
Jun 26 23:48:33 unRAID2 watchmeminfo[1261]: LowFree: 757528 HighFree: 2229196  Dirty: 13832 Writeback: 0
Jun 26 23:48:34 unRAID2 watchmeminfo[1261]: LowFree: 757544 HighFree: 2229196  Dirty: 13804 Writeback: 0
Jun 26 23:48:51 unRAID2 last message repeated 16 times
Jun 26 23:48:52 unRAID2 watchmeminfo[1261]: LowFree: 757544 HighFree: 2229072  Dirty: 13816 Writeback: 0
Jun 26 23:48:56 unRAID2 last message repeated 4 times
Jun 26 23:48:57 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2228824  Dirty: 13816 Writeback: 0
Jun 26 23:49:01 unRAID2 last message repeated 4 times
Jun 26 23:49:02 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2229032  Dirty: 13816 Writeback: 0
Jun 26 23:49:03 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2229072  Dirty: 4 Writeback: 52
Jun 26 23:49:04 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2229072  Dirty: 12 Writeback: 0
Jun 26 23:49:05 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2228824  Dirty: 12 Writeback: 0
Jun 26 23:49:06 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2228824  Dirty: 12 Writeback: 0
Jun 26 23:49:07 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2228824  Dirty: 12 Writeback: 0
Jun 26 23:49:08 unRAID2 watchmeminfo[1261]: LowFree: 757504 HighFree: 2229032  Dirty: 12 Writeback: 0
Jun 26 23:49:09 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2228948  Dirty: 12 Writeback: 0
Jun 26 23:49:10 unRAID2 watchmeminfo[1261]: LowFree: 757512 HighFree: 2228948  Dirty: 12 Writeback: 0
Jun 26 23:49:11 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228948  Dirty: 12 Writeback: 0
Jun 26 23:49:12 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228948  Dirty: 12 Writeback: 0
Jun 26 23:49:13 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228824  Dirty: 12 Writeback: 0
Jun 26 23:49:14 unRAID2 watchmeminfo[1261]: LowFree: 757472 HighFree: 2228824  Dirty: 12 Writeback: 0
Jun 26 23:49:20 unRAID2 last message repeated 6 times
Jun 26 23:49:21 unRAID2 watchmeminfo[1261]: LowFree: 757472 HighFree: 2228824  Dirty: 0 Writeback: 0
Jun 26 23:49:25 unRAID2 last message repeated 3 times
Jun 26 23:49:26 unRAID2 watchmeminfo[1261]: LowFree: 757480 HighFree: 2228928  Dirty: 0 Writeback: 0
Jun 26 23:49:27 unRAID2 watchmeminfo[1261]: LowFree: 757448 HighFree: 2228948  Dirty: 0 Writeback: 0
Jun 26 23:49:28 unRAID2 watchmeminfo[1261]: LowFree: 757448 HighFree: 2228948  Dirty: 0 Writeback: 0
Jun 26 23:49:29 unRAID2 watchmeminfo[1261]: LowFree: 757456 HighFree: 2228948  Dirty: 0 Writeback: 0

 

So it looks like you need a minimum of 30 seconds before all data should be flushed.

I've check this multiple times, there are times when the dirty buffer sits there for 30 seconds before it drops to 0.

 

and for the record. I'm thrilled we have bash 4.1 We have associative arrays and co-processes.

For those interested, here's the script. It's got some interesting techniques. (Ask questions in another thread)

 

#!/bin/bash

[ ${DEBUG:=0} -gt 0 ] && set -x -v

P=${0##*/}              # basename of program
R=${0%%$P}              # dirname of program
P=${P%.*}               # strip off after last . character
TMPFILE=/tmp/${P}.$$

declare -a FIELDS
declare -A MEMINFO

coproc LOGGER {
    logger -s -t${P}[$$]
}

# ps -fp ${LOGGER_PID}

exec 11>&1           # Link file descriptor #11 with stdout.
exec 12>&2           # Link file descriptor #12 with stdout.
exec 1>&${LOGGER[1]} # redirect stdout to logger co-process.
exec 2>&${LOGGER[1]} # redirect stderr to logger co-process.

capture_meminfo()
{
    while read -a FIELDS
    do
         FIELDS[0]="${FIELDS[0]%:}"
         FIELDS[0]="${FIELDS[0]/(/_}"
         FIELDS[0]="${FIELDS[0]/)/_}"
         MEMINFO["${FIELDS[0]}"]=${FIELDS[1]}
         # echo "${FIELDS[0]} ${FIELDS[1]}"
    done < /proc/meminfo >&12
}

# echo "1 this is STDOUT" >&1
# echo "2 this is STDERR" >&2

while true
do
    capture_meminfo
    echo "LowFree: ${MEMINFO[LowFree]} HighFree: ${MEMINFO[HighFree]}  Dirty: ${MEMINFO[Dirty]} Writeback: ${MEMINFO[Writeback]}"
    sleep 1
done

exec 1>&11 11>&-      # Restore stdout and close file descriptor #11.
exec 2>&12 12>&-      # Restore stdout and close file descriptor #11.

eval "exec ${LOGGER[0]}<&-"
eval "exec ${LOGGER[1]}>&-"

# echo "3 this is STDOUT" >&1
# echo "4 this is STDERR" >&2

Link to comment

So it looks like you need a minimum of 30 seconds before all data should be flushed.

I've check this multiple times, there are times when the dirty buffer sits there for 30 seconds before it drops to 0.

 

Looking at meminfo tells you there are dirty buffers but not which file systems those buffers are for, e.g., some of those are for sure for the root file system.  I believe it should only be necessary to wait 5 seconds after the last disk i/o has completed (by looking at the disk activity LED's and/or looking at the i/o counters on the Main page) before data should be safe.  Actually, this "5 seconds" is programmable: vm.dirty_writeback_centisecs

Link to comment

I've passed the nars test while a parity check is in progress.

 

It's midnight here, so I'm going to resume my 512 parallel copy tomorrow after the parity check completes.

I think after that I might do one final test. Remove parity and Create new parity then do the parallel copy while the new parity generate is in progress. (just for kicks).

 

So far I feel much better about this version.

 

 

Also fwiw, I'm getting 170MB/s parity check speeds. I can't believe it!!!

Link to comment

My test system is about to have 22 disks put into it tonight as a test for another apparent limitation of esxi and 3 m1015 cards.  I am going to look to modify some of the test scripts floating around to try and get every disk in the array being written to in parallel at the same time then cut power after 15 seconds and see what happens :)

Link to comment

I've passed the nars test while a parity check is in progress.

 

It's midnight here, so I'm going to resume my 512 parallel copy tomorrow after the parity check completes.

I think after that I might do one final test. Remove parity and Create new parity then do the parallel copy while the new parity generate is in progress. (just for kicks).

 

So far I feel much better about this version.

 

 

Also fwiw, I'm getting 170MB/s parity check speeds. I can't believe it!!!

 

Thanks for you help!  I very much appreciate it.

Link to comment

I do have good news :)

 

Same kernel, 3.9.6, guess only the patch changed?

Correct.

 

But I would also like to know details... I guess Tom may have had to go to San Quentin ;D

LOL thought about it - but he's in Mule Creek and I doubt he's thought about file systems for a long time.  ;)

 

The original changes put in 3.5 added a work queue and the notion of queuing "work" every 5 seconds to flush dirty blocks (this emulates the old kernel thread that did this in 3.4 and below).  The implementation of this had a bug in that each 'flush' result in additional disk i/o which, as a result, scheduled another 'flush' to take place 5 seconds later.  Hence we got the "continuous write" issue.  What my patch does essentially, is maintain all this but changed the 'flush' to not issue disk i/o if it doesn't need to.  But note that a "sync" command will still cause disk i/o.

 

I say the change is not "elegant" because it maintains this 5-sec polling loop (which btw is totally and completely harmless) which they are trying to get rid of (because linux is being run on phones and they don't want anything running that doesn't have to in the interest of saving battery power - which we don't care about obviously).  A better solution is to implement a different way of scheduling flush of the journal, but I don't know enough of this code to make this change.

 

Thanks for the explanation, really appreciate it. And it sounds good to me... keep all "as is" and just fix the unneeded flushs i/o... if you are absolutely sure that it surely does only stop unneeded i/o (nothing points to the opposite from tests) then... I would say it seems perfect to me. I would not care about getting rid of that 5 sec. pooling loop... this will not likely help unraid users at all, and... it is just not your task... I would just mail Jeff and/or post to reiserfs newsgroups your findings and let such improvements, etc for them. This is my humble opinion.

 

Let's just hope the crash's when stopping array don't get back as you did bring back some of the behavior in rc13, but these could surely be related to other changes you did and not to this?

 

Btw, Repeated tests a few more times now and so far seems perfect to me.

 

I did also monitor the dirty buffers and write back values while testing and found same about it going to 0 only on 30 seconds. Also if you read my previous description about hdd activity status led's, it's exactly when it does the last small led blink after 30 seconds from write completes that dirty buffers go to 0.

Link to comment

Also, with this release, please try Stop/Start array a few times to make sure this kernel-crash bug is still gone too.  ::)

Thought about it... and will do... but I was never able to reproduce it on rc13 then I don't think I'm good guy to test it... maybe would be a good idea you include on the test someone that was able to easily/always reproduce it on rc13?

Link to comment

Okay, I've run the same test I could reliably create problems with before (on rc15a), and the "nars test" (which also failed on rc15a), and both work perfectly on 16a.    Note that on my system these also worked fine on rc16 ... so I can't really tell if 16a improved things or not.

 

I'll try the parallel write test on this tomorrow, but I suspect the 22-drive version that mejutty plans to run is a FAR better test than on my little test box.

 

Link to comment

then cut power after 15 seconds and see what happens :)

 

 

This isn't a totally fair test, at least for the bug encountered previously.

I would suggest monitoring the dirty buffers and then waiting until it's 0.

Should be around 30 seconds.

 

 

While it would be interesting to see what happens when there are some dirty buffers left.

It may skew the test a bit.

Link to comment

For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers?  Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption?  I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits).

 

Or is that not an appropriate benchmark?

Link to comment

For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers?  Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption?  I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits).

 

Or is that not an appropriate benchmark?

 

Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it.

 

In the meantime here's a dump of rc16a's /proc/sys/vm/dirty* values

 

According to what I'm reading 30 seconds seems to be right.

5 second commits.

data older then 30 seconds is written out.

 

root@unRAID2:/proc/sys/vm#  more /proc/sys/vm/dirty*
::::::::::::::
/proc/sys/vm/dirty_background_bytes
::::::::::::::
0
::::::::::::::
/proc/sys/vm/dirty_background_ratio
::::::::::::::
10
::::::::::::::
/proc/sys/vm/dirty_bytes
::::::::::::::
0
::::::::::::::
/proc/sys/vm/dirty_expire_centisecs
::::::::::::::
3000
::::::::::::::
/proc/sys/vm/dirty_ratio
::::::::::::::
20
::::::::::::::
/proc/sys/vm/dirty_writeback_centisecs
::::::::::::::
500

 

and information about them according to here https://www.kernel.org/doc/Documentation/sysctl/vm.txt

 

dirty_background_bytes

Contains the amount of dirty memory at which the pdflush background writeback
daemon will start writeback.

Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
one of them may be specified at a time. When one sysctl is written it is
immediately taken into account to evaluate the dirty memory limits and the
other appears as 0 when read.

==============================================================

dirty_background_ratio

Contains, as a percentage of total system memory, the number of pages at which
the pdflush background writeback daemon will start writing out dirty data.

==============================================================

dirty_bytes

Contains the amount of dirty memory at which a process generating disk writes
will itself start writeback.

Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
specified at a time. When one sysctl is written it is immediately taken into
account to evaluate the dirty memory limits and the other appears as 0 when
read.

Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
value lower than this limit will be ignored and the old configuration will be
retained.

==============================================================

dirty_expire_centisecs

This tunable is used to define when dirty data is old enough to be eligible
for writeout by the pdflush daemons.  It is expressed in 100'ths of a second.
Data which has been dirty in-memory for longer than this interval will be
written out next time a pdflush daemon wakes up.

==============================================================

dirty_ratio

Contains, as a percentage of total system memory, the number of pages at which
a process which is generating disk writes will itself start writing out dirty
data.

==============================================================

dirty_writeback_centisecs

The pdflush writeback daemons will periodically wake up and write `old' data
out to disk.  This tunable expresses the interval between those wakeups, in
100'ths of a second.

Setting this to zero disables periodic writeback altogether.

 

 

Link to comment

For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers?  Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption?  I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits).

 

Or is that not an appropriate benchmark?

 

Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it.

 

 

Well all I was saying is, in line with your post about fair tests, is that if we wanted a benchmark for what is reasonable to expect from RSF, then maybe that benchmark can be found by looking at another FS.  But I acknowledge that I only know enough to be dangerous and so maybe it is not a valid comparison.

Link to comment

For comparison, how long would a similar test(s) using ext3/4 take to clear the buffers?  Or put another way, how quickly could you pull the plug on a ext3/4 system and observe / not observe corruption?  I ask because surely we can't expect better resiliency than that without specifically designing it (read: instant commits).

 

Or is that not an appropriate benchmark?

 

Since we are trying to resolve the reiserfs problem, doing the ext3/4 test within this context is probably not the best addition to the suite of tests. However, If you repost that question in another thread, I'm sure there will be people who would help test it.

 

 

Well all I was saying is, in line with your post about fair tests, is that if we wanted a benchmark for what is reasonable to expect from RSF, then maybe that benchmark can be found by looking at another FS.  But I acknowledge that I only know enough to be dangerous and so maybe it is not a valid comparison.

 

Hmmm, OK I get the idea. I think we were just looking for valid data to be present with reiserfs after an unclean shutdown.  According to the kernel values and what I observed by monitoring the dirty buffer values, it seems around 30 seconds.

 

nars had tested EXT3 and did not report any corruption.

 

In our other tests, we found that with reiserfs you could wait 3 minutes, 5 minutes, 10 minutes, 30 minutes and/or 2 hours and find corruption and/or missing files on the reiserfs.

 

Link to comment

OK I'm not far away from testing with my 20 drive Frankenstein.  1 build limitation tonight has been eliminated so on to the next.

 

Please consider watching the Dirty memory and Writeback.

It should take about 30 seconds to drop to 0.

Then power off.

 

watch grep -e Dirty: -e Writeback: /proc/meminfo

 

After that test proves successful or not, you can lower that time threshold to see what happens.

Link to comment

Just for grins, I ran the "unfair" test ==> killed power after 15 seconds ... and indeed, I can cause problems with that  :)

 

If I watch the disk activity light, with the "nars test" it takes ~ 18 seconds to stop flashing ... and if I cut power at that point all is well;  with my set of files, it's about 2 seconds different, but same result:  if I kill power immediately after it stops flashing everything's fine.

 

So it certainly seems that RC16a has fixed the "bug" ... although I anxiously await the results of mejutty's Frankenstein parallel write test  8)

Link to comment