copy files, wait 30min, hard reboot = data loss


Recommended Posts

  • Replies 154
  • Created
  • Last Reply

Top Posters In This Topic

Cool Tom, don't mind showing us the patch? :)

He wants to get approval from the reiserfs maintainer at SUSE before putting out to the public.

He may also want the source of the patch to come from the official maintainer at SUSE rather than there be any differences from how Tom may incorporate it, vs. how it is eventually incorporated in the official code.  If the fix did not have only one "official" patch, then subsequent patches from the same SUSE maintainer might not match and all kinds of problems might result in failing patches.

 

Joe L.

Link to comment

He may also want the source of the patch to come from the official maintainer at SUSE rather than there be any differences from how Tom may incorporate it, vs. how it is eventually incorporated in the official code.  If the fix did not have only one "official" patch, then subsequent patches from the same SUSE maintainer might not match and all kinds of problems might result in failing patches.

 

Absolutely !!  #1 rule of Configuration Management !!!

 

 

Link to comment

Only for that because... don't take me wrong, but... after he telling Tom to just reverse these changes without thinking/looking on the consequences of it (and no one else on that reiserfs newsgroup talking about it - in fact no replies at all to Jeff posts), I'm starting to think reiserfs support may be getting a bit strange... well, SUSE got away from reiserfs as default FS for years now, it may explain something...

 

If there is on future a pool for what to add on unRaid 5.1, 5.2, ... I think I will vote to allow other filesystems, if such option is on the table... I would really like to test ext3/4 on the array filled with lots of files etc to check performance difference, etc but the "direct io" issue doesn't allow me... I have even been searching if there is some way to disable it, some mount option or something, but can't find a way :(

 

Sorry for the 'venting' and a bit off-topic... just posting what is going on my head :)

Link to comment

Only for that because... don't take me wrong, but... after he telling Tom to just reverse these changes without thinking/looking on the consequences of it (and no one else on that reiserfs newsgroup talking about it - in fact no replies at all to Jeff posts), I'm starting to think reiserfs support may be getting a bit strange... well, SUSE got away from reiserfs as default FS for years now, it may explain something...

 

If there is on future a pool for what to add on unRaid 5.1, 5.2, ... I think I will vote to allow other filesystems, if such option is on the table... I would really like to test ext3/4 on the array filled with lots of files etc to check performance difference, etc but the "direct io" issue doesn't allow me... I have even been searching if there is some way to disable it, some mount option or something, but can't find a way :(

 

Sorry for the 'venting' and a bit off-topic... just posting what is going on my head :)

 

We've been asking for 'other' filesystem support for years. I'm with you on this.

While reiserfs has proven reliable in the past, and quite resilient, it may not be as good moving forward.

I would expect given the recent findings that the priorty for other filesystem support will be bumped up.

 

Is btrfs support in the kernel and can we manually format and mount that without the direct IO issue?

Link to comment
Is btrfs support in the kernel and can we manually format and mount that without the direct IO issue?

Support is on stock kernel, tools are missing but easy to get them... out of curiosity I did just tested it and it doesn't work fine on the array, appears to mount fine but doing a sync or umount freezes (also tested out of the array and no problem at all), guess also direct io issue as it is similar to when I tested with ext3/4...

Link to comment

I did tested rc16, repeated the test a few times to make sure... and always waited at least 2 minutes before doing hard reset... and I'm sorry to say but despite I no longer get ALL files/folder missing (like on rc15a) I still get some files missing and one of the non-missing ones is corrupted after reboot :(

 

Confirm that sync doesn't cause the periodic writes, and can't reproduce any crash on stopping array, tried a lot of times, but again I did also not reproduced it with rc13.

 

Edit: I did found that apparently if I have less RAM for the unRaid VM then the problem seems more noticeable (more files missing), the worst I tried was with 512MB where after hard reset I did only got 28 files (of 61) while with 1GB I did got 48 (of 61).  Also according 'mejutty' tests was also easier for him to reproduce with a big number of files/folders.

 

Just to be sure I did also tried with rc16 to do 'sync' before hard reset, and this was the only way I get all the files and no corruption at all after the hard reset.

Link to comment

I did tested rc16, repeated the test a few times to make sure... and always waited at least 2 minutes before doing hard reset... and I'm sorry to say but despite I no longer get ALL files/folder missing (like on rc15a) I still get some files missing and one of the non-missing ones is corrupted after reboot :(

 

Confirm that sync doesn't cause the periodic writes, and can't reproduce any crash on stopping array, tried a lot of times, but again I did also not reproduced it with rc13.

 

Edit: I did found that apparently if I have less RAM for the unRaid VM then the problem seems more noticeable (more files missing), the worst I tried was with 512MB where after hard reset I did only got 28 files (of 61) while with 1GB I did got 48 (of 61).  Also according 'mejutty' tests was also easier for him to reproduce with a big number of files/folders.

 

Just to be sure I did also tried with rc16 to do 'sync' before hard reset, and this was the only way I get all the files and no corruption at all after the hard reset.

Can repeat this exact series of tests with -rc14?  -rc14 uses 3.4.48 kernel which uses old method of flushing.  I'd to know if you can get it to fail in the same way.

Link to comment

What's acceptable on this condition?

 

I.E. massive file write and time factor before outstanding writes are destined to be complete or corruption should NOT occur.

 

in my experience shutting a machine down after a minute or two of massive writes where caching is present has always been wrought with possible issues.

 

It's at some point, some time factor, where this should not be such an issue.

Is this at 2 minutes, 3, 5, 10?

 

could adjusting some of these values help?

 

[pre]- dirty_background_bytes - dirty_background_ratio - dirty_bytes - dirty_expire_centisecs - dirty_ratio - dirty_writeback_centisecs[/pre]

 

 

Link to comment
Can repeat this exact series of tests with -rc14?  -rc14 uses 3.4.48 kernel which uses old method of flushing.  I'd to know if you can get it to fail in the same way.

 

I did tried it the other day along with some older rc's I did mention before... anyway I did repeated it yet once again now for rc14 and found that RAM amount also does a difference on the issue: I still had VM set to 512MB RAM, from the last tests today on rc16, tried same exact test files with rc14 for 3 times and was unable to reproduce the problem... BUT... then I did remembered that for all the tests of the other day I had VM set to 1GB RAM, so I did changed it back to 1GB and repeated the test 3 more times and on all of them I was now able to reproduce the problem on rc14 with these results:

- 1st test got only 57 (of 61) files and 1 corrupted

- 2nd test got 24 (of 61) files and 1 corrupted.

- 3rd test got 61 (of 61) files but... 1 corrupted.

Then in sum: yes, I'm able to reproduce it in similar way on rc14, though not the same amount of non-disappearing files, but it also doesn't seem exact thing on every test and also depending on RAM amount.

Also for all the tests I did today I did waited exactly 2 minutes after disk activity stopped right after copying the files.

Link to comment

Can repeat this exact series of tests with -rc14?  -rc14 uses 3.4.48 kernel which uses old method of flushing.  I'd to know if you can get it to fail in the same way.

 

I did tried it the other day along with some older rc's I did mention before... anyway I did repeated it yet once again now for rc14 and found that RAM amount also does a difference on the issue: I still had VM set to 512MB RAM, from the last tests today on rc16, tried same exact test files with rc14 for 3 times and was unable to reproduce the problem... BUT... then I did remembered that for all the tests of the other day I had VM set to 1GB RAM, so I did changed it back to 1GB and repeated the test 3 more times and on all of them I was now able to reproduce the problem on rc14 with these results:

- 1st test got only 57 (of 61) files and 1 corrupted

- 2nd test got 24 (of 61) files and 1 corrupted.

- 3rd test got 61 (of 61) files but... 1 corrupted.

Then in sum: yes, I'm able to reproduce it in similar way on rc14, though not the same amount of non-disappearing files, but it also doesn't seem exact thing on every test and also depending on RAM amount.

Also for all the tests I did today I did waited exactly 2 minutes after disk activity stopped right after copying the files.

Thank you, that was valuable info.  I guess you have disk activity LED's, correct?  If so, one more test: set up the same transfers and then watch the disk activity LED.  I'm going to guess it stops blinking in a matter of seconds after the transfers are complete on the client side, correct?  Now here's the test: at this point, when activity LED's are off, type the 'sync' command and then let me know if you see "a lot" of disk activity, or just "a little bit" as defined by:

 

"a little bit" is what you would see if you typed 'sync' and you already know there is nothing buffered - just a couple blips.

 

"a lot" - is more more blinking than the above  ;)

 

I realize this is not very objective, but let's give it a try please.

Link to comment
Thank you, that was valuable info.  I guess you have disk activity LED's, correct?  If so, one more test: set up the same transfers and then watch the disk activity LED.  I'm going to guess it stops blinking in a matter of seconds after the transfers are complete on the client side, correct?  Now here's the test: at this point, when activity LED's are off, type the 'sync' command and then let me know if you see "a lot" of disk activity, or just "a little bit" as defined by:

 

"a little bit" is what you would see if you typed 'sync' and you already know there is nothing buffered - just a couple blips.

 

"a lot" - is more more blinking than the above  ;)

 

I realize this is not very objective, but let's give it a try please.

 

Re activity led's, I did looked at it with some more attention for rc14 and rc16 and what I see is: after file transfer is complete (and network activity led turns off at same time) then hdd led's (data and parity obviously) will still be on for near 8 seconds (same for both rc14/rc16), then turn off **, and after about 5 seconds turn on again for 3 seconds on rc14 (6 seconds on rc16), then turn off again, and on rc14 it will not turn on again anymore but on rc16 it will blink for half a second after exactly 30 seconds from the ** above.

 

I can also say that I remember that I did see on other rc's, maybe others with 3.9.x kernel (I can look if that interests you) that behavior of a very small blink 30 seconds after it finishes the writing after copy completed.

 

On either of the two versions (and I think I can say in any version I did tested really) manually doing 'sync' always produces a very small, less than 1 second, blink on all hdd activity leds.

 

So far the ONLY version from the ones I did tested (rc8a, rc11a, rc14, rc13, rc15a and rc16) that I was not able to reproduce any data loss/corruption at all (and I did just retested it again today with both 512MB and 1GB RAM) is the non-patched rc13 (and I surely didn't issued any sync before doing any testing, that would trigger the periodic flushs... just tested to copy data right after booting with just stock version). Also talking about activity leds this rc13 shows a slightly different behavior than others - it doesn't have that second "long" write after 5 seconds, just a continuous led ON when copying files and for 16 seconds more after file copy completes (with ram=1GB, or 8 seconds more after file copy completes if ram=512MB, note that tests on other rc's above have been with 1GB then this is really longer write than on other versions) and then the small blink after 30 seconds as explained above. However please note that if I hard reset even before that "after 30 seconds small blink" I still don't get any data loss/corruption on this rc13, then not sure what may be that small blink exactly..

 

Not sure if these led infos can help you at all but you asked them... hope so... :) and hope I could explain it clearly as my English is maybe not the best one as you may have already noticed :)

Link to comment

I have just redone my tests with rc16 and I can not reproduce the errors.  I used both the large files and the folder with 1250 smaller files and folders from the same test and I waited 30 seonds between the last file being written and the reset.  all passed crc fine and were always there.

Link to comment

I have just redone my tests with rc16 and I can not reproduce the errors.  I used both the large files and the folder with 1250 smaller files and folders from the same test and I waited 30 seonds between the last file being written and the reset.  all passed crc fine and were always there.

Could you re-test with less RAM like 1GB or even 512MB as from my tests it seems to influence the problem. You are still using 4GB?

Link to comment

FWIW, I just did 4 tests with RC16 ... and all were error-free.

 

(1)  Copied the same set of files I had done with my last test (that produced errors using 15a); waited 1 minutes, killed power.    On reboot everything was there and good (Confirmed with FolderMatch)

 

(2)  Repeated the above, but killed the power after 30 seconds.  Same result.

 

(3)  Was going to repeat it again and kill power in 5 seconds ... but disk activity light was blinking away, so I felt it was unfair to do so, since clearly UnRAID was syncing.  Light went out ~ 5 seconds later ... so I killed power then (total ~ 15 seconds after the writes).  On reboot all was well.

 

(4)  Repeated test with ~ 2000 small files (jpegs).  Killed power 20 seconds after the writes.  All okay.

 

Note:  After each of the tests except the last one I aborted the automatic parity check that was started.    I let the last one complete ... and it completed with all zeroes (as I would expect).

 

Interesting that nars still has data loss issues with this release, but it certainly resolved it on my little test system.

 

Link to comment

I have just redone my tests with rc16 and I can not reproduce the errors.  I used both the large files and the folder with 1250 smaller files and folders from the same test and I waited 30 seonds between the last file being written and the reset.  all passed crc fine and were always there.

Could you re-test with less RAM like 1GB or even 512MB as from my tests it seems to influence the problem. You are still using 4GB?

 

Just tried with 512Mb and sorry could not reproduce all files transferred and no crc errors.  Will run another 5 times for both large and small files.

Link to comment

I pulled a memory stick and re-ran the tests with 512MB and they still work perfectly.

 

Then tried with a few large files (the same test I had originally done with RC15a) ... and it also works perfectly.

 

Bottom line:  From my perspective RC16 definitely resolved the problem !!

 

The only thing I can think of is that the fact nars is using a virtualized UnRAID must have something to do with the errors he's getting.    Perhaps some interaction between UnRAID the the hypervisor.    But both my tests and mejutty's certainly seem to show that the problem that RC15a had with deferred writes not being committed in time has been resolved.

 

 

Link to comment

I pulled a memory stick and re-ran the tests with 512MB and they still work perfectly.

 

Then tried with a few large files (the same test I had originally done with RC15a) ... and it also works perfectly.

 

Bottom line:  From my perspective RC16 definitely resolved the problem !!

 

The only thing I can think of is that the fact nars is using a virtualized UnRAID must have something to do with the errors he's getting.    Perhaps some interaction between UnRAID the the hypervisor.    But both my tests and mejutty's certainly seem to show that the problem that RC15a had with deferred writes not being committed in time has been resolved.

 

My tests are with esxi 5.1 as well using passthrough M1015 drives.  I have no errors.  Tempted to reload an earlier rc and restest then check 15a just to confirm I can still generate the errors.

Link to comment

Well here I am thread crapping not sure if tom wants results here on via email I should have asked.

 

I rested on my setup with rc14 no errors found could not fault waiting 15 seconds before reset

I then loaded up rc15a failed every time I ran the test without exception.

 

Which brings me to the next question.  Nars how is your vm setup.  Are you using passthrough disks, RDM or have you setup virtual disks for your unraid drives on real disks??  I mean don't get me wrong it's great this issue was found but I am a little puzzled by your other results on releases that should not have the error.  For me rc16 seems to be good, during my tests I have stopped and started it at least 50 times with no hang-ups and not had a single file transfer failure.

Link to comment

I haven't done 50 tests ... but I've done ~ 20 now [basically one test every commercial while my wife and I watched TV tonight  :) ]

 

... and 100% success with RC16.    Just for grins, I went back to RC15a to be sure I wasn't dreaming about the failures ... and ran 3 tests with 3 failures in a row  ==> so there definitely WAS a problem; and it's definitely been FIXED with RC16 ... at least from my test system's perspective.    [bare metal -- old Dell with 3 IDE drives]

 

Link to comment

Man, I can't tell you guys how much I appreciate all this testing. :-*

 

I'm 99% confident my patch does the trick, but have not heard back from any kernel folks yet.  If there is a file/data loss bug it's been in reiserfs for a while - I have a theory, but need confirmation from someone who knows the code.

Link to comment

I can confirm, by the way, that if there's a bug, it's certainly not something that "bites" in normal use.    I'm a backup fanatic, so everything I write to my UnRAID server is backed up to a set of disks I store offline.    As these get filled, I do a complete FolderMatch compare with the array -- and have NEVER found any issues.      On the rare occasion that I get a sync error with a parity check (probably 3 times in 4+ years), I pull the entire set of backup disks out and, one-at-a-time, do yet-another full comparison (this takes ~ a week these days).  Again, I've never had an error.

 

And of course both of my systems are UPS-protected, so the likelihood of encountering the bug we've discussed in this thread is VERY small.    But it IS clearly a bug -- so should definitely be fixed !!

 

Link to comment

Just to make sure it wasn't a smb/cache/network layer issue.

 

 

I logged in locally.

I cp'ed bzroot to one of the array drives in parallel 512 times.

It took a very long time.

Then I ran an md5sum on all the files CHECK all good. (this took over an hour)

I crashed the system, rebooted and ran the MD5SUM and 25% of 512 files were corrupt.

No missing files but segments of files were missing.

i.e sizes varied.

 

I think this is a good test and if you could send me a pm to the link of rc16 I can test that way too.

Or someone else could test it.

 

 

Here's my scriptlet

Uncomment as needed.

 

 

one command copies all files in parallel.

The other file takes 1 md5hash sum and prints it out for all the files so you can do an md5sum on the destination directory.

You'll need to do an md5sum on one of the bzroots to capture the line for duplication.

then uncomment and save it to a file.

It's relative to the directory, so you'll need to be IN the crashtestdummy directory to validate the md5sum.

 

 


#!/bin/bash


while [[ ${LOOP:=0} -lt 512 ]]
do (( LOOP=LOOP+1 ))
   # cp -v /boot/bzroot /mnt/disk2/crashtestdummy/bzroot.${LOOP} &
   # echo "90c6721bf36ecbf562e4267d18693a13  bzroot.${LOOP}"
done

 

Link to comment

Man, I can't tell you guys how much I appreciate all this testing. :-*

 

I'm 99% confident my patch does the trick, but have not heard back from any kernel folks yet.  If there is a file/data loss bug it's been in reiserfs for a while - I have a theory, but need confirmation from someone who knows the code.

 

Tom ----

 

It might be a good idea to release -rc16 IF you haven't heard back from the kernel folks fairly quickly.  There are a lot of us folks who are running this RC version and, if I understand the problem, any improper shutdown will cause a data loss.  Your patch may introduce some other problem but the odds are it will be less catastrophic than this particular problem.  Plus wider usage might show any new problem much quicker than waiting and then introducing the patch (whether yours or the kernel folk's) at some later point. 

 

After all, all of us who are using- rc candidates are aware that there might be issues and we are willing to assume that/some risk.  However, I, for one, become uncomfortable running a release for a long period of time with a problem for which a solution to address that problem exists. 

 

I also have the terrible feeling that the 32bit kernel has become the ugly stepchild.  Active development and through testing  has apparently stopped.  Work is done reluctantly only when someone complains...

 

EDIT:  You are reluctant to do a general release of the current patch as -rc16, you could do a -rc15b... 

Link to comment
Guest
This topic is now closed to further replies.