copy files, wait 30min, hard reboot = data loss


Recommended Posts

I have been doing some tests on a VM trying to find out what would happen on non clean shutdown of unraid system and I'm a bit scared with the results...

 

In my tests I did tried to copy (via smb) a folder with some files (to be exact 61 files, 270MB, most files near 1MB and 3 bigger ones...) to an unraid disk share (not an usershare) and then waited some time, the longer I did wait was 30min, then did hard reboot it, and then checking for files after reboot the new files I did copy are missing (reiserfsck after filesystem mounted again after hard reboot shows no corruptions found).

 

I did repeated the test several times with rc15a and even with rc14 and rc11a (I did wondered if it could be related to the latest kernel patch to fix the sync issue) however I could reproduce it with all versions I did tested, despite on rc15a on all tests all files disappeared after hard reboot, while on older versions usually only part of the files disappeared (not always the same amount of files though, but a big part of them disappearing) and most times 1 of these non-missing files got corrupted (same file size but with "half" of it's data replaced by 0x00's).

 

Surely I understand that unclean shutdown is not a good thing... UPS is recommended, ok... anyway I do find it VERY strange that after system being idle so many time (like 30min) data doesn't get fully committed to hdd's!?

 

If I manually do a 'sync' before hard reboot then no data loss happens.

 

I did also tried to set array to don't auto start and after hard reboots then manually start array on maintenance mode and run reiserfsck, it does show replaying journal transactions... but still there is data loss after mounting fs. Also if I don't manually reiserfsck it seems that when mounting fs replaying journal is done automatically as there are also lines on syslog about it after non clean shutdowns... then reiserfsck before mounting the 'unclean' fs doesn't seem to make any difference at all.

Link to comment
  • Replies 154
  • Created
  • Last Reply

Top Posters In This Topic

Just to be clear:  You're copying these files to a disk share (NOT a user share);  with no cache drive; and waiting for 30 minutes after the copy has completed ???

 

... and is this to a disk that's on a pass-through controller in ESXi ??

 

Yes, disk share, no user share (provided by shfs), and no cache drive. And yes, surely waiting after copy completed and no hdd activity at all.

 

No pass-through controller and no ESXi, I did just tested it on VMWare with 3 virtual hdd's: one for parity, two for data (in fact one more hdd with fat partition to boot unraid from but that doesn't matter). I can't (don't want to...) try on a real system but guess same will happen as it really seems to me a fs journal commit (not happening) issue or sort of... anyway if someone can try it on a real test system...

Link to comment

Did you power-down/reboot the physical machine ... or did you "tell" VMware to reset the virtual machine?

 

Reason I ask is that if you did it to the physical machine, VMware may not have closed the open VMDKs ... and the issue may not be related to UnRAID, but to the virtual hard drives.

Link to comment

Very interesting results.

 

If I get a chance in the next couple weeks, I'll set up a spare UnRAID with 3 drives and repeat these tests on bare metal.    Hard to believe your findings ... but I don't doubt that's what happened !!

 

... I'd think 30 SECONDS would be enough idle time for any pending writes to be committed - let alone 30 minutes !!

 

Link to comment

Tom -- is this an issue with previous releases as well??

 

... or is it just related to the new kernel?

 

[Glad you posted when you did ... I had just configured a USB key with UnRAID and was going to set up an old PC with 3 80GB drives to test this.]

 

Link to comment

Tom -- is this an issue with previous releases as well??

 

... or is it just related to the new kernel?

It's an issue with all computer systems in general because they all cache file data.  The default amount of time that linux will let dirty blocks remain unflushed in our kernel is 5 seconds.  I verified that waiting several minutes and then hard reset the server, files are not written and/or not completely written.  This is almost certainly being caused by "the patch" and I have sent email to maintainer to confirm.

 

[Glad you posted when you did ... I had just configured a USB key with UnRAID and was going to set up an old PC with 3 80GB drives to test this.]

Never hurts to have additional confirmation.

Link to comment

It's an issue with all computer systems in general because they all cache file data.  The default amount of time that linux will let dirty blocks remain unflushed in our kernel is 5 seconds.  I verified that waiting several minutes and then hard reset the server, files are not written and/or not completely written.  This is almost certainly being caused by "the patch" and I have sent email to maintainer to confirm.

 

Yes, I'm well aware of the delay between writes and commits ... but 30 minutes was clearly a bit long  :)    A few seconds is much more normal.    Hopefully you'll get a quick "patch for the patch" !!  8)

 

 

Never hurts to have additional confirmation.

 

Just for grins I'll go ahead and set up a small system to try this on.  On our way out for a couple hours, so it'll be later today;  but I'll confirm the same results later today, and if you get a modification ready I'll also test it to confirm the issue disappears.

 

Nars ==>  I don't know about Tom, but I'm sure glad you decided to test that !!

 

Link to comment

limetech: note that I did mention on my 1st post that I did tested it with rc11a and rc14 and got same problem as I was also wondering if the patch was the cause of the problem - in fact I decided to test this exactly because the patch, to check if removing that periodic flush's would not cause something like this - however I can also get the problem with older rc's with 3.4.x kernel then... please test if it you get a chance and thank you for looking at it.

Link to comment

If I manually do a 'sync' before hard reboot then no data loss happens.

 

Here is the key statement, proving or disproving there is a problem.

I would expect after 30 minutes of inactivty, the buffer cache would have been flushed.

 

 

Nars, have you made any other kernel tunings?

Tom, are there any that have changed in the recent kernels that could affect the flushing of the buffer cache?

 

 

Link to comment

Is this something that's been present in all the rc's and never been found or something that's come along recently?

 

I just got a spare sever N54L on a nice cash-back offer but no spare / test hdd yet :(

 

Ill have to get a few CHEAP small hdd from ebay or something so got a test rig.

Link to comment

@WebboTech: I'm testing it with all stock unraid versions, no kernel tunings, no extras, no plugins, default go file. I can test with other filesystem soon, only pity I can't test with ext3/4 on unraid md due to the direct io issue, but I can test out of the array, and also with reiserfs out of the array just to make sure..

Link to comment

@WebboTech: I'm testing it with all stock unraid versions, no kernel tunings, no extras, no plugins, default go file. I can test with other filesystem soon, only pity I can't test with ext3/4 on unraid md due to the direct io issue, but I can test out of the array, and also with reiserfs out of the array just to make sure..

 

That's a good test. Thanks!

Link to comment

I did just tried it on an extra virtual hdd out of the unraid md array and results:

 

- with reiserfs partition and mounted with same params as unraid, i.e.:

mount -t reiserfs -o user_xattr,acl,noatime,nodiratime /dev/hdd1 /mnt/d2

confirm same exact problem I did report above, hard reset = data loss.

Also tried adding commit=10 option and doesn't seem to do any difference.

 

- with ext3 partition mounted as:

mount -t ext3 -o user_xattr,acl,noatime,nodiratime /dev/hdd1 /mnt/d2

did copy same files exactly as in other tests, I should say I did noticed that after file copy finish, after some 15 or 20 seconds there was some disk activity again (I guess it was committing data), waited a bit more (less than 2 minutes, total) and did hard reset... mounted it again (it did take some 10 seconds to mount, while reiserfs was always immediate), checked and all files are there (nothing on lost+found folder I should say), and checked files integrity, no corruption at all.

Link to comment

Just set up an old Dell with a pair of 20GB drives and a 40GB parity (didn't have anything smaller).    Doing a parity sync now ... running slow 'cause I forgot to replace the optical drive cables I used with 80-wire cables -- but should still finish in 20 min or so.

 

Then I'll copy a few files and confirm this issue ... although Tom already did this, so I don't expect to find anything different.

 

Link to comment

Just set up an old Dell with a pair of 20GB drives and a 40GB parity (didn't have anything smaller).    Doing a parity sync now ... running slow 'cause I forgot to replace the optical drive cables I used with 80-wire cables -- but should still finish in 20 min or so.

 

Then I'll copy a few files and confirm this issue ... although Tom already did this, so I don't expect to find anything different.

 

You don't need to setup parity hdd, just setup it without parity and you should be able to reproduce it, as it doesn't seem related to unraid parity at all, just fs thing.

Link to comment

Good test NARS, That's what I would have expected, looks like we have some reiserfs issues. *sigh*

 

I wonder what happens if you drop the cache before the power down... if that will force a sync (of only dirty buffers).

 

echo 3 >  /proc/sys/vm/drop_caches

 

 

Perhaps there is really an issue with superblock updates.

 

Link to comment

You don't need to setup parity hdd, just setup it without parity and you should be able to reproduce it, as it doesn't seem related to unraid parity at all, just fs thing.

 

Yes, I thought about that after I'd already started the sync.    May as well let it run ... I've got to make a Home Depot run for the next 30-45 minutes anyway, and it'll be long done by then.

 

I'll post the results when I get back.    I'm sure glad you found this NOW ... it sounds like it could be a pretty significant data-loss issue that's been "hiding" and not detected for quite a while => and since it only surfaces when there are unexpected and unclean shutdowns it doesn't impact folks except when something else is wrong, so the correlation may not have been evident for those who lost data.    Nasty bug !!    [Also have to wonder just how far back it goes in terms of releases]

 

Link to comment

Good test NARS, That's what I would have expected, looks like we have some reiserfs issues. *sigh*

 

I wonder what happens if you drop the cache before the power down... if that will force a sync (of only dirty buffers).

 

echo 3 >  /proc/sys/vm/drop_caches

 

 

Perhaps there is really an issue with superblock updates.

This issue has only to do with the 'patch'.

The linux docs warn that you should 'sync' before dropping caches.  Doing a 'sync' before shutdown solves this (which the unraid s/w does btw), also un-mounting before shutdown would solve this too.  But yes, a hard reset or power-off will cause this problem and it will be fixed asap.

Link to comment

I did just tried with rc13 (with non-patched 3.9.3 kernel), and indeed I can't reproduce the problem, tried 3 times, checked files integrity and no problem at all!

 

However as I said before I can reproduce similar problem on rc8a, rc11a and rc14 (with 3.4.x kernels), then those had apparently a somehow similar problem despite not using that patch...

 

Tom: did you ever found if the reason for the array stop crashs reported for rc13 (I did never reproduced them myself) was that code that Jeff patch removed or something else? I'm asking it as I think you also did some other changes... like removing "superfluous syncs"... not sure if anything related to solving these crashes though?

Link to comment

Well ... FWIW I can't reproduce this issue on my small Basic setup I created just for this purpose !!!

 

I did the following two tests:

 

Test #1

 

(a)  Copied 30 files totaling 179MB to a share called "CleanWrites", and also directly to the disk1 share.    then shut down the array; rebooted it; and confirmed that they were all there and used FolderMatch to confirm all were good.

 

(b)  Copied the same 30 files to a share called "DirtyWrites", and also directly to the disk2 share.    Waited 2 minutes; then turned off the powerstrip ... so the PC simply lost power (about as "dirty" a shutdown as you can get).    Rebooted the PC ... UnRAID indicated it had detected the unclean shutdown ... started the array (a parity check started of course) ... and then looked at the "DirtyWrites" folder and also disk2.  BOTH looked fine ... so I ran FolderMatch against them and the original source => and all was perfect.

 

 

Test #2

 

Basically repeated both parts of #1, but this time used a set of smaller files => 552 files in 30 folders totaling 260MB ... AND I cut power 15 seconds after the copies had finished in the "dirty" part of the test.    Again, everything was just fine !!

 

 

I was thinking of WHY this wouldn't reproduce what should be a "known issue".    The main reason I can think of is that it's an old Dell (4600) with 3 IDE drives ... are these perhaps handled differently than SATA with regard to caching the writes? 

 

I canceled the parity check after the first test;  but am going to let the second one run to completion just to see if there are any errors as a result of the shutdowns.  I also plan to run reiserfsck on the drives after the parity check is done to see if it finds anything.

 

Link to comment

Well ... FWIW I can't reproduce this issue on my small Basic setup I created just for this purpose !!!

 

I did the following two tests:

 

Test #1

 

(a)  Copied 30 files totaling 179MB to a share called "CleanWrites", and also directly to the disk1 share.    then shut down the array; rebooted it; and confirmed that they were all there and used FolderMatch to confirm all were good.

 

(b)  Copied the same 30 files to a share called "DirtyWrites", and also directly to the disk2 share.    Waited 2 minutes; then turned off the powerstrip ... so the PC simply lost power (about as "dirty" a shutdown as you can get).    Rebooted the PC ... UnRAID indicated it had detected the unclean shutdown ... started the array (a parity check started of course) ... and then looked at the "DirtyWrites" folder and also disk2.  BOTH looked fine ... so I ran FolderMatch against them and the original source => and all was perfect.

 

 

Test #2

 

Basically repeated both parts of #1, but this time used a set of smaller files => 552 files in 30 folders totaling 260MB ... AND I cut power 15 seconds after the copies had finished in the "dirty" part of the test.    Again, everything was just fine !!

 

 

I was thinking of WHY this wouldn't reproduce what should be a "known issue".    The main reason I can think of is that it's an old Dell (4600) with 3 IDE drives ... are these perhaps handled differently than SATA with regard to caching the writes? 

 

I canceled the parity check after the first test;  but am going to let the second one run to completion just to see if there are any errors as a result of the shutdowns.  I also plan to run reiserfsck on the drives after the parity check is done to see if it finds anything.

 

Maybe something to do with writing to a User Share is preventing you from reproducing.

 

How about if you just do to Disk Share, maybe Disk1, and Disk 2 if you want to compare.

 

Link to comment
Guest
This topic is now closed to further replies.