copy files, wait 30min, hard reboot = data loss


Recommended Posts

Yes, I had the same thought.    So I just deleted everything on the disks;  disabled shares altogether;  and wrote a bunch of stuff directly to disk1.

 

... waited 15 seconds; then killed power.    Rebooted ... and everything is there !!

 

Ran FolderMatch and it all compares perfectly.

 

I can only assume this has something to do with using old IDE drives => they must somehow be treated differently than more modern SATA units.

 

FWIW, the configuration is a Dell 4600, 2 20GB WD200's and a 40GB WD400.

 

By the way, the parity check last time completed with zero errors ... and I suspect this one will too (I'm just letting it run).

 

I guess that little test system isn't useful for this particular issue.

 

Link to comment
  • Replies 154
  • Created
  • Last Reply

Top Posters In This Topic

@garycase:

 

Maybe bigger files number may trigger it more easily? No idea... also one difference is that I did copy whole folder, thus also creating a new folder, not sure if it may make any difference really... also right after the copy finished I didn't browse it or accessed emhttp page, just waited... and reboot.

 

I was also able to reproduce it cp'ing (instead of by smb) like this:

cp -r /mnt/disk1/fotos_sgs2 /mnt/disk2

(having 'good', i.e. surely committed files already on disk1, and disk2 completely empty before doing it)

just wait... hard reboot.

 

rc15a? no plugins or something extra running?

Link to comment

Yes, it was a brand new RC15a flash drive -- no plugins or anything else.  Just 3 old disks I had in my "junk box" on an old Dell I hadn't yet donated to anyone.  (I have 8 old Dells at the moment I just haven't gotten rid of yet)

 

I didn't "look" at the files via Explorer, and did nothing on the Web GUI ... just copied the files; then walked back to the system and killed the power (turned the powerstrip switch off).

 

Since Tom's well aware of this issue ... and clearly understands what's wrong, I'm not going to spend a lot of time trying to replicate it.  It's very interesting that this old system doesn't have the problem, however.

 

Link to comment

I am running 15a on an ESXi install with 2 m1015's in passthrough mode.  I repeated a test 5 times I copied 5 files ranging from 60mb to 770mb to a disk share and then repeated to a user share (movies) then waited only a minute and did a reset on the vm.  It booted back up and went straight into parity check which I cancelled then did a crc check every time all the files appeared to be there and only 1 time out of the 10 tests (5 to disk share 5 to user share) did a file fail crc (it was the 770mb file)

 

I'm going to run some more tests with a folder and smaller files.  See what happens.

 

Update:::

 

Completely different story with a folder containing 1250 files and 250 folders totalling 193Mb (which is a backup of my live unraid running flashdrive) first copy waited 1 minute reset all gone second test as I though I was seeing things reset after 1 minute I see the folder and think ahh I was seeing things but alas the folder is empty.  So it seems to make a difference how and what you are copying.  And my test array is a 2 disk 1 data 1 parity.

Link to comment

I am running 15a on an ESXi install with 2 m1015's in passthrough mode.  I repeated a test 5 times I copied 5 files ranging from 60mb to 770mb to a disk share and then repeated to a user share (movies) then waited only a minute and did a reset on the vm.  It booted back up and went straight into parity check which I cancelled then did a crc check every time all the files appeared to be there and only 1 time out of the 10 tests (5 to disk share 5 to user share) did a file fail crc (it was the 770mb file)

 

I'm going to run some more tests with a folder and smaller files.  See what happens.

 

Update:::

 

Completely different story with a folder containing 1250 files and 250 folders totalling 193Mb (which is a backup of my live unraid running flashdrive) first copy waited 1 minute reset all gone second test as I though I was seeing things reset after 1 minute I see the folder and think ahh I was seeing things but alas the folder is empty.  So it seems to make a difference how and what you are copying.  And my test array is a 2 disk 1 data 1 parity.

 

Repeated test this time with a folder containing the files from the first test where only 1 failure.  Copied the folder waited 1 minutes rebooted, on reboot folder was there 4 out of the 5 files were in the folder and 1 missing, when crc check 3 passed 1 failed.  Repeated test this time checking order in which files were copied.  Reset server again 4 files 1 missing 1 failed crc files that was missing was last to copy file that failed crc was second last.

Link to comment

I guess I should double check the files I've copied over recently.

 

This problem should only occur if there is a sudden power loss, if unRAID is shut down correctly the sync command is issued before unmounting the array.

 

Sent from a phone, sorry for any typos

 

 

Link to comment

I guess I should double check the files I've copied over recently.

 

No reason to panic here=> There is NO problem with normal operation of UnRAID.    As long as you've got a UPS and always do normal shutdowns, there's no problem.    Read this thread from the beginning -- the issue is with abnormal shutdowns ... some writes may not have been committed, so you can lose data.    We're TRYING to replicate the problem, so we can help Tom resolve it AND can thoroughly test the fix when it's available.    So we're writing data; then killing power and/or doing a reset without shutting down UnRAID first ... things you should NEVER do in normal operation.

 

But for a normal, operational UnRAID with good UPS protection, there's very little likelihood of any problem.    If you do a lot of writes and want to be sure you don't have any possibility of encountering this issue, just Stop your array, which will guarantee all pending writes are flushed and everything's synced.

 

Link to comment

"Problem" is relative.  Buffers should get flushed when the system is idle, if for no other reason then to free up cache so it wont have to flush when the system is not idle.

 

Certainly agree the buffers should be flushed automatically AND quickly.  Tom indicated the Linux kernel is supposed to do that within 5 seconds ... that's what this whole issue is about.  He's looking into this, and I'm sure it will be resolved very quickly.

 

Link to comment

While I understand a sudden loss of power will leave you with possibility of corrupt files, what concerns me is Nars reported that  a reset after 30 minutes left him with missing & corrupt files.

 

No you may laugh, but whenever I do a large copy or large maintenance. I almost always manually sync the disks.

This is why I was wigging with the manual sync constant write problem.

 

So now I can manually sync, but if I happen to forget, There's a potential risk.

 

I'll try some tests on my microserver this weekend and see what I turn up. 

 

If you do a lot of writes and want to be sure you don't have any possibility of encountering this issue, just Stop your array, which will guarantee all pending writes are flushed and everything's synced.

 

Nars reported a manual sync handles this. So you may not need to stop your array, but you do need to manually issue a sync.

 

The need for some kind of historical md5 database seems to be even more important now. 

Link to comment

"Problem" is relative.  Buffers should get flushed when the system is idle, if for no other reason then to free up cache so it wont have to flush when the system is not idle.

 

Agree but nars reported that even after 30 minutes there was issue.

 

Surely I understand that unclean shutdown is not a good thing... UPS is recommended, ok... anyway I do find it VERY strange that after system being idle so many time (like 30min) data doesn't get fully committed to hdd's!?If I manually do a 'sync' before hard reboot then no data loss happens.
Link to comment

But a sync spins up all disks. (or at least it did... was that a bug that was quashed?)

 

My point was that I don't mind long holds of data in buffers, if the system is busy.  I'm happy going several minutes with volatile data in buffers when flushing would slow down other ops.  Unfortunately there isn't a tunable (that I know of) that says delay flushing buffers up to xx seconds if I/O wait is above some threshold.

Link to comment

Update:::

 

Completely different story with a folder containing 1250 files and 250 folders totalling 193Mb (which is a backup of my live unraid running flashdrive) first copy waited 1 minute reset all gone second test as I though I was seeing things reset after 1 minute I see the folder and think ahh I was seeing things but alas the folder is empty.  So it seems to make a difference how and what you are copying.  And my test array is a 2 disk 1 data 1 parity.

 

Yep -- absolutely !!    Seeing your result, I just created a test folder with all smaller files and folders ... about 600 files in 40 folders.    Copied it all to the disk2 share; waited 30 seconds; and killed power.  Booted back up and ran FolderMatch ... there are 64 files missing, 1 folder missing, and 1 folder with different contents (missing a file).    Seems this old system exhibits the same issue.  I'm going to keep this set of files/folders handy, and will use the same test when Tom has a fix to confirm all is okay  :)

Link to comment

Here's a point to consider,  Corruption after one minute , Well I could understand that a bit.

It's corruption after tens of minutes that's a major concern.

If after tens of minutes you are loosing files and/or seeing corruption on an idle system with a sudden power loss it means that some basic housekeeping is not being done.

Link to comment

Here's a point to consider,  Corruption after one minute , Well I could understand that a bit.

It's corruption after tens of minutes that's a major concern.

If after tens of minutes you are loosing files and/or seeing corruption on an idle system with a sudden power loss it means that some basic housekeeping is not being done.

This issue is due to the "patch" that "fixed" an issue caused by a bigger "patch" in the linux kernel.  I'm working on a work-around until those guys can figure it out.

Link to comment

Here's a point to consider,  Corruption after one minute , Well I could understand that a bit.

It's corruption after tens of minutes that's a major concern.

If after tens of minutes you are loosing files and/or seeing corruption on an idle system with a sudden power loss it means that some basic housekeeping is not being done.

This issue is due to the "patch" that "fixed" an issue caused by a bigger "patch" in the linux kernel.  I'm working on a work-around until those guys can figure it out.

 

 

LOL!!! I figured as much!

Link to comment

I have fixed this problem and have an -rc16 release available.  If you want to test this first before I make it "generally" available, please send email to me tomm@lime-technology.com.  What I'm looking for is some testing for:

- verify copying files to array, wait at least 5 seconds after I/O stops, reset or power-down, files are still there.

- verify 'sync' command does not result in continuous writes

- verify server does not crash with "kernel oops" as a result of Stopping the array

 

Here's a summary of what the problem is.

- starting with kernel 3.5 developers decided to get rid of a system kernel thread that wakes up periodically (5 sec default) to call the "sync_fs" method of each registered file system.  In it's place, each file system is responsible for syncing when it deems it needs to.  For reiserfs, this means scheduling "work" on a "work queue" when necessary to execute sync_fs.  (They wanted to avoid unnecessarily waking a thread up to do nothing among other things).

- The code changes for reiserfs were not exactly right.  A 'bug' was causing "continuous writes" as a result of "sync".

- The first patch created by reiserfs  maintainer at SUSE simply reversed out the changes described above.  But the problem is there is no longer a dedicated system kernel thread that wakes up every 5 sec to sync the file system (doh!).  So effect for reiserfs is that the journal isn't written out - no journal = no metadata update = files don't appear if sytsem reset before journal processed.

- I studied the reiserfs code and patches to determine what the problem is and developed my own "patch" to fix the bug.  I have sent this path to reisersfs maintainer for his "blessing" (or for him to point out any problems in my patch).

- So I don't want to release -rc16 until I hear back from him, but in the meantime I want to get some more testing done.

 

Link to comment

Just sent it as well, however re testing kernel crash when stopping array I was never able to reproduce it on rc13 then I'm not probably the right guy to test that... may be worth you pick some guy that was able to reproduce that on rc13 easily.

 

Just out of curiosity Tom can you comment a bit what your patch consists of? From the tests you are asking can I guess you are back using the original kernel code that Jeff patch removed with just some fix to avoid the 'superfluous' continuous writes?

Link to comment
Guest
This topic is now closed to further replies.