copy files, wait 30min, hard reboot = data loss


Recommended Posts

Nars test in parallel.

 


#!/bin/bash -e
cd /mnt/disk1
mkdir nars_test_folder
cd nars_test_folder
while read CMD
do      echo ${CMD}
        ${CMD} &
done <<-EOF
dd if=/dev/urandom of=nars_test_file_01.dat count=1 bs=55288564
dd if=/dev/urandom of=nars_test_file_02.dat count=1 bs=510183
dd if=/dev/urandom of=nars_test_file_03.dat count=1 bs=2458188
dd if=/dev/urandom of=nars_test_file_04.dat count=1 bs=2978245
dd if=/dev/urandom of=nars_test_file_05.dat count=1 bs=2643817
dd if=/dev/urandom of=nars_test_file_06.dat count=1 bs=2761289
dd if=/dev/urandom of=nars_test_file_07.dat count=1 bs=2787313
dd if=/dev/urandom of=nars_test_file_08.dat count=1 bs=2633474
dd if=/dev/urandom of=nars_test_file_09.dat count=1 bs=2610714
dd if=/dev/urandom of=nars_test_file_10.dat count=1 bs=2919871
dd if=/dev/urandom of=nars_test_file_11.dat count=1 bs=2350149
dd if=/dev/urandom of=nars_test_file_12.dat count=1 bs=509644
dd if=/dev/urandom of=nars_test_file_13.dat count=1 bs=482331
dd if=/dev/urandom of=nars_test_file_14.dat count=1 bs=2302861
dd if=/dev/urandom of=nars_test_file_15.dat count=1 bs=2476117
dd if=/dev/urandom of=nars_test_file_16.dat count=1 bs=2439426
dd if=/dev/urandom of=nars_test_file_17.dat count=1 bs=2770128
dd if=/dev/urandom of=nars_test_file_18.dat count=1 bs=2346394
dd if=/dev/urandom of=nars_test_file_19.dat count=1 bs=2417321
dd if=/dev/urandom of=nars_test_file_20.dat count=1 bs=2498691
dd if=/dev/urandom of=nars_test_file_21.dat count=1 bs=6160690
dd if=/dev/urandom of=nars_test_file_22.dat count=1 bs=2887331
dd if=/dev/urandom of=nars_test_file_23.dat count=1 bs=1477755
dd if=/dev/urandom of=nars_test_file_24.dat count=1 bs=20340436
dd if=/dev/urandom of=nars_test_file_25.dat count=1 bs=1307032
dd if=/dev/urandom of=nars_test_file_26.dat count=1 bs=2828348
dd if=/dev/urandom of=nars_test_file_27.dat count=1 bs=2534757
dd if=/dev/urandom of=nars_test_file_28.dat count=1 bs=1038573
dd if=/dev/urandom of=nars_test_file_29.dat count=1 bs=1097624
dd if=/dev/urandom of=nars_test_file_30.dat count=1 bs=1009367
dd if=/dev/urandom of=nars_test_file_31.dat count=1 bs=1265181
dd if=/dev/urandom of=nars_test_file_32.dat count=1 bs=1151772
dd if=/dev/urandom of=nars_test_file_33.dat count=1 bs=1126012
dd if=/dev/urandom of=nars_test_file_34.dat count=1 bs=887565
dd if=/dev/urandom of=nars_test_file_35.dat count=1 bs=1543955
dd if=/dev/urandom of=nars_test_file_36.dat count=1 bs=1552538
dd if=/dev/urandom of=nars_test_file_37.dat count=1 bs=1564416
dd if=/dev/urandom of=nars_test_file_38.dat count=1 bs=1558199
dd if=/dev/urandom of=nars_test_file_39.dat count=1 bs=1537904
dd if=/dev/urandom of=nars_test_file_40.dat count=1 bs=1536752
dd if=/dev/urandom of=nars_test_file_41.dat count=1 bs=1595042
dd if=/dev/urandom of=nars_test_file_42.dat count=1 bs=1618269
dd if=/dev/urandom of=nars_test_file_43.dat count=1 bs=1488836
dd if=/dev/urandom of=nars_test_file_44.dat count=1 bs=1503179
dd if=/dev/urandom of=nars_test_file_45.dat count=1 bs=1493448
dd if=/dev/urandom of=nars_test_file_46.dat count=1 bs=1601083
dd if=/dev/urandom of=nars_test_file_47.dat count=1 bs=1600324
dd if=/dev/urandom of=nars_test_file_48.dat count=1 bs=1145707
dd if=/dev/urandom of=nars_test_file_49.dat count=1 bs=1176492
dd if=/dev/urandom of=nars_test_file_50.dat count=1 bs=1286722
dd if=/dev/urandom of=nars_test_file_51.dat count=1 bs=1301486
dd if=/dev/urandom of=nars_test_file_52.dat count=1 bs=1536094
dd if=/dev/urandom of=nars_test_file_53.dat count=1 bs=1295588
dd if=/dev/urandom of=nars_test_file_54.dat count=1 bs=226047
dd if=/dev/urandom of=nars_test_file_55.dat count=1 bs=2128774
dd if=/dev/urandom of=nars_test_file_56.dat count=1 bs=2209284
dd if=/dev/urandom of=nars_test_file_57.dat count=1 bs=2201158
dd if=/dev/urandom of=nars_test_file_58.dat count=1 bs=1795643
dd if=/dev/urandom of=nars_test_file_59.dat count=1 bs=1670813
dd if=/dev/urandom of=nars_test_file_60.dat count=1 bs=1647307
dd if=/dev/urandom of=nars_test_file_61.dat count=1 bs=100544443
EOF


wait


md5sum * > ../nars_test_folder.md5


# cp -r /mnt/disk1/nars_test_folder /mnt/disk2
# cd /mnt/disk2/nars_test_folder/ && md5sum -c /mnt/disk1/nars_test_folder.md5

 

 

double check md5sums for each directory GOOD.

sleep 180.

HARD BOOT.

 

cd /mnt/disk1/nars_test_folder/ && md5sum -c /mnt/disk1/nars_test_folder.md5

ALL GOOD

 

cd /mnt/disk2/nars_test_folder/ && md5sum -c /mnt/disk1/nars_test_folder.md5

 

nars_test_file_60.dat: OK

nars_test_file_61.dat: FAILED

md5sum: WARNING: 1 of 61 computed checksums did NOT match

 

cd /mnt/disk3/nars_test_folder/ && md5sum -c /mnt/disk1/nars_test_folder.md5

 

nars_test_file_61.dat: FAILED

md5sum: WARNING: 1 of 61 computed checksums did NOT match

 

 

root@unRAID2:/mnt/disk1/nars_test_folder# ls -l /mnt/disk1/nars_test_folder/nars_test_file_61.dat

-rw-rw-rw- 1 root root 93381270 2013-06-25 19:21 /mnt/disk1/nars_test_folder/nars_test_file_61.dat

 

root@unRAID2:/mnt/disk1/nars_test_folder# ls -l /mnt/disk2/nars_test_folder/nars_test_file_61.dat

-rw-rw-rw- 1 root root 84008960 2013-06-25 19:22 /mnt/disk2/nars_test_folder/nars_test_file_61.dat

 

root@unRAID2:/mnt/disk1/nars_test_folder# ls -l /mnt/disk3/nars_test_folder/nars_test_file_61.dat

-rw-rw-rw- 1 root root 61370368 2013-06-25 19:22 /mnt/disk3/nars_test_folder/nars_test_file_61.dat

Link to comment
  • Replies 154
  • Created
  • Last Reply

Top Posters In This Topic

I need to go out for some time, but just want to post that I have been looking for a non 3.4.x unraid rc, picked rc5 (3.0.35) for some tests and apparently I can't reproduce any data loss/corruption with this version. Though I will test it a bit more later to make sure, but if that is correct I can only reproduce problems with versions with 3.4.x kernel and with patched 3.9.x (not with non-patched one).

 

@WeeboTech: I did also see many times in a few of my tests only file 61 failing, though changing memory amount on my VM would make that worst (files missing and another file corrupted - always the last non-missing one). One thing you may also test, if you want, is lowering memory, easy with mem kernel param...

 

Edit: did some more tests and apparently can't really reproduce it on rc5.

Link to comment

I think I may wait for tom to weigh in on this for now.

 

I know we want to move forward, and I want to see what Tom has to say.

I would not want to be testing something in the past when there is a possible working fix for the very near future.

 

At the very least we've created some nice regression testing to spank out the bugs.

Link to comment

It does seem that the primary bug is resolved by the RC16 patch, but clearly there's still SOMETHING that can cause data loss when there's a lot of parallel activity.

 

I'd agree with waiting for Tom to weigh in ... otherwise we're all spinning our wheels running tests that may have long been OBE.

 

Link to comment

It does seem that the primary bug is resolved by the RC16 patch, but clearly there's still SOMETHING that can cause data loss when there's a lot of parallel activity.

 

I'd agree with waiting for Tom to weigh in ... otherwise we're all spinning our wheels running tests that may have long been OBE.

 

I don't consider the primary bug resolved. A system may not trigger it, but others do.

in fact, twice with two different scenarios.

 

The nars test cp from disk to disk was sequential.

Both md5sums after hardboot of a sequential cp failed.

 

On my test the parallel writes showed corruption but the sequential copies were successful.

 

Keep in mind I did md5sum of each directory, seed and 2 copies before the 3 minute wait and hard boot.

 

Something is not being written out or tracked correctly in the tables and flushed out.

I expect some form of corruption if the hard boot happens within seconds of the write.

I think 3 minutes is enough time for things to have been put in order.

 

If I can reveal corruption in 20 minutes of programming doing very basic unix things, there is a big problem before we can say a bug is resolved.

 

Writing 60 files in parallel is something that can be happen easily just using the server as a real file server.

compile the kernel in parallel.  Download a huge torrent of the billboard hot 100 hits where you get about 120 files coming in at breakneck speed in all sorts of bits and pieces.  Nothing is sequential in torrent downloads. Even my reorganization of mp3's in my DJ folder is done in parallel while ripping the newest queensrcyhe CD from amazon.com.

 

If the nars's test is exhibiting the same problems on two different systems, we can't claim a bug resolved.

Link to comment

Agree.  In fact, I suspect you may have found something much more serious than the not-yet-committed writes that nars identified.    I have to wonder just how long this has been an issue ... and if it would be true of other file systems as well, or is restricted to Reiser.

 

Link to comment

Again we are speculating here but Tom has hinted at there being 2 distinct possible bugs one that was introduced in with the new kernel when the code that caused endless writes on synch were removed (also caused the crashing on stop array) which is a by product of the new kernel not being responsible for flushing the cache and passing it back to each of the file systems.  The second is one that has not been elaborated on, that Tom thinks there has been another underlying bug in the file system code for some time.  Until we get some clarification/explanation we are all just clutching a straws.  I think the testing we all have done with the information we have is about as far as we can go and has provided Tom with a heap of data.  Now we sit back and wait :)

Link to comment

So the best bet right now would be to use rc13, am I correct with that? Since nars wrote

I did just tried with rc13 (with non-patched 3.9.3 kernel), and indeed I can't reproduce the problem, tried 3 times, checked files integrity and no problem at all!

 

I'm currently on 5.0-beta4 and wanted to upgrade to a newer version, that's why I'm asking

Link to comment

I have only just seen this thread. Can I ask something.... can someone define "hard reset" or "hard reboot". I know this sounds silly, but it has not been clearly defined throughout this thread and is fundamental to testing this issue. Are you literally talking about pulling the power cord from the server?

 

Link to comment

It's a power loss to the unRAID server or an unclean shutdown.  Apparently if you stop the array, you'll invoke a sync command and this will prevent potential data loss if you've written to a disk share (not a user share) and a hard reset (power loss, unclean shutdown) occurs.

 

Edit:  does affect user shares...I was mistaken above

Link to comment

It's a power loss to the unRAID server or an unclean shutdown.... 

 

Ok thanks.

So as I understand it, you copy XYZ to server, wait 30 minutes, kill power and you get data loss. OK. What is the maximum amount of time someone has tried waiting to see if data loss is still present?

Link to comment

I think the wait time beyond several seconds is irrelevant.  nars inadvertently picked 30 minutes because that was deemed way beyond the amount of time it would take for the copy to complete (any cached file data to complete writing to the disk).

Link to comment

I have only just seen this thread. Can I ask something.... can someone define "hard reset" or "hard reboot". I know this sounds silly, but it has not been clearly defined throughout this thread and is fundamental to testing this issue. Are you literally talking about pulling the power cord from the server?

 

hard reset - is where you press the Reset button connected to the motherboard, resulting in immediate system reset.

hard reboot - same as hard reset

kill power - literally yank the power cord out of the wall or flip the power switch off on the PSU

 

I believe the issue is a regression introduced who-knows-when in the reiserfs file system code.

Link to comment

nars reported earlier that ext3 did not exhibit the corruption and an earlier version of the release candidates did not either.

nars also reported that reiserfs filesystems mounted outside of parity protection exhibited the corruption issue.

 

Exactly, but we are not fully sure that the problem you are hitting with parallel writes it's really the same that I can reproduce... If you get a chance you could maybe confirm that though, by repeating your test on reiserfs out of the array and on ext3...

 

 

So the best bet right now would be to use rc13, am I correct with that? Since nars wrote

I did just tried with rc13 (with non-patched 3.9.3 kernel), and indeed I can't reproduce the problem, tried 3 times, checked files integrity and no problem at all!

 

I'm currently on 5.0-beta4 and wanted to upgrade to a newer version, that's why I'm asking

 

The problem is that rc13 haves some other issues... including crashing when stopping array on some systems (I couldn't reproduce it though), I'm not fully sure if that was fixed on rc15 just because of the reiserfs patch OR due to some other changes Tom made like "remove superfoulous 'sync's" and "ensure md-devices not removed before last umount completes".

 

From my tests I could also not reproduce it on rc5 if you read above, then... in your situation, if you are ok with your 5.0-beta4 for so long time and you are concerned about this problem then I would just wait a bit more this all clarifies before updating.

Link to comment

I have only just seen this thread. Can I ask something.... can someone define "hard reset" or "hard reboot". I know this sounds silly, but it has not been clearly defined throughout this thread and is fundamental to testing this issue. Are you literally talking about pulling the power cord from the server?

 

hard reset - is where you press the Reset button connected to the motherboard, resulting in immediate system reset.

hard reboot - same as hard reset

kill power - literally yank the power cord out of the wall or flip the power switch off on the PSU

 

I believe the issue is a regression introduced who-knows-when in the reiserfs file system code.

 

sorry to labour this question, but am i right in thinking if your array has a unclean shutdown, either by a hard reset or reboot then at the moment you could in theory loose all data on the array since the last sync?, which in most cases is when you had a clean shutdown of the array?

 

if this is true then is it possible to force a manual sync from the cli as temporary solution for now?.

Link to comment

@binhex: Yes, from our tests, depending on version, etc... there is the possibility to lose all or partial data written since last 'sync' or since last array stop. To make absolute sure data is committed you may either telnet the server and use the 'sync' command or just stop/start the array on the web interface.

Link to comment

@binhex: Yes, from our tests, depending on version, etc... there is the possibility to lose all or partial data written since last 'sync' or since last array stop. To make absolute sure data is committed you may either telnet the server and use the 'sync' command or just stop/start the array on the web interface.

 

thanks nars, i was afraid you might say that, im now officially scared!!  :o

Link to comment

I think the wait time beyond several seconds is irrelevant.  nars inadvertently picked 30 minutes because that was deemed way beyond the amount of time it would take for the copy to complete (any cached file data to complete writing to the disk share).

 

 

One test of mine was in a two hour window.

Link to comment

@binhex: Yes, from our tests, depending on version, etc... there is the possibility to lose all or partial data written since last 'sync' or since last array stop. To make absolute sure data is committed you may either telnet the server and use the 'sync' command or just stop/start the array on the web interface.

 

thanks nars, i was afraid you might say that, im now officially scared!!  :o

 

 

This isn't scare tactics and you should be... concerned.  ;)

Link to comment

So as an example, say on the 1st of January 2013 user X performed their last sync of the array, anything they have written to disk direct (not via user share) since that date is not written to the array safely until another sync is performed?

Link to comment

I have only just seen this thread. Can I ask something.... can someone define "hard reset" or "hard reboot". I know this sounds silly, but it has not been clearly defined throughout this thread and is fundamental to testing this issue. Are you literally talking about pulling the power cord from the server?

 

hard reset - is where you press the Reset button connected to the motherboard, resulting in immediate system reset.

hard reboot - same as hard reset

kill power - literally yank the power cord out of the wall or flip the power switch off on the PSU

 

I believe the issue is a regression introduced who-knows-when in the reiserfs file system code.

 

sorry to labour this question, but am i right in thinking if your array has a unclean shutdown, either by a hard reset or reboot then at the moment you could in theory loose all data on the array since the last sync?, which in most cases is when you had a clean shutdown of the array?

 

if this is true then is it possible to force a manual sync from the cli as temporary solution for now?.

sure, just log in and type

sync

on the command line.

 

If the drives have spun down they will be spun up first and the data (if any) written.

 

Joe L.

Link to comment

So as an example, say on the 1st of January 2013 user X performed their last sync of the array, anything they have written to disk direct (not via user share) since that date is not written to the array safely until another sync is performed?

 

Not likely... 1st you would not probably have array started so much continuous time (or if so then maybe you also have an ups protecting you from non clean shutdowns?)... and 2nd it's likely that when more and more data it's written some (1st one) will eventually get committed, though our tests are not so extensive to check that... but most likely the problem may happen (doesn't mean it will happen for sure... depending on version and other conditions) with just last written data before a non clean shutdown/reset. Also it is not related if it was written directly to disk or user share... it should be the same, since user share is just like a "virtual file system" that when you write something to it, then it will subsequently write things to disks, like you can do directly...

Link to comment
Guest
This topic is now closed to further replies.