copy files, wait 30min, hard reboot = data loss


Recommended Posts

The only thing I can think of is that the fact nars is using a virtualized UnRAID must have something to do with the errors he's getting.    Perhaps some interaction between UnRAID the the hypervisor.    But both my tests and mejutty's certainly seem to show that the problem that RC15a had with deferred writes not being committed in time has been resolved.

I did already thought same... but... if we think that when doing sync before hard reset I do never get corruption... and... that out of the array with just a manually mounted reiserfs partition I do also reproduce it and with an ext3 partition I do not reproduce it... strange that it may be related to anything other than a fs thing, I think?

 

Which brings me to the next question.  Nars how is your vm setup.  Are you using passthrough disks, RDM or have you setup virtual disks for your unraid drives on real disks??  I mean don't get me wrong it's great this issue was found but I am a little puzzled by your other results on releases that should not have the error.  For me rc16 seems to be good, during my tests I have stopped and started it at least 50 times with no hang-ups and not had a single file transfer failure.

I'm using just virtual hdd's on my VM for testing, no passthrough, no real hdd's.

 

I do understand perfectly that being the only person able to reproduce the problem with rc16, and just on a virtual environment, it's a complex thing... we may just think it is a really hard to happen thing... timing related etc... not sure what to say :(  and seeing that I can also reproduce it with previous rc's, that many people have been running, then surely rc16 it's ok to go live, looking at it this way.

 

Also despite I can still reproduce problem on rc16 I do feel that rc16 (as well as all rc's with 3.4.x kernels) are less problematic than rc15a (where I usually lost ALL files or even the folder, independently of RAM size, i.e. I feelt it "easier to happen"...).

 

 

Btw, garycase, mejutty, limetech, if you are willing to do one last test... this is the last thing I will ask you to try:

 

(I suppose you all have at least 2 data disc's, else you may need to do some little changes on the procedure...)

 

Please use this small script to create on disk1 a test folder with 61 test files with same exact sizes of the test files I have been using for all my tests:

 

#!/bin/bash -e
cd /mnt/disk1
mkdir nars_test_folder
cd nars_test_folder
dd if=/dev/urandom of=nars_test_file_01.dat count=1 bs=55288564
dd if=/dev/urandom of=nars_test_file_02.dat count=1 bs=510183
dd if=/dev/urandom of=nars_test_file_03.dat count=1 bs=2458188
dd if=/dev/urandom of=nars_test_file_04.dat count=1 bs=2978245
dd if=/dev/urandom of=nars_test_file_05.dat count=1 bs=2643817
dd if=/dev/urandom of=nars_test_file_06.dat count=1 bs=2761289
dd if=/dev/urandom of=nars_test_file_07.dat count=1 bs=2787313
dd if=/dev/urandom of=nars_test_file_08.dat count=1 bs=2633474
dd if=/dev/urandom of=nars_test_file_09.dat count=1 bs=2610714
dd if=/dev/urandom of=nars_test_file_10.dat count=1 bs=2919871
dd if=/dev/urandom of=nars_test_file_11.dat count=1 bs=2350149
dd if=/dev/urandom of=nars_test_file_12.dat count=1 bs=509644
dd if=/dev/urandom of=nars_test_file_13.dat count=1 bs=482331
dd if=/dev/urandom of=nars_test_file_14.dat count=1 bs=2302861
dd if=/dev/urandom of=nars_test_file_15.dat count=1 bs=2476117
dd if=/dev/urandom of=nars_test_file_16.dat count=1 bs=2439426
dd if=/dev/urandom of=nars_test_file_17.dat count=1 bs=2770128
dd if=/dev/urandom of=nars_test_file_18.dat count=1 bs=2346394
dd if=/dev/urandom of=nars_test_file_19.dat count=1 bs=2417321
dd if=/dev/urandom of=nars_test_file_20.dat count=1 bs=2498691
dd if=/dev/urandom of=nars_test_file_21.dat count=1 bs=6160690
dd if=/dev/urandom of=nars_test_file_22.dat count=1 bs=2887331
dd if=/dev/urandom of=nars_test_file_23.dat count=1 bs=1477755
dd if=/dev/urandom of=nars_test_file_24.dat count=1 bs=20340436
dd if=/dev/urandom of=nars_test_file_25.dat count=1 bs=1307032
dd if=/dev/urandom of=nars_test_file_26.dat count=1 bs=2828348
dd if=/dev/urandom of=nars_test_file_27.dat count=1 bs=2534757
dd if=/dev/urandom of=nars_test_file_28.dat count=1 bs=1038573
dd if=/dev/urandom of=nars_test_file_29.dat count=1 bs=1097624
dd if=/dev/urandom of=nars_test_file_30.dat count=1 bs=1009367
dd if=/dev/urandom of=nars_test_file_31.dat count=1 bs=1265181
dd if=/dev/urandom of=nars_test_file_32.dat count=1 bs=1151772
dd if=/dev/urandom of=nars_test_file_33.dat count=1 bs=1126012
dd if=/dev/urandom of=nars_test_file_34.dat count=1 bs=887565
dd if=/dev/urandom of=nars_test_file_35.dat count=1 bs=1543955
dd if=/dev/urandom of=nars_test_file_36.dat count=1 bs=1552538
dd if=/dev/urandom of=nars_test_file_37.dat count=1 bs=1564416
dd if=/dev/urandom of=nars_test_file_38.dat count=1 bs=1558199
dd if=/dev/urandom of=nars_test_file_39.dat count=1 bs=1537904
dd if=/dev/urandom of=nars_test_file_40.dat count=1 bs=1536752
dd if=/dev/urandom of=nars_test_file_41.dat count=1 bs=1595042
dd if=/dev/urandom of=nars_test_file_42.dat count=1 bs=1618269
dd if=/dev/urandom of=nars_test_file_43.dat count=1 bs=1488836
dd if=/dev/urandom of=nars_test_file_44.dat count=1 bs=1503179
dd if=/dev/urandom of=nars_test_file_45.dat count=1 bs=1493448
dd if=/dev/urandom of=nars_test_file_46.dat count=1 bs=1601083
dd if=/dev/urandom of=nars_test_file_47.dat count=1 bs=1600324
dd if=/dev/urandom of=nars_test_file_48.dat count=1 bs=1145707
dd if=/dev/urandom of=nars_test_file_49.dat count=1 bs=1176492
dd if=/dev/urandom of=nars_test_file_50.dat count=1 bs=1286722
dd if=/dev/urandom of=nars_test_file_51.dat count=1 bs=1301486
dd if=/dev/urandom of=nars_test_file_52.dat count=1 bs=1536094
dd if=/dev/urandom of=nars_test_file_53.dat count=1 bs=1295588
dd if=/dev/urandom of=nars_test_file_54.dat count=1 bs=226047
dd if=/dev/urandom of=nars_test_file_55.dat count=1 bs=2128774
dd if=/dev/urandom of=nars_test_file_56.dat count=1 bs=2209284
dd if=/dev/urandom of=nars_test_file_57.dat count=1 bs=2201158
dd if=/dev/urandom of=nars_test_file_58.dat count=1 bs=1795643
dd if=/dev/urandom of=nars_test_file_59.dat count=1 bs=1670813
dd if=/dev/urandom of=nars_test_file_60.dat count=1 bs=1647307
dd if=/dev/urandom of=nars_test_file_61.dat count=1 bs=100544443
md5sum * > ../nars_test_folder.md5

 

After it completes then please do a proper unraid restart (to make sure the test files on disk1, that will be used as 'reference', get committed and... that you will only really do a cp on next session and really nothing more... no smb, no emhttp connection, nothing - make sure array is set to autostart though) and upon startup just cp the folder to disk2 like:

cp -r /mnt/disk1/nars_test_folder /mnt/disk2

wait it completes... wait 1 or 2 extra minutes... do hard reset... just this.

After restart just check files on disk2 with something like:

cd /mnt/disk2/nars_test_folder/ && md5sum -c /mnt/disk1/nars_test_folder.md5

 

I did followed these exact steps to make sure I could reproduce it, these are exact steps, not involving external smb clients, etc... just try it 2 or 3 times, if possible with different amount of RAM (I did tried with 512MB and 1GB), if you can't reproduce it at all then you can't really reproduce it, don't 'hammer' on it.

 

WeeboTech idea to try parallel writes may also be good to try to stress it, but I don't need to try it as I'm already able to reproduce without that :)

Link to comment
  • Replies 154
  • Created
  • Last Reply

Top Posters In This Topic

Nars, are you ESX (version?) Vmware worstation or player?

What controller is configured for the virtual drives?

 

I'm running ESX 5.1 on a HP microserver, vm has 4GB of ram.

I am copying to 1 physical disk with rdm pass through.

I think I have it configured to use the paravirtual scsi controller.

 

I didn't even do the random thing, although that was a good idea.

I copied bzroot so I knew I had something with a fixed size and md5.

 

on rc15a, sizes were smaller on the corrupt files.

Also keep in mind, the process took over an hour.

to do the parallel cp, then md5sum check them all.

There was a bout an hour idle time.

Then I crashed the machine.

 

After that it was another hour or so to

come up.

md5sum check them all.

 

When I get home I'll check with RC16, then drop down to bare metal and test again.

Link to comment

Nars, are you ESX (version?) Vmware worstation or player?

What controller is configured for the virtual drives?

 

I'm running ESX 5.1 on a HP microserver, vm has 4GB of ram.

I am copying to 1 physical disk with rdm pass through.

I think I have it configured to use the paravirtual scsi controller.

 

I didn't even do the random thing, although that was a good idea.

I copied bzroot so I knew I had something with a fixed size and md5.

 

on rc15a, sizes were smaller on the corrupt files.

Also keep in mind, the process took over an hour.

to do the parallel cp, then md5sum check them all.

There was a bout an hour idle time.

Then I crashed the machine.

 

After that it was another hour or so to

come up.

md5sum check them all.

 

When I get home I'll check with RC16, then drop down to bare metal and test again.

 

VMWare Workstation. I'm using 3 virtual IDE hdd's of 512MB each, if that matters, and one 300MB one with FAT partition to boot unRaid from.

 

Your idea to do parallel copies may be really good as it may stress things a bit more, make things slower, thus eventually triggering timing issues... despite maybe not needed so many iterations...

 

The small script I did posted I did used random just to fill them but the main idea was to just re-create files with same exact sizes of my test files I have been testing with from the beginning, so that others can test with similar files.

Link to comment
It might be a good idea to release -rc16 IF you haven't heard back from the kernel folks fairly quickly.  There are a lot of us folks who are running this RC version and, if I understand the problem, any improper shutdown will cause a data loss.

 

In the interim, you can ensure your data is safe after copying new files by stopping the array and restarting. Stopping issues the sync command which commits the changes to the drive.

Link to comment

I'll add a second data disk in as I am currently running with 1 data and 1 parity and try using your script nars and see what results I come up with.  All my tests have been done with transfers from a W7 machine.  Actually without me having to start calculating here what size is the resultant folder??  I may just use the flash instead of disk1 so I can give it a crack before adding in the second drive.

Link to comment

It might be a good idea to release -rc16 IF you haven't heard back from the kernel folks fairly quickly.  There are a lot of us folks who are running this RC version and, if I understand the problem, any improper shutdown will cause a data loss.

 

In the interim, you can ensure your data is safe after copying new files by stopping the array and restarting. Stopping issues the sync command which commits the changes to the drive.

 

You can manually telnet in and issue the sync also.

Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move.

 

Link to comment

Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move.

 

While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing??    With, of course, the glaring exception that if anyone else was using the array you'd disrupt them  :)

Link to comment

Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move.

 

While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing??    With, of course, the glaring exception that if anyone else was using the array you'd disrupt them  :)

 

Yes of course, there is a sync issued and the filesystems are unmounted and remounted thus accomplishing the sync/flush.

 

For me that will break allot of things in my network.  ;)

Besides, my utorrent machine is always writing to the unRAID server.

 

The one positive thing about this is that you can rehash the torrents and then fill in the missing pieces.

But now that I think about it.. This might be why I have to do this so often.  Many torrents that finish fine end up being corrupted a few weeks later for no reason and have to re download the missing pieces.

I thought there were issues on the network, but how can that be. it's only a beefy laptop and a microserver on it's own network. (bare metal unRAID)

Link to comment

Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move.

 

While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing??    With, of course, the glaring exception that if anyone else was using the array you'd disrupt them  :)

 

Yes of course, there is a sync issued and the filesystems are unmounted and remounted thus accomplishing the sync/flush.

 

For me that will break allot of things in my network.  ;)

Besides, my utorrent machine is always writing to the unRAID server.

 

The one positive thing about this is that you can rehash the torrents and then fill in the missing pieces.

But now that I think about it.. This might be why I have to do this so often.  Many torrents that finish fine end up being corrupted a few weeks later for no reason and have to re download the missing pieces.

I thought there were issues on the network, but how can that be. it's only a beefy laptop and a microserver on it's own network. (bare metal unRAID)

 

Ya know, I'm not gonna claim that I suffered from this bug, I did indeed have some torrents that inexplicably showed up with missing pieces.  Like after downloading 8 of 9 individual episodes, and then combining them into a single season's torrent and forcing a recheck, a few of the episodes showed failed crc checks for some pieces.  I just shrugged it off, but it did happen a good amount.  What I can say at all, is that my server had any crashes that should explain the missing pieces. :-\

Link to comment

If DOES seem that if (a) your UnRAID server has a UPS  (a MANDATORY accessory in my opinion); and that (b) the APC UPS and CleanPowerdown packages do their jobs correctly and cleanly shut down the server in the event of a powerfailure  ...  then this bug should never be an issue.

 

That would seem to protect against the issue impacting any downloaded torrents that hadn't yet been sync'd => but it does assume you have UPS support on all your systems.

 

Obviously, getting this fixed is a better idea, however  8)

Link to comment

Which version of workstation?

 

I have 9 at home.

 

I can try both scenarios.

 

8.0.4 here, but you can still test on 9 probably.

Also note that I have all virtual disk files on same physical hdd, it may do some difference as it will surely make things slower..

 

 

I'll add a second data disk in as I am currently running with 1 data and 1 parity and try using your script nars and see what results I come up with.  All my tests have been done with transfers from a W7 machine.  Actually without me having to start calculating here what size is the resultant folder??  I may just use the flash instead of disk1 so I can give it a crack before adding in the second drive.

 

Yes, you can surely test from the flash, total size for the folder will be 283656666 bytes.

Link to comment

If DOES seem that if (a) your UnRAID server has a UPS  (a MANDATORY accessory in my opinion); and that (b) the APC UPS and CleanPowerdown packages do their jobs correctly and cleanly shut down the server in the event of a powerfailure  ...  then this bug should never be an issue.

 

That would seem to protect against the issue impacting any downloaded torrents that hadn't yet been sync'd => but it does assume you have UPS support on all your systems.

 

Obviously, getting this fixed is a better idea, however  8)

 

We define mandatory accessory, but there's no real support for it.

There's a kludge right now that has to be refined shortly after this.

 

With the current power down, when the filesystems are unmounted I see emhttp complaining.

However if we call the emhttp powerdown instead, it may never be able to umount the filesystems, thus causing a hang. The possibility of loosing data is higher now.

 

Other then that so far my systems have not had power issues.

There's something else going on and this reveals a big possibility.

 

All the more reason now to have some kind of scan & checksum database.

 

In any event, when I get home today, I'll start a battery of tests to see what I find.

 

I may have to take out those new 4TB drives and install the 250's that came with my microservers.

Or at least set the HPA on some other drives.

Link to comment

Btw, garycase, mejutty, limetech, if you are willing to do one last test... this is the last thing I will ask you to try:

 

I'd be glad to try this, but I'm a complete Linux idiot, so am not sure how to run the script.

 

I copied the script to a file (called it "NarsTest") and put it on the flash drive.

 

Then from the UnRAID console, I did /Boot/NarsTest and it gives me an "Invalid Parameter"

 

I copied everything from the "#!/bin/bash -e" line to the end.

 

So ... what are the EXACT steps I need to do to run this?

 

Link to comment

Btw, garycase, mejutty, limetech, if you are willing to do one last test... this is the last thing I will ask you to try:

 

I'd be glad to try this, but I'm a complete Linux idiot, so am not sure how to run the script.

 

I copied the script to a file (called it "NarsTest") and put it on the flash drive.

 

Then from the UnRAID console, I did /Boot/NarsTest and it gives me an "Invalid Parameter"

 

I copied everything from the "#!/bin/bash -e" line to the end.

 

So ... what are the EXACT steps I need to do to run this?

 

hmm, maybe you created the script with a windows text editor and the problem is the cr/lf's? try this:

fromdos </boot/NarsTest >/boot/NarsTest2

then try to run NarsTest2.

Link to comment

Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move.

 

While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing??    With, of course, the glaring exception that if anyone else was using the array you'd disrupt them  :)

 

True...  and so is the suggestion to Telnet in and issue a 'sync' command.  The problem as I see it that a lot of the people who are using -rc15a probably aren't following this thread.  And most likely, they are the very ones that will not be using an UPS.  I was just suggesting to Tom that it might be better to release a fix that he is 99% sure fixes the problem if the kernel folks are slow in responding to his request than to have several users ending up with a corrupted file systems while there is an easily implemented solution that should prevent the problem.

Link to comment

hmm, maybe you created the script with a windows text editor and the problem is the cr/lf's? try this:

fromdos </boot/NarsTest >/boot/NarsTest2

then try to run NarsTest2.

 

That was obviously the problem (I had just saved your text with Notepad).  Worked fine now.

I've got to run some errands at the moment, but will do the testing when I return, now that I have the appropriate set of files on disk1.

 

Link to comment

Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move.

 

While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing??    With, of course, the glaring exception that if anyone else was using the array you'd disrupt them  :)

 

True...  and so is the suggestion to Telnet in and issue a 'sync' command.  The problem as I see it that a lot of the people who are using -rc15a probably aren't following this thread.  And most likely, they are the very ones that will not be using an UPS.  I was just suggesting to Tom that it might be better to release a fix that he is 99% sure fixes the problem if the kernel folks are slow in responding to his request than to have several users ending up with a corrupted file systems while there is an easily implemented solution that should prevent the problem.

 

I agree.

Link to comment

Decided to quickly run one test to see if it has the same issues you have had.

 

Ran the copy;  waited 1 minute and cut power;  booted up;  and then compared the "nars_test_folder" on disk1 and disk 2 ==> they are identical !!

 

When I get back in a couple hours I'll repeat that test with a couple smaller intervals between the copy and killing power (perhaps 30 seconds and 15 seconds) ... but it certainly looks like my test system does NOT have the same issue, and that with RC16 it works just fine.

 

Link to comment

Decided to quickly run one test to see if it has the same issues you have had.

 

Ran the copy;  waited 1 minute and cut power;  booted up;  and then compared the "nars_test_folder" on disk1 and disk 2 ==> they are identical !!

 

When I get back in a couple hours I'll repeat that test with a couple smaller intervals between the copy and killing power (perhaps 30 seconds and 15 seconds) ... but it certainly looks like my test system does NOT have the same issue, and that with RC16 it works just fine.

 

Thank you for testing, btw I don't think it is worth trying with smaller intervals as I did always waited with near 2 minutes, it doesn't really matter to wait more for me... if you wish to try with different memory amount (if you are back with 1GB and want to decrease it you don't need to physically do it, you may just add mem param on syslinux.cfg as that used to limit to 4GB that was talked a lot on forum...) try anything else is probably worthless and you can't really reproduce it.

Link to comment

Ok folks, sorry, but I don't have good news for the first test.

 

I took bzroot rc-16 as a seed file and cp'ed it in parallel to disk1.

 

Then I did a cp of all the files from disk1 to disk2.

Then I did a cp of all the files from disk2 to disk3.

 

I waited 3 minutes then reset the virtual machine.

 

ESX 5.1, 4GB memory. unRAID 5.0 RC-16 to 3 physical disks.

I am not using the user shares.

 

root@unRAID2:/mnt/disk1/crashtestdummy# df -vH

Filesystem            Size  Used  Avail Use% Mounted on

tmpfs                  135M  177k  135M  1% /var/log

/dev/sda1              16G  2.1G    14G  13% /boot

/dev/md1              4.1T  4.6G  4.0T  1% /mnt/disk1

/dev/md2              3.1T  4.2G  3.0T  1% /mnt/disk2

/dev/md3              3.1T  4.2G  3.0T  1% /mnt/disk3

shfs                    11T    13G    10T  1% /mnt/user

 

What is odd to me is 76 of the 120 files do not have the proper size and fail the md5sum check.

These are the FIRST set of files I made. (in parallel)

 

What baffles me more is that the sequential cp of disk1 to disk2 is good, the sequential cp of disk2 to disk3 is good.

While it certainly could be my system, ram or hardware of some sort. It's a bit baffling.

 

In addition, accessing the files via md5sum is slow as molassas on the 4tb. Possibly due to fragmentation?

It's really slow. Both tests on the other disks finished miles ahead of the 4TB disk.

 

I could add manual syncs in here to force the flush and make sure it's not something else.

I have RC12a on my other server I could pull those images over and try them, unless someone can suggest a better old image to test.

 

I'm quite concerned.

 

Let me add that I feel I left more then enough time for the cache and superblock to be flushed.

The parallel cp took quite some time. Were talking tens of minutes.

Then I did the md5sum again, more then enough time for syncing the superblock.

The sequential cp to the other disks was pretty fast, maybe 3 minutes each

each doing an md5sum.

then an idle sleep for 3 minutes.

 


root@unRAID2:/mnt/disk1/crashtestdummy# uname -a
Linux unRAID2 3.9.6p-unRAID #35 SMP Mon Jun 24 11:31:07 PDT 2013 i686 AMD Turion(tm) II Neo N54L Dual-Core Processor AuthenticAMD GNU/Linux

root@unRAID2:/mnt/disk1/crashtestdummy# cat /etc/unraid-version
version=5.0-rc16

root@unRAID2:/mnt/disk1/crashtestdummy# cat /boot/crashtestdummy
#!/bin/bash

FILE=bzroot.516
DIR=/mnt/disk1/crashtestdummy
MD5SUM=`md5sum /boot/${FILE} | cut -f1 -d' '`

SLEEPTIME=180
LOOPMAX=${LOOPMAX:=120}

rm ${DIR}/md5sum

rm   -fr ${DIR}
mkdir -p ${DIR}

while [[ ${LOOP:=0} -lt ${LOOPMAX:=10} ]]
do (( LOOP=LOOP+1 ))
   cp -v /boot/${FILE} ${DIR}/${FILE}.${LOOP} &
   echo "${MD5SUM}  ${FILE}.${LOOP}" >> ${DIR}/md5sum
done

echo "`date '+%b %d %T'` $0[$$]: Waiting for background cp to finish."
wait

echo "`date '+%b %d %T'` $0[$$]: Checking md5sum"
( cd ${DIR} ; md5sum -c md5sum )

DESTDIR=/mnt/disk2/crashtestdummy
echo "`date '+%b %d %T'` $0[$$]: copying to ${DESTDIR}"
rm -r  ${DESTDIR}
mkdir  ${DESTDIR}
cp -av ${DIR}/* ${DESTDIR}
( cd ${DESTDIR}; md5sum -c md5sum )

DIR=${DESTDIR}
DESTDIR=/mnt/disk3/crashtestdummy
echo "`date '+%b %d %T'` $0[$$]: copying to: ${DESTDIR}"
rm -r  ${DESTDIR}
mkdir  ${DESTDIR}
cp -av ${DIR}/* ${DESTDIR}
( cd ${DESTDIR}; md5sum -c md5sum )

echo "`date '+%b %d %T'` $0[$$]: sleeping ${SLEEPTIME}"
sleep ${SLEEPTIME}
echo "`date '+%b %d %T'` $0[$$]: ok crash the system now"

 

 


md5sum: WARNING: 76 of 120 computed checksums did NOT match

root@unRAID2:/mnt/disk1/crashtestdummy# ls -l
total 3915729
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.1*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.10*
-rwxrwxrwx 1 root root 33153024 2013-06-25 17:47 bzroot.516.100*
-rwxrwxrwx 1 root root 32141312 2013-06-25 17:47 bzroot.516.101*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.102*
-rwxrwxrwx 1 root root 32362496 2013-06-25 17:47 bzroot.516.103*
-rwxrwxrwx 1 root root 32608256 2013-06-25 17:47 bzroot.516.104*
-rwxrwxrwx 1 root root 32698368 2013-06-25 17:47 bzroot.516.105*
-rwxrwxrwx 1 root root 33419264 2013-06-25 17:47 bzroot.516.106*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.107*
-rwxrwxrwx 1 root root 33746944 2013-06-25 17:47 bzroot.516.108*
-rwxrwxrwx 1 root root 31952896 2013-06-25 17:47 bzroot.516.109*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:44 bzroot.516.11*
-rwxrwxrwx 1 root root 32219136 2013-06-25 17:47 bzroot.516.110*
-rwxrwxrwx 1 root root 33247232 2013-06-25 17:47 bzroot.516.111*
-rwxrwxrwx 1 root root 33292288 2013-06-25 17:47 bzroot.516.112*
-rwxrwxrwx 1 root root 33329152 2013-06-25 17:47 bzroot.516.113*
-rwxrwxrwx 1 root root 33480704 2013-06-25 17:47 bzroot.516.114*
-rwxrwxrwx 1 root root 33140736 2013-06-25 17:47 bzroot.516.115*
-rwxrwxrwx 1 root root 33415168 2013-06-25 17:47 bzroot.516.116*
-rwxrwxrwx 1 root root 33013760 2013-06-25 17:47 bzroot.516.117*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.118*
-rwxrwxrwx 1 root root 32890880 2013-06-25 17:47 bzroot.516.119*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.12*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.120*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.13*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.14*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.15*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.16*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.17*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.18*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.19*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.2*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.20*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.21*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.22*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.23*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.24*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.25*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.26*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.27*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.28*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.29*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.3*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.30*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.31*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.32*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.33*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.34*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.35*
-rwxrwxrwx 1 root root 32735232 2013-06-25 17:47 bzroot.516.36*
-rwxrwxrwx 1 root root 32481280 2013-06-25 17:47 bzroot.516.37*
-rwxrwxrwx 1 root root 33333248 2013-06-25 17:47 bzroot.516.38*
-rwxrwxrwx 1 root root 33574912 2013-06-25 17:47 bzroot.516.39*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.4*
-rwxrwxrwx 1 root root 33636352 2013-06-25 17:47 bzroot.516.40*
-rwxrwxrwx 1 root root 32837632 2013-06-25 17:47 bzroot.516.41*
-rwxrwxrwx 1 root root 33075200 2013-06-25 17:47 bzroot.516.42*
-rwxrwxrwx 1 root root 33574912 2013-06-25 17:47 bzroot.516.43*
-rwxrwxrwx 1 root root 32948224 2013-06-25 17:47 bzroot.516.44*
-rwxrwxrwx 1 root root 32968704 2013-06-25 17:47 bzroot.516.45*
-rwxrwxrwx 1 root root 32960512 2013-06-25 17:47 bzroot.516.46*
-rwxrwxrwx 1 root root 33026048 2013-06-25 17:47 bzroot.516.47*
-rwxrwxrwx 1 root root 33316864 2013-06-25 17:47 bzroot.516.48*
-rwxrwxrwx 1 root root 33681408 2013-06-25 17:47 bzroot.516.49*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.5*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.50*
-rwxrwxrwx 1 root root 33460224 2013-06-25 17:47 bzroot.516.51*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.52*
-rwxrwxrwx 1 root root 33099776 2013-06-25 17:47 bzroot.516.53*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.54*
-rwxrwxrwx 1 root root 32972800 2013-06-25 17:47 bzroot.516.55*
-rwxrwxrwx 1 root root 32804864 2013-06-25 17:47 bzroot.516.56*
-rwxrwxrwx 1 root root 32608256 2013-06-25 17:47 bzroot.516.57*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.58*
-rwxrwxrwx 1 root root 33288192 2013-06-25 17:47 bzroot.516.59*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.6*
-rwxrwxrwx 1 root root 33792000 2013-06-25 17:47 bzroot.516.60*
-rwxrwxrwx 1 root root 33656832 2013-06-25 17:47 bzroot.516.61*
-rwxrwxrwx 1 root root 32661504 2013-06-25 17:47 bzroot.516.62*
-rwxrwxrwx 1 root root 33333248 2013-06-25 17:47 bzroot.516.63*
-rwxrwxrwx 1 root root 33304576 2013-06-25 17:47 bzroot.516.64*
-rwxrwxrwx 1 root root 33529856 2013-06-25 17:47 bzroot.516.65*
-rwxrwxrwx 1 root root 33161216 2013-06-25 17:47 bzroot.516.66*
-rwxrwxrwx 1 root root 33722368 2013-06-25 17:47 bzroot.516.67*
-rwxrwxrwx 1 root root 33882112 2013-06-25 17:47 bzroot.516.68*
-rwxrwxrwx 1 root root 33357824 2013-06-25 17:47 bzroot.516.69*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.7*
-rwxrwxrwx 1 root root 33214464 2013-06-25 17:47 bzroot.516.70*
-rwxrwxrwx 1 root root 32821248 2013-06-25 17:47 bzroot.516.71*
-rwxrwxrwx 1 root root 33259520 2013-06-25 17:47 bzroot.516.72*
-rwxrwxrwx 1 root root 32821248 2013-06-25 17:47 bzroot.516.73*
-rwxrwxrwx 1 root root 33112064 2013-06-25 17:47 bzroot.516.74*
-rwxrwxrwx 1 root root 33226752 2013-06-25 17:47 bzroot.516.75*
-rwxrwxrwx 1 root root 33890304 2013-06-25 17:47 bzroot.516.76*
-rwxrwxrwx 1 root root 33628160 2013-06-25 17:47 bzroot.516.77*
-rwxrwxrwx 1 root root 33521664 2013-06-25 17:47 bzroot.516.78*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.79*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.8*
-rwxrwxrwx 1 root root 32555008 2013-06-25 17:47 bzroot.516.80*
-rwxrwxrwx 1 root root 33226752 2013-06-25 17:47 bzroot.516.81*
-rwxrwxrwx 1 root root 33312768 2013-06-25 17:47 bzroot.516.82*
-rwxrwxrwx 1 root root 33411072 2013-06-25 17:47 bzroot.516.83*
-rwxrwxrwx 1 root root 32763904 2013-06-25 17:47 bzroot.516.84*
-rwxrwxrwx 1 root root 33091584 2013-06-25 17:47 bzroot.516.85*
-rwxrwxrwx 1 root root 33521664 2013-06-25 17:47 bzroot.516.86*
-rwxrwxrwx 1 root root 32653312 2013-06-25 17:47 bzroot.516.87*
-rwxrwxrwx 1 root root 32358400 2013-06-25 17:47 bzroot.516.88*
-rwxrwxrwx 1 root root 32550912 2013-06-25 17:47 bzroot.516.89*
-rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.9*
-rwxrwxrwx 1 root root 32649216 2013-06-25 17:47 bzroot.516.90*
-rwxrwxrwx 1 root root 33316864 2013-06-25 17:47 bzroot.516.91*
-rwxrwxrwx 1 root root 33579008 2013-06-25 17:47 bzroot.516.92*
-rwxrwxrwx 1 root root 31932416 2013-06-25 17:47 bzroot.516.93*
-rwxrwxrwx 1 root root 32768000 2013-06-25 17:47 bzroot.516.94*
-rwxrwxrwx 1 root root 32772096 2013-06-25 17:47 bzroot.516.95*
-rwxrwxrwx 1 root root 32387072 2013-06-25 17:47 bzroot.516.96*
-rwxrwxrwx 1 root root 32874496 2013-06-25 17:47 bzroot.516.97*
-rwxrwxrwx 1 root root 32272384 2013-06-25 17:47 bzroot.516.98*
-rwxrwxrwx 1 root root 32452608 2013-06-25 17:47 bzroot.516.99*
-rw-rw-rw- 1 root root     5772 2013-06-25 17:38 md5sum

 

 

 

 

When responding, Please don't quote my whole message. I know what I wrote and it's pretty long. Sorry.

Quote what's important to keep the replication down. Thanks!!

Link to comment

@WeeboTech: If the problem you are seeing is related to the problem I can more easily reproduce it may mean that timing (i.e. slowing things) may make difference to make it appear more easily, that's maybe why is shows for your parallel writes... if you got good data on subsequent copies on disk2 and disk3 it probably means that data on your disk1 was good before the hard reset and may be a commit issue, anyway it may be yet some different problem (hope not...).

 

I would just try if you can reproduce it again with a lower number of parallel writes to make it simpler, faster and easier to make any further tests. Then I would try to do a sync before hard reset and see if you still get problems with data written on parallel. Then I would try same with rc13 with unpatched 3.9 kernel and see what happens as this is the only version from the ones I did tested until now that I was not able to reproduce 'my' problem.

Link to comment

Doing narstest to disk1, issue sync, reboot gracefully.

 

cp -r disk1 to disk2

do md5sum(good)

cp -r disk1 to disk3

do md5sum(good)

 

wait 2 minutes.

CRASH the testdummy.

 

boot unRAID

cd /mnt/disk2/nars_test_folder && md5sum -c /mnt/disk1/nars_test_folder.md5  GOOD

cd /mnt/disk3/nars_test_folder && md5sum -c /mnt/disk1/nars_test_folder.md5  GOOD

Link to comment

Interesting that we get a variety of results on nars test.

 

I just did it again -- twice -- and both were (again) perfect, with no errors. 

 

Once with 30 seconds until I killed power after the cp.

 

Then the same thing with only 512MB of RAM ... same 30 second delay.

 

So at least on my bare metal system RC16 works just fine with that set of test file.

 

Link to comment
Interesting that we get a variety of results on nars test.

Note that the test where WeeboTech actually got corruption was not my test files, he is trying a different way to test with parallel writes. He was unable to get any problem with my test files as well.

Link to comment
Guest
This topic is now closed to further replies.