nars Posted June 25, 2013 Author Share Posted June 25, 2013 The only thing I can think of is that the fact nars is using a virtualized UnRAID must have something to do with the errors he's getting. Perhaps some interaction between UnRAID the the hypervisor. But both my tests and mejutty's certainly seem to show that the problem that RC15a had with deferred writes not being committed in time has been resolved. I did already thought same... but... if we think that when doing sync before hard reset I do never get corruption... and... that out of the array with just a manually mounted reiserfs partition I do also reproduce it and with an ext3 partition I do not reproduce it... strange that it may be related to anything other than a fs thing, I think? Which brings me to the next question. Nars how is your vm setup. Are you using passthrough disks, RDM or have you setup virtual disks for your unraid drives on real disks?? I mean don't get me wrong it's great this issue was found but I am a little puzzled by your other results on releases that should not have the error. For me rc16 seems to be good, during my tests I have stopped and started it at least 50 times with no hang-ups and not had a single file transfer failure. I'm using just virtual hdd's on my VM for testing, no passthrough, no real hdd's. I do understand perfectly that being the only person able to reproduce the problem with rc16, and just on a virtual environment, it's a complex thing... we may just think it is a really hard to happen thing... timing related etc... not sure what to say and seeing that I can also reproduce it with previous rc's, that many people have been running, then surely rc16 it's ok to go live, looking at it this way. Also despite I can still reproduce problem on rc16 I do feel that rc16 (as well as all rc's with 3.4.x kernels) are less problematic than rc15a (where I usually lost ALL files or even the folder, independently of RAM size, i.e. I feelt it "easier to happen"...). Btw, garycase, mejutty, limetech, if you are willing to do one last test... this is the last thing I will ask you to try: (I suppose you all have at least 2 data disc's, else you may need to do some little changes on the procedure...) Please use this small script to create on disk1 a test folder with 61 test files with same exact sizes of the test files I have been using for all my tests: #!/bin/bash -e cd /mnt/disk1 mkdir nars_test_folder cd nars_test_folder dd if=/dev/urandom of=nars_test_file_01.dat count=1 bs=55288564 dd if=/dev/urandom of=nars_test_file_02.dat count=1 bs=510183 dd if=/dev/urandom of=nars_test_file_03.dat count=1 bs=2458188 dd if=/dev/urandom of=nars_test_file_04.dat count=1 bs=2978245 dd if=/dev/urandom of=nars_test_file_05.dat count=1 bs=2643817 dd if=/dev/urandom of=nars_test_file_06.dat count=1 bs=2761289 dd if=/dev/urandom of=nars_test_file_07.dat count=1 bs=2787313 dd if=/dev/urandom of=nars_test_file_08.dat count=1 bs=2633474 dd if=/dev/urandom of=nars_test_file_09.dat count=1 bs=2610714 dd if=/dev/urandom of=nars_test_file_10.dat count=1 bs=2919871 dd if=/dev/urandom of=nars_test_file_11.dat count=1 bs=2350149 dd if=/dev/urandom of=nars_test_file_12.dat count=1 bs=509644 dd if=/dev/urandom of=nars_test_file_13.dat count=1 bs=482331 dd if=/dev/urandom of=nars_test_file_14.dat count=1 bs=2302861 dd if=/dev/urandom of=nars_test_file_15.dat count=1 bs=2476117 dd if=/dev/urandom of=nars_test_file_16.dat count=1 bs=2439426 dd if=/dev/urandom of=nars_test_file_17.dat count=1 bs=2770128 dd if=/dev/urandom of=nars_test_file_18.dat count=1 bs=2346394 dd if=/dev/urandom of=nars_test_file_19.dat count=1 bs=2417321 dd if=/dev/urandom of=nars_test_file_20.dat count=1 bs=2498691 dd if=/dev/urandom of=nars_test_file_21.dat count=1 bs=6160690 dd if=/dev/urandom of=nars_test_file_22.dat count=1 bs=2887331 dd if=/dev/urandom of=nars_test_file_23.dat count=1 bs=1477755 dd if=/dev/urandom of=nars_test_file_24.dat count=1 bs=20340436 dd if=/dev/urandom of=nars_test_file_25.dat count=1 bs=1307032 dd if=/dev/urandom of=nars_test_file_26.dat count=1 bs=2828348 dd if=/dev/urandom of=nars_test_file_27.dat count=1 bs=2534757 dd if=/dev/urandom of=nars_test_file_28.dat count=1 bs=1038573 dd if=/dev/urandom of=nars_test_file_29.dat count=1 bs=1097624 dd if=/dev/urandom of=nars_test_file_30.dat count=1 bs=1009367 dd if=/dev/urandom of=nars_test_file_31.dat count=1 bs=1265181 dd if=/dev/urandom of=nars_test_file_32.dat count=1 bs=1151772 dd if=/dev/urandom of=nars_test_file_33.dat count=1 bs=1126012 dd if=/dev/urandom of=nars_test_file_34.dat count=1 bs=887565 dd if=/dev/urandom of=nars_test_file_35.dat count=1 bs=1543955 dd if=/dev/urandom of=nars_test_file_36.dat count=1 bs=1552538 dd if=/dev/urandom of=nars_test_file_37.dat count=1 bs=1564416 dd if=/dev/urandom of=nars_test_file_38.dat count=1 bs=1558199 dd if=/dev/urandom of=nars_test_file_39.dat count=1 bs=1537904 dd if=/dev/urandom of=nars_test_file_40.dat count=1 bs=1536752 dd if=/dev/urandom of=nars_test_file_41.dat count=1 bs=1595042 dd if=/dev/urandom of=nars_test_file_42.dat count=1 bs=1618269 dd if=/dev/urandom of=nars_test_file_43.dat count=1 bs=1488836 dd if=/dev/urandom of=nars_test_file_44.dat count=1 bs=1503179 dd if=/dev/urandom of=nars_test_file_45.dat count=1 bs=1493448 dd if=/dev/urandom of=nars_test_file_46.dat count=1 bs=1601083 dd if=/dev/urandom of=nars_test_file_47.dat count=1 bs=1600324 dd if=/dev/urandom of=nars_test_file_48.dat count=1 bs=1145707 dd if=/dev/urandom of=nars_test_file_49.dat count=1 bs=1176492 dd if=/dev/urandom of=nars_test_file_50.dat count=1 bs=1286722 dd if=/dev/urandom of=nars_test_file_51.dat count=1 bs=1301486 dd if=/dev/urandom of=nars_test_file_52.dat count=1 bs=1536094 dd if=/dev/urandom of=nars_test_file_53.dat count=1 bs=1295588 dd if=/dev/urandom of=nars_test_file_54.dat count=1 bs=226047 dd if=/dev/urandom of=nars_test_file_55.dat count=1 bs=2128774 dd if=/dev/urandom of=nars_test_file_56.dat count=1 bs=2209284 dd if=/dev/urandom of=nars_test_file_57.dat count=1 bs=2201158 dd if=/dev/urandom of=nars_test_file_58.dat count=1 bs=1795643 dd if=/dev/urandom of=nars_test_file_59.dat count=1 bs=1670813 dd if=/dev/urandom of=nars_test_file_60.dat count=1 bs=1647307 dd if=/dev/urandom of=nars_test_file_61.dat count=1 bs=100544443 md5sum * > ../nars_test_folder.md5 After it completes then please do a proper unraid restart (to make sure the test files on disk1, that will be used as 'reference', get committed and... that you will only really do a cp on next session and really nothing more... no smb, no emhttp connection, nothing - make sure array is set to autostart though) and upon startup just cp the folder to disk2 like: cp -r /mnt/disk1/nars_test_folder /mnt/disk2 wait it completes... wait 1 or 2 extra minutes... do hard reset... just this. After restart just check files on disk2 with something like: cd /mnt/disk2/nars_test_folder/ && md5sum -c /mnt/disk1/nars_test_folder.md5 I did followed these exact steps to make sure I could reproduce it, these are exact steps, not involving external smb clients, etc... just try it 2 or 3 times, if possible with different amount of RAM (I did tried with 512MB and 1GB), if you can't reproduce it at all then you can't really reproduce it, don't 'hammer' on it. WeeboTech idea to try parallel writes may also be good to try to stress it, but I don't need to try it as I'm already able to reproduce without that Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 Nars, are you ESX (version?) Vmware worstation or player? What controller is configured for the virtual drives? I'm running ESX 5.1 on a HP microserver, vm has 4GB of ram. I am copying to 1 physical disk with rdm pass through. I think I have it configured to use the paravirtual scsi controller. I didn't even do the random thing, although that was a good idea. I copied bzroot so I knew I had something with a fixed size and md5. on rc15a, sizes were smaller on the corrupt files. Also keep in mind, the process took over an hour. to do the parallel cp, then md5sum check them all. There was a bout an hour idle time. Then I crashed the machine. After that it was another hour or so to come up. md5sum check them all. When I get home I'll check with RC16, then drop down to bare metal and test again. Link to comment
nars Posted June 25, 2013 Author Share Posted June 25, 2013 Nars, are you ESX (version?) Vmware worstation or player? What controller is configured for the virtual drives? I'm running ESX 5.1 on a HP microserver, vm has 4GB of ram. I am copying to 1 physical disk with rdm pass through. I think I have it configured to use the paravirtual scsi controller. I didn't even do the random thing, although that was a good idea. I copied bzroot so I knew I had something with a fixed size and md5. on rc15a, sizes were smaller on the corrupt files. Also keep in mind, the process took over an hour. to do the parallel cp, then md5sum check them all. There was a bout an hour idle time. Then I crashed the machine. After that it was another hour or so to come up. md5sum check them all. When I get home I'll check with RC16, then drop down to bare metal and test again. VMWare Workstation. I'm using 3 virtual IDE hdd's of 512MB each, if that matters, and one 300MB one with FAT partition to boot unRaid from. Your idea to do parallel copies may be really good as it may stress things a bit more, make things slower, thus eventually triggering timing issues... despite maybe not needed so many iterations... The small script I did posted I did used random just to fill them but the main idea was to just re-create files with same exact sizes of my test files I have been testing with from the beginning, so that others can test with similar files. Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 Which version of workstation? I have 9 at home. I can try both scenarios. Link to comment
jbartlett Posted June 25, 2013 Share Posted June 25, 2013 It might be a good idea to release -rc16 IF you haven't heard back from the kernel folks fairly quickly. There are a lot of us folks who are running this RC version and, if I understand the problem, any improper shutdown will cause a data loss. In the interim, you can ensure your data is safe after copying new files by stopping the array and restarting. Stopping issues the sync command which commits the changes to the drive. Link to comment
mejutty Posted June 25, 2013 Share Posted June 25, 2013 I'll add a second data disk in as I am currently running with 1 data and 1 parity and try using your script nars and see what results I come up with. All my tests have been done with transfers from a W7 machine. Actually without me having to start calculating here what size is the resultant folder?? I may just use the flash instead of disk1 so I can give it a crack before adding in the second drive. Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 It might be a good idea to release -rc16 IF you haven't heard back from the kernel folks fairly quickly. There are a lot of us folks who are running this RC version and, if I understand the problem, any improper shutdown will cause a data loss. In the interim, you can ensure your data is safe after copying new files by stopping the array and restarting. Stopping issues the sync command which commits the changes to the drive. You can manually telnet in and issue the sync also. Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move. Link to comment
garycase Posted June 25, 2013 Share Posted June 25, 2013 Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move. While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing?? With, of course, the glaring exception that if anyone else was using the array you'd disrupt them Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move. While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing?? With, of course, the glaring exception that if anyone else was using the array you'd disrupt them Yes of course, there is a sync issued and the filesystems are unmounted and remounted thus accomplishing the sync/flush. For me that will break allot of things in my network. Besides, my utorrent machine is always writing to the unRAID server. The one positive thing about this is that you can rehash the torrents and then fill in the missing pieces. But now that I think about it.. This might be why I have to do this so often. Many torrents that finish fine end up being corrupted a few weeks later for no reason and have to re download the missing pieces. I thought there were issues on the network, but how can that be. it's only a beefy laptop and a microserver on it's own network. (bare metal unRAID) Link to comment
jumperalex Posted June 25, 2013 Share Posted June 25, 2013 Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move. While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing?? With, of course, the glaring exception that if anyone else was using the array you'd disrupt them Yes of course, there is a sync issued and the filesystems are unmounted and remounted thus accomplishing the sync/flush. For me that will break allot of things in my network. Besides, my utorrent machine is always writing to the unRAID server. The one positive thing about this is that you can rehash the torrents and then fill in the missing pieces. But now that I think about it.. This might be why I have to do this so often. Many torrents that finish fine end up being corrupted a few weeks later for no reason and have to re download the missing pieces. I thought there were issues on the network, but how can that be. it's only a beefy laptop and a microserver on it's own network. (bare metal unRAID) Ya know, I'm not gonna claim that I suffered from this bug, I did indeed have some torrents that inexplicably showed up with missing pieces. Like after downloading 8 of 9 individual episodes, and then combining them into a single season's torrent and forcing a recheck, a few of the episodes showed failed crc checks for some pieces. I just shrugged it off, but it did happen a good amount. What I can say at all, is that my server had any crashes that should explain the missing pieces. Link to comment
garycase Posted June 25, 2013 Share Posted June 25, 2013 If DOES seem that if (a) your UnRAID server has a UPS (a MANDATORY accessory in my opinion); and that (b) the APC UPS and CleanPowerdown packages do their jobs correctly and cleanly shut down the server in the event of a powerfailure ... then this bug should never be an issue. That would seem to protect against the issue impacting any downloaded torrents that hadn't yet been sync'd => but it does assume you have UPS support on all your systems. Obviously, getting this fixed is a better idea, however Link to comment
nars Posted June 25, 2013 Author Share Posted June 25, 2013 Which version of workstation? I have 9 at home. I can try both scenarios. 8.0.4 here, but you can still test on 9 probably. Also note that I have all virtual disk files on same physical hdd, it may do some difference as it will surely make things slower.. I'll add a second data disk in as I am currently running with 1 data and 1 parity and try using your script nars and see what results I come up with. All my tests have been done with transfers from a W7 machine. Actually without me having to start calculating here what size is the resultant folder?? I may just use the flash instead of disk1 so I can give it a crack before adding in the second drive. Yes, you can surely test from the flash, total size for the folder will be 283656666 bytes. Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 If DOES seem that if (a) your UnRAID server has a UPS (a MANDATORY accessory in my opinion); and that (b) the APC UPS and CleanPowerdown packages do their jobs correctly and cleanly shut down the server in the event of a powerfailure ... then this bug should never be an issue. That would seem to protect against the issue impacting any downloaded torrents that hadn't yet been sync'd => but it does assume you have UPS support on all your systems. Obviously, getting this fixed is a better idea, however We define mandatory accessory, but there's no real support for it. There's a kludge right now that has to be refined shortly after this. With the current power down, when the filesystems are unmounted I see emhttp complaining. However if we call the emhttp powerdown instead, it may never be able to umount the filesystems, thus causing a hang. The possibility of loosing data is higher now. Other then that so far my systems have not had power issues. There's something else going on and this reveals a big possibility. All the more reason now to have some kind of scan & checksum database. In any event, when I get home today, I'll start a battery of tests to see what I find. I may have to take out those new 4TB drives and install the 250's that came with my microservers. Or at least set the HPA on some other drives. Link to comment
garycase Posted June 25, 2013 Share Posted June 25, 2013 Btw, garycase, mejutty, limetech, if you are willing to do one last test... this is the last thing I will ask you to try: I'd be glad to try this, but I'm a complete Linux idiot, so am not sure how to run the script. I copied the script to a file (called it "NarsTest") and put it on the flash drive. Then from the UnRAID console, I did /Boot/NarsTest and it gives me an "Invalid Parameter" I copied everything from the "#!/bin/bash -e" line to the end. So ... what are the EXACT steps I need to do to run this? Link to comment
nars Posted June 25, 2013 Author Share Posted June 25, 2013 Btw, garycase, mejutty, limetech, if you are willing to do one last test... this is the last thing I will ask you to try: I'd be glad to try this, but I'm a complete Linux idiot, so am not sure how to run the script. I copied the script to a file (called it "NarsTest") and put it on the flash drive. Then from the UnRAID console, I did /Boot/NarsTest and it gives me an "Invalid Parameter" I copied everything from the "#!/bin/bash -e" line to the end. So ... what are the EXACT steps I need to do to run this? hmm, maybe you created the script with a windows text editor and the problem is the cr/lf's? try this: fromdos </boot/NarsTest >/boot/NarsTest2 then try to run NarsTest2. Link to comment
Frank1940 Posted June 25, 2013 Share Posted June 25, 2013 Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move. While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing?? With, of course, the glaring exception that if anyone else was using the array you'd disrupt them True... and so is the suggestion to Telnet in and issue a 'sync' command. The problem as I see it that a lot of the people who are using -rc15a probably aren't following this thread. And most likely, they are the very ones that will not be using an UPS. I was just suggesting to Tom that it might be better to release a fix that he is 99% sure fixes the problem if the kernel folks are slow in responding to his request than to have several users ending up with a corrupted file systems while there is an easily implemented solution that should prevent the problem. Link to comment
garycase Posted June 25, 2013 Share Posted June 25, 2013 hmm, maybe you created the script with a windows text editor and the problem is the cr/lf's? try this: fromdos </boot/NarsTest >/boot/NarsTest2 then try to run NarsTest2. That was obviously the problem (I had just saved your text with Notepad). Worked fine now. I've got to run some errands at the moment, but will do the testing when I return, now that I have the appropriate set of files on disk1. Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 Would be nice to have a SYNC button (grin) it would satisfy my normal habit of syncing the disks after a massive transfer or move. While it's not a "Sync" button, doesn't the "Stop" button followed by a "Start" do the same thing?? With, of course, the glaring exception that if anyone else was using the array you'd disrupt them True... and so is the suggestion to Telnet in and issue a 'sync' command. The problem as I see it that a lot of the people who are using -rc15a probably aren't following this thread. And most likely, they are the very ones that will not be using an UPS. I was just suggesting to Tom that it might be better to release a fix that he is 99% sure fixes the problem if the kernel folks are slow in responding to his request than to have several users ending up with a corrupted file systems while there is an easily implemented solution that should prevent the problem. I agree. Link to comment
garycase Posted June 25, 2013 Share Posted June 25, 2013 Decided to quickly run one test to see if it has the same issues you have had. Ran the copy; waited 1 minute and cut power; booted up; and then compared the "nars_test_folder" on disk1 and disk 2 ==> they are identical !! When I get back in a couple hours I'll repeat that test with a couple smaller intervals between the copy and killing power (perhaps 30 seconds and 15 seconds) ... but it certainly looks like my test system does NOT have the same issue, and that with RC16 it works just fine. Link to comment
nars Posted June 25, 2013 Author Share Posted June 25, 2013 Decided to quickly run one test to see if it has the same issues you have had. Ran the copy; waited 1 minute and cut power; booted up; and then compared the "nars_test_folder" on disk1 and disk 2 ==> they are identical !! When I get back in a couple hours I'll repeat that test with a couple smaller intervals between the copy and killing power (perhaps 30 seconds and 15 seconds) ... but it certainly looks like my test system does NOT have the same issue, and that with RC16 it works just fine. Thank you for testing, btw I don't think it is worth trying with smaller intervals as I did always waited with near 2 minutes, it doesn't really matter to wait more for me... if you wish to try with different memory amount (if you are back with 1GB and want to decrease it you don't need to physically do it, you may just add mem param on syslinux.cfg as that used to limit to 4GB that was talked a lot on forum...) try anything else is probably worthless and you can't really reproduce it. Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 Ok folks, sorry, but I don't have good news for the first test. I took bzroot rc-16 as a seed file and cp'ed it in parallel to disk1. Then I did a cp of all the files from disk1 to disk2. Then I did a cp of all the files from disk2 to disk3. I waited 3 minutes then reset the virtual machine. ESX 5.1, 4GB memory. unRAID 5.0 RC-16 to 3 physical disks. I am not using the user shares. root@unRAID2:/mnt/disk1/crashtestdummy# df -vH Filesystem Size Used Avail Use% Mounted on tmpfs 135M 177k 135M 1% /var/log /dev/sda1 16G 2.1G 14G 13% /boot /dev/md1 4.1T 4.6G 4.0T 1% /mnt/disk1 /dev/md2 3.1T 4.2G 3.0T 1% /mnt/disk2 /dev/md3 3.1T 4.2G 3.0T 1% /mnt/disk3 shfs 11T 13G 10T 1% /mnt/user What is odd to me is 76 of the 120 files do not have the proper size and fail the md5sum check. These are the FIRST set of files I made. (in parallel) What baffles me more is that the sequential cp of disk1 to disk2 is good, the sequential cp of disk2 to disk3 is good. While it certainly could be my system, ram or hardware of some sort. It's a bit baffling. In addition, accessing the files via md5sum is slow as molassas on the 4tb. Possibly due to fragmentation? It's really slow. Both tests on the other disks finished miles ahead of the 4TB disk. I could add manual syncs in here to force the flush and make sure it's not something else. I have RC12a on my other server I could pull those images over and try them, unless someone can suggest a better old image to test. I'm quite concerned. Let me add that I feel I left more then enough time for the cache and superblock to be flushed. The parallel cp took quite some time. Were talking tens of minutes. Then I did the md5sum again, more then enough time for syncing the superblock. The sequential cp to the other disks was pretty fast, maybe 3 minutes each each doing an md5sum. then an idle sleep for 3 minutes. root@unRAID2:/mnt/disk1/crashtestdummy# uname -a Linux unRAID2 3.9.6p-unRAID #35 SMP Mon Jun 24 11:31:07 PDT 2013 i686 AMD Turion(tm) II Neo N54L Dual-Core Processor AuthenticAMD GNU/Linux root@unRAID2:/mnt/disk1/crashtestdummy# cat /etc/unraid-version version=5.0-rc16 root@unRAID2:/mnt/disk1/crashtestdummy# cat /boot/crashtestdummy #!/bin/bash FILE=bzroot.516 DIR=/mnt/disk1/crashtestdummy MD5SUM=`md5sum /boot/${FILE} | cut -f1 -d' '` SLEEPTIME=180 LOOPMAX=${LOOPMAX:=120} rm ${DIR}/md5sum rm -fr ${DIR} mkdir -p ${DIR} while [[ ${LOOP:=0} -lt ${LOOPMAX:=10} ]] do (( LOOP=LOOP+1 )) cp -v /boot/${FILE} ${DIR}/${FILE}.${LOOP} & echo "${MD5SUM} ${FILE}.${LOOP}" >> ${DIR}/md5sum done echo "`date '+%b %d %T'` $0[$$]: Waiting for background cp to finish." wait echo "`date '+%b %d %T'` $0[$$]: Checking md5sum" ( cd ${DIR} ; md5sum -c md5sum ) DESTDIR=/mnt/disk2/crashtestdummy echo "`date '+%b %d %T'` $0[$$]: copying to ${DESTDIR}" rm -r ${DESTDIR} mkdir ${DESTDIR} cp -av ${DIR}/* ${DESTDIR} ( cd ${DESTDIR}; md5sum -c md5sum ) DIR=${DESTDIR} DESTDIR=/mnt/disk3/crashtestdummy echo "`date '+%b %d %T'` $0[$$]: copying to: ${DESTDIR}" rm -r ${DESTDIR} mkdir ${DESTDIR} cp -av ${DIR}/* ${DESTDIR} ( cd ${DESTDIR}; md5sum -c md5sum ) echo "`date '+%b %d %T'` $0[$$]: sleeping ${SLEEPTIME}" sleep ${SLEEPTIME} echo "`date '+%b %d %T'` $0[$$]: ok crash the system now" md5sum: WARNING: 76 of 120 computed checksums did NOT match root@unRAID2:/mnt/disk1/crashtestdummy# ls -l total 3915729 -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.1* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.10* -rwxrwxrwx 1 root root 33153024 2013-06-25 17:47 bzroot.516.100* -rwxrwxrwx 1 root root 32141312 2013-06-25 17:47 bzroot.516.101* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.102* -rwxrwxrwx 1 root root 32362496 2013-06-25 17:47 bzroot.516.103* -rwxrwxrwx 1 root root 32608256 2013-06-25 17:47 bzroot.516.104* -rwxrwxrwx 1 root root 32698368 2013-06-25 17:47 bzroot.516.105* -rwxrwxrwx 1 root root 33419264 2013-06-25 17:47 bzroot.516.106* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.107* -rwxrwxrwx 1 root root 33746944 2013-06-25 17:47 bzroot.516.108* -rwxrwxrwx 1 root root 31952896 2013-06-25 17:47 bzroot.516.109* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:44 bzroot.516.11* -rwxrwxrwx 1 root root 32219136 2013-06-25 17:47 bzroot.516.110* -rwxrwxrwx 1 root root 33247232 2013-06-25 17:47 bzroot.516.111* -rwxrwxrwx 1 root root 33292288 2013-06-25 17:47 bzroot.516.112* -rwxrwxrwx 1 root root 33329152 2013-06-25 17:47 bzroot.516.113* -rwxrwxrwx 1 root root 33480704 2013-06-25 17:47 bzroot.516.114* -rwxrwxrwx 1 root root 33140736 2013-06-25 17:47 bzroot.516.115* -rwxrwxrwx 1 root root 33415168 2013-06-25 17:47 bzroot.516.116* -rwxrwxrwx 1 root root 33013760 2013-06-25 17:47 bzroot.516.117* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.118* -rwxrwxrwx 1 root root 32890880 2013-06-25 17:47 bzroot.516.119* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.12* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.120* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.13* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.14* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.15* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.16* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.17* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.18* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.19* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.2* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.20* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.21* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.22* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.23* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.24* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.25* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.26* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.27* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.28* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.29* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.3* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.30* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.31* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.32* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.33* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.34* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:46 bzroot.516.35* -rwxrwxrwx 1 root root 32735232 2013-06-25 17:47 bzroot.516.36* -rwxrwxrwx 1 root root 32481280 2013-06-25 17:47 bzroot.516.37* -rwxrwxrwx 1 root root 33333248 2013-06-25 17:47 bzroot.516.38* -rwxrwxrwx 1 root root 33574912 2013-06-25 17:47 bzroot.516.39* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.4* -rwxrwxrwx 1 root root 33636352 2013-06-25 17:47 bzroot.516.40* -rwxrwxrwx 1 root root 32837632 2013-06-25 17:47 bzroot.516.41* -rwxrwxrwx 1 root root 33075200 2013-06-25 17:47 bzroot.516.42* -rwxrwxrwx 1 root root 33574912 2013-06-25 17:47 bzroot.516.43* -rwxrwxrwx 1 root root 32948224 2013-06-25 17:47 bzroot.516.44* -rwxrwxrwx 1 root root 32968704 2013-06-25 17:47 bzroot.516.45* -rwxrwxrwx 1 root root 32960512 2013-06-25 17:47 bzroot.516.46* -rwxrwxrwx 1 root root 33026048 2013-06-25 17:47 bzroot.516.47* -rwxrwxrwx 1 root root 33316864 2013-06-25 17:47 bzroot.516.48* -rwxrwxrwx 1 root root 33681408 2013-06-25 17:47 bzroot.516.49* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.5* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.50* -rwxrwxrwx 1 root root 33460224 2013-06-25 17:47 bzroot.516.51* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.52* -rwxrwxrwx 1 root root 33099776 2013-06-25 17:47 bzroot.516.53* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.54* -rwxrwxrwx 1 root root 32972800 2013-06-25 17:47 bzroot.516.55* -rwxrwxrwx 1 root root 32804864 2013-06-25 17:47 bzroot.516.56* -rwxrwxrwx 1 root root 32608256 2013-06-25 17:47 bzroot.516.57* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.58* -rwxrwxrwx 1 root root 33288192 2013-06-25 17:47 bzroot.516.59* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.6* -rwxrwxrwx 1 root root 33792000 2013-06-25 17:47 bzroot.516.60* -rwxrwxrwx 1 root root 33656832 2013-06-25 17:47 bzroot.516.61* -rwxrwxrwx 1 root root 32661504 2013-06-25 17:47 bzroot.516.62* -rwxrwxrwx 1 root root 33333248 2013-06-25 17:47 bzroot.516.63* -rwxrwxrwx 1 root root 33304576 2013-06-25 17:47 bzroot.516.64* -rwxrwxrwx 1 root root 33529856 2013-06-25 17:47 bzroot.516.65* -rwxrwxrwx 1 root root 33161216 2013-06-25 17:47 bzroot.516.66* -rwxrwxrwx 1 root root 33722368 2013-06-25 17:47 bzroot.516.67* -rwxrwxrwx 1 root root 33882112 2013-06-25 17:47 bzroot.516.68* -rwxrwxrwx 1 root root 33357824 2013-06-25 17:47 bzroot.516.69* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.7* -rwxrwxrwx 1 root root 33214464 2013-06-25 17:47 bzroot.516.70* -rwxrwxrwx 1 root root 32821248 2013-06-25 17:47 bzroot.516.71* -rwxrwxrwx 1 root root 33259520 2013-06-25 17:47 bzroot.516.72* -rwxrwxrwx 1 root root 32821248 2013-06-25 17:47 bzroot.516.73* -rwxrwxrwx 1 root root 33112064 2013-06-25 17:47 bzroot.516.74* -rwxrwxrwx 1 root root 33226752 2013-06-25 17:47 bzroot.516.75* -rwxrwxrwx 1 root root 33890304 2013-06-25 17:47 bzroot.516.76* -rwxrwxrwx 1 root root 33628160 2013-06-25 17:47 bzroot.516.77* -rwxrwxrwx 1 root root 33521664 2013-06-25 17:47 bzroot.516.78* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:47 bzroot.516.79* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.8* -rwxrwxrwx 1 root root 32555008 2013-06-25 17:47 bzroot.516.80* -rwxrwxrwx 1 root root 33226752 2013-06-25 17:47 bzroot.516.81* -rwxrwxrwx 1 root root 33312768 2013-06-25 17:47 bzroot.516.82* -rwxrwxrwx 1 root root 33411072 2013-06-25 17:47 bzroot.516.83* -rwxrwxrwx 1 root root 32763904 2013-06-25 17:47 bzroot.516.84* -rwxrwxrwx 1 root root 33091584 2013-06-25 17:47 bzroot.516.85* -rwxrwxrwx 1 root root 33521664 2013-06-25 17:47 bzroot.516.86* -rwxrwxrwx 1 root root 32653312 2013-06-25 17:47 bzroot.516.87* -rwxrwxrwx 1 root root 32358400 2013-06-25 17:47 bzroot.516.88* -rwxrwxrwx 1 root root 32550912 2013-06-25 17:47 bzroot.516.89* -rwxrwxrwx 1 root root 33930684 2013-06-25 17:45 bzroot.516.9* -rwxrwxrwx 1 root root 32649216 2013-06-25 17:47 bzroot.516.90* -rwxrwxrwx 1 root root 33316864 2013-06-25 17:47 bzroot.516.91* -rwxrwxrwx 1 root root 33579008 2013-06-25 17:47 bzroot.516.92* -rwxrwxrwx 1 root root 31932416 2013-06-25 17:47 bzroot.516.93* -rwxrwxrwx 1 root root 32768000 2013-06-25 17:47 bzroot.516.94* -rwxrwxrwx 1 root root 32772096 2013-06-25 17:47 bzroot.516.95* -rwxrwxrwx 1 root root 32387072 2013-06-25 17:47 bzroot.516.96* -rwxrwxrwx 1 root root 32874496 2013-06-25 17:47 bzroot.516.97* -rwxrwxrwx 1 root root 32272384 2013-06-25 17:47 bzroot.516.98* -rwxrwxrwx 1 root root 32452608 2013-06-25 17:47 bzroot.516.99* -rw-rw-rw- 1 root root 5772 2013-06-25 17:38 md5sum When responding, Please don't quote my whole message. I know what I wrote and it's pretty long. Sorry. Quote what's important to keep the replication down. Thanks!! Link to comment
nars Posted June 25, 2013 Author Share Posted June 25, 2013 @WeeboTech: If the problem you are seeing is related to the problem I can more easily reproduce it may mean that timing (i.e. slowing things) may make difference to make it appear more easily, that's maybe why is shows for your parallel writes... if you got good data on subsequent copies on disk2 and disk3 it probably means that data on your disk1 was good before the hard reset and may be a commit issue, anyway it may be yet some different problem (hope not...). I would just try if you can reproduce it again with a lower number of parallel writes to make it simpler, faster and easier to make any further tests. Then I would try to do a sync before hard reset and see if you still get problems with data written on parallel. Then I would try same with rc13 with unpatched 3.9 kernel and see what happens as this is the only version from the ones I did tested until now that I was not able to reproduce 'my' problem. Link to comment
WeeboTech Posted June 25, 2013 Share Posted June 25, 2013 Doing narstest to disk1, issue sync, reboot gracefully. cp -r disk1 to disk2 do md5sum(good) cp -r disk1 to disk3 do md5sum(good) wait 2 minutes. CRASH the testdummy. boot unRAID cd /mnt/disk2/nars_test_folder && md5sum -c /mnt/disk1/nars_test_folder.md5 GOOD cd /mnt/disk3/nars_test_folder && md5sum -c /mnt/disk1/nars_test_folder.md5 GOOD Link to comment
garycase Posted June 25, 2013 Share Posted June 25, 2013 Interesting that we get a variety of results on nars test. I just did it again -- twice -- and both were (again) perfect, with no errors. Once with 30 seconds until I killed power after the cp. Then the same thing with only 512MB of RAM ... same 30 second delay. So at least on my bare metal system RC16 works just fine with that set of test file. Link to comment
nars Posted June 25, 2013 Author Share Posted June 25, 2013 Interesting that we get a variety of results on nars test. Note that the test where WeeboTech actually got corruption was not my test files, he is trying a different way to test with parallel writes. He was unable to get any problem with my test files as well. Link to comment
Recommended Posts