February 11, 201313 yr I have 13 data hdds. How would I find the cause of the 5 errors? use checksums based on the original content. Other than that, there is no way I am aware of. It could be on any disk, disk controller, or memory. It could be intermittent, or not, and could be bits representing data files, or file-system-structure, or unused space on the disk. Other than the logical block (recorded in the system log) you have nothing else to use as a clue. I know of no utility to translate a block on the disk to whatever might be using it. Joe L.
February 11, 201313 yr Author Unfortunately the original content is huge!!! Would you think 5 errors is negligible? I am running NOCORRECT option right now and the parity check is unfinished. Should I cancel it and run a CORRECT check instead?
February 11, 201313 yr Unfortunately the original content is huge!!! Would you think 5 errors is negligible? I am running NOCORRECT option right now and the parity check is unfinished. Should I cancel it and run a CORRECT check instead? no, let it run to completion, then capture the errors in the syslog. Then, run a second NOCORRECT. See if the errors are in the same blocks. If yes, and if in low locations on the disk(early in the check process) then best bet is probably to correct them, especially if you had a hard shutdown and the parity disk might have not been updated. If different locations, you have a random, intermittent error from somewhere. (disk, memory, disk controller, power supply noisy, etc. ) Those types of problems cause hair-loss. Step 1 in resolving them is an overnight memory test (since it is easiest to eliminate from all the others) Then, this technique outlined in the wiki comes next: http://lime-technology.com/wiki/index.php/FAQ#How_To_Troubleshoot_Recurring_Parity_Errors Joe L.
February 12, 201313 yr I had ~5 sync errors with 1 of my drives when I first started out with unRAID. Turns out it was just a sata cable at fault. Swapped it with a new one and errors were gone
February 12, 201313 yr Author I am running a nocorrect check again and so far the same 5 errors at the same blocks have come up. Looks like I will be running a correct check.
February 12, 201313 yr I am running a nocorrect check again and so far the same 5 errors at the same blocks have come up. Looks like I will be running a correct check. that is far easier to deal with than random errors.
February 12, 201313 yr Author How big is the errors (bits, bytes, MB etc): Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565768 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565776 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565784 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565792 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565800
February 12, 201313 yr How big is the errors (bits, bytes, MB etc): Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565768 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565776 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565784 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565792 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565800 I think those are the logical sector numbers. So basically, somewhere in those block of bits across all your drives. It seems they are all grouped together... That might be just a single "write" that did not get made to the parity disk in the event of a non-clean stop of the server. It will only print the first 20 sectors, so don't get fooled.. if (sb->sync_errs <= 20) printk("md: parity incorrect: %llu\n", sector); Joe L.
February 13, 201313 yr How big is the errors (bits, bytes, MB etc): Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565768 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565776 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565784 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565792 Feb 12 18:12:21 Unraid kernel: md: parity incorrect: 1565565800 The block size that unRAID is talk about in these parity messages is 512 bytes. unRAID 4.7 also talks about 1K byte block sizes elsewhere, see: http://lime-technology.com/forum/index.php?topic=12371.msg117750#msg117750 this might have been changed with the 5.0 series. So in your case the amount of corruption could be between 1 bit and 512 bytes for each of the 5 errors. Most likely this is only happening on a single disk (there's nothing to rule out the same block number having issues on several disks at once, but that would be highly improbable!). There's also nothing to say if the problem is with the contents of the data drive or the parity drive, all we can say is that the current contents of the parity drive for these 5 blocks does not match the results of the parity calculation from the data drive's contents. Now if you had checksums (MD5, SHA or similar) of all your files and if you had a utility that could tell you which files occupy those blocks on each data disk, then you could recompute the checksums on the individual files and: 1. if you found all the checksums were still good you could conclude that the parity disk was incorrect and just rebuild it 2. if you found that the checksum of the files on ONE of the disks was wrong then you could rebuild that disk and then recheck the file checksums. In this case there is really no guarantee that the rebuild will fix the issue. The idea is if there was only one error event affecting each block, then parity is likely to be correct and a rebuild will fix it. Regards, Stephen
February 13, 201313 yr If the disks themselves aren't showing any errors (i.e. the "Errors" column is all zeroes), then it's almost certain that the sync errors are simply due to failed parity disk writes. Just run a correcting parity check; then run it again and you should be error-free. If you're concerned that some of the files may have been corrupted, just run a compare with your backups -- but I've never found a mismatch when doing this [the first couple times I had sync errors, I spent a week or so running FolderMatch comparisons of my backup disks ... but never found any mismatches].
February 13, 201313 yr Author The files I have are bluray iso. I have spent some time converting some of them to MKV which is not an easy task. Maybe one of the films was not converted proper but there are no errors in the hdd error columns. I not really going to worry about 5 x 512 bytes as that is tiny compared to the sizes of the bluray isos!
February 14, 201313 yr The files I have are bluray iso. I have spent some time converting some of them to MKV which is not an easy task. Maybe one of the films was not converted proper but there are no errors in the hdd error columns. I not really going to worry about 5 x 512 bytes as that is tiny compared to the sizes of the bluray isos! And if small errors of that size do affect a video file it usually just causes a slight glitch in playback, say a dropped frame or a brief bit of pixelization on part of the display. So you'll probably not be affected much by this and may never notice the issue. But if a small glitch hit a tax return file you might become due a refund :-) Regards, Stephen
March 22, 201313 yr Author In Feb I ran a correct parity. But I can't remember if re-checked the parity afterwards. Today I am running a nocorrect parity and the SAME 5 errors came up. This time I cancelled the parity and now I am running a check correct. Strange!
December 1, 201312 yr Author I am on the 5.0.3 software and I am running CORRECT parity check and the same errors have come up: Dec 1 11:44:26 Unraid kernel: md: correcting parity, sector=1565565768 Dec 1 11:44:26 Unraid kernel: md: correcting parity, sector=1565565776 Dec 1 11:44:26 Unraid kernel: md: correcting parity, sector=1565565784 Dec 1 11:44:26 Unraid kernel: md: correcting parity, sector=1565565792 Dec 1 11:44:26 Unraid kernel: md: correcting parity, sector=1565565800 WTF!??!?! But in previous months these have not appearred.
December 2, 201312 yr Suppose you have one data disk which has 5 particular sectors that return random garbage. First, try to find out if such a disk exists. If you're lucky -- those 5 sectors return random garbage ALL the time -- then you can catch it easily: Restart the server WITHOUT emhttp (don't mount any disks) and do something like this: passes=2 for i in /dev/[sh]d? ;do ( for j in `seq 1 $passes` ;do md5sum $i >>/root/${i}.md5 done echo $i ; cat /root/${i}.md5 ) & done wait Observe if the md5 summs change between the passes. If they do, then you've cought the offending disk. Then next step would be try to overwrite the 5 sectors on that disk, and see if the reallocated sectors number changes before and after the overwriting. Note that the above script may take a few days to finish, depending on the number and the size of the disks in the server, so you may want to run the script in `screen`, or at the console. Now, if you're less lucky -- those 5 sectors return random garbage only some of the time -- then the offending disk will be much harder to catch. You may have to increase the number of passes in the script to a bigger number, and that can take a really really long time. Or, you may not be able to catch such a disk, and the problem may turn out to be something completely different, but for lack of any better ideas this exercise is worth the shot. Good luck! Edit: On further thought, since you have the numbers of those 5 sectors, you may write a simpler script which reads multiple times only those 5 sectors on each disk, instead of reading the whole disks, and that will finish the job much faster. But still it doesn't hurt to do a whole read if you have the time.
Archived
This topic is now archived and is closed to further replies.