Pinpoint reason for sync errors

October 17, 201213 yr

Hello everybody

I have been away from my UnRaid Server for almost a year thanks to some personal problems. I recently managed to restart it and the first thing I did was to do a parity check (without correction). Unfortunately, up to now (75%), i have about 400 sync errors. Since I never had so many, I will have to check what is causing the problems. I guess I have to follow the instructions from here:

http://lime-technology.com/wiki/index.php/FAQ#Why_am_I_getting_repeated_parity_errors.3F

(memtest, smart reports for all disks, reiserfsck for all disks)

But I was wondering: when there is a sync error, there is either a problem with a data drive or with the parity drive. The only way to be absolutely sure is to check if the respective stored files are ok. Is there a way (a tool maybe?) to pinpoint the files on each disk that correspond to each parity error? For example if there is a parity error on bit 345666567:

1.find the files on each of my 12 disks that contain this bit.

2. check if said files work.

3a. If they work the problem is on the parity drive and I just need to run a new parity check with sync correction

3b. If there is a problem with a specific drive, files on this drive will be problematic.

I pose this question because I cannot think of a way one can be sure that everything is OK when fixing parity errors. It could just mean the perpetuation of errors in data disks.

Thanks in advance for any answers

Quote

October 17, 201213 yr

Hello everybody

But I was wondering: when there is a sync error, there is either a problem with a data drive or with the parity drive. The only way to be absolutely sure is to check if the respective stored files are ok. Is there a way (a tool maybe?) to pinpoint the files on each disk that correspond to each parity error?

Unfortunately there is no such tool.

One test you should do is to run a second (or third) parity check (still of the non-correcting kind) and look at the system log file to see if it is listing the same block numbers as having problems. If the errors move around then the issue might be due to bad memory or faulty power.

Another test you should run is the memory test.

You should also get smart reports from all your drives.

Regards,

Stephen

Quote

October 18, 201213 yr

Author

Unfortunately there is no such tool.

pity. I guess it would also be useful, once an error is found to be checked for a 2nd time. This way it would verify or eliminate the possibility of the error being due to bad memory or faulty power, wouldn't it?

One test you should do is to run a second (or third) parity check (still of the non-correcting kind) and look at the system log file to see if it is listing the same block numbers as having problems. If the errors move around then the issue might be due to bad memory or faulty power.

Another test you should run is the memory test.

I just finished a second parity test. there appears to be the same number of mistakes. That probably indicates it is not an artifact of some faulty component. I will still run memtest but I ran it recently and everything was OK.

You should also get smart reports from all your drives.

SMART reports no problems. Assuming that memtest shows no problems too, should I proceed to fixing the parity errors?

Quote

October 18, 201213 yr

I just finished a second parity test. there appears to be the same number of mistakes. That probably indicates it is not an artifact of some faulty component. I will still run memtest but I ran it recently and everything was OK.

If the first parity check was a non-correcting parity "check" followed by a correcting "sync" the second would find exactly the same sectors and correct them, resulting in exactly the same errors reported. A third "check" (or sync) should find no errors.

If the first parity check was a non-correcting parity "check" followed by a second non-correcting "check" the second would find exactly the same sectors, resulting in exactly the same errors reported. A third "parity sync" (correcting parity-sync) should find the same errors again and fix the errors. A fourth check should then find no errors.

Quote

October 18, 201213 yr

Author

If the first parity check was a non-correcting parity "check" followed by a correcting "sync" the second would find exactly the same sectors and correct them, resulting in exactly the same errors reported. A third "check" (or sync) should find no errors.

If the first parity check was a non-correcting parity "check" followed by a second non-correcting "check" the second would find exactly the same sectors, resulting in exactly the same errors reported. A third "parity sync" (correcting parity-sync) should find the same errors again and fix the errors. A fourth check should then find no errors.

I performed two non-corrective parity chacks. Both found 373 errors. I found those messages in the syslog. They existed in both parity checks and they are identical:

Oct 18 21:19:00 Tower kernel: md: parity incorrect: 1856549616
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013348976
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013348984
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013348992
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349000
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349008
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349016
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349024
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349032
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349040
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349048
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349056
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349064
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349072
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349080
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349088
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349096
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349104
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349112
Oct 18 21:59:18 Tower kernel: md: parity incorrect: 2013349120

I don't know what the fact that those numbers are clustered means (if they represent bits, maybe I bumped my server at the moment this part was being written, or there was a small power fluctuation). But shouldn't I find 373 such messages? (again, assuming the numbers represent bits).

Quote

October 18, 201213 yr

Hello everybody

But I was wondering: when there is a sync error, there is either a problem with a data drive or with the parity drive. The only way to be absolutely sure is to check if the respective stored files are ok. Is there a way (a tool maybe?) to pinpoint the files on each disk that correspond to each parity error?

Unfortunately there is no such tool.

"Unfortunately" is an understatement. In my opinion, this is a flagrant omission in a product such as unRAID.

It might be that Reiser FS lacks such a facility; if that is the case, then that, alone, would make it a bad choice to be limited to. (Or, if there [really] were sufficient advantages to Reiser FS, otherwise, then unRAID should have implemented the missing feature itself.)

Even the very first release of Unix (1973) had a (admittedly obtuse) mechanism for this (using ncheck & icheck). I know that BSD/SunOS provided a mechanism. I'm not familiar enough with ext[234], but I'd be very surprised if they lacked a (straightforward) mechanism. (there is a bass-ackwards hack that starts with find, but ... )

OP: even if you could determine those (up to 12) files [on some drives, the sector in question might be unused, or part of FS metadata], you could still be stuck figuring out which one was actually damaged.

This is a situation where hashing/checksumming comes into play. Some people use ZFS; others SnapRAID; others roll their own.

Also, to OP: just because a possibly suspect file "works" does not exonerate it from being damaged (hence, mistakenly pointing the blame on the parity drive). A single bit-flip (in a single file) will correctly provoke a parity check error report, but that single bit may be totally extraneous to the correct functioning of the file containing it.

'Tis a tangled web we (must un-)weave ...

Quote

October 19, 201213 yr

You can run hashdeep but it takes a long time so it's hard to keep it current if the data changes. It works for archive disks.

Quote

January 7, 201313 yr

So, what does one do in this situation? I have the same problem - with only 5 parity sync errors. I ran two non-correcting checks and I know exactly the five bits in question. If we can't rely on the parity check to tell us whether the data drive or the parity drive is in error, then what do we do?

If we are just looking at drive reports and finding no errors, and then assuming the parity drive is incorrect, then just what is the point of having the parity drive? Only to be able to rebuild a disk when it is replaced?

I am assuming at this point that what I am supposed to do is run a correcting parity check and assume my data is correct and intact. I was doing this in order to replace the parity drive though, so I will not do this and just replace the parity drive and build parity fresh.

However I am uncomfortable with this outcome - is it realistic to assume that the parity drive can just suddenly lose a bit but the data drives cant? I'm not sure what analogy to use here . . . hope you understand my point.

Quote

January 7, 201313 yr

Check all SMART reports for non-zero current_pending_sector RAW_VALUE.

Quote

January 7, 201313 yr

That makes sense. Based on my reading yesterday of a number of similar threads, I did this, and all of my drives have zeros for this value, so I am fairly comfortable just correcting (actually, I put in a new parity drive and created parity). I just am uncomfortable with the thought that among my 11 healthy drives, some how a few bits got scrambled.

Thanks for the advice. I think my system is okay, I am just wondering in the long run how to approach this maintenance issue.

Quote

Pinpoint reason for sync errors

Featured Replies

Archived

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)