Parity Check finished with Errors, not sure how to proceed

jmos1277 · June 13, 2017

I've been having some issues with my server lately and can't seem to get them straightened out. Several months ago I had a disk that was disabled and emulated. I rebuilt my data onto that disk and everything seemed to be okay. About a week ago, that same disk was once again disabled. For a second time, I rebuilt the data onto the disk and everything seemed okay. That thread is located here:

Now the SMART info for that disk is reporting 1806 reallocated sector count. Current pending sector count is 0.

My plan is to replace the disk. Before replacing the disk, I decided to manually run a parity check ("Write corrections to parity" was checked). My parity check reported that it found 256 errors. About a week ago, my parity check reported the same number of errors (diagnostics from last week's run is in thread linked above). Diagnostics for today's run is attached to this thread.

Last week's syslog showed quite a few "disk8 write error" messages. It also showed "recovery thread: Q corrected" messages.

Today's syslog only seems to show "recovery thread: Q corrected" messages.

I'm not sure if this is meaningful in any way.

My UPS reported a power failure in the middle of my parity check. My server didn't shut down, but I'm not sure if this would have an impact not he results of my parity check.

I'm not really sure how to proceed.

Besides the high reallocated sector count on my disk8, it seems to be behaving normally.

I have so many questions. I don't even know if they all make sense.

1) Since I ran the parity check with "Write corrections to parity" checked, does that mean that my parity data might now be incorrect? If the file on my "bad" disk was corrupt, did running a parity check in this way just corrupt my parity?

2) Or rather, is it possible that the failed parity check from June 5th updated my parity with incorrect parity information because it failed to read my "bad" disk, and now my latest parity check (with no disk errors) just corrected it? Both of these parity checks reported the same number of errors.

3) Is there a way to check the file system on my "bad" disk to see if there are any corrupt files?

4) Is there a way to see what file might be corrupt based on the sector number? It would be great to know which file might be corrupt.
5) In my parity check from June 5th, I saw a lot of "recovery thread: Q corrected" message. I assume that means that my parity data was modified. If my parity was incorrect, shouldn't it have updated both my P and my Q parity? Why no recovery thread: P corrected" message? I understand that P might not need correction in all the same places that Q needs correction, but wouldn't I expect at least some P corrected messages?

media-diagnostics-20170612-1921.zip

jmos1277 · June 13, 2017

One more question. Why were there "disk8 write error" message during a parity check?

Shouldn't a parity check just need to read, and not write?

JonathanM · June 13, 2017

10 minutes ago, jmos1277 said:

One more question. Why were there "disk8 write error" message during a parity check?

Shouldn't a parity check just need to read, and not write?

Any time a read error occurs, whatever operation is taking place, the data that should have been read is calculated from parity, and a write is attempted to put the data back to the disk that gave the read error. If the write succeeds, the drive error counter is incremented, and operations proceed as normal. If the write fails, the drive is red balled, and no further attempts are made to access the drive. All subsequent operations to that drive slot are emulated by parity.

jmos1277 · June 13, 2017

54 minutes ago, jonathanm said:

Any time a read error occurs, whatever operation is taking place, the data that should have been read is calculated from parity, and a write is attempted to put the data back to the disk that gave the read error. If the write succeeds, the drive error counter is incremented, and operations proceed as normal. If the write fails, the drive is red balled, and no further attempts are made to access the drive. All subsequent operations to that drive slot are emulated by parity.

jonathanm, thanks for the info.

How does the "Write corrections to parity" option affect the behavior your described in your post above? If the box is checked, does unRAID write to the parity disk instead of the data disk?

JorgeB · June 13, 2017

Unless you're expecting sync errors, like after an unclean shutdown, parity check should always be no-correct, it's a know issue that if there are disk errors during a check parity can be wrongly updated, making it corrupt, pretty sure this was what happened on the June 5th check, there were errors on disk8 and immediately after those parity was updated (no clue why only Q parity was corrupted), so in this lat check the same sectors that were previously corrupted were now corrected.

Parity should be correct now and remember to change schedule parity checks to non correct.

JonathanM · June 13, 2017

9 hours ago, jmos1277 said:

jonathanm, thanks for the info.

How does the "Write corrections to parity" option affect the behavior your described in your post above? If the box is checked, does unRAID write to the parity disk instead of the data disk?

My gut feeling says it doesn't effect it, because the parity check reads all the disks, so any read error on any of the disks would be handled the same way no matter what generated the read. Write corrections vs not writing them doesn't come into play if the data you are working with is reconstructed from parity to begin with.

Tom @limetech would be the one to ask to get a definitive answer. Thinking through it, I have a question about what happens in the code if during the read cycle to generate the corrected data to be written, what happens if another disk read error is detected? Theoretically with dual parity, you would still be ok, but I have no clue what happens with single parity, and if more than 2 disks generate read errors, like if a controller dropped offline, there is no valid data to try a write, so does a multiple disk read error red ball all effected drives without any writes being attempted? My gut says the array should be immediately stopped to avoid further damage, but I didn't write the code.

Inquiring minds want to know. (Me at least )

JorgeB · June 13, 2017

44 minutes ago, jonathanm said:

My gut feeling says it doesn't effect it

I also believe it has no effect on this case.

45 minutes ago, jonathanm said:

what happens if another disk read error is detected?

IIRC and there are read errors on more devices than the available redundancy unRAID will continue the check without attempting to write those sectors and log "multiple read errors" or something similar.

JorgeB · June 13, 2017

Dug up an old log with multiple errors, it's logged like so:

Feb 28 04:10:26 unRaid kernel: md: disk3 read error, sector=6329638664
Feb 28 04:10:26 unRaid kernel: md: disk4 read error, sector=6329638664
Feb 28 04:10:26 unRaid kernel: md: recovery thread: multiple disk errors, sector=6329638664

Note however that in theses cases, e.g., when a controller crashes and unRAID loses connection with all disks it will still disable the 1st disk it tries to write to (or the first two if the user has dual parity):

Feb 28 04:10:26 unRaid kernel: md: disk5 read error, sector=4659722456
Feb 28 04:10:26 unRaid kernel: md: disk4 write error, sector=6329636912
Feb 28 04:10:26 unRaid kernel: md: md_do_sync: got signal, exit...
Feb 28 04:10:26 unRaid kernel: md: disk4 write error, sector=6329636920

These are after the read errors and in this case it disabled disk4

JonathanM · June 13, 2017

1 minute ago, johnnie.black said:

Note however that in theses cases, e.g., when a controller crashes and unRAID loses connection with all disks it will still disable the 1st of theses it tries to write to (or the first two if the user has dual parity):

What is it trying to write? With multiple read errors, it has no way of knowing what data should be there.

JorgeB · June 13, 2017

In this case I believe it's not trying to write the correct parity info, because the disk dropped offline, it's probably trying to do a sync or similar operation and can't so it fails.

jmos1277 · June 14, 2017

So I ran another parity check to see how things would turn out and to see if idea #2 from my original post was a possibility. The parity check passed :-)

I plan to replace the 4TB disk8, but didn't want to purchase another 4TB disk. Instead, I bought 2 10TB disks (WD Reds). They've both been pre-cleared. I'm in the process of rebuilding parity onto those new 10TB disks. After my parity drives rebuild successfully, I plan to pre-clear and repurpose one of my old 4TB parity disks to replace the dying disk8.

I know, it was a risky move to rebuild my parity drives with a disk that seems to be failing. It's even more risky because I decided to rebuild both parity drives at the same time. If the failing disk causes a problem while rebuilding my parity, I can always put my old 4TB parity disks back into the array and recover from there. The old 4TB parity drives have not been cleared and should still contain valid parity data (I think). I have not written anything to my array since installing the new 10TB parity disks. However, the Main tab of my unRAID interface does show that each of my drives has between 10 and 20 writes in the Writes column.... Hmmm, I wonder why that is? Maybe my 4TB parity is not valid anymore

JorgeB · June 14, 2017

That small number of writes is normal and as long as there were no changes to array data old parity is valid

SSD · June 14, 2017

Thought I would chime in here with an observation. The kinds of thrashings that occur as a disk fails can foul parity, often with the users help by running correcting checks. I know many live under this belief that parity and dual parity protect from single and dual drive failures, but this assumes that drives die in an orderly way. In their thrashing about they can do things like cause controllers to drop offline and parity can get fouled, at least minorly. And when parity gets fouled, dual parity gets fouled too.

Rebuilding can still be accomplished even when there has been some parity corruption, but you have no idea what file or files were impacted. Maybe it happened in unused space. Maybe it happened in a movie and when it plays there is an almost impeccable flicker. Maybe it happened in a zIp file which is now corrupt. Or a photograph which no longer renders. You don't know and have no way to know. Dual parity does not help! What fouls one parity typically fouls both. I'll mention that if hard disks failed like light bulbs, this would not be an issue. One goes blink, parity rebuilds it. Two go blink, dual parity rebuilds them. But drives don't fail like light bulbs. And in their failure, they can affect parity's effectiveness.

Creating md5s or similar checksums is needed. It won't fix the problems, but will at least give you the tools to figure out what if any files were corrupted. You can restore those files from backup (if you have), re-rip, re-download, ... or maybe bemoan the loss of a file that was important to you ... whatever, because you know the distinct file or files were affected. Imagine knowing that somewhere on your 8T disk that rebuilt, you know there is likely some corruption. But have no way, short of watching or listening to thousands of hours of media for perceptible issues, to know the extent. Blissful ignorance? I think not.

I once did an experiment and created par2 blocks for entire 2T disks. The block size needed to be large, but I didn't much care as I was dealing with large media files. It took forever. But it did work! Took a while to check par integrity, but I was able to recover from corrupted, even deleted files, so long as I had sufficient blocks. I wish there were a way to do something similar in real time to provide a file oriented redundancy to supplement the disk level protection of parity. But I know of none. Par blocks might work on a full volume you never intend to update.

Not trying to be negative. Parity is a huge amount of protection. And dual parity provides extra protection albeit in a narrow band of use cases, but you still need to understand the limitations to design your strategy in case a real world thrashing disk failure occurs and parity is not perfect at rebuild time.

jmos1277 · June 14, 2017

bjp999, thanks for your thoughts on my situation. As I understand it, regardless of how I decided to approach my situation, there was really no way of knowing if any of my data was corrupt (short of creating files hashes or par blocks). Given that my 6/5/2017 parity check showed 256 errors and my 6/12/2017 parity check also showed 256 errors I'd like to stick my head in the sand and assume #2 from my original post above (i.e. all data is okay).

Would it have been better to rebuild one parity disk at a time? Certainly it would seem to be less risky, just in case my disk8 flakes out during the rebuild process.

My parity rebuild onto the new 10TB disks has about one more hour before it hits the 40% mark. At that point, I no longer need my disk8 (or any of my disks) to finish the rebuild process (currently my largest data disk is 4TB).

JorgeB · June 14, 2017

You got lucky since for some reason only parity2 was corrupted, and the disk you rebuilt used parity, important thing is to make sure all parity checks are non correct (unless sync errors are expect, like following an unclean shutdown).

jmos1277 · June 14, 2017

johnnie.black, I will definitely do non-correcting parity checks in the future. I feel like I've always done correcting parity checks in the past because the default setting under "Array Operation" of my unRAID interface always had the box checked. At least I think it did. Maybe I checked the box a long time ago and have just forgotten I have been running my unRAID box for almost seven years now!

SSD · June 15, 2017

@jmos1277 - If you have to run a parity check on an array that has disk issues, non-correcting is the way to go. But you never know - a disk problem can happen while the check is running. You could run a read-only check after a hard reboot, have it find some sync errors, run a correcting check, and ONLY THEN a disk could start to fail and corrupt parities. Avoiding the risk completely is not possible even for @johnnie.black

JorgeB · June 15, 2017

5 hours ago, bjp999 said:

and ONLY THEN a disk could start to fail and corrupt parities. Avoiding the risk completely is not possible even for @johnnie.black

Very true, one of the reasons I think checksums are essential, without them user has no way to know if there was corruption and what files were corrupted.

Parity Check finished with Errors, not sure how to proceed

Recommended Posts

jmos1277

Link to comment

jmos1277

Link to comment

JonathanM

Link to comment

jmos1277

Link to comment

JorgeB

Link to comment

JonathanM

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

JonathanM

Link to comment

JorgeB

Link to comment

jmos1277

Link to comment

JorgeB

Link to comment

SSD

Link to comment

jmos1277

Link to comment

JorgeB

Link to comment

jmos1277

Link to comment

SSD

Link to comment

JorgeB

Link to comment

Archived