What happens when a NOCORRECT parity check results in errors?


Go to solution Solved by trurl,

Recommended Posts

My monthly parity check completed with 5 errors. This is in my syslog:

kernel: md: recovery thread: P incorrect, sector=1962934168
kernel: md: recovery thread: P incorrect, sector=1962934176
kernel: md: recovery thread: P incorrect, sector=1962934184
kernel: md: recovery thread: P incorrect, sector=1962934192
kernel: md: recovery thread: P incorrect, sector=1962934200
[...]
kernel: md: sync done. time=82684sec
kernel: md: recovery thread: exit status: 0


There is no other messages in my syslog related to the parity check.

 

I don't actually know what happens now. Unraid says there were 5 errors, but all my disks are still active, nothing's simulated or whatever.

What action did Unraid take once these errors appeared? As it was non-correcting, did nothing happen? It just reports them and that's it? Which files are these in particular - is there a way to know? Do I need to know?

 

I don't know what action I take once errors appear in the non-correcting parity check.

Link to comment
12 hours ago, JorgeB said:

using btrfs

I am using btrfs - how do I find out if the errors are in data or parity?
I understand that a correcting parity check will update parity, but what if the errors are in the data, not the parity? How would I fix the data using the parity information?

Link to comment
41 minutes ago, cybersteel8 said:

How would I fix the data using the parity information?

You would have to force Unraid to rebuild a complete drive - you cannot do anything at a lower level of granularity with parity.

 

Much easier is restore any affected files from your backups.

Link to comment
On 5/28/2022 at 3:29 PM, itimpi said:

You would have to force Unraid to rebuild a complete drive - you cannot do anything at a lower level of granularity with parity.

 

Much easier is restore any affected files from your backups.

I still don't know how to determine which files are affected. I don't even know which drive I would want to rebuild, if I took this approach. How do I know which files are affected?

 

On 5/28/2022 at 5:06 PM, JorgeB said:

Run a scrub on all array drives, if no corruptions are found run a correcting check.

Thanks. I have done this on all my array drives, and no errors were found on any of them. Does this mean the error was on the parity drive and not on my data drives? Do I know this with confidence?

Edited by cybersteel8
Link to comment
  • Solution

Just to make sure there isn't something else causing sync errors, you should run a correcting check, followed by a non-correcting check, all without rebooting. If there are still sync errors after the non-correcting check then diagnostics would be needed to compare the checks in case you have some other hardware (RAM or something) causing these.

Link to comment
On 5/30/2022 at 5:01 PM, JorgeB said:

If no checksum errors were found you can have confidence that the data is OK, so just correct parity.

 Thanks. Due to the following test, I have now run a correcting parity check, and it corrected the 5 errors.
 

On 5/30/2022 at 9:39 PM, trurl said:

Just to make sure there isn't something else causing sync errors, you should run a correcting check, followed by a non-correcting check, all without rebooting.

I have done this and the results seem successful. After the correcting check, I ran a non-correcting check immediately without rebooting at all. There were no errors detected in the non-correcting check.

Would the diagnostics still be useful, considering the second parity check resulted in no errors? I am unsure if it is safe to assume that my hardware is fault-free, but I suppose if I keep getting parity errors, diagnostics would become more useful. I think this is something I will keep an eye on, but not worry about. I'd like your opinion on this.
--
It seems the answer to my original question is that, when a non-correcting parity check results in errors, no action is taken by Unraid. A second parity check will show the same errors. The action that ought to be taken by the user is to run a btrfs scrub on all of the array drives to determine if there are any problems with the data on the drive. If the btrfs scrubs result in no errors, then the errors are on the parity drive, so a correcting parity check is appropriate. Another non-correcting parity check after correcting parity is appropriate, to ensure that parity is indeed in a valid state.

"If there are errors as a result of the scrub, then that drive should be rebuilt from parity, as the parity mismatch is from the data not the parity" - This is my assumption. I would appreciate a comment on this to know if my assumption is correct.

Link to comment
2 minutes ago, cybersteel8 said:

If there are errors as a result of the scrub, then that drive should be rebuilt from parity, as the parity mismatch is from the data not the parity" - This is my assumption. I would appreciate a comment on this to know if my assumption is correct.

At one level that is correct, but if you have backups it would normally be MUCH quicker to simply restore the files identified by the scrub as being corrupt.

Link to comment
2 hours ago, itimpi said:

At one level that is correct, but if you have backups it would normally be MUCH quicker to simply restore the files identified by the scrub as being corrupt.

Ah, so a btrfs scrub actually documents the names of the files that are corrupt? That's incredibly useful to know, thanks!

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.