Parity Check finding 1 error

autumnwalker · October 9, 2019

Just ran my monthly parity check. It found exactly one error. Thoughts?

JonathanM · October 9, 2019

Is it a correcting check? If so, run a non-correcting check and see if it's still there.

If it's a non-correcting check, verify the health of all the array drives then run a correcting check. Follow up with a non-correcting check to see if it goes away.

Don't run a correcting check with a known bad drive involved.

autumnwalker · October 10, 2019

It was a non-correcting check.

By validate health do you mean "green balls"? If so, each is showing healthy. None have failed SMART.

JonathanM · October 10, 2019

12 minutes ago, autumnwalker said:

By validate health do you mean "green balls"?

That's one sign of health. When you click on each drive in the main GUI it takes you to a page that lists smart attributes and allows you to run smart tests and view the results. In the ideal world you would want each drive to have recently passed a long smart test. However, those take several hours to complete, and don't need to be run regularly since unraid reads each sector during a parity check.

Drive health is a complicated subject.

autumnwalker · October 10, 2019

Understood.

Should I trust the SMART information that Unraid is currently displaying? Run short tests?

JorgeB · October 10, 2019

16 hours ago, autumnwalker said:

Just ran my monthly parity check. It found exactly one error. Thoughts?

Assuming it was non correct run another one, to see if still finds the same error, single error could be from a bit flip, if not using ECC RAM, if it was correct still run again and if it was a bit flip it should also find 1 error.

Also check or post the syslog around the time the error happened, if no disk related entries unlikely for them to be the reason.

autumnwalker · October 10, 2019

Log from start to finish of the parity check (non-correcting) here:

Oct 8 22:58:37 nas01 kernel: mdcmd (64): check nocorrect

Oct 8 22:58:37 nas01 kernel: md: recovery thread: check P ...

Oct 8 22:58:37 nas01 kernel: md: using 1536k window, over a total of 3907018532 blocks.

Oct 9 02:26:14 nas01 afpd[5310]: Reading IPC header failed (-1 of 14 bytes read): Connection reset by peer

Oct 9 10:33:18 nas01 kernel: md: recovery thread: P incorrect, sector=4646561192

Oct 9 11:18:28 nas01 kernel: md: recovery thread: completion status: 0

System using non ECC RAM.

Edited October 10, 2019 by autumnwalker

JorgeB · October 10, 2019

Very unlikely to be disk related.

autumnwalker · October 10, 2019

So run another non-correcting parity check and see if I get the same error as below?

Oct 9 10:33:18 nas01 kernel: md: recovery thread: P incorrect, sector=4646561192

JorgeB · October 10, 2019

6 hours ago, autumnwalker said:

So run another non-correcting parity check and see if I get the same error as below?

Correct, if the first one was also non-correct and you don't get an error it was likely a memory bit flip, if it was correcting you should now get the same error.

autumnwalker · October 10, 2019

Running now. Both initial and current run are non-correcting. Will report back with results.

autumnwalker · October 11, 2019

Non correcting returned same error: Oct 10 22:12:58 nas01 kernel: md: recovery thread: P incorrect, sector=4646561192

autumnwalker · October 11, 2019

What's my next step here?

JonathanM · October 11, 2019

Correcting check.

John_M · October 11, 2019

As I see it, if all your disks are healthy, your only real option is to run a correcting parity check. That will update the parity disk to match what's on the data disks, making the assumption that the error is in the parity since there is no way of knowing which actual disk is in error. How valid that assumption is is debatable. Before you do run a correcting parity check though you might want to do either or both of the following:

Post your diagnostics or check the SMART data of your disks yourself
Run a MemTest to check for bad RAM

You might want to consider the use of a checksumming technique to detect data corruption going forward. It won't help with your current situation but it might help in the future. I use the Dynamix File Integrity plugin. Some people use btrfs as the format on their data disks.

autumnwalker · October 11, 2019

SMART is coming back clean on each disk.

Is there any way to memtest with the system live (afaik there is not).

I will check out the FIle Integrity plugin, but I'm still on reiserfs so I cannot use that until I migrate. Yet another reason for me to migrate now.

John_M · October 11, 2019

Running MemTest86 involves a reboot, I'm afraid.

JorgeB · October 11, 2019

Like mentioned best option now is to run a correcting check, if it was bad ram it wouldn't be finding an error on the exact same sector, so not so clear what caused the error, hopefully a single event and no more will be detected in the near future, one of those situations where having checksums is valuable.

autumnwalker · October 12, 2019

Correcting check ran, fixed one error. Just ordered a new drive to start migrating away from reiserfs and I'll look at checksums.

I suspect this was related to my failing PSU which was causing my SATA card to crap out (remember that?).

autumnwalker · October 12, 2019

Thanks all!

John_M · October 12, 2019

2 hours ago, autumnwalker said:

I suspect this was related to my failing PSU

Failing PSUs can be responsible for the most obscure problems. I think that's a very plausible explanation.

Parity Check finding 1 error

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation