uncorrectable parity errors ?

MAM59 · August 4, 2022

Running checks multiple times always give me entries like these:

Aug 4 08:16:21 F kernel: mdcmd (36): check correct
Aug 4 08:16:21 F kernel: md: recovery thread: check P ...
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917656
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917664
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917672
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917680
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917688

These 5 errors are always the same, there is no other message, no read errors, no SMART error.

It says "P corrected" but thats obviously a lie.

What to do?

f-diagnostics-20220804-0908.zip

JorgeB · August 4, 2022

That suggests a problem with a controller or disk, unfortunately no easy way to tell which without testing one by one.

MAM59 · August 4, 2022

now, on the 5th run (thank god, it only takes an hour to get to the range where the "errors" showed up) the messages are vanished (at least until now, but the region is passed already).

But why did it take 4 runs that have failed to correct them?!?!?!?

And, why did they show up at all (there was no power outage or unproper shutdown). ?

Its a bit strange...

1 hour ago, JorgeB said:

a problem with a controller or disk

I've ordered some new 18Tb drives to replace and old and "slow as a dog" 10Tb one. But I have my doubts that this will solve the problem with the parity.

Controller is no problem too, the box has 4 seperate SATA controllers, each one only has one drive attached yet. I will put the new drives onto one of the other controllers and do not use the 4th for now anymore.

trurl · August 4, 2022

disk3 has only passed short self-test, no other disks have had self-tests.

MAM59 · August 4, 2022

20 minutes ago, trurl said:

disk3 has only passed short self-test, no other disks have had self-tests.

This is only an optical illusion. I do my long term burn in tests for new or reused drives in a seperate machine. So normal operation is not interrupted.

If the disks do not show any problems, they can move over to the production machine.

Disk3 showed some slowdowns recently, therefor I did run some short tests.

It will be replaced soon. The replacements should arrive tomorrow and after a day of testing or so, d3 will be pulled out and a new drive will be put in.

(But besides those slowdowns (which also have vanished currently, very mystical...) there were no read errors, no seek errors and no sector reassignments)

And the main question is still unsolved: why was the parity not corrected even if UNRAID tells me that it has happened?

JorgeB · August 4, 2022

8 minutes ago, MAM59 said:

why was the parity not corrected even if UNRAID tells me that it has happened?

No evidence it wasn't correct, just that there were errors again in the next check, this was for example common for some users using a SAS2LP controller with some disks, after every check there were the same 5 sync errors.

MAM59 · August 4, 2022

hmm, I dont have any SAS2LP controller, just plain SATA ones (2 onboard and 2 simple 4port ones taken from the recommendations of this board).

The only SASlike thingy here is the backplane, it uses 4 SAS connectors (4 drives each) running with "reverse SATA to SAS" cables.

For now I will forget about those "errors", but I will monthly check if they are reappearing.

trurl · August 4, 2022

3 hours ago, MAM59 said:

This is only an optical illusion.

Apparently not the same as the self-tests the drive firmware does, since your tests are not logged in SMART report. And burn-in tests of course don't say anything about how things are working currently.

15 minutes ago, MAM59 said:

For now I will forget about those "errors", but I will monthly check if they are reappearing.

Exactly zero sync errors is the desired result. If parity isn't all correct how can it be expected to rebuild all of a disk correctly?

MAM59 · August 4, 2022

1 minute ago, trurl said:

Exactly zero sync errors is the desired result.

the current run has zero errors.

For a long period I had zero errors on every monthly check.

The 5 one just appeared again this week and did not vanish after the first three rerun tries.

Now in the fourth run they are gone again.

I had these mystical 5 errors (dunno anymore if they were the same sectors) early last year already. And, like now, they went away after a few retries. And stayed away for almost a year.

This is rather strange behaviour I think, so I asked here.

But of course, all this may be Murphy' Law #452 ("Shit happens") and just random...

MAM59 · September 1, 2022

and here we go again 😞

New month, old errors:

Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917656
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917664
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917672
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917680
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917688

The same sectors like before are marked. As usual, if I stop the run ("corrected") and start from scratch, they are gone, but next month they are back again.

Whats wrong

(again, there are no disk errors, not even warnings, it looks like UNRAID is producing the errors by itself!)

JorgeB · September 1, 2022

Most likely a controller or one of the disks, unfortunately no easy way to test unless you start swapping them.

MAM59 · September 1, 2022

but it does not even tell WHERE the error should be?

Swapping is not really an Option, I do not have any spare 18Tb drives currently and I do not want to take one out of the backup server.

In general, the "errors" do not worry me much, they seem to be a fata morgana.

What makes me angry is that they show up again after a long time of normal operation (there were no outages, no read/write errors or something else in between. I guess I have booted the box once in that period)

JorgeB · September 1, 2022

42 minutes ago, MAM59 said:

but it does not even tell WHERE the error should be?

It's not possible due to how parity works, it's just possible to know that it's wrong.

MAM59 · September 1, 2022

11 minutes ago, JorgeB said:

It's not possible due to how parity works, it's just possible to know that it's wrong.

yeah, thats clear but still very unsatisfying...

Obviously there must have been a wrong write to these specific sectors (I dont think there was ANY hardware problem with disk, cable or controller). Maybe the way the parity is calculated is .... hmm... not deterministic?

(what still does not answer if it happens on writes or reads)

But always the same sector numbers? this cannot be accidentally.

trurl · September 1, 2022

3 hours ago, MAM59 said:

Maybe the way the parity is calculated is .... hmm... not deterministic?

It's deterministic and a very simple calculation. It must be getting different input data due to hardware issue.

MAM59 · September 1, 2022

1 hour ago, trurl said:

It must be getting different input data due to hardware issue.

Almost impossible. But then...

Another try:

What is the difference between running the check automatically or starting it manually ?

(Yeah, I know, there SHOULD be NONE, but if I wait for the autostart every month I do get this 5 errors, starting a run manually gives me ZERO errors...)

JorgeB · September 1, 2022

It should be the same, disable the scheduled check and run a manual one next month, note that some of these type of errors, when they are controller related, have been known to only happen after a reboot, i.e., if you run two consecutive checks without rebooting there won't be any errors, if you reboot there will be.

MAM59 · September 1, 2022

1 minute ago, JorgeB said:

have been known to only happen after a reboo

hmm... sounds not really convincing... but ok, I will disable the scheduler and launch it manually next time.

JorgeB · September 1, 2022

12 minutes ago, MAM59 said:

sounds not really convincing

https://forums.unraid.net/topic/50698-monthly-5-parity-errors/?do=findComment&comment=499236

uncorrectable parity errors ?

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation