uncorrectable parity errors ?

August 4, 20223 yr

Community Expert

Running checks multiple times always give me entries like these:

Aug 4 08:16:21 F kernel: mdcmd (36): check correct
Aug 4 08:16:21 F kernel: md: recovery thread: check P ...
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917656
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917664
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917672
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917680
Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917688

These 5 errors are always the same, there is no other message, no read errors, no SMART error.

It says "P corrected" but thats obviously a lie.

What to do?

f-diagnostics-20220804-0908.zip

Quote

August 4, 20223 yr

Community Expert

That suggests a problem with a controller or disk, unfortunately no easy way to tell which without testing one by one.

Quote

August 4, 20223 yr

Author
Community Expert

now, on the 5th run (thank god, it only takes an hour to get to the range where the "errors" showed up) the messages are vanished (at least until now, but the region is passed already).

But why did it take 4 runs that have failed to correct them?!?!?!?

And, why did they show up at all (there was no power outage or unproper shutdown). ?

Its a bit strange...

1 hour ago, JorgeB said:

a problem with a controller or disk

I've ordered some new 18Tb drives to replace and old and "slow as a dog" 10Tb one. But I have my doubts that this will solve the problem with the parity.

Controller is no problem too, the box has 4 seperate SATA controllers, each one only has one drive attached yet. I will put the new drives onto one of the other controllers and do not use the 4th for now anymore.

Quote

August 4, 20223 yr

Community Expert

disk3 has only passed short self-test, no other disks have had self-tests.

Quote

August 4, 20223 yr

Author
Community Expert

20 minutes ago, trurl said:

disk3 has only passed short self-test, no other disks have had self-tests.

This is only an optical illusion. I do my long term burn in tests for new or reused drives in a seperate machine. So normal operation is not interrupted.

If the disks do not show any problems, they can move over to the production machine.

Disk3 showed some slowdowns recently, therefor I did run some short tests.

It will be replaced soon. The replacements should arrive tomorrow and after a day of testing or so, d3 will be pulled out and a new drive will be put in.

(But besides those slowdowns (which also have vanished currently, very mystical...) there were no read errors, no seek errors and no sector reassignments)

And the main question is still unsolved: why was the parity not corrected even if UNRAID tells me that it has happened?

Quote

August 4, 20223 yr

Community Expert

8 minutes ago, MAM59 said:

why was the parity not corrected even if UNRAID tells me that it has happened?

No evidence it wasn't correct, just that there were errors again in the next check, this was for example common for some users using a SAS2LP controller with some disks, after every check there were the same 5 sync errors.

Quote

August 4, 20223 yr

Author
Community Expert

hmm, I dont have any SAS2LP controller, just plain SATA ones (2 onboard and 2 simple 4port ones taken from the recommendations of this board).

The only SASlike thingy here is the backplane, it uses 4 SAS connectors (4 drives each) running with "reverse SATA to SAS" cables.

For now I will forget about those "errors", but I will monthly check if they are reappearing.

Quote

August 4, 20223 yr

Community Expert

3 hours ago, MAM59 said:

This is only an optical illusion.

Apparently not the same as the self-tests the drive firmware does, since your tests are not logged in SMART report. And burn-in tests of course don't say anything about how things are working currently.

15 minutes ago, MAM59 said:

For now I will forget about those "errors", but I will monthly check if they are reappearing.

Exactly zero sync errors is the desired result. If parity isn't all correct how can it be expected to rebuild all of a disk correctly?

Quote

August 4, 20223 yr

Author
Community Expert

1 minute ago, trurl said:

Exactly zero sync errors is the desired result.

the current run has zero errors.

For a long period I had zero errors on every monthly check.

The 5 one just appeared again this week and did not vanish after the first three rerun tries.

Now in the fourth run they are gone again.

I had these mystical 5 errors (dunno anymore if they were the same sectors) early last year already. And, like now, they went away after a few retries. And stayed away for almost a year.

This is rather strange behaviour I think, so I asked here.

But of course, all this may be Murphy' Law #452 ("Shit happens") and just random...

Quote

September 1, 20223 yr

Author
Community Expert

and here we go again 😞

New month, old errors:

Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917656
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917664
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917672
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917680
Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917688

The same sectors like before are marked. As usual, if I stop the run ("corrected") and start from scratch, they are gone, but next month they are back again.

Whats wrong

(again, there are no disk errors, not even warnings, it looks like UNRAID is producing the errors by itself!)

Quote

September 1, 20223 yr

Community Expert

Most likely a controller or one of the disks, unfortunately no easy way to test unless you start swapping them.

Quote

September 1, 20223 yr

Author
Community Expert

but it does not even tell WHERE the error should be?

Swapping is not really an Option, I do not have any spare 18Tb drives currently and I do not want to take one out of the backup server.

In general, the "errors" do not worry me much, they seem to be a fata morgana.

What makes me angry is that they show up again after a long time of normal operation (there were no outages, no read/write errors or something else in between. I guess I have booted the box once in that period)

Quote

September 1, 20223 yr

Community Expert

42 minutes ago, MAM59 said:

but it does not even tell WHERE the error should be?

It's not possible due to how parity works, it's just possible to know that it's wrong.

Quote

September 1, 20223 yr

Author
Community Expert

11 minutes ago, JorgeB said:

It's not possible due to how parity works, it's just possible to know that it's wrong.

yeah, thats clear but still very unsatisfying...

Obviously there must have been a wrong write to these specific sectors (I dont think there was ANY hardware problem with disk, cable or controller). Maybe the way the parity is calculated is .... hmm... not deterministic?

(what still does not answer if it happens on writes or reads)

But always the same sector numbers? this cannot be accidentally.

Quote

September 1, 20223 yr

Community Expert

3 hours ago, MAM59 said:

Maybe the way the parity is calculated is .... hmm... not deterministic?

It's deterministic and a very simple calculation. It must be getting different input data due to hardware issue.

Quote

September 1, 20223 yr

Author
Community Expert

1 hour ago, trurl said:

It must be getting different input data due to hardware issue.

Almost impossible. But then...

Another try:

What is the difference between running the check automatically or starting it manually ?

(Yeah, I know, there SHOULD be NONE, but if I wait for the autostart every month I do get this 5 errors, starting a run manually gives me ZERO errors...)

Quote

September 1, 20223 yr

Community Expert

It should be the same, disable the scheduled check and run a manual one next month, note that some of these type of errors, when they are controller related, have been known to only happen after a reboot, i.e., if you run two consecutive checks without rebooting there won't be any errors, if you reboot there will be.

Quote

September 1, 20223 yr

Author
Community Expert

1 minute ago, JorgeB said:

have been known to only happen after a reboo

hmm... sounds not really convincing... but ok, I will disable the scheduler and launch it manually next time.

Quote

September 1, 20223 yr

Community Expert

12 minutes ago, MAM59 said:

sounds not really convincing

https://forums.unraid.net/topic/50698-monthly-5-parity-errors/?do=findComment&comment=499236

Quote

uncorrectable parity errors ?

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)