Jump to content

uncorrectable parity errors ?


Recommended Posts

Running checks multiple times always give me entries like these:

 

Aug  4 08:16:21 F kernel: mdcmd (36): check correct
Aug  4 08:16:21 F kernel: md: recovery thread: check P ...
Aug  4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917656
Aug  4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917664
Aug  4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917672
Aug  4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917680
Aug  4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917688

 

These 5 errors are always the same, there is no other message, no read errors, no SMART error.

It says "P corrected" but thats obviously a lie.

 

What to do?

 

f-diagnostics-20220804-0908.zip

Link to comment

now, on the 5th run (thank god, it only takes an hour to get to the range where the "errors" showed up) the messages are vanished (at least until now, but the region is passed already).

But why did it take 4 runs that have failed to correct them?!?!?!?

 

And, why did they show up at all (there was no power outage or unproper shutdown). ?

 

Its a bit strange...

 

1 hour ago, JorgeB said:

a problem with a controller or disk

I've ordered some new 18Tb drives to replace and old and "slow as a dog" 10Tb one. But I have my doubts that this will solve the problem with the parity.

Controller is no problem too, the box has 4 seperate SATA controllers, each one only has one drive attached yet. I will put the new drives onto one of the other controllers and do not use the 4th for now anymore.

 

Link to comment
20 minutes ago, trurl said:

disk3 has only passed short self-test, no other disks have had self-tests.

This is only an optical illusion. I do my long term burn in tests for new or reused drives in a seperate machine. So normal operation is not interrupted.

If the disks do not show any problems, they can move over to the production machine.

Disk3 showed some slowdowns recently, therefor I did run some short tests.

It will be replaced soon. The replacements should arrive tomorrow and after a day of testing or so, d3 will be pulled out and a new drive will be put in.

 

(But besides those slowdowns (which also have vanished currently, very mystical...) there were no read errors, no seek errors and no sector reassignments)

 

And the main question is still unsolved: why was the parity not corrected even if UNRAID tells me that it has happened?

 

 

Link to comment
8 minutes ago, MAM59 said:

why was the parity not corrected even if UNRAID tells me that it has happened?

No evidence it wasn't correct, just that there were errors again in the next check, this was for example common for some users using a SAS2LP controller with some disks, after every check there were the same 5 sync errors.

Link to comment

hmm, I dont have any SAS2LP controller, just plain SATA ones (2 onboard and 2 simple 4port ones taken from the recommendations of this board).

The only SASlike thingy here is the backplane, it uses 4 SAS connectors (4 drives each) running with "reverse SATA to SAS" cables.

 

For now I will forget about those "errors", but I will monthly check if they are reappearing.

 

Link to comment
3 hours ago, MAM59 said:

This is only an optical illusion.

Apparently not the same as the self-tests the drive firmware does, since your tests are not logged in SMART report. And burn-in tests of course don't say anything about how things are working currently.

 

15 minutes ago, MAM59 said:

For now I will forget about those "errors", but I will monthly check if they are reappearing.

Exactly zero sync errors is the desired result. If parity isn't all correct how can it be expected to rebuild all of a disk correctly?

Link to comment
1 minute ago, trurl said:

Exactly zero sync errors is the desired result.

the current run has zero errors.

For a long period I had zero errors on every monthly check.

The 5 one just appeared again this week and did not vanish after the first three rerun tries.

Now in the fourth run they are gone again.

I had these mystical 5 errors (dunno anymore if they were the same sectors) early last year already. And, like now, they went away after a few retries. And stayed away for almost a year.

This is rather strange behaviour I think, so I asked here.

But of course, all this may be Murphy' Law #452 ("Shit happens") and just random...

 

Link to comment
  • 4 weeks later...

and here we go again 😞

New month, old errors:

Sep  1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917656
Sep  1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917664
Sep  1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917672
Sep  1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917680
Sep  1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917688

 

The same sectors like before are marked. As usual, if I stop the run ("corrected") and start from scratch, they are gone, but next month they are back again.

 

Whats wrong ??? ???

(again, there are no disk errors, not even warnings, it looks like UNRAID is producing the errors by itself!)

 

Link to comment

but it does not even tell WHERE the error should be?

Swapping is not really an Option, I do not have any spare 18Tb drives currently and I do not want to take one out of the backup server.

In general, the "errors" do not worry me much, they seem to be a fata morgana.

What makes me angry is that they show up again after a long time of normal operation (there were no outages, no read/write errors or something else in between. I guess I have booted the box once in that period)

 

Link to comment
11 minutes ago, JorgeB said:

It's not possible due to how parity works, it's just possible to know that it's wrong.

yeah, thats clear but still very unsatisfying...

Obviously there must have been a wrong write to these specific sectors (I dont think there was ANY hardware problem with disk, cable or controller). Maybe the way the parity is calculated is .... hmm... not deterministic?

(what still does not answer if it happens on writes or reads)

But always the same sector numbers? this cannot be accidentally.

 

Link to comment
1 hour ago, trurl said:

It must be getting different input data due to hardware issue.

Almost impossible. But then...

 

Another try:

  • What is the difference between running the check automatically or starting it manually ?

(Yeah, I know, there SHOULD be NONE, but if I wait for the autostart every month I do get this 5 errors, starting a run manually gives me ZERO errors...)

 

Link to comment

It should be the same, disable the scheduled check and run a manual one next month, note that some of these type of errors, when they are controller related, have been known to only happen after a reboot, i.e., if you run two consecutive checks without rebooting there won't be any errors, if you reboot there will be.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...