MAM59 Posted August 4, 2022 Share Posted August 4, 2022 Running checks multiple times always give me entries like these: Aug 4 08:16:21 F kernel: mdcmd (36): check correct Aug 4 08:16:21 F kernel: md: recovery thread: check P ... Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917656 Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917664 Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917672 Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917680 Aug 4 08:51:21 F kernel: md: recovery thread: P corrected, sector=796917688 These 5 errors are always the same, there is no other message, no read errors, no SMART error. It says "P corrected" but thats obviously a lie. What to do? f-diagnostics-20220804-0908.zip Quote Link to comment
JorgeB Posted August 4, 2022 Share Posted August 4, 2022 That suggests a problem with a controller or disk, unfortunately no easy way to tell which without testing one by one. Quote Link to comment
MAM59 Posted August 4, 2022 Author Share Posted August 4, 2022 now, on the 5th run (thank god, it only takes an hour to get to the range where the "errors" showed up) the messages are vanished (at least until now, but the region is passed already). But why did it take 4 runs that have failed to correct them?!?!?!? And, why did they show up at all (there was no power outage or unproper shutdown). ? Its a bit strange... 1 hour ago, JorgeB said: a problem with a controller or disk I've ordered some new 18Tb drives to replace and old and "slow as a dog" 10Tb one. But I have my doubts that this will solve the problem with the parity. Controller is no problem too, the box has 4 seperate SATA controllers, each one only has one drive attached yet. I will put the new drives onto one of the other controllers and do not use the 4th for now anymore. Quote Link to comment
trurl Posted August 4, 2022 Share Posted August 4, 2022 disk3 has only passed short self-test, no other disks have had self-tests. Quote Link to comment
MAM59 Posted August 4, 2022 Author Share Posted August 4, 2022 20 minutes ago, trurl said: disk3 has only passed short self-test, no other disks have had self-tests. This is only an optical illusion. I do my long term burn in tests for new or reused drives in a seperate machine. So normal operation is not interrupted. If the disks do not show any problems, they can move over to the production machine. Disk3 showed some slowdowns recently, therefor I did run some short tests. It will be replaced soon. The replacements should arrive tomorrow and after a day of testing or so, d3 will be pulled out and a new drive will be put in. (But besides those slowdowns (which also have vanished currently, very mystical...) there were no read errors, no seek errors and no sector reassignments) And the main question is still unsolved: why was the parity not corrected even if UNRAID tells me that it has happened? Quote Link to comment
JorgeB Posted August 4, 2022 Share Posted August 4, 2022 8 minutes ago, MAM59 said: why was the parity not corrected even if UNRAID tells me that it has happened? No evidence it wasn't correct, just that there were errors again in the next check, this was for example common for some users using a SAS2LP controller with some disks, after every check there were the same 5 sync errors. Quote Link to comment
MAM59 Posted August 4, 2022 Author Share Posted August 4, 2022 hmm, I dont have any SAS2LP controller, just plain SATA ones (2 onboard and 2 simple 4port ones taken from the recommendations of this board). The only SASlike thingy here is the backplane, it uses 4 SAS connectors (4 drives each) running with "reverse SATA to SAS" cables. For now I will forget about those "errors", but I will monthly check if they are reappearing. Quote Link to comment
trurl Posted August 4, 2022 Share Posted August 4, 2022 3 hours ago, MAM59 said: This is only an optical illusion. Apparently not the same as the self-tests the drive firmware does, since your tests are not logged in SMART report. And burn-in tests of course don't say anything about how things are working currently. 15 minutes ago, MAM59 said: For now I will forget about those "errors", but I will monthly check if they are reappearing. Exactly zero sync errors is the desired result. If parity isn't all correct how can it be expected to rebuild all of a disk correctly? Quote Link to comment
MAM59 Posted August 4, 2022 Author Share Posted August 4, 2022 1 minute ago, trurl said: Exactly zero sync errors is the desired result. the current run has zero errors. For a long period I had zero errors on every monthly check. The 5 one just appeared again this week and did not vanish after the first three rerun tries. Now in the fourth run they are gone again. I had these mystical 5 errors (dunno anymore if they were the same sectors) early last year already. And, like now, they went away after a few retries. And stayed away for almost a year. This is rather strange behaviour I think, so I asked here. But of course, all this may be Murphy' Law #452 ("Shit happens") and just random... Quote Link to comment
MAM59 Posted September 1, 2022 Author Share Posted September 1, 2022 and here we go again 😞 New month, old errors: Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917656 Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917664 Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917672 Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917680 Sep 1 00:26:24 F kernel: md: recovery thread: P corrected, sector=796917688 The same sectors like before are marked. As usual, if I stop the run ("corrected") and start from scratch, they are gone, but next month they are back again. Whats wrong (again, there are no disk errors, not even warnings, it looks like UNRAID is producing the errors by itself!) Quote Link to comment
JorgeB Posted September 1, 2022 Share Posted September 1, 2022 Most likely a controller or one of the disks, unfortunately no easy way to test unless you start swapping them. Quote Link to comment
MAM59 Posted September 1, 2022 Author Share Posted September 1, 2022 but it does not even tell WHERE the error should be? Swapping is not really an Option, I do not have any spare 18Tb drives currently and I do not want to take one out of the backup server. In general, the "errors" do not worry me much, they seem to be a fata morgana. What makes me angry is that they show up again after a long time of normal operation (there were no outages, no read/write errors or something else in between. I guess I have booted the box once in that period) Quote Link to comment
JorgeB Posted September 1, 2022 Share Posted September 1, 2022 42 minutes ago, MAM59 said: but it does not even tell WHERE the error should be? It's not possible due to how parity works, it's just possible to know that it's wrong. Quote Link to comment
MAM59 Posted September 1, 2022 Author Share Posted September 1, 2022 11 minutes ago, JorgeB said: It's not possible due to how parity works, it's just possible to know that it's wrong. yeah, thats clear but still very unsatisfying... Obviously there must have been a wrong write to these specific sectors (I dont think there was ANY hardware problem with disk, cable or controller). Maybe the way the parity is calculated is .... hmm... not deterministic? (what still does not answer if it happens on writes or reads) But always the same sector numbers? this cannot be accidentally. Quote Link to comment
trurl Posted September 1, 2022 Share Posted September 1, 2022 3 hours ago, MAM59 said: Maybe the way the parity is calculated is .... hmm... not deterministic? It's deterministic and a very simple calculation. It must be getting different input data due to hardware issue. Quote Link to comment
MAM59 Posted September 1, 2022 Author Share Posted September 1, 2022 1 hour ago, trurl said: It must be getting different input data due to hardware issue. Almost impossible. But then... Another try: What is the difference between running the check automatically or starting it manually ? (Yeah, I know, there SHOULD be NONE, but if I wait for the autostart every month I do get this 5 errors, starting a run manually gives me ZERO errors...) Quote Link to comment
JorgeB Posted September 1, 2022 Share Posted September 1, 2022 It should be the same, disable the scheduled check and run a manual one next month, note that some of these type of errors, when they are controller related, have been known to only happen after a reboot, i.e., if you run two consecutive checks without rebooting there won't be any errors, if you reboot there will be. Quote Link to comment
MAM59 Posted September 1, 2022 Author Share Posted September 1, 2022 1 minute ago, JorgeB said: have been known to only happen after a reboo hmm... sounds not really convincing... but ok, I will disable the scheduler and launch it manually next time. Quote Link to comment
JorgeB Posted September 1, 2022 Share Posted September 1, 2022 12 minutes ago, MAM59 said: sounds not really convincing https://forums.unraid.net/topic/50698-monthly-5-parity-errors/?do=findComment&comment=499236 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.