Help with Disk/Parity issues and errors... maybe?


Recommended Posts

Hoping some others can take a look at my diag and give me some pointers on what might be going on.

 

Started off 3-4 months ago, accidently hit the side of the unraid server. This caused over the course of a few days 3 disks to go "bad" showing tens of thousands of errors. Huh... pulled disks out, tested everything ok, removed from config, re-added, recovered data from offsite unit, etc. Got things back to where they were. I also figured out in this time about the file system check, which I also ran on the disks I was having issues with.

 

Now though, on my monthly parity checks, it's correcting like 10K plus errors (38K+ on the last one). Which I know isn't good. I've run Short Smart tests on all disks. They all show no errors. In the logs, I see though a ton of "exception... frozen" errors, which I'm sure is not good. 

 

Most of the smart reports look ok, other than the two parity disks, probably due to all the corrections. They are both pretty much brand new. 

 

But otherwise, everything runs fine, disk shares are good, VM runs fine, dockers run fine. Just seems to be underlying disk/filesystem/hardware issue I need to deal with.

Thanks for looking and any advice!
 

tower-diagnostics-20210403-1501.zip

Link to comment

Disk WDC_WD6001FFWX-68Z39N0_WD-WXN1H841RHFN dropped offline. Maybe that was because its connections were disturbed or maybe because you're using Marvell SAS HBAs, which are known for dropping disks and haven't been recommended for a number of years. Since the disk is offline there is no SMART report. Long term, I would consider replacing them with LSI HBAs. Short term, shut down, check cables and see if it has a good SMART report when you power up again.

 

Short SMART self-tests are of limited use. They just test the basic functioning of the electronics. Long self-tests read check the whole of the disk surface and give a much better idea of its health but take several hours to complete.

 

The other disks look ok. Don't worry about the IronWolfs - Seagate drives report some parameters differently from WD. All hard drives have very high raw error rates and high levels of error correction. It's just that WD ones report the errors that remain after correction (so hopefully zero) while Seagate ones report the rate before correction.

 

I'm not sure what you did in the intervening months to address this long standing problem. Your logs go back approximately one month and there have been problems throughout. You account of "pulling disks out, testing ok, etc" doesn't convince me that you fixed the original problem and your monthly paity checks seem to me to be doing more harm than good. When a disk drops offline I expect to see it as disabled and emulated but you've tried to fix it by doing a New Config so I'm going to stop there and not try to guess and make matters worse.

 

Link to comment

I've just recently got some LSI cards in IT mode, so that is what I will work on, swapping out those SuperMicro cards. Hopefully that will bring some stability. All the disks themselves never show any issues any more. That one dropping offline, never shows up on the dashboard. I'll swap controllers and keep an eye on it. Thanks @John_M

 

 

Link to comment
20 hours ago, Whaler_99 said:

my monthly parity checks, it's correcting like 10K plus errors

My guess is with all that you did you invalidated parity and it hasn't been valid since.

 

Are you actually doing CORRECTING parity checks? Normally monthly checks are non-correcting. Until you correct (or rebuild) parity you will have parity errors.

 

After running a correcting parity check, you should follow that with a non-correcting parity check to verify that you have no parity errors. Exactly zero sync errors is the only acceptable result, and until you get there you still have work to do.

Link to comment
2 hours ago, Whaler_99 said:

In the scheduler, I have it configured for monthly and the "Write corrections to parity.." set to yes.


It is normally recommended that you set the monthly parity checks to be non-correcting.    The rational behind this is that you do not want a disk that might be playing up to end up corrupting parity, but you do want to be know if there is a mismatch so you can investigate why.   You then only use correcting checks after a problem has been identified and if it was identified and (hopefully) resolved.

Link to comment

Replaced the two SM cards with the LSI ones I got, also had to replace a cable that was a bit short. Just finished the correcting parity check and found zero errors. I'll now run a non correcting as suggested and ensure there are 0 errors. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.