6.3.5: Disk Died; Replaced; ParityChk; 166k Errors on Other Disk

wheel · March 26, 2020

Some history on this tower:

So the LSI controllers have been in since December and doing fine. I’ve successfully upgraded at least 3, maybe more, 4TB to 6TB drives in the time since, always parity checking before and non correcting parity check after.

This is the first time since the LSI cards came in that I had a disk die on me (Disk 12, 2 sync errors on the GUI and drive automatically disconnected). I was maybe 2 days max away from upgrading a random 4 to 6 for space reasons anyway, so I went ahead and put the 6 in and started the rebuilding process. I’d been steadily adding files over the past month and a half or so since my last upgrade (and last parity check), but weirdly not many to the disk that’s now showing 166k errors (Disk 13).

The first half or so of the parity check had zero errors. I checked it with about 3 hours left and saw the 166k errors, but let the check run to completion. No more errors popped up in the last 3 hours of the check, the sync error disk (13) isn’t disabled or marked in any negative way outside of the error count, and all files (including the ones added to that disk during the ~45 days of no parity checks) seem to open fine still.

With all these factors in play, any suggestions on next steps here? Got a feeling hardware replacements are going to be a pain in this environment, but I’m swimming in free time if there are some time-intensive steps I can take to figure out what’s going wrong here and get things back to normal.

Thanks in advance for any help or guidance!

tower-diagnostics-20200326-0913.zip

JorgeB · March 26, 2020

There were read errors on disk13 duding the rebuild, which means disk12 is partiality corrupt.

Any special reason for being on such an old release? Those diags are not so complete as the latest ones, and not so easy to say if it was a disk problem or not, but from what I can see disk look OK.

wheel · March 26, 2020

Damn.

No special reason on the old version; vaguely remember planning to upgrade around 6.6.6(?) but read about some weird stuff going on and decided to hold off for a future version. Time flew by in between then and now (unraid’s mostly a set-and-forget thing for me).

So I’m out of 6TBs but can upgrade one in another tower to an 8TB and get another 6TB to use and replace 13’s 6TB if needed.

I’m guessing these are my next steps:

(1) Confirm file integrity on D12 and D13

(2) Identify whether disk 13 has a problem or if it’s related to hotswap cage or wires or whatever (NOT sure on this one)

(3) Upgrade to last stable Unraid release OR

(3) Replace D13 and upgrade to last stable Unraid release

On the right track? Thanks for the swift help, JB!

Edited March 26, 2020 by wheel

JorgeB · March 26, 2020

Seems like a plan, if you still have old disk12 you can run a checksum compare between both, with for example rsync.

Would also recommend converting all reiserfs disks, it's not recommended for some time now.

wheel · March 26, 2020

Old disk 12 is still in the exact same shape, and I have an eSATA caddy on another tower I can hopefully easily use for the checksum compare on the two 12s over the network (about to do some reading on that).

Also looking into the reiserfs thing - definitely news to me, and feeling like I should be better safe than sorry on all towers during this mess. (EDIT: File juggling is going to be tough until I can get some more drives in the mail. Hopefully their being reiserfs won’t screw me too hard during the crisis if external hard drives keep getting their shipment times pushed back as non-essential.)

Any recommendations on how to confirm whether D13 needs replacing now with the unraid version still sitting at 6.3.5?

Thanks again!

Edited March 26, 2020 by wheel

JorgeB · March 26, 2020

19 minutes ago, wheel said:

Any recommendations on how to confirm whether D13 needs replacing now with the unraid version still sitting at 6.3.5?

There's a recent extended SMART test and it passed, run a another one if still OK disk should be fine.

wheel · March 27, 2020

Well, the short SMART test on D13 came back fine, but the extended's been sitting on 10% for over 2 hours now, which feels weird on a 6tb. I'm going to let it keep rolling for awhile, but I feel like this doesn't bode well for that 6tb having much life left in it.

Am I safer off replacing that 6tb (if Extended SMART fails) before upgrading unraid to a newer version? If so, since I just ran a non-correcting parity check, is any of the (now-corrupted) D12 data repairable through the old parity I haven't "corrected" yet? Or should I run a correcting parity check before replacing that 6tb?

JorgeB · March 27, 2020

8 hours ago, wheel said:

is any of the (now-corrupted) D12 data repairable through the old parity I haven't "corrected" yet? Or should I run a correcting parity check before replacing that 6tb?

Either way you'll need to check the data on disk12, so whatever you prefer.

Extended test takes several hours (2 to 3 hours per TB) and sometime can appear stuck.

wheel · March 27, 2020

Extended test's at 50% now, so - holding off!

Been spot-checking D12, and already found a few files that won't open properly. Going to be a hunt, but I've got time for it.

Thanks a ton for your patience and advice in such a weird time for everyone, JB.

wheel · March 27, 2020

So Disk13 completed the extended SMART self-test without error.

Since I'm probably going to end up upgrading a handful of other disks during the course of this mess, my new concern is why Disk13 threw up read errors during the Disk12 rebuild - and how to prevent that from happening again the next time I rebuild a disk.

Any guidance on how best to trace that problem to its source and stop it from reocurring would be greatly appreciated!

JorgeB · March 27, 2020

31 minutes ago, wheel said:

my new concern is why Disk13 threw up read errors during the Disk12 rebuild

Most likely a connection issue, recommend replacing cables (or swapping with another disk) to rule them out, this way if it happens again to the same disk it's likely a disk problem, despite the healthy SMART.

wheel · May 12, 2020

Back to the game - but this time, with a fully-updated unraid 6.8.3 diagnostics set!

I was writing to the rebuilt Disk 12 last night when the disk disabled itself with write errors. Had a hot spare 6TB sitting and'll be out of town later this week, so figured I'd go ahead and replace it now with a known-good 6TB.

Rebuild seemed to go fine, and I'm running the non-correcting parity check now - bam, at some point, picked up 216 sync errors. Just jumped to 217 while I was typing this. None of the errors are associated with any specific disk on the main page, but showing up at the summary at the bottom.

Diagnostics attached; should I stop the noncorrecting parity check? Any new info from the new diagnostics from an updated unraid version?

Thanks for any help!

Edit - 268 now, steadily growing a few errors at a time.

tower-diagnostics-20200511-1913.zip

Edited May 12, 2020 by wheel

JorgeB · May 12, 2020

Sync errors are likely the result of the previous issues, run another check after this one finishes without rebooting and post new diags if there are more sync errors.

wheel · May 12, 2020

Sounds like a plan: check's almost done and about to start another one. Presuming it's best to run a non-correcting one to be safe - or should I run this one as correcting, then run another to see if new (vs additional) sync errors appear?

Edit: the sync errors stopped growing after they hit 299. Looks like they've stayed stable there overnight and the check's almost done, so definitely a lower volume of errors than last time Disk 12 (or its hotswap slot) started going screwy.

Edited May 12, 2020 by wheel

trurl · May 12, 2020

3 minutes ago, wheel said:

Presuming it's best to run a non-correcting one to be safe

If the correcting check doesn't fix all parity errors there is no point in another correcting check since you have some other problem to diagnose. So you follow a correcting check with a non-correcting check to verify the correcting check fixed all parity errors.

wheel · May 12, 2020

That makes sense - trick is, I haven't run a correcting check since the one back in March described above.

The check I ran after installing the replacement drive on Sunday/Monday was non-correcting, and that's the same one that's finishing up right now.

It does sound like now's the time to run a correcting parity check, with a plan to run a non-correcting check after that check (two checks total, starting this morning) to make sure I don't have a bigger issue specific to Disk 12's hotswap cage considering the consistent issues across disks that may or may not be coincidentally occurring there.

(Really, really hope I don't need to replace a middle-of-the-tower hotswap cage in a pandemic, but technically easier than moving everything to a new build...)

Thank you both for the help and guidance, JB & trurl!

trurl · May 12, 2020

2 hours ago, wheel said:

It does sound like now's the time to run a correcting parity check, with a plan to run a non-correcting check after that check (two checks total, starting this morning)

Be sure to get diagnostics in case the non-correcting check still has errors. By comparing the logs of the parity checks sometimes it can help troubleshoot the problem.

wheel · May 12, 2020

Added to the plan: extra diagnostics sets. I'll report back here with those in ~48 hours or so. Thanks a ton!

wheel · May 14, 2020

5/12 (Diagnostics After 299 Sync Errors Non-Correcting Check)

5/13 (Diagnostics After Correcting Check)

5/14 (Diagnostics After Final, Non-Correcting Check)

Hope these help figure out what's going on with the 12 slot (if anything!)

tower-diagnostics-20200514-0549-FINAL-NONC-CHK.zip tower-diagnostics-20200513-0732-AFTER-CORR-CHK.zip tower-diagnostics-20200512-1054-AFTER-299ERROR-NONC-CHK.zip

JorgeB · May 14, 2020

There were no errors on the last check so everything if fine for now, only need to worry if you get more errors on a future check (without any unclean shutdown or other issues).

wheel · May 14, 2020

Nice. So the seeming disk-after-disk issues associated with slot #12 are probably just coincidental? Both the 166k error drive from March and the swiftly-disabled disk this month were pretty old (the latter being a white label I got maybe 4 years ago?), so it makes sense, but the recurrence of #12 issues definitely caught my attention in a single-parity setup.

JorgeB · May 14, 2020

Sync errors after disk12 rebuild are likely a result of this:

On 3/27/2020 at 6:00 PM, wheel said:

my new concern is why Disk13 threw up read errors during the Disk12 rebuild

After that all looks normal, any more issues we need the diags when the problem happens.

wheel · May 14, 2020

Got it. Thanks a ton!

wheel · May 31, 2020

Weird Disk12 happenings again.

I had an unclean shutdown with someone accidentally hitting the power button on my UPS that powered two unraid boxes.

One booted back up and prompted me to parity check. One (this one) weirdly gave me the option for a clean shutdown, which I took, then started back up. No visible issues, but felt paranoid, so ran a non correcting parity check before modifying any files. ~200 read errors on Disk 12. Ran correcting parity check. Tried collecting diagnostics at every possible opportunity to help see if anything weird turned up that anyone else might notice:

5-27: right after "unclean" / clean shutdown

5-29: after non-correcting parity check

5-30: after correcting parity check

tower-diagnostics-20200530-2053.zip tower-diagnostics-20200529-2000.zip tower-diagnostics-20200527-1017.zip

JorgeB · May 31, 2020

Those look like an actual disk problem, though an intermittent one, you should keep an eye on that disk.

Also not recommended to run a correcting check unless sync errors (not disk errors) are expected, and you should never run one when disk errors are expected.

6.3.5: Disk Died; Replaced; ParityChk; 166k Errors on Other Disk

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation