Jump to content
Sign in to follow this  
wheel

6.3.5: Disk Died; Replaced; ParityChk; 166k Errors on Other Disk

11 posts in this topic Last Reply

Recommended Posts

Some history on this tower:

So the LSI controllers have been in since December and doing fine. I’ve successfully upgraded at least 3, maybe more, 4TB to 6TB drives in the time since, always parity checking before and non correcting parity check after.

 

This is the first time since the LSI cards came in that I had a disk die on me (Disk 12, 2 sync errors on the GUI and drive automatically disconnected). I was maybe 2 days max away from upgrading a random 4 to 6 for space reasons anyway, so I went ahead and put the 6 in and started the rebuilding process. I’d been steadily adding files over the past month and a half or so since my last upgrade (and last parity check), but weirdly not many to the disk that’s now showing 166k errors (Disk 13).

 

The first half or so of the parity check had zero errors. I checked it with about 3 hours left and saw the 166k errors, but let the check run to completion. No more errors popped up in the last 3 hours of the check, the sync error disk (13) isn’t disabled or marked in any negative way outside of the error count, and all files (including the ones added to that disk during the ~45 days of no parity checks) seem to open fine still.

 

With all these factors in play, any suggestions on next steps here? Got a feeling hardware replacements are going to be a pain in this environment, but I’m swimming in free time if there are some time-intensive steps I can take to figure out what’s going wrong here and get things back to normal.

 

Thanks in advance for any help or guidance!

tower-diagnostics-20200326-0913.zip

Share this post


Link to post

There were read errors on disk13 duding the rebuild, which means disk12 is partiality corrupt.

 

Any special reason for being on such an old release? Those diags are not so complete as the latest ones, and not so easy to say if it was a disk problem or not, but from what I can see disk look OK.

Share this post


Link to post
Posted (edited)

Damn.

 

No special reason on the old version; vaguely remember planning to upgrade around 6.6.6(?) but read about some weird stuff going on and decided to hold off for a future version. Time flew by in between then and now (unraid’s mostly a set-and-forget thing for me).

 

So I’m out of 6TBs but can upgrade one in another tower to an 8TB and get another 6TB to use and replace 13’s 6TB if needed.

 

I’m guessing these are my next steps:

(1) Confirm file integrity on D12 and D13

(2) Identify whether disk 13 has a problem or if it’s related to hotswap cage or wires or whatever (NOT sure on this one)

(3) Upgrade to last stable Unraid release OR

(3) Replace D13 and upgrade to last stable Unraid release

 

On the right track? Thanks for the swift help, JB!

Edited by wheel

Share this post


Link to post

Seems like a plan, if you still have old disk12 you can run a checksum compare between both, with for example rsync.

 

Would also recommend converting all reiserfs disks, it's not recommended for some time now.

Share this post


Link to post
Posted (edited)

Old disk 12 is still in the exact same shape, and I have an eSATA caddy on another tower I can hopefully easily use for the checksum compare on the two 12s over the network (about to do some reading on that).

 

Also looking into the reiserfs thing - definitely news to me, and feeling like I should be better safe than sorry on all towers during this mess. (EDIT: File juggling is going to be tough until I can get some more drives in the mail. Hopefully their being reiserfs won’t screw me too hard during the crisis if external hard drives keep getting their shipment times pushed back as non-essential.)

 

Any recommendations on how to confirm whether D13 needs replacing now with the unraid version still sitting at 6.3.5?

 

Thanks again!

Edited by wheel

Share this post


Link to post
19 minutes ago, wheel said:

Any recommendations on how to confirm whether D13 needs replacing now with the unraid version still sitting at 6.3.5?

There's a recent extended SMART test and it passed, run a another one if still OK disk should be fine.

Share this post


Link to post

Well, the short SMART test on D13 came back fine, but the extended's been sitting on 10% for over 2 hours now, which feels weird on a 6tb. I'm going to let it keep rolling for awhile, but I feel like this doesn't bode well for that 6tb having much life left in it.

 

Am I safer off replacing that 6tb (if Extended SMART fails) before upgrading unraid to a newer version? If so, since I just ran a non-correcting parity check, is any of the (now-corrupted) D12 data repairable through the old parity I haven't "corrected" yet? Or should I run a correcting parity check before replacing that 6tb?

Share this post


Link to post
8 hours ago, wheel said:

is any of the (now-corrupted) D12 data repairable through the old parity I haven't "corrected" yet? Or should I run a correcting parity check before replacing that 6tb?

Either way you'll need to check the data on disk12, so whatever you prefer.

 

Extended test takes several hours (2 to 3 hours per TB) and sometime can appear stuck.

Share this post


Link to post

Extended test's at 50% now, so - holding off!

 

Been spot-checking D12, and already found a few files that won't open properly. Going to be a hunt, but I've got time for it.

 

Thanks a ton for your patience and advice in such a weird time for everyone, JB.

Share this post


Link to post

So Disk13 completed the extended SMART self-test without error.

 

Since I'm probably going to end up upgrading a handful of other disks during the course of this mess, my new concern is why Disk13 threw up read errors during the Disk12 rebuild - and how to prevent that from happening again the next time I rebuild a disk.

 

Any guidance on how best to trace that problem to its source and stop it from reocurring would be greatly appreciated!

Share this post


Link to post
31 minutes ago, wheel said:

my new concern is why Disk13 threw up read errors during the Disk12 rebuild

Most likely a connection issue, recommend replacing cables (or swapping with another disk) to rule them out, this way if it happens again to the same disk it's likely a disk problem, despite the healthy SMART.

Share this post


Link to post

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

Sign in to follow this