SMART errors - drive failing?


Recommended Posts

Hello all,

 

I have a 12TB drive in my array shucked from a WD Easystore that started giving me errors in the webui yesterday. I ran short SMART test and it passed. Then this morning, I had a red x. I know that means the drive is disabled and usually that's a catastrophic failure (well, as far as unRAID is concerned, I think.) Anyway, I ran an extended SMART test and it's showing me UDMA CRC errors. The drive isn't that old. And who knows where the heck I put the enclosure. What are my next steps? Also, has anyone had any luck RMAing these drives? I've heard mixed results. They are OEM so technically the warranty is only in the enclosure, but I've also heard WD takes them back anyway.

In either case, what should I do? Order a replacement drive and get ready to swap it out? I'm a little panicked. On the bright side, there are some good Black Friday deals going on, so the timing is probably the best I can get.

 

*EDIT* I forgot to attach the Diagnostics file. Sorry about that.

blackpearl-diagnostics-20201123-1017.zip

Edited by LittleMike
Forgot attachment
Link to comment
4 minutes ago, itimpi said:

You may find this section of the online documentation on CRC errors to be of interest.

Interesting. So what that says is it's (generally) a power or cabling issue and not the drive itself. Did you happen to check my SMART logs? From what I could tell, the drive looked okay aside from those UDMA CRC errors, so it's quite possible the problem does NOT lie with the drive. That would be good news.

Though the drives are attached to a backplane in a Supermicro 2RU chassis with one of those slim platinum PSU's, so fixing that problem may actually be more of a pain. Haha.

Link to comment
44 minutes ago, LittleMike said:

I know that means the drive is disabled and usually that's a catastrophic failure (well, as far as unRAID is concerned, I think.)

Based on what usually comes to our attention, it is seldom "catastrophic" unless the user does the wrong thing.

 

Emulated disk is mounted and showing about half full so I expect rebuilding to the same disk will be fine provided you don't have any problems with other disks. Are any other disks showing SMART warnings on the Dashboard page?

 

You can acknowledge those CRC errors by clicking on the SMART warning icon for that disk on the Dashboard and it will warn again if it increases.

 

That isn't really a lot of CRC errors, did you notice if it had any before you had this problem?

 

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

Link to comment
2 minutes ago, trurl said:

Based on what usually comes to our attention, it is seldom "catastrophic" unless the user does the wrong thing.

 

Emulated disk is mounted and showing about half full so I expect rebuilding to the same disk will be fine provided you don't have any problems with other disks. Are any other disks showing SMART warnings on the Dashboard page?

 

You can acknowledge those CRC errors by clicking on the SMART warning icon for that disk on the Dashboard and it will warn again if it increases.

 

That isn't really a lot of CRC errors, did you notice if it had any before you had this problem?

 

Do you have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

 

I don't have email notification, but I do have immediate dashboard notification (on the webgui) set.

No issues on any other drives and no issues before today.

I clicked the thumb icon to acknowledge last night and everything seemed peachy again and woke up to it being disabled this morning, so of course I panicked.

Link to comment
8 minutes ago, LittleMike said:

I don't have email notification, but I do have immediate dashboard notification (on the webgui) set.

Many people take a "set it and forget it" approach, so depending on not forgetting to check the webUI wouldn't be nearly good enough. Especially if you are running headless and don't see the webUI without opening it in a browser somewhere.

Link to comment

Safest approach is to rebuild to a new disk and keep the original in case there is a problem rebuilding. But of course that requires a spare disk. You can rebuild to the same disk if you don't have a spare and don't want to buy one. If you don't have any other hardware issues such as bad connections then rebuilding to the same disk usually is fine.

Link to comment
5 minutes ago, trurl said:

Many people take a "set it and forget it" approach, so depending on not forgetting to check the webUI wouldn't be nearly good enough. Especially if you are running headless and don't see the webUI without opening it in a browser somewhere.

I never not have the webgui up, honestly. I hadn't even thought about it. It's one of my pinned browser tabs and I never close my browser unless I need to. Haha.

Okay, powered down the server, checked all the cable connections, reseated all the drives in their hotswap bays, and powered everything back up. It still showed disabled, so I'm going to stop the array, unassign the drive, start the array, stop the arrray, re-assign the drive, and try to rebuild. Cross your fingers!

*EDIT* @trurl your second reply came in after I hit send. I don't have another on hand so I'm going to attempt to rebuild to the same drive.

I do have a 14TB and a 12TB in my cart right now (the 14TB is $189 at BB right now for Black Friday). I may still purchase both and keep an eye on this one over the next few days. Then I may try to RMA it. Ugh, this is stressful.

Edited by LittleMike
reply came in while I was responding
Link to comment
Just now, trurl said:

Might even be a backplane issue.

 

If it was a backplane issue, all the drives would have an issue. It's a server chassis with a SATA/SAS backplane. 5 disks total in an 8 bay chassis. All of them set to spindown at idle. So I can't see it being a power issue given that a) the disks consume 6.4W under max load x 5 is 32W, and b) none of the other drives are having an issue. Being that it's a backplane all fed off the same lines, I can't see it being a cabling issue for the same reason - none of the other drives are giving me a single error.

Link to comment
5 minutes ago, trurl said:

You could try different disk in that same bay and see if it also had CRC errors.

True. I doubt it would make a difference because the way the controllers are designed, if the port was bad, multiple ports would be bad. Unless it's a bent pin or something stupid like that. It's a hotswap bay, so unlikely, but something I could easily confirm. Will unRAID get screwed up by that as far as the drives in the array? I don't know how it assigns them under the hood.

Link to comment
  • 2 weeks later...

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.