Jump to content

Can't read SMART data from disabled parity drive


drmit

Recommended Posts

Hi all,

 

First off thanks to everyone who contributes to this community. I started with Unraid less than two years ago after never having used a Linux-based system before, and while I haven't contributed to these forums much, I've found them immensely helpful.

 

My unraid server is currently running 6.12.3 with an array of 3 x 10TB WD Red Plus drives with single parity (i.e. 2 x 10TB array disks and one 10TB parity drive). All drives are attached directly to the motherboard, so no controllers in use. I run parity checks once per month on the first of the month, and last month my Parity drive ended up disabled, and a SMART check of the drive found one CRC error. I assumed at the time it may be a one-off, so I checked my cabling, which all looked fine, and rebuilt parity onto the same drive. A SMART extended test completed fine, apart from CRC error count = 1. All seemed fine until this month's parity check. This time, the parity drive became disabled, but I can't seem to read the SMART data on it (possibly because I can't figure out how to get it to spin up??). When I try to run smartctl -a /dev/sdg from terminal, I get the response:

 

Short INQUIRY response, skip product id
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

 

I tried to stop the array, remove the disk from the array, restart in maintenance mode, stop the array, and re-add the disk to the array as a Parity drive. While it appears in the drop-down list initially (sdg), after selecting it the UI refreshes and the option for sdg is then gone from the list. The drive is listed under unassigned devices (so I was able to copy out the disk log information), but I can't seem to do anything with it.

 

Could anyone provide a suggestion on what my next step should be? The drives are only about 1.5 yrs old so should still be under warranty, but if there is something simple I'm missing here I'd like to get to the bottom of it rather than leave my array unprotected while the RMA runs its course (I'm in New Zealand and bought the drives from Amazon, so who knows how long the RMA will take). Could a SATA/power cable or SATA port on the motherboard be at fault here (they appear fine visually)?

 

I've attached the syslog and disk log but if any other data would be useful I can add that too.

 

Thanks for your help!

syslog.txt disk log information sdg.txt

Link to comment

No power splitters, and only using 3 max of the 4 SATA power connectors on any one branch from the PSU (which is new as well, Seasonic Prime PX-650). Will attempt to check the cables with a multimeter, though those pins are quite narrow. Any idea on expected resistance?

Link to comment

SATA power and data cables to the problem drive now checked and all seem to have good continuity on all of the pins. Resistance varies but no one pin seemed much worse than another.

 

I was finally able to figure out how to 'detach' the drive from the Settings menu in Unassigned Devices. I then re-attached, after which I was able to add it back into the array as a parity drive (it auto-populated the first parity drive slot once I re-attached it). I then rebuilt parity overnight and the check completed with no errors.

 

After 're-attaching' the drive and starting the array I was able to download the SMART log (attached) which states that the CRC error count is now 2 (it was 1 after the last failed parity check). It also says (for both 'errors'): When the command that caused the error occurred, the device was doing SMART Offline or Self-test. Why would it say that when the error occurred during a parity check, not a SMART test?

 

Another extended SMART test is now underway.

 

I'm a bit unsure what to do next. Should I RMA the drive? Replace the cables with new ones and hope parity continues to remain valid? My data is all backed up, but ironically my backup server (TrueNAS) seems to be having hardware issues right now, so I'd prefer to not take any chances.

 

What is anyone's experience in RMAing a drive with a few CRC errors? Will WD just replace with new, or would they reject it if one of their tests seems to indicate the drive is fine?

WDC_WD101EFBX-68B0AN0_VCJW6NVP-20230904-0806.txt

Edited by drmit
added bit about extended SMART test underway
Link to comment
13 minutes ago, drmit said:

What is anyone's experience in RMAing a drive with a few CRC errors

They may not accept it as CRC errors are connection errors and rarely indicate a disk problem.   As long as the value is not constantly increasing then you can ignore it.   The value never resets to 0 so being steady is fine.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...