Help determining if drive is dying


Recommended Posts

Yesterday morning I had one of my drives get disabled due to errors. How can I determine the cause of the errors, and if this means there's an issue with the drive, or an issue elsewhere?

 

The disabled drive is fairly old, so I wouldn't be surprised if it's on it's way out, but I want to be sure before doing something about it.

 

I've attached my diagnostics zip, but it's missing the SMART data for the disabled drive (due to it being disabled). I've now uploaded the SMART result from the failed drive after a restart. The SMART test has passed.

 

image.thumb.png.06a4985ff0ebc0efc015c095ccf07495.png

tower-diagnostics-20201209-0850.zip

WDC_WD4000F9YZ-09N20L0_WD-WCC5D0010919-20201209-1103.txt

Edited by thingie2
Added additional SMART data
Link to comment
  • 4 weeks later...

So a follow up on this. I've tried swapping SATA cables & changing the SATA port the drive is connected to, but I've had errors a couple of times since trying these.

 

Only thing I haven't done yet is trying a different PWR cable, but there are other drives on the same chain, so thinking it's unlikely to be that, but going to try and determine for sure.

 

I've attached the diagnostics for the last couple of times it's had the problem. Do these give any further idea on what might be the issue?

tower-diagnostics-20201231-2033.zip tower-diagnostics-20201227-1150.zip

Link to comment
On 12/9/2020 at 3:54 AM, thingie2 said:

I've attached my diagnostics zip, but it's missing the SMART data for the disabled drive (due to it being disabled).

Diagnostics includes SMART for all attached disks whether disabled or not. If it can't get SMART then there is some reason other than disabled.

 

SMART for disk3 doesn't appear in either of those latest diagnostics. Check connections.

Link to comment
14 hours ago, trurl said:

Diagnostics includes SMART for all attached disks whether disabled or not. If it can't get SMART then there is some reason other than disabled.

 

SMART for disk3 doesn't appear in either of those latest diagnostics. Check connections.

I thought I read somewhere that if a drive is disabled, you can't get the SMART report for that drive until the array has been re-started, which is why I didn't worry about there not being a SMART report in that log. I'm going to open it up, check connections & swap the HDD PWR cable today, it just seems odd to me that if it is a loose connection, it's the same drive that's coming loose every time.

Link to comment
1 hour ago, thingie2 said:

I thought I read somewhere that if a drive is disabled, you can't get the SMART report for that drive until the array has been re-started, which is why I didn't worry about there not being a SMART report in that log. I'm going to open it up, check connections & swap the HDD PWR cable today, it just seems odd to me that if it is a loose connection, it's the same drive that's coming loose every time.

A drive being disabled does not mean no SMART report is available.   The normal reason for no SMART report is the drive dropped offline (and this is perhaps why it got disabled).   However after a reboot the drive will still show as disabled (pending you taking recovery action) but the drive can now be online and thus provide a SMART report.

 

Link to comment
8 minutes ago, itimpi said:

A drive being disabled does not mean no SMART report is available.   The normal reason for no SMART report is the drive dropped offline (and this is perhaps why it got disabled).   However after a reboot the drive will still show as disabled (pending you taking recovery action) but the drive can now be online and thus provide a SMART report.

 

I've just finished checking connections & changing the PWR cable for the drive. All connections seemed fine, but here's the smart report for the drive now that it's available, but it doesn't look like there's anything to worry about in that. 

tower-smart-20210101-1352.zip

Link to comment

I'm thinking more & more it's an issue with the drive. I've just had another look through the logs from after noticing something at machine startup today. I get the following error during boot:

ata8: link is slow to respond, please be patient (ready=0)

Then the following errors just prior to the disk errors yesterday:

Dec 31 14:54:54 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Dec 31 14:54:54 Tower kernel: ata8.00: failed command: FLUSH CACHE EXT
Dec 31 14:54:54 Tower kernel: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22
Dec 31 14:54:54 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Dec 31 14:54:54 Tower kernel: ata8.00: status: { DRDY }
Dec 31 14:54:54 Tower kernel: ata8: hard resetting link
Dec 31 14:55:04 Tower kernel: ata8: softreset failed (device not ready)
Dec 31 14:55:04 Tower kernel: ata8: hard resetting link
Dec 31 14:55:14 Tower kernel: ata8: softreset failed (device not ready)
Dec 31 14:55:14 Tower kernel: ata8: hard resetting link
Dec 31 14:55:25 Tower kernel: ata8: link is slow to respond, please be patient (ready=0)
Dec 31 14:55:49 Tower kernel: ata8: softreset failed (device not ready)
Dec 31 14:55:49 Tower kernel: ata8: limiting SATA link speed to 3.0 Gbps
Dec 31 14:55:49 Tower kernel: ata8: hard resetting link
Dec 31 14:55:55 Tower kernel: ata8: softreset failed (device not ready)
Dec 31 14:55:55 Tower kernel: ata8: reset failed, giving up
Dec 31 14:55:55 Tower kernel: ata8.00: disabled

From a look at the previous lots, I get similar errors, however the errors move with the ata# when I moved the drive.

 

I'm no expert with this though, so I'm unsure if this is meaning it's can't find the drive (like it's disconnected), or it knows it's there, but it's not responding properly. If it's the latter, that would point to the connections all being find, but an issue with the drive. Does anyone have any experience/knowledge in this area that could shed some light on this?

Link to comment
On 1/2/2021 at 12:18 PM, JorgeB said:

Connect that drive to the onboard SATA controller, swap with another one if needed, might be a compatibility issue with the controller.

I can try that, however of the 6 data drives I have in my server, I only have 2 different models of drive (4x4TB WD SE drives, & 2x8TB WD Red drives). The drive that keeps coming up with an issue is one of the 4TB drive, and I also have another one of those drives on the same expansion card. If it was an issue with compatibility with the controller, wouldn't the other drive of the same type on the same controller have the same issue (or am I oversimplifying)?

 

I think I'm going to wait until (if) the drive fails again, so I can determine if the change of PWR cable has any effect (rather than changing 2 things together & not being sure which is the cause), then if it does fail, I'll try swapping it onto the onboard controller.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.