thingie2 Posted December 9, 2020 Share Posted December 9, 2020 (edited) Yesterday morning I had one of my drives get disabled due to errors. How can I determine the cause of the errors, and if this means there's an issue with the drive, or an issue elsewhere? The disabled drive is fairly old, so I wouldn't be surprised if it's on it's way out, but I want to be sure before doing something about it. I've attached my diagnostics zip, but it's missing the SMART data for the disabled drive (due to it being disabled). I've now uploaded the SMART result from the failed drive after a restart. The SMART test has passed. tower-diagnostics-20201209-0850.zip WDC_WD4000F9YZ-09N20L0_WD-WCC5D0010919-20201209-1103.txt Edited December 9, 2020 by thingie2 Added additional SMART data Quote Link to comment
JorgeB Posted December 9, 2020 Share Posted December 9, 2020 Looks more like a connection problem, swap both cables with another drive and see if the problem follows the drive. Quote Link to comment
thingie2 Posted December 9, 2020 Author Share Posted December 9, 2020 Thanks, that's what I was hoping, but wasn't sure how to determine/check. I'll add the drive back into the array & rebuild, then swap the cables round/replace with others & hopefully that'll prevent it happening again. Quote Link to comment
thingie2 Posted December 31, 2020 Author Share Posted December 31, 2020 So a follow up on this. I've tried swapping SATA cables & changing the SATA port the drive is connected to, but I've had errors a couple of times since trying these. Only thing I haven't done yet is trying a different PWR cable, but there are other drives on the same chain, so thinking it's unlikely to be that, but going to try and determine for sure. I've attached the diagnostics for the last couple of times it's had the problem. Do these give any further idea on what might be the issue? tower-diagnostics-20201231-2033.zip tower-diagnostics-20201227-1150.zip Quote Link to comment
trurl Posted December 31, 2020 Share Posted December 31, 2020 On 12/9/2020 at 3:54 AM, thingie2 said: I've attached my diagnostics zip, but it's missing the SMART data for the disabled drive (due to it being disabled). Diagnostics includes SMART for all attached disks whether disabled or not. If it can't get SMART then there is some reason other than disabled. SMART for disk3 doesn't appear in either of those latest diagnostics. Check connections. Quote Link to comment
thingie2 Posted January 1, 2021 Author Share Posted January 1, 2021 14 hours ago, trurl said: Diagnostics includes SMART for all attached disks whether disabled or not. If it can't get SMART then there is some reason other than disabled. SMART for disk3 doesn't appear in either of those latest diagnostics. Check connections. I thought I read somewhere that if a drive is disabled, you can't get the SMART report for that drive until the array has been re-started, which is why I didn't worry about there not being a SMART report in that log. I'm going to open it up, check connections & swap the HDD PWR cable today, it just seems odd to me that if it is a loose connection, it's the same drive that's coming loose every time. Quote Link to comment
itimpi Posted January 1, 2021 Share Posted January 1, 2021 1 hour ago, thingie2 said: I thought I read somewhere that if a drive is disabled, you can't get the SMART report for that drive until the array has been re-started, which is why I didn't worry about there not being a SMART report in that log. I'm going to open it up, check connections & swap the HDD PWR cable today, it just seems odd to me that if it is a loose connection, it's the same drive that's coming loose every time. A drive being disabled does not mean no SMART report is available. The normal reason for no SMART report is the drive dropped offline (and this is perhaps why it got disabled). However after a reboot the drive will still show as disabled (pending you taking recovery action) but the drive can now be online and thus provide a SMART report. Quote Link to comment
thingie2 Posted January 1, 2021 Author Share Posted January 1, 2021 8 minutes ago, itimpi said: A drive being disabled does not mean no SMART report is available. The normal reason for no SMART report is the drive dropped offline (and this is perhaps why it got disabled). However after a reboot the drive will still show as disabled (pending you taking recovery action) but the drive can now be online and thus provide a SMART report. I've just finished checking connections & changing the PWR cable for the drive. All connections seemed fine, but here's the smart report for the drive now that it's available, but it doesn't look like there's anything to worry about in that. tower-smart-20210101-1352.zip Quote Link to comment
thingie2 Posted January 1, 2021 Author Share Posted January 1, 2021 I'm thinking more & more it's an issue with the drive. I've just had another look through the logs from after noticing something at machine startup today. I get the following error during boot: ata8: link is slow to respond, please be patient (ready=0) Then the following errors just prior to the disk errors yesterday: Dec 31 14:54:54 Tower kernel: ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Dec 31 14:54:54 Tower kernel: ata8.00: failed command: FLUSH CACHE EXT Dec 31 14:54:54 Tower kernel: ata8.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 22 Dec 31 14:54:54 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Dec 31 14:54:54 Tower kernel: ata8.00: status: { DRDY } Dec 31 14:54:54 Tower kernel: ata8: hard resetting link Dec 31 14:55:04 Tower kernel: ata8: softreset failed (device not ready) Dec 31 14:55:04 Tower kernel: ata8: hard resetting link Dec 31 14:55:14 Tower kernel: ata8: softreset failed (device not ready) Dec 31 14:55:14 Tower kernel: ata8: hard resetting link Dec 31 14:55:25 Tower kernel: ata8: link is slow to respond, please be patient (ready=0) Dec 31 14:55:49 Tower kernel: ata8: softreset failed (device not ready) Dec 31 14:55:49 Tower kernel: ata8: limiting SATA link speed to 3.0 Gbps Dec 31 14:55:49 Tower kernel: ata8: hard resetting link Dec 31 14:55:55 Tower kernel: ata8: softreset failed (device not ready) Dec 31 14:55:55 Tower kernel: ata8: reset failed, giving up Dec 31 14:55:55 Tower kernel: ata8.00: disabled From a look at the previous lots, I get similar errors, however the errors move with the ata# when I moved the drive. I'm no expert with this though, so I'm unsure if this is meaning it's can't find the drive (like it's disconnected), or it knows it's there, but it's not responding properly. If it's the latter, that would point to the connections all being find, but an issue with the drive. Does anyone have any experience/knowledge in this area that could shed some light on this? Quote Link to comment
trurl Posted January 2, 2021 Share Posted January 2, 2021 Run an extended SMART test on it. Quote Link to comment
thingie2 Posted January 2, 2021 Author Share Posted January 2, 2021 11 hours ago, trurl said: Run an extended SMART test on it. I set one to run overnight last night. The rest has completed successfully, see attached log. tower-smart-20210102-1200.zip Quote Link to comment
JorgeB Posted January 2, 2021 Share Posted January 2, 2021 Connect that drive to the onboard SATA controller, swap with another one if needed, might be a compatibility issue with the controller. Quote Link to comment
thingie2 Posted January 3, 2021 Author Share Posted January 3, 2021 On 1/2/2021 at 12:18 PM, JorgeB said: Connect that drive to the onboard SATA controller, swap with another one if needed, might be a compatibility issue with the controller. I can try that, however of the 6 data drives I have in my server, I only have 2 different models of drive (4x4TB WD SE drives, & 2x8TB WD Red drives). The drive that keeps coming up with an issue is one of the 4TB drive, and I also have another one of those drives on the same expansion card. If it was an issue with compatibility with the controller, wouldn't the other drive of the same type on the same controller have the same issue (or am I oversimplifying)? I think I'm going to wait until (if) the drive fails again, so I can determine if the change of PWR cable has any effect (rather than changing 2 things together & not being sure which is the cause), then if it does fail, I'll try swapping it onto the onboard controller. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.