Device is disabled - understanding the SMART report

Gog · January 24, 2020

I just noticed a disabled drive but I don't know if I have a cable issue or if the drive is dying.

Can someone with SMART knowledge read my disk report and guide me on change cable and reseat vs trash the drive?

Thanks

WDC_WD40EFRX-68WT0N0_WD-WCC4EF24PX5J-20200124-0804.txt

trurl · January 24, 2020

SMART for that disk looks OK. But you should always

go to Tools - Diagnostics and attach the complete diagnostics zip file to your NEXT post.

Diagnostics include SMART for all disks, syslog that might give a better idea of what happened (if you haven't rebooted), and many other things that give a more complete understanding of your situation.

I will wait on the diagnostics before making any recommendations about how to proceed.

Gog · January 25, 2020

Thanks for the reply, complete diagnostics attached.

I've had a number of CRC errors on two disks, but not this one.

tower-diagnostics-20200124-1948.zip

trurl · January 25, 2020

Most of your disks are very full, and some are still ReiserFS. Why are you logging Mover?

3 hours ago, Gog said:

I just noticed a disabled drive

Jan 15 03:53:42 Tower kernel: md: disk3 write error, sector=1953336760

Looks like disk3 got disabled Jan 15. Do you not have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

You also have problems communicating with cache and disk1. Are these all the same controller? Disk3 needs to be rebuilt of course, especially since it is out-of-sync more than a week. But I'm not confident about the rebuild with these other issues.

Jan  8 19:31:51 Tower kernel: ata1.00: ATA-10: KINGSTON SA400S37480G, 50026B778227C383, SBFK71B1, max UDMA/133
Jan  8 19:32:09 Tower emhttpd: import 30 cache device: (sdi) KINGSTON_SA400S37480G_50026B778227C383
Jan 23 07:08:31 Tower kernel: ata1.00: exception Emask 0x10 SAct 0x1800000 SErr 0x280100 action 0x6 frozen
Jan 23 07:08:31 Tower kernel: ata1: hard resetting link
...
Jan  8 19:31:51 Tower kernel: ata2.00: ATA-9: HGST HDN726060ALE614, K8H5GNMD, APGNW7JH, max UDMA/133
Jan  8 19:32:08 Tower kernel: md: import disk1: (sdj) HGST_HDN726060ALE614_K8H5GNMD size: 5860522532 
Jan 23 03:38:34 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x280100 action 0x6 frozen
Jan 23 03:38:34 Tower kernel: ata2: hard resetting link

Let's see if @johnnie.black is still awake and if he has anything to say about your controller or suggestions about how to proceed.

Gog · January 25, 2020

Quote

Most of your disks are very full, and some are still ReiserFS

Yes, new drives are xfs but I'm not actively migrating data. I just remove the smallest drive when I add a new one.

Quote

Why are you logging Mover?

I was tracking an odd behavior a while ago and forgot to mute the mover.

Quote

Looks like disk3 got disabled Jan 15. Do you not have Notifications setup to alert you immediately by email or other agent as soon as a problem is detected?

I do but missed that email. I did an inbox cleanup and here we are

Quote

You also have problems communicating with cache and disk1

yes, these are the CRC errors I mentioned

Quote

Are these all the same controller?

Not sure, I'll have to power down the server to pull the drawers to verify

trurl · January 25, 2020

12 minutes ago, Gog said:

I do but missed that email. I did an inbox cleanup and here we are

Set Array status notification for every day. You would have gotten a new email every day that told you "Array Health FAIL"

JorgeB · January 25, 2020

Disk1 and cache need a new SATA cable

Jan  9 03:23:49 Tower kernel: ata1: SError: { UnrecovData 10B8B BadCRC }
...
Jan 12 11:40:04 Tower kernel: ata2: SError: { UnrecovData 10B8B BadCRC }

It's also highly recommended to update the LSI to latest firmware p20.00.07.00, all earlier p20 releases have known issues and possibly what got disk3 disabled.

Gog · January 26, 2020

On 1/25/2020 at 2:54 AM, johnnie.black said:
Disk1 and cache need a new SATA cable
Jan  9 03:23:49 Tower kernel: ata1: SError: { UnrecovData 10B8B BadCRC }
...
Jan 12 11:40:04 Tower kernel: ata2: SError: { UnrecovData 10B8B BadCRC }

I replaced those cables

Disk 3 is on the LSI but disk 1 and cache were not.

Quote

It's also highly recommended to update the LSI to latest firmware p20.00.07.00, all earlier p20 releases have known issues and possibly what got disk3 disabled.

I'm on p20.00.02.00, trying to get the p20.00.07.00 from a reliable source but supermicro's ftp is refusing connections from my IP.

These instructions are bang on except I can't get the binaries: https://www.ixsystems.com/community/threads/flashing-the-lsi2308-firmware-on-a-supermicro-x10sl7-f-motherboard.38884/

I found https://www.mediafire.com/?py9c1w5u56xytw2

that gives a procedure to upgrade LSI SAS 9211-8i to p20.00.07.00. Do you know id the same firmware works on my controller(LSI 2308) ? The broadcom website is not really helpful

JorgeB · January 27, 2020

You can get the package from Broadcom's support site, under legacy controllers.

Device is disabled - understanding the SMART report

Recommended Posts

Gog

Link to comment

trurl

Link to comment

Gog

Link to comment

trurl

Link to comment

Gog

Link to comment

trurl

Link to comment

JorgeB

Link to comment

Gog

Link to comment

JorgeB

Link to comment

Join the conversation