[Solved] Help: Drive gone bad, what now?

Videodr0me · November 14, 2018

One of my drives has apparently gone bad:

grafik.png.872e3f98a05bfaaa6f801e03bdb4d8c7.png

I do not want to do something hasty, so I left the system running. What should i do now? The following questions come to mind:

1) if i just take the array offline, what options do i have - maybe regenerate the drive from parity (if the drive is not completely broken, and this was just a one time read error, which could get reallocated).

2) If i restart the server, what would happen? Would unraid know the drive has gone bad? Or would it just start up normally in an inconsistent state?

3) How would i replace the drive? Shutdown first, pull out drive, just put in new drive? And then if i want to try the old drive again (formatting, adding to the array), is that possible, or would unraid just see its a bad drive and refuse to do anything?

The log showed this when the drive went bad:

Nov 13 10:45:02 Tower crond[1725]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Nov 13 14:48:30 Tower sshd[19005]: Accepted password for root from 192.168.178.37 port 54021 ssh2
Nov 13 14:52:26 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 13 14:52:26 Tower kernel: ata10.00: failed command: READ DMA EXT
Nov 13 14:52:26 Tower kernel: ata10.00: cmd 25/00:10:c0:22:01/00:00:00:02:00/e0 tag 10 dma 8192 in
Nov 13 14:52:26 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 13 14:52:26 Tower kernel: ata10.00: status: { DRDY }
Nov 13 14:52:26 Tower kernel: ata10: hard resetting link
Nov 13 14:52:36 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:52:36 Tower kernel: ata10: hard resetting link
Nov 13 14:52:46 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:52:46 Tower kernel: ata10: hard resetting link
Nov 13 14:52:51 Tower sshd[19005]: syslogin_perform_logout: logout() returned an error
Nov 13 14:53:21 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:53:21 Tower kernel: ata10: limiting SATA link speed to 3.0 Gbps
Nov 13 14:53:21 Tower kernel: ata10: hard resetting link
Nov 13 14:53:26 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Nov 13 14:53:26 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:53:32 Tower kernel: ata10.00: qc timeout (cmd 0xec)
Nov 13 14:53:32 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 13 14:53:32 Tower kernel: ata10.00: revalidation failed (errno=-5)
Nov 13 14:53:32 Tower kernel: ata10: hard resetting link
Nov 13 14:53:42 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:53:42 Tower kernel: ata10: hard resetting link
Nov 13 14:53:52 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:53:52 Tower kernel: ata10: hard resetting link
Nov 13 14:54:27 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:54:27 Tower kernel: ata10: limiting SATA link speed to 1.5 Gbps
Nov 13 14:54:27 Tower kernel: ata10: hard resetting link
Nov 13 14:54:32 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 13 14:54:32 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:54:42 Tower kernel: ata10.00: qc timeout (cmd 0xec)
Nov 13 14:54:42 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 13 14:54:42 Tower kernel: ata10.00: revalidation failed (errno=-5)
Nov 13 14:54:42 Tower kernel: ata10: hard resetting link
Nov 13 14:54:52 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:54:52 Tower kernel: ata10: hard resetting link
Nov 13 14:55:02 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:55:02 Tower kernel: ata10: hard resetting link
Nov 13 14:55:37 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:55:37 Tower kernel: ata10: hard resetting link
Nov 13 14:55:42 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 13 14:55:42 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:56:13 Tower kernel: ata10.00: qc timeout (cmd 0xec)
Nov 13 14:56:13 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 13 14:56:13 Tower kernel: ata10.00: revalidation failed (errno=-5)
Nov 13 14:56:13 Tower kernel: ata10.00: disabled
Nov 13 14:56:13 Tower kernel: ata10: hard resetting link
Nov 13 14:56:23 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:56:23 Tower kernel: ata10: hard resetting link
Nov 13 14:56:33 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:56:33 Tower kernel: ata10: hard resetting link
Nov 13 14:57:08 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:57:08 Tower kernel: ata10: hard resetting link
Nov 13 14:57:13 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 13 14:57:13 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:57:13 Tower kernel: ata10: EH complete
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#11 CDB: opcode=0x88 88 00 00 00 00 02 00 01 22 c0 00 00 00 10 00 00
Nov 13 14:57:13 Tower kernel: print_req_error: I/O error, dev sdn, sector 8590009024
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#12 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#13 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Nov 13 14:57:13 Tower kernel: md: disk5 read error, sector=8590008960
Nov 13 14:57:13 Tower kernel: md: disk5 read error, sector=8590008968

All input welcome.

JorgeB · November 14, 2018

Download current diagnostics, since the disk dropped offline power down and check connections, power back up and post both the pre-reboot diags and current ones.

Videodr0me · November 14, 2018

3 hours ago, johnnie.black said:

Download current diagnostics, since the disk dropped offline power down and check connections, power back up and post both the pre-reboot diags and current ones.

Yes did save diagnostics, but was unsure if a reboot without removing the drive is a good idea. If you are positive that unraid will not get confused with the drive still in the system, then i would do a reboot now. In the best of all worlds, unraid would on a reboot see that the drive dropped from the array previously and give some options on how to proceed. Maybe you can give me some info on what options these are, or if it is safe to reboot?

JorgeB · November 14, 2018

The disk dropped offline, you need to reboot or power cycle to see if it comes back online, checking connections is also a good idea, Unraid doesn't get confused when doing this.

Videodr0me · November 14, 2018

Thanks! I know it might seem trivial to users who already went through this experience. But better to make sure that a reboot is ok, then to mess up.

Ok, did the reboot. I did not autostart the array (i usually do that manually). Unraid correctly has a red x next to disk 5. From the smart diagnostics, it seems the drive is back online. I attached pre-reboot and post-reboot smart diagnostics. What now? Can i somehow regenerate to that drive? Should i test the drive before attempting that?

Thanks for helping!

POST_REBOOT_ST8000AS0002-1NA17Z_Z840Z3QH-20181114-1629 disk5 (sdn) - DISK_DSBL.txt

PRE_REBOOT_ST8000AS0002-1NA17Z_Z840Z3QH-20181114-1313 disk5 (sdn) - DISK_DSBL.txt

JorgeB · November 14, 2018

You should post the diagnostics, not just the SMART report, but based on that the disk looks fine and you can rebuild on top, just make sure the emulated is mounting before rebuilding, or better yet use a spare to rebuild in case something goes wrong.

JorgeB · November 14, 2018

Just now, johnnie.black said:

disk looks fine

Forgot to say, disk is getting too hot:

Lifetime    Min/Max Temperature:     21/65 Celsius

You might want to improve cooling, disks should stay below 40C, 45C tops.

Videodr0me · November 14, 2018

35 minutes ago, johnnie.black said:
Forgot to say, disk is getting too hot:
Lifetime    Min/Max Temperature:     21/65 Celsius
You might want to improve cooling, disks should stay below 40C, 45C tops.

This temp resulted from a failure of a fan control script. This should not happen again.
I did an extended smart test with no errors so far.

I mounted the array and its running in emulated mode.

So are there any more tests i can run in unraid on that drive?

And how exactly do i rebuild on top? I can't seem to find an option?

And thanks again! I am getting a lot calmer.

JorgeB · November 14, 2018

6 minutes ago, Videodr0me said:

So are there any more tests i can run in unraid on that drive?

If you ran an extended SMART test and it passed it should be OK.

6 minutes ago, Videodr0me said:

And how exactly do i rebuild on top? I can't seem to find an option?

https://wiki.unraid.net/Troubleshooting#Re-enable_the_drive

Videodr0me · November 16, 2018

Extended Smart passed with no errors.

Rebuild disk 5 as per your linked instructions.

Took 18h, now everything is back to normal.

Again, thank you very much again!

[Solved] Help: Drive gone bad, what now?

Recommended Posts

Videodr0me

Link to comment

JorgeB

Link to comment

Videodr0me

Link to comment

JorgeB

Link to comment

Videodr0me

Link to comment

JorgeB

Link to comment

JorgeB

Link to comment

Videodr0me

Link to comment

JorgeB

Link to comment

Videodr0me

Link to comment

Archived