Jump to content

[Solved] Help: Drive gone bad, what now?


Videodr0me

Recommended Posts

One of my drives has apparently gone bad:

grafik.png.872e3f98a05bfaaa6f801e03bdb4d8c7.png

I do not want to do something hasty, so I left the system running. What should i do now? The following questions come to mind:

1) if i just take the array offline, what options do i have - maybe regenerate the drive from parity (if the drive is not completely broken, and this was just a one time read error, which could get reallocated).

2) If i restart the server, what would happen? Would unraid know the drive has gone bad? Or would it just start up normally in an inconsistent state?

3) How would i replace the drive? Shutdown first, pull out drive, just put in new drive? And then if i want to try the old drive again (formatting, adding to the array), is that possible, or would unraid just see its a bad drive and refuse to do anything?

 

The log showed this when the drive went bad:

Nov 13 10:45:02 Tower crond[1725]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Nov 13 14:48:30 Tower sshd[19005]: Accepted password for root from 192.168.178.37 port 54021 ssh2
Nov 13 14:52:26 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Nov 13 14:52:26 Tower kernel: ata10.00: failed command: READ DMA EXT
Nov 13 14:52:26 Tower kernel: ata10.00: cmd 25/00:10:c0:22:01/00:00:00:02:00/e0 tag 10 dma 8192 in
Nov 13 14:52:26 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Nov 13 14:52:26 Tower kernel: ata10.00: status: { DRDY }
Nov 13 14:52:26 Tower kernel: ata10: hard resetting link
Nov 13 14:52:36 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:52:36 Tower kernel: ata10: hard resetting link
Nov 13 14:52:46 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:52:46 Tower kernel: ata10: hard resetting link
Nov 13 14:52:51 Tower sshd[19005]: syslogin_perform_logout: logout() returned an error
Nov 13 14:53:21 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:53:21 Tower kernel: ata10: limiting SATA link speed to 3.0 Gbps
Nov 13 14:53:21 Tower kernel: ata10: hard resetting link
Nov 13 14:53:26 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Nov 13 14:53:26 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:53:32 Tower kernel: ata10.00: qc timeout (cmd 0xec)
Nov 13 14:53:32 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 13 14:53:32 Tower kernel: ata10.00: revalidation failed (errno=-5)
Nov 13 14:53:32 Tower kernel: ata10: hard resetting link
Nov 13 14:53:42 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:53:42 Tower kernel: ata10: hard resetting link
Nov 13 14:53:52 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:53:52 Tower kernel: ata10: hard resetting link
Nov 13 14:54:27 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:54:27 Tower kernel: ata10: limiting SATA link speed to 1.5 Gbps
Nov 13 14:54:27 Tower kernel: ata10: hard resetting link
Nov 13 14:54:32 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 13 14:54:32 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:54:42 Tower kernel: ata10.00: qc timeout (cmd 0xec)
Nov 13 14:54:42 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 13 14:54:42 Tower kernel: ata10.00: revalidation failed (errno=-5)
Nov 13 14:54:42 Tower kernel: ata10: hard resetting link
Nov 13 14:54:52 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:54:52 Tower kernel: ata10: hard resetting link
Nov 13 14:55:02 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:55:02 Tower kernel: ata10: hard resetting link
Nov 13 14:55:37 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:55:37 Tower kernel: ata10: hard resetting link
Nov 13 14:55:42 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 13 14:55:42 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:56:13 Tower kernel: ata10.00: qc timeout (cmd 0xec)
Nov 13 14:56:13 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Nov 13 14:56:13 Tower kernel: ata10.00: revalidation failed (errno=-5)
Nov 13 14:56:13 Tower kernel: ata10.00: disabled
Nov 13 14:56:13 Tower kernel: ata10: hard resetting link
Nov 13 14:56:23 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:56:23 Tower kernel: ata10: hard resetting link
Nov 13 14:56:33 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:56:33 Tower kernel: ata10: hard resetting link
Nov 13 14:57:08 Tower kernel: ata10: softreset failed (1st FIS failed)
Nov 13 14:57:08 Tower kernel: ata10: hard resetting link
Nov 13 14:57:13 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Nov 13 14:57:13 Tower kernel: ata10.00: link online but device misclassified
Nov 13 14:57:13 Tower kernel: ata10: EH complete
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#11 CDB: opcode=0x88 88 00 00 00 00 02 00 01 22 c0 00 00 00 10 00 00
Nov 13 14:57:13 Tower kernel: print_req_error: I/O error, dev sdn, sector 8590009024
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#12 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00
Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#13 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00
Nov 13 14:57:13 Tower kernel: md: disk5 read error, sector=8590008960
Nov 13 14:57:13 Tower kernel: md: disk5 read error, sector=8590008968

 

All input welcome.

Link to comment
3 hours ago, johnnie.black said:

Download current diagnostics, since the disk dropped offline power down and check connections, power back up and post both the pre-reboot diags and current ones.

Yes did save diagnostics, but was unsure if a reboot without removing the drive is a good idea. If you are positive that unraid will not get confused with the drive still in the system, then i would do a reboot now. In the best of all worlds, unraid would on a reboot see that the drive dropped from the array previously and give some options on how to proceed. Maybe you can give me some info on what options these are, or if it is safe to reboot?

Link to comment

Thanks! I know it might seem trivial to users who already went through this experience. But better to make sure that a reboot is ok, then to mess up.

 

Ok, did the reboot. I did not autostart the array (i usually do that manually). Unraid correctly has a red x next to disk 5. From the smart diagnostics, it seems the drive is back online. I attached pre-reboot and post-reboot smart diagnostics. What now? Can i somehow regenerate to that drive? Should i test the drive before attempting that?

 

Thanks for helping!

POST_REBOOT_ST8000AS0002-1NA17Z_Z840Z3QH-20181114-1629 disk5 (sdn) - DISK_DSBL.txt

PRE_REBOOT_ST8000AS0002-1NA17Z_Z840Z3QH-20181114-1313 disk5 (sdn) - DISK_DSBL.txt

Link to comment
35 minutes ago, johnnie.black said:

Forgot to say, disk is getting too hot:


Lifetime    Min/Max Temperature:     21/65 Celsius

You might want to improve cooling, disks should stay below 40C, 45C tops.

This temp resulted from a failure of a fan control script. This should not happen again.
I did an extended smart test with no errors so far.

I mounted the array and its running in emulated mode.

So are there any more tests i can run in unraid on that drive?

And how exactly do i rebuild on top? I can't seem to find an option?

 

And thanks again! I am getting a lot calmer.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...