Videodr0me Posted November 14, 2018 Share Posted November 14, 2018 One of my drives has apparently gone bad: I do not want to do something hasty, so I left the system running. What should i do now? The following questions come to mind: 1) if i just take the array offline, what options do i have - maybe regenerate the drive from parity (if the drive is not completely broken, and this was just a one time read error, which could get reallocated). 2) If i restart the server, what would happen? Would unraid know the drive has gone bad? Or would it just start up normally in an inconsistent state? 3) How would i replace the drive? Shutdown first, pull out drive, just put in new drive? And then if i want to try the old drive again (formatting, adding to the array), is that possible, or would unraid just see its a bad drive and refuse to do anything? The log showed this when the drive went bad: Nov 13 10:45:02 Tower crond[1725]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null Nov 13 14:48:30 Tower sshd[19005]: Accepted password for root from 192.168.178.37 port 54021 ssh2 Nov 13 14:52:26 Tower kernel: ata10.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen Nov 13 14:52:26 Tower kernel: ata10.00: failed command: READ DMA EXT Nov 13 14:52:26 Tower kernel: ata10.00: cmd 25/00:10:c0:22:01/00:00:00:02:00/e0 tag 10 dma 8192 in Nov 13 14:52:26 Tower kernel: res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) Nov 13 14:52:26 Tower kernel: ata10.00: status: { DRDY } Nov 13 14:52:26 Tower kernel: ata10: hard resetting link Nov 13 14:52:36 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:52:36 Tower kernel: ata10: hard resetting link Nov 13 14:52:46 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:52:46 Tower kernel: ata10: hard resetting link Nov 13 14:52:51 Tower sshd[19005]: syslogin_perform_logout: logout() returned an error Nov 13 14:53:21 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:53:21 Tower kernel: ata10: limiting SATA link speed to 3.0 Gbps Nov 13 14:53:21 Tower kernel: ata10: hard resetting link Nov 13 14:53:26 Tower kernel: ata10: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Nov 13 14:53:26 Tower kernel: ata10.00: link online but device misclassified Nov 13 14:53:32 Tower kernel: ata10.00: qc timeout (cmd 0xec) Nov 13 14:53:32 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) Nov 13 14:53:32 Tower kernel: ata10.00: revalidation failed (errno=-5) Nov 13 14:53:32 Tower kernel: ata10: hard resetting link Nov 13 14:53:42 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:53:42 Tower kernel: ata10: hard resetting link Nov 13 14:53:52 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:53:52 Tower kernel: ata10: hard resetting link Nov 13 14:54:27 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:54:27 Tower kernel: ata10: limiting SATA link speed to 1.5 Gbps Nov 13 14:54:27 Tower kernel: ata10: hard resetting link Nov 13 14:54:32 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Nov 13 14:54:32 Tower kernel: ata10.00: link online but device misclassified Nov 13 14:54:42 Tower kernel: ata10.00: qc timeout (cmd 0xec) Nov 13 14:54:42 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) Nov 13 14:54:42 Tower kernel: ata10.00: revalidation failed (errno=-5) Nov 13 14:54:42 Tower kernel: ata10: hard resetting link Nov 13 14:54:52 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:54:52 Tower kernel: ata10: hard resetting link Nov 13 14:55:02 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:55:02 Tower kernel: ata10: hard resetting link Nov 13 14:55:37 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:55:37 Tower kernel: ata10: hard resetting link Nov 13 14:55:42 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Nov 13 14:55:42 Tower kernel: ata10.00: link online but device misclassified Nov 13 14:56:13 Tower kernel: ata10.00: qc timeout (cmd 0xec) Nov 13 14:56:13 Tower kernel: ata10.00: failed to IDENTIFY (I/O error, err_mask=0x4) Nov 13 14:56:13 Tower kernel: ata10.00: revalidation failed (errno=-5) Nov 13 14:56:13 Tower kernel: ata10.00: disabled Nov 13 14:56:13 Tower kernel: ata10: hard resetting link Nov 13 14:56:23 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:56:23 Tower kernel: ata10: hard resetting link Nov 13 14:56:33 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:56:33 Tower kernel: ata10: hard resetting link Nov 13 14:57:08 Tower kernel: ata10: softreset failed (1st FIS failed) Nov 13 14:57:08 Tower kernel: ata10: hard resetting link Nov 13 14:57:13 Tower kernel: ata10: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Nov 13 14:57:13 Tower kernel: ata10.00: link online but device misclassified Nov 13 14:57:13 Tower kernel: ata10: EH complete Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#11 CDB: opcode=0x88 88 00 00 00 00 02 00 01 22 c0 00 00 00 10 00 00 Nov 13 14:57:13 Tower kernel: print_req_error: I/O error, dev sdn, sector 8590009024 Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#12 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=0x00 Nov 13 14:57:13 Tower kernel: sd 11:0:0:0: [sdn] tag#13 CDB: opcode=0x85 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e5 00 Nov 13 14:57:13 Tower kernel: md: disk5 read error, sector=8590008960 Nov 13 14:57:13 Tower kernel: md: disk5 read error, sector=8590008968 All input welcome. Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 Download current diagnostics, since the disk dropped offline power down and check connections, power back up and post both the pre-reboot diags and current ones. Link to comment
Videodr0me Posted November 14, 2018 Author Share Posted November 14, 2018 3 hours ago, johnnie.black said: Download current diagnostics, since the disk dropped offline power down and check connections, power back up and post both the pre-reboot diags and current ones. Yes did save diagnostics, but was unsure if a reboot without removing the drive is a good idea. If you are positive that unraid will not get confused with the drive still in the system, then i would do a reboot now. In the best of all worlds, unraid would on a reboot see that the drive dropped from the array previously and give some options on how to proceed. Maybe you can give me some info on what options these are, or if it is safe to reboot? Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 The disk dropped offline, you need to reboot or power cycle to see if it comes back online, checking connections is also a good idea, Unraid doesn't get confused when doing this. Link to comment
Videodr0me Posted November 14, 2018 Author Share Posted November 14, 2018 Thanks! I know it might seem trivial to users who already went through this experience. But better to make sure that a reboot is ok, then to mess up. Ok, did the reboot. I did not autostart the array (i usually do that manually). Unraid correctly has a red x next to disk 5. From the smart diagnostics, it seems the drive is back online. I attached pre-reboot and post-reboot smart diagnostics. What now? Can i somehow regenerate to that drive? Should i test the drive before attempting that? Thanks for helping! POST_REBOOT_ST8000AS0002-1NA17Z_Z840Z3QH-20181114-1629 disk5 (sdn) - DISK_DSBL.txt PRE_REBOOT_ST8000AS0002-1NA17Z_Z840Z3QH-20181114-1313 disk5 (sdn) - DISK_DSBL.txt Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 You should post the diagnostics, not just the SMART report, but based on that the disk looks fine and you can rebuild on top, just make sure the emulated is mounting before rebuilding, or better yet use a spare to rebuild in case something goes wrong. Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 Just now, johnnie.black said: disk looks fine Forgot to say, disk is getting too hot: Lifetime Min/Max Temperature: 21/65 Celsius You might want to improve cooling, disks should stay below 40C, 45C tops. Link to comment
Videodr0me Posted November 14, 2018 Author Share Posted November 14, 2018 35 minutes ago, johnnie.black said: Forgot to say, disk is getting too hot: Lifetime Min/Max Temperature: 21/65 Celsius You might want to improve cooling, disks should stay below 40C, 45C tops. This temp resulted from a failure of a fan control script. This should not happen again. I did an extended smart test with no errors so far. I mounted the array and its running in emulated mode. So are there any more tests i can run in unraid on that drive? And how exactly do i rebuild on top? I can't seem to find an option? And thanks again! I am getting a lot calmer. Link to comment
JorgeB Posted November 14, 2018 Share Posted November 14, 2018 6 minutes ago, Videodr0me said: So are there any more tests i can run in unraid on that drive? If you ran an extended SMART test and it passed it should be OK. 6 minutes ago, Videodr0me said: And how exactly do i rebuild on top? I can't seem to find an option? https://wiki.unraid.net/Troubleshooting#Re-enable_the_drive Link to comment
Videodr0me Posted November 16, 2018 Author Share Posted November 16, 2018 Extended Smart passed with no errors. Rebuild disk 5 as per your linked instructions. Took 18h, now everything is back to normal. Again, thank you very much again! Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.