[SOLVED] Drive disabled after moving to a different slot

gnollo · May 19, 2020

Drive was fine but now it has been disabled because of read errors after moving it to another slot in my norco caddy. I moved the drive after stopping the array, but not turning off the system.

Is there a way to tell unraid to enable it without having to rebuild it?

Diagnostics attached.

tower-diagnostics-20200519-0921.zip

JorgeB · May 19, 2020

You could do a new config but would require a parity sync/check, and since the emulated disk is mounting correctly might as well rebuild it instead..

gnollo · May 21, 2020

Now I have a different problem. Removed the drive after stopping and starting the array, and rebooted, and the booting process goes in a loop.

I can see the usual process flashing by (PCI Devices listing...) until the line "SYSLINUX 6.03 EDD ...." appears on the screen, then it flashes back to the boot screen and starts again a few times, then the process stops at that line, and I have to hard reboot (CTRL/ALT/DEL does not appear to work anymore).

Why o why did I move that drive. Why...

gnollo · May 21, 2020

It seems to be a flash problem. Formatted another flash with rufus and installed a fresh copy of unraid server, it boots with no issues.

Inserted the current flash in my laptop, it found errors, I scanned and fixed it but it still doesn't work.

Copied off all the files I could see after the fix, and now I am reformatting it with rufus, unticked fast format, now it's taking ages.

This must be connected with the messages about the flash being unreadable in the past, but it was rebooting fine and the message was appearing only from time to time. I guess I should have listened and replaced the flash straight away?

Edited May 21, 2020 by gnollo

gnollo · May 21, 2020

Mmmh. Turns out it's not strictly a flash problem. Copied the backed up contents from the flash that is failing to boot over to the flash that booted, and that also now is exhibiting the same issue of looping booting.

So I created a new fresh install of unraid plain config on the original flash that failed yesterday, copied across the permissioning key alone, and it boots fine.

Now I will try to copy across the last version of the flash I saved in December, and see if that works too.

gnollo · May 21, 2020

It booted but it has the wrong configuration of drives, as it predates my update of parity to a bigger drive.

I guess i can stiill do a new config and then run a parity sync/check?

gnollo · May 21, 2020

Booted fine with new configuration, running parity check now. Two drives (including 7) now show UDMA CRC error counts (4 and 55).

gnollo · May 21, 2020

Emby server is not running though. It looks like is trying to run 3.5.3.0

gnollo · May 21, 2020

1 hour ago, gnollo said:

Emby server is not running though. It looks like is trying to run 3.5.3.0

Fixed that too. Deleted library.db and now it's up and running

gnollo · May 24, 2020

Parity check completed

Total size:10 TB

Elapsed time:2 days, 1 hour, 27 minutes

Current position:10.0 TB (100.0 %)

Estimated speed:45.0 MB/sec

Estimated finish:completed

Sync errors corrected:488083827

Speed a bit on the low side, perhaps because of the number of sync errors?

JorgeB · May 24, 2020

24 minutes ago, gnollo said:

perhaps because of the number of sync errors?

They won't help, were so many errors expected?

gnollo · May 24, 2020

Last time I checked parity in 2019:

Event: Unraid Parity check
Subject: Notice [TOWER] - Parity check finished (47 errors)
Description: Duration: 21 hours, 50 minutes, 34 seconds. Average speed: 101.8 MB/s
Importance: warning

gnollo · May 24, 2020

4 minutes ago, johnnie.black said:

They won't help, were so many errors expected?

I have no idea, drive 7 was disabled at one point, and I will have done changes to the drives, so perhaps the corrections are to reflect that that drive did not change? Will do another check in a week and see how that works out in terms of number of sync errors.

gnollo · May 24, 2020

Also I was still running the emby docker which would have impacted performance, will turn that off next time as I am running parity check

gnollo · May 28, 2020

Things have gone a lot worse since the parity check.

I could not connect to the network drives so I logged on the Unraid Gui

- 1 drive disabled (drive2)

- 2 other drives with errors

- emby server stopped

- parity has over three and half billion errors

- CPU almost at 100%

- tried to stop the array but it's stuck on "stopping"

- wont' shut down, even with telnet connection and poweroff

The one thing that the affected drives have in common, is that they all reside in one of the Norco SS500.

I am only using one power connector for all the three norcos in use, should I change that?

Also in that norco connector, I have a drive (parity) which I had to tape on one of the power connectors to allow for unraid to recognise it upon booting (I took it out of an external drive cage).

Diagnostics attached. I am thinking hard poweroff at the mains, swap power connector on the affected drive, and reboot.

I think rebuilding drive 2 is very dangerous, as I don't feel I can trust parity. I am more inclined to try to force the system to mark all the drives as good and recalculate parity from scratch (not sure how to do that though).

The problems continue it seems on my server.

tower-diagnostics-20200528-1733.zip

gnollo · May 28, 2020

gnollo · May 28, 2020

Rebooted, used two molex for each Norco cage this time.
Disk2 unmountable, no file system

Sent from my SM-A520F using Tapatalk

gnollo · May 28, 2020

Array started with disk 2 disabled.
Not sure what to do next, I don't want to lose any of the data on disk2

Sent from my SM-A520F using Tapatalk

gnollo · May 28, 2020

Fresh diagnostics

tower-diagnostics-20200528-2354.zip

JorgeB · May 29, 2020

You had multiple disks in different controllers going offline at the same time:

May 28 11:07:50 Tower kernel: sd 9:0:1:0: device_block, handle(0x000a)
May 28 11:07:53 Tower kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x000a)
May 28 11:07:53 Tower kernel: sd 9:0:1:0: [sdj] Synchronizing SCSI cache
May 28 11:07:53 Tower kernel: sd 9:0:1:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
May 28 11:07:53 Tower kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221106000000)
May 28 11:07:53 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b004dce890), slot(5)
May 28 11:07:53 Tower rc.diskinfo[28375]: SIGHUP received, forcing refresh of disks info.
May 28 11:08:07 Tower kernel: ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
May 28 11:08:07 Tower kernel: ata5: SError: { PHYRdyChg }
May 28 11:08:07 Tower kernel: ata5.00: failed command: WRITE DMA EXT
May 28 11:08:07 Tower kernel: ata5.00: cmd 35/00:08:c0:21:cf/00:00:b8:01:00/e0 tag 0 dma 4096 out
May 28 11:08:07 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x14 (ATA bus error)
May 28 11:08:07 Tower kernel: ata5.00: status: { DRDY }
May 28 11:08:07 Tower kernel: ata5: hard resetting link
May 28 11:08:08 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300)
May 28 11:08:14 Tower kernel: ata5: hard resetting link
May 28 11:08:14 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300)
May 28 11:08:19 Tower kernel: ata5: hard resetting link
May 28 11:08:20 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300)
May 28 11:08:20 Tower kernel: ata5.00: disabled
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 Sense Key : 0x5 [current]
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 ASC=0x21 ASCQ=0x4
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 01 b8 cf 21 c0 00 00 00 08 00 00
May 28 11:08:20 Tower kernel: print_req_error: I/O error, dev sdf, sector 7395549632
May 28 11:08:20 Tower kernel: md: disk2 write error, sector=7395549568
May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] killing request
May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
### [PREVIOUS LINE REPEATED 1 TIMES] ###
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
May 28 11:08:20 Tower kernel: ata5: EH complete
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] CDB: opcode=0x88 88 00 00 00 00 00 03 f5 30 a8 00 00 00 08 00 00
May 28 11:08:20 Tower kernel: print_req_error: I/O error, dev sdf, sector 66400424
May 28 11:08:20 Tower kernel: ata5.00: detaching (SCSI 4:0:0:0)
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Synchronizing SCSI cache
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Stopping disk
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00
May 28 11:08:20 Tower rc.diskinfo[28375]: SIGHUP received, forcing refresh of disks info.
May 28 11:08:21 Tower kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
May 28 11:08:21 Tower kernel: ata6: SError: { PHYRdyChg }
May 28 11:08:21 Tower kernel: ata6.00: failed command: READ DMA EXT
May 28 11:08:21 Tower kernel: ata6.00: cmd 25/00:08:f0:58:df/00:00:2a:00:00/e0 tag 0 dma 4096 in
May 28 11:08:21 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
May 28 11:08:21 Tower kernel: ata6.00: status: { DRDY }
May 28 11:08:21 Tower kernel: ata6: hard resetting link
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697728
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697736
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697744
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697752
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697760
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697768
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697776
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697784
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=11995447336
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=66400360
May 28 11:08:21 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300)
May 28 11:08:27 Tower kernel: ata6: hard resetting link
May 28 11:08:27 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300)
May 28 11:08:33 Tower kernel: ata6: hard resetting link
May 28 11:08:33 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300)
May 28 11:08:33 Tower kernel: ata6.00: disabled

This suggests a power/connection problem.

gnollo · May 29, 2020

I agree, something affected that Norco controller, power likely. I doubled up now on power connectors for each of the two Norco's 500 now.

What do I do with the disabled drive? I had to do an unclean shutdown as the tower would not stop. DId that cause filesystem corruption?

JorgeB · May 29, 2020

Unassign the disable disk and start the array to check if the emulated disk is mounting correctly and contents look OK.

gnollo · May 29, 2020

The array has already started, and the drive is already emulated. Do you want me to stop the array first, then unassign the disk and restart?

JorgeB · May 29, 2020

No need if already started, check that data looks OK and if yes you can rebuild on top.

gnollo · May 29, 2020

I stopped the array anyway, unassigned the disk and restarted. Although content is emulated, I cannot get to disk2 as I usually do via \\tower\disk2.

And now I get read errors from disk7, which is not on the same norco unit that was affected before.

[SOLVED] Drive disabled after moving to a different slot

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation