[SOLVED] Drive disabled after moving to a different slot


Recommended Posts

Now I have a different problem. Removed the drive after stopping and starting the array, and rebooted, and the booting process goes in a loop.

I can see the usual process flashing by (PCI Devices listing...) until the line "SYSLINUX 6.03 EDD ...." appears on the screen, then it flashes back to the boot screen and starts again a few times, then the process stops at that line, and I have to hard reboot (CTRL/ALT/DEL does not appear to work anymore).

Why o why did I move that drive. Why...

Link to comment

It seems to be a flash problem. Formatted another flash with rufus and installed a fresh copy of unraid server, it boots with no issues.

Inserted the current flash in my laptop, it found errors, I scanned and fixed it but it still doesn't work.

Copied off all the files I could see after the fix, and now I am reformatting it with rufus, unticked fast format, now it's taking ages.

This must be connected with the messages about the flash being unreadable in the past, but it was rebooting fine and the message was appearing only from time to time. I guess I should have listened and replaced the flash straight away?

Edited by gnollo
Link to comment

Mmmh. Turns out it's not strictly a flash problem. Copied the backed up contents from the flash that is failing to boot over to the flash that booted, and that also now is exhibiting the same issue of looping booting.

So I created a new fresh install of unraid plain config on the original flash that failed yesterday, copied across the permissioning key alone, and it boots fine.

Now I will try to copy across the last version of the flash I saved in December, and see if that works too.

Link to comment

Parity check completed

Total size:10 TB

Elapsed time:2 days, 1 hour, 27 minutes

Current position:10.0 TB (100.0 %)

Estimated speed:45.0 MB/sec

Estimated finish:completed

Sync errors corrected:488083827

 

Speed a bit on the low side, perhaps because of the number of sync errors?

Link to comment
4 minutes ago, johnnie.black said:

They won't help, were so many errors expected?

I have no idea, drive 7 was disabled at one point, and I will have done changes to the drives, so perhaps the corrections are to reflect that that drive did not change? Will do another check in a week and see how that works out in terms of number of sync errors.

Link to comment

Things have gone a lot worse since the parity check.

I could not connect to the network drives so I logged on the Unraid Gui

- 1 drive disabled (drive2)

- 2 other drives with errors

- emby server stopped

- parity has over three and half billion errors

- CPU almost at 100%

- tried to stop the array but it's stuck on "stopping"

- wont' shut down, even with telnet connection and poweroff

 

The one thing that the affected drives have in common, is that they all reside in one of the Norco SS500.

I am only using one power connector for all the three norcos in use, should I change that?

Also in that norco connector, I have a drive (parity) which I had to tape on one of the power connectors to allow for unraid to recognise it upon booting (I took it out of an external drive cage).

 

Diagnostics attached. I am thinking hard poweroff at the mains, swap power connector on the affected drive, and reboot. 

I think rebuilding drive 2 is very dangerous, as I don't feel I can trust parity. I am more inclined to try to force the system to mark all the drives as good and recalculate parity from scratch (not sure how to do that though).

The problems continue it seems on my server.

 

tower-diagnostics-20200528-1733.zip

Link to comment

You had multiple disks in different controllers going offline at the same time:

 

May 28 11:07:50 Tower kernel: sd 9:0:1:0: device_block, handle(0x000a)
May 28 11:07:53 Tower kernel: sd 9:0:1:0: device_unblock and setting to running, handle(0x000a)
May 28 11:07:53 Tower kernel: sd 9:0:1:0: [sdj] Synchronizing SCSI cache
May 28 11:07:53 Tower kernel: sd 9:0:1:0: [sdj] Synchronize Cache(10) failed: Result: hostbyte=0x01 driverbyte=0x00
May 28 11:07:53 Tower kernel: mpt2sas_cm0: removing handle(0x000a), sas_addr(0x4433221106000000)
May 28 11:07:53 Tower kernel: mpt2sas_cm0: enclosure logical id(0x500605b004dce890), slot(5)
May 28 11:07:53 Tower rc.diskinfo[28375]: SIGHUP received, forcing refresh of disks info.
May 28 11:08:07 Tower kernel: ata5.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
May 28 11:08:07 Tower kernel: ata5: SError: { PHYRdyChg }
May 28 11:08:07 Tower kernel: ata5.00: failed command: WRITE DMA EXT
May 28 11:08:07 Tower kernel: ata5.00: cmd 35/00:08:c0:21:cf/00:00:b8:01:00/e0 tag 0 dma 4096 out
May 28 11:08:07 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/40 Emask 0x14 (ATA bus error)
May 28 11:08:07 Tower kernel: ata5.00: status: { DRDY }
May 28 11:08:07 Tower kernel: ata5: hard resetting link
May 28 11:08:08 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300)
May 28 11:08:14 Tower kernel: ata5: hard resetting link
May 28 11:08:14 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300)
May 28 11:08:19 Tower kernel: ata5: hard resetting link
May 28 11:08:20 Tower kernel: ata5: SATA link down (SStatus 0 SControl 300)
May 28 11:08:20 Tower kernel: ata5.00: disabled
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 Sense Key : 0x5 [current]
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 ASC=0x21 ASCQ=0x4
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] tag#0 CDB: opcode=0x8a 8a 00 00 00 00 01 b8 cf 21 c0 00 00 00 08 00 00
May 28 11:08:20 Tower kernel: print_req_error: I/O error, dev sdf, sector 7395549632
May 28 11:08:20 Tower kernel: md: disk2 write error, sector=7395549568
May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] killing request
May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
### [PREVIOUS LINE REPEATED 1 TIMES] ###
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] UNKNOWN(0x2003) Result: hostbyte=0x01 driverbyte=0x00
May 28 11:08:20 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
May 28 11:08:20 Tower kernel: ata5: EH complete
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] CDB: opcode=0x88 88 00 00 00 00 00 03 f5 30 a8 00 00 00 08 00 00
May 28 11:08:20 Tower kernel: print_req_error: I/O error, dev sdf, sector 66400424
May 28 11:08:20 Tower kernel: ata5.00: detaching (SCSI 4:0:0:0)
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Synchronizing SCSI cache
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Stopping disk
May 28 11:08:20 Tower kernel: sd 4:0:0:0: [sdf] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00
May 28 11:08:20 Tower rc.diskinfo[28375]: SIGHUP received, forcing refresh of disks info.
May 28 11:08:21 Tower kernel: ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
May 28 11:08:21 Tower kernel: ata6: SError: { PHYRdyChg }
May 28 11:08:21 Tower kernel: ata6.00: failed command: READ DMA EXT
May 28 11:08:21 Tower kernel: ata6.00: cmd 25/00:08:f0:58:df/00:00:2a:00:00/e0 tag 0 dma 4096 in
May 28 11:08:21 Tower kernel:         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x14 (ATA bus error)
May 28 11:08:21 Tower kernel: ata6.00: status: { DRDY }
May 28 11:08:21 Tower kernel: ata6: hard resetting link
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697728
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697736
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697744
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697752
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697760
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697768
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697776
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=7294697784
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=11995447336
May 28 11:08:21 Tower kernel: md: disk2 read error, sector=66400360
May 28 11:08:21 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300)
May 28 11:08:27 Tower kernel: ata6: hard resetting link
May 28 11:08:27 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300)
May 28 11:08:33 Tower kernel: ata6: hard resetting link
May 28 11:08:33 Tower kernel: ata6: SATA link down (SStatus 0 SControl 300)
May 28 11:08:33 Tower kernel: ata6.00: disabled

This suggests a power/connection problem.

 

Link to comment

I agree, something affected that Norco controller, power likely. I doubled up now on power connectors for each of the two Norco's 500 now.

What do I do with the disabled drive? I had to do an unclean shutdown as the tower would not stop. DId that cause filesystem corruption?

Link to comment

I stopped the array anyway, unassigned the disk and restarted. Although content is emulated, I cannot get to disk2 as I usually do via \\tower\disk2. 

And now I get read errors from disk7, which is not on the same norco unit that was affected before. 

Link to comment
  • JorgeB changed the title to [SOLVED] Drive disabled after moving to a different slot

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.