[6.8.3] Stuck in loop "retry unmounting disk shares" (SOLVED: dead RAID card)

jaso · June 3, 2020

My unraid server had some trouble earlier:

Unraid Cache disk message: 03-06-2020 19:06

Warning [TOWER] - Cache pool BTRFS missing device(s)
Samsung_SSD_860_EVO_500GB_S4BENG0KC05104W (sdg)

+

Unraid Disk 6 error: 03-06-2020 19:07

Alert [TOWER] - Disk 6 in error state (disk dsbl)
WDC_WD40EZRX-00SPEB0_WD-WCC4E52UR3RJ (sdh)

+

Unraid array errors: 03-06-2020 19:07

Warning [TOWER] - array has errors
Array has 1 disk with read errors

I used the Tools > Diagnostics > Download to grab all the logs and config. Then thought I shut down the array to do some troubleshooting. Unfortunately I am now stuck in a constant loop of "Array Stopping • Retry unmounting disk share(s)...".

From the Syslog

Jun  3 17:55:35 Tower kernel: mdcmd (772): spindown 6
Jun  3 19:05:39 Tower kernel: ata5.00: exception Emask 0x52 SAct 0xfc0 SErr 0xffffffff action 0xe frozen
Jun  3 19:05:39 Tower kernel: ata5: SError: { RecovData RecovComm UnrecovData Persist Proto HostInt PHYRdyChg PHYInt CommWake 10B8B Dispar BadCRC Handshk LinkSeq TrStaTrns UnrecFIS DevExch }
Jun  3 19:05:39 Tower kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun  3 19:05:39 Tower kernel: ata5.00: cmd 60/20:30:40:d9:08/00:00:16:00:00/40 tag 6 ncq dma 16384 in
Jun  3 19:05:39 Tower kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
Jun  3 19:05:39 Tower kernel: ata5.00: status: { DRDY }
Jun  3 19:05:39 Tower kernel: ata5.00: failed command: READ FPDMA QUEUED
Jun  3 19:05:39 Tower kernel: ata5.00: cmd 60/08:38:d8:b5:6c/00:00:04:00:00/40 tag 7 ncq dma 4096 in
Jun  3 19:05:39 Tower kernel:         res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x56 (ATA bus error)
Jun  3 19:05:39 Tower kernel: ata5.00: status: { DRDY }

then a bit later in the syslog:

Jun  3 19:05:39 Tower kernel: ata5.00: status: { DRDY }
Jun  3 19:05:39 Tower kernel: ata5: hard resetting link
Jun  3 19:05:39 Tower kernel: ahci 0000:02:00.0: AHCI controller unavailable!
Jun  3 19:05:40 Tower kernel: ata5: failed to resume link (SControl FFFFFFFF)
Jun  3 19:05:40 Tower kernel: ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun  3 19:05:46 Tower kernel: ata5: hard resetting link
Jun  3 19:05:46 Tower kernel: ahci 0000:02:00.0: AHCI controller unavailable!
Jun  3 19:05:47 Tower kernel: ata5: failed to resume link (SControl FFFFFFFF)
Jun  3 19:05:47 Tower kernel: ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun  3 19:05:47 Tower kernel: ata5: limiting SATA link speed to <unknown>
Jun  3 19:05:52 Tower kernel: ata5: hard resetting link
Jun  3 19:05:52 Tower kernel: ahci 0000:02:00.0: AHCI controller unavailable!
Jun  3 19:05:53 Tower kernel: ata5: failed to resume link (SControl FFFFFFFF)
Jun  3 19:05:53 Tower kernel: ata5: SATA link down (SStatus FFFFFFFF SControl FFFFFFFF)
Jun  3 19:05:53 Tower kernel: ata5.00: disabled
Jun  3 19:05:53 Tower kernel: ahci 0000:02:00.0: AHCI controller unavailable!
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#6 Sense Key : 0x5 [current] 
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#6 ASC=0x21 ASCQ=0x4 
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#6 CDB: opcode=0x28 28 00 16 08 d9 40 00 00 20 00
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 369678656
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#7 Sense Key : 0x5 [current] 
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#7 ASC=0x21 ASCQ=0x4 
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#7 CDB: opcode=0x28 28 00 04 6c b5 d8 00 00 08 00
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 74233304
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#8 Sense Key : 0x5 [current] 
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#8 ASC=0x21 ASCQ=0x4 
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#8 CDB: opcode=0x28 28 00 05 20 56 88 00 00 08 00
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 86005384

and then a little bit later:

Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: [sdg] tag#11 CDB: opcode=0x2a 2a 00 01 de 5e 08 00 02 00 00
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 31350280
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 3, rd 2, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: ata5: EH complete
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 86005384
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 3, rd 3, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: sd 4:0:0:0: rejecting I/O to offline device
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 75279920
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 4, rd 3, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: ata5.00: detaching (SCSI 4:0:0:0)
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 75281280
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 5, rd 3, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: print_req_error: I/O error, dev sdg, sector 27140976
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 6, rd 3, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 7, rd 3, flush 0, corrupt 0, gen 0
Jun  3 19:05:53 Tower kernel: BTRFS: error (device sdg1) in btrfs_commit_transaction:2267: errno=-5 IO failure (Error while writing out transaction)
Jun  3 19:05:53 Tower kernel: BTRFS info (device sdg1): forced readonly
Jun  3 19:05:53 Tower kernel: BTRFS warning (device sdg1): Skipping commit of aborted transaction.
Jun  3 19:05:53 Tower kernel: BTRFS: error (device sdg1) in cleanup_transaction:1860: errno=-5 IO failure
Jun  3 19:05:53 Tower kernel: BTRFS info (device sdg1): delayed_refs has NO entry
Jun  3 19:05:53 Tower kernel: loop: Write error at byte offset 14237696, length 4096.
Jun  3 19:05:53 Tower kernel: loop: Write error at byte offset 20107264, length 4096.
Jun  3 19:05:53 Tower kernel: loop: Write error at byte offset 2207744000, length 4096.
Jun  3 19:05:53 Tower kernel: BTRFS warning (device loop2): chunk 13631488 missing 1 devices, max tolerance is 0 for writeable mount
Jun  3 19:05:53 Tower kernel: BTRFS: error (device loop2) in write_all_supers:3716: errno=-5 IO failure (errors while submitting device barriers.)

I grabbed the syslog again, in an attempt to see what was causing the "unmounting loop":

Jun  3 20:06:47 Tower kernel: print_req_error: I/O error, dev loop2, sector 2969408
Jun  3 20:06:50 Tower emhttpd: Unmounting disks...
Jun  3 20:06:50 Tower emhttpd: shcmd (91679): umount /mnt/disk4
Jun  3 20:06:50 Tower root: umount: /mnt/disk4: target is busy.
Jun  3 20:06:50 Tower emhttpd: shcmd (91679): exit status: 32
Jun  3 20:06:50 Tower emhttpd: shcmd (91680): umount /mnt/cache
Jun  3 20:06:50 Tower root: umount: /mnt/cache: target is busy.
Jun  3 20:06:50 Tower emhttpd: shcmd (91680): exit status: 32
Jun  3 20:06:50 Tower emhttpd: Retry unmounting disk share(s)...
Jun  3 20:06:52 Tower kernel: btrfs_dev_stat_print_on_error: 110 callbacks suppressed
Jun  3 20:06:52 Tower kernel: BTRFS error (device sdg1): bdev /dev/sdg1 errs: wr 42, rd 38010, flush 0, corrupt 0, gen 0

I'd prefer a graceful shutdown rather than a hard restart. Any got any ideas how to unmount disk4 and my cache?

Kind Regards,

jaso

Edited June 3, 2020 by jaso
update title to mark as solved

itimpi · June 3, 2020

I do not think you can achieve a graceful shutdown as you are getting hardware errors (quite likely cable type connection issues) on some drives so that the drives will never finish unmounting.

jaso · June 3, 2020

1 minute ago, itimpi said:

I do not think you can achieve a graceful shutdown as you are getting hardware errors (quite likely cable type connection issues) on some drives so that the drives will never finish unmounting.

Thanks itimpi.

Hard reboot time :-(

jaso · June 3, 2020

Figured out what the problem was. Dead RAID card.

/mnt/disk6 and /mnt/cache were both being served by a generic 2x sata card. It just up and died after 5 years of top-notch service.

Will have to wait for a few days for a new raid card to arrive. In the meantime I've moved my cache ssd to another sata slot, and /mnt/disk6 is being emulated for now...

Cheers,

jaso

Edited June 7, 2020 by jaso
typo

[6.8.3] Stuck in loop "retry unmounting disk shares" (SOLVED: dead RAID card)

Recommended Posts

jaso

Link to comment

itimpi

Link to comment

jaso

Link to comment

jaso

Link to comment

Join the conversation