Unraid 6.12.3 crashing, docker service is unavailable


jdndm
Go to solution Solved by JorgeB,

Recommended Posts

Hi,

 

My unraid 6.12.3 server is crashing at random intervals. Sometimes it crashes and brings down the UI, but I've manage to catch the last couple of crashes to extract diagnostic files. I suspect that it is my cache drive that is causing the errors, but I'm really not sure. So far I have replaced power and sata cables, 2 different sata ports on the mobo, cache drive connected to HBA. I'm at a point where I'm sure something is failing and needs to be replaced but I'm not sure what it could be. I suspect it's the cache drive, HBA or onboard sata controller.

 

Hardware:
- Supermicro X8DTL
- 8GB RAM
- LSI SAS2008 HBA
- 2 x 16TB Seagate Ironwolf 
- 2 x 8TB WD Red
- 2 x 4TB WD Red
- 1 x 3TB WD Green
- 1 x 500GB MX500 (cache)
- 1 x 320GB Seagate Momentus

 

Here's a the section of the logs from when the server crashed

Sep 23 06:29:59 Tower kernel: md: sync done. time=114533sec
Sep 23 06:29:59 Tower kernel: md: recovery thread: exit status: 0
Sep 24 00:15:26 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x41 SErr 0x0 action 0x6 frozen
Sep 24 00:15:26 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Sep 24 00:15:26 Tower kernel: ata2.00: cmd 60/40:00:e0:1a:1f/00:00:00:00:00/40 tag 0 ncq dma 32768 in
Sep 24 00:15:26 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 24 00:15:26 Tower kernel: ata2.00: status: { DRDY }
Sep 24 00:15:26 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Sep 24 00:15:26 Tower kernel: ata2.00: cmd 61/40:30:40:48:51/00:00:00:00:00/40 tag 6 ncq dma 32768 out
Sep 24 00:15:26 Tower kernel:         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 24 00:15:26 Tower kernel: ata2.00: status: { DRDY }
Sep 24 00:15:26 Tower kernel: ata2: hard resetting link
Sep 24 00:15:31 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 24 00:15:36 Tower kernel: ata2: COMRESET failed (errno=-16)
Sep 24 00:15:36 Tower kernel: ata2: hard resetting link
Sep 24 00:15:41 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 24 00:15:46 Tower kernel: ata2: COMRESET failed (errno=-16)
Sep 24 00:15:46 Tower kernel: ata2: hard resetting link
Sep 24 00:15:51 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 24 00:16:21 Tower kernel: ata2: COMRESET failed (errno=-16)
Sep 24 00:16:21 Tower kernel: ata2: limiting SATA link speed to 1.5 Gbps
Sep 24 00:16:21 Tower kernel: ata2: hard resetting link
Sep 24 00:16:26 Tower kernel: ata2: COMRESET failed (errno=-16)
Sep 24 00:16:26 Tower kernel: ata2: reset failed, giving up
Sep 24 00:16:26 Tower kernel: ata2.00: disable device
Sep 24 00:16:26 Tower kernel: ata2: EH complete
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#6 CDB: opcode=0x28 28 00 13 3f 37 60 00 00 08 00
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#7 CDB: opcode=0x28 28 00 13 41 ca 40 00 00 08 00
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 322910048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 323078720 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=90s
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#13 CDB: opcode=0x28 28 00 12 8b 3b 88 00 00 08 00
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#4 CDB: opcode=0x93 93 08 00 00 00 00 05 9f bb f8 00 00 f4 08 00 00
Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#12 CDB: opcode=0x28 28 00 0f 3a 25 68 00 00 08 00
Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 94354424 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2
Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 5326912 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 2

 

The first symptom that I've noticed a crash is that my docker containers are unavailable. I have found that I need to shutdown the server, and then power it back up to bring it online. If I attempt a reboot, I get this message on boot up (screenshot attached).

 

Controller Bus#00, Device#1F, Function#02: 06 Ports, 01 Devices:
	Port-00: No device detected

 

I haven't seen any of the drives reporting errors.

 

Thanks in advance any light you might be able to shed on the situation.

Cheers,

jdndn

 

 

 

 

20230924_210950.jpg

tower-diagnostics-20230924-1845.zip

Link to comment
  • Solution

First issue, cache device is dropping offline:

 

Sep 25 04:53:45 Tower kernel: ata2: hard resetting link
Sep 25 04:53:50 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Sep 25 04:54:20 Tower kernel: ata2: COMRESET failed (errno=-16)
Sep 25 04:54:20 Tower kernel: ata2: limiting SATA link speed to 1.5 Gbps
Sep 25 04:54:20 Tower kernel: ata2: hard resetting link
Sep 25 04:54:25 Tower kernel: ata2: COMRESET failed (errno=-16)
Sep 25 04:54:25 Tower kernel: ata2: reset failed, giving up
Sep 25 04:54:25 Tower kernel: ata2.00: disable device
Sep 25 04:54:25 Tower kernel: ata2: EH complete

 

Check/replace cables and since it's an MX500 also see here.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.