jdndm Posted September 24, 2023 Share Posted September 24, 2023 Hi, My unraid 6.12.3 server is crashing at random intervals. Sometimes it crashes and brings down the UI, but I've manage to catch the last couple of crashes to extract diagnostic files. I suspect that it is my cache drive that is causing the errors, but I'm really not sure. So far I have replaced power and sata cables, 2 different sata ports on the mobo, cache drive connected to HBA. I'm at a point where I'm sure something is failing and needs to be replaced but I'm not sure what it could be. I suspect it's the cache drive, HBA or onboard sata controller. Hardware: - Supermicro X8DTL - 8GB RAM - LSI SAS2008 HBA - 2 x 16TB Seagate Ironwolf - 2 x 8TB WD Red - 2 x 4TB WD Red - 1 x 3TB WD Green - 1 x 500GB MX500 (cache) - 1 x 320GB Seagate Momentus Here's a the section of the logs from when the server crashed Sep 23 06:29:59 Tower kernel: md: sync done. time=114533sec Sep 23 06:29:59 Tower kernel: md: recovery thread: exit status: 0 Sep 24 00:15:26 Tower kernel: ata2.00: exception Emask 0x0 SAct 0x41 SErr 0x0 action 0x6 frozen Sep 24 00:15:26 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED Sep 24 00:15:26 Tower kernel: ata2.00: cmd 60/40:00:e0:1a:1f/00:00:00:00:00/40 tag 0 ncq dma 32768 in Sep 24 00:15:26 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Sep 24 00:15:26 Tower kernel: ata2.00: status: { DRDY } Sep 24 00:15:26 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED Sep 24 00:15:26 Tower kernel: ata2.00: cmd 61/40:30:40:48:51/00:00:00:00:00/40 tag 6 ncq dma 32768 out Sep 24 00:15:26 Tower kernel: res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout) Sep 24 00:15:26 Tower kernel: ata2.00: status: { DRDY } Sep 24 00:15:26 Tower kernel: ata2: hard resetting link Sep 24 00:15:31 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Sep 24 00:15:36 Tower kernel: ata2: COMRESET failed (errno=-16) Sep 24 00:15:36 Tower kernel: ata2: hard resetting link Sep 24 00:15:41 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Sep 24 00:15:46 Tower kernel: ata2: COMRESET failed (errno=-16) Sep 24 00:15:46 Tower kernel: ata2: hard resetting link Sep 24 00:15:51 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Sep 24 00:16:21 Tower kernel: ata2: COMRESET failed (errno=-16) Sep 24 00:16:21 Tower kernel: ata2: limiting SATA link speed to 1.5 Gbps Sep 24 00:16:21 Tower kernel: ata2: hard resetting link Sep 24 00:16:26 Tower kernel: ata2: COMRESET failed (errno=-16) Sep 24 00:16:26 Tower kernel: ata2: reset failed, giving up Sep 24 00:16:26 Tower kernel: ata2.00: disable device Sep 24 00:16:26 Tower kernel: ata2: EH complete Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#6 CDB: opcode=0x28 28 00 13 3f 37 60 00 00 08 00 Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#7 CDB: opcode=0x28 28 00 13 41 ca 40 00 00 08 00 Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 322910048 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 323078720 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 2 Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=90s Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#13 CDB: opcode=0x28 28 00 12 8b 3b 88 00 00 08 00 Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#4 CDB: opcode=0x93 93 08 00 00 00 00 05 9f bb f8 00 00 f4 08 00 00 Sep 24 00:16:26 Tower kernel: sd 2:0:0:0: [sdb] tag#12 CDB: opcode=0x28 28 00 0f 3a 25 68 00 00 08 00 Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 94354424 op 0x3:(DISCARD) flags 0x800 phys_seg 1 prio class 2 Sep 24 00:16:26 Tower kernel: I/O error, dev sdb, sector 5326912 op 0x1:(WRITE) flags 0x1800 phys_seg 8 prio class 2 The first symptom that I've noticed a crash is that my docker containers are unavailable. I have found that I need to shutdown the server, and then power it back up to bring it online. If I attempt a reboot, I get this message on boot up (screenshot attached). Controller Bus#00, Device#1F, Function#02: 06 Ports, 01 Devices: Port-00: No device detected I haven't seen any of the drives reporting errors. Thanks in advance any light you might be able to shed on the situation. Cheers, jdndn tower-diagnostics-20230924-1845.zip Quote Link to comment
jdndm Posted September 25, 2023 Author Share Posted September 25, 2023 (edited) Just had another crash, diagnostics attached. The UI is still up, so if there's any further investigation I can do please let me know. tower-diagnostics-20230925-1040.zip Edited September 25, 2023 by jdndm edited for clarity Quote Link to comment
Solution JorgeB Posted September 25, 2023 Solution Share Posted September 25, 2023 First issue, cache device is dropping offline: Sep 25 04:53:45 Tower kernel: ata2: hard resetting link Sep 25 04:53:50 Tower kernel: ata2: link is slow to respond, please be patient (ready=0) Sep 25 04:54:20 Tower kernel: ata2: COMRESET failed (errno=-16) Sep 25 04:54:20 Tower kernel: ata2: limiting SATA link speed to 1.5 Gbps Sep 25 04:54:20 Tower kernel: ata2: hard resetting link Sep 25 04:54:25 Tower kernel: ata2: COMRESET failed (errno=-16) Sep 25 04:54:25 Tower kernel: ata2: reset failed, giving up Sep 25 04:54:25 Tower kernel: ata2.00: disable device Sep 25 04:54:25 Tower kernel: ata2: EH complete Check/replace cables and since it's an MX500 also see here. Quote Link to comment
jdndm Posted September 25, 2023 Author Share Posted September 25, 2023 Thanks, I checked the link and I had one of the affected firmware versions. Firmware has been upgraded to M3CR046. I'll monitor for a few days, and if stable I will mark your answer as the solution. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.