Drive Error - Array wont stop


Go to solution Solved by JorgeB,

Recommended Posts

Hi All,

 

I'm running Unraid 6.12.6 on an ASRock X570D4U with a 4650G Pro and 64GB ECC RAM.

 

Sadly I woke up today to a disabled drive warning:

image.png.36668654806733e2f7bd4f72f9a2d4c8.png

 

I had a look in my syslog and found these errors:

Feb 13 00:00:06 Tower Plugin Auto Update: Community Applications Plugin Auto Update finished
Feb 13 04:00:26 Tower kernel: md: disk29 read error, sector=3907594640
Feb 13 04:00:26 Tower emhttpd: read SMART /dev/sdh
Feb 13 04:00:26 Tower emhttpd: read SMART /dev/sdf
Feb 13 04:00:38 Tower emhttpd: read SMART /dev/sdd
Feb 13 04:00:38 Tower emhttpd: read SMART /dev/sde
Feb 13 04:00:38 Tower emhttpd: read SMART /dev/sdb
Feb 13 04:00:38 Tower emhttpd: read SMART /dev/sdc
Feb 13 04:00:38 Tower emhttpd: read SMART /dev/sdi
Feb 13 04:00:49 Tower kernel: md: disk29 write error, sector=3907594640
Feb 13 04:01:01 Tower sSMTP[12879]: Creating SSL connection to host
Feb 13 04:01:01 Tower sSMTP[12879]: SSL connection using TLS_AES_256_GCM_SHA384
Feb 13 04:01:04 Tower sSMTP[12879]: Sent mail for [censored] (221 2.0.0 closing connection y12-20020a056000108c00b0033b40a3f92asm8358077wrw.25 - gsmtp) uid=0 username=root outbytes=770
Feb 13 04:02:26 Tower root: /etc/libvirt: 125.7 MiB (131817472 bytes) trimmed on /dev/loop3
Feb 13 04:02:26 Tower root: /var/lib/docker: 3.7 GiB (3952398336 bytes) trimmed on /dev/loop2
Feb 13 04:02:26 Tower root: /mnt/cache: 314.3 GiB (337479950336 bytes) trimmed on /dev/mapper/nvme1n1p1
Feb 13 04:30:39 Tower emhttpd: spinning down /dev/sdd
Feb 13 04:30:39 Tower emhttpd: spinning down /dev/sde
Feb 13 04:30:39 Tower emhttpd: spinning down /dev/sdb
Feb 13 04:30:39 Tower emhttpd: spinning down /dev/sdc
Feb 13 04:30:39 Tower emhttpd: spinning down /dev/sdi
Feb 13 04:30:58 Tower emhttpd: spinning down /dev/sdh
Feb 13 04:30:58 Tower emhttpd: spinning down /dev/sdf
Feb 13 09:00:11 Tower emhttpd: read SMART /dev/sdh
Feb 13 09:00:21 Tower emhttpd: read SMART /dev/sdd
Feb 13 09:27:31 Tower emhttpd: read SMART /dev/sdc
Feb 13 09:28:00 Tower emhttpd: read SMART /dev/sdi
Feb 13 09:28:13 Tower emhttpd: read SMART /dev/sdb
Feb 13 09:30:30 Tower emhttpd: spinning down /dev/sdh
Feb 13 09:54:28 Tower webGUI: Successful login user root from 192.168.1.175
Feb 13 09:54:29 Tower emhttpd: error: hotplug_devices, 1713: No such file or directory (2): tagged device WDC_WUH721816ALE6L4_2BJ90ZUN was (sdg) is now (sdj)
Feb 13 09:54:29 Tower emhttpd: read SMART /dev/sdj
Feb 13 09:54:29 Tower kernel: emhttpd[6175]: segfault at 67c ip 000055776d0f197b sp 00007ffc12d22d60 error 4 in emhttpd[55776d0dc000+25000] likely on CPU 11 (core 6, socket 0)
Feb 13 09:54:29 Tower kernel: Code: e0 44 01 00 48 89 45 f8 48 8d 05 05 32 01 00 48 89 45 f0 e9 79 01 00 00 8b 45 ec 89 c7 e8 eb 87 ff ff 48 89 45 d8 48 8b 45 d8 <8b> 80 7c 06 00 00 85 c0 0f 94 c0 0f b6 c0 89 45 d4 48 8b 45 e0 48

 

Nobody else in the house, so aside from the potential of some unwanted rodent guests (I hope not) there is nobody who could have touched the server.

For some reason, I now can't stop the array.

Diags attached, can anyone help me understand what's happened and/or what I need to do? Really hoping this isnt an actual drive fault as these suckers aint cheap.

 

Thanks,

tower-diagnostics-20240213-1002.zip

Link to comment

The drive logs

Now disconnected sdg:

Feb 12 13:31:32 Tower emhttpd: read SMART /dev/sdg
Feb 12 13:33:33 Tower kernel: sd 6:0:0:0: [sdg] Synchronizing SCSI cache
Feb 12 14:01:09 Tower emhttpd: spinning down /dev/sdg
Feb 12 14:01:09 Tower emhttpd: sdspin /dev/sdg down: 2
Feb 13 09:54:29 Tower emhttpd: error: hotplug_devices, 1713: No such file or directory (2): tagged device WDC_WUH721816ALE6L4_2BJ90ZUN was (sdg) is now (sdj)

 

sdj in unassigned drives:

Feb  9 19:28:49 Tower kernel: ata6: SATA max UDMA/133 abar m2048@0xfc400000 port 0xfc400180 irq 87
Feb  9 19:28:49 Tower kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb  9 19:28:49 Tower kernel: ata6.00: ATA-11: WDC  WUH721816ALE6L4, PCGNW232, max UDMA/133
Feb  9 19:28:49 Tower kernel: ata6.00: 31251759104 sectors, multi 16: LBA48 NCQ (depth 32), AA
Feb  9 19:28:49 Tower kernel: ata6.00: Features: NCQ-sndrcv NCQ-prio
Feb  9 19:28:49 Tower kernel: ata6.00: configured for UDMA/133
Feb 12 13:31:23 Tower kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 12 13:31:32 Tower kernel: ata6.00: configured for UDMA/133
Feb 12 13:33:13 Tower kernel: ata6: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Feb 12 13:33:18 Tower kernel: ata6.00: qc timeout after 5000 msecs (cmd 0xec)
Feb 12 13:33:18 Tower kernel: ata6.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Feb 12 13:33:18 Tower kernel: ata6.00: revalidation failed (errno=-5)
Feb 12 13:33:23 Tower kernel: ata6: link is slow to respond, please be patient (ready=0)
Feb 12 13:33:28 Tower kernel: ata6: found unknown device (class 0)
Feb 12 13:33:28 Tower kernel: ata6: SATA link down (SStatus 1 SControl 300)
Feb 12 13:33:28 Tower kernel: ata6: limiting SATA link speed to <unknown>
Feb 12 13:33:30 Tower kernel: ata6: SATA link down (SStatus 1 SControl 3F0)
Feb 12 13:33:30 Tower kernel: ata6.00: disable device
Feb 12 13:33:33 Tower kernel: ata6: SATA link down (SStatus 1 SControl 300)
Feb 12 13:33:33 Tower kernel: ata6.00: detaching (SCSI 6:0:0:0)
Feb 12 13:33:35 Tower kernel: ata6: SATA link down (SStatus 1 SControl 300)
Feb 12 13:33:35 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps
Feb 12 13:33:37 Tower kernel: ata6: SATA link down (SStatus 1 SControl 310)
Feb 12 13:33:37 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps
Feb 12 13:33:39 Tower kernel: ata6: SATA link down (SStatus 1 SControl 310)
Feb 12 13:33:39 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps
Feb 12 13:33:41 Tower kernel: ata6: SATA link down (SStatus 1 SControl 310)
Feb 12 13:33:41 Tower kernel: ata6: limiting SATA link speed to 1.5 Gbps
Feb 12 13:33:42 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 12 13:33:43 Tower kernel: ata6.00: ATA-11: WDC  WUH721816ALE6L4, PCGNW232, max UDMA/133
Feb 12 13:33:43 Tower kernel: ata6.00: 31251759104 sectors, multi 16: LBA48 NCQ (depth 32), AA
Feb 12 13:33:43 Tower kernel: ata6.00: Features: NCQ-sndrcv NCQ-prio
Feb 12 13:33:43 Tower kernel: ata6.00: configured for UDMA/133
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] 4096-byte physical blocks
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] Write Protect is off
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] Mode Sense: 00 3a 00 00
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] Preferred minimum I/O size 4096 bytes
Feb 12 13:33:43 Tower kernel: sdj: sdj1
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] Attached SCSI disk
Feb 12 13:33:43 Tower kernel: ata6.00: exception Emask 0x10 SAct 0x1f0 SErr 0x280100 action 0x6 frozen
Feb 12 13:33:43 Tower kernel: ata6.00: irq_stat 0x08000000, interface fatal error
Feb 12 13:33:43 Tower kernel: ata6: SError: { UnrecovData 10B8B BadCRC }
Feb 12 13:33:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Feb 12 13:33:43 Tower kernel: ata6.00: cmd 60/70:20:00:fe:bf/00:00:46:07:00/40 tag 4 ncq dma 57344 in
Feb 12 13:33:43 Tower kernel: ata6.00: status: { DRDY }
Feb 12 13:33:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Feb 12 13:33:43 Tower kernel: ata6.00: cmd 60/80:28:78:fe:bf/00:00:46:07:00/40 tag 5 ncq dma 65536 in
Feb 12 13:33:43 Tower kernel: ata6.00: status: { DRDY }
Feb 12 13:33:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Feb 12 13:33:43 Tower kernel: ata6.00: cmd 60/78:30:08:ff:bf/00:00:46:07:00/40 tag 6 ncq dma 61440 in
Feb 12 13:33:43 Tower kernel: ata6.00: status: { DRDY }
Feb 12 13:33:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Feb 12 13:33:43 Tower kernel: ata6.00: cmd 60/38:38:88:ff:bf/00:00:46:07:00/40 tag 7 ncq dma 28672 in
Feb 12 13:33:43 Tower kernel: ata6.00: status: { DRDY }
Feb 12 13:33:43 Tower kernel: ata6.00: failed command: READ FPDMA QUEUED
Feb 12 13:33:43 Tower kernel: ata6.00: cmd 60/28:40:c8:ff:bf/00:00:46:07:00/40 tag 8 ncq dma 20480 in
Feb 12 13:33:43 Tower kernel: ata6.00: status: { DRDY }
Feb 12 13:33:43 Tower kernel: ata6: hard resetting link
Feb 12 13:33:43 Tower kernel: ata6: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Feb 12 13:33:43 Tower kernel: ata6.00: configured for UDMA/133
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#4 Sense Key : 0x5 [current]
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#4 ASC=0x21 ASCQ=0x4
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#4 CDB: opcode=0x88 88 00 00 00 00 07 46 bf fe 00 00 00 00 70 00 00
Feb 12 13:33:43 Tower kernel: I/O error, dev sdj, sector 31251758592 op 0x0:(READ) flags 0x80700 phys_seg 14 prio class 2
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#5 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#5 Sense Key : 0x5 [current]
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#5 ASC=0x21 ASCQ=0x4
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#5 CDB: opcode=0x88 88 00 00 00 00 07 46 bf fe 78 00 00 00 80 00 00
Feb 12 13:33:43 Tower kernel: I/O error, dev sdj, sector 31251758712 op 0x0:(READ) flags 0x80700 phys_seg 15 prio class 2
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#6 Sense Key : 0x5 [current]
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#6 ASC=0x21 ASCQ=0x4
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#6 CDB: opcode=0x88 88 00 00 00 00 07 46 bf ff 08 00 00 00 78 00 00
Feb 12 13:33:43 Tower kernel: I/O error, dev sdj, sector 31251758856 op 0x0:(READ) flags 0x80700 phys_seg 15 prio class 2
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#7 Sense Key : 0x5 [current]
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#7 ASC=0x21 ASCQ=0x4
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#7 CDB: opcode=0x88 88 00 00 00 00 07 46 bf ff 88 00 00 00 38 00 00
Feb 12 13:33:43 Tower kernel: I/O error, dev sdj, sector 31251758984 op 0x0:(READ) flags 0x80700 phys_seg 7 prio class 2
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=0s
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#8 Sense Key : 0x5 [current]
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#8 ASC=0x21 ASCQ=0x4
Feb 12 13:33:43 Tower kernel: sd 6:0:0:0: [sdj] tag#8 CDB: opcode=0x88 88 00 00 00 00 07 46 bf ff c8 00 00 00 28 00 00
Feb 12 13:33:43 Tower kernel: I/O error, dev sdj, sector 31251759048 op 0x0:(READ) flags 0x80700 phys_seg 5 prio class 2
Feb 12 13:33:43 Tower kernel: ata6: EH complete
Feb 12 13:33:45 Tower unassigned.devices: Partition '/dev/sdj1' does not have a file system and cannot be mounted.
Feb 13 09:54:29 Tower emhttpd: error: hotplug_devices, 1713: No such file or directory (2): tagged device WDC_WUH721816ALE6L4_2BJ90ZUN was (sdg) is now (sdj)
Feb 13 09:54:29 Tower emhttpd: read SMART /dev/sdj

 

Link to comment

So although the system wouldn't stop, it did shut down safely. Now booted up and running an extended smart test on that drive but if anyone has any clue what went wrong, or how to find out what went wrong I would really appreciate it.

Link to comment
29 minutes ago, B1scu1T said:

I'll change the cable but just a bit baffled that there would be an error without anything being touched. (correct me if wrong)

 

Power or SATA cables can sometimes work themselves slightly loose simply due to vibration or thermal effects.  I think the SATA cable is particularly prone to this as it is not a very robust connector.

  • Like 1
Link to comment
4 minutes ago, itimpi said:

 

Power or SATA cables can sometimes work themselves slightly loose simply due to vibration or thermal effects.  I think the SATA cable is particularly prone to this as it is not a very robust connector.

Yeah it's a branded startech reverse SAS breakout and I suppose it has been in and out of a lot of builds. I'm going to perhaps try with a HBA and straight SAS cable before I buy anything, mostly because that's what I already own..

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.