Dead drive or controller issue?


Recommended Posts

I had disk5 in my array throw a random smattering of read errors earlier this week and wrote it off as nothing significant after the disk passed a long and short smart test.

 

Well it just magically dropped off out of the blue and it almost looks like the controller took the link down rather than the drive died?

 

Jul 15 13:18:47 Node kernel: 
Jul 15 13:19:21 Node kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jul 15 13:19:21 Node kernel: ata1.00: configured for UDMA/133
Jul 15 14:13:09 Node kernel: ata1: COMRESET failed (errno=-32)
Jul 15 14:13:09 Node kernel: ata1: reset failed (errno=-32), retrying in 8 secs
Jul 15 14:13:17 Node kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jul 15 14:13:19 Node kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Jul 15 14:13:19 Node kernel: ata1.00: configured for UDMA/133
Jul 15 17:20:11 Node emhttpd: spinning down /dev/sdg
Jul 15 17:20:53 Node kernel: mdcmd (60): set md_write_method 0
Jul 15 17:20:53 Node kernel: 
Jul 15 17:33:31 Node kernel: ata1: SATA link down (SStatus 0 SControl 320)
Jul 15 17:33:31 Node kernel: ata1: SATA link down (SStatus 0 SControl 320)
Jul 15 17:33:31 Node kernel: ata1.00: link offline, clearing class 1 to NONE
Jul 15 17:33:32 Node kernel: ata1: SATA link down (SStatus 0 SControl 320)
Jul 15 17:33:32 Node kernel: ata1.00: link offline, clearing class 1 to NONE
Jul 15 17:33:32 Node kernel: ata1.00: disabled
Jul 15 17:33:32 Node kernel: ata1.00: detaching (SCSI 1:0:0:0)
Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Synchronizing SCSI cache
Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=0x04 driverbyte=0x00
Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Stopping disk
Jul 15 17:33:32 Node kernel: sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=0x04 driverbyte=0x00
Jul 15 17:33:32 Node rc.diskinfo[12031]: SIGHUP received, forcing refresh of disks info.
Jul 15 17:33:38 Node kernel: ata1: link is slow to respond, please be patient (ready=0)
Jul 15 17:33:41 Node kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 15 17:33:41 Node kernel: ata1.00: ATA-9: WDC WD60EFRX-68L0BN1,      WD-WX21DC74DH70, 82.00A82, max UDMA/133
Jul 15 17:33:41 Node kernel: ata1.00: 11721045168 sectors, multi 0: LBA48 NCQ (depth 32), AA
Jul 15 17:33:41 Node kernel: ata1.00: configured for UDMA/133
Jul 15 17:33:41 Node kernel: scsi 1:0:0:0: Direct-Access     ATA      WDC WD60EFRX-68L 0A82 PQ: 0 ANSI: 5
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: Attached scsi generic sg1 type 0
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] 11721045168 512-byte logical blocks: (6.00 TB/5.46 TiB)
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] 4096-byte physical blocks
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Write Protect is off
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Mode Sense: 00 3a 00 00
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
Jul 15 17:33:41 Node kernel: sdm: sdm1
Jul 15 17:33:41 Node kernel: sd 1:0:0:0: [sdm] Attached SCSI disk
Jul 15 17:33:41 Node rc.diskinfo[12031]: SIGHUP received, forcing refresh of disks info.
Jul 15 17:33:47 Node emhttpd: read SMART /dev/sdg
Jul 15 17:33:49 Node kernel: md: disk5 read error, sector=1743040
Jul 15 17:33:49 Node kernel: md: disk5 read error, sector=1743048
Jul 15 17:33:49 Node kernel: md: disk5 read error, sector=1743056
Jul 15 17:33:49 Node kernel: md: disk5 write error, sector=1743040
Jul 15 17:33:49 Node kernel: md: disk5 write error, sector=1743048
Jul 15 17:33:49 Node kernel: md: disk5 write error, sector=1743056

 

node-diagnostics-20210715-1742.zip

 

 

Is this a disk or controller/cabling problem? It appears to be the onboard intel SATA controller. I had one other disk (disk9) on the same controller throw some read errors recently but I went ahead and replaced it and I haven't seen any more.

 

What are my options at this point? Since it failed a write test I'm going to have to rebuild on the same disk or a new one.


EDIT: Ok so taking another look at this it looks like UnRAID now sees the same disk as /dev/sdm? So maybe this is a controller or cabling issue after all since it "lost" and "found" the disk again? 

 

image.png.fc1dd787521d738e145fb87df1a93ef7.png

 

I'm probably going to get a new disk on order anyway but it might be worth having someone check and re-seat the cables tomorrow.

 

If cabling seems good should I try a rebuild on top of the old disk? The emulated content is present and accounted for and I have another long smart test of the drive running.

Edited by weirdcrap
Link to comment
7 hours ago, JorgeB said:

Assuming SMART is OK since there's no report on the diags posted this is usually a connection/power problem, replace/swap cables/slot.

It passed SMART.

 

I believe this disk runs directly off the PSU power so no splitters or anything. I can try having someone replace the SATA cable but I am honestly terrified that if this is a cabling/power issue swapping cables with another drive is going to drop a different disk from the array, breaking parity all together.

 

If I try to rebuild to the same disk and it drops off again will UnRAID just re-disable the disk? Or will it break parity?

 

This is a remote server so infuriatingly I cannot troubleshoot this problem myself without a 6 hour round trip that i just made last weekend.

Edited by weirdcrap
Link to comment
On 7/16/2021 at 6:08 AM, JorgeB said:

It will remain disable.

Cables replaced, power cable seating checked, rebuilding to the same disk now. So far so good, fingers crossed that was it.

 

EDIT: Rebuild completed successfully. I will monitor and report back if this becomes a problem again.

Edited by weirdcrap
Link to comment
  • weirdcrap changed the title to Dead drive or controller issue?

Didn't last long, back at with just some read errors now. it was only a few days after and the disk dropped last time so I imagine its going out again soon.

 

Jul 17 18:13:53 Node kernel: sd 1:0:0:0: [sdb] tag#26 ASC=0x11 ASCQ=0x4
Jul 17 18:13:53 Node kernel: sd 1:0:0:0: [sdb] tag#26 CDB: opcode=0x88 88 00 00 00 00 02 47 d0 00 d0 00 00 00 08 00 00
Jul 17 18:13:53 Node kernel: blk_update_request: I/O error, dev sdb, sector 9794748624 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jul 17 18:13:53 Node kernel: md: disk5 read error, sector=9794748560
Jul 17 18:13:53 Node kernel: ata1: EH complete
Jul 17 18:14:01 Node sSMTP[7263]: Creating SSL connection to host
Jul 17 18:14:01 Node sSMTP[7263]: SSL connection using TLS_AES_256_GCM_SHA384
Jul 17 18:14:03 Node sSMTP[7263]: Sent mail for snip (221 2.0.0 closing connection b25sm7843287ios.36 - gsmtp) uid=0 username=root outbytes=819
Jul 17 18:14:34 Node kernel: ata1.00: exception Emask 0x0 SAct 0x10000 SErr 0x0 action 0x0
Jul 17 18:14:34 Node kernel: ata1.00: irq_stat 0x40000008
Jul 17 18:14:34 Node kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 17 18:14:34 Node kernel: ata1.00: cmd 60/08:80:a8:89:2a/00:00:e9:00:00/40 tag 16 ncq dma 4096 in
Jul 17 18:14:34 Node kernel: res 41/40:00:a8:89:2a/00:00:e9:00:00/00 Emask 0x409 (media error) <F>
Jul 17 18:14:34 Node kernel: ata1.00: status: { DRDY ERR }
Jul 17 18:14:34 Node kernel: ata1.00: error: { UNC }
Jul 17 18:14:34 Node kernel: ata1.00: configured for UDMA/133
Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=7s
Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 Sense Key : 0x3 [current]
Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 ASC=0x11 ASCQ=0x4
Jul 17 18:14:34 Node kernel: sd 1:0:0:0: [sdb] tag#16 CDB: opcode=0x88 88 00 00 00 00 00 e9 2a 89 a8 00 00 00 08 00 00
Jul 17 18:14:34 Node kernel: blk_update_request: I/O error, dev sdb, sector 3911879080 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jul 17 18:14:34 Node kernel: md: disk5 read error, sector=3911879016
Jul 17 18:14:34 Node kernel: ata1: EH complete
Jul 17 18:14:59 Node kernel: ata1.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x0
Jul 17 18:14:59 Node kernel: ata1.00: irq_stat 0x40000008
Jul 17 18:14:59 Node kernel: ata1.00: failed command: READ FPDMA QUEUED
Jul 17 18:14:59 Node kernel: ata1.00: cmd 60/08:68:e0:75:cb/00:00:47:02:00/40 tag 13 ncq dma 4096 in
Jul 17 18:14:59 Node kernel: res 41/40:00:e0:75:cb/00:00:47:02:00/00 Emask 0x409 (media error) <F>
Jul 17 18:14:59 Node kernel: ata1.00: status: { DRDY ERR }
Jul 17 18:14:59 Node kernel: ata1.00: error: { UNC }
Jul 17 18:14:59 Node kernel: ata1.00: configured for UDMA/133
Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=7s
Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 Sense Key : 0x3 [current]
Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 ASC=0x11 ASCQ=0x4
Jul 17 18:14:59 Node kernel: sd 1:0:0:0: [sdb] tag#13 CDB: opcode=0x88 88 00 00 00 00 02 47 cb 75 e0 00 00 00 08 00 00
Jul 17 18:14:59 Node kernel: blk_update_request: I/O error, dev sdb, sector 9794450912 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
Jul 17 18:14:59 Node kernel: md: disk5 read error, sector=9794450848
Jul 17 18:14:59 Node kernel: ata1: EH complete

 

I'll try moving it to a different SATA port on the mobo next and see if that makes any difference...

node-diagnostics-20210717-1829.zip

Edited by weirdcrap
Link to comment
On 7/18/2021 at 3:37 AM, JorgeB said:

Also good to replace/swap power cable just to rule that out.

Yeah I'm going trying to take a step by step approach to identify the issue. I changed SATA ports today, if it is still acting up we'll swap power connections. 

 

I had my tech confirm this is the one drive on a MOLEX to SATA adapter due to the last SATA plug not being long enough. So that may be the culprit and I'll try replacing it next.

Link to comment
  • 2 weeks later...

An update on this, moving the drive to a new port (SATAII Since that is all that was left open) seems to have resolved the issue so far. I have a parity check scheduled in 2 days so that will give it a bit of an extra stress test but it SEEMS to be solved...

 

I'll have to put another drive on that SATAIII port next and see if the issue continues on a different drive.

Edited by weirdcrap
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.