Odd disk problem with new-ish 4TB Seagate drive


Freppe
Go to solution Solved by JorgeB,

Recommended Posts

So I've been having some issues with my latest drive, which I bought in November. My other drives are a bunch of older 1TB drives and some newer 3TB drives, but when I had to replace a failed drive last time I bought a 4TB Seagate drive and installed it as parity 2. The drive seemed to work perfectly fine, but after some weeks it failed with errors. I pulled the drive and tested it in my Windows computer with a full read-write test and that passed without any errors. Reinstalled the drive and it worked fine in Unraid again, at least for a while and then it failed again. I've had this repeat multiple times, but every time I test it in another computer I'm unable to find any issues. I've replaced SATA cables, thinking that might be an issue, but no luck there. I've even recently purchased a proper LSI SAS 9201-16i since I thought maybe the problem could be that I'm using some SATA cards that it seems are not recommended for Unraid use with what I can read online, but when I try my disk connected to that card it fails even faster. The disk is then recognized fine but when I start the parity sync I immediately get write errors and it fails.

I have done three diagnostics, the first one is from when the disk last failed prior to my attempt to use the LSI HBA. The second is when I use the LSI HBA connected to the 4TB drive. The third is from my currently running system where the 4TB drive is connected to a SATA port on the motherboard, where I can see that the logs are full of errors related to that drive. The parity sync is currently running there, but it seems to be very slow.

I'm at a bit of a loss for what I should do here, since I am unable to find any issues with the drive when connected to another system I feel that it would be tricky to send it in for warranty replacement. Is there any way to see from the logs if the problem is with my system or with the drive? Any way to properly prove that the drive is faulty so I can send it under warranty?

The type of logs I see right now are like this:

Quote
Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: No reference found at driver, assuming scmd(0x00000000dd847d05) might have completed
Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: task abort: SUCCESS scmd(0x00000000dd847d05)
Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: attempting task abort!scmd(0x0000000096ed5524), outstanding for 34207 ms & timeout 30000 ms
Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: [sdu] tag#1318 CDB: opcode=0x8a 8a 00 00 00 00 00 00 44 69 e0 00 00 04 00 00 00
Jan 12 12:41:37 Tower kernel: scsi target50:0:3: handle(0x000b), sas_address(0x4433221103000000), phy(3)
Jan 12 12:41:37 Tower kernel: scsi target50:0:3: enclosure logical id(0x5000000080000000), slot(3) 

tower-diagnostics-20221230-1239.zip tower-diagnostics-20230112-1138.zip tower-diagnostics-20230112-1245.zip

The logs from the initial failure was something like this:

Quote

Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=26s
Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 Sense Key : 0x5 [current] 
Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 ASC=0x21 ASCQ=0x4 
Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 CDB: opcode=0x88 88 00 00 00 00 01 31 8f e1 a0 00 00 05 40 00 00
Dec 30 04:16:17 Tower kernel: I/O error, dev sdm, sector 5126480288 op 0x0:(READ) flags 0x0 phys_seg 168 prio class 0
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480224
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480232
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480240
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480248
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480256
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480264
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480272
Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480280

And the logs when attached to the LSI HBA was this:

Quote

Jan 12 11:33:50 Tower kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00)
Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s
Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 Sense Key : 0x2 [current] 
Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 ASC=0x4 ASCQ=0x0 
Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 CDB: opcode=0x8a 8a 00 00 00 00 00 0b a6 2f 18 00 00 01 00 00 00
Jan 12 11:33:50 Tower kernel: I/O error, dev sdq, sector 195440408 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440344
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440352
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440360
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440368
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440376
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440384
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440392
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440400
Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440408

 

Edited by Freppe
Added sample of logs
Link to comment

Good call there, I had connected it to one of the onboard SAS ports on my motherboard (Asus KGPE-D16 with installed PIKE raid card). I have now moved the disk over to the old SATA card again and have started the array, and the parity sync is running at a more normal speed compared to the very slow speed it was with the previous configuration.

 

I'm a bit reluctant to move all the other drives around since I don't want to risk damaging the data on the other drives. Could it be a case where the disk just doesn't work with SAS cards? But that doesn't explain why it would fail (but slowly) when connected to the SATA card. Can the SAS card (and onboard SAS ports) cause the disk to fail faster?

 

Fresh diagnostics attached.

tower-diagnostics-20230112-1434.zip

Link to comment

OK, thanks for looking at it. I will wait for the parity sync to complete (20h left), and with that done I will try to move another drive over to the new LSI HBA as I would prefer to run everything on that considering that it seems to be the popular opinion that cards like that should be used with Unraid.

 

I have previously swapped cables multiple times, but the drive is in an icy dock cage and I will try to move the drive over to another slot as well to see if that does anything.

 

Would it seem strange if I manage to get my other drives working on the LSI card but not this specific drive? I do have other drives running on the on-board SAS ports, and they seem to be working fine.

Link to comment

Parity sync has suddenly slowed down considerably, now with a new fancy error:

Quote
Jan 12 15:52:47 Tower kernel: ata19.00: exception Emask 0x10 SAct 0x40001000 SErr 0x4010000 action 0xe frozen
Jan 12 15:52:47 Tower kernel: ata19.00: irq_stat 0x80400040, connection status changed
Jan 12 15:52:47 Tower kernel: ata19: SError: { PHYRdyChg DevExch }
Jan 12 15:52:47 Tower kernel: ata19.00: failed command: WRITE FPDMA QUEUED
Jan 12 15:52:47 Tower kernel: ata19.00: cmd 61/40:60:10:42:6b/05:00:20:00:00/40 tag 12 ncq dma 688128 out
Jan 12 15:52:47 Tower kernel:         res 40/00:70:38:ff:6d/00:00:20:00:00/40 Emask 0x10 (ATA bus error)
Jan 12 15:52:47 Tower kernel: ata19.00: status: { DRDY }
Jan 12 15:52:47 Tower kernel: ata19.00: failed command: WRITE FPDMA QUEUED
Jan 12 15:52:47 Tower kernel: ata19.00: cmd 61/38:f0:80:33:6d/05:00:20:00:00/40 tag 30 ncq dma 684032 out
Jan 12 15:52:47 Tower kernel:         res 40/00:70:38:ff:6d/00:00:20:00:00/40 Emask 0x10 (ATA bus error)
Jan 12 15:52:47 Tower kernel: ata19.00: status: { DRDY }
Jan 12 15:52:47 Tower kernel: ata19: hard resetting link
Jan 12 15:52:48 Tower kernel: ata19: SATA link up 6.0 Gbps (SStatus 133 SControl 310)
Jan 12 15:52:48 Tower kernel: ata19.00: configured for UDMA/33
Jan 12 15:52:48 Tower kernel: ata19: EH complete

 

tower-diagnostics-20230112-1553.zip

Link to comment

ata19 (from system/lsscsi and also syslog)

[19:0:0:0]   disk    ATA      TOSHIBA HDWD130  ACF0  /dev/sdm   /dev/sg13
  state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30
  dir: /sys/bus/scsi/devices/19:0:0:0  [/sys/devices/pci0000:00/0000:00:0d.0/0000:02:00.0/ata18/host19/target19:0:0/19:0:0:0]

sdm (from system/vars and also smart folder)

            [name] => disk12
            [device] => sdm
            [id] => TOSHIBA_HDWD130_Z06316VAS

 

Link to comment

And now the parity sync stopped with the 4TB drive disabled:

Quote
Jan 12 16:12:55 Tower kernel: ata19: SATA link down (SStatus 0 SControl 310)
Jan 12 16:13:01 Tower kernel: ata19: hard resetting link
Jan 12 16:13:01 Tower kernel: ata19: SATA link down (SStatus 0 SControl 310)
Jan 12 16:13:06 Tower kernel: ata19: hard resetting link
Jan 12 16:13:06 Tower kernel: ata34: SATA link down (SStatus 0 SControl 300)
Jan 12 16:13:06 Tower kernel: ata33: SATA link down (SStatus 0 SControl 300)
Jan 12 16:13:07 Tower kernel: ata19: SATA link down (SStatus 0 SControl 310)
Jan 12 16:13:07 Tower kernel: ata19.00: disable device
Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=24s
Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 Sense Key : 0x3 [current] 
Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 ASC=0x13 ASCQ=0x0 
Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 CDB: opcode=0x8a 8a 00 00 00 00 00 22 92 b8 80 00 00 05 40 00 00
Jan 12 16:13:07 Tower kernel: I/O error, dev sdn, sector 580040832 op 0x1:(WRITE) flags 0x4000 phys_seg 168 prio class 0
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040768
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040776
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040784
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040792
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040800
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040808
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040816
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040824
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040832
Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040840

 

tower-diagnostics-20230112-1615.zip

Link to comment

The move to the onboard SATA was mostly because the type of errors are logged more clearly, they do look like a power/connection problem, but since you already replaced cables and the same happens with the different controllers it suggests a disk problem, any way you can replace that disk with a different one?

Link to comment

I'm considering simply getting a new disk, just so I can hopefully get back to having 2 parity drives before I do more testing. Will try the switch of drive slots first though, in case that is the issue. Problem is that I end up without parity protection if I get another drive fail, which feels a bit unsafe.

Link to comment

New test started, which will be very interesting. All the talk of potential power problems made me double check the power going to the icy dock cage with 5 drives, and I noticed that the two middle black lines in one molex connector were pushed out of their place. This was connected to two of the three power input ports on the icy dock cage. Pushed them back so will see if things are better now, but if there are still problems then I may also have to redo some other power wires as I have quite a lot of my drives hanging of one output from the power supply.

Link to comment
25 minutes ago, Freppe said:

as I have quite a lot of my drives hanging of one output from the power supply

 

Try to keep at max 4 per connector, avoid splitters (use crimped style connections, not molded, if you absolutely must) and definitely add another run to the PSU if at all possible. 

I've found Toshiba drives to be particularly sensitive to voltage sag when too many drives are on one connector

Edited by Michael_P
  • Upvote 1
Link to comment

If this turns out to be the solution, then I guess what tricked me was the way that it just affected the same drive despite other drives powered through the same disk cage being fine. That I got different errors when the disk was connected through different adapters also made me think that the problem would be with the disk rather than the power.

 

If this parity sync runs through then I think I'll look at the power connectors to try and balance things a bit. Splitters are probably necessary though, but I could do a better job at balancing the disks between the different lines from the PSU to spread the load better. I have also tried to find adapters from PCIE power connectors to molex or sata, since that would let me use that cable from the PSU as well (no graphics card in this machine), but haven't found any appropriate adapters.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.