Freppe Posted January 12 Share Posted January 12 (edited) So I've been having some issues with my latest drive, which I bought in November. My other drives are a bunch of older 1TB drives and some newer 3TB drives, but when I had to replace a failed drive last time I bought a 4TB Seagate drive and installed it as parity 2. The drive seemed to work perfectly fine, but after some weeks it failed with errors. I pulled the drive and tested it in my Windows computer with a full read-write test and that passed without any errors. Reinstalled the drive and it worked fine in Unraid again, at least for a while and then it failed again. I've had this repeat multiple times, but every time I test it in another computer I'm unable to find any issues. I've replaced SATA cables, thinking that might be an issue, but no luck there. I've even recently purchased a proper LSI SAS 9201-16i since I thought maybe the problem could be that I'm using some SATA cards that it seems are not recommended for Unraid use with what I can read online, but when I try my disk connected to that card it fails even faster. The disk is then recognized fine but when I start the parity sync I immediately get write errors and it fails. I have done three diagnostics, the first one is from when the disk last failed prior to my attempt to use the LSI HBA. The second is when I use the LSI HBA connected to the 4TB drive. The third is from my currently running system where the 4TB drive is connected to a SATA port on the motherboard, where I can see that the logs are full of errors related to that drive. The parity sync is currently running there, but it seems to be very slow. I'm at a bit of a loss for what I should do here, since I am unable to find any issues with the drive when connected to another system I feel that it would be tricky to send it in for warranty replacement. Is there any way to see from the logs if the problem is with my system or with the drive? Any way to properly prove that the drive is faulty so I can send it under warranty? The type of logs I see right now are like this: Quote Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: No reference found at driver, assuming scmd(0x00000000dd847d05) might have completed Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: task abort: SUCCESS scmd(0x00000000dd847d05) Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: attempting task abort!scmd(0x0000000096ed5524), outstanding for 34207 ms & timeout 30000 ms Jan 12 12:41:37 Tower kernel: sd 50:0:3:0: [sdu] tag#1318 CDB: opcode=0x8a 8a 00 00 00 00 00 00 44 69 e0 00 00 04 00 00 00 Jan 12 12:41:37 Tower kernel: scsi target50:0:3: handle(0x000b), sas_address(0x4433221103000000), phy(3) Jan 12 12:41:37 Tower kernel: scsi target50:0:3: enclosure logical id(0x5000000080000000), slot(3) tower-diagnostics-20221230-1239.zip tower-diagnostics-20230112-1138.zip tower-diagnostics-20230112-1245.zip The logs from the initial failure was something like this: Quote Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=26s Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 Sense Key : 0x5 [current] Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 ASC=0x21 ASCQ=0x4 Dec 30 04:16:17 Tower kernel: sd 11:0:0:0: [sdm] tag#1 CDB: opcode=0x88 88 00 00 00 00 01 31 8f e1 a0 00 00 05 40 00 00 Dec 30 04:16:17 Tower kernel: I/O error, dev sdm, sector 5126480288 op 0x0:(READ) flags 0x0 phys_seg 168 prio class 0 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480224 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480232 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480240 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480248 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480256 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480264 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480272 Dec 30 04:16:17 Tower kernel: md: disk29 read error, sector=5126480280 And the logs when attached to the LSI HBA was this: Quote Jan 12 11:33:50 Tower kernel: mpt2sas_cm0: log_info(0x31110d00): originator(PL), code(0x11), sub_code(0x0d00) Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=3s Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 Sense Key : 0x2 [current] Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 ASC=0x4 ASCQ=0x0 Jan 12 11:33:50 Tower kernel: sd 17:0:0:0: [sdq] tag#39 CDB: opcode=0x8a 8a 00 00 00 00 00 0b a6 2f 18 00 00 01 00 00 00 Jan 12 11:33:50 Tower kernel: I/O error, dev sdq, sector 195440408 op 0x1:(WRITE) flags 0x800 phys_seg 32 prio class 0 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440344 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440352 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440360 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440368 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440376 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440384 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440392 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440400 Jan 12 11:33:50 Tower kernel: md: disk29 write error, sector=195440408 Edited January 12 by Freppe Added sample of logs Quote Link to comment
Solution JorgeB Posted January 12 Solution Share Posted January 12 Same disk failing in different controllers suggests a disk problem, disk you use a different power cable (or backplane slot) also? In the last diags disk is connected to an LSI, you have two, connect to the onboard SATA and post new diags. Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 Good call there, I had connected it to one of the onboard SAS ports on my motherboard (Asus KGPE-D16 with installed PIKE raid card). I have now moved the disk over to the old SATA card again and have started the array, and the parity sync is running at a more normal speed compared to the very slow speed it was with the previous configuration. I'm a bit reluctant to move all the other drives around since I don't want to risk damaging the data on the other drives. Could it be a case where the disk just doesn't work with SAS cards? But that doesn't explain why it would fail (but slowly) when connected to the SATA card. Can the SAS card (and onboard SAS ports) cause the disk to fail faster? Fresh diagnostics attached. tower-diagnostics-20230112-1434.zip Quote Link to comment
JorgeB Posted January 12 Share Posted January 12 No issues so far, lets see how it goes, post new diags if it starts erroring out, don't forget to replace/swap power cable/slot to rule that out also. Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 OK, thanks for looking at it. I will wait for the parity sync to complete (20h left), and with that done I will try to move another drive over to the new LSI HBA as I would prefer to run everything on that considering that it seems to be the popular opinion that cards like that should be used with Unraid. I have previously swapped cables multiple times, but the drive is in an icy dock cage and I will try to move the drive over to another slot as well to see if that does anything. Would it seem strange if I manage to get my other drives working on the LSI card but not this specific drive? I do have other drives running on the on-board SAS ports, and they seem to be working fine. Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 Parity sync has suddenly slowed down considerably, now with a new fancy error: Quote Jan 12 15:52:47 Tower kernel: ata19.00: exception Emask 0x10 SAct 0x40001000 SErr 0x4010000 action 0xe frozen Jan 12 15:52:47 Tower kernel: ata19.00: irq_stat 0x80400040, connection status changed Jan 12 15:52:47 Tower kernel: ata19: SError: { PHYRdyChg DevExch } Jan 12 15:52:47 Tower kernel: ata19.00: failed command: WRITE FPDMA QUEUED Jan 12 15:52:47 Tower kernel: ata19.00: cmd 61/40:60:10:42:6b/05:00:20:00:00/40 tag 12 ncq dma 688128 out Jan 12 15:52:47 Tower kernel: res 40/00:70:38:ff:6d/00:00:20:00:00/40 Emask 0x10 (ATA bus error) Jan 12 15:52:47 Tower kernel: ata19.00: status: { DRDY } Jan 12 15:52:47 Tower kernel: ata19.00: failed command: WRITE FPDMA QUEUED Jan 12 15:52:47 Tower kernel: ata19.00: cmd 61/38:f0:80:33:6d/05:00:20:00:00/40 tag 30 ncq dma 684032 out Jan 12 15:52:47 Tower kernel: res 40/00:70:38:ff:6d/00:00:20:00:00/40 Emask 0x10 (ATA bus error) Jan 12 15:52:47 Tower kernel: ata19.00: status: { DRDY } Jan 12 15:52:47 Tower kernel: ata19: hard resetting link Jan 12 15:52:48 Tower kernel: ata19: SATA link up 6.0 Gbps (SStatus 133 SControl 310) Jan 12 15:52:48 Tower kernel: ata19.00: configured for UDMA/33 Jan 12 15:52:48 Tower kernel: ata19: EH complete tower-diagnostics-20230112-1553.zip Quote Link to comment
trurl Posted January 12 Share Posted January 12 14 minutes ago, Freppe said: new fancy error typical connection problem log entries Quote Link to comment
trurl Posted January 12 Share Posted January 12 ata19 (from system/lsscsi and also syslog) [19:0:0:0] disk ATA TOSHIBA HDWD130 ACF0 /dev/sdm /dev/sg13 state=running queue_depth=32 scsi_level=6 type=0 device_blocked=0 timeout=30 dir: /sys/bus/scsi/devices/19:0:0:0 [/sys/devices/pci0000:00/0000:00:0d.0/0000:02:00.0/ata18/host19/target19:0:0/19:0:0:0] sdm (from system/vars and also smart folder) [name] => disk12 [device] => sdm [id] => TOSHIBA_HDWD130_Z06316VAS Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 And now the parity sync stopped with the 4TB drive disabled: Quote Jan 12 16:12:55 Tower kernel: ata19: SATA link down (SStatus 0 SControl 310) Jan 12 16:13:01 Tower kernel: ata19: hard resetting link Jan 12 16:13:01 Tower kernel: ata19: SATA link down (SStatus 0 SControl 310) Jan 12 16:13:06 Tower kernel: ata19: hard resetting link Jan 12 16:13:06 Tower kernel: ata34: SATA link down (SStatus 0 SControl 300) Jan 12 16:13:06 Tower kernel: ata33: SATA link down (SStatus 0 SControl 300) Jan 12 16:13:07 Tower kernel: ata19: SATA link down (SStatus 0 SControl 310) Jan 12 16:13:07 Tower kernel: ata19.00: disable device Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=DRIVER_OK cmd_age=24s Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 Sense Key : 0x3 [current] Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 ASC=0x13 ASCQ=0x0 Jan 12 16:13:07 Tower kernel: sd 20:0:0:0: [sdn] tag#4 CDB: opcode=0x8a 8a 00 00 00 00 00 22 92 b8 80 00 00 05 40 00 00 Jan 12 16:13:07 Tower kernel: I/O error, dev sdn, sector 580040832 op 0x1:(WRITE) flags 0x4000 phys_seg 168 prio class 0 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040768 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040776 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040784 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040792 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040800 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040808 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040816 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040824 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040832 Jan 12 16:13:07 Tower kernel: md: disk29 write error, sector=580040840 tower-diagnostics-20230112-1615.zip Quote Link to comment
JorgeB Posted January 12 Share Posted January 12 The move to the onboard SATA was mostly because the type of errors are logged more clearly, they do look like a power/connection problem, but since you already replaced cables and the same happens with the different controllers it suggests a disk problem, any way you can replace that disk with a different one? Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 I'm considering simply getting a new disk, just so I can hopefully get back to having 2 parity drives before I do more testing. Will try the switch of drive slots first though, in case that is the issue. Problem is that I end up without parity protection if I get another drive fail, which feels a bit unsafe. Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 New test started, which will be very interesting. All the talk of potential power problems made me double check the power going to the icy dock cage with 5 drives, and I noticed that the two middle black lines in one molex connector were pushed out of their place. This was connected to two of the three power input ports on the icy dock cage. Pushed them back so will see if things are better now, but if there are still problems then I may also have to redo some other power wires as I have quite a lot of my drives hanging of one output from the power supply. Quote Link to comment
Michael_P Posted January 12 Share Posted January 12 (edited) 25 minutes ago, Freppe said: as I have quite a lot of my drives hanging of one output from the power supply Try to keep at max 4 per connector, avoid splitters (use crimped style connections, not molded, if you absolutely must) and definitely add another run to the PSU if at all possible. I've found Toshiba drives to be particularly sensitive to voltage sag when too many drives are on one connector Edited January 12 by Michael_P 1 Quote Link to comment
Freppe Posted January 12 Author Share Posted January 12 If this turns out to be the solution, then I guess what tricked me was the way that it just affected the same drive despite other drives powered through the same disk cage being fine. That I got different errors when the disk was connected through different adapters also made me think that the problem would be with the disk rather than the power. If this parity sync runs through then I think I'll look at the power connectors to try and balance things a bit. Splitters are probably necessary though, but I could do a better job at balancing the disks between the different lines from the PSU to spread the load better. I have also tried to find adapters from PCIE power connectors to molex or sata, since that would let me use that cable from the PSU as well (no graphics card in this machine), but haven't found any appropriate adapters. Quote Link to comment
Michael_P Posted January 12 Share Posted January 12 3 hours ago, Freppe said: I have also tried to find adapters from PCIE power connectors to molex I made my own using this style connector by adding them to unused SATA power cables that came with the PSU. Simple enough, and you can add as many as you need. https://www.moddiy.com/products/DIY-IDE-Molex-Power-EZ-Crimp-Connector-%2d-Black.html Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.