February 5, 20197 yr tower-diagnostics-20190205-1216.zipHello, I'm having some very frusterating problems with Disks dropping from the array with read errors. They are specific to a set of brand new drive I have purchased, each drive was precleared prior to entering the array with no SMART errors. I purchased 10 Toshiba 8tb drives to begin replacing some aging disks, as soon as I started adding the disks to the array I was having problems, initially the XFS filesystem was becoming corrupt and unable to be repaired, I removed the offending drives and used UFS Explorer to recover all the data successfully. I have now backed up all my data on separate drives and have started a completely new array. With the trial and error restoring my files from backups it seems as soon as one of the drives fills up to 4.03TB I start getting disk read errors. I have managed to get one drive to start filling past the 4.03tb mark, however this was by transferring directly to the disk share rather than the user share (again through trial and error). Obviously this isn't the intended used case however I am trying to understand what the root cause is. One of the disks I am pre-clearing is also being limited to 1.5gbs with these errors showing: Feb 4 03:15:36 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 4 03:15:36 Tower kernel: ata3.00: ATA-10: TOSHIBA MG05ACA800E, Z6LGK004FXJD, GX0R, max UDMA/100 Feb 4 03:15:36 Tower kernel: ata3.00: 15628053168 sectors, multi 16: LBA48 NCQ (depth 32), AA Feb 4 03:15:36 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:15:36 Tower kernel: sd 3:0:0:0: [sdk] 15628053168 512-byte logical blocks: (8.00 TB/7.28 TiB) Feb 4 03:15:36 Tower kernel: sd 3:0:0:0: [sdk] 4096-byte physical blocks Feb 4 03:15:36 Tower kernel: sd 3:0:0:0: [sdk] Write Protect is off Feb 4 03:15:36 Tower kernel: sd 3:0:0:0: [sdk] Mode Sense: 00 3a 00 00 Feb 4 03:15:36 Tower kernel: sd 3:0:0:0: [sdk] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA Feb 4 03:15:36 Tower kernel: ata3.00: exception Emask 0x50 SAct 0x20000 SErr 0xb0802 action 0xe frozen Feb 4 03:15:36 Tower kernel: ata3.00: irq_stat 0x00400000, PHY RDY changed Feb 4 03:15:36 Tower kernel: ata3: SError: { RecovComm HostInt PHYRdyChg PHYInt 10B8B } Feb 4 03:15:36 Tower kernel: ata3.00: failed command: READ FPDMA QUEUED Feb 4 03:15:36 Tower kernel: ata3.00: cmd 60/08:88:00:00:00/00:00:00:00:00/40 tag 17 ncq dma 4096 in Feb 4 03:15:36 Tower kernel: ata3.00: status: { DRDY } Feb 4 03:15:36 Tower kernel: ata3: hard resetting link Feb 4 03:15:42 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 4 03:15:42 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:15:42 Tower kernel: ata3: EH complete Feb 4 03:15:42 Tower kernel: sdk: sdk1 Feb 4 03:15:42 Tower kernel: sd 3:0:0:0: [sdk] Attached SCSI disk Feb 4 03:15:44 Tower kernel: ata3: SATA link down (SStatus 0 SControl 300) Feb 4 03:15:56 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 4 03:15:56 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:03 Tower kernel: ata3: SATA link up 6.0 Gbps (SStatus 133 SControl 300) Feb 4 03:16:03 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:04 Tower kernel: ata3: limiting SATA link speed to 3.0 Gbps Feb 4 03:16:10 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 4 03:16:10 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:17 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 4 03:16:17 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:24 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 4 03:16:24 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:31 Tower kernel: ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 4 03:16:31 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:32 Tower kernel: ata3: limiting SATA link speed to 1.5 Gbps Feb 4 03:16:39 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Feb 4 03:16:39 Tower kernel: ata3.00: configured for UDMA/100 Feb 4 03:16:47 Tower kernel: ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Feb 4 03:16:47 Tower kernel: ata3.00: configured for UDMA/100 Again this disk is brand new and passes SMART scans. I did read there were some Linux Kernel bugs with certain drives and SATA controllers, however I have tried with my motherboard onboard sata ports and my LSI 9260-8i, both having the same results. My motherboard is a Gigabyte X399 board with a Treadripper 1900x. Have also ensured to seat and re-seat cables and try different cables. I have attached full diagnostics below. Hoping someone is going to point out where I have done something incredibly stupid and help resolve the issue for me because at this point I am ready to give up. Edited February 5, 20197 yr by GoudaK
February 5, 20197 yr Community Expert Have you checked the power cables and connections from the disks all the way back to the power supply?
February 5, 20197 yr Author Thank you for your reply. I have checked that, have plugged and unplugged them in a number of times also. At one point I had 2 rows of 4 drives, on one row there was 2x Toshiba drives, 1 Seagate and 1 HGST the Toshiba drives were still causing issues the other 2 drives have not had any issues at all. The other row has 4 Toshiba drives. I am using molex to 4x sata breakout cables both running off separate rails off the PSU. The PSU is a 1200w Silverstone Gold PSU.
February 5, 20197 yr Author I should also note, as soon as a drive failed I proceeded to preclear it for a second time ensuring not to physically move it, the resulting preclear and post read was successful.
February 5, 20197 yr Community Expert Disable spin down for disk4, Unraid can't currently spin down SAS disks and it's spamming your log with related errors, making it much harder to analyze and missing some time, after that reboot and work normally until you have some errors, then please post new diags.
February 6, 20197 yr Author Thanks Johnnie, I realised too late it was causing issues. I've disabled and will reboot and report back.
February 6, 20197 yr Author So I have rebooted and now keep getting sata link resets... Feb 6 12:10:54 Tower kernel: ata6: SError: { RecovComm HostInt PHYRdyChg PHYInt } Feb 6 12:10:54 Tower kernel: ata6.00: failed command: READ DMA EXT Feb 6 12:10:54 Tower kernel: ata6.00: cmd 25/00:f8:08:04:1b/00:03:00:00:00/e0 tag 19 dma 520192 in Feb 6 12:10:54 Tower kernel: res 50/00:00:07:04:1b/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Feb 6 12:10:54 Tower kernel: ata6.00: status: { DRDY } Feb 6 12:10:54 Tower kernel: ata6: hard resetting link Feb 6 12:10:55 Tower kernel: ata6: SATA link down (SStatus 0 SControl 320) Feb 6 12:10:57 Tower kernel: ata6: hard resetting link Feb 6 12:10:57 Tower kernel: ata6: SATA link down (SStatus 0 SControl 320) Feb 6 12:10:59 Tower kernel: ata6: hard resetting link Feb 6 12:11:04 Tower kernel: ata6: link is slow to respond, please be patient (ready=0) Feb 6 12:11:07 Tower kernel: ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 320) Feb 6 12:11:07 Tower kernel: ata6.00: configured for UDMA/100 Feb 6 12:11:07 Tower kernel: ata6: EH complete Feb 6 12:11:07 Tower kernel: ata7.00: exception Emask 0x50 SAct 0x0 SErr 0x30802 action 0xe frozen Feb 6 12:11:07 Tower kernel: ata7.00: irq_stat 0x00400000, PHY RDY changed Feb 6 12:11:07 Tower kernel: ata7: SError: { RecovComm HostInt PHYRdyChg PHYInt } Feb 6 12:11:07 Tower kernel: ata7.00: failed command: READ DMA EXT Feb 6 12:11:07 Tower kernel: ata7.00: cmd 25/00:08:08:30:1b/00:04:00:00:00/e0 tag 6 dma 528384 in Feb 6 12:11:07 Tower kernel: res 50/00:00:07:30:1b/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Feb 6 12:11:07 Tower kernel: ata7.00: status: { DRDY } Feb 6 12:11:07 Tower kernel: ata7: hard resetting link Feb 6 12:11:08 Tower kernel: ata7: SATA link down (SStatus 0 SControl 310) Feb 6 12:11:10 Tower kernel: ata7: hard resetting link Feb 6 12:11:10 Tower kernel: ata7: SATA link down (SStatus 0 SControl 310) Feb 6 12:11:11 Tower kernel: ata7: hard resetting link Feb 6 12:11:20 Tower kernel: ata7: SATA link up 1.5 Gbps (SStatus 113 SControl 310) Feb 6 12:11:20 Tower kernel: ata7.00: configured for UDMA/33 Feb 6 12:11:20 Tower kernel: ata7: EH complete Feb 6 12:11:20 Tower kernel: ata7.00: exception Emask 0x50 SAct 0x0 SErr 0x30802 action 0xe frozen Feb 6 12:11:20 Tower kernel: ata7.00: irq_stat 0x00400000, PHY RDY changed Feb 6 12:11:20 Tower kernel: ata7: SError: { RecovComm HostInt PHYRdyChg PHYInt } Feb 6 12:11:20 Tower kernel: ata7.00: failed command: READ DMA EXT Feb 6 12:11:20 Tower kernel: ata7.00: cmd 25/00:f8:08:d8:1b/00:03:00:00:00/e0 tag 8 dma 520192 in Feb 6 12:11:20 Tower kernel: res 50/00:00:07:d8:1b/00:00:00:00:00/e0 Emask 0x50 (ATA bus error) Feb 6 12:11:20 Tower kernel: ata7.00: status: { DRDY } Feb 6 12:11:20 Tower kernel: ata7: hard resetting link Feb 6 12:11:21 Tower kernel: ata7: SATA link down (SStatus 0 SControl 310) Feb 6 12:11:23 Tower kernel: ata7: hard resetting link Feb 6 12:11:23 Tower kernel: ata7: SATA link down (SStatus 0 SControl 310) Feb 6 12:11:24 Tower kernel: ata7: hard resetting link
February 6, 20197 yr Community Expert Grab those diagnostics, update the LSI firmware to latest, 20.00.07.00, since the one you're using has known issues, connect both disks to the LSI controller (if don't know which post the diags), work for a little while, if there are errors post both diagnostics.
February 7, 20197 yr Author I'll grab full diagnostic when I get home this arvo. I should have noted all disks are currently on the motherboard to rule out the LSI card, should I update firmware and then move the 2 offending disks back to the LSI card?
February 7, 20197 yr Community Expert 6 hours ago, GoudaK said: should I update firmware and then move the 2 offending disks back to the LSI card? Yes, to see if the problems stay with the disks.
February 7, 20197 yr Author Well for some reason I come home to find all error logs stopped and parity build humming along nicely... I thought I stopped it but clearly didn't... I will wait for parity rebuild to complete (5 hours left) update firmware and go from there. Thanks for the help so far.
Archived
This topic is now archived and is closed to further replies.