Jump to content

Drive shuts down - Failed drive or controller?


Recommended Posts

Hi all,

 

a week ago I've got a problem with a failing parity drive while writing a large amount of data on the array. I think it just shuts down. Because it was no longer available. Even in BIOS.

 

So I tried some things: Switched the SATA Cable, switched the port and removed two drives that were still unused and not in the array.

After a few minutes running the system ,just doing a smart test without errors. The drive again just disappeared. The array was not even started.  Reboot in the BIOS: Drive gone. Thought of a misplaced cable, checked everything, booted and fine. After some minutes of nothing: Gone again and I heard a loud click-shutdown sound from the case.

I suspected the power supply and used another connector (from a different line) of the 650W PSU. It booted, all looked fine. Extended SMART test ran and everything looked good.

I started the array and rebuild partiy. All fine. I used it lightly for the last days.

 

Until yesterday, where I started moving more data onto the array. After some hours it failed again and is still gone.

 

Things I've found in the logs prior the failiure, which I saw also last week on the same drive, without further issues:

 

May 29 18:09:07 Tower kernel: ata2: SATA link down (SStatus 0 SControl 300)
May 29 18:09:14 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
May 29 18:09:17 Tower kernel: ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
May 29 18:09:17 Tower kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.SPT1._GTF.DSSP], AE_NOT_FOUND (20200925/psargs-330)
May 29 18:09:17 Tower kernel: ACPI Error: Aborting method \_SB.PCI0.SAT0.SPT1._GTF due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
May 29 18:09:17 Tower kernel: ACPI BIOS Error (bug): Could not resolve symbol [\_SB.PCI0.SAT0.SPT1._GTF.DSSP], AE_NOT_FOUND (20200925/psargs-330)
May 29 18:09:17 Tower kernel: ACPI Error: Aborting method \_SB.PCI0.SAT0.SPT1._GTF due to previous error (AE_NOT_FOUND) (20200925/psparse-529)
May 29 18:09:17 Tower kernel: ata2.00: configured for UDMA/133

 

And then later it failed:

 

May 29 22:11:57 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x1e00000 SErr 0x4890000 action 0xe frozen
May 29 22:11:57 Tower kernel: ata2.00: irq_stat 0x08400040, interface fatal error, connection status changed
May 29 22:11:57 Tower kernel: ata2: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
May 29 22:11:57 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
May 29 22:11:57 Tower kernel: ata2.00: cmd 61/b8:a8:98:e6:9a/00:00:37:05:00/40 tag 21 ncq dma 94208 out
May 29 22:11:57 Tower kernel:         res 40/00:b0:50:e7:9a/00:00:37:05:00/40 Emask 0x10 (ATA bus error)
May 29 22:11:57 Tower kernel: ata2.00: status: { DRDY }
May 29 22:11:57 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
May 29 22:11:57 Tower kernel: ata2.00: cmd 61/88:b0:50:e7:9a/04:00:37:05:00/40 tag 22 ncq dma 593920 out
May 29 22:11:57 Tower kernel:         res 40/00:b0:50:e7:9a/00:00:37:05:00/40 Emask 0x10 (ATA bus error)
May 29 22:11:57 Tower kernel: ata2.00: status: { DRDY }
May 29 22:11:57 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
May 29 22:11:57 Tower kernel: ata2.00: cmd 61/40:b8:d8:eb:9a/05:00:37:05:00/40 tag 23 ncq dma 688128 out
May 29 22:11:57 Tower kernel:         res 40/00:b0:50:e7:9a/00:00:37:05:00/40 Emask 0x10 (ATA bus error)
May 29 22:11:57 Tower kernel: ata2.00: status: { DRDY }
May 29 22:11:57 Tower kernel: ata2.00: failed command: WRITE FPDMA QUEUED
May 29 22:11:57 Tower kernel: ata2.00: cmd 61/e0:c0:b8:e5:9a/00:00:37:05:00/40 tag 24 ncq dma 114688 out
May 29 22:11:57 Tower kernel:         res 40/00:b0:50:e7:9a/00:00:37:05:00/40 Emask 0x10 (ATA bus error)
May 29 22:11:57 Tower kernel: ata2.00: status: { DRDY }
May 29 22:11:57 Tower kernel: ata2: hard resetting link
May 29 22:11:58 Tower kernel: ata2: SATA link down (SStatus 0 SControl 300)
May 29 22:12:03 Tower kernel: ata2: hard resetting link
May 29 22:12:04 Tower kernel: ata2: SATA link down (SStatus 0 SControl 300)
May 29 22:12:09 Tower kernel: ata2: hard resetting link
May 29 22:12:09 Tower kernel: ata2: SATA link down (SStatus 0 SControl 300)
May 29 22:12:09 Tower kernel: ata2.00: disabled
May 29 22:12:09 Tower kernel: sd 2:0:0:0: [sdc] tag#21 UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08 cmd_age=12s
May 29 22:12:09 Tower kernel: sd 2:0:0:0: [sdc] tag#21 Sense Key : 0x5 [current] 
May 29 22:12:09 Tower kernel: sd 2:0:0:0: [sdc] tag#21 ASC=0x21 ASCQ=0x4 
May 29 22:12:09 Tower kernel: sd 2:0:0:0: [sdc] tag#21 CDB: opcode=0x8a 8a 00 00 00 00 05 37 9a e6 98 00 00 00 b8 00 00
May 29 22:12:09 Tower kernel: blk_update_request: I/O error, dev sdc, sector 22407734936 op 0x1:(WRITE) flags 0x0 phys_seg 23 prio class 0
May 29 22:12:09 Tower kernel: md: disk0 write error, sector=22407734872
May 29 22:12:09 Tower kernel: md: disk0 write error, sector=22407734880
May 29 22:12:09 Tower kernel: md: disk0 write error, sector=22407734888
May 29 22:12:09 Tower kernel: md: disk0 write error, sector=22407734896

 

I mean, the error pattern points me toward a defect cable or PSU. But the other drives are working without issues.

The board got 2 6gb/s SATA ports and 4 slower ones. I switched the cable AND connected it to the faster 6gb/s port.

 

I have 3 Toshiba HDDs in the arary. One older 4TB and two new 12TB Toshiba MG07ACA12TE. One of the 12TBs is the parity and is failing. The other one is working fine and holds most, if not all, of the data.

 

Any ideas how I can pinpoint the error any further or should I just claim warranty on the failing drive? Can it be a failing controller on the mainboard? But then I would expect other occasional drive failures as well...

 

I attached the diagnostics and a seperate SMART report from last week of the drive as it is now offline again and not within the diagnostics zip.

tower-diagnostics-20220530-0723.zip TOSHIBA_MG07ACA12TE_X1B0A0GXF95G-20220525-1009 parity (sdf) - DISK_DSBL.txt

Link to comment

Thanks for your input, I did exactly that and the drive was still not recognized from the bios. I turned it off and on again, there it was. After a next reboot, gone again.

The other drives are still fine. Cables didn't matter. I think I'll have to RMA it.

The onyl strange thing is that it ran for a few months fine under slight usage.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...