Jump to content

Potential ATA issues / drive falls out of array


Recommended Posts

Dec 30 19:50:23 Tower kernel: ata4.00: exception Emask 0x10 SAct 0x80000000 SErr 0x4090000 action 0xe frozen
Dec 30 19:50:23 Tower kernel: ata4.00: irq_stat 0x00400040, connection status changed
Dec 30 19:50:23 Tower kernel: ata4: SError: { PHYRdyChg 10B8B DevExch }
Dec 30 19:50:23 Tower kernel: ata4.00: failed command: READ FPDMA QUEUED
Dec 30 19:50:23 Tower kernel: ata4.00: cmd 60/20:f8:a0:00:00/00:00:00:02:00/40 tag 31 ncq dma 16384 in
Dec 30 19:50:23 Tower kernel:         res 40/00:f8:a0:00:00/00:00:00:02:00/40 Emask 0x10 (ATA bus error)
Dec 30 19:50:23 Tower kernel: ata4.00: status: { DRDY }
Dec 30 19:50:23 Tower kernel: ata4: hard resetting link
Dec 30 19:50:26 Tower kernel: ata1: link is slow to respond, please be patient (ready=0)
Dec 30 19:50:26 Tower kernel: ata2: link is slow to respond, please be patient (ready=0)
Dec 30 19:50:29 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Dec 30 19:50:30 Tower kernel: ata1: COMRESET failed (errno=-16)
Dec 30 19:50:30 Tower kernel: ata2: COMRESET failed (errno=-16)
Dec 30 19:50:30 Tower kernel: ata2: hard resetting link
Dec 30 19:50:31 Tower kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 320)
Dec 30 19:50:31 Tower kernel: ata1.00: configured for UDMA/133
Dec 30 19:50:33 Tower kernel: ata4: COMRESET failed (errno=-16)
Dec 30 19:50:33 Tower kernel: ata4: hard resetting link
Dec 30 19:50:33 Tower kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 30 19:50:33 Tower kernel: ata2.00: configured for UDMA/133
Dec 30 19:50:33 Tower kernel: ata2: EH complete
Dec 30 19:50:38 Tower kernel: ata4: link is slow to respond, please be patient (ready=0)
Dec 30 19:50:43 Tower kernel: ata4: COMRESET failed (errno=-16)
Dec 30 19:50:43 Tower kernel: ata4: hard resetting link
Dec 30 19:50:46 Tower kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Dec 30 19:50:46 Tower kernel: ata4.00: configured for UDMA/133
Dec 30 19:50:46 Tower kernel: ata4: EH complete
Dec 30 19:50:47 Tower kernel: ata2.00: exception Emask 0x10 SAct 0x700027 SErr 0x4890000 action 0xe frozen
Dec 30 19:50:47 Tower kernel: ata2.00: irq_stat 0x0c400040, interface fatal error, connection status changed
Dec 30 19:50:47 Tower kernel: ata2: SError: { PHYRdyChg 10B8B LinkSeq DevExch }
Dec 30 19:50:47 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Dec 30 19:50:47 Tower kernel: ata2.00: cmd 60/00:00:20:58:7d/04:00:85:01:00/40 tag 0 ncq dma 524288 in
Dec 30 19:50:47 Tower kernel:         res 40/00:10:60:5d:7d/00:00:85:01:00/40 Emask 0x10 (ATA bus error)
Dec 30 19:50:47 Tower kernel: ata2.00: status: { DRDY }
Dec 30 19:50:47 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Dec 30 19:50:47 Tower kernel: ata2.00: cmd 60/40:08:20:5c:7d/01:00:85:01:00/40 tag 1 ncq dma 163840 in
Dec 30 19:50:47 Tower kernel:         res 40/00:10:60:5d:7d/00:00:85:01:00/40 Emask 0x10 (ATA bus error)
Dec 30 19:50:47 Tower kernel: ata2.00: status: { DRDY }
Dec 30 19:50:47 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Dec 30 19:50:47 Tower kernel: ata2.00: cmd 60/d0:10:60:5d:7d/03:00:85:01:00/40 tag 2 ncq dma 499712 in
Dec 30 19:50:47 Tower kernel:         res 40/00:10:60:5d:7d/00:00:85:01:00/40 Emask 0x10 (ATA bus error)
Dec 30 19:50:47 Tower kernel: ata2.00: status: { DRDY }
Dec 30 19:50:47 Tower kernel: ata2.00: failed command: READ FPDMA QUEUED
Dec 30 19:50:47 Tower kernel: ata2.00: cmd 60/00:28:30:61:7d/04:00:85:01:00/40 tag 5 ncq dma 524288 in
Dec 30 19:50:47 Tower kernel:         res 40/00:10:60:5d:7d/00:00:85:01:00/40 Emask 0x10 (ATA bus error)

 

I have seen a bunch of errors related to ata but not sure what's exactly triggering things. I removed the cable attached to what's labeled as SATA3_2 on my mobo but still getting these issues. I rebooted while attempting to solve this but couldn't reboot from GUI, had to use the button. Then unraid couldn't unmount all the drives so I forced an unclean shutdown. Came back up and Drive 17 fell out of the array with no prior warning.

 

When I go in for attributes on Disk 17 it has a high raw read rate and a high seek error rate but not sure if that's being caused by bad cables, or other hardware issue than the disk.

 

I think I also may be triggering it when running mover but can't tell.

 

Any thoughts? I

tower-diagnostics-20231230-1952.zip

Link to comment
1 hour ago, trurl said:

Check filesystem on disk17

 

Oof.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
agf_freeblks 9058410, counted 9058399 in ag 2
agi_count 1984, counted 2048 in ag 2
agi_freecount 21, counted 13 in ag 2
agi_freecount 21, counted 13 in ag 2 finobt
sb_icount 34880, counted 35744
sb_ifree 506, counted 455
sb_fdblocks 603095616, counted 616701578
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
Metadata corruption detected at 0x438a03, xfs_inode block 0xa63aa00/0x4000
Metadata corruption detected at 0x438a03, xfs_inode block 0xa63aa20/0x4000
bad CRC for inode 174303744
bad magic number 0x16 on inode 174303744
bad version number 0xffffffaa on inode 174303744
bad next_unlinked 0xc0fc21c3 on inode 174303744
inode identifier 9179762597904482375 mismatch on inode 174303744
bad CRC for inode 174303745
bad magic number 0xccd3 on inode 174303745
bad version number 0xffffffba on inode 174303745
inode identifier 3371903399051595482 mismatch on inode 174303745
bad CRC for inode 174303746
bad magic number 0xdbe0 on inode 174303746
bad version number 0xffffffdf on inode 174303746
inode identifier 50872455103499597 mismatch on inode 174303746
bad CRC for inode 174303747
bad magic number 0xdd24 on inode 174303747
bad version number 0xffffffbd on inode 174303747
bad next_unlinked 0x9f522d11 on inode 174303747
inode identifier 9340838863122723239 mismatch on inode 174303747
bad CRC for inode 174303748
bad magic number 0x2043 on inode 174303748
bad version number 0xffffffa2 on inode 174303748
bad next_unlinked 0xf10ffca3 on inode 174303748
inode identifier 1184165229794778217 mismatch on inode 174303748
bad CRC for inode 174303749
bad magic number 0x66b7 on inode 174303749
bad version number 0x79 on inode 174303749
bad next_unlinked 0xb51219b6 on inode 174303749
inode identifier 14679859918268388760 mismatch on inode 174303749

 

Lots more of the bad crc, bad magic, bad version, bad_next, inode lines.

 

Several of these:

imap claims a free inode 1155669479 is in use, would correct imap and clear inode

 

A few of these with various folder names:

entry "[FOLDER NAME]" at block 0 offset 152 in directory inode 6600634561 references free inode 1155669489
	would clear inode number in entry at offset 152...

 

These as well:

entry "[FOLDER NAME]" in shortform directory 32911946759 references free inode 2600817779
would have junked entry "[FOLDER NAME]" in directory inode 32911946759

 

Many of both of these:

disconnected dir inode 4888060274, would move to lost+found

and

would have reset inode 6600634561 nlinks from 164 to 140

 

Link to comment
19 hours ago, trurl said:

Do it again without -n, if it asks for it use -L. Post the results.

 

Sorry I didn't save the results. Ran it and it looked clean after I repaired. 30gb ended up in lost+found.

 

However, a parity check started overnight and now I have tons of sync errors..

Link to comment
1 hour ago, privateer said:

tons of sync errors

Did you do the filesystem check from the command line? Sounds like you may have gotten the command wrong and invalidated parity. Better to use the webUI it will use the correct command.

Link to comment
2 hours ago, trurl said:

If you did the check of the sd device and not the md device then that would invalidate parity. Checking md device keeps parity in sync with changes.

 

I did the MD device per the instructions in the Unraid docs.

 

2 hours ago, trurl said:

Did you do the filesystem check from the command line? Sounds like you may have gotten the command wrong and invalidated parity. Better to use the webUI it will use the correct command.

 

I used the UI, not the command line.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...