Jump to content

[6.9.2] Cache disk errors, thought bad SATA cable, not sure now


Recommended Posts

Hi! I'm thinking one of my cache SSDs might be failing, but I'd appreciate a second set of eyes on the info (it's not even a year old so hopefully I can RMA it if necessary)

I'd been noticing some "READ_FPDMA_QUEUED" and "WRITE_FPDMA_QUEUED" errors popping up in the logs for this particular disk, but I thought it might be a bad SATA cable (was just using whatever old cables I had on hand). I replaced it with a brand new cable as soon as possible, and also swapped to a different SATA port in the process, but during a BTRFS scrub, the disk is still getting the same errors:
(Full log for during the scrub attached)

Jan 24 20:38:03 boxserver kernel: ata12.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
Jan 24 20:38:03 boxserver kernel: ata12.00: irq_stat 0x40000008
Jan 24 20:38:03 boxserver kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 24 20:38:03 boxserver kernel: ata12.00: cmd 60/08:f8:f0:5f:2f/00:00:07:00:00/40 tag 31 ncq dma 4096 in
Jan 24 20:38:03 boxserver kernel: res 41/40:08:f0:5f:2f/00:00:07:00:00/00 Emask 0x409 (media error) <F>
Jan 24 20:38:03 boxserver kernel: ata12.00: status: { DRDY ERR }
Jan 24 20:38:03 boxserver kernel: ata12.00: error: { UNC }
Jan 24 20:38:03 boxserver kernel: ata12.00: supports DRM functions and may not be fully accessible
Jan 24 20:38:03 boxserver kernel: ata12.00: supports DRM functions and may not be fully accessible
Jan 24 20:38:03 boxserver kernel: ata12.00: configured for UDMA/133
Jan 24 20:38:03 boxserver kernel: ata12: EH complete
Jan 24 20:38:03 boxserver kernel: ata12.00: Enabling discard_zeroes_data

17 read errors during the scrub, 0 corrected/uncorrected/unverified though
The smart report for that drive also shows "Errors occurred - Check SMART report" (smart report attached)
Since the issue is persisting, I'm thinking now that it's the drive itself sadly, is there anything else I should check?
Build is here, and the cache SSDs are currently in raid 0 (appdata gets backed up to array regularly, VMs don't have important data)

EDIT: I've pulled all the data from the cache to the array (using mover), and during that there were many of these same errors, only from /dev/sdg (the same drive). All files seemed to make it over though, so I removed sdg from the pool (and reformatted sdf, the other cache drive, into a single drive pool) and transferred the cache contents back over to sdf no problem. Signs point to that drive being bad :(

syslog.txt Samsung_SSD_870_EVO_1TB_S6PTNZ0R608029R-20220124-2050.txt
 

Edited by wolfinabox
Link to comment
9 hours ago, JorgeB said:

SMART test is failing so device needs to be replaced.

Gotcha, thought so, just wanted to be sure. I suppose since those results are from the disk itself, they can't really be caused by the interface/cable anyway, makes sense!
Will work on replacing that, ty! Now to find out which identical SSD it is in my tower...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...