[6.9.2] Cache disk errors, thought bad SATA cable, not sure now

wolfinabox · January 25, 2022

Hi! I'm thinking one of my cache SSDs might be failing, but I'd appreciate a second set of eyes on the info (it's not even a year old so hopefully I can RMA it if necessary)

I'd been noticing some "READ_FPDMA_QUEUED" and "WRITE_FPDMA_QUEUED" errors popping up in the logs for this particular disk, but I thought it might be a bad SATA cable (was just using whatever old cables I had on hand). I replaced it with a brand new cable as soon as possible, and also swapped to a different SATA port in the process, but during a BTRFS scrub, the disk is still getting the same errors:
(Full log for during the scrub attached)

Jan 24 20:38:03 boxserver kernel: ata12.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
Jan 24 20:38:03 boxserver kernel: ata12.00: irq_stat 0x40000008
Jan 24 20:38:03 boxserver kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 24 20:38:03 boxserver kernel: ata12.00: cmd 60/08:f8:f0:5f:2f/00:00:07:00:00/40 tag 31 ncq dma 4096 in
Jan 24 20:38:03 boxserver kernel: res 41/40:08:f0:5f:2f/00:00:07:00:00/00 Emask 0x409 (media error) <F>
Jan 24 20:38:03 boxserver kernel: ata12.00: status: { DRDY ERR }
Jan 24 20:38:03 boxserver kernel: ata12.00: error: { UNC }
Jan 24 20:38:03 boxserver kernel: ata12.00: supports DRM functions and may not be fully accessible
Jan 24 20:38:03 boxserver kernel: ata12.00: supports DRM functions and may not be fully accessible
Jan 24 20:38:03 boxserver kernel: ata12.00: configured for UDMA/133
Jan 24 20:38:03 boxserver kernel: ata12: EH complete
Jan 24 20:38:03 boxserver kernel: ata12.00: Enabling discard_zeroes_data

17 read errors during the scrub, 0 corrected/uncorrected/unverified though
The smart report for that drive also shows "Errors occurred - Check SMART report" (smart report attached)
Since the issue is persisting, I'm thinking now that it's the drive itself sadly, is there anything else I should check?
Build is here, and the cache SSDs are currently in raid 0 (appdata gets backed up to array regularly, VMs don't have important data)

EDIT: I've pulled all the data from the cache to the array (using mover), and during that there were many of these same errors, only from /dev/sdg (the same drive). All files seemed to make it over though, so I removed sdg from the pool (and reformatted sdf, the other cache drive, into a single drive pool) and transferred the cache contents back over to sdf no problem. Signs point to that drive being bad

syslog.txt Samsung_SSD_870_EVO_1TB_S6PTNZ0R608029R-20220124-2050.txt

Edited January 25, 2022 by wolfinabox

JorgeB · January 25, 2022

SMART test is failing so device needs to be replaced.

wolfinabox · January 25, 2022

9 hours ago, JorgeB said:

SMART test is failing so device needs to be replaced.

Gotcha, thought so, just wanted to be sure. I suppose since those results are from the disk itself, they can't really be caused by the interface/cable anyway, makes sense!
Will work on replacing that, ty! Now to find out which identical SSD it is in my tower...

JorgeB · January 25, 2022

8 minutes ago, wolfinabox said:

I suppose since those results are from the disk itself, they can't really be caused by the interface/cable anyway, makes sense!

Correct, a full device write might fix it though, at least for some time, but if it does it's difficult to predict for how long.

[6.9.2] Cache disk errors, thought bad SATA cable, not sure now

Recommended Posts

wolfinabox

Link to comment

JorgeB

Link to comment

wolfinabox

Link to comment

JorgeB

Link to comment

Join the conversation