Hi! I'm thinking one of my cache SSDs might be failing, but I'd appreciate a second set of eyes on the info (it's not even a year old so hopefully I can RMA it if necessary)
I'd been noticing some "READ_FPDMA_QUEUED" and "WRITE_FPDMA_QUEUED" errors popping up in the logs for this particular disk, but I thought it might be a bad SATA cable (was just using whatever old cables I had on hand). I replaced it with a brand new cable as soon as possible, and also swapped to a different SATA port in the process, but during a BTRFS scrub, the disk is still getting the same errors:
(Full log for during the scrub attached)
Jan 24 20:38:03 boxserver kernel: ata12.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x0
Jan 24 20:38:03 boxserver kernel: ata12.00: irq_stat 0x40000008
Jan 24 20:38:03 boxserver kernel: ata12.00: failed command: READ FPDMA QUEUED
Jan 24 20:38:03 boxserver kernel: ata12.00: cmd 60/08:f8:f0:5f:2f/00:00:07:00:00/40 tag 31 ncq dma 4096 in
Jan 24 20:38:03 boxserver kernel: res 41/40:08:f0:5f:2f/00:00:07:00:00/00 Emask 0x409 (media error) <F>
Jan 24 20:38:03 boxserver kernel: ata12.00: status: { DRDY ERR }
Jan 24 20:38:03 boxserver kernel: ata12.00: error: { UNC }
Jan 24 20:38:03 boxserver kernel: ata12.00: supports DRM functions and may not be fully accessible
Jan 24 20:38:03 boxserver kernel: ata12.00: supports DRM functions and may not be fully accessible
Jan 24 20:38:03 boxserver kernel: ata12.00: configured for UDMA/133
Jan 24 20:38:03 boxserver kernel: ata12: EH complete
Jan 24 20:38:03 boxserver kernel: ata12.00: Enabling discard_zeroes_data
17 read errors during the scrub, 0 corrected/uncorrected/unverified though
The smart report for that drive also shows "Errors occurred - Check SMART report" (smart report attached)
Since the issue is persisting, I'm thinking now that it's the drive itself sadly, is there anything else I should check?
Build is here, and the cache SSDs are currently in raid 0 (appdata gets backed up to array regularly, VMs don't have important data)
EDIT: I've pulled all the data from the cache to the array (using mover), and during that there were many of these same errors, only from /dev/sdg (the same drive). All files seemed to make it over though, so I removed sdg from the pool (and reformatted sdf, the other cache drive, into a single drive pool) and transferred the cache contents back over to sdf no problem. Signs point to that drive being bad
syslog.txt Samsung_SSD_870_EVO_1TB_S6PTNZ0R608029R-20220124-2050.txt