Posted January 24, 20232 yr I'm looking for some secondary support for an issue I've tried to solve for a few weeks as it takes some time to occur. Some history: The server was running a cache setup with 2 SSD's of 240GB on BTRFS RAID1. About 2 months ago I upgraded my cache to two new disks to have more cache space: 2x 1TB (CT1000MX500SSD1). As far as I'm aware I properly swapped these one by one to preserve the mirrored cached data. It seems I'm losing a disk after about 1 or 1.5 week with the following errors: Jan 24 02:07:53 Mountain kernel: ata1.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen Jan 24 02:07:53 Mountain kernel: ata1.00: failed command: WRITE FPDMA QUEUED Jan 24 02:07:53 Mountain kernel: ata1.00: cmd 61/20:00:30:b4:3a/00:00:08:00:00/40 tag 0 ncq dma 16384 out Jan 24 02:07:53 Mountain kernel: res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout) [...] Jan 24 02:07:53 Mountain kernel: ata1.00: status: { DRDY } Jan 24 02:07:53 Mountain kernel: ata1: hard resetting link Jan 24 02:07:59 Mountain kernel: ata1: found unknown device (class 0) Jan 24 02:08:03 Mountain kernel: ata1: softreset failed (device not ready) Jan 24 02:08:03 Mountain kernel: ata1: hard resetting link Jan 24 02:08:09 Mountain kernel: ata1: found unknown device (class 0) Jan 24 02:08:13 Mountain kernel: ata1: softreset failed (device not ready) Jan 24 02:08:13 Mountain kernel: ata1: hard resetting link Jan 24 02:08:19 Mountain kernel: ata1: found unknown device (class 0) Jan 24 02:08:24 Mountain kernel: ata1: link is slow to respond, please be patient (ready=0) Jan 24 02:08:48 Mountain kernel: ata1: softreset failed (device not ready) Jan 24 02:08:48 Mountain kernel: ata1: limiting SATA link speed to 3.0 Gbps Jan 24 02:08:48 Mountain kernel: ata1: hard resetting link Jan 24 02:08:53 Mountain kernel: ata1: found unknown device (class 0) Jan 24 02:08:54 Mountain kernel: ata1: softreset failed (device not ready) Jan 24 02:08:54 Mountain kernel: ata1: reset failed, giving up Jan 24 02:08:54 Mountain kernel: ata1.00: disable device Jan 24 02:08:54 Mountain kernel: ata1: EH complete Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#12 CDB: opcode=0x2a 2a 00 08 38 76 98 00 00 40 00 Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#20 CDB: opcode=0x28 28 00 0b 13 24 e0 00 00 20 00 Initial the logs seemed unclear on which exact disk it was so I tried the following in sequence. Configuration 2x 1TB (CT1000MX500SSD1). - Switched SATA ports on motherboard to different ports. - Switched SATA Cables to new cables. Marked specific disk which causing the issue: - Switched power SATA between reported error disk - Switched out specific SATA Cable of the reported error disk and switched a port. ----------------------------------------------------------------------------------------------------------------------------------------------------- Believed single disk failure removed error disk and changed hardware configuration Configuration: 2x 1TB (CT1000MX500SSD1 and Samsung_SSD_870_EVO_1TB) - Error occurred again on this time the other: CT1000MX500SSD1 (SDC) So i'm out most logical ideas, should I also swap out this CT1000MX or I'd reckon its something else. Now I'm thinking the following: I had two bad disks, but seems very unlikely? Motherboard is having issues, or specifically with this type of disk? A software misconfiguration that can cause this disk access issue? I'm hoping someone can help or at least maybe have a few other options to attempt. Attached is the diagnostics in case any more specific information is needed. (syslog1, probably most useful). mountain-diagnostics-20230124-1848.zip
January 24, 20232 yr Community Expert Looks more like a power/cable problem, but if it only happens with the Crucial and not with the Samsung the board/controller might not like them.
January 27, 20232 yr Author Just to update this item, in case some one in the future would stumble across this. At the moment I replaced the Crucial disk for a different branded (WD blue SSD 1TB). Based on the supply remark from JorgeB (Thanks, was a good thing to change as well) I also switched it to another supply rail in the supply. Now the waiting game starts for another 1 to 1.5 weeks to see if it shows up again. And as I'm mostly wanting to ensure an operational server instead of bug-hunting the exact cause. I can't be sure if its a mix of the crucial with this specific configuration or if it was a potential bad supply rail or connector of the Sata if it is resolved now. This means that in the even I'd probably don't reply here within another 2-3 weeks, the issue is probably resolved.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.