Losing cache disk(s) on regular occassion.

Rafarion · January 24, 2023

I'm looking for some secondary support for an issue I've tried to solve for a few weeks as it takes some time to occur.

Some history:
The server was running a cache setup with 2 SSD's of 240GB on BTRFS RAID1.
About 2 months ago I upgraded my cache to two new disks to have more cache space: 2x 1TB (CT1000MX500SSD1).
As far as I'm aware I properly swapped these one by one to preserve the mirrored cached data.

It seems I'm losing a disk after about 1 or 1.5 week with the following errors:

Jan 24 02:07:53 Mountain kernel: ata1.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen
Jan 24 02:07:53 Mountain kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 24 02:07:53 Mountain kernel: ata1.00: cmd 61/20:00:30:b4:3a/00:00:08:00:00/40 tag 0 ncq dma 16384 out
Jan 24 02:07:53 Mountain kernel:         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

[...]

Jan 24 02:07:53 Mountain kernel: ata1.00: status: { DRDY }
Jan 24 02:07:53 Mountain kernel: ata1: hard resetting link
Jan 24 02:07:59 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:03 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:03 Mountain kernel: ata1: hard resetting link
Jan 24 02:08:09 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:13 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:13 Mountain kernel: ata1: hard resetting link
Jan 24 02:08:19 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:24 Mountain kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan 24 02:08:48 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:48 Mountain kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 24 02:08:48 Mountain kernel: ata1: hard resetting link
Jan 24 02:08:53 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:54 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:54 Mountain kernel: ata1: reset failed, giving up
Jan 24 02:08:54 Mountain kernel: ata1.00: disable device
Jan 24 02:08:54 Mountain kernel: ata1: EH complete
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#12 CDB: opcode=0x2a 2a 00 08 38 76 98 00 00 40 00
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#20 CDB: opcode=0x28 28 00 0b 13 24 e0 00 00 20 00

Initial the logs seemed unclear on which exact disk it was so I tried the following in sequence.
Configuration 2x 1TB (CT1000MX500SSD1).
- Switched SATA ports on motherboard to different ports.

- Switched SATA Cables to new cables.
Marked specific disk which causing the issue:

- Switched power SATA between reported error disk

- Switched out specific SATA Cable of the reported error disk and switched a port.

-----------------------------------------------------------------------------------------------------------------------------------------------------
Believed single disk failure removed error disk and changed hardware configuration
Configuration: 2x 1TB (CT1000MX500SSD1 and Samsung_SSD_870_EVO_1TB)
- Error occurred again on this time the other: CT1000MX500SSD1 (SDC)

So i'm out most logical ideas, should I also swap out this CT1000MX or I'd reckon its something else.

Now I'm thinking the following:
I had two bad disks, but seems very unlikely?
Motherboard is having issues, or specifically with this type of disk?

A software misconfiguration that can cause this disk access issue?

I'm hoping someone can help or at least maybe have a few other options to attempt.

Attached is the diagnostics in case any more specific information is needed. (syslog1, probably most useful).

mountain-diagnostics-20230124-1848.zip

JorgeB · January 24, 2023

Looks more like a power/cable problem, but if it only happens with the Crucial and not with the Samsung the board/controller might not like them.

Rafarion · January 27, 2023

Just to update this item, in case some one in the future would stumble across this.
At the moment I replaced the Crucial disk for a different branded (WD blue SSD 1TB).
Based on the supply remark from JorgeB (Thanks, was a good thing to change as well) I also switched it to another supply rail in the supply.

Now the waiting game starts for another 1 to 1.5 weeks to see if it shows up again.

And as I'm mostly wanting to ensure an operational server instead of bug-hunting the exact cause. I can't be sure if its a mix of the crucial with this specific configuration or if it was a potential bad supply rail or connector of the Sata if it is resolved now.

This means that in the even I'd probably don't reply here within another 2-3 weeks, the issue is probably resolved.

Losing cache disk(s) on regular occassion.

Recommended Posts

Rafarion

Link to comment

JorgeB

Link to comment

Rafarion

Link to comment

Join the conversation