Jump to content

Losing cache disk(s) on regular occassion.


Recommended Posts

I'm looking for some secondary support for an issue I've tried to solve for a few weeks as it takes some time to occur.

 

Some history:
The server was running a cache setup with 2 SSD's of 240GB on BTRFS RAID1.
About 2 months ago I upgraded my cache to two new disks to have more cache space: 2x 1TB (CT1000MX500SSD1).
As far as I'm aware I properly swapped these one by one to preserve the mirrored cached data.


It seems I'm losing a disk after about 1 or 1.5 week with the following errors:

Jan 24 02:07:53 Mountain kernel: ata1.00: exception Emask 0x0 SAct 0xffffffff SErr 0x0 action 0x6 frozen
Jan 24 02:07:53 Mountain kernel: ata1.00: failed command: WRITE FPDMA QUEUED
Jan 24 02:07:53 Mountain kernel: ata1.00: cmd 61/20:00:30:b4:3a/00:00:08:00:00/40 tag 0 ncq dma 16384 out
Jan 24 02:07:53 Mountain kernel:         res 40/00:01:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

[...]

Jan 24 02:07:53 Mountain kernel: ata1.00: status: { DRDY }
Jan 24 02:07:53 Mountain kernel: ata1: hard resetting link
Jan 24 02:07:59 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:03 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:03 Mountain kernel: ata1: hard resetting link
Jan 24 02:08:09 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:13 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:13 Mountain kernel: ata1: hard resetting link
Jan 24 02:08:19 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:24 Mountain kernel: ata1: link is slow to respond, please be patient (ready=0)
Jan 24 02:08:48 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:48 Mountain kernel: ata1: limiting SATA link speed to 3.0 Gbps
Jan 24 02:08:48 Mountain kernel: ata1: hard resetting link
Jan 24 02:08:53 Mountain kernel: ata1: found unknown device (class 0)
Jan 24 02:08:54 Mountain kernel: ata1: softreset failed (device not ready)
Jan 24 02:08:54 Mountain kernel: ata1: reset failed, giving up
Jan 24 02:08:54 Mountain kernel: ata1.00: disable device
Jan 24 02:08:54 Mountain kernel: ata1: EH complete
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#12 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#20 UNKNOWN(0x2003) Result: hostbyte=0x04 driverbyte=DRIVER_OK cmd_age=0s
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#12 CDB: opcode=0x2a 2a 00 08 38 76 98 00 00 40 00
Jan 24 02:08:54 Mountain kernel: sd 2:0:0:0: [sdc] tag#20 CDB: opcode=0x28 28 00 0b 13 24 e0 00 00 20 00


Initial the logs seemed unclear on which exact disk it was so I tried the following in sequence.
Configuration 2x 1TB (CT1000MX500SSD1).
- Switched SATA ports on motherboard to different ports.

- Switched SATA Cables to new cables.
Marked specific disk which causing the issue:

- Switched power SATA between reported error disk

- Switched out specific SATA Cable of the reported error disk and switched a port.

-----------------------------------------------------------------------------------------------------------------------------------------------------
Believed single disk failure removed error disk and changed hardware configuration
Configuration:
2x 1TB (CT1000MX500SSD1 and Samsung_SSD_870_EVO_1TB)
- Error occurred again on this time the other: CT1000MX500SSD1 (SDC)


So i'm out most logical ideas, should I also swap out this CT1000MX or I'd reckon its something else.

Now I'm thinking the following:
I had two bad disks, but seems very unlikely?
Motherboard is having issues, or specifically with this type of disk?

A software misconfiguration that can cause this disk access issue?

I'm hoping someone can help or at least maybe have a few other options to attempt.

Attached is the diagnostics in case any more specific information is needed. (syslog1, probably most useful).

mountain-diagnostics-20230124-1848.zip

Link to comment

Just to update this item, in case some one in the future would stumble across this.
At the moment I replaced the Crucial disk for a different branded (WD blue SSD 1TB).
Based on the supply remark from JorgeB (Thanks, was a good thing to change as well) I also switched it to another supply rail in the supply.

Now the waiting game starts for another 1 to 1.5 weeks to see if it shows up again.

And as I'm mostly wanting to ensure an operational server instead of bug-hunting the exact cause. I can't be sure if its a mix of the crucial with this specific configuration or if it was a potential bad supply rail or connector of the Sata if it is resolved now.

This means that in the even I'd probably don't reply here within another 2-3 weeks, the issue is probably resolved.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...