Server crashes with read errors every week or so.


Recommended Posts

I am running into an issue where after a period of uptime my server fails with a bunch of read errors. Here is the logs from one instance.

 

May 10 12:35:18 Tower kernel: ahci 0000:01:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000c address=0x95e7e000 flags=0x0000]
May 10 12:35:19 Tower kernel: ata4.00: exception Emask 0x10 SAct 0x400045f SErr 0x0 action 0x6 frozen
May 10 12:35:19 Tower kernel: ata4.00: irq_stat 0x08000000, interface fatal error
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:00:b8:ce:27/00:00:2d:00:00/40 tag 0 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:08:48:d1:27/00:00:2d:00:00/40 tag 1 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:10:f8:d2:27/00:00:2d:00:00/40 tag 2 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:18:f0:d3:27/00:00:2d:00:00/40 tag 3 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:20:c0:d4:27/00:00:2d:00:00/40 tag 4 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:30:a0:d5:27/00:00:2d:00:00/40 tag 6 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:50:f8:d5:27/00:00:2d:00:00/40 tag 10 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4.00: failed command: WRITE FPDMA QUEUED
May 10 12:35:19 Tower kernel: ata4.00: cmd 61/08:d0:38:ce:27/00:00:2d:00:00/40 tag 26 ncq dma 4096 out
May 10 12:35:19 Tower kernel:         res 40/00:00:61:d6:27/00:00:2d:00:00/40 Emask 0x10 (ATA bus error)
May 10 12:35:19 Tower kernel: ata4.00: status: { DRDY }
May 10 12:35:19 Tower kernel: ata4: hard resetting link
May 10 12:35:29 Tower kernel: ata4: softreset failed (1st FIS failed)
May 10 12:35:29 Tower kernel: ata4: hard resetting link
May 10 12:35:39 Tower kernel: ata4: softreset failed (1st FIS failed)
May 10 12:35:39 Tower kernel: ata4: hard resetting link
May 10 12:35:49 Tower kernel: ata3.00: exception Emask 0x0 SAct 0x600000 SErr 0x0 action 0x6 frozen
May 10 12:35:49 Tower kernel: ata3.00: failed command: READ FPDMA QUEUED
May 10 12:35:49 Tower kernel: ata3.00: cmd 60/80:a8:18:eb:2f/00:00:53:00:00/40 tag 21 ncq dma 65536 in
May 10 12:35:49 Tower kernel:         res 40/00:01:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)

 

Hardware Info:

Model: Custom

M/B: Gigabyte Technology Co., Ltd. X470 AORUS GAMING 5 WIFI-CF Version Default string - s/n: Default string

BIOS: American Megatrends International, LLC. Version F63a. Dated: 02/17/2022

CPU: AMD Ryzen 7 2700X Eight-Core @ 3700 MHz

HVM: Enabled

IOMMU: Enabled

Cache: 768 KiB, 4 MB, 16 MB

Memory: 32 GiB DDR4 (max. installable capacity 128 GiB)

Network: bond0: fault-tolerance (active-backup), mtu 1500
 eth0: 1000 Mbps, full duplex, mtu 1500

Kernel: Linux 5.10.28-Unraid x86_64

OpenSSL: 1.1.1j

Uptime: 0 days, 03:25:40 // after restart

Ram is running at 2133 MHz, 4 8GB sticks

Link to comment
8 hours ago, JorgeB said:

Problems with the onboard SATA controller, quite common with some Ryzen boards, look for a BIOS update or use an add-on controller.

 

Unfortunately I have already tried upgrading the BIOS to the latest version. Is there anything else I can do besides buying a PCI card?

Link to comment
On 5/11/2022 at 10:03 AM, JorgeB said:

Not that I know of, other than using a different model board, ideally Intel based.

 

So I purchased a LSI controller and everything is working great so far, however now I am getting this error.

 

fstrim: /mnt/cache: FITRIM ioctl failed: Remote I/O error

 

Based on some other posts it looks like this is related to the LSI card not supporting fstrim.

 

Should I move my cache drive back onto the onboard sata ports since I moved it to the controller as part of this? or is that likely to give me more issues with the sata controller?

 

I could also try to experiment with changing the firmware version but that isn't ideal. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.