[SOLVED] Cascading SATA disk errors on a ryzen system, B550 Tomahawk, unraid 6.8.3


Mihail

Recommended Posts

The setup

Ryzen 7 3700X

MAG B550 TOMAHAWK - BIOS A53 (Latest Beta bios with ComboAM4PIV2 1.1.9.0) - Updated after second failure

128GB DDR4 2666MHz

550W PSU


Version
7C91vA53(Beta version)
Release Date
2020-12-30
File Size
17.96 MB

Drives

Parity WDC_WD102KRYZ-01A5AB0_VCH9Z3KP - 10 TB

Disk1 WDC_WD4002FYYZ-01B7CB0_K3G42TLB - 4 TB

Disk2 WDC_WD102KRYZ-01A5AB0_VCH8VSTP - 10 TB

Disk3 WDC_WD4002FYYZ-01B7CB0_K3G4VK1B - 4 TB

 

Cache Samsung_SSD_860_EVO_2TB_S4X1NJ0N702274P - 2 TB

Cache Samsung_SSD_860_EVO_2TB_S4X1NJ0N702273Y - 2 TB

 

All drives are SATA and are plugged into the onboard SATA controller

 

Running Unraid 6.8.3

 

 

The problem

 

After running for several days (2-14) the system experiences an error that causes cascading read errors across multiple disks. Typically unraid marks disk1 as disabled and either puts the whole array into read only mode or locks up virtual machines completely.

 

The errors seemed to point to communication issues with the drives so the first step was to sacrifice the SATA cables to the IT gods and replace them with new ones. This did not fix the issue.

 

After rebooting the server everything returns to normal operation. Have tried re-adding disk1 to the array twice. The array rebuild goes completes with out issue as does the extended smart test. The system along with it's virtual machines then works fine for several days before encountering a similar issue.

 

On the last round of failures I did not re-add the 4TB Disk1 that was marked disabled to try to rule it out of the equation. The same cascading failure happened two days later and the system was working normally after reboot.

 

I suspect that this could be a kernel issue or a hardware issue with the SATA controller.

 

Most of the time the errors start like this:

Jan 12 08:13:59 vbarum kernel: ahci 0000:02:00.1: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000cafd0000 flags=0x0000]
Jan 12 08:13:59 vbarum kernel: ata4.00: exception Emask 0x10 SAct 0x38000 SErr 0x0 action 0x6 frozen
Jan 12 08:13:59 vbarum kernel: ata4.00: irq_stat 0x08000000, interface fatal error
Jan 12 08:13:59 vbarum kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Jan 12 08:13:59 vbarum kernel: ata4.00: cmd 61/08:78:e8:79:4c/00:00:3c:00:00/40 tag 15 ncq dma 4096 out
Jan 12 08:13:59 vbarum kernel:         res 40/00:88:90:7a:4c/00:00:3c:00:00/40 Emask 0x10 (ATA bus error)
Jan 12 08:13:59 vbarum kernel: ata4.00: status: { DRDY }
Jan 12 08:13:59 vbarum kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Jan 12 08:13:59 vbarum kernel: ata4.00: cmd 61/08:80:20:7a:4c/00:00:3c:00:00/40 tag 16 ncq dma 4096 out
Jan 12 08:13:59 vbarum kernel:         res 40/00:88:90:7a:4c/00:00:3c:00:00/40 Emask 0x10 (ATA bus error)
Jan 12 08:13:59 vbarum kernel: ata4.00: status: { DRDY }
Jan 12 08:13:59 vbarum kernel: ata4.00: failed command: WRITE FPDMA QUEUED
Jan 12 08:13:59 vbarum kernel: ata4.00: cmd 61/08:88:90:7a:4c/00:00:3c:00:00/40 tag 17 ncq dma 4096 out
Jan 12 08:13:59 vbarum kernel:         res 40/00:88:90:7a:4c/00:00:3c:00:00/40 Emask 0x10 (ATA bus error)
Jan 12 08:13:59 vbarum kernel: ata4.00: status: { DRDY }
Jan 12 08:13:59 vbarum kernel: ata4: hard resetting link
Jan 12 08:14:00 vbarum kernel: ata4: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
Jan 12 08:14:05 vbarum kernel: ata4.00: qc timeout (cmd 0xec)
Jan 12 08:14:05 vbarum kernel: ata4.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Jan 12 08:14:05 vbarum kernel: ata4.00: revalidation failed (errno=-5)
Jan 12 08:14:05 vbarum kernel: ata4: hard resetting link
Jan 12 08:14:15 vbarum kernel: ata4: softreset failed (1st FIS failed)
Jan 12 08:14:15 vbarum kernel: ata4: hard resetting link
Jan 12 08:14:25 vbarum kernel: ata4: softreset failed (1st FIS failed)
Jan 12 08:14:25 vbarum kernel: ata4: hard resetting link

The AHCI error is not always the first one and the kernel reports read errors on multiple disks.

 

Please help me with fixing the issue. I was unable to make sense of the logs, but I hope that there are more knowledgeable people here.

 

I have attached full syslogs and hardware profile to this post. Hopefully they are useful in diagnosing the issue.

 

 

 

syslog.zip hwprofile.txt

Link to comment
50 minutes ago, Mihail said:

Jan 12 08:13:59 vbarum kernel: ahci 0000:02:00.1: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x00000000cafd0000 flags=0x0000]

 

This is a rather common issue with the SATA controller on Ryzen boards, newer kernel on v6.9 helps in some cases, newer BIOS might also help, but you already did that.

Link to comment
1 minute ago, JorgeB said:

 

This is a rather common issue with the SATA controller on Ryzen boards, newer kernel on v6.9 helps in some cases, newer BIOS might also help, but you already did that.

I ran into some posts regarding this error with NVMe drives and some with talk about disabling IOMMU grouping but now with SATA drives.
Would you advise to move to the beta channel for a fix through using a newer kernel?

Link to comment
  • 2 weeks later...
  • JorgeB changed the title to [SOLVED] Cascading SATA disk errors on a ryzen system, B550 Tomahawk, unraid 6.8.3
  • 2 weeks later...

Still broken, got critical disk errors last night and all wm's went down because of it.

 

Please help me figure it out, since having to reboot and rebuild the parity every 10 days or so is getting tiresome.

 

I attached a new diagnostic dump before and after rebooting. Before reboot smart data was not included in the dump. I attached the syslog from the last bootup to system failure.
 

vbarum-diagnostics-20210217-2315-beforereboot.zip vbarum-diagnostics-20210217-2341-after reboot.zip syslog

Link to comment
14 minutes ago, JorgeB said:

It's again the typical AMD controller problem, if there's no newer BIOS you basically have 3 options: wait for a newer Unraid release with a newer kernel and hope it helps, use an ad-don HBA/controller or try a different board model.

 

You think a LSI- Card would fix this? or something in the same relm?

Link to comment
On 2/18/2021 at 3:34 PM, Mihail said:

Still broken, got critical disk errors last night and all wm's went down because of it.

 

Please help me figure it out, since having to reboot and rebuild the parity every 10 days or so is getting tiresome.

 

I attached a new diagnostic dump before and after rebooting. Before reboot smart data was not included in the dump. I attached the syslog from the last bootup to system failure.
 

vbarum-diagnostics-20210217-2315-beforereboot.zip 162.44 kB · 1 download vbarum-diagnostics-20210217-2341-after reboot.zip 118.28 kB · 0 downloads syslog 3.2 MB · 0 downloads

Quick question - did any part of your computer activated PCIe4.0? Like Samsung 980 Pro, RTX3000 series GFX?

MAG B550 TOMAHAWK - latest BIOS is 7C91vA5 AGESA 1.2.0.0. Did you try it?

Any overclocking done there?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.