v6.9.2: Sudden read errors on all 4 array disks and btrfs IO errors on cache

convergence · March 11, 2022

Hi, I'd appreciate any help understanding what is going on and how to recover from this. I have some experience with single disks failing (usually due to dodgy sata cables) and rebuilding the array, but having all array disks and the cache drives report errors all at once is very new to me.

The first thing I noticed was that my Windows VM froze up. I went into the Unraid GUI expecting to restart the VM or the whole machine if that wouldn't work. Instead I was greeted by two notifications, the first warning that 3 disks have read errors, the second reported that 4 disks have read errors. The notifications were dated one minute apart, around the time the VM became sluggish and froze.

I haven't dared to reboot or try anything else. The docker service seems to have stopped without any action on my part. After some time the VM was also stopped (not by me) and has disappeared from the VMs tab. All my array disks report the same small number of errors: initially 32 per disk, now up to 129. The Main tab doesn't show any errors for my cache devices, but the syslog is full of errors relating to both cache devices.

Things leading up to "the event":

- Windows update on VM the day before, forcing me to reboot the whole machine because the GPU wouldn't init

- overnight parity check with 0 errors (not initiated by me or schedule; due to machine reboot?)

- casual Windows VM usage; not much load from docker containers

System/hardware:

- motherboard: Gigabyte AX370-Gaming K5

- bios: American Megatrends Inc. Version F3c. Dated: 06/02/2017

- cpu: AMD Ryzen 7 1700 Eight-Core @ 2725 MHz

- ram: 32 GiB DDR4

- parity: Toshiba 8 TB

- data disks: 3x HGST 4 TB

- cache: Samsung SSD 840 pro 256 GB + 860 evo 250 GB

- Unraid version 6.9.2

- Radeon 5700XT used with IOMMU in the Windows VM

Which actions should I take? Reboot? Stop the array?

I'm going to try to get some rest. It's been a long day.

pierre-diagnostics-20220311-1728.anon.zip

Edited May 11, 2022 by convergence
redacted logs

JorgeB · March 12, 2022

Problem with the onboard controller, quite common with some Ryzen boards, usually under load, look for a BIOS update and if that doesn't help use an add-on controller, or a different board.

convergence · March 12, 2022

Thank you for your reply. A sata controller issue seems plausible. Luckily it didn't do much damage (yet).

A reboot got my server up and running again, for now. I'm at 45% of the parity check that started on boot and there have been zero sync errors so far. Only a bit more to go before it should be only zeroes on the parity disk. I guess there could be a bit of data corruption, but I wouldn't know where to find it.

I'll go with the BIOS update route after this since my motherboard seems to be fine. I'm still on such an old BIOS version because I haven't had stability issues with this system and I've had some bad fortune with updates breaking things on older systems. I am 13 versions behind the latest release so I think I'll have to do incremental updates.

Any recommendations to make the repeated BIOS update process go smoothly and to reduce risks?

There don't seem to be many forum posts here referencing that kind of process, or I did a bad job searching for them. I'm thinking of things like stopping the array, disabling docker, setting the VM to not start on boot. Or should I leave all of that running as is to improve my chances of catching issues that might occur at a particular BIOS version?

I'm probably overthinking all this but I'm a bit paranoid when it comes to changing things on machines that have been running stably for many years.

JorgeB · March 13, 2022

There shouldn't be any issues after updating the BIOS, of course might also not help with the controller issue.

convergence · March 26, 2022

A couple of updates on how this went:

updated bios to version f31
verified that Unraid still booted (array down though: the additional usb drive took me over the limit for my license)
updated bios to f51d (and removed the usb drive)
after boot looked mostly good: docker containers ran as normally but Windows VM wouldn't boot due to changed IOMMU groups

After this I struggled for a couple of hours. While I was trying to fix the VM config my machine lost all network connectivity. I tried restarting networking but didn't manage to get an IP address again. I tried to reboot via the local tty, which didn't work. Reboot hung on the graceful shutdown of Nginx, or on something that wasn't printed on the screen. I had to do a dirty shutdown after 30 minutes of waiting.

Having seen some forum posts about issues related to c-states, I disabled c-states in my BIOS before booting into Unraid again. I'm not sure whether c-states were disabled before I did the BIOS updates. The machine proceeded to boot without issues and it had networking again. I was able to re-assign my video card and soundcard to the Windows VM using the form. The VM still wouldn't boot because some references to the missing PCIe devices was still left in the XML. After removing all references to missing devices in the XML the VM was able to start successfully.

I hope the new BIOS improved the stability of my onboard SATA controller. Next project: finally trying ich777's fix for the AMD reset bug on my 5700XT.

convergence · May 11, 2022

New issues, could be related:

I'm thinking of the SATA controller and RAM as possible causes. More info (and updates to follow) in the new topic.

v6.9.2: Sudden read errors on all 4 array disks and btrfs IO errors on cache

Recommended Posts

convergence

Link to comment

JorgeB

Link to comment

convergence

Link to comment

JorgeB

Link to comment

convergence

Link to comment

convergence

Link to comment

Join the conversation