slize Posted June 7, 2021 Share Posted June 7, 2021 The issue: One of the two disks in the array are randomly going to "error state" which results in a crash of all VMs on the server. It happens ~once per month. I checked the disks and they are totally fine. They are about 10 months old and i replaced one of them last month (just to make sure that the issue is not caused by a broken disk). I also swapped the SATA cables and the location in the server backplane. The last time the error occurred was at the 01.06.2021, 14 hours and 2 minutes after a parity check started. And the "failed" disk was the new/replaced disk. Quote Event: Unraid Parity disk error Subject: Alert [SRVUNR1] - Parity disk in error state (disk dsbl) Description: WDC_WD80EDAZ-11TA3A0_VG033ZLG (sdg) Importance: alert I added some screenshots from the GUI after the crash and the diagnostics. To get the system back up running i have to: #1 Reboot the system #2 Remove the "error state" disk #3 Start the array #4 Stop the array #5 Add the disk #6 Start a parity rebuild/resync depending on the disk that got corrupted (parity disk/data disk) How can i stop this from happening? srvunr1-diagnostics-20210601-1839.zip srvunr1-smart-20210601-1839.zip srvunr1-smart-20210607-1814 (1).zip srvunr1-smart-20210607-1814.zip Quote Link to comment
JorgeB Posted June 7, 2021 Share Posted June 7, 2021 Problem with the onboard SATA controller: Jun 1 14:00:59 SRVUNR1 kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xe7d60000 flags=0x0000] Unfortunately quite common with some Ryzen boards, BIOS update might help, or using a newer Unraid release when available due to the newer kernel, failing that best bet is to use an add-on controller (or a different model board). 1 Quote Link to comment
slize Posted June 7, 2021 Author Share Posted June 7, 2021 (edited) 47 minutes ago, JorgeB said: Problem with the onboard SATA controller: Jun 1 14:00:59 SRVUNR1 kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xe7d60000 flags=0x0000] Unfortunately quite common with some Ryzen boards, BIOS update might help, or using a newer Unraid release when available due to the newer kernel, failing that best bet is to use an add-on controller (or a different model board). I am using an ASRock Rack X470 with the latest bios. Well thats sad when you pay 300€ for a board just to get such errors. I will get a cheap 50€ SATA HBA - that should be enough for this system. Thank you very much! Edited June 7, 2021 by slize Quote Link to comment
JorgeB Posted June 7, 2021 Share Posted June 7, 2021 You're welcome, you can take a look here for some recommended models: 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.