One disk randomly drops out the array


Recommended Posts

The issue:
One of the two disks in the array are randomly going to "error state" which results in a crash of all VMs on the server. It happens ~once per month. I checked the disks and they are totally fine. They are about 10 months old and i replaced one of them last month (just to make sure that the issue is not caused by a broken disk). I also swapped the SATA cables and the location in the server backplane. The last time the error occurred was at the 01.06.2021, 14 hours and 2 minutes after a parity check started. And the "failed" disk was the new/replaced disk.

Quote

Event: Unraid Parity disk error
Subject: Alert [SRVUNR1] - Parity disk in error state (disk dsbl)
Description: WDC_WD80EDAZ-11TA3A0_VG033ZLG (sdg)
Importance: alert


I added some screenshots from the GUI after the crash and the diagnostics. To get the system back up running i have to:
#1 Reboot the system
#2 Remove the "error state" disk
#3 Start the array

#4 Stop the array
#5 Add the disk
#6 Start a parity rebuild/resync depending on the disk that got corrupted (parity disk/data disk)

How can i stop this from happening?

Unbenannt.png

Unbenannt1.png

Unbenannt2.png

srvunr1-diagnostics-20210601-1839.zip srvunr1-smart-20210601-1839.zip srvunr1-smart-20210607-1814 (1).zip srvunr1-smart-20210607-1814.zip

Link to comment

Problem with the onboard SATA controller:

 

Jun  1 14:00:59 SRVUNR1 kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xe7d60000 flags=0x0000]

 

Unfortunately quite common with some Ryzen boards, BIOS update might help, or using a newer Unraid release when available due to the newer kernel, failing that best bet is to use an add-on controller (or a different model board).

  • Thanks 1
Link to comment
47 minutes ago, JorgeB said:

Problem with the onboard SATA controller:

 




Jun  1 14:00:59 SRVUNR1 kernel: ahci 0000:03:00.1: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0xe7d60000 flags=0x0000]

 

Unfortunately quite common with some Ryzen boards, BIOS update might help, or using a newer Unraid release when available due to the newer kernel, failing that best bet is to use an add-on controller (or a different model board).

I am using an ASRock Rack X470 with the latest bios. Well thats sad when you pay 300€ for a board just to get such errors. I will get a cheap 50€ SATA HBA - that should be enough for this system. Thank you very much! 

Edited by slize
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.