Read error during parity check - Device disabled


Recommended Posts

Hi all,

 

I have been wresting with this server since I built it with "random" reboots and a whole host of other annoying issues.

Starting to get worried about my data now tbh. Specs:

ASRock Rack X570-D4U

AMD Ryzen 5 PRO 4650G

2x 32GB DDR4 ECC Unbuffered RAM

NVidia T400

 

I had a powercut this morning and it of course co-coincided with me having my UPS offline due to a battery replacement... typical.

Just to double up on my misfortune, I had also made a mistake in a .conf file whilst setting up an external rsyslog server to help me trace the issue at the root of my general server issues so after all the effort, I didn't even get any off-board log data... typical.

 

I have been suspicious that the problems could be related to the chipset on the motherboard so as a troubleshooting measure I installed an LSI 9211-8i which is connected via SAS cables to a SAS backplate.

 

The power cut of course means I had to run a parity check. While it was running (probably didn't make it to 10%) I got an email from the server saying there was a disk error. When I logged in, I noticed that drive was disabled, and all the other drives had read errors.

 

Out of panic, I just stopped the parity check as I noticed that I had left on "write corrections" and was concerned I would going to be writing garbage to my array.

 

I have no idea where I stand at the moment, so open to suggestions to check my data and/or test my hardware.

 

I have run a short SMART test, and am currently running an extended test of the drive that unraid says has failed. I also ordered a replacement drive which will arrive tomorrow, although I'm not actually think there is anything wrong with it.

 

Diagnostics attached.

tower-diagnostics-20230313-1415.zip

Link to comment
3 minutes ago, JorgeB said:

Diags are after rebooting so we can't see what happened, but the disk looks healthy.

I was concerned that might be a problem. A couple of questions for moving forward:

1. Is there anything other than an external syslog server that can help me trace my reboots?

2. Is there anything I can do to test the validity of the data on my array?

3. How do I see where the "fail" flag is and "acknowledge" it so the array can be started with this disk?

 

Due to the power failure and subsequent failed parity check, I am genuinely not confident in my parity data to rebuild the array from.

Link to comment

Start the array to see if the emulated disk mounts and contents look correct, if yes you can rebuild on top, another option, assuming nothing was written to that disk once it got disabled would be to do a new config and re-sync parity instead.

 

35 minutes ago, B1scu1T said:

Is there anything I can do to test the validity of the data on my array?

Only if you have pre-existing checksums for all data (or are using btrfs).

Link to comment

I should probably point out the reason the log is wiped is because the server rebooted itself when i stopped the array.

 

Guess I will have to have a think. About ready to chuck this POS motherboard in the bin, never had as many issues as this with my Supermicro Intel systems.

Link to comment

Extended parity check on that disk came back ok.

 

I decided that I have more trust in the disks data than the parity, given I saw the disks writing data to the parity when one of the disks was disabled, so I created a new config and tried to build a new parity, but It didn't make it.

 

Sadly I don't think my remote syslog helps us but grateful for any feedback.

unraid_155.zip tower-diagnostics-20230314-0942.zip

Edited by B1scu1T
Link to comment
40 minutes ago, JorgeB said:

There's nothing relevant logged, this and the symptoms to me point to a hardware issue.

Definitely. This I am already pretty sure of, but I was hoping there would be an indicator here as to what is.

 

I have run memtest for a few days when installing this so I have no clue where the hardware issues are coming from.

Edited by B1scu1T
Link to comment
6 minutes ago, JorgeB said:

By the symptoms Board, RAM or PSU with be the mains suspects, I would start by trying with just one of the RAM sticks, if it still crashes try with just the other one, that would basically rule out the RAM.

The PSU is pretty new, and its a decent unit (Corsair TX850M). I could swap in my older CS450M though, that ran for years without fault so it would at least one of the variables removed.

IPMI is also of absolutely no help and isn't reporting any issues other than clock sync, but it also isn't dropping out when the sever drops out which probably suggests there isnt a significant power issue.

I have grabbed the debug logs from there. Some errors around the same time as the reboot but they don't, IMO, point to anything useful... bit beyond my expertise at this point

 

The issue with tracing the individual components is that I have fiddled with settings previously, thinking I had figured out the problem (e.g. RAM settings/voltage, CPU settings, GPU settings), and then had the system stable for literally months with no crashes or reboots, so then I presume it is ok until it randomly happens again. At the moment it appears to be totally borked at exactly the wrong moment.

 

I'm at breaking point tbh, I need this server to run at some capacity.

I think im going to pick up an v6 E3 Xeon and supporting board from Ebay then I can experiment with this X570 and get to the bottom of its issues as a separate project (yipee). Might even just sell the board on and/or discuss with the retailer if it looks like there is a problem with it.

var.zip

Link to comment

So on the one hand, I ran the 4 pass memtest with ECC polling enabled twice and no errors... so dont think it's that.

 

On the other hand, the ASRock Rack support team supplied me with some SOC flash tools, but they don't recognise the BMC image to fix my IPMI.

Link to comment
  • 1 month later...

I managed to get the board alive again through what seems to be nothing but sheer luck, however I didn't get to the bottom of why there is constant crashes.

 

I just bit the bullet and bought new hardware. Now running Intel and Supermicro and it's been flawless. Despite the lack of Memtest reported failures, I'm inclined to lean towards it being a Memory issue, slight possibility there is something wrong with the 4650G CPU or UEFI settings.

Edited by B1scu1T
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.