Freeze and crash during parity-check


Go to solution Solved by JorgeB,

Recommended Posts

Hello, I'm having this issue with my sever where every time I try to do a parity check it gets between 5 to 50 GB in (on a 12tb server) and then completely locks up. Can't do anything after it freezes, have to force a shutdown by pressing the power button.

 

I was running 6.9.2 but decided to restore the previous version, which was 6.8.3. After a few failed checks, I tried another parity check and it spewed out a bunch of USB read/write errors when it crashed. I swapped the USB for a new one and while it's not giving me any USB errors it's still crashing the same with the parity checks.

 

When it crashes it completely locks up so I don't have the true syslog files, however I did setup a syslog server to try and get an idea of what is going on. Two crashes are logged in the Crash 1.txt and Crash 2.txt files attached, in addition to the diagnostics files and a syslog file that was saved before starting a parity check (as I can't download it after the lock up).

 

I'm at a loss for what is going on. I ran a memory test and everything came back clean. None of the drives have changed from when it was last running. This all started happening after a power outage took the server offline.

 

If any other info is needed please let me know and I'll see what I can do.

Crash 1.txt Crash 2.txt tower-diagnostics-20220508-2242.zip tower-syslog-20220509-2232.zip

Link to comment
8 hours ago, JorgeB said:

Posted logs seem incomplete, is that all it caught or it's just partial, would like to see the beginning of the problem.

Sorry about that, that was all I could get at the time. After a few more attempts I've managed to get the parity check to fail without the entire server bricking itself. The check is still completely stalled and hasn't moved for an hour. This set of logs should be more complete (though it doesn't capture the entire time the server has been on, it's just repeating the same errors over and over anyway).

tower-syslog-20220510-1521.zip

Link to comment
23 minutes ago, JorgeB said:

Unraid driver is crashing, this happens sometimes to a few users, usually with a specific kernel/hardware combination, try upgrading to v6.10.0-rc7.

So far so good. I've updated to v6.10.0-rc7, and the parity check is at 1% right now. I'll report back if it fails again or if it's successful.

 

Thanks for the help.

  • Like 1
Link to comment

Unfortunately it turns out that there are still issues. I had some issue which was plaguing it for a while that was causing crashed but I couldn't get any logs of it. Now it's just stuck. It's been 42.4 GB into the parity check for about 6 hours now. However it hasn't crashed, the server is still responsive, though the logs show it throwing errors over and over (logs attached).

tower-syslog-20220512-0332.zip

Link to comment
8 hours ago, JorgeB said:

Several things crashing, looks more like a hardware issue, or some compatibly issue with the kernel, but since it happens with very different kernels I would think hardware.

Well that sucks. I can't image what hardware though, this server was running for months just fine. I've run memory tests, drive tests, nothing has ever shown any issues.

Link to comment
  • 4 weeks later...

Had to take a break from working on this due to work stuff but I'm back on it now. Still not solution. I've got a new set of logs though, this time it failed at 0.2% which is a bit better than normal. The server isn't bricked though, so I'm able to download the logs. I also downloaded the diagnostics. Unlike normal the elapsed time is still working 15 minutes into the check, though the speed keeps dropping, so it's at like 7 MB/s now and I can't actually tell if it's making any progress.

 

Gonna let it sit until it crashes. Still no idea as to the cause though I did see someone mention crashes during a parity check and it turned out their PSU was damaged. Not sure if my errors are representative of that but if/when this crashes I might try swapping the PSU.

tower-syslog-20220603-2309.zip tower-diagnostics-20220603-1916.zip

Link to comment
18 hours ago, trurl said:

Have you done memtest?

 

I have performed memtests with no errors.

11 hours ago, JorgeB said:

Unraid driver is still crashing, unclear to me from your first post if it now also crashes with v6.8 or v6.9? if you see this same issue with different Unraid releases, I would really suspect a hardware problem.

 

I've tested a few different versions with similar errors. Last night I began testing on 6.10.2-rc3 and it has not crashed yet 19 hours into the check, however about 3 minutes into the check it got to 28.1 GB (of 12 tb) and it has stalled there and is still there. Unlike previous cases the server has not crashed out though and the elapsed time (which usually freezes right before it crashes) has continued to count up. The estimated speed has continued to drop and is now at 85.1 KB/s.

 

The issue is that nothing hardware wise has changed in the server from when it was previously running to now, and memtests and drive health checks have all come back clean. I swapped the USB based on some previous advice, but the issue persists.

 

Is there any easy way for me use a different USB to boot a completely clean system on the same hardware (minus the drives)? So it'd be the same CPU, MB, RAM, etc... just different drives to not ruin my data and a different USB to handle the separate install. Presumably if it managed to build and array in that case then it's either the drives or software.

 

I've added new logs, though only two lines have been added since I began yesterday (logs of which were posted in my previous reply).

tower-syslog-20220604-1835.zip

Link to comment

So I've confirmed two things. Firstly, the power supply is not the problem, swapped it out for another I had and the issues still persist, it was worth a try. Secondly, If I check without writing corrections the check will hang at some point like normal but it won't crash the server. If I do select it to write corrections it will instead crash when it hangs. Also where it hangs is inconsistent, I've had it stop between 2 and 180 GB into the check. None of this really sheds any light on what the problem is but it's progress I guess.

Link to comment
13 hours ago, Lime1028 said:

The issue is that nothing hardware wise has changed in the server from when it was previously running to now

Test with a known working Unraid release, one that you know was working before, if it now also crashes it will point to a hardware issue, any piece of hardware can go bad at any time, doesn't really matter it was working before.

Link to comment
20 hours ago, JorgeB said:

Test with a known working Unraid release, one that you know was working before, if it now also crashes it will point to a hardware issue, any piece of hardware can go bad at any time, doesn't really matter it was working before.

I've got a really weird jerry-rigged setup going on right now with a different motherboard, CPU, and RAM. It's currently running a parity check so if it passes (about 19h is what it's estimating) then it will confirm that the issue is in one of those 3 components. I also tried updating the BIOS for the original motherboard, but the newest BIOS just refused to boot Unraid no matter how I adjusted the boot settings. Restoring the older BIOS brought it back to booting, and crashing on parity check, as normal.

 

I am still just curious as to what component is the issue. Unraid runs fine, and I can do anything within the interface I want, it's only when it begins a parity check that it crashes. Maybe something to do with the SATA ports on the motherboard?

 

Either way I'll send an update crash or pass.

Link to comment
On 6/6/2022 at 12:43 AM, Lime1028 said:

Either way I'll send an update crash or pass.

 

Well life stuff got in the way of validating everything early in the week. In the end you guys were correct, hardware error, seemingly the motherboard to be exact. I'm not entirely sure what on the board is causing it as the everything works fine until I start a parity check. Might be the SATA controller, since Unraid is run off a USB it doesn't need to touch anything SATA related until the check starts. Either way the board still had a few months of warranty left on it so it's currently being shipped off to MSI and we'll see what they find. Got a loaner board from a friend in their right now and everything is back up and running.

 

Jorge, and trurl, thank you both for your help with these issues and your patience with me. It's all much appreciated.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.