Jump to content

Issues with Unraid crashing


Recommended Posts

I've been having ongoing issues with Unraid crashing. Initially Unraid ran great (GPU Passthrough issues aside). Sometime in the first half of this year, I've started having nothing but problems with Unraid crashing. It started around the time I updated to 6.9.2. The server within 24-48 hours would hang, unable to ping, ssh, or navigate to the webGUI, and I could force a crash by running Parity Check. The Parity check wouldn't run more than ~12 minutes before crashing

 

Also, around the time that I started seeing crashes, I went from a Synology RT2600ac as my router/switch/WiFi to a Netgate SG-3100 running pfSense for my router/switch. Not sure this could be a contributing factory, but since I believe docker's macvlan was the primary culprit, it's been in the back of my mind.

 

I'm currently on 6.10.0-rc1 and stability has improved. However, I'm still getting crashes. I can complete a Parity check, and the server will stay up for a few days. I've had 3 crashed so far on 6.10.0-rc1. First was during the first parity check 60% through (much longer than the previously stated 12 minutes). So when I ran another parity check successfully, I thought perhaps it was a fluke. Second time was a few days later on the 18th, armed with my false sense of security, I neglected to set up a Syslog server, so I have no idea what happened. And now we're up to present day, where I caused a crash when I logged into the server. Syslog server was live this time.

 

Hardware

CPU: Ryzen 5 3600x

Motherboard: Gigabyte X570 AORUS Elite (rev 1.0)

RAM: 2x16GB G.SKILL Ripjaws V Series DDR4 3200

PSU: Corsair 650RMi

GPU: Gigabyte Radeon RX 5500 XT 4GB

HDD: 2x Seagate IronWolf 4TB, 2x Seagate IronWolf 3TB

SSD: 2x 1TB WD Blue 3D NAND

 

Steps I've taken to solve.

I've ran MemTest (Passed)

I've switched out the ram with Ballistic Sport LT 2x16GB DDR4 3200.

Turned off XMP. (Current setting)

I've disabled C-State, and changed Power Supply Idle Control to Typical Current Idle. (Current setting)

I've updated the Motherboard BIOS (multiple times, and will be updating again after I post).

I've tried each of the following Unraid versions 6.8.3, 6.9.0, 6.9.1, 6.9.2 and now 6.10.0-rc1

I've changed the thumbdrive from a USB 3.0 Samsung to a USB 2.0 Sandisk. (Currently on 2.0)

I've offloaded my data and completely wiped the system, starting from new, testing out each version with no dockers or apps installed.

Changed Docker from macvlan to ipvlan (6.10.0-rc1).

 

I've included two files. One is the most recent crash, the other is when I reverted back to 6.8.3 months ago, I believe I induced the 6.8.3 crash with a parity check as mentioned above. I gave up on trouble shooting this and the box has been collecting dust until I saw that 6.10.0-rc1 allowed the use of ipvlan in docker.

 

syslog_server-6.10.0-rc1.log syslog_server-6.8.3.log

Link to comment

I updated the BIOS to the newest version. Went back over settings to ensure C-State was off, Power Supply Idle was on Typical Current Idle and XMP was turned off. I seem to now have even more instability. Crashed once during a Parity check, roughly 3.5 hours in, and then a second time with no Parity check going. Due to the first crash not giving any syslogs, I kept my desktop going, with htop and log open. Attached is a screencap of htop, once again, no syslog was created during the crash.

Screenshot from 2021-08-22 19-55-34.png

Link to comment
15 hours ago, Bjarndyr said:

I believe I induced the 6.8.3 crash with a parity check as mentioned above.

If it was stable with v6.8.3 and now it's crashing with multiple versions, including v6.8.3, it suggests there's a hardware problem, but difficult to say what, you'd need to try with different RAM, board, CPU, etc.

Link to comment

After my last crash, I'm beginning to lean towards hardware as well. I previously swapped RAM because it's the easiest, as well as took out the HBA and connected all the drives via SATA cables, but this weekend I'm going to swap the CPU, PSU and Motherboard one at a time. Given the fact that with my last crash, I didn't see cores maxing out getting hung up on tasks, and crashes almost always happen during a parity check (extra stress on the PSU), I'm leaning towards it being power related. Previously with versions < 6.10.0-rc1, and macvlan on Docker, cores would max out one by one before ultimately crashing even with no parity check; this to me suggests I may have had two concurrent issues, and 6.10.0-rc1 with Docker set to ipvlan, solved one of those issues.

 

One concern I have is, the 'donor' PC (which is my main rig *cries in geek*)  has the exact same PSU, and Motherboard. So if there is a compatibility issue as a result of upgraded BIOS and/or changes in the kernel, I will have a difficult time proving it. This seems to be an unlikely scenario, given that this build was stable at one time, through 6.8.2 to 6.9 and those builds weren't stable when I reverted back to them.

 

That said, as a sanity check, if anyone reading this has the same Motherboard (Gigabyte X570 AORUS Elite (rev 1.0)) on the latest BIOS, and/or PSU (Corsair RM650i), please let me know if they're working for you on any version >= 6.8.3.

Edited by Bjarndyr
Typo corrections and clarifications
Link to comment

I swapped the power supply out, and I'm not sure what to make of the results.

 

First boot up, I had a crash while running parity check. But the system was running super slow, there were some errors being thrown because the BIOS for some reason reverted back to default, so Virtualization was turned off. So I set the BIOS settings, rebooted and then started a parity check.

 

That said, this morning there was a similar Kernel panic that would've crashed the system previously... but hasn't. However, my parity check, that should be nearly done, is saying it'll take another 7 hours to complete.

 

I've got some new cases coming in soon, so I'm going to swap my other motherboard and CPU in next, to see if some of this goes away. Given that the system didn't crash when it normally would have, but I'm still getting some anomalies, I think I may be onto something with regards to power being a potential culprit. Perhaps something has been degrading either on the motherboard or the PSU, and the one started effecting the other, but with a different PSU in, it's picking up the slack.

 

Syslog attached.

unraid-server-syslog-20210825-1540.zip

Link to comment
  • 5 weeks later...

It looks like it may have been a bad processor. I swapped in my 3900x for a week and had zero issues. So I RMA'd the 3600x, which I got back and installed on Monday. It's been running without issue so far. BIOS is back at default, except I turned on XMP.

 

Through my research, I found that the 3600 series has been known to have issues with the IMC (Integrated Memory Controller). Which if the culprit, explains why the kernel panics I had were never the same.

 

If I run into any more issues, I will update the thread.

 

 

UPDATE 10/5/21: Just shy of 2 weeks in, and it's rock solid. Had a parity check run, and didn't even notice. So yeah, I'm VERY confident it was the CPU.

Edited by Bjarndyr
Update on status of issue.
  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...