Need help solving random crashing.

Magma · May 4, 2018

I have only recently started using Unraid as of 2 weeks ago. Currently version 6.5.1. The entire time I have had random crashing. Sometimes multiple times a day. The longest uptime was just over 3 days. I'm only running Dockers for Plex, Sonarr, Radarr, sabnzbd, and nzbHydra2.

I'm running it on an i7 970 with a EVGA X58 FTW3 motherboard with the latest bios. The temperatures were somewhat high so yesterday I replaced the stock heat sink with something better. This greatly improved the CPU temps but the crashing appeared to increase in frequency. Possibly just a coincidence.

Today I went home during lunch and saw that it had crashed again. There is never any error message on the screen. This time however when I went to reboot it it hands on loading bzroot. I left it running memtest to see if detects any errors.

I have installed the Fix Common Problems plugin and have attached the latest logs generated from troubleshooting mode.

If anyone can provide any insight it would be greatly appreciated. I'm at a total loss currently.

FCPsyslog_tail.txt

tower-diagnostics-20180504-1120.zip

Edited May 9, 2018 by Magma
Added more details.

Magma · May 4, 2018

Got home. Memtest had completed 2 passes with 0 errors. Reset bios settings to default and managed to boot this time.

Going to be running a memtest overnight later.

Magma · May 7, 2018

Unfortunately I'm still at a loss. Memtest overnight provided 0 errors.

It lasted all day but crashed somewhere between 2:23 and 2:53 in the morning. I have attached the syslog and diagnostics again.

Not exactly sure what else to be looking for honestly. One of my drives displayed read errors during a parity check. Nothing is stored on that drive presently however and is typically spun down. Could that be causing the crashes?

tower-diagnostics-20180507-0223.zip

FCPsyslog_tail.txt

JorgeB · May 7, 2018

Disk2 needs to be replaced, and yes, it may cause the server to crash, or more likely unresponsive and appear to be crashed.

Magma · May 9, 2018

Another update. I have removed the dying hard drive and have replaced the ram which has passed overnight memtests.

The crashing continues. It appears to be tied to high cpu usage. It happened when sabnzbd was unpacking at 30 gb download and it happens nightly when PLEX does it's scheduled maintenance.

I rebooted today and started troubleshooting mode. I then started the scheduled maintenance and after about 30 mins it all locked up. Temperatures were only at 38 C.

I have attached the latest diagnostics and syslog tail from this crash.

I have a new 600w PSU with a 50a single +12v rail on the way.

Any other suggestions on what could possibly be the issue or what I could try in the meantime?

FCPsyslog_tail.txt

tower-diagnostics-20180509-1148.zip

JorgeB · May 9, 2018

Try running in safe mode for a while, and will dockers/VMs stopped.

Magma · May 9, 2018

Ok I will give that a shot. How long should I run it in safe mode?

What exactly will this determine though? The server would essentially be idling with all of the drives spun down.

John_M · May 9, 2018

15 minutes ago, Magma said:

What exactly will this determine though? The server would essentially be idling with all of the drives spun down.

You're looking for stability. If it's stable in this basic operating mode you can start re-enabling things (just one at a time, ideally) to see if any of them cause it to fall over. It's much easier to find the culprit this way.

JorgeB · May 9, 2018

21 minutes ago, Magma said:

The server would essentially be idling with all of the drives spun down.

You can still use as a basic NAS.

Magma · May 29, 2018

I am at a complete loss so here's an update.

I disabled all of my dockers and enabled them one at a time. Crashes were still happening though not as often.

The event that most closely coincided with the crashes was plexs overnight maintenance although it wasn't uncommon for there to be no crash. I have also had it crash during the middle of the day with no real active usage.

My longest uptime was 2.5 days. I have replaced the power supply, and ram. It has passed overnight memtest and prime95 tests.

I have tried only using 1 stick of ram at a time and swapping them out.

My only guess is that I need to replace the motherboard which I would prefer to avoid because at that point I should probably just replace the ram and cpu too.

I have attached the most recent syslog and diagnostics from the last crash.

FCPsyslog_tail.txt

tower-diagnostics-20180528-1112.zip

BobPhoenix · June 2, 2018

Two suggestions for you - first is most likely and I had nasty problems with it but I've had problems with the second before as well.

Turn off in bios if on MB or remove the Marvel 88SE9123 controller card - I had my Marvel 9230 controller passed through to a VM and got dropped drives and had to reboot the server to get them back. Didn't cause unRAID crashes for me but if you are using array drives on yours this is likely your problem. I turned my MB 9230 off in the bios so I wouldn't be tempted to use it and haven't had any problems with that server any more.
Turn off in bios if on MB or remove the NEC USB 3.0 controller card. Since I stopped using my Fresco USB 3.0 card in another server it has been up for 15 days and I believe it would have been longer but I had to reboot for an unRAID upgrade. Before that I got random crashes that I couldn't figure out with it installed.

Really think the 1st one is most likely the cause but doing either or both of the above is where I would start to trouble shoot since you have already tried some other hardware changes.

Need help solving random crashing.

Recommended Posts

Magma

Link to comment

Magma

Link to comment

Magma

Link to comment

JorgeB

Link to comment

Magma

Link to comment

JorgeB

Link to comment

Magma

Link to comment

John_M

Link to comment

JorgeB

Link to comment

Magma

Link to comment

BobPhoenix

Link to comment

Join the conversation