Random Freezes/Crashes

KLAK · March 19, 2020

Few days ago I moved to latest stable build and things started to go down hill from here.. First up my specs:

MOBO: Asus ROG Strix x99 Gaming

RAM: G.Skill 64GB DDR4

CPU: Xeon e5 2673-v4

PSU: 1050W Thermaltake platinum plus

Water cooled

GPU: EVGA GTX 760 sc 4gb x2

PCI Cards: Fresco Logic 5 port USB cards x2

HDD: 6 WD Red 8TB, 2 WD Red 10TB (parity)

I have been experiencing random freezes and crashes for no reason, system will lock up and I will loose all VM, web access, vpn access, dockers like its dead. I have syslog mirrored to flash which is posted below.

Things I have tried:

Updated BOIS to newest along with new unraid

Downgrade BIOS to 1903 new unraid

Downgrade BIOS to 1903 and last stable build of unraid

BIOS to 1801 and last stable build of unraid

I have tried all I can think of but still having issues, please let me know what you can see / think of for me to try. This setup has been running for over a year with no issues along the different unraid versions.

syslog klak-diagnostics-20200315-1855.zip

JorgeB · March 19, 2020

Downgrade back to last Unraid release that was stable for you to confirm if it was upgrade related or not.

KLAK · March 19, 2020

Already have currently on 6.8.2 and old bios

Edited March 19, 2020 by KLAK

JorgeB · March 19, 2020

Then it's likely a hardware problem.

KLAK · March 19, 2020

Anything in the logs pointing to this?

JorgeB · March 19, 2020

There are multiple call traces, and if it still crashes with same Unraid release it was working before then they are likely hardware related, start by running memtest.

KLAK · March 19, 2020

Fired it up tonight with old BIOS and old unraid, going to let it run though the night to see what it does I think. How long should I let memtest run for?

JorgeB · March 19, 2020

Ideally 24H, but usually a few minutes/hours are enough for any serious problem to be found.

AgentXXL · April 5, 2020

@KLAK Just curious to know how you've progressed with this issue? I'm also experiencing random 'temporary' freezes on my media unRAID system. They last anywhere from 5 - 45 seconds during which the unRAID webgui becomes largely un-responsive and Docker containers like Plex freeze. It usually results in any content being played to stop until the system resumes normal operation.

I'm using a Supermicro X8DTN+ motherboard (came with the CSE-847 36 bay enclosure) with dual x5650 Xeons and 32GB of RAM. The system originally had 16GB of RAM but about 2 weeks ago I started getting 'Machine Check Events' (MCE) reported by the Fix Common Problems plugin. As the 16GB was 2 mis-matched DIMMs (different manufacturers) I replaced it with 32GB that's all from the same batch and manufacturer.

I haven't seen a MCE since replacing the RAM but I am still seeing random freezes in unRAID. When this happens the dashboard shows many of the cores (physical and hyper-threaded) maxed at 100% as shown on the picture below. I try to keep an instance of TOP running to try and determine what leads to these lock-ups, but as you can see from the image, nothing appears to be utilizing all those CPU resources.

I've gone through my system log and at the time of the freezes there are no indications of any issues. I've done a reboot and am monitoring the system log for errors, but so far I have no indication of what's causing these random freezes. I am also running 6.8.3 and the only things that have changed on my system are the plugin and Docker updates for CA, UD, etc.

Note that I saw similar random freezes when running unRAID on a i7-6700K system (4 core, 8 thread) but I assumed that was due to the lack of cores/threads in the CPU. Now that I have 12 cores/24 threads, I wasn't expecting to see these lockups. The most annoying part is that neither TOP nor the syslog give me any clues as to the cause. Hopefully you've made some progress diagnosing the issue and can share your solution if you have one.

Dale

KLAK · April 17, 2020

@AgentXXL Sorry for the late reply been nuts at work so I have not been on here much. This is what I did:

Pulled all ram and ran memtest one by one, all checked ok

Installed one stick at a time and ran memtest, if I got errors I stopped and swapped in another stick, did this until all 8 were ok and stable

Still had issues so I turned to my GPUs which I had a suspicion were drying so pulled and tried on a windows bare metal and found a bad one

Pulled the bad one and fired server back up, all dockers and vms running... went for 4 days pushing it and no issues.

Long story short I swapped the GTX 760s out for Strix GTX970s and all is well in the neighborhood again.

May not be your issues but seems to have fixed mine since it has been running for some time now and all errors have gone away.

Now on latest Unraid and newest BIOS as well.

Current Specs:

MOBO: Asus ROG Strix x99 Gaming

RAM: G.Skill 64GB DDR4

CPU: Xeon e5 2673-v4 20 core/40 thread

PSU: 1050W Thermaltake platinum plus

Thermaltake Floe Ring 360 AIO

GPU: ASUS Strix GTX 970 4gb x2

PCI Cards: Fresco Logic 5 port USB cards x2

HDD: 6 WD Red 8TB, 2 WD Red 10TB (parity)

Edited April 17, 2020 by KLAK

AgentXXL · April 17, 2020

1 hour ago, KLAK said:

@AgentXXL Sorry for the late reply been nuts at work so I have not been on here much. This is what I did:

Pulled all ram and ran memtest one by one, all checked ok

.

.

Current Specs:

MOBO: Asus ROG Strix x99 Gaming

RAM: G.Skill 64GB DDR4

CPU: Xeon e5 2673-v4 20 core/40 thread

PSU: 1050W Thermaltake platinum plus

Thermaltake Floe Ring 360 AIO

GPU: ASUS Strix GTX 970 4gb x2

PCI Cards: Fresco Logic 5 port USB cards x2

HDD: 6 WD Red 8TB, 2 WD Red 10TB (parity)

Thanks for the reply. I no longer use a GPU in my unRAID setup (other than the onboard graphics from the Supermicro x8DTN+) so that's likely not my cause. As I've replaced the RAM and my 'machine check events' went away, I'm now down to it being a problem with the CPUs, the motherboard/chipset or the LSI2008 HBA.

I just had another lockup about 20 minutes ago and it lasted for about 2 minutes. Still reviewing the syslog but so far nothing is obvious. And my background TOP window also showed no specific process(es) that were loading down the CPU cores/threads. It's definitely puzzling. I'm planning/budgeting to replace the Supermicro x8DTN+ with a new Threadripper based system eventually.

As the motherboard space in the Supermicro CSE-847 will only accept low-profile expansion cards, I'm likely going to replace it with a IPMI JBOD adapter and convert it to a DAS. Then my new Threadripper setup will be in a decent separate enclosure that will allow full height expansion cards. Obviously I'll have to get a new HBA with external ports or at least some internal to external mini-SAS adapter plates and the external mini-SAS cables.

I'll report back if I happen to come across a solution before I make the changes/upgrade.

Random Freezes/Crashes

Recommended Posts

KLAK

Link to comment

JorgeB

Link to comment

KLAK

Link to comment

JorgeB

Link to comment

KLAK

Link to comment

JorgeB

Link to comment

KLAK

Link to comment

JorgeB

Link to comment

AgentXXL

Link to comment

KLAK

Link to comment

AgentXXL

Link to comment

Join the conversation