Server keeps crashing

clowncracker · March 22, 2023

I initially installed a m.2 Google Coral and the Coral Accelerator Module Driver plugin, but my server started crashing. Assuming that this was the issue, I removed the Google Coral and uninstalled the plugin. Now my server is still crashing and I have no idea why, I'm hoping someone could look at the diagnostics and let me know.

Fix common problems states that there is a hardware issue:

Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the Unraid forums (which is what I did).

Edited April 1, 2023 by clowncracker

JorgeB · March 22, 2023

5 hours ago, clowncracker said:

Your server has detected hardware errors.

This usually suggests just that, a hardware problem, start by running memtest.

clowncracker · March 22, 2023

12 hours ago, JorgeB said:

This usually suggests just that, a hardware problem, start by running memtest.

Memtest has completed four passes with no issues.

The weird thing is that it isn't an instant crash. The server is fine for 3ish hours and then the UI just stops working. I cannot access it from the webpage and I have manually restart the computer.

Edited March 22, 2023 by clowncracker

clowncracker · March 22, 2023

Just crashed again during a parity check, after being online for about 3 hours and 45 minutes. I needed to manually restart my server to get it to be responsive again.

I have a notification popup that says Parity check finished (0 errors) with a duration of over 19 hours, even though the server was online for less than four hours.

I'd like to note that fix common problems (and the syslog) no longer indicate that this is a hardware issue. I've attached the syslog.

Edited April 1, 2023 by clowncracker

clowncracker · March 23, 2023

19 hours ago, JorgeB said:

This usually suggests just that, a hardware problem, start by running memtest.

Another update, I've had it running in safe mode with all VMs and Dockers disabled for 7 hours with no crashes yet.

MrGrey · March 23, 2023

Is the RAM ECC?... Clutching at straws

13 hours ago, clowncracker said:

The server is fine for 3ish hours and then the UI just stops working.

Again, clutching at straws... Does the server keep working and/while the UI stops working?

Hope it helps.

MGrey.

clowncracker · March 23, 2023

8 minutes ago, MrGrey said:

Is the RAM ECC?... Clutching at straws

Again, clutching at straws... Does the server keep working and/while the UI stops working?

Hope it helps.

MGrey.

All of the VMs and Dockers stop working, I think it just crashes but the computer doesn't turn off.

Not ECC RAM. Considering it's been working for about 8 hours at this point in safe mode with no Dockers/VMs running, I'm fairly certain the hardware error was a false flag.

This all started when I installed the m.2 Google coral and installed the driver plugin, so I think the driver plugin messed something up. Even after I uninstalled the plugin and removed the Google coral, the issue persisted.

Anon · March 23, 2023

Seems you got a sorta working stable mode now. This means you can try around seeing what exactly causes the error. Its gonna be a lot of effort as you have to wait many hours but you can at least start activating stuff again bit by bit and see how the server reacts.

Otherwise: Maybe try to limit every docker & VM to just one CPU core via pinning and check again. Maybe its just one docker going berserk and taking up 100% CPU on all cores causing nothing else to work anymore?

clowncracker · March 24, 2023

Is there nothing in the system log or diagnostics that might help determine the cause?

@JorgeB any chance you can look at the latest diagnostic and system log I attached? Nothing has changed in my config and this point and I have no idea how to diagnose this issue.

Edited April 1, 2023 by clowncracker

JorgeB · March 24, 2023

There are call traces and sgefault logged, but those by themselves don't rule point to a culprit, just suggest a hardware problem, RAM and/or board would be my main suspects.

clowncracker · March 24, 2023

5 hours ago, JorgeB said:

There are call traces and sgefault logged, but those by themselves don't rule point to a culprit, just suggest a hardware problem, RAM and/or board would be my main suspects.

I believe the sever crashes when the CPU gets near 100% utilization. If memtest didn't give me any errors, do you think that means it's the motherboard?

JorgeB · March 24, 2023

Could be, can never say for sure.

clowncracker · April 1, 2023

The issue ended up being the motherboard.

Server keeps crashing

Recommended Posts

clowncracker

Link to comment

JorgeB

Link to comment

clowncracker

Link to comment

clowncracker

Link to comment

clowncracker

Link to comment

MrGrey

Link to comment

clowncracker

Link to comment

Anon

Link to comment

clowncracker

Link to comment

JorgeB

Link to comment

clowncracker

Link to comment

JorgeB

Link to comment

clowncracker

Link to comment

Join the conversation