clowncracker Posted March 22, 2023 Share Posted March 22, 2023 (edited) I initially installed a m.2 Google Coral and the Coral Accelerator Module Driver plugin, but my server started crashing. Assuming that this was the issue, I removed the Google Coral and uninstalled the plugin. Now my server is still crashing and I have no idea why, I'm hoping someone could look at the diagnostics and let me know. Fix common problems states that there is a hardware issue: Your server has detected hardware errors. You should install mcelog via the NerdPack plugin, post your diagnostics and ask for assistance on the Unraid forums (which is what I did). Edited April 1, 2023 by clowncracker Quote Link to comment
JorgeB Posted March 22, 2023 Share Posted March 22, 2023 5 hours ago, clowncracker said: Your server has detected hardware errors. This usually suggests just that, a hardware problem, start by running memtest. Quote Link to comment
clowncracker Posted March 22, 2023 Author Share Posted March 22, 2023 (edited) 12 hours ago, JorgeB said: This usually suggests just that, a hardware problem, start by running memtest. Memtest has completed four passes with no issues. The weird thing is that it isn't an instant crash. The server is fine for 3ish hours and then the UI just stops working. I cannot access it from the webpage and I have manually restart the computer. Edited March 22, 2023 by clowncracker Quote Link to comment
clowncracker Posted March 22, 2023 Author Share Posted March 22, 2023 (edited) Just crashed again during a parity check, after being online for about 3 hours and 45 minutes. I needed to manually restart my server to get it to be responsive again. I have a notification popup that says Parity check finished (0 errors) with a duration of over 19 hours, even though the server was online for less than four hours. I'd like to note that fix common problems (and the syslog) no longer indicate that this is a hardware issue. I've attached the syslog. Edited April 1, 2023 by clowncracker Quote Link to comment
clowncracker Posted March 23, 2023 Author Share Posted March 23, 2023 19 hours ago, JorgeB said: This usually suggests just that, a hardware problem, start by running memtest. Another update, I've had it running in safe mode with all VMs and Dockers disabled for 7 hours with no crashes yet. Quote Link to comment
MrGrey Posted March 23, 2023 Share Posted March 23, 2023 Is the RAM ECC?... Clutching at straws 13 hours ago, clowncracker said: The server is fine for 3ish hours and then the UI just stops working. Again, clutching at straws... Does the server keep working and/while the UI stops working? Hope it helps. MGrey. Quote Link to comment
clowncracker Posted March 23, 2023 Author Share Posted March 23, 2023 8 minutes ago, MrGrey said: Is the RAM ECC?... Clutching at straws Again, clutching at straws... Does the server keep working and/while the UI stops working? Hope it helps. MGrey. All of the VMs and Dockers stop working, I think it just crashes but the computer doesn't turn off. Not ECC RAM. Considering it's been working for about 8 hours at this point in safe mode with no Dockers/VMs running, I'm fairly certain the hardware error was a false flag. This all started when I installed the m.2 Google coral and installed the driver plugin, so I think the driver plugin messed something up. Even after I uninstalled the plugin and removed the Google coral, the issue persisted. Quote Link to comment
Anon Posted March 23, 2023 Share Posted March 23, 2023 Seems you got a sorta working stable mode now. This means you can try around seeing what exactly causes the error. Its gonna be a lot of effort as you have to wait many hours but you can at least start activating stuff again bit by bit and see how the server reacts. Otherwise: Maybe try to limit every docker & VM to just one CPU core via pinning and check again. Maybe its just one docker going berserk and taking up 100% CPU on all cores causing nothing else to work anymore? Quote Link to comment
clowncracker Posted March 24, 2023 Author Share Posted March 24, 2023 (edited) Is there nothing in the system log or diagnostics that might help determine the cause? @JorgeB any chance you can look at the latest diagnostic and system log I attached? Nothing has changed in my config and this point and I have no idea how to diagnose this issue. Edited April 1, 2023 by clowncracker Quote Link to comment
JorgeB Posted March 24, 2023 Share Posted March 24, 2023 There are call traces and sgefault logged, but those by themselves don't rule point to a culprit, just suggest a hardware problem, RAM and/or board would be my main suspects. Quote Link to comment
clowncracker Posted March 24, 2023 Author Share Posted March 24, 2023 5 hours ago, JorgeB said: There are call traces and sgefault logged, but those by themselves don't rule point to a culprit, just suggest a hardware problem, RAM and/or board would be my main suspects. I believe the sever crashes when the CPU gets near 100% utilization. If memtest didn't give me any errors, do you think that means it's the motherboard? Quote Link to comment
JorgeB Posted March 24, 2023 Share Posted March 24, 2023 Could be, can never say for sure. Quote Link to comment
Solution clowncracker Posted April 1, 2023 Author Solution Share Posted April 1, 2023 The issue ended up being the motherboard. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.