Random Server Crashes, no idea how/what to troubleshoot


Recommended Posts

So, for a while now i am experiencing random server crashes.

Sometimes the server is running fine for a week or two, sometimes only for a day.

 

Most of the crashes happen at night when i am not using the sever actively, but background stuff is happening. 

When i say crash i mean that the server is completely unreachable. No WebUI, no SSH.

I have switched most of the Hardware recently (MoBo, CPU, RAM) and the Crashes happen on both sets of Hardware, so i guess it might be Software related or a problem with the drives.

On the other Hardware i had VGA Output and the last Messages where about Kernel Panic, but i don't have the exact error since i switched back to my old setup again and that does not have a running VGA output.

 

I have gathered:

- Diagnostics after i restarted the server

- SMART Logs for the three drives that have warnings (Parity + 2 Array drives)

- LOG.txt the last hours of syslog from the USB BEFORE the restart.

 

I have activated Syslog Server but i am not quite sure where the syslog is saved besides the Flash drive, since it just says <custom> under Settings-> Syslog -> Location

I have also activated Mover Logging since the drives and the Software are the common denominators between both setups.

 

 

What i have done so far:

a while ago i started a reddid thread but i was not at home so could not follow through with all suggestions in real time: 


-  switched from macvlan to ipvlan

- i update all my docker containers at least once a week

 

 

My Hardware currently:
 

M/B: Supermicro X8DT3 Version 2.0 - s/n: OM15S33389

BIOS: American Megatrends Inc. Version 2.1. Dated: 03/17/2012

CPU: Intel® Xeon® CPU E5640 @ 2.67GHz

HVM: Enabled

IOMMU: Enabled

Cache: 256 KiB, 1 MB, 12 MB, 256 KiB, 1 MB, 12 MB

Memory: 80 GiB DDR3 Multi-bit ECC (max. installable capacity 384 GiB)

Network: bond0: fault-tolerance (active-backup), mtu 1500

Kernel: Linux 5.19.14-Unraid x86_64

OpenSSL: 1.1.1q
UNRAID: 6.11.1 2022-10-06

 

 

I hope i have gathered all the right stuff to get some help :) Feel free to ask for more

LOG.txt tower-diagnostics-20230502-1138.zip tower-smart-20230502-1129.zip tower-smart-20230502-1132.zip tower-smart-20230502-1135.zip

Edited by Random.Name
Link to comment

Unfortunately there's nothing relevant logged, this usually points to a hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disabled, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment
1 hour ago, JorgeB said:

Unlikely that a problem with one of the drives would make Unraid crash without anything logged, PSU would be a possibility.

oh... ok... how would i go about checking that? Just Buy a new PSU and hope that fixes it?


I have it running in safe mode now and will for a week or so. But how do i go about turning on the services one by one? can i do this from safe mode or what would be the ideal way?
And with services do you mean Plugins/Docker/ VM or individual plugins and docker containers? Just trying to make sure i get everything right ;)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.