My server has crashed 3 times in the last month and I don't know how to find the cause

BlueSialia · July 31, 2022

So basically the tittle. It's been three mornings I wake up to my server not being up. The computer is powered on but the Web GUI can't be seen, my Plex is down too. I can't even ping the machine.

I've restarted the machine and downloaded the diagnostics but the logs contain only entries since the last boot. So basically only a few minutes. As far as I can see I won't find any error log of the crash there. What can I do?

unblue-diagnostics-20220731-2137.zip

BlueSialia · July 31, 2022

After noticing the word "diagnostics" turned into an hyperlink I discovered the persistent logs feature of Unraid. Neat trick of the forums!

Hopefully that will tell me more next time it happens.

BlueSialia · August 1, 2022

Things got weird. I don't know why but my VM settings changed.

The VM manager tells me my default VM storage path does not exist. It's set to `/mnt/user/domains/` and, in fact, that share does not exist. I used `/mnt/user/vdisks/` (Yup, I got into this because of Linus Tech Tips).

Setting it back to vdisks fixes it but the fact that my configuration just changed scares me. I feel like the system is slowly dying.

JorgeB · August 1, 2022

Enable the syslog server and post that after a crash.

BlueSialia · August 17, 2022

So it happened again. It happened on the 12th and this is every log entry from that day:

Aug 12 01:01:34 UnBlue emhttpd: read SMART /dev/sdd
Aug 12 01:02:23 UnBlue root: /etc/libvirt: 27.6 MiB (28966912 bytes) trimmed on /dev/loop3
Aug 12 01:02:23 UnBlue root: /var/lib/docker: 2.9 GiB (3095732224 bytes) trimmed on /dev/loop2
Aug 12 01:02:23 UnBlue root: /mnt/cache: 242.4 GiB (260238061568 bytes) trimmed on /dev/nvme0n1p1
Aug 12 02:00:01 UnBlue crond[1341]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Aug 12 02:03:55 UnBlue emhttpd: spinning down /dev/sdd
Aug 12 02:40:55 UnBlue emhttpd: read SMART /dev/sdd
Aug 12 03:42:45 UnBlue emhttpd: spinning down /dev/sdd
Aug 12 17:52:17 UnBlue emhttpd: read SMART /dev/sdd
Aug 12 18:52:40 UnBlue emhttpd: spinning down /dev/sdd

As far as I know. There is nothing weird there. The next log entry is from today (I was away so I couldn't restart the server earlier).

There is one thing that did caught my eye. On restart I have the notification: unassigned.devices.plg: An update is available. Click here to install version 2022.08.12. I can't say it for sure, but I believe every time my server has crashed I've found a pending update for that same plugin.

Edited August 17, 2022 by BlueSialia
I used markdown syntax thinking the forums accepted it

JorgeB · August 17, 2022

Check here to make sure you're using the correct "power supply idle control" setting.

trurl · August 17, 2022

On 8/1/2022 at 3:22 AM, BlueSialia said:

my configuration just changed

Everything about your configuration (any settings from webUI) is in config folder on flash. Did you do anything that would have reset config?

BlueSialia · August 17, 2022

4 minutes ago, JorgeB said:

Check here to make sure you're using the correct "power supply idle control" setting.

I'll check the BIOS settings. Although, would it even make sense that this is the cause when I've used the same BIOS settings for years and this crashing has started to happen a month and a half ago?

3 minutes ago, trurl said:

Everything about your configuration (any settings from webUI) is in config folder on flash. Did you do anything that would have reset config?

Nope, nothing. At least not while knowing that I was doing so. I actually don't remember changing anything in Unraid for a long while. I have my Plex, my VMs that I start with Wake Up on LAN... I have the Dashboard opened all the time, but just because I like looking at it haha.

JorgeB · August 17, 2022

10 minutes ago, BlueSialia said:

I'll check the BIOS settings. Although, would it even make sense that this is the cause when I've used the same BIOS settings for years and this crashing has started to happen a month and a half ago?

If the BIOS is correctly set and the crashes started out of the blue without anything relevant logged in the syslog it points to a hardware issue.

BlueSialia · August 17, 2022

7 minutes ago, JorgeB said:

If the BIOS is correctly set and the crashes started out of the blue without anything relevant logged in the syslog it points to a hardware issue.

I was suspecting that. Specially because of my config change, which makes me think the flash drive may be dying. How would go to try to determine that? It's not possible to run any of the tests that can be executed on a cache or array drive on the flash drive, isn't it?

JorgeB · August 17, 2022

Unlikely that a dying flash drive would make Unraid crash without anything being logged, but you can replace it if you want:

https://wiki.unraid.net/Manual/Changing_The_Flash_Device

BlueSialia · March 13, 2023

I still face this issue. I replaced the USB, reformated the drives, new settings. I basically setup Unraid from zero again.

I cannot get an uptime past 15 days. It ends up crashing. I am now on vacation and I cannot access my Plex or any other services hosted on it because it is down. I connect to my home network with a VPN and can't even ping the server.

I have no idea about how to debug this. There is nothing in the syslog server.

trurl · March 13, 2023

Have you done memtest?

BlueSialia · March 13, 2023

4 hours ago, trurl said:

Have you done memtest?

Yes. Passed without any error or warning.

trurl · March 14, 2023

On 3/13/2023 at 8:09 AM, BlueSialia said:

There is nothing in the syslog server.

On 8/1/2022 at 4:23 AM, JorgeB said:

Enable the syslog server and post that after a crash.

Doesn't look like you ever posted it.

BlueSialia · March 14, 2023

10 hours ago, trurl said:

Doesn't look like you ever posted it.

Yes I did. Just after the post you quoted.

JonathanM · March 15, 2023

I only see one file posted in this thread, the diagnostic zip file in the first post. The syslog server if configured correctly will save a copy of the syslog up until the crash, and I don't see it posted.

BlueSialia · March 15, 2023

I didn't upload the entire file. Just posted the log entries of one of the days the server crashed.

BlueSialia · July 6, 2023

One year since this started happening to me and the issue is still the same. At some point, the server just dies somehow. The hardware is still on but the system is not even part of the network.

And I have no idea about how to debug this problem. I think the only thing I can do is turn off every docker and VM that is not critical, wait and see if it exhibits the same behavior. If by any chance it manages to reach an uptime of a month then I can enable what I turned off slowly. Like one per 2 weeks or something. Even if this works it'll mean months of having most of my things disabled. I'm just hopeless.

BlueSialia · October 16, 2023

I finally found a solution. But a very undesirable one. The issue was this. My Ryzen CPU doesn't support Unraid's (Linux) c-states. I disabled them in the BIOS. Now, as it is expected, my server has a very high power consumption even at idle.

JorgeB · October 17, 2023

See here, if there's a power supply idle control in the BIOS setting you should be able to leave global c-stares enabled.

My server has crashed 3 times in the last month and I don't know how to find the cause

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation