Server hangs completely at random times

Daniel Finch · August 1, 2018

UnRAID version: 6.5.3

Plugins:

Community Applications: 2018.07.22

Fix Common Problems: 2018.07.28

Nerd Tools: 2018.02.17 (screen installed)

Unassigned Devices: 2018.06.01a

Docker apps:

binhex-krusader (not started)

deluge

nginx

nzbget

PlexMediaServer

radarr

sonarr

Hardware:

Motherboard: ASRock H110M-ITX

CPU: i3 6100T (stock cooler, not overclocked)

RAM: 2x 4GB 2133Mhz

Hard drives:

* 3x 4TB Seagate Barracuda 3.5 (1 being used for parity)

* 1x 4TB Seagate IronWolf

No GPU installed

---

Since I first set my server up I've been seeing random and complete server hangs. None of the Docker instances will be available, nor will the GUI, and I'm unable to log in via the console - I'll enter the username and never get a password prompt. I have to perform a hard shut down and turn it back on. It seems to happen every 3-4 days or so, the last time it happened I put it into Diagnostics mode, so I've got the .zip and the syslog attached. Usually I'm not using the machine (it sits on a desk somewhere in my house untouched), so I don't often realise it's happened until I go to do something with one of the Docker apps.

My network sits behind a pfSense device, so the only way to access the server is via VPN or by being in the physical location on the network. As far as I can see, there's never any errors shown in the console - the IP address and username prompt are always the last things displayed.

I have not yet tried safe mode, and I haven't found any reproduction steps yet (sorry!).

FCPsyslog_tail.txt

htpc-diagnostics-20180730-1836.zip

TechMed · August 1, 2018

Hi @Daniel Samuels,

Welcome to the forum! LOTS of great help here and folks are friendly.

So, I realize this will not be a direct answer, but it has been my experience that when things go "randomly" awry, it is almost always hardware related.

I recently had a similar situation on a new build and discovered that the USB port I had my flash drive in was defective.

System "appeared" to boot, but all kinds of weirdness, like you are talking about.

Until one of the Pros gets a chance to review your Diags file, you may want to have a look see at some of the hardware.

Again, this is not a point and shoot answer, just experience in general saying, look to the hardware first when there is randomness in the error(s).

I'm sure it will all work itself out once the Pros get a chance to chime in, they really are great.

Altheran · August 3, 2018

In my experience since Skylake : Hangups = C States. Since I disabled C States in the BIOS, no more worries

Daniel Finch · August 3, 2018

8 hours ago, Altheran said:

In my experience since Skylake : Hangups = C States. Since I disabled C States in the BIOS, no more worries

How interesting, I'll try disabling that and see how it goes. Thanks!

perPLEXed · August 3, 2018

I had random crashes and reboots on my Ryzen 1800 Desktop system Windows 10. It could be idling with nothing running or sometimes during a graphic intensive game and it would just reboot or crash. I was unable to repeat it consistently. I stress tested it for hours and sometimes it would crash and sometimes not. I started monitoring CPU temperatures and started logging it. I recorded temperatures a year ago with the system and noticed the current CPU temperatures were slightly higher idle and significantly higher during a load. So I pulled my CPU heat sink and reapplied new thermal paste. I noticed the temperatures went down slightly on idle but was much lower under load.

I no longer have any random reboots or crashes now. Thermal paste degrades over time? Misaligned CPU Heat-sink? Not sure but it fixed it.

pwm · August 3, 2018

43 minutes ago, perPLEXed said:

Thermal paste degrades over time? Misaligned CPU Heat-sink?

Some thermal paste degrades a lot over time, but I think most thermal paste works quite well for the expected lifetime of the system - many industrial systems are expected to work well for 10-20 years without need for replacing any thermal paste.

It's more likely that there was an alignment issue or that not all of the chip had thermal paste. Either of these could result in a big variation in temperature between different parts of the chip potentially making one part hot enough that it becomes unstable while the temperature sensor still sees a temperature that does not require throttling.

perPLEXed · August 3, 2018

45 minutes ago, pwm said:

It's more likely that there was an alignment issue or that not all of the chip had thermal paste.

I agree that was probably it.

Daniel Finch · September 8, 2018

Just wanted to come back to this thread and give an update. Since disabling C-States I have had no further hangs and the server is now sitting at 32 days uptime. Thanks everyone for your help!

Server hangs completely at random times

Recommended Posts

Daniel Finch

Link to comment

TechMed

Link to comment

Altheran

Link to comment

Daniel Finch

Link to comment

perPLEXed

Link to comment

pwm

Link to comment

perPLEXed

Link to comment

Daniel Finch

Link to comment

Join the conversation