Unexpected clean shutdowns

VaiZore · March 10, 2022

Hi all,

I've been using Unraid since late 2021 and I love it. In January of this year I upgraded my server from an old Xeon e3-1230v2 to an i5-10400 on a MSI B560 board. For the first month or so it was perfectly stable. I was very happy with the performance.

For about a month now I've been struggling with an odd issue. My server will unexpectedly shutdown or restart - cleanly. As in, the logs show the system going down in a planned fashion, and when it comes back up there's no parity check needed. There doesn't seem to be any pattern to the shutdowns. It can be after 2 minutes of uptime or 10 days. When it happens, there's always an "init 0" in the logs, followed by a shutdown or a restart:

<snip>

Mar 9 07:20:52 <servername> init: Switching to runlevel: 0
Mar 9 07:20:52 <servername> init: Trying to re-exec init

<snip>

The server is on UPS, so the first thing I tried was to bypass UPS and run on mains power. Also disabled APCUPSD and unplugged the UPS cable. That didn't fix it. I replaced the PSU with a known good unit, that didn't fix the issue. Uninstalled a bunch of plugins, that didn't fix it. In fact when I uninstalled "My Servers" and was asked to provide feedback, I reported the issues I'd been having and a support engineer recommended I try rebooting in Safe Mode. I did, but this didn't fix the issue.

I've updated the BIOS.

I even unplugged the case's "power on" cable from the motherboard's pin header, in case it was shorting and sending the equivalent of a button press to the motherboard. Still didn't fix it.

I've upgraded from 6.9.2 to 6.10rc2, but the issue persists.

I'm running a syslog server locally now, so have full logs of the shutdown, but nothing jumps out at me as the cause.

Is there anything else I can try before I RMA the board & CPU? If I have to RMA, should I stay with Intel iGPU processors? After this bad experience I'm tempted to go AMD instead (all of my work/personal desktops are AMD, and rock solid) but I use hardware transcoding so I'd have to invest in a GPU with NVENC which will increase the cost and power draw... I'd really prefer to stay Intel for Quicksync if I can.

The only thing I haven't explored is disabling XMP on the memory. It's quite high speed memory (3600MT/s aka 3600MHz) - is that too much for the 10400's memory controller? Or the B560 platform? I can drop the speed down to the basic 2133MHz spec, but I don't really want to lose the performance, and I struggle to see how a memory error would cause a graceful shutdown, which is why I've been reticent to try...

Really grateful for any/all input, I'm tearing my (remaining) hair out!

EDIT: Dropped RAM down to 2933 and testing now...

timewavezero-diagnostics-20220310-1129.zip

Edited March 10, 2022 by VaiZore

ChatNoir · March 10, 2022

2 hours ago, VaiZore said:

The only thing I haven't explored is disabling XMP on the memory. It's quite high speed memory (3600MT/s aka 3600MHz) - is that too much for the 10400's memory controller? Or the B560 platform? I can drop the speed down to the basic 2133MHz spec, but I don't really want to lose the performance, and I struggle to see how a memory error would cause a graceful shutdown, which is why I've been reticent to try...

Really grateful for any/all input, I'm tearing my (remaining) hair out!

EDIT: Dropped RAM down to 2933 and testing now...

The CPU's memory controller is only rated to 2666, so you are still overclocking at 2933.

https://ark.intel.com/content/www/us/en/ark/products/199271/intel-core-i510400-processor-12m-cache-up-to-4-30-ghz.html

JonathanM · March 10, 2022

Are you forwarding anything to the outside world? In the past, there have been instances where the GUI or console was accessed remotely and servers were getting hacked or shut down.

VaiZore · March 10, 2022

3 hours ago, JonathanM said:

Are you forwarding anything to the outside world? In the past, there have been instances where the GUI or console was accessed remotely and servers were getting hacked or shut down.

Server is accessible via My Servers, not publishing any services.

VaiZore · March 10, 2022

3 hours ago, ChatNoir said:

The CPU's memory controller is only rated to 2666, so you are still overclocking at 2933.

https://ark.intel.com/content/www/us/en/ark/products/199271/intel-core-i510400-processor-12m-cache-up-to-4-30-ghz.html

Understood. I've dropped the memory speed in case the IMC can't handle it. If I still experience shutdowns at 2933, I'll try 2666.

I would have expected a memory speed issue to cause instability/crash/unclean shutdowns though rather than clean...

ChatNoir · March 10, 2022

40 minutes ago, VaiZore said:

I would have expected a memory speed issue to cause instability/crash/unclean shutdowns though rather than clean...

Might not be what causes your problem. But for a server, I would rather keep thing within spec.

I am not sure 1-2% extra performance is worth risking your data.

VaiZore · March 10, 2022

7 minutes ago, ChatNoir said:

Might not be what causes your problem. But for a server, I would rather keep thing within spec.

I am not sure 1-2% extra performance is worth risking your data.

That's a fair point. Appreciate the input.

Frank1940 · March 10, 2022

One other point is that pushing the power button on the computer case will start a clean shutdown of your Unraid server. There have been cases in the past where small -- perhaps, some not so small-- children and pets have been attracted to the led lights that often are a part of this button and will push at it. (Cats were the pet most often mentioned...)

EDIT: One-to-two-second push = Clean shutdown

Five second push = Forced (dirty shutdown) shutdown

Edited March 10, 2022 by Frank1940

JonathanM · March 10, 2022

1 hour ago, Frank1940 said:

pushing the power button on the computer case will start a clean shutdown of your Unraid server.

10 hours ago, VaiZore said:

I even unplugged the case's "power on" cable from the motherboard's pin header, in case it was shorting and sending the equivalent of a button press to the motherboard. Still didn't fix it.

VaiZore · March 11, 2022

12 hours ago, Frank1940 said:

One other point is that pushing the power button on the computer case will start a clean shutdown of your Unraid server. There have been cases in the past where small -- perhaps, some not so small-- children and pets have been attracted to the led lights that often are a part of this button and will push at it. (Cats were the pet most often mentioned...)

EDIT: One-to-two-second push = Clean shutdown

Five second push = Forced (dirty shutdown) shutdown

Appreciate the input - as JonathanM mentioned, I have isolated that as a potential cause. I do have kids, but the server is in a locked office, and the issue persisted even when I disconnected the case's power-on switch cable.

VaiZore · March 16, 2022

This looks to have been a hardware issue. I am RMAing the CPU & memory. Thank you all for your input!

VaiZore · March 24, 2022

Update - received new CPU/RAM/Motherboard and still system shutdown!!!

Managed to update to 6.10rc4 but that hasn't fixed it either....

JonathanM · March 25, 2022

Unfortunately the next troubleshooting steps will severely limit your server usage, so hopefully the shutdowns have gotten more consistent.

1. boot in safe mode. wait for shutdown or not

2. if it still shuts down, boot in normal mode, and disconnect the network cable. wait for shutdown or not.

depending on the results of those tests, we can make further recommendations for narrowing down the cause.

VaiZore · March 25, 2022

7 hours ago, JonathanM said:

Unfortunately the next troubleshooting steps will severely limit your server usage, so hopefully the shutdowns have gotten more consistent.

1. boot in safe mode. wait for shutdown or not

2. if it still shuts down, boot in normal mode, and disconnect the network cable. wait for shutdown or not.

depending on the results of those tests, we can make further recommendations for narrowing down the cause.

Thank you for the response! It's a relief to know we're not out of options.

The shutdowns are such a problem that the system is currently unusable, and I've migrated essential services to another PC temporarily. So, yes, I'm ready for downtime

I will try the steps you suggest, and report back. I have previously tried running in Safe Mode, and still had the clean shutdowns. But will try again.

Other things I've tried in the meantime:

1) Connect case's top USB 2.0 ports to the motherboard's USB header, and booted USB from those ports (to force USB 2.0, my USB stick is 2.0 but was in a 3.0 port) - still clean shutdown after a few hours

2) Ordered new USB 2.0 stick, arriving tomorrow - hopefully I can run for 30 days on a new stick with old config without transferring license... Will transfer license if it solves the issue, of course.

3) Going to replace CPU heatsink fan. Intel box cooler wasn't cutting it under load, so I installed a Cooler Master Hyper 212 from an older build. Temps went from 70+ to max 40C. But the fan is old, and I wonder if it's either stopping or reporting 0 RPM to the Motherboard which is then sending a "Halt" to the OS... It's a long shot but I've got to investigate every possibility.

JonathanM · March 25, 2022

2 hours ago, VaiZore said:

hopefully I can run for 30 days on a new stick with old config without transferring license.

You can either set up a new trial, which will require reassigning your drives, or you can transfer your existing config, which will require transferring the license to start the array.

At this point, I recommend setting up the new stick for a fresh trial, but DON'T yet assign any drives or transfer anything from the old stick. Just boot into Unraid and let it sit unconfigured and see if it shuts down.

Have you set the CPU max power to a realistic figure, or left it to max power or whatever the board defaults to?

VaiZore · March 25, 2022

6 minutes ago, JonathanM said:

Have you set the CPU max power to a realistic figure, or left it to max power or whatever the board defaults to?

During first entry to the BIOS, you have to specify a cooler type, and that sets the TDP. I chose the lowest option ("Box cooler, 65W").

With the tower cooler installed, I've never seen the chip (i5-10400) go above 40C (as measured by Dynamix System Temp plugin and displayed on the GUI Dashboard).

I suppose it's possible that the reading is incorrect, or that the fan is failing, but a CPU overheat (or fan reading 0 RPM) would result in an immediate power off rather than a prolonged, clean shutdown. The shutdowns take 90+ seconds and in that time unRAID is cleanly shutting down libvirt & docker, stopping the array, and unmounting disks. It doesn't feel like a panicing system powering off due to overheat!

VaiZore · April 28, 2022

Hi all. I RMA'd the second 10400 and bought an 11400. RAM at 2933 in Gear 1, within spec. Now stable for 6 days without issue. Must have been a bad batch of 10400s...

Thanks all for help!

VaiZore · May 16, 2022

Unfortunately this problem has re-appeared.

I gave up on Intel and dusted off an unused Ryzen 3700X. Different motherboard, different RAM, different PSU, and installed a GPU (RTX 2060) as no iGPU on my CPU. The only hardware that is the same from both builds is the case, the HDDs, and the SSDs.

I still experience unexpected clean and dirty shutdowns. Latest diag attached, and a few syslog snippets below:

May 11 14:35:45 TimeWaveZero webGUI: Successful login user root from <snip>
May 11 15:24:19 TimeWaveZero emhttpd: read SMART /dev/sdd
May 11 15:59:39 TimeWaveZero shutdown[8673]: shutting down for system halt
May 11 15:59:39 TimeWaveZero init: Switching to runlevel: 0
May 11 15:59:39 TimeWaveZero init: Trying to re-exec init
May 11 15:59:41 TimeWaveZero kernel: mdcmd (36): nocheck cancel
May 11 15:59:42 TimeWaveZero emhttpd: Spinning up all drives...
May 11 15:59:42 TimeWaveZero emhttpd: spinning up /dev/sdc
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sdd
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sdb
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sdc
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/nvme0n1
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sda
May 11 15:59:48 TimeWaveZero emhttpd: Stopping services...

May 11 16:00:07 TimeWaveZero emhttpd: shcmd (103006): rm -f /etc/avahi/services/smb.service
May 11 16:00:07 TimeWaveZero avahi-daemon[3852]: Files changed, reloading.
May 11 16:00:07 TimeWaveZero avahi-daemon[3852]: Service group file /services/smb.service vanished, removing services.
May 11 16:00:07 TimeWaveZero emhttpd: Stopping mover...
May 11 16:00:07 TimeWaveZero emhttpd: shcmd (103008): /usr/local/sbin/mover stop
May 11 16:00:07 TimeWaveZero root: mover: not running
May 11 16:00:07 TimeWaveZero emhttpd: Sync filesystems...
May 11 16:00:07 TimeWaveZero emhttpd: shcmd (103009): sync

May 11 16:02:28 TimeWaveZero root: Delaying execution of fix common problems scan for 10 minutes
May 11 16:02:28 TimeWaveZero emhttpd: /usr/local/emhttp/plugins/user.scripts/backgroundScript.sh "/tmp/user.scripts/tmpScripts/my_script/script" >/dev/null 2>&1
May 11 16:02:28 TimeWaveZero emhttpd: Starting services...

Anyone have any ideas? At this point I'm prepared to accept that it's Aliens or Solar Flares... Except all other machines on my network are stable...

timewavezero-diagnostics-20220516-1153.zip

unriadmidwest · May 16, 2022

I had something similar to this happen to me. Every day, a clean shutdown would be started at random times during the day and night. I reviewed logs daily and fought with this for about 2 week. I finally started by replacing the PSU. I replaced it and still had the same issues. Felt very defeated. With the server on, while tinkering in the case, I bumped a wire and a clean shut down happened. Started to dig further and the wire from the power button on the case to the mother board was loose. I think it would vibrate enough to "short" and act like the power button on the case we quickly pressed. I re-seated it and never had the problem again. Might not be your situation, but might be worth looking into.

Good luck!

Frank1940 · May 16, 2022

3 hours ago, VaiZore said:

At this point I'm prepared to accept that it's Aliens or Solar Flares... Except all other machines on my network are stable...

Might you have a local clown on your network??? 🙄 😏

EDIT 1: --- IF you can, double check your wireless connections to see what users have been accessing it. (The clown could be remote...)

EDIT 2: --- If your router will allow it, create a VLAN for IOT devices and isolate them from your home LAN. Add a second VLAN for all 'Guests' to whom you want to provide WAN access which is also isolated from your LAN.

Edited May 16, 2022 by Frank1940

VaiZore · May 16, 2022

1 hour ago, unriadmidwest said:

I had something similar to this happen to me. Every day, a clean shutdown would be started at random times during the day and night. I reviewed logs daily and fought with this for about 2 week. I finally started by replacing the PSU. I replaced it and still had the same issues. Felt very defeated. With the server on, while tinkering in the case, I bumped a wire and a clean shut down happened. Started to dig further and the wire from the power button on the case to the mother board was loose. I think it would vibrate enough to "short" and act like the power button on the case we quickly pressed. I re-seated it and never had the problem again. Might not be your situation, but might be worth looking into.

Good luck!

I really appreciate the input. This had occurred to me, so I now run with the motherboard power button cabled disconnected from the motherboard pin headers. I power on manually when I need to. Issue persists. Thanks anyway!

VaiZore · May 16, 2022

1 hour ago, Frank1940 said:

Might you have a local clown on your network??? 🙄 😏

EDIT 1: --- IF you can, double check your wireless connections to see what users have been accessing it. (The clown could be remote...)

EDIT 2: --- If your router will allow it, create a VLAN for IOT devices and isolate them from your home LAN. Add a second VLAN for all 'Guests' to whom you want to provide WAN access which is also isolated from your LAN.

Thanks for the suggestions. I'm only using the ISP provided router, which certainly isn't the highest security available! I will check the router. Presumably in order to shutdown my server the clown would still need my root credentials, which I have changed multiple times during troubleshooting?

No IOT devices yet

Frank1940 · May 16, 2022

30 minutes ago, VaiZore said:

Presumably in order to shutdown my server the clown would still need my root credentials, which I have changed multiple times during troubleshooting?

That is true as far as I know. Just make sure your password is reasonably secure-- ten characters at a minimum. See here for why:

https://www.hivesystems.io/blog/are-your-passwords-in-the-green

VaiZore · June 13, 2022

Hi all. I consider this resolved now. Thank you to all who offered assistance.

The solution - I moved the equipment into a new case... (Fractal Define 7). 100% uptime since the change (weeks, now).

Since I had already ruled out the power & reset switches in the old case, I have no idea what the problem could have been. The metal panel in the old case (also a Fractal, a much older one) that supported the motherboard might have been slightly bent outwards (convex), but it didn't look like it was far enough out for the metal to make contact with any components or tracing on the back of the motherboard, or any of the 'keep out' zones. Plus, I used all of the correct standoffs for the form factor, so it's unlikely to have been a short-circuit. Maybe there was some other defect that wasn't obvious, poorly shielded header cables, or some improper grounding or something... I'm not an electrical engineer, so take these guesses with a grain of salt!

Either way, it wasn't the fault of UNRAID, which I can now enjoy uninterrupted. I have since bought a second license for a backup server, which is also rock-solid. I'm very much enjoying the product again.

Thanks to @ChatNoir and @JonathanM and all of the members who contributed their assistance.

Unexpected clean shutdowns

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation