Unexpected clean shutdowns


Go to solution Solved by VaiZore,

Recommended Posts

Hi all,

 

I've been using Unraid since late 2021 and I love it. In January of this year I upgraded my server from an old Xeon e3-1230v2 to an i5-10400 on a MSI B560 board. For the first month or so it was perfectly stable. I was very happy with the performance.

 

For about a month now I've been struggling with an odd issue. My server will unexpectedly shutdown or restart - cleanly. As in, the logs show the system going down in a planned fashion, and when it comes back up there's no parity check needed. There doesn't seem to be any pattern to the shutdowns. It can be after 2 minutes of uptime or 10 days. When it happens, there's always an "init 0" in the logs, followed by a shutdown or a restart:

 

<snip>

Mar  9 07:20:52 <servername> init: Switching to runlevel: 0
Mar  9 07:20:52 <servername> init: Trying to re-exec init

<snip>

 

The server is on UPS, so the first thing I tried was to bypass UPS and run on mains power. Also disabled APCUPSD and unplugged the UPS cable. That didn't fix it. I replaced the PSU with a known good unit, that didn't fix the issue. Uninstalled a bunch of plugins, that didn't fix it. In fact when I uninstalled "My Servers" and was asked to provide feedback, I reported the issues I'd been having and a support engineer recommended I try rebooting in Safe Mode. I did, but this didn't fix the issue.

 

I've updated the BIOS.

 

I even unplugged the case's "power on" cable from the motherboard's pin header, in case it was shorting and sending the equivalent of a button press to the motherboard. Still didn't fix it.

 

I've upgraded from 6.9.2 to 6.10rc2, but the issue persists.

 

I'm running a syslog server locally now, so have full logs of the shutdown, but nothing jumps out at me as the cause.

 

Is there anything else I can try before I RMA the board & CPU? If I have to RMA, should I stay with Intel iGPU processors? After this bad experience I'm tempted to go AMD instead (all of my work/personal desktops are AMD, and rock solid) but I use hardware transcoding so I'd have to invest in a GPU with NVENC which will increase the cost and power draw... I'd really prefer to stay Intel for Quicksync if I can.

 

The only thing I haven't explored is disabling XMP on the memory. It's quite high speed memory (3600MT/s aka 3600MHz) - is that too much for the 10400's memory controller? Or the B560 platform? I can drop the speed down to the basic 2133MHz spec, but I don't really want to lose the performance, and I struggle to see how a memory error would cause a graceful shutdown, which is why I've been reticent to try...

 

Really grateful for any/all input, I'm tearing my (remaining) hair out!

 

 

EDIT: Dropped RAM down to 2933 and testing now...

 

 

timewavezero-diagnostics-20220310-1129.zip

Edited by VaiZore
Link to comment
2 hours ago, VaiZore said:

The only thing I haven't explored is disabling XMP on the memory. It's quite high speed memory (3600MT/s aka 3600MHz) - is that too much for the 10400's memory controller? Or the B560 platform? I can drop the speed down to the basic 2133MHz spec, but I don't really want to lose the performance, and I struggle to see how a memory error would cause a graceful shutdown, which is why I've been reticent to try...

 

Really grateful for any/all input, I'm tearing my (remaining) hair out!

 

 

EDIT: Dropped RAM down to 2933 and testing now...

The CPU's memory controller is only rated to 2666, so you are still overclocking at 2933.

https://ark.intel.com/content/www/us/en/ark/products/199271/intel-core-i510400-processor-12m-cache-up-to-4-30-ghz.html

Link to comment
3 hours ago, JonathanM said:

Are you forwarding anything to the outside world? In the past, there have been instances where the GUI or console was accessed remotely and servers were getting hacked or shut down.

Server is accessible via My Servers, not publishing any services.

Link to comment
3 hours ago, ChatNoir said:

The CPU's memory controller is only rated to 2666, so you are still overclocking at 2933.

https://ark.intel.com/content/www/us/en/ark/products/199271/intel-core-i510400-processor-12m-cache-up-to-4-30-ghz.html

Understood. I've dropped the memory speed in case the IMC can't handle it. If I still experience shutdowns at 2933, I'll try 2666.

 

I would have expected a memory speed issue to cause instability/crash/unclean shutdowns though rather than clean...

Link to comment
40 minutes ago, VaiZore said:

I would have expected a memory speed issue to cause instability/crash/unclean shutdowns though rather than clean...

Might not be what causes your problem. But for a server, I would rather keep thing within spec.

I am not sure 1-2% extra performance is worth risking your data.

Link to comment

One other point is that pushing the power button on the computer case will start a clean shutdown of your Unraid server.  There have been cases in the past where small -- perhaps, some not so small-- children and pets have been attracted to the led lights that often are a part of this button and will push at it.   (Cats were the pet most often mentioned...)

 

EDIT:  One-to-two-second push = Clean shutdown

          Five second push  =  Forced (dirty shutdown) shutdown

Edited by Frank1940
Link to comment
1 hour ago, Frank1940 said:

pushing the power button on the computer case will start a clean shutdown of your Unraid server. 

 

10 hours ago, VaiZore said:

 

I even unplugged the case's "power on" cable from the motherboard's pin header, in case it was shorting and sending the equivalent of a button press to the motherboard. Still didn't fix it.

 

Link to comment
12 hours ago, Frank1940 said:

One other point is that pushing the power button on the computer case will start a clean shutdown of your Unraid server.  There have been cases in the past where small -- perhaps, some not so small-- children and pets have been attracted to the led lights that often are a part of this button and will push at it.   (Cats were the pet most often mentioned...)

 

EDIT:  One-to-two-second push = Clean shutdown

          Five second push  =  Forced (dirty shutdown) shutdown

Appreciate the input - as JonathanM mentioned, I have isolated that as a potential cause. I do have kids, but the server is in a locked office, and the issue persisted even when I disconnected the case's power-on switch cable.

Link to comment
  • 2 weeks later...

Unfortunately the next troubleshooting steps will severely limit your server usage, so hopefully the shutdowns have gotten more consistent.

 

1. boot in safe mode. wait for shutdown or not

2. if it still shuts down, boot in normal mode, and disconnect the network cable. wait for shutdown or not.

 

depending on the results of those tests, we can make further recommendations for narrowing down the cause.

Link to comment
7 hours ago, JonathanM said:

Unfortunately the next troubleshooting steps will severely limit your server usage, so hopefully the shutdowns have gotten more consistent.

 

1. boot in safe mode. wait for shutdown or not

2. if it still shuts down, boot in normal mode, and disconnect the network cable. wait for shutdown or not.

 

depending on the results of those tests, we can make further recommendations for narrowing down the cause.

Thank you for the response! It's a relief to know we're not out of options.

 

The shutdowns are such a problem that the system is currently unusable, and I've migrated essential services to another PC temporarily. So, yes, I'm ready for downtime :)

 

I will try the steps you suggest, and report back. I have previously tried running in Safe Mode, and still had the clean shutdowns. But will try again.

 

Other things I've tried in the meantime:

1) Connect case's top USB 2.0 ports to the motherboard's USB header, and booted USB from those ports (to force USB 2.0, my USB stick is 2.0 but was in a 3.0 port) - still clean shutdown after a few hours

2) Ordered new USB 2.0 stick, arriving tomorrow - hopefully I can run for 30 days on a new stick with old config without transferring license... Will transfer license if it solves the issue, of course.

3) Going to replace CPU heatsink fan. Intel box cooler wasn't cutting it under load, so I installed a Cooler Master Hyper 212 from an older build. Temps went from 70+ to max 40C. But the fan is old, and I wonder if it's either stopping or reporting 0 RPM to the Motherboard which is then sending a "Halt" to the OS... It's a long shot but I've got to investigate every possibility.

 

 

Link to comment
2 hours ago, VaiZore said:

hopefully I can run for 30 days on a new stick with old config without transferring license.

You can either set up a new trial, which will require reassigning your drives, or you can transfer your existing config, which will require transferring the license to start the array.

 

At this point, I recommend setting up the new stick for a fresh trial, but DON'T yet assign any drives or transfer anything from the old stick. Just boot into Unraid and let it sit unconfigured and see if it shuts down.

 

Have you set the CPU max power to a realistic figure, or left it to max power or whatever the board defaults to?

Link to comment
6 minutes ago, JonathanM said:

Have you set the CPU max power to a realistic figure, or left it to max power or whatever the board defaults to?

 

During first entry to the BIOS, you have to specify a cooler type, and that sets the TDP. I chose the lowest option ("Box cooler, 65W").

 

With the tower cooler installed, I've never seen the chip (i5-10400) go above 40C (as measured by Dynamix System Temp plugin and displayed on the GUI Dashboard).

 

I suppose it's possible that the reading is incorrect, or that the fan is failing, but a CPU overheat (or fan reading 0 RPM) would result in an immediate power off rather than a prolonged, clean shutdown. The shutdowns take 90+ seconds and in that time unRAID is cleanly shutting down libvirt & docker, stopping the array, and unmounting disks. It doesn't feel like a panicing system powering off due to overheat!

 

 

Link to comment
  • 1 month later...
  • 3 weeks later...

Unfortunately this problem has re-appeared.

 

I gave up on Intel and dusted off an unused Ryzen 3700X. Different motherboard, different RAM, different PSU, and installed a GPU (RTX 2060) as no iGPU on my CPU. The only hardware that is the same from both builds is the case, the HDDs, and the SSDs.

 

I still experience unexpected clean and dirty shutdowns. Latest diag attached, and a few syslog snippets below:

 

May 11 14:35:45 TimeWaveZero webGUI: Successful login user root from <snip>
May 11 15:24:19 TimeWaveZero emhttpd: read SMART /dev/sdd
May 11 15:59:39 TimeWaveZero shutdown[8673]: shutting down for system halt
May 11 15:59:39 TimeWaveZero init: Switching to runlevel: 0
May 11 15:59:39 TimeWaveZero init: Trying to re-exec init
May 11 15:59:41 TimeWaveZero kernel: mdcmd (36): nocheck cancel
May 11 15:59:42 TimeWaveZero emhttpd: Spinning up all drives...
May 11 15:59:42 TimeWaveZero emhttpd: spinning up /dev/sdc
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sdd
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sdb
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sdc
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/nvme0n1
May 11 15:59:42 TimeWaveZero emhttpd: read SMART /dev/sda
May 11 15:59:48 TimeWaveZero emhttpd: Stopping services...

<snip bunch of service shutdown spam>

May 11 16:00:07 TimeWaveZero emhttpd: shcmd (103006): rm -f /etc/avahi/services/smb.service
May 11 16:00:07 TimeWaveZero avahi-daemon[3852]: Files changed, reloading.
May 11 16:00:07 TimeWaveZero avahi-daemon[3852]: Service group file /services/smb.service vanished, removing services.
May 11 16:00:07 TimeWaveZero emhttpd: Stopping mover...
May 11 16:00:07 TimeWaveZero emhttpd: shcmd (103008): /usr/local/sbin/mover stop
May 11 16:00:07 TimeWaveZero root: mover: not running
May 11 16:00:07 TimeWaveZero emhttpd: Sync filesystems...
May 11 16:00:07 TimeWaveZero emhttpd: shcmd (103009): sync

<reboot>

May 11 16:02:28 TimeWaveZero root: Delaying execution of fix common problems scan for 10 minutes
May 11 16:02:28 TimeWaveZero emhttpd: /usr/local/emhttp/plugins/user.scripts/backgroundScript.sh "/tmp/user.scripts/tmpScripts/my_script/script" >/dev/null 2>&1
May 11 16:02:28 TimeWaveZero emhttpd: Starting services...

 

 

Anyone have any ideas? At this point I'm prepared to accept that it's Aliens or Solar Flares... Except all other machines on my network are stable...

timewavezero-diagnostics-20220516-1153.zip

Link to comment

I had something similar to this happen to me. Every day, a clean shutdown would be started at random times during the day and night. I reviewed logs daily and fought with this for about 2 week. I finally started by replacing the PSU. I replaced it and still had the same issues. Felt very defeated. With the server on, while tinkering in the case, I bumped a wire and a clean shut down happened. Started to dig further and the wire from the power button on the case to the mother board was loose. I think it would vibrate enough to "short" and act like the power button on the case we quickly pressed. I re-seated it and never had the problem again. Might not be your situation, but might be worth looking into.

 

Good luck!

Link to comment
3 hours ago, VaiZore said:

At this point I'm prepared to accept that it's Aliens or Solar Flares... Except all other machines on my network are stable...

Might you have a local clown on your network???   🙄   😏

 

EDIT 1: --- IF you can, double check your wireless connections to see what users have been accessing it.  (The clown could be remote...)

 

EDIT 2: --- If your router will allow it, create a VLAN for IOT devices and isolate them from your home LAN.  Add a second VLAN for all 'Guests' to whom you want to provide WAN access which is also isolated from your LAN.

Edited by Frank1940
Link to comment
1 hour ago, unriadmidwest said:

I had something similar to this happen to me. Every day, a clean shutdown would be started at random times during the day and night. I reviewed logs daily and fought with this for about 2 week. I finally started by replacing the PSU. I replaced it and still had the same issues. Felt very defeated. With the server on, while tinkering in the case, I bumped a wire and a clean shut down happened. Started to dig further and the wire from the power button on the case to the mother board was loose. I think it would vibrate enough to "short" and act like the power button on the case we quickly pressed. I re-seated it and never had the problem again. Might not be your situation, but might be worth looking into.

 

Good luck!

I really appreciate the input. This had occurred to me, so I now run with the motherboard power button cabled disconnected from the motherboard pin headers. I power on manually when I need to. Issue persists. Thanks anyway!

Link to comment
1 hour ago, Frank1940 said:

Might you have a local clown on your network???   🙄   😏

 

EDIT 1: --- IF you can, double check your wireless connections to see what users have been accessing it.  (The clown could be remote...)

 

EDIT 2: --- If your router will allow it, create a VLAN for IOT devices and isolate them from your home LAN.  Add a second VLAN for all 'Guests' to whom you want to provide WAN access which is also isolated from your LAN.

Thanks for the suggestions. I'm only using the ISP provided router, which certainly isn't the highest security available! I will check the router. Presumably in order to shutdown my server the clown would still need my root credentials, which I have changed multiple times during troubleshooting?

No IOT devices yet :)

Link to comment
  • 4 weeks later...
  • Solution

Hi all. I consider this resolved now. Thank you to all who offered assistance.

 

The solution - I moved the equipment into a new case... (Fractal Define 7). 100% uptime since the change (weeks, now).

 

Since I had already ruled out the power & reset switches in the old case, I have no idea what the problem could have been. The metal panel in the old case (also a Fractal, a much older one) that supported the motherboard might have been slightly bent outwards (convex), but it didn't look like it was far enough out for the metal to make contact with any components or tracing on the back of the motherboard, or any of the 'keep out' zones. Plus, I used all of the correct standoffs for the form factor, so it's unlikely to have been a short-circuit. Maybe there was some other defect that wasn't obvious, poorly shielded header cables, or some improper grounding or something... I'm not an electrical engineer, so take these guesses with a grain of salt!

 

Either way, it wasn't the fault of UNRAID, which I can now enjoy uninterrupted. I have since bought a second license for a backup server, which is also rock-solid. I'm very much enjoying the product again.

Thanks to @ChatNoir and @JonathanM and all of the members who contributed their assistance.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.