Why is my server crashing?

Howboys · August 9, 2022

So over the last 24 hours or so, it seems like my server is just crashing.

I have no idea why. Once it crashed and rebooted in the middle of the night, and once just a few mins ago.

In both cases, when it started back up, the nvme cache drive was missing (probably due to it being in a bad state after an unclean shutdown). Rebooting didn't help but clean shutdown + start up does. When it crashed a few mins ago and restarted back up, I grabbed the attached diagnostic report (but didn't do a shutdown so my cache only has one SSD now).

I'm not really pushing the limits of my server or anything, so idk why it's been restarting.

tower-diagnostics-20220808-1935.zip

Edited August 9, 2022 by Howboys

JorgeB · August 9, 2022

Enable the syslog server and post that after a crash.

Howboys · August 9, 2022

Ok it happened again last night but I didn't have syslog server enabled then (the server didn't restart itself though the PC's power light was on, and disks were spinning. No output/signal on the monitor).

I've enabled syslog server now.

Howboys · August 14, 2022

Ok it happened just now. Here's the syslog around the time of the crash (11:36 local time).

What's not in the file attached is what's right before:

Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered blocking state
Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered disabled state
Aug 14 11:28:52 Tower kernel: device veth63b1366 entered promiscuous mode
Aug 14 11:28:53 Tower kernel: eth0: renamed from veth2a1387c
Aug 14 11:28:53 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth63b1366: link becomes ready
Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered blocking state
Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered forwarding state
Aug 14 11:29:19 Tower sshd[27775]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 34804 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:29:19 Tower sshd[27775]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:29:19 Tower sshd[27775]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 34804
Aug 14 11:30:19 Tower sshd[29895]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35046 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:30:19 Tower sshd[29895]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:30:19 Tower sshd[29895]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35046
Aug 14 11:31:19 Tower sshd[31937]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35272 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:31:19 Tower sshd[31937]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:31:19 Tower sshd[31937]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35272
Aug 14 11:32:19 Tower sshd[1650]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35494 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:32:19 Tower sshd[1650]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:32:19 Tower sshd[1650]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35494
Aug 14 11:33:19 Tower sshd[3711]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35720 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:33:19 Tower sshd[3711]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:33:19 Tower sshd[3711]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35720
Aug 14 11:36:19 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022
Aug 14 11:36:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

From what I can tell, there's basically no logs preceeding the crash that are useful. The logs I've attached seem to be corresponding to the server re-boot after the crash.

So, what else can I provide or enable for more visibility?

Just to provide more info:

- My build is Gigabyte B450 Aorus M (Bios F63a), Ryzen 7 1700X

- I've already run memtest and made sure the memory is okay (did the memtest when I got the RAM a month or so ago)

- I have `rcu_nocbs=0-15` added to my flash config (https://forums.unraid.net/topic/61129-ryzen-freezes/)

- I'm not running any VMs, just Docker container and storage shares

- There's no significant load on the server at the time of the crash. In fact, things are totally fine any time I do a cpu or GPU stress test.

sys

Edited August 15, 2022 by Howboys

JorgeB · August 15, 2022

13 hours ago, Howboys said:

Ryzen 7 1700X

See here:

https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173

Howboys · August 15, 2022

I should've mentioned I've already done all that when I built the server.

JorgeB · August 15, 2022

Then and since nothing relevant was logged it's likely some hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Howboys · August 15, 2022

I guess it could be my memory though I'm not sure.

According to BIOS it's running at 2133Mhz (though it is 3000Mhz) and I've got 2 sticks in the 2nd and 4th slot away from the CPU. According to the Internet, that should be the right configuration?

image.png.fea22fc5aa6230c0a72a809d825056b9.png

Edited August 15, 2022 by Howboys

Howboys · August 15, 2022

So I was just double checking my config for unraid and noticed that I had not added `rcu_nocbs=0-15` to the append line but instead on a new line. I've now added it to the append line, so hopefully that fixes it?

The other change I've made is to enable global C-states but disable C6 using zenstates in /config/go.

If after all that the server crashes, I'll disable global C-states and see if that does it.

I'm pretty confident that correctly applying `rcu_nocbs=0-15` will resolve this but only time can tell.

Edited August 15, 2022 by Howboys

Howboys · August 17, 2022

It happened again and here's the logs:

Aug 16 19:51:52 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:53:03 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022
Aug 16 19:53:03 Tower kernel: Command line: BOOT_IMAGE=/bzimage rcu_nocbs=0-15 initrd=/bzroot
Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Aug 16 19:53:03 Tower kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256

Cache is definitely not full (which seems to have caused the crash?):

image.png.38746c90fd173a162bea75720c7fccda.png

Attaching diagnostics.

tower-diagnostics-20220816-2000.zip

Howboys · August 17, 2022

From

> "Share cache full" means the share to where you are transferring his hitting the minimum free space setting, for that share, not cache, so the transfer would bypass your cache device and go directly to the array

That does seem to be the case. One of my shares has "Minimum free space" set to 200GB:

I don't understand what the logic is here for cache and cache bypass, and why it would cause a server crash.

Howboys · August 17, 2022

On 8/15/2022 at 7:49 AM, JorgeB said:

Then and since nothing relevant was logged it's likely some hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

So I'm now in basically a crash loop - every 20 minutes or so the server just crashes and restarts (though the original symptom of the nvme SSD missing is not present anymore).

Do you have any idea what hardware might be the issue? Could it be the RAM or the CPU or something else given the symptoms?

JorgeB · August 17, 2022

PSU, board or RAM would be my main suspects, but could be another thing.

Howboys · August 20, 2022

I think I may have this solved.

I bought a new PSU and new RAM and when I went to replace the PSU, I noticed that my one of my current RAM sticks was not all the way seated. I also noticed that the ATX 24-pin cable was a teeny-bit loose. Anyways, after plugging them in right, my current uptime has been almost 2 days with no hiccups.

If the server stays up for a few more days, I'll upgrade to a bigger case so my SATA power cables aren't pushing against the ATX port (which is what seems to be happening).

Consider this solved (I'll post if I notice issues again).

Why is my server crashing?

Recommended Posts

Howboys

Link to comment

JorgeB

Link to comment

Howboys

Link to comment

Howboys

Link to comment

JorgeB

Link to comment

Howboys

Link to comment

JorgeB

Link to comment

Howboys

Link to comment

Howboys

Link to comment

Howboys

Link to comment

Howboys

Link to comment

Howboys

Link to comment

JorgeB

Link to comment

Howboys

Link to comment

Join the conversation