Why is my server crashing?


Go to solution Solved by Howboys,

Recommended Posts

So over the last 24 hours or so, it seems like my server is just crashing.

 

I have no idea why. Once it crashed and rebooted in the middle of the night, and once just a few mins ago.

 

In both cases, when it started back up, the nvme cache drive was missing (probably due to it being in a bad state after an unclean shutdown). Rebooting didn't help but clean shutdown + start up does. When it crashed a few mins ago and restarted back up, I grabbed the attached diagnostic report (but didn't do a shutdown so my cache only has one SSD now).

 

I'm not really pushing the limits of my server or anything, so idk why it's been restarting.

tower-diagnostics-20220808-1935.zip

Edited by Howboys
Link to comment

Ok it happened just now. Here's the syslog around the time of the crash (11:36 local time).

 

What's not in the file attached is what's right before:

 

Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered blocking state
Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered disabled state
Aug 14 11:28:52 Tower kernel: device veth63b1366 entered promiscuous mode
Aug 14 11:28:53 Tower kernel: eth0: renamed from veth2a1387c
Aug 14 11:28:53 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth63b1366: link becomes ready
Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered blocking state
Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered forwarding state
Aug 14 11:29:19 Tower sshd[27775]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 34804 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:29:19 Tower sshd[27775]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:29:19 Tower sshd[27775]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 34804
Aug 14 11:30:19 Tower sshd[29895]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35046 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:30:19 Tower sshd[29895]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:30:19 Tower sshd[29895]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35046
Aug 14 11:31:19 Tower sshd[31937]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35272 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:31:19 Tower sshd[31937]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:31:19 Tower sshd[31937]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35272
Aug 14 11:32:19 Tower sshd[1650]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35494 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:32:19 Tower sshd[1650]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:32:19 Tower sshd[1650]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35494
Aug 14 11:33:19 Tower sshd[3711]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35720 on $UNRAID_IP port 22 rdomain ""
Aug 14 11:33:19 Tower sshd[3711]: error: kex_exchange_identification: Connection closed by remote host
Aug 14 11:33:19 Tower sshd[3711]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35720
Aug 14 11:36:19 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022
Aug 14 11:36:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

 

From what I can tell, there's basically no logs preceeding the crash that are useful. The logs I've attached seem to be corresponding to the server re-boot after the crash.

 

So, what else can I provide or enable for more visibility?

 

Just to provide more info:

 

- My build is Gigabyte B450 Aorus M (Bios F63a), Ryzen 7 1700X

- I've already run memtest and made sure the memory is okay (did the memtest when I got the RAM a month or so ago)

- I have `rcu_nocbs=0-15` added to my flash config (https://forums.unraid.net/topic/61129-ryzen-freezes/)

- I'm not running any VMs, just Docker container and storage shares

- There's no significant load on the server at the time of the crash. In fact, things are totally fine any time I do a cpu or GPU stress test.

 

 

sys

Edited by Howboys
Link to comment

Then and since nothing relevant was logged it's likely some hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

Link to comment

I guess it could be my memory though I'm not sure.

 

According to BIOS it's running at 2133Mhz (though it is 3000Mhz) and I've got 2 sticks in the 2nd and 4th slot away from the CPU. According to the Internet, that should be the right configuration?

 

537924082_ScreenShot2022-08-15at4_25_42PM.thumb.png.2eeeabb36f15bd3d81b43fb39055c2db.png

 

781620560_ScreenShot2022-08-15at4_25_48PM.thumb.png.5c0143a0143262097d8960c907b0bdca.png

 

image.png.fea22fc5aa6230c0a72a809d825056b9.png

Edited by Howboys
Link to comment

So I was just double checking my config for unraid and noticed that I had not added `rcu_nocbs=0-15` to the append line but instead on a new line. I've now added it to the append line, so hopefully that fixes it?

 

1562997802_ScreenShot2022-08-15at4_24_09PM.thumb.png.d709d1087cfe378a81355800664bfadc.png

 

The other change I've made is to enable global C-states but disable C6 using zenstates in /config/go.

 

1916877155_ScreenShot2022-08-15at4_25_09PM.thumb.png.0432d66f9a3f15f3bbdcb3ac316506f0.png

 

If after all that the server crashes, I'll disable global C-states and see if that does it.

 

I'm pretty confident that correctly applying `rcu_nocbs=0-15` will resolve this but only time can tell.

Edited by Howboys
Link to comment

It happened again and here's the logs:

 

Aug 16 19:51:52 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:53 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:51:55 Tower shfs: share cache full
Aug 16 19:53:03 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022
Aug 16 19:53:03 Tower kernel: Command line: BOOT_IMAGE=/bzimage rcu_nocbs=0-15 initrd=/bzroot
Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Aug 16 19:53:03 Tower kernel: x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256

 

Cache is definitely not full (which seems to have caused the crash?):

 

image.png.38746c90fd173a162bea75720c7fccda.png

 

Attaching diagnostics.

tower-diagnostics-20220816-2000.zip

Link to comment

From 

 

> "Share cache full" means the share to where you are transferring his hitting the minimum free space setting, for that share, not cache, so the transfer would bypass your cache device and go directly to the array

 

That does seem to be the case. One of my shares has "Minimum free space" set to 200GB:

 

image.thumb.png.7cfccadfaf48cfe56fa045decf30c997.png

 

I don't understand what the logic is here for cache and cache bypass, and why it would cause a server crash.

Link to comment
On 8/15/2022 at 7:49 AM, JorgeB said:

Then and since nothing relevant was logged it's likely some hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one.

 

So I'm now in basically a crash loop - every 20 minutes or so the server just crashes and restarts (though the original symptom of the nvme SSD missing is not present anymore).

 

Do you have any idea what hardware might be the issue? Could it be the RAM or the CPU or something else given the symptoms?

Link to comment
  • Solution

I think I may have this solved.

 

I bought a new PSU and new RAM and when I went to replace the PSU, I noticed that my one of my current RAM sticks was not all the way seated. I also noticed that the ATX 24-pin cable was a teeny-bit loose. Anyways, after plugging them in right, my current uptime has been almost 2 days with no hiccups.

 

If the server stays up for a few more days, I'll upgrade to a bigger case so my SATA power cables aren't pushing against the ATX port (which is what seems to be happening).

 

Consider this solved (I'll post if I notice issues again).

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.