Howboys Posted August 9, 2022 Share Posted August 9, 2022 (edited) So over the last 24 hours or so, it seems like my server is just crashing. I have no idea why. Once it crashed and rebooted in the middle of the night, and once just a few mins ago. In both cases, when it started back up, the nvme cache drive was missing (probably due to it being in a bad state after an unclean shutdown). Rebooting didn't help but clean shutdown + start up does. When it crashed a few mins ago and restarted back up, I grabbed the attached diagnostic report (but didn't do a shutdown so my cache only has one SSD now). I'm not really pushing the limits of my server or anything, so idk why it's been restarting. tower-diagnostics-20220808-1935.zip Edited August 9, 2022 by Howboys Quote Link to comment
JorgeB Posted August 9, 2022 Share Posted August 9, 2022 Enable the syslog server and post that after a crash. Quote Link to comment
Howboys Posted August 9, 2022 Author Share Posted August 9, 2022 Ok it happened again last night but I didn't have syslog server enabled then (the server didn't restart itself though the PC's power light was on, and disks were spinning. No output/signal on the monitor). I've enabled syslog server now. Quote Link to comment
Howboys Posted August 14, 2022 Author Share Posted August 14, 2022 (edited) Ok it happened just now. Here's the syslog around the time of the crash (11:36 local time). What's not in the file attached is what's right before: Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered blocking state Aug 14 11:28:52 Tower kernel: docker0: port 1(veth63b1366) entered disabled state Aug 14 11:28:52 Tower kernel: device veth63b1366 entered promiscuous mode Aug 14 11:28:53 Tower kernel: eth0: renamed from veth2a1387c Aug 14 11:28:53 Tower kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth63b1366: link becomes ready Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered blocking state Aug 14 11:28:53 Tower kernel: docker0: port 1(veth63b1366) entered forwarding state Aug 14 11:29:19 Tower sshd[27775]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 34804 on $UNRAID_IP port 22 rdomain "" Aug 14 11:29:19 Tower sshd[27775]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:29:19 Tower sshd[27775]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 34804 Aug 14 11:30:19 Tower sshd[29895]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35046 on $UNRAID_IP port 22 rdomain "" Aug 14 11:30:19 Tower sshd[29895]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:30:19 Tower sshd[29895]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35046 Aug 14 11:31:19 Tower sshd[31937]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35272 on $UNRAID_IP port 22 rdomain "" Aug 14 11:31:19 Tower sshd[31937]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:31:19 Tower sshd[31937]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35272 Aug 14 11:32:19 Tower sshd[1650]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35494 on $UNRAID_IP port 22 rdomain "" Aug 14 11:32:19 Tower sshd[1650]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:32:19 Tower sshd[1650]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35494 Aug 14 11:33:19 Tower sshd[3711]: Connection from $ANOTHER_LOCAL_MACHINE_IP port 35720 on $UNRAID_IP port 22 rdomain "" Aug 14 11:33:19 Tower sshd[3711]: error: kex_exchange_identification: Connection closed by remote host Aug 14 11:33:19 Tower sshd[3711]: Connection closed by $ANOTHER_LOCAL_MACHINE_IP port 35720 Aug 14 11:36:19 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022 Aug 14 11:36:19 Tower kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot From what I can tell, there's basically no logs preceeding the crash that are useful. The logs I've attached seem to be corresponding to the server re-boot after the crash. So, what else can I provide or enable for more visibility? Just to provide more info: - My build is Gigabyte B450 Aorus M (Bios F63a), Ryzen 7 1700X - I've already run memtest and made sure the memory is okay (did the memtest when I got the RAM a month or so ago) - I have `rcu_nocbs=0-15` added to my flash config (https://forums.unraid.net/topic/61129-ryzen-freezes/) - I'm not running any VMs, just Docker container and storage shares - There's no significant load on the server at the time of the crash. In fact, things are totally fine any time I do a cpu or GPU stress test. sys Edited August 15, 2022 by Howboys Quote Link to comment
JorgeB Posted August 15, 2022 Share Posted August 15, 2022 13 hours ago, Howboys said: Ryzen 7 1700X See here: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/?do=findComment&comment=819173 Quote Link to comment
Howboys Posted August 15, 2022 Author Share Posted August 15, 2022 I should've mentioned I've already done all that when I built the server. Quote Link to comment
JorgeB Posted August 15, 2022 Share Posted August 15, 2022 Then and since nothing relevant was logged it's likely some hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
Howboys Posted August 15, 2022 Author Share Posted August 15, 2022 (edited) I guess it could be my memory though I'm not sure. According to BIOS it's running at 2133Mhz (though it is 3000Mhz) and I've got 2 sticks in the 2nd and 4th slot away from the CPU. According to the Internet, that should be the right configuration? Edited August 15, 2022 by Howboys Quote Link to comment
Howboys Posted August 15, 2022 Author Share Posted August 15, 2022 (edited) So I was just double checking my config for unraid and noticed that I had not added `rcu_nocbs=0-15` to the append line but instead on a new line. I've now added it to the append line, so hopefully that fixes it? The other change I've made is to enable global C-states but disable C6 using zenstates in /config/go. If after all that the server crashes, I'll disable global C-states and see if that does it. I'm pretty confident that correctly applying `rcu_nocbs=0-15` will resolve this but only time can tell. Edited August 15, 2022 by Howboys Quote Link to comment
Howboys Posted August 17, 2022 Author Share Posted August 17, 2022 It happened again and here's the logs: Aug 16 19:51:52 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:53 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:51:55 Tower shfs: share cache full Aug 16 19:53:03 Tower kernel: Linux version 5.15.46-Unraid (root@Develop) (gcc (GCC) 11.2.0, GNU ld version 2.37-slack15) #1 SMP Fri Jun 10 11:08:41 PDT 2022 Aug 16 19:53:03 Tower kernel: Command line: BOOT_IMAGE=/bzimage rcu_nocbs=0-15 initrd=/bzroot Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers' Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers' Aug 16 19:53:03 Tower kernel: x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers' Aug 16 19:53:03 Tower kernel: x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256 Cache is definitely not full (which seems to have caused the crash?): Attaching diagnostics. tower-diagnostics-20220816-2000.zip Quote Link to comment
Howboys Posted August 17, 2022 Author Share Posted August 17, 2022 From > "Share cache full" means the share to where you are transferring his hitting the minimum free space setting, for that share, not cache, so the transfer would bypass your cache device and go directly to the array That does seem to be the case. One of my shares has "Minimum free space" set to 200GB: I don't understand what the logic is here for cache and cache bypass, and why it would cause a server crash. Quote Link to comment
Howboys Posted August 17, 2022 Author Share Posted August 17, 2022 On 8/15/2022 at 7:49 AM, JorgeB said: Then and since nothing relevant was logged it's likely some hardware issue, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. So I'm now in basically a crash loop - every 20 minutes or so the server just crashes and restarts (though the original symptom of the nvme SSD missing is not present anymore). Do you have any idea what hardware might be the issue? Could it be the RAM or the CPU or something else given the symptoms? Quote Link to comment
JorgeB Posted August 17, 2022 Share Posted August 17, 2022 PSU, board or RAM would be my main suspects, but could be another thing. Quote Link to comment
Solution Howboys Posted August 20, 2022 Author Solution Share Posted August 20, 2022 I think I may have this solved. I bought a new PSU and new RAM and when I went to replace the PSU, I noticed that my one of my current RAM sticks was not all the way seated. I also noticed that the ATX 24-pin cable was a teeny-bit loose. Anyways, after plugging them in right, my current uptime has been almost 2 days with no hiccups. If the server stays up for a few more days, I'll upgrade to a bigger case so my SATA power cables aren't pushing against the ATX port (which is what seems to be happening). Consider this solved (I'll post if I notice issues again). 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.