kstrike155 Posted September 24, 2023 Share Posted September 24, 2023 (edited) General system specs are: AMD Ryzen 5 1600 ASRock B450 Pro4 2x8 GB DDR4-2400 ECC RAM 3x + 1x parity drives (all 4TB, mix of WD Red and Blue) and a single NVMe cache disk Been running Unraid for years without much issue, but recently my system has been becoming completely unresponsive: no console, no SSH, no nothing. It happens probably about once a week on average. I have to hard reboot using the power button. I've attached diagnostics from shortly after a reboot. I did have a heck of a time upgrading to 6.12.4 and has to rebuild my USB stick. Then the My Servers recovery didn't image the stick properly, so I had to start over with my own backup. Not sure if it's related, though it seems like it might have started happening after the upgrade. I added an old video card to the machine during my debugging of the upgrade (so I could use a monitor), and left that in the machine. I am going to try removing it to see if it fixes anything, but wondering if there are any other ideas? I also noticed I've been receiving some errors on my parity disk which I will replace, but I would hope disk failure wouldn't bring my whole server down. Here's syslog output from before and after the reboot (full syslog after reboot in the diagnostics); nothing of note really: Sep 23 23:27:48 homer emhttpd: read SMART /dev/sdd Sep 23 23:27:57 homer emhttpd: read SMART /dev/sdc Sep 24 00:01:33 homer root: /mnt/cache: 858.9 GiB (922254512128 bytes) trimmed on /dev/nvme0n1p1 Sep 24 00:28:06 homer emhttpd: spinning down /dev/sdc Sep 24 00:39:56 homer sshd[16907]: Received disconnect from 192.168.1.228 port 55478:11: Normal Shutdown Sep 24 00:39:56 homer sshd[16907]: Disconnected from user root 192.168.1.228 port 55478 Sep 24 00:55:24 homer emhttpd: spinning down /dev/sdd Sep 24 02:02:56 homer emhttpd: read SMART /dev/sdc =================================== REBOOT ======================================== Sep 24 07:42:55 homer kernel: Linux version 6.1.49-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP PREEMPT_DYNAMIC Wed Aug 30 09:42:35 PDT 2023 Sep 24 07:42:55 homer kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot Sep 24 07:42:55 homer kernel: BIOS-provided physical RAM map: Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009d3ff] usable Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000009d400-0x000000000009ffff] reserved Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009d01fff] usable Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000009d02000-0x0000000009ffffff] reserved Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a20b000-0x000000000affffff] usable Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b01ffff] reserved Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000b020000-0x00000000db1c8fff] usable homer-diagnostics-20230924-0800.zip Edited September 24, 2023 by kstrike155 Quote Link to comment
kstrike155 Posted September 24, 2023 Author Share Posted September 24, 2023 Actually, I reset the BIOS during my troubleshooting in booting off the USB after the upgrade. I bet this is a C-State thing, will check and reply with findings. Quote Link to comment
Alintya Posted September 24, 2023 Share Posted September 24, 2023 (edited) I am experiencing similar issues with a Ryzen 5600G machine but for longer than 6.12.4. I didn't have any troubles updating tho. The system will become almost completely unresponsive except for wireguard, which seems to be unaffected. My logs aren't really helpful either Frequency of this happening is around once a week for me as well. Edited September 24, 2023 by Alintya Quote Link to comment
kstrike155 Posted September 24, 2023 Author Share Posted September 24, 2023 I went to the BIOS and Global C-State was set to Auto, but I actually still have my c6-disable flag on during boot: /usr/local/sbin/zenstates --c6-disable What I did notice is that Power Supply Idle Control was set to Auto as well. In my research it sounds like that may actually be what's causing system hangs, so I set that to Typical and left the C-State setting enabled. I also pulled my GPU. Hopefully these two things resolve the issue! Quote Link to comment
HeNotSatisfied Posted September 24, 2023 Share Posted September 24, 2023 (edited) I have the same issues with my server and an "ASRock H510M-HVS R2.0" (Intel). Hard locks every couple of days. I will try the Power Supply Idle Control if this finally solves my issue, you are my hero! Edit: Power Supply Idle Control must be something only for AMD. I turned off all C-States (sad german noices) lets see if it helps. Edited September 24, 2023 by HeNotSatisfied new information Quote Link to comment
wes.crockett Posted September 27, 2023 Share Posted September 27, 2023 Any updates with c6 update? I'm running an old Xeon machine and having similar issues. Quote Link to comment
kstrike155 Posted September 27, 2023 Author Share Posted September 27, 2023 I already had the C6 update in place from previously, so really the only change I made was removing the GPU and setting Power Supply Idle Control to typical. That BIOS option is specific to AMD, so won't help you, sorry. With that said, so far so good, time will tell if this remains stable. 1 Quote Link to comment
kstrike155 Posted October 3, 2023 Author Share Posted October 3, 2023 Welp, it happened again. This time I noticed that the HDD activity light was blinking when Unraid started to hang. But it wasn't blinking like normal activity; it was a steady/consistent blink on/off about 2-3 times per second. I then looked inside my case and saw the LED blinking on my cache drive which is an HP EX920 M.2 NVMe SSD (see here for a similar video from someone else). Surely this must mean something is wrong with my cache drive and Unraid isn't handling it well? Quote Link to comment
kstrike155 Posted October 7, 2023 Author Share Posted October 7, 2023 That didn't seem to resolve the problem either. I've been running without the SSD for a few days now (running everything off the array, no cache: woof). System just came to a halt not too long ago. I didn't do a full memtest yet, only 20%, but it came back clean thus far. It seems to happen when under load, maybe a power supply issue? I read a pretty interesting thread on Reddit, it looks like this is a common issue with 6.12. I uninstalled the unassigned devices plugin as mentioned in that thread just for giggles. Quote Link to comment
ljm42 Posted October 10, 2023 Share Posted October 10, 2023 The diags were taken pretty soon after a reboot so they don't show evidence of the problem. But I'm guessing the system is crashing due to macvlan call traces. The quick fix is to go to Settings > Docker, switch to advanced view, and change the "Docker custom network type" from macvlan to ipvlan. For more info, see the 6.12.4 release notes: https://docs.unraid.net/unraid-os/release-notes/6.12.4/#fix-for-macvlan-call-traces 1 Quote Link to comment
kstrike155 Posted October 11, 2023 Author Share Posted October 11, 2023 Interesting you mention that. I am running the @mbentley docker-timemachine image with a custom MAC address and IP so that it can show up on the network using a dedicated IP. I just happened to disable that container on Saturday because I'm running array-only with spinning disks right now (while my SSD is being replaced) and need all of the drive speed I can get for my Docker containers. Will let it run for a while and see if it's stable (no other containers are using the macvlan driver). Quote Link to comment
ljm42 Posted October 11, 2023 Share Posted October 11, 2023 Containers with custom IPs work well on ipvlan too, but if you want to keep using macvlan just follow the instructions in the 6.12.4 release notes and you'll be fine too. Quote Link to comment
kstrike155 Posted October 13, 2023 Author Share Posted October 13, 2023 (edited) Looks like it is hanging again. It's not a TOTAL hard lock, I am able to ping the machine. I was also able to (very slowly) SSH into the machine but I can't run anything. After a few minutes I can't SSH to it at all now. It's almost as if the system is TOTALLY overloaded. Unfortunately I am out of town and am not able to reset the machine remotely now. Can still ping, though! Can also run nc commands to check open ports and they seem to be responding as expected, so I don't think the machine is in kernel panic. Edited October 13, 2023 by kstrike155 Quote Link to comment
Kev600 Posted October 13, 2023 Share Posted October 13, 2023 What's showing in your syslog? If you can't access it via console, set up a syslog server.. Quote Link to comment
kstrike155 Posted October 20, 2023 Author Share Posted October 20, 2023 My syslog is in the first post of the thread and doesn't show much. I switched to the IPvlan network driver now so hopefully that resolves things. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.