6.12.4 System Hard Locks Often

kstrike155 · September 24, 2023

General system specs are:

AMD Ryzen 5 1600
ASRock B450 Pro4
2x8 GB DDR4-2400 ECC RAM
3x + 1x parity drives (all 4TB, mix of WD Red and Blue) and a single NVMe cache disk

Been running Unraid for years without much issue, but recently my system has been becoming completely unresponsive: no console, no SSH, no nothing. It happens probably about once a week on average. I have to hard reboot using the power button. I've attached diagnostics from shortly after a reboot.

I did have a heck of a time upgrading to 6.12.4 and has to rebuild my USB stick. Then the My Servers recovery didn't image the stick properly, so I had to start over with my own backup. Not sure if it's related, though it seems like it might have started happening after the upgrade. I added an old video card to the machine during my debugging of the upgrade (so I could use a monitor), and left that in the machine. I am going to try removing it to see if it fixes anything, but wondering if there are any other ideas? I also noticed I've been receiving some errors on my parity disk which I will replace, but I would hope disk failure wouldn't bring my whole server down.

Here's syslog output from before and after the reboot (full syslog after reboot in the diagnostics); nothing of note really:

Sep 23 23:27:48 homer emhttpd: read SMART /dev/sdd
Sep 23 23:27:57 homer emhttpd: read SMART /dev/sdc
Sep 24 00:01:33 homer root: /mnt/cache: 858.9 GiB (922254512128 bytes) trimmed on /dev/nvme0n1p1
Sep 24 00:28:06 homer emhttpd: spinning down /dev/sdc
Sep 24 00:39:56 homer sshd[16907]: Received disconnect from 192.168.1.228 port 55478:11: Normal Shutdown
Sep 24 00:39:56 homer sshd[16907]: Disconnected from user root 192.168.1.228 port 55478
Sep 24 00:55:24 homer emhttpd: spinning down /dev/sdd
Sep 24 02:02:56 homer emhttpd: read SMART /dev/sdc
=================================== REBOOT ========================================
Sep 24 07:42:55 homer kernel: Linux version 6.1.49-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP PREEMPT_DYNAMIC Wed Aug 30 09:42:35 PDT 2023
Sep 24 07:42:55 homer kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot
Sep 24 07:42:55 homer kernel: BIOS-provided physical RAM map:
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009d3ff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000009d400-0x000000000009ffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009d01fff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000009d02000-0x0000000009ffffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a20b000-0x000000000affffff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b01ffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000b020000-0x00000000db1c8fff] usable

homer-diagnostics-20230924-0800.zip

Edited September 24, 2023 by kstrike155

kstrike155 · September 24, 2023

Actually, I reset the BIOS during my troubleshooting in booting off the USB after the upgrade. I bet this is a C-State thing, will check and reply with findings.

Alintya · September 24, 2023

I am experiencing similar issues with a Ryzen 5600G machine but for longer than 6.12.4. I didn't have any troubles updating tho.

The system will become almost completely unresponsive except for wireguard, which seems to be unaffected.

My logs aren't really helpful either

Frequency of this happening is around once a week for me as well.

Edited September 24, 2023 by Alintya

kstrike155 · September 24, 2023

I went to the BIOS and Global C-State was set to Auto, but I actually still have my c6-disable flag on during boot:

/usr/local/sbin/zenstates --c6-disable

What I did notice is that Power Supply Idle Control was set to Auto as well. In my research it sounds like that may actually be what's causing system hangs, so I set that to Typical and left the C-State setting enabled. I also pulled my GPU. Hopefully these two things resolve the issue!

HeNotSatisfied · September 24, 2023

I have the same issues with my server and an "ASRock H510M-HVS R2.0" (Intel). Hard locks every couple of days.
I will try the Power Supply Idle Control if this finally solves my issue, you are my hero!

Edit:
Power Supply Idle Control must be something only for AMD.
I turned off all C-States (sad german noices) lets see if it helps.

Edited September 24, 2023 by HeNotSatisfied
new information

wes.crockett · September 27, 2023

Any updates with c6 update? I'm running an old Xeon machine and having similar issues.

kstrike155 · September 27, 2023

I already had the C6 update in place from previously, so really the only change I made was removing the GPU and setting Power Supply Idle Control to typical. That BIOS option is specific to AMD, so won't help you, sorry.

With that said, so far so good, time will tell if this remains stable.

kstrike155 · October 3, 2023

Welp, it happened again. This time I noticed that the HDD activity light was blinking when Unraid started to hang. But it wasn't blinking like normal activity; it was a steady/consistent blink on/off about 2-3 times per second. I then looked inside my case and saw the LED blinking on my cache drive which is an HP EX920 M.2 NVMe SSD (see here for a similar video from someone else).

Surely this must mean something is wrong with my cache drive and Unraid isn't handling it well?

kstrike155 · October 7, 2023

That didn't seem to resolve the problem either. I've been running without the SSD for a few days now (running everything off the array, no cache: woof). System just came to a halt not too long ago.

I didn't do a full memtest yet, only 20%, but it came back clean thus far.

It seems to happen when under load, maybe a power supply issue? I read a pretty interesting thread on Reddit, it looks like this is a common issue with 6.12. I uninstalled the unassigned devices plugin as mentioned in that thread just for giggles.

ljm42 · October 10, 2023

The diags were taken pretty soon after a reboot so they don't show evidence of the problem. But I'm guessing the system is crashing due to macvlan call traces.

The quick fix is to go to Settings > Docker, switch to advanced view, and change the "Docker custom network type" from macvlan to ipvlan.

For more info, see the 6.12.4 release notes: https://docs.unraid.net/unraid-os/release-notes/6.12.4/#fix-for-macvlan-call-traces

kstrike155 · October 11, 2023

Interesting you mention that. I am running the @mbentley docker-timemachine image with a custom MAC address and IP so that it can show up on the network using a dedicated IP. I just happened to disable that container on Saturday because I'm running array-only with spinning disks right now (while my SSD is being replaced) and need all of the drive speed I can get for my Docker containers. Will let it run for a while and see if it's stable (no other containers are using the macvlan driver).

ljm42 · October 11, 2023

Containers with custom IPs work well on ipvlan too, but if you want to keep using macvlan just follow the instructions in the 6.12.4 release notes and you'll be fine too.

kstrike155 · October 13, 2023

Looks like it is hanging again. It's not a TOTAL hard lock, I am able to ping the machine. I was also able to (very slowly) SSH into the machine but I can't run anything.

After a few minutes I can't SSH to it at all now. It's almost as if the system is TOTALLY overloaded. Unfortunately I am out of town and am not able to reset the machine remotely now.

Can still ping, though! Can also run nc commands to check open ports and they seem to be responding as expected, so I don't think the machine is in kernel panic.

Edited October 13, 2023 by kstrike155

Kev600 · October 13, 2023

What's showing in your syslog? If you can't access it via console, set up a syslog server..

kstrike155 · October 20, 2023

My syslog is in the first post of the thread and doesn't show much. I switched to the IPvlan network driver now so hopefully that resolves things.

6.12.4 System Hard Locks Often

Recommended Posts

kstrike155

Link to comment

kstrike155

Link to comment

Alintya

Link to comment

kstrike155

Link to comment

HeNotSatisfied

Link to comment

wes.crockett

Link to comment

kstrike155

Link to comment

kstrike155

Link to comment

kstrike155

Link to comment

ljm42

Link to comment

kstrike155

Link to comment

ljm42

Link to comment

kstrike155

Link to comment

Kev600

Link to comment

kstrike155

Link to comment

Join the conversation