6.12.4 System Hard Locks Often


kstrike155

Recommended Posts

General system specs are:

  • AMD Ryzen 5 1600
  • ASRock B450 Pro4
  • 2x8 GB DDR4-2400 ECC RAM
  • 3x + 1x parity drives (all 4TB, mix of WD Red and Blue) and a single NVMe cache disk

 

Been running Unraid for years without much issue, but recently my system has been becoming completely unresponsive: no console, no SSH, no nothing. It happens probably about once a week on average. I have to hard reboot using the power button. I've attached diagnostics from shortly after a reboot.

 

I did have a heck of a time upgrading to 6.12.4 and has to rebuild my USB stick. Then the My Servers recovery didn't image the stick properly, so I had to start over with my own backup. Not sure if it's related, though it seems like it might have started happening after the upgrade. I added an old video card to the machine during my debugging of the upgrade (so I could use a monitor), and left that in the machine. I am going to try removing it to see if it fixes anything, but wondering if there are any other ideas? I also noticed I've been receiving some errors on my parity disk which I will replace, but I would hope disk failure wouldn't bring my whole server down.

 

Here's syslog output from before and after the reboot (full syslog after reboot in the diagnostics); nothing of note really:

 

Sep 23 23:27:48 homer emhttpd: read SMART /dev/sdd
Sep 23 23:27:57 homer emhttpd: read SMART /dev/sdc
Sep 24 00:01:33 homer root: /mnt/cache: 858.9 GiB (922254512128 bytes) trimmed on /dev/nvme0n1p1
Sep 24 00:28:06 homer emhttpd: spinning down /dev/sdc
Sep 24 00:39:56 homer sshd[16907]: Received disconnect from 192.168.1.228 port 55478:11: Normal Shutdown
Sep 24 00:39:56 homer sshd[16907]: Disconnected from user root 192.168.1.228 port 55478
Sep 24 00:55:24 homer emhttpd: spinning down /dev/sdd
Sep 24 02:02:56 homer emhttpd: read SMART /dev/sdc
=================================== REBOOT ========================================
Sep 24 07:42:55 homer kernel: Linux version 6.1.49-Unraid (root@Develop-612) (gcc (GCC) 12.2.0, GNU ld version 2.40-slack151) #1 SMP PREEMPT_DYNAMIC Wed Aug 30 09:42:35 PDT 2023
Sep 24 07:42:55 homer kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot
Sep 24 07:42:55 homer kernel: BIOS-provided physical RAM map:
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009d3ff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000009d400-0x000000000009ffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000000100000-0x0000000009d01fff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x0000000009d02000-0x0000000009ffffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a200000-0x000000000a20afff] ACPI NVS
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000a20b000-0x000000000affffff] usable
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000b000000-0x000000000b01ffff] reserved
Sep 24 07:42:55 homer kernel: BIOS-e820: [mem 0x000000000b020000-0x00000000db1c8fff] usable

 

homer-diagnostics-20230924-0800.zip

Edited by kstrike155
Link to comment

I am experiencing similar issues with a Ryzen 5600G machine but for longer than 6.12.4. I didn't have any troubles updating tho.

The system will become almost completely unresponsive except for wireguard, which seems to be unaffected.

My logs aren't really helpful either

Frequency of this happening is around once a week for me as well.

Edited by Alintya
Link to comment

I went to the BIOS and Global C-State was set to Auto, but I actually still have my c6-disable flag on during boot:

 

/usr/local/sbin/zenstates --c6-disable

 

What I did notice is that Power Supply Idle Control was set to Auto as well. In my research it sounds like that may actually be what's causing system hangs, so I set that to Typical and left the C-State setting enabled. I also pulled my GPU. Hopefully these two things resolve the issue!

Link to comment

I have the same issues with my server and an "ASRock H510M-HVS R2.0" (Intel). Hard locks every couple of days.
I will try the Power Supply Idle Control if this finally solves my issue, you are my hero!

 

Edit:
Power Supply Idle Control must be something only for AMD.
I turned off all C-States (sad german noices) lets see if it helps.

Edited by HeNotSatisfied
new information
Link to comment

Welp, it happened again. This time I noticed that the HDD activity light was blinking when Unraid started to hang. But it wasn't blinking like normal activity; it was a steady/consistent blink on/off about 2-3 times per second. I then looked inside my case and saw the LED blinking on my cache drive which is an HP EX920 M.2 NVMe SSD (see here for a similar video from someone else).

 

Surely this must mean something is wrong with my cache drive and Unraid isn't handling it well?

Link to comment

That didn't seem to resolve the problem either. I've been running without the SSD for a few days now (running everything off the array, no cache: woof). System just came to a halt not too long ago.

 

I didn't do a full memtest yet, only 20%, but it came back clean thus far.

 

It seems to happen when under load, maybe a power supply issue? I read a pretty interesting thread on Reddit, it looks like this is a common issue with 6.12. I uninstalled the unassigned devices plugin as mentioned in that thread just for giggles.

Link to comment

The diags were taken pretty soon after a reboot so they don't show evidence of the problem. But I'm guessing the system is crashing due to macvlan call traces.

 

The quick fix is to go to Settings > Docker, switch to advanced view, and change the "Docker custom network type" from macvlan to ipvlan. 

 

For more info, see the 6.12.4 release notes: https://docs.unraid.net/unraid-os/release-notes/6.12.4/#fix-for-macvlan-call-traces 

  • Like 1
Link to comment

Interesting you mention that. I am running the @mbentley docker-timemachine image with a custom MAC address and IP so that it can show up on the network using a dedicated IP. I just happened to disable that container on Saturday because I'm running array-only with spinning disks right now (while my SSD is being replaced) and need all of the drive speed I can get for my Docker containers. Will let it run for a while and see if it's stable (no other containers are using the macvlan driver).

Link to comment

Looks like it is hanging again. It's not a TOTAL hard lock, I am able to ping the machine. I was also able to (very slowly) SSH into the machine but I can't run anything.

 

After a few minutes I can't SSH to it at all now. It's almost as if the system is TOTALLY overloaded. Unfortunately I am out of town and am not able to reset the machine remotely now.

 

Can still ping, though! Can also run nc commands to check open ports and they seem to be responding as expected, so I don't think the machine is in kernel panic.

Edited by kstrike155
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.