Jump to content

System Randomly Locking Up - v6.11.5


Isorikk
Go to solution Solved by Isorikk,

Recommended Posts

Hi all,

 

For the past couple months I've been trying to determine the cause of my Unraid system randomly becoming unreachable. Sometimes it'll run with no apparent issues for weeks, and other times it'll last a day or two before freezing again. This system ran flawlessly for years until I made two major changes:

 

Change 1: Upgraded the hardware. Basically a full transplant of the drives to a new system with a new CPU, new motherboard, new RAM, moved the drives from SATA splitters to SAS HBA's, and added an Intel A770 GPU for future video encoding endeavors.

Change 2: After determining there was no issues with the hardware upgrade, I then upgraded to the version V6.11.5 from V6.9.2 in hopes that it might play better with the A770, which is technically unsupported on Linux 5 (I've since been able to get the GPU to passthrough to a VM with no issues).

 

I suspect the issues I'm having are related to change #1, but I haven't been able to determine what specifically is the cause. The errors I'm encountering in the logs are beyond my knowledge to troubleshoot and Google has not been helpful. I've run multiple memtests and reseated everything. Another, possibly unrelated, symptom I've encountered is that the Win10 VM I'm running for the A770 only runs for about two hours before locking up and pinning the CPU to 100% until I tell the VM to force shut down. I did not run a VM on the system prior to Change 1 and 2.

 

I've attached the syslog from today where it most recently froze and the system diagnostics. Thanks for any help you can provide!

 

syslog.txt gemininas-diagnostics-20230219-1305.zip

Edited by Isorikk
Link to comment
7 hours ago, JorgeB said:

There have been other users with issues with Ryzen 7xxx, try disabling C-states, also XMP on the RAM.

 

I will try disabling the C-states, however for the XMP on RAM, I originally had it disabled (saw no useful purpose for overclocking RAM on a storage server), but I was getting a different kernel panic with data corruption. It appeared that the bits were sometimes flipping in memory, and enabling XMP seems to have resolved it.

 

I will follow-up in a few days with the results of disabling C-states.

Link to comment

An update to my previous post:

 

After disabling Global C-State Control in the BIOS, I went to restart the system a little bit later and for some reason it wouldn't POST anymore. It booted exactly one time with C-States disabled and then never again. To resolve the issue I either had to unplug the drives from the SAS HBA's to get it to POST or, what I ended up doing, was flashing the BIOS with a newer version which appears to have reset the config.

 

I'm going to let it run for a few days with C-States re-enabled after the update, but I have a feeling that the issue will persist...

Edited by Isorikk
Link to comment

System froze again with Global C-States disabled and XMP profile turned off. Log doesn't really have anything, this is all it has at the time before it went unresponsive:

 

Quote

Feb 25 16:32:39 GeminiNAS kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered blocking state
Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered disabled state
Feb 25 16:32:39 GeminiNAS kernel: device vnet2 entered promiscuous mode
Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered blocking state
Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered forwarding state
Feb 25 16:32:41 GeminiNAS  avahi-daemon[7613]: Joining mDNS multicast group on interface vnet2.IPv6 with address fe80::fc54:ff:fed4:45f8.
Feb 25 16:32:41 GeminiNAS  avahi-daemon[7613]: New relevant interface vnet2.IPv6 for mDNS.
Feb 25 16:32:41 GeminiNAS  avahi-daemon[7613]: Registering new address record for fe80::fc54:ff:fed4:45f8 on vnet2.*.
Feb 25 16:32:42 GeminiNAS  acpid: input device has been disconnected, fd 6
Feb 25 16:32:42 GeminiNAS  acpid: input device has been disconnected, fd 7
Feb 25 16:32:42 GeminiNAS  acpid: input device has been disconnected, fd 8
Feb 25 16:32:53 GeminiNAS kernel: usb 1-9: reset full-speed USB device number 7 using xhci_hcd

 

I've attached the full syslog for today, with it going unresponsive at approximately 4:55pm.

syslog.txt

Link to comment
8 hours ago, JorgeB said:

Try switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right))

 

I've updated the configuration to ipvlan, not sure how it got set to macvlan in the first place, but good catch! So far I've encountered no issues. I will let it run for a few days again to see if the issue persists.

  • Like 1
Link to comment

A new development:

 

The system has not frozen or gone unresponsive, but this morning I discovered that all of the docker and plugin services were unable to reach out to check for updates. I suspect this is directly related to changing from macvlan to ipvlan. I did some further investigation and determined that all of the services were able to reach my local network, including the gateway, but could not reach out to the internet. The Unraid system itself could reach just fine, only the plugin/docker services seemed to be blocked.

 

A restart resolved the issue. I'm wondering if the issue may be caused by the network card...

 

I've attached more logs from this morning. It looks like a lot of weird stuff is happening with br0 but I'm not certain if this was just a one-time bug or a symptom of a larger issue.

syslog.txt

  • Like 1
Link to comment
  • Solution

I believe my problems were caused my trying to run the VM on top of whatever else is going on. Disabling VM's has eliminated all weird bugs. It's possible the culprit was the unsupported video card. I have since tried an unofficial kernel that adds drivers for the video card and it has been playing well with Docker containers thus far. I'm going to go ahead and mark this thread as resolved, with the solution being don't use VM's with new hardware!

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...