February 19, 20233 yr Hi all, For the past couple months I've been trying to determine the cause of my Unraid system randomly becoming unreachable. Sometimes it'll run with no apparent issues for weeks, and other times it'll last a day or two before freezing again. This system ran flawlessly for years until I made two major changes: Change 1: Upgraded the hardware. Basically a full transplant of the drives to a new system with a new CPU, new motherboard, new RAM, moved the drives from SATA splitters to SAS HBA's, and added an Intel A770 GPU for future video encoding endeavors. Change 2: After determining there was no issues with the hardware upgrade, I then upgraded to the version V6.11.5 from V6.9.2 in hopes that it might play better with the A770, which is technically unsupported on Linux 5 (I've since been able to get the GPU to passthrough to a VM with no issues). I suspect the issues I'm having are related to change #1, but I haven't been able to determine what specifically is the cause. The errors I'm encountering in the logs are beyond my knowledge to troubleshoot and Google has not been helpful. I've run multiple memtests and reseated everything. Another, possibly unrelated, symptom I've encountered is that the Win10 VM I'm running for the A770 only runs for about two hours before locking up and pinning the CPU to 100% until I tell the VM to force shut down. I did not run a VM on the system prior to Change 1 and 2. I've attached the syslog from today where it most recently froze and the system diagnostics. Thanks for any help you can provide! syslog.txt gemininas-diagnostics-20230219-1305.zip Edited February 19, 20233 yr by Isorikk
February 20, 20233 yr Community Expert There have been other users with issues with Ryzen 7xxx, try disabling C-states, also XMP on the RAM.
February 20, 20233 yr Author 7 hours ago, JorgeB said: There have been other users with issues with Ryzen 7xxx, try disabling C-states, also XMP on the RAM. I will try disabling the C-states, however for the XMP on RAM, I originally had it disabled (saw no useful purpose for overclocking RAM on a storage server), but I was getting a different kernel panic with data corruption. It appeared that the bits were sometimes flipping in memory, and enabling XMP seems to have resolved it. I will follow-up in a few days with the results of disabling C-states.
February 23, 20233 yr Author An update to my previous post: After disabling Global C-State Control in the BIOS, I went to restart the system a little bit later and for some reason it wouldn't POST anymore. It booted exactly one time with C-States disabled and then never again. To resolve the issue I either had to unplug the drives from the SAS HBA's to get it to POST or, what I ended up doing, was flashing the BIOS with a newer version which appears to have reset the config. I'm going to let it run for a few days with C-States re-enabled after the update, but I have a feeling that the issue will persist... Edited February 23, 20233 yr by Isorikk
February 23, 20233 yr Community Expert System should still always boot with C-States disabled, sounds like a buggy BIOS.
February 23, 20233 yr Author 20 minutes ago, JorgeB said: System should still always boot with C-States disabled, sounds like a buggy BIOS. Tell me about it... 🙃
February 26, 20233 yr Author System froze again with Global C-States disabled and XMP profile turned off. Log doesn't really have anything, this is all it has at the time before it went unresponsive: Quote Feb 25 16:32:39 GeminiNAS kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered blocking state Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered disabled state Feb 25 16:32:39 GeminiNAS kernel: device vnet2 entered promiscuous mode Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered blocking state Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered forwarding state Feb 25 16:32:41 GeminiNAS avahi-daemon[7613]: Joining mDNS multicast group on interface vnet2.IPv6 with address fe80::fc54:ff:fed4:45f8. Feb 25 16:32:41 GeminiNAS avahi-daemon[7613]: New relevant interface vnet2.IPv6 for mDNS. Feb 25 16:32:41 GeminiNAS avahi-daemon[7613]: Registering new address record for fe80::fc54:ff:fed4:45f8 on vnet2.*. Feb 25 16:32:42 GeminiNAS acpid: input device has been disconnected, fd 6 Feb 25 16:32:42 GeminiNAS acpid: input device has been disconnected, fd 7 Feb 25 16:32:42 GeminiNAS acpid: input device has been disconnected, fd 8 Feb 25 16:32:53 GeminiNAS kernel: usb 1-9: reset full-speed USB device number 7 using xhci_hcd I've attached the full syslog for today, with it going unresponsive at approximately 4:55pm. syslog.txt
February 26, 20233 yr Community Expert Try switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right))
February 26, 20233 yr Author 8 hours ago, JorgeB said: Try switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)) I've updated the configuration to ipvlan, not sure how it got set to macvlan in the first place, but good catch! So far I've encountered no issues. I will let it run for a few days again to see if the issue persists.
February 27, 20233 yr Author A new development: The system has not frozen or gone unresponsive, but this morning I discovered that all of the docker and plugin services were unable to reach out to check for updates. I suspect this is directly related to changing from macvlan to ipvlan. I did some further investigation and determined that all of the services were able to reach my local network, including the gateway, but could not reach out to the internet. The Unraid system itself could reach just fine, only the plugin/docker services seemed to be blocked. A restart resolved the issue. I'm wondering if the issue may be caused by the network card... I've attached more logs from this morning. It looks like a lot of weird stuff is happening with br0 but I'm not certain if this was just a one-time bug or a symptom of a larger issue. syslog.txt
March 3, 20233 yr Author Solution I believe my problems were caused my trying to run the VM on top of whatever else is going on. Disabling VM's has eliminated all weird bugs. It's possible the culprit was the unsupported video card. I have since tried an unofficial kernel that adds drivers for the video card and it has been playing well with Docker containers thus far. I'm going to go ahead and mark this thread as resolved, with the solution being don't use VM's with new hardware!
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.