Isorikk Posted February 19, 2023 Share Posted February 19, 2023 (edited) Hi all, For the past couple months I've been trying to determine the cause of my Unraid system randomly becoming unreachable. Sometimes it'll run with no apparent issues for weeks, and other times it'll last a day or two before freezing again. This system ran flawlessly for years until I made two major changes: Change 1: Upgraded the hardware. Basically a full transplant of the drives to a new system with a new CPU, new motherboard, new RAM, moved the drives from SATA splitters to SAS HBA's, and added an Intel A770 GPU for future video encoding endeavors. Change 2: After determining there was no issues with the hardware upgrade, I then upgraded to the version V6.11.5 from V6.9.2 in hopes that it might play better with the A770, which is technically unsupported on Linux 5 (I've since been able to get the GPU to passthrough to a VM with no issues). I suspect the issues I'm having are related to change #1, but I haven't been able to determine what specifically is the cause. The errors I'm encountering in the logs are beyond my knowledge to troubleshoot and Google has not been helpful. I've run multiple memtests and reseated everything. Another, possibly unrelated, symptom I've encountered is that the Win10 VM I'm running for the A770 only runs for about two hours before locking up and pinning the CPU to 100% until I tell the VM to force shut down. I did not run a VM on the system prior to Change 1 and 2. I've attached the syslog from today where it most recently froze and the system diagnostics. Thanks for any help you can provide! syslog.txt gemininas-diagnostics-20230219-1305.zip Edited February 19, 2023 by Isorikk Quote Link to comment
JorgeB Posted February 20, 2023 Share Posted February 20, 2023 There have been other users with issues with Ryzen 7xxx, try disabling C-states, also XMP on the RAM. Quote Link to comment
Isorikk Posted February 20, 2023 Author Share Posted February 20, 2023 7 hours ago, JorgeB said: There have been other users with issues with Ryzen 7xxx, try disabling C-states, also XMP on the RAM. I will try disabling the C-states, however for the XMP on RAM, I originally had it disabled (saw no useful purpose for overclocking RAM on a storage server), but I was getting a different kernel panic with data corruption. It appeared that the bits were sometimes flipping in memory, and enabling XMP seems to have resolved it. I will follow-up in a few days with the results of disabling C-states. Quote Link to comment
Isorikk Posted February 23, 2023 Author Share Posted February 23, 2023 (edited) An update to my previous post: After disabling Global C-State Control in the BIOS, I went to restart the system a little bit later and for some reason it wouldn't POST anymore. It booted exactly one time with C-States disabled and then never again. To resolve the issue I either had to unplug the drives from the SAS HBA's to get it to POST or, what I ended up doing, was flashing the BIOS with a newer version which appears to have reset the config. I'm going to let it run for a few days with C-States re-enabled after the update, but I have a feeling that the issue will persist... Edited February 23, 2023 by Isorikk Quote Link to comment
JorgeB Posted February 23, 2023 Share Posted February 23, 2023 System should still always boot with C-States disabled, sounds like a buggy BIOS. Quote Link to comment
Isorikk Posted February 23, 2023 Author Share Posted February 23, 2023 20 minutes ago, JorgeB said: System should still always boot with C-States disabled, sounds like a buggy BIOS. Tell me about it... 🙃 Quote Link to comment
Isorikk Posted February 26, 2023 Author Share Posted February 26, 2023 System froze again with Global C-States disabled and XMP profile turned off. Log doesn't really have anything, this is all it has at the time before it went unresponsive: Quote Feb 25 16:32:39 GeminiNAS kernel: vfio-pci 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered blocking state Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered disabled state Feb 25 16:32:39 GeminiNAS kernel: device vnet2 entered promiscuous mode Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered blocking state Feb 25 16:32:39 GeminiNAS kernel: br0: port 2(vnet2) entered forwarding state Feb 25 16:32:41 GeminiNAS avahi-daemon[7613]: Joining mDNS multicast group on interface vnet2.IPv6 with address fe80::fc54:ff:fed4:45f8. Feb 25 16:32:41 GeminiNAS avahi-daemon[7613]: New relevant interface vnet2.IPv6 for mDNS. Feb 25 16:32:41 GeminiNAS avahi-daemon[7613]: Registering new address record for fe80::fc54:ff:fed4:45f8 on vnet2.*. Feb 25 16:32:42 GeminiNAS acpid: input device has been disconnected, fd 6 Feb 25 16:32:42 GeminiNAS acpid: input device has been disconnected, fd 7 Feb 25 16:32:42 GeminiNAS acpid: input device has been disconnected, fd 8 Feb 25 16:32:53 GeminiNAS kernel: usb 1-9: reset full-speed USB device number 7 using xhci_hcd I've attached the full syslog for today, with it going unresponsive at approximately 4:55pm. syslog.txt Quote Link to comment
JorgeB Posted February 26, 2023 Share Posted February 26, 2023 Try switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)) Quote Link to comment
Isorikk Posted February 26, 2023 Author Share Posted February 26, 2023 8 hours ago, JorgeB said: Try switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)) I've updated the configuration to ipvlan, not sure how it got set to macvlan in the first place, but good catch! So far I've encountered no issues. I will let it run for a few days again to see if the issue persists. 1 Quote Link to comment
Isorikk Posted February 27, 2023 Author Share Posted February 27, 2023 A new development: The system has not frozen or gone unresponsive, but this morning I discovered that all of the docker and plugin services were unable to reach out to check for updates. I suspect this is directly related to changing from macvlan to ipvlan. I did some further investigation and determined that all of the services were able to reach my local network, including the gateway, but could not reach out to the internet. The Unraid system itself could reach just fine, only the plugin/docker services seemed to be blocked. A restart resolved the issue. I'm wondering if the issue may be caused by the network card... I've attached more logs from this morning. It looks like a lot of weird stuff is happening with br0 but I'm not certain if this was just a one-time bug or a symptom of a larger issue. syslog.txt 1 Quote Link to comment
Solution Isorikk Posted March 3, 2023 Author Solution Share Posted March 3, 2023 I believe my problems were caused my trying to run the VM on top of whatever else is going on. Disabling VM's has eliminated all weird bugs. It's possible the culprit was the unsupported video card. I have since tried an unofficial kernel that adds drivers for the video card and it has been playing well with Docker containers thus far. I'm going to go ahead and mark this thread as resolved, with the solution being don't use VM's with new hardware! 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.