November 18, 20241 yr Over the weekend, the last three days or so, I've been having issues with my unraid server going unresponsive once or twice a day and also occasionally doing a random reboot following recovery. I was suspecting that it might be the macvlan issue manifest, but my docker is set to ipvlan, bridging enabled, and Host access to custom networks enabled. This whole thing is super sudden, prior to this the server did not struggle to reach in excess of 30 days of uptime, and in the last week I have not made any configuration changes that I'd suspect to be problematic if any at all (I've checked against backups of my flash and everything in Docker and Networking is more or less the same. I've attached diagnostics from the last time the server rebooted itself - there's not anything obvious I can see in the syslog or syslog-previous that might suggest what is happening or even that the server restarted at all. Any help is appreciated! Attached hardware for reference. All of the memory is detected, and I'm not seeing any resource usage spikes. tower-diagnostics-20241118-1836.zip
November 19, 20241 yr Community Expert Silicon lottery you appear to be using similar hardware, i would recommend to anyone. (Seen in signature....) I will review you diag file here soon.
November 19, 20241 yr Community Expert Some bios settings ?xmp/overclocking? reboot and run mem test... From reviewing the syslog, your unRAID server's unresponsiveness and sudden reboots appear to be symptomatic of either a hardware-related issue or a potential software configuration conflict. Here's a breakdown of what to check and possible next steps: Confirm processor heatsink and pressure is secure. 1. Common Hardware-Related Issues a. Memory Problems Faulty or improperly seated RAM can cause random reboots. Action: Run a Memtest from the unRAID boot menu to check for memory errors. Check that all RAM sticks are securely seated. b. Power Supply Problems Sudden reboots may indicate that the power supply is failing or insufficient. Action: Verify that your power supply unit (PSU) is adequate for the system's power draw. Inspect PSU cables and connections. c. Overheating Overheating of the CPU, motherboard, or other components can cause reboots. Action: Monitor system temperatures using the Dynamix System Temperature plugin. Check for dust in fans or heatsinks and ensure adequate airflow. I prefer macvlan my self... Network and Docker Issues a. MACVLAN vs. IPVLAN Even with IPVLAN configured, certain network activities can cause kernel panics or reboots if there are conflicting settings. Action: Verify that all Docker containers using a custom network are not conflicting with the server's main IP address. Test disabling Docker temporarily to rule out Docker-related issues /etc/rc.d/rc.docker stop *Enable bonding... disable bridging... b. Host Access to Custom Networks Enabling Host Access to Custom Networks can sometimes introduce issues. Action: Temporarily disable this setting in Settings > Docker and monitor for stability. . BIOS and Firmware Updates Outdated BIOS or hardware firmware can lead to instability. Action: Check for BIOS updates for your motherboard and apply the latest stable version. Ensure all other firmware (NICs, HBAs, etc.) is updated. . Plugin and Software Conflicts Outdated or incompatible plugins can cause unRAID instability. Action: Review and disable unnecessary or unused plugins. Update all installed plugins via the unRAID web GUI. *At grub boot screen chose safemode... test in this enviroment (no plugins...) Disk-Related Issues Failing or marginal disks can also cause system instability. Action: Run SMART tests on all drives from Main > Devices. Look for any disks with high reallocated sectors, pending sectors, or other errors.
November 19, 20241 yr Author Thanks for the info, I'm working to walk through them over the course of today. I've included my steps taken this morning below, and will stability test over the course of a day or two - the server was previously crashing or going unresponsive within 24 hours of uptime so I'll know pretty quickly. Next steps if still unstable, replace USB, re-seat cpu, and begin testing with all/selective dockers and plugins disabled. Other possibly(?) suspect is and old cache SSD which is the only non-new component to this build. I'll update as I figure out more, and include another diag package if it goes unresponsive again. --- Tower Troubleshooting, Nov 2024 Memory tests run for ~12 hours, passing BIOS was a build from 2022 flashed to latest 2024-09-05 no overclocking set in bios INTERNALS re-seat memory dust psu is rated for 750, we're still well under that - UPS is reporting peaks under 300W re-seat CPU- need thermal compound, will try later check power cable seating vaccuum behind case to make sure airflow is fine new USB on the way, just in case it's failing- plugged in to internal usb 3 header, this has been fine but can try usb 2 or a port on the back of case DOCKER changed back to macvlan- this should be fine as of latest os ver which I'm on, if it was the culprit there would probably have been logs or traces bridging disabled host access to custom networks enabled PLUGINS nvidia driver plugin was broken and failing to install new version uninstalled, rebooted, reinstalled, so on...
November 19, 20241 yr Community Expert Solution Also make sure this has been taken care of: https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-819173
November 19, 20241 yr Author System is up following BIOS c state change, waiting for it to come down again. I'll upload diagnostics again if that happens. I'd just like to add this as a note, I'm very hesitant to re-paste and re-seat the CPU, in part due to the fact that it's been working for nearly a year without issue; But also because temps in the case are pretty chilly by my standard and not suggesting any kind of over heating
November 19, 20241 yr Community Expert Using the power supply idle control setting is usually preferable than completely disabling C-states, but for testing purposes it should do the same.
November 20, 20241 yr Author We've topped 24h of uptime, I'm going to go ahead and say it was the C-states stuff causing the issue out of nowhere. Confusing considering the machine has been running for the better part of a year without this happening, but I'll take it! Thank you both for your time!
November 20, 20241 yr 40 minutes ago, sethwv said: I'm going to go ahead and say it was the C-states stuff causing the issue out of nowhere This is the only way I've ever had Unraid crash on me (regularly and repeatedly) since 2018. I put the BIOS on that machine back to its defaults and plan to never touch those settings again.
November 20, 20241 yr Author 1 minute ago, Espressomatic said: This is the only way I've ever had Unraid crash on me Darn! Live and learn! Just to confirm, your defaults were to C-States OFF? Or ON? Because DISABLING them outright is what (seemingly) solved my issue here.
November 21, 20241 yr I'll have to check on that specific machine again, but I believe it defaults to ON with some settings preventing it from achieving the higher states. Enabling those is what caused the system to lock up about once every 12-24 hours. It was on the machine I've dedicated to routing/firewall, so I didn't have the patience to do much testing once I set it back.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.