September 17, 2025Sep 17 Hello,I upgraded from 6.12.15 to 7.1.4 prior to doing some maintenance on my host. I needed to swap out both parity disks to larger disks so I can add more disks to the array. As part of this I also removed a secondary HDD BTRFS cache pool, as well as portions of configs that were related to Win10 gaming VM with passed through NIC, GPU, USB controller, NVMe drive etc (huge pages were enabled, ACS override still enabled). This config was rock solid since I built it in 2019. It has never crashed a single time, through quarterly parity checks, heavy CPU, disk, and GPU workloads. After my upgrade to 7.1.4, I also installed the nvidia drivers, because I intend to use the the GPU for docker and other VMs now instead of having it passed through to a VM 100% of the time.Following the upgrade and preparing for the parity driver swap, I ran a parity check. I woke up a an alert the parity check had finished, but was expected to be over 1 day with my new 20TB disks. I logged into see the array was stopped and after checking learned the server had rebooted. Looking at the logs, I saw a machine check event in the logs DURING BOOT! I found that very odd, as I never seen that before. I brushed it off as a fluke, initiated the check again, it finished. Initiated the first swap, it finished. I left it for a week to test stability, no issues. I wrote some data to the array, initiated another parity check, and it crashed again. At this point I enabled the syslog server so I can try to catch any errors. After over another week of being stable, I kicked the second parity sync for the second parity disk, and it crashed overnight again. I am very much leaning towards rolling back to 6.12.15 to rule out an OS issue, but want to give the community and devs a chance to look into this issue and maybe help others from experiencing the same problems in the future. I have found similar threads to mine, but they have gone dark with no resolution.Unfortunately last night the syslog server didnt catch anything in the logs. The last entries I see are:Sep 17 00:00:01 xxx Plugin Auto Update: Checking for available plugin updates Sep 17 00:00:03 xxx Plugin Auto Update: Community Applications Plugin Auto Update finishedThe email I received about the parity check being finished was around 12:32AM, so presumably the server crashed sometime between 12:20-12:30.In the mean time I have another parity sync running to get my array back to healthy. zenhammer-diagnostics-20250917-0939.zip
September 17, 2025Sep 17 Community Expert 30 minutes ago, hammsandwich said:learned the server had rebootedServer rebooting by itself is almost always a hardware issue. Since it's a Ryzen, make sure this has been taken care of, even if it wasn't a problem before:https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-819173
September 17, 2025Sep 17 Author 2 minutes ago, JorgeB said:Server rebooting by itself is almost always a hardware issue. Since it's a Ryzen, make sure this has been taken care of, even if it wasn't a problem before:https://forums.unraid.net/topic/46802-faq-for-unraid-v6/page/2/#findComment-819173I would tend to agree with you. Since indeed do have C-States disabled, but ideally was looking to re-enable them eventually to try to reduce power consumption. But at this time, C-States has been disabled for the better part of 6 years. The only other thing I can think of is I swapped out the CPU cooler so I may try to reseat the CPU again. I am currently tailing the logs while its rebuilding parity to try to catch the error. Will post what I find.
September 17, 2025Sep 17 Community Expert See the part in the link about spontaneous reboots for 5000 series CPUs, you may need to increase the voltage a little.
September 17, 2025Sep 17 Author Interesting, makes some sense possibly the newer kernel is causing these issues. I had an all core static clock set previously, and dropped the clock speed while leaving voltage alone for now. My intent was to go to a power efficient PBO setup, negative CO with c states to try to save more power.I am looking at the bug report linked in the post and some people noted a BIOS updated helped with stability. Looking at my board an ASRock X470 Taichi, I am currently on BIOS 4.73, which no longer appears on the ASRock site as it was beta bios. I will start looking into updating the BIOS and which version might be my best bet moving forward.
September 17, 2025Sep 17 Author Ok, getting the following after hard crashing.Sep 17 15:16:12 zenhammer kernel: mce: [Hardware Error]: Machine check events logged Sep 17 15:16:12 zenhammer kernel: mce: [Hardware Error]: CPU 9: Machine Check: 0 Bank 5: bea0000000000108 Sep 17 15:16:12 zenhammer kernel: mce: [Hardware Error]: TSC 0 ADDR ffffff81afa9e2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 Sep 17 15:16:12 zenhammer kernel: mce: [Hardware Error]: PROCESSOR 2:a20f10 TIME 1758136507 SOCKET 0 APIC 3 microcode a20102dAgain, these only appear in the syslog following the reboot, which is odd. I ran an extended preclear on new parity disks, but is it possible one of them is bad?I will try giving CPU more voltage, but I also think upgrading to BIOS 8.01 with AGESA 1.2.0.8 might be a viable solution.
September 18, 2025Sep 18 Author Increasing voltage did not help. System crashed 3 more times during parity rebuild. I found some additional threads indicating my AGESA version is incompatible with the new AMD P-State driver in the new kernels. I took the gamble and upgraded my BIOS to 8.01, which unfortunately also reset everything to default and will not let me reload my old BIOS profiles. However, after getting standard settings reconfigured, Unraid has been rebuilding for over 12 hours so far. Will keep updating as I test further. Interesting that parity workload seems to the triggering factor.
September 18, 2025Sep 18 Community Expert You can also use a kernel option to revert to the old ACPI driver if needed.
September 18, 2025Sep 18 Author The fact my old BIOS was taken down from the download page ended up being a deciding factor on moving to a newer BIOS. I would like to take advantage of newer kernel features and drivers, and updating my BIOS seemed like a logical step. I may try to revert to the old driver to test idle power efficiency at some point however.
October 1, 2025Oct 1 Author Solution I just want to to close this out and say the new kernel in Unraid 7 has more aggressive voltage/frequency curves, and my CPU negative CO became unstable in the new release. Testing in Linux distros with older kernels and Windows confirms this finding; everything was 100% stable with my old settings in these other distros. I had to basically remove my PBO settings and run the CPU stock, and things have been 100% stable since.
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.