PioneerX Posted October 6, 2022 Share Posted October 6, 2022 I recently decided to convert my ultra reliable TrueNAS machine to UNRIAD since it needed a disk upgrade and UNRAID's ability to add single disks on the fly was very appealing. Since converting over to UNRAID (with brand new IronWolf 8TB drives) I have had the parity scan kick in every day, I finally worked out that it's kicking in because of an unscheduled restart, however I can not find what is causing the restart. Diagnostics are attached. Since I know the normal syslog is not persistent over the restart I had the syslog sent to the array so that's also attached. Any ideas? Thanks very much any help you can provide. emporium-diagnostics-20221006-1118.zip syslog.log Quote Link to comment
JorgeB Posted October 6, 2022 Share Posted October 6, 2022 Nothing obvious logged, one thing you can try is to boot the server in safe mode with all docker/VMs disable, let it run as a basic NAS for a few days, if it still crashes it's likely a hardware problem, if it doesn't start turning on the other services one by one. Quote Link to comment
PioneerX Posted October 6, 2022 Author Share Posted October 6, 2022 Thanks @JorgeB. Docker is already disabled and there are no VM's running at all. I have a separate 4 machine cluster that holds my VM's so UNRAID was only ever intended to be a NAS. I have rebooted in safe mode and will keep an eye on it. NOTE: The only hardware that was changed between TrueNAS and UNRAID was the disks (4x Dell Enterprise 3TB drives replaced with 3x 8TB IronWolf), all other hardware is the the same as it was under TrueNAS and that ran for 5 years without falling over at all. The whole thing is also on a 1500VA UPS. Quote Link to comment
JorgeB Posted October 6, 2022 Share Posted October 6, 2022 Try with v6.10.3, in case it doesn't like the newer kernel, if the same try blacklisting the iGPU from loading, there have been some stability issues with that driver. Quote Link to comment
PioneerX Posted October 13, 2022 Author Share Posted October 13, 2022 Update: The problem is not solved and the machine still restarts 1-2 times in any 24h period (always random). I have done all the of the following..... Downgrade to v6.10.3 Found some hibernate settings in the BIOS that have now been removed Reset the BIOS to default Found an old HBA inside the machine that was unused so removed it in-case of compatibility problems Blacklisted iGPU Moved the machine to different power feeds Replaced the PSU ....and yet the problem persists. The logs still do not contain any relevant information about the reboot so I have to assume that the hardware in use is just not compatible and needs to be replaced (https://www.newegg.com/biostar-nm70i-847-mini-itx/p/N82E16813138368 is the hardware contained, with 2x 8GB DDR3 installed), like I said this machine originally was running FreeNAS/TrueNASCore for almost a decade before I changed to UNRAID ;). I will be ordering replacement hardware since I cant entrust data to a machine that reboots randomly every few hours I might sound negative of UNRAID but I'm 100% not, its a great product and used very successfully my many people, it just appears my hardware is not compatible. 1 Quote Link to comment
itimpi Posted October 13, 2022 Share Posted October 13, 2022 32 minutes ago, PioneerX said: might sound negative of UNRAID but I'm 100% not, its a great product and used very successfully my many people, it just appears my hardware is not compatible. Very little hardware is incompatible. I would think that the most likely reason for random restarts is something like the CPU overheating. Have you checked the CPU fan and the thermal paste on its heatsink? Quote Link to comment
PioneerX Posted October 13, 2022 Author Share Posted October 13, 2022 3 hours ago, itimpi said: Very little hardware is incompatible. I would think that the most likely reason for random restarts is something like the CPU overheating. Have you checked the CPU fan and the thermal paste on its heatsink? I thought the same but even under average of 75% load the CPU sticks at a solid 52-53C, but checked the fan and renewed the paste anyway. One of the odd things I have recently noticed is that when the restart happens the parity check kicks in (which I believe is normal) however the system NEVER restarts when a parity check in underway, the restart only happens when the system is at low load. As the system is only a NAS the load when a parity check is not running is extreamly low, but it has never restarted with the parity check running. Quote Link to comment
Solution PioneerX Posted October 20, 2022 Author Solution Share Posted October 20, 2022 Just to close this out for everyone. I finally figured out what is going and it's something I have never come across in my decades of IT experience. When the system is under load (read/write/parity rebuild) then it's very stable, however once the load comes off it fails. The actual reason turned out to be very obscure. When under load the motherboard fan control would run the fans up to around midway to deal with the increased case temp, however once the load is removed the case temp starts to come down and so the fan controller reduces fan speed, when the temp reduces enough (around 10mins from removal of the load) the fan controller hits it's minimum setting and shorts the fans leads out, this causes the PSU to trigger short circuit protection causing a power outage to the machine. Once the machine comes back online UNRAID detects an unclean shutdown and starts a parity check. The check causes system load which causes temp and therefore fan speed which keeps the system stable until either parity check is cancelled or completes.....then once the load is gone and the fan speed fall off the controller shorts again and the cycle repeats. I have never seen a fan controller short out in this manor before but I have confirmed it with an oscilloscope, the PWN duty cycle drops off as expected but when reaching zero the power and ground pins get shorted together (this is what causes the PSU to fire short circuit protection). I'm guessing the controller IC or the power MOSFET its controlling have gone bad. I will be replacing the MB/CPU/RAM since I cant replace the integrated fan controller and replacing the MB means replacing all the others parts as well. Thanks for everyone's help along this road. 1 Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.