Jump to content

Random Restarts


Go to solution Solved by PioneerX,

Recommended Posts

I recently decided to convert my ultra reliable TrueNAS machine to UNRIAD since it needed a disk upgrade and UNRAID's ability to add single disks on the fly was very appealing. 

 

Since converting over to UNRAID (with brand new IronWolf 8TB drives) I have had the parity scan kick in every day, I finally worked out that it's kicking in because of an unscheduled restart, however I can not find what is causing the restart.

 

Diagnostics are attached. Since I know the normal syslog is not persistent over the restart I had the syslog sent to the array so that's also attached.

 

2022-10-06%2011_33_53-emporium_Main.png

 

Any ideas? Thanks very much any help you can provide.

 

 

emporium-diagnostics-20221006-1118.zip syslog.log

Link to comment

Thanks @JorgeB.

 

Docker is already disabled and there are no VM's running at all. I have a separate 4 machine cluster that holds my VM's so UNRAID was only ever intended to be a NAS. I have rebooted in safe mode and will keep an eye on it.

 

NOTE: The only hardware that was changed between TrueNAS and UNRAID was the disks (4x Dell Enterprise 3TB drives replaced with 3x 8TB IronWolf), all other hardware is the the same as it was under TrueNAS and that ran for 5 years without falling over at all. The whole thing is also on a 1500VA UPS.

Link to comment

Update:

 

The problem is not solved and the machine still restarts 1-2 times in any 24h period (always random). I have done all the of the following.....

 

  • Downgrade to v6.10.3
  • Found some hibernate settings in the BIOS that have now been removed
  • Reset the BIOS to default
  • Found an old HBA inside the machine that was unused so removed it in-case of compatibility problems
  • Blacklisted iGPU
  • Moved the machine to different power feeds
  • Replaced the PSU

....and yet the problem persists. The logs still do not contain any relevant information about the reboot so I have to assume that the hardware in use is just not compatible and needs to be replaced (https://www.newegg.com/biostar-nm70i-847-mini-itx/p/N82E16813138368 is the hardware contained, with 2x 8GB DDR3 installed), like I said this machine originally was running FreeNAS/TrueNASCore for almost a decade before I changed to UNRAID ;). I will be ordering replacement hardware since I cant entrust data to a machine that reboots randomly every few hours :(

 

I might sound negative of UNRAID but I'm 100% not, its a great product and used very successfully my many people, it just appears my hardware is not compatible.

  • Like 1
Link to comment
32 minutes ago, PioneerX said:

might sound negative of UNRAID but I'm 100% not, its a great product and used very successfully my many people, it just appears my hardware is not compatible.

Very little hardware is incompatible.  I would think that the most likely reason for random restarts is something like the CPU overheating.   Have you checked the CPU fan and the thermal paste on its heatsink?

Link to comment
3 hours ago, itimpi said:

Very little hardware is incompatible.  I would think that the most likely reason for random restarts is something like the CPU overheating.   Have you checked the CPU fan and the thermal paste on its heatsink?

I thought the same but even under average of 75% load the CPU sticks at a solid 52-53C, but checked the fan and renewed the paste anyway.

 

One of the odd things I have recently noticed is that when the restart happens the parity check kicks in (which I believe is normal) however the system NEVER restarts when a parity check in underway, the restart only happens when the system is at low load. As the system is only a NAS the load when a parity check is not running is extreamly low, but it has never restarted with the parity check running.

Link to comment
  • Solution

Just to close this out for everyone. I finally figured out what is going and it's something I have never come across in my decades of IT experience. 

 

When the system is under load (read/write/parity rebuild) then it's very stable, however once the load comes off it fails. The actual reason turned out to be very obscure. When under load the motherboard fan control would run the fans up to around midway to deal with the increased case temp, however once the load is removed the case temp starts to come down and so the fan controller reduces fan speed, when the temp reduces enough (around 10mins from removal of the load) the fan controller hits it's minimum setting and shorts the fans leads out, this causes the PSU to trigger short circuit protection causing a power outage to the machine. Once the machine comes back online UNRAID detects an unclean shutdown and starts a parity check. The check causes system load which causes temp and therefore fan speed which keeps the system stable until either parity check is cancelled or completes.....then once the load is gone and the fan speed fall off the controller shorts again and the cycle repeats.

 

I have never seen a fan controller short out in this manor before but I have confirmed it with an oscilloscope, the PWN duty cycle drops off as expected but when reaching zero the power and ground pins get shorted together (this is what causes the PSU to fire short circuit protection). I'm guessing the controller IC or the power MOSFET its controlling have gone bad.

 

I will be replacing the MB/CPU/RAM since I cant replace the integrated fan controller and replacing the MB means replacing all the others parts as well.

 

Thanks for everyone's help along this road.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...