unRAID Server (6.6.3) Shutting Down Frequently


Recommended Posts

Hi all.

 

I was wandering if someone might be able to help me troubleshoot this issue I have been seeing since upgrading from 6.5.3. Every now and then (usually after about a day or two) of uptime, my server shuts itself down.

 

I've been having this problem since moving from 6.5.3. It was running just fine for weeks, and when 6.6.x released I upgraded using the GUI. Since then, the shut downs are occuring all the time. I've been upgrading as new releases have come about but still the issues persist. I've most recently deployed a new USB key straight on version 6.6.3 but still the same thing.

 

What's odd is that it seems to be doing a shutdown or a hard crash as the server will be in a powered off state, not sitting at a console or panic screen. I've done everything I can think of to rule out hardware - temp monitoring through IPMI and BIOS, Memtest, reset BIOS settings, reseated power/reset headers on motherboard, checked event log on IPMI etc. Nothing appears to be wrong. I should note that this hardware was running NAS4Free for about 4 years with no issues prior to migrating to unRAID a month or two ago.

 

Are there any diagnostics I can perform on the OS? The diagnostics tool in the web GUI seemed to reset after power loss so they don't show anything meaningful.

 

Here are the stats for my server. I'm runnning very minimal docker instances only (Plex, Sonarr etc) and CPU/mem utilisation are very low.

 

Model: Custom

M/B: Supermicro - X9SCL/X9SCM

CPU: Intel® Xeon® CPU E3-1230 V2 @ 3.30GHz

HVM: Enabled

IOMMU: Enabled

Cache: 256 kB, 1024 kB, 8192 kB

Memory: 32 GB Single-bit ECC (max. installable capacity 32 GB)

Network: eth0: 1000 Mb/s, full duplex, mtu 1500
 eth3: 1000 Mb/s, full duplex, mtu 1500

Kernel: Linux 4.18.15-unRAID x86_64

OpenSSL: 1.1.0i

Uptime: 0 days, 00:06:45

Link to comment

Is it possible that something could be hitting the power switch on the case?  (I can recall there being a couple of instances of this in the past. One time, it was a cat. Children are another possibility. ) 

 

Is it possible that someone is having some fun with you and shutting it remotely via the GUI?  (Try changing the server password.) 

 

Are you getting a parity check on restart (hard shutdown)  or a normal shutdown (no parity check)?

 

@jonathanm has a have good possibility with the PS suggestion. 

Link to comment

Are you saying the server is being powered off (rather than crashing but staying powered on)?    This distinction is important as powering down always needs a specific trigger to activate it.  Typically this would be hardware related.   If the server is staying powered on but unresponsive (I.e. crashed) then a software problem also becomes a likely candidate.

 

in the short term the first step is to install the “fix Common Problems” plugin and put it into troubleshooting mode (note that this mode gets unset if you reboot) which might help in capturing some useful log information.

Link to comment
2 minutes ago, jonathanm said:

This sort of issue is usually caused by a failing power supply.

 

What model is it, and how old? How many hard drives, and models on those? Graphics cards? VM passthrough for graphics?

Quite possibly, however, during the Memtest, which I ran for over a full day, there were no outages. Since typing the message above, it went again! That was up for about 30 minutes this time.

 

The PSU is an EVGA BQ modular 500 watt if I recall. Graphics - none, all VM functionality disabled. I'm very much using this as a traditional NAS server. Nothing fancy.

Link to comment
1 minute ago, Frank1940 said:

Is it possible that something could be hitting the power switch on the case?  (I can recall there being a couple of instances of this in the past. One time, it was a cat. Children are another possibility. ) 

 

Is it possible that someone is having some fun with you and shutting it remotely via the GUI?  (Try changing the server password.) 

 

Are you getting a parity check on restart (hard shutdown)  or a normal shutdown (no parity check)?

 

@jonathanm has a have good possibility with the PS suggestion. 

It's a fair suggestion, but unless a cat could get into my garage, open my locked rack and climb up about 30u and power off then I'm going say no. Although I don't ever trust a cat :D

 

In seriousness though, it's deffo not someone powering it down. It is locked away and secure. The GUI shutdown is interesting but I don't think it could be that, I'm behind a firewall which I monitor and the server itself is in a constrained VLAN with limited connectivity so I think I'm OK from that standpoint - strong password too.

 

Parity check seems to kick in on the reboot. Most times it completes with zero issues, but as I mentioned in my last reply, this latest stop happened after about 30 mins so it will never have finished.

 

Oh - forgot to mention - 4 HDD (2TB), two of which parity, and one 500GB SSD cache.

Link to comment
5 minutes ago, itimpi said:

Are you saying the server is being powered off (rather than crashing but staying powered on)?    This distinction is important as powering down always needs a specific trigger to activate it.  Typically this would be hardware related.   If the server is staying powered on but unresponsive (I.e. crashed) then a software problem also becomes a likely candidate.

 

in the short term the first step is to install the “fix Common Problems” plugin and put it into troubleshooting mode (note that this mode gets unset if you reboot) which might help in capturing some useful log information.

Full power off. What is interesting is that when I power it back up the first time, the server doesn't boot from the USB key - I get the 'no OS' warning. Power-off and then on and it boots fine. This is new since 6.6.3. I thought it could be a bad stick which is why I rebuilt a new one.

 

I've got the FCP app - I'll try and get it in to troubleshooting mode - haven't checked that option yet!

Link to comment

I would bet on the PS.  Cheap (and quick) to test (especially with only four HD's).  If none in house, try borrowing from a friend (or purchase one from a vender with a 30 day return policy). 

 

EDIT:  I had a PS go bad but it would reboot the server.  In my case, it was an easy find.  I had replaced a dual rail supply with a single rail a week earlier.  So the PS was the last thing changed in the system...

Edited by Frank1940
Link to comment

When it comes back up does it start an automatic parity check?  

 

If so that suggests an unexpected hardware power off.  I would be looking then at the power supply, and items like the cooling system (in case the CPU is triggering a thermal triggered power Down).  

 

If on powering up a parity check is not initiated then that indicates a tidy shutdown.    The only candidate I can think of off-hand that might trigger that would be a UPS related event.    Are you using a UPS?

Link to comment
18 minutes ago, Frank1940 said:

I would bet on the PS.  Cheap (and quick) to test (especially with only four HD's).  If none in house, try borrowing from a friend (or purchase one from a vender with a 30 day return policy). 

 

EDIT:  I had a PS go bad but it would reboot the server.  In my case, it was an easy find.  I had replaced a dual rail supply with a single rail a week earlier.  So the PS was the last thing changed in the system...

I can certainly give it a go - have a spare kicking around here so no harm in trying. I'll post back with my findings.

Link to comment
19 minutes ago, itimpi said:

When it comes back up does it start an automatic parity check?  

 

If so that suggests an unexpected hardware power off.  I would be looking then at the power supply, and items like the cooling system (in case the CPU is triggering a thermal triggered power Down).  

 

If on powering up a parity check is not initiated then that indicates a tidy shutdown.    The only candidate I can think of off-hand that might trigger that would be a UPS related event.    Are you using a UPS?

That's interesting to know - it certainly seems like a dirty shutdown then. Thermals are no probs - IPMI monitoring of temps and event logs indicate no issues. It's also cold in the UK at the moment - I reckon I could run this thing passively with no problems :D

 

Joking aside that was my first hunch but I am certain that it's thermally sound. I'm going to try the PSU as you and others have mentioned. I'll keep you all posted!

Link to comment
23 hours ago, FunkadelicRelic said:

Thanks everyone for the help so far. Going to try the PSU - fairly unanimous responses so far so we'll see how that goes.

OK - new PSU is in. Replaced with a spare 450w EVGA CX modular PSU. First time power up is fine, parity check in-progress. Find out soon enough if that was the culprit. Keep you posted.

Link to comment
1 hour ago, FunkadelicRelic said:

OK - new PSU is in. Replaced with a spare 450w EVGA CX modular PSU. First time power up is fine, parity check in-progress. Find out soon enough if that was the culprit. Keep you posted.

Well - the server just lost power again with the new PSU. Must have stayed up for an hour or so this time.

 

Anyone else have any other suggestions?

Link to comment
16 hours ago, Frank1940 said:

Do you have a UPS and using the UPS shutdown app?  You can easily check a UPS by shutting your server down and connect a lamp load of a couple of hundred watts to it.  Pull the power plug and see how long it the battery lasts.  

Unfortunately not.

 

I'm working now on the assumption that something is shorting out. The rackmount chassis has been in place for years so I wouldn't imagine it's the motherboard - but I do have questions about the backplane and disk trays. I'm going to swap into a spare case with no BP or caddies and see what happens. Guess that's my weekend project sorted.

Link to comment
  • 3 weeks later...

So I seem to have resolved this although not entirely sure what was causing the issue as I changed quite a few things in the process. To summarise, it must have been environmental.

 

I switched PSU and changed out CPU fan and TIM. Still crashed. In the end, I had a spare 4u case of identical make, so I decided to lift and shift the components into the spare case, backplane included. When moving the motherboard, I did notice the board had quite a lot of thick black gunk (thought it was scorch marks to start with!) around some of the capacitors. I gave it a good clean with isopropyl alcohol 99 and it cleaned right up. Re-assembled and everything seems to be OK. It's now been up for over 24 hours so fingers crossed.

 

If I were to guess, I'd imagine it was a fan cable or one of the case headers causing a short.

 

Anyway - appreciate everyones help trying to diagnose this.

 

Regards.

 

Tom.

Link to comment

I have the same issue.  Going back to 6.5.3 seems to be stable.

I've tried 6.6.0, 6.6.1, an 6.6.2.   I have since jumped to 6.6.4 and 6.6.5 and it has been good.

 

I thought it was possibly related to a realtek lan card, but because my system hard-stops that seems different that those symtoms.  My console or syslog both didn't show anything.

 

Are you now running 6.6.3 or did you upgrade with 6.6.4/6.6.5.  

.. maybe it's even a plugin that got an update!?  Just throwing wild ideas out now.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.