Jump to content

Help! My Unraid server is stopping!


Roscoe62

Recommended Posts

Went to check some content on my Unraid server only to find that it was offline - I mean COMPLETELY offline - no power.

 

I keep it down in our garage where the temps are slightly cooler. I'm running Unraid server v3.x it's in the coolermaster stacker cm (the newer one with the space for only 1 power supply) and I'm running a total of 7 500Gb Seagate SATA HDD's in my array - all in Promise hotswap cases. The power supply is IIRC an Antec Truepower 550W. Oh yeah, I keep this thing running 24/7.

 

This is probably the third time this has happened over a period of around 6 months which, I guess, isn't too bad but I'd like to know what is causing the server to power down, and deal with it. I'm not using a UPS yet, but we don't really have too many problems with power here. The last time we had a power cut was 13 months ago - the night my son was born!

 

Can you guys think of any way to narrow down what might be causing this? I'm grateful for any suggestions of how to solve this little mystery.

 

Thanks!!

Link to comment

Sorry, I should have been clearer.

 

I try to access the content on the server remotely and I cannot - I get an error message. So I go down to check on the server and all the lights are out and the power is off. It comes online again simply by pressing the power button.

Link to comment

A voltage drop could cause it I'm sure - and you'd never notice them usually.  Does your bios allow for power on/off scenarios?  I think  my bios had all these options disabled.  One option allowed the puter to re-boot if such an event took place (Compared to staying off if the power failed).  An UPS is certainly the best long term solution to power fluctuations (if that's what it is?)

Link to comment

I can think of two reasons why your server might be completely offline.

 

The first, as mentioned by flambot, is a power fluctuation - - particularly if your BIOS isn't set to automatically boot up from a power outage.  As he mentioned, a UPS is the best solution for this problem.

 

The second is heat.  If you're using a fairly new MB, it probably comes equipped with thermal protection "smarts" built into the BIOS that tries to protect your equipment if it gets too hot.  Check to make sure the CPU heatsink & fan is still firmly attached and working properly, as well as the case fans and/or PSU fan.  If any of these are faulty, the box could easily overheat.

Link to comment

Hmmm....I'm using an Asus P5PE-VM for a m/board, and I'm pretty sure there is a BIOS setting which will allow the machine to boot up automatically after a power failure. I'll check this out.  :)

 

As far as heat goes, well I can do a quick visual inspection but apart from that I don't think Unraid has any temp monitoring tools in its toolbox, so I'm not sure what else I can do there.

 

Also, I'm interested in the idea of using a UPS. Are there any makes/models that are recommended by Unraid or its core of users?

Link to comment

Hmmm....I'm using an Asus P5PE-VM for a m/board, and I'm pretty sure there is a BIOS setting which will allow the machine to boot up automatically after a power failure. I'll check this out.  :)

 

Well, you may be in luck.  That's the same motherboard I use.  Mine's not set to power up automatically (unlike my firewall), but I believe you'll find the correct BIOS setting under Power, APM Configuration, Restore on AC Power Loss.

 

As far as heat goes, well I can do a quick visual inspection but apart from that I don't think Unraid has any temp monitoring tools in its toolbox, so I'm not sure what else I can do there.

 

UnRAID doesn't, but I know your motherboard does.  ;D  If you look in the BIOS under Advanced, CPU Configuration, CPU Internal Thermal Control, I suspect you'll see that its turned on (by default).  This setting tells the motherboard to shut down to protect your equipment in case it gets too hot.  So...shut down your unit, take off the side (or top) cover and look inside.  Is it full of dust bunnies?  Dust can clog up a CPU or case fan and/or block airflow around a heatsink, rendering them either useless or greatly reducing their efficiency.  If so, take your rig outside and use a can of compressed air to blow out all the bunnies.  ;)  If not, check to make sure the CPU heatsink/fan is still mounted properly (i.e., there's no side-to-side motion when you gently try to rock it).  Now fire up your PC and make sure all the fans are spinning.

 

If everything's working, try removing and re-seating your memory chips.  I found a posting on the Asus website from a user that indicated memory problems could cause this phenomenon.

 

Also, I'm interested in the idea of using a UPS. Are there any makes/models that are recommended by Unraid or its core of users?

Two popular brands are APS and Belkin.  I've had good luck with both, although Belkin is a little cheaper.  I know Tom is considering including the necessary Linux drivers/code to automate a graceful server shutdown on power failure, but I'm not sure if both manufacturers adhere to a common API standard.  Maybe others can weigh in here.

Link to comment

Hi Billk,

 

Thank you for your comprehensive reply.  :)

The server has shut down once more since then so I'm starting to suspect something other than power fluctuations.

 

Your procedures sound like a good weekend job, so I'll get on to that this weekend. Hopefully that will sort any issues out. Either way, I'll post back with what I find. I guess you never know when someone else will have a similar issue.

Link to comment

Hmmm....well the jury is still out I guess.  :-\

 

I powered the server down, opened it up and had a good look inside. There really wasn't much dust to speak of so I didn't bother having a really good clean out because everything looked pretty clean.

 

I connected up a monitor and a keyboard, powered up and changed the BIOS setting to allow the server to power up after a power failure, and then I carefully inspected all of the fans. All are spinning without any problems.

 

The only thing I did notice was a reasonable number of flying insect carcases lying on the bottom of the case. Each of these are a bit smaller than a common ant, but there were quite a few there. At the back of the case I have one space where I couldn't fit a case fan because of the excessively large CPU fan I run, so I just have the fan grill and some metal mesh there bolted in place. Inspecting this closely I see there is sufficient space for these very small insects to get in there. I had some acoustic foam lying around so I cut a circle just large enough to cover this opening and I've now sealed it up just in case. After putting the server back in place I noticed some similar metal mesh material at the bottom of the case so it's possible the insects can get in there as well. Depending on how things go I may have to seal this opening up as well.

 

Anyway, when I booted the server up again and checked status via the web server, I noticed there is now an orange indicator on the parity drive, so I'm letting it complete the full parity sync job to see how that goes. It's probably around 30% of the way through now so I've got a bit of a wait in store.

 

If after completing parity sync the problem remains I guess I'm going to have to fork for a new parity drive? Could this be the problem? Could the tiny insects be the problem? Something else....power supply?

 

Any comments etc welcome!

Link to comment

Insects in the case should be no problem.

They will probably not look for food in any of the PCI slots, ram slots and they are usually decent enough to go outside the case, when feeling the urge to releave themselves. :-))

 

Lots of insects have acidous droppings, and that is not healty for your server, and if they are small enough you can forget about trying to keep them out.

It is probably a good idea to take out all cards and ram, blow the sockets and reseat everything.

 

or.... delete that classic british punk/pop track: Adam and the Ants: Ant music.

 

Link to comment

Antmusic?! OK, you're showing your age now!!!

 

...but then, so am I!  ;D

 

Server seems to be OK at the moment. After it finished it's parity sync everything seemed OK. Prior to me investigating the case, the server had powered down 3 times in the last 4 days. Since then it hasn't skipped a beat.

Link to comment

Archived

This topic is now archived and is closed to further replies.

×
×
  • Create New...