Server Randomly Turning Off?


Recommended Posts

Every time we've encountered a problem with one of our servers in the data center randomly rebooting, it was usually due to a faulty power supply.  The voltage drops and the system thinks someone hit the reset button.   We start there and usually end there.

 

If that doesn't fix it, memory was the second culprit.  Heating has, for us, never been a problem.

 

+1 for power supply. I know you've heard it a bunch of times already, but that's what I'd swap out first. I know it can be a pain in the ass to start replacing parts till the problem is cured, but sometimes that's the only solution. If you don't want to buy a new one before you know, maybe harvest one from another system just to test...

 

As far as the CPU is concerned, if there's no problem with the fan, it's possible the heatsink isn't seated properly.... Any chance you swapped out the processor or the heatsink and fan lately?

Link to comment

The PSU takes a ton of abuse in this forum.  If the PSU is severely underpowered, it should power itself down in an overload state when trying to spin up all of your drives on power up, or when doing some dramatic spinups in unRAID (like stopping the array or starting a parity check will spin up all drives if they are spun down).  If it is not severely underpowered, likely the ebbs and flows of power demands would allow all the drives to power up, albeit a bit more slowly than usuual.

 

If a PSU dies, my experience (and I've had my share), is that they die like a light bulb.  There is (or used to be) an internal fuse in the PSU that blows.  I've tried replacing that fuse and never had luck reviving a dead one for more than 30 seconds.

 

I'm not saying yours might not be doing flakey things, but this has just been my experience in building and troubleshooting my own (the first PC I built was an 8MHz 8088).  Random power downs when the server is basically idling is not normally a PSU problem.

 

Cabling problems are the #1 greatest cause of flakey behavior.  My first step would be to unplug and replug, securely, every connection.  If it keeps happening, my money is on a failing capacitor on the motherboard.

Link to comment
If a PSU dies, my experience (and I've had my share), is that they die like a light bulb.

 

I'm not saying that it's the case here, but I've known PSUs suffer a variety of 'soft' failures.  As an example, in one case, the ethernet interface was intermittently failing to initialise and, in all other respects, the machine appeared to be operating normally.  Replacing the PSU solved the problem with the ethernet interface, and the machine has continued to operate normally for several years.  That machine (still with original mobo/processor), purchased in 2002, is still going strong on its third PSU (the second one did die a 'sudden death' shortly after its arrival in Philippines).

 

Oh, and talking of first 'PC' built - mine was in 1978 - a 4MHz INS8060, with 256bytes of memory.  Input was from a hexadecimal keypad, and output was on an 8-digit 7-segment display.

Link to comment

I like ruling out the PSU first because:

  • It's often the culprit
  • A compatible working tester PSU can be found in many other systems, unlike CPUs, ram etc...

 

If one is available, it doesn't hurt to swap it out and confirm or rule out the PSU as the issue. A flaky PSU can still fail, whether the system is at idle or under load (Though I agree, it does tend to usually happen under load... Mind you, the OP *does* state that the issue was usually noticed while recording tv or ripping video...)

 

Link to comment

set_rtc_mmss: errors

 

http://nixcraft.com/centos-rhel-fedora/13458-set_rtc_mmss-cant-update-59-0-error.html

 

What does set_rtc_mmss: can't update from 54 to 5 mean?

 

The function set_rtc_mmss() updates minutes and seconds of the CMOS clock from system time. It does not update the hour or date to avoid problems with timezones.[1] The message shown was added to make users and implementers aware of the problem that not all time updates will succeed.

 

Imagine the system time is 17:56:23 while the CMOS clock is already at 18:03:45. Updating just minutes and seconds would set the hardware clock to 18:56:23, a wrong value. The solution for this problem is either to wait a few minutes, or to install a kernel patch that fixes the problem. Normally a wrong time in the hardware clock will not show up until after reboot, or maybe after APM slowed down your system.

 

 

 

Link to comment

A couple random shut downs today and no sys log running. But I noticed after restarting server that my pc gets a blue screen of death. Wonder if there is something in my network causing this? I have a DLink DIR 655 router that is going to a 5 and 8 ports gigabit switch that goes to each room and server.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.