Server Randomly Turning Off?

DoeBoye · April 18, 2011

Every time we've encountered a problem with one of our servers in the data center randomly rebooting, it was usually due to a faulty power supply. The voltage drops and the system thinks someone hit the reset button. We start there and usually end there.

If that doesn't fix it, memory was the second culprit. Heating has, for us, never been a problem.

+1 for power supply. I know you've heard it a bunch of times already, but that's what I'd swap out first. I know it can be a pain in the ass to start replacing parts till the problem is cured, but sometimes that's the only solution. If you don't want to buy a new one before you know, maybe harvest one from another system just to test...

As far as the CPU is concerned, if there's no problem with the fan, it's possible the heatsink isn't seated properly.... Any chance you swapped out the processor or the heatsink and fan lately?

Mailman74 · April 18, 2011

This syslog appears to be taken after a restart. Try to get one before a restart.

That is weird because I had the sys log running before my server crashed.

dgaschk · April 18, 2011

The syslog shows journal transactions being replayed. This happens after and un-clean shutdown. Try running reiserfsck on the drives /dev/md1 - /dev/md3

SSD · April 18, 2011

The PSU takes a ton of abuse in this forum. If the PSU is severely underpowered, it should power itself down in an overload state when trying to spin up all of your drives on power up, or when doing some dramatic spinups in unRAID (like stopping the array or starting a parity check will spin up all drives if they are spun down). If it is not severely underpowered, likely the ebbs and flows of power demands would allow all the drives to power up, albeit a bit more slowly than usuual.

If a PSU dies, my experience (and I've had my share), is that they die like a light bulb. There is (or used to be) an internal fuse in the PSU that blows. I've tried replacing that fuse and never had luck reviving a dead one for more than 30 seconds.

I'm not saying yours might not be doing flakey things, but this has just been my experience in building and troubleshooting my own (the first PC I built was an 8MHz 8088). Random power downs when the server is basically idling is not normally a PSU problem.

Cabling problems are the #1 greatest cause of flakey behavior. My first step would be to unplug and replug, securely, every connection. If it keeps happening, my money is on a failing capacitor on the motherboard.

PeterB · April 18, 2011

If a PSU dies, my experience (and I've had my share), is that they die like a light bulb.

I'm not saying that it's the case here, but I've known PSUs suffer a variety of 'soft' failures. As an example, in one case, the ethernet interface was intermittently failing to initialise and, in all other respects, the machine appeared to be operating normally. Replacing the PSU solved the problem with the ethernet interface, and the machine has continued to operate normally for several years. That machine (still with original mobo/processor), purchased in 2002, is still going strong on its third PSU (the second one did die a 'sudden death' shortly after its arrival in Philippines).

Oh, and talking of first 'PC' built - mine was in 1978 - a 4MHz INS8060, with 256bytes of memory. Input was from a hexadecimal keypad, and output was on an 8-digit 7-segment display.

DoeBoye · April 18, 2011

I like ruling out the PSU first because:

It's often the culprit
A compatible working tester PSU can be found in many other systems, unlike CPUs, ram etc...

If one is available, it doesn't hurt to swap it out and confirm or rule out the PSU as the issue. A flaky PSU can still fail, whether the system is at idle or under load (Though I agree, it does tend to usually happen under load... Mind you, the OP *does* state that the issue was usually noticed while recording tv or ripping video...)

Mailman74 · April 18, 2011

This syslog has been running all day.

syslog2.txt

dgaschk · April 18, 2011

What do these commands show:

ntpq -p

hwclock;date

mbryanr · April 18, 2011

set_rtc_mmss: errors

http://nixcraft.com/centos-rhel-fedora/13458-set_rtc_mmss-cant-update-59-0-error.html

What does set_rtc_mmss: can't update from 54 to 5 mean?

The function set_rtc_mmss() updates minutes and seconds of the CMOS clock from system time. It does not update the hour or date to avoid problems with timezones.[1] The message shown was added to make users and implementers aware of the problem that not all time updates will succeed.

Imagine the system time is 17:56:23 while the CMOS clock is already at 18:03:45. Updating just minutes and seconds would set the hardware clock to 18:56:23, a wrong value. The solution for this problem is either to wait a few minutes, or to install a kernel patch that fixes the problem. Normally a wrong time in the hardware clock will not show up until after reboot, or maybe after APM slowed down your system.

Mailman74 · April 24, 2011

A couple random shut downs today and no sys log running. But I noticed after restarting server that my pc gets a blue screen of death. Wonder if there is something in my network causing this? I have a DLink DIR 655 router that is going to a 5 and 8 ports gigabit switch that goes to each room and server.

Server Randomly Turning Off?

Recommended Posts

DoeBoye

Link to comment

Mailman74

Link to comment

dgaschk

Link to comment

SSD

Link to comment

PeterB

Link to comment

DoeBoye

Link to comment

Mailman74

Link to comment

dgaschk

Link to comment

mbryanr

Link to comment

Mailman74

Link to comment

Join the conversation