April 18, 201115 yr Every time we've encountered a problem with one of our servers in the data center randomly rebooting, it was usually due to a faulty power supply. The voltage drops and the system thinks someone hit the reset button. We start there and usually end there. If that doesn't fix it, memory was the second culprit. Heating has, for us, never been a problem. +1 for power supply. I know you've heard it a bunch of times already, but that's what I'd swap out first. I know it can be a pain in the ass to start replacing parts till the problem is cured, but sometimes that's the only solution. If you don't want to buy a new one before you know, maybe harvest one from another system just to test... As far as the CPU is concerned, if there's no problem with the fan, it's possible the heatsink isn't seated properly.... Any chance you swapped out the processor or the heatsink and fan lately?
April 18, 201115 yr Author This syslog appears to be taken after a restart. Try to get one before a restart. That is weird because I had the sys log running before my server crashed.
April 18, 201115 yr The syslog shows journal transactions being replayed. This happens after and un-clean shutdown. Try running reiserfsck on the drives /dev/md1 - /dev/md3
April 18, 201115 yr The PSU takes a ton of abuse in this forum. If the PSU is severely underpowered, it should power itself down in an overload state when trying to spin up all of your drives on power up, or when doing some dramatic spinups in unRAID (like stopping the array or starting a parity check will spin up all drives if they are spun down). If it is not severely underpowered, likely the ebbs and flows of power demands would allow all the drives to power up, albeit a bit more slowly than usuual. If a PSU dies, my experience (and I've had my share), is that they die like a light bulb. There is (or used to be) an internal fuse in the PSU that blows. I've tried replacing that fuse and never had luck reviving a dead one for more than 30 seconds. I'm not saying yours might not be doing flakey things, but this has just been my experience in building and troubleshooting my own (the first PC I built was an 8MHz 8088). Random power downs when the server is basically idling is not normally a PSU problem. Cabling problems are the #1 greatest cause of flakey behavior. My first step would be to unplug and replug, securely, every connection. If it keeps happening, my money is on a failing capacitor on the motherboard.
April 18, 201115 yr If a PSU dies, my experience (and I've had my share), is that they die like a light bulb. I'm not saying that it's the case here, but I've known PSUs suffer a variety of 'soft' failures. As an example, in one case, the ethernet interface was intermittently failing to initialise and, in all other respects, the machine appeared to be operating normally. Replacing the PSU solved the problem with the ethernet interface, and the machine has continued to operate normally for several years. That machine (still with original mobo/processor), purchased in 2002, is still going strong on its third PSU (the second one did die a 'sudden death' shortly after its arrival in Philippines). Oh, and talking of first 'PC' built - mine was in 1978 - a 4MHz INS8060, with 256bytes of memory. Input was from a hexadecimal keypad, and output was on an 8-digit 7-segment display.
April 18, 201115 yr I like ruling out the PSU first because: It's often the culprit A compatible working tester PSU can be found in many other systems, unlike CPUs, ram etc... If one is available, it doesn't hurt to swap it out and confirm or rule out the PSU as the issue. A flaky PSU can still fail, whether the system is at idle or under load (Though I agree, it does tend to usually happen under load... Mind you, the OP *does* state that the issue was usually noticed while recording tv or ripping video...)
April 18, 201115 yr set_rtc_mmss: errors http://nixcraft.com/centos-rhel-fedora/13458-set_rtc_mmss-cant-update-59-0-error.html What does set_rtc_mmss: can't update from 54 to 5 mean? The function set_rtc_mmss() updates minutes and seconds of the CMOS clock from system time. It does not update the hour or date to avoid problems with timezones.[1] The message shown was added to make users and implementers aware of the problem that not all time updates will succeed. Imagine the system time is 17:56:23 while the CMOS clock is already at 18:03:45. Updating just minutes and seconds would set the hardware clock to 18:56:23, a wrong value. The solution for this problem is either to wait a few minutes, or to install a kernel patch that fixes the problem. Normally a wrong time in the hardware clock will not show up until after reboot, or maybe after APM slowed down your system.
April 24, 201115 yr Author A couple random shut downs today and no sys log running. But I noticed after restarting server that my pc gets a blue screen of death. Wonder if there is something in my network causing this? I have a DLink DIR 655 router that is going to a 5 and 8 ports gigabit switch that goes to each room and server.
Archived
This topic is now archived and is closed to further replies.