random reboots 14b


Recommended Posts

i only recently updated to v6 (apr 7th) and i'm getting random reboots on 14b. is this a known issue? it also happened about 2 days ago.

 

knowing the syslog restarts on a restart, since the last time it happened i kept a telnet session up with a tail going (not sure if this is accurate but i would assume so), and there's nothing in there at all indicating the server going down or about to go down, just normal spindowns. any ideas?

 

i don't think it's a power issue as i have a UPS but i actually have not tested it yet on v6.

Link to comment

Two thoughts ...

 

(1)  Before spending any time troubleshooting this, upgrade to Beta 15.  No reason to chase a problem unless you're sure it's actually an issue on the latest version.

 

(2)  If you're plugged in to the UPS, try bypassing it.    If you're not using it, plug the power into the UPS.  In other words, be sure you know whether or not power is contributing to the problem.

 

Link to comment

 

i don't think it's a power issue as i have a UPS but i actually have not tested it yet on v6.

 

Test the UPS by shutting down the server and unplugging  it from the UPS.  Now, plug a 100 watt lamp.  Make sure the lamp is on.  Then unplug the UPS from the power source.  The Lamp should run at least five minutes.  If it does not, you probably need to replace the battery.

Link to comment

I recently saw a system with this issue that had a problem with the PFC circuit in the power supply.  I believe the cause was a low-quality UPS on a power supply with active PFC.    The solution was to replace both the power supply and the UPS ... I had the client buy a sine-wave UPS unit from CyberPower (the least expensive true sine wave UPS units I recommend).

 

Link to comment

No, not a known issue. I would think this is hardware related. Have you run a memory test? When the array comes back up does it kick off a parity check?

 

been a while since i've run a memtest as it's not something i do regularly. do you suggest i do one? also when the array comes back, it does do a party check. this is how i know that it rebooted initially, because the parity check started emails. the 2nd time it happened i was actually sitting next to the server, i would've thought i'd maybe here the fans spin down and back up again when it restarted, but i didn't.

 

Two thoughts ...

 

(1)  Before spending any time troubleshooting this, upgrade to Beta 15.  No reason to chase a problem unless you're sure it's actually an issue on the latest version.

 

(2)  If you're plugged in to the UPS, try bypassing it.    If you're not using it, plug the power into the UPS.  In other words, be sure you know whether or not power is contributing to the problem.

 

Test the UPS by shutting down the server and unplugging  it from the UPS.  Now, plug a 100 watt lamp.  Make sure the lamp is on.  Then unplug the UPS from the power source.  The Lamp should run at least five minutes.  If it does not, you probably need to replace the battery.

 

as soon as this party check finishes, i'll upgrade to the latest beta. and then go through the process of testing my ups.

 

thanks guys! will report back.

Link to comment

parity check finished with no errors and i didn't have any lamp handy (and, lazy) so i decided to just unplug the UPS from the wall, and it kicked in perfectly. i didn't have it unplugged for that long, just enough to test the theory that there was a brief power hit that caused the reboot, and it does not seem to be the case. could the UPS still be the issue or could i rule that out now? or do i need it to run for an extended amount of time?

 

i am getting ready to upgrade to 6b15 as we speak.

Link to comment

parity check finished with no errors and i didn't have any lamp handy (and, lazy) so i decided to just unplug the UPS from the wall, and it kicked in perfectly. i didn't have it unplugged for that long, just enough to test the theory that there was a brief power hit that caused the reboot, and it does not seem to be the case. could the UPS still be the issue or could i rule that out now? or do i need it to run for an extended amount of time?

 

i am getting ready to upgrade to 6b15 as we speak.

 

The reason I suggested a test using a light bulb is that the battery voltage has to have enough capacity to supply energy until your server completely shuts down.  IF it doesn't, than when you restart, a parity check is triggered.  That 'estimated' run time in apcupsd is a complete bit of fiction.  It is close when the battery is new but the actual run time drops as the battery ages!  I had a battery (years ago) that would last less then two seconds with a 100 watt computer/monitor load.  (I found that out the hard way!)

Link to comment

If the issue was a surge when the inverter kicked in, then unplugging the UPS would have shown that.

But if it's a problem with the power supply's PFC circuitry, then getting the UPS completely out of the circuit will likely eliminate most of the restarts.  Depending on the failure mode, you may still see a few, but not as regularly.  The only really dependable way to absolutely confirm that is to (a) replace the power supply ... and confirm the issues "goes away" long enough that you're convinced it's gone (a week??);  and then (b) buy a new UPS with a sine wave output  [Not absolutely necessary ... but if that was in fact the issue it could cause the same issue on your new power supply over time].

 

Link to comment

hey guys, it seems the plot thickens.. i am now on beta15, current uptime about 1d 17h. no reboots yet.

 

last night i had some errors on my cache drive. after some googling it says this can also indicate a power issue. unfortunately i have not yet had a chance to run memtest over night, possibly tonight if i get a chance.

 

anyways,  what would you guys do first if you were me if the memetst comes up clean? new PSU? new cache drive? no drives are redballed in the gui.

 

Apr 21 19:05:45 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
Apr 21 19:05:45 Tower kernel: ata4.00: irq_stat 0x00400000, PHY RDY changed
Apr 21 19:05:45 Tower kernel: ata4: SError: { RecovComm Persist HostInt PHYRdyChg 10B8B }
Apr 21 19:05:45 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Apr 21 19:05:45 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 12
Apr 21 19:05:45 Tower kernel:         res 40/00:5c:5f:1f:ee/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 21 19:05:45 Tower kernel: ata4.00: status: { DRDY }
Apr 21 19:05:45 Tower kernel: ata4: hard resetting link
Apr 21 19:05:51 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 21 19:05:51 Tower kernel: ata4.00: configured for UDMA/133
Apr 21 19:05:51 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x50
Apr 21 19:05:51 Tower kernel: ata4: EH complete

syslog.txt

Link to comment

If Memtest is error-free, I'd try a higher quality power supply.

 

One thing you might look at is the age of your UPS's battery.  The UPS is a good sine-wave unit, so I doubt it's a problem, but if the battery is more than 3 years old, that could be causing problems when you switch to the inverter.

 

But if you continue to get the random reboots, it's pretty likely it's a power issue.  I'd get a high quality Corsair (HX series or above -- NOT the CX units) or Seasonic (X series) power supply.    This would be a good choice:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817139012

 

 

Link to comment

If Memtest is error-free, I'd try a higher quality power supply.

 

One thing you might look at is the age of your UPS's battery.  The UPS is a good sine-wave unit, so I doubt it's a problem, but if the battery is more than 3 years old, that could be causing problems when you switch to the inverter.

 

But if you continue to get the random reboots, it's pretty likely it's a power issue.  I'd get a high quality Corsair (HX series or above -- NOT the CX units) or Seasonic (X series) power supply.    This would be a good choice:

http://www.newegg.com/Product/Product.aspx?Item=N82E16817139012

 

my ups isn't that old, maybe a a year, two at most. i unplugged the ups from the wall and it kicked in perfectly and was able to keep the server running so i don't think that's causing the random reboots. although i didn't test it for very long. i'll have to revisit that.

 

i am just curious why you dont think it's the cache drive? if it was a power issue causing this and the random reboots, i would think other drives would have exceptions too because they share the same backplane i think.

Link to comment

Strange and random reboots can be a lot of things => a memory "glitch" (which MemTest may or may not find);  a power bus spike (not the supply power ... the power buses from the PSU);  a disk drive that's causing a power spike (bad actuator that randomly "pulls down" the power bus);  etc.

 

It's very difficult to isolate exactly which without either very expensive test gear or simply swapping components.

 

From what you've outlined, if it's not a memory glitch, I'd think it's most likely power ... but there's no guarantee.

 

I see your system as 8GB of RAM =>  is that 2 x 4GB or 4 x 2GB ??  If you have 4 modules installed, remove 2 of them (one from each channel) ... and see if you still get the issue.    Clearly I'd test this before buying a new power supply.    If that doesn't help, and you're still getting the random issues, I'd try a new higher quality power supply, as I noted earlier.

 

Link to comment

my plan for the random reboots was to run memtest overnight when no one's accessing the server, if clean but i still get random reboots (so far i haven't yet on 15b but it's barely been 2 days), then i would try a new psu.

 

now i have the cache drive error so it is adding another variable.. if i don't get random reboots for a week (originally i had no random rebbots from apr 7 until 2 days before i made this thread), but i still get cache drive errors, only then i should replace cache drive? we are thinking that same psu issue is causing the cache drive errors?

 

in any case, i should try to fix random reboots first before worrying about the cache drive i'm assuming.

Link to comment

Is the cache simply used as a cache, or is it also an application drive that's needed in the system?

 

If the former, you could simply remove it from the system and see if that changes things.

 

it is an application drive as well. also i think the ram is 2x4GB, so that's something to look into as well.

 

the plan of action for now i guess is to do the following, with some time in between to see if i'm still having issues..

 

1. memtest overnight

2. test ups

3. extended memtest, possibly replace anyway, ram is cheap

4. replace psu

5. replace cache drive

6. throw unraid server in the garbage lol

Link to comment

By the way, if you run a very extended MemTest (8-10 full cycles) and don't have any failures, I'd replace the power supply -- not the memory.    I've definitely seem memory issues that took a LONG time for MemTest to hit an error (8-10 hours) ... but if you go 12 hours or more without any issue it's very unlikely that changing the memory would help.    You could remove a module to reduce the memory bus loading ... just to see if that might make a difference  (I'd definitely do this if you have 4 modules; but if you have 2 that's less likely to be a problem).

Link to comment

hey guys, it seems the plot thickens.. i am now on beta15, current uptime about 1d 17h. no reboots yet.

 

last night i had some errors on my cache drive. after some googling it says this can also indicate a power issue. unfortunately i have not yet had a chance to run memtest over night, possibly tonight if i get a chance.

 

anyways,  what would you guys do first if you were me if the memetst comes up clean? new PSU? new cache drive? no drives are redballed in the gui.

 

Apr 21 19:05:45 Tower kernel: ata4.00: exception Emask 0x50 SAct 0x0 SErr 0x90a02 action 0xe frozen
Apr 21 19:05:45 Tower kernel: ata4.00: irq_stat 0x00400000, [b]PHY RDY changed[/b]
Apr 21 19:05:45 Tower kernel: ata4: SError: { RecovComm Persist HostInt [b]PHYRdyChg[/b] 10B8B }
Apr 21 19:05:45 Tower kernel: ata4.00: failed command: FLUSH CACHE EXT
Apr 21 19:05:45 Tower kernel: ata4.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 12
Apr 21 19:05:45 Tower kernel:         res 40/00:5c:5f:1f:ee/00:00:00:00:00/40 Emask 0x50 (ATA bus error)
Apr 21 19:05:45 Tower kernel: ata4.00: status: { DRDY }
Apr 21 19:05:45 Tower kernel: ata4: hard resetting link
Apr 21 19:05:51 Tower kernel: ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 21 19:05:51 Tower kernel: ata4.00: configured for UDMA/133
Apr 21 19:05:51 Tower kernel: ata4.00: retrying FLUSH 0xea Emask 0x50
Apr 21 19:05:51 Tower kernel: ata4: EH complete

 

I believe those error flags indicate a momentary loss of contact (PHYRdyChg, others indicate corrupted transmission).  That could be power spike or very brief power outage related, or it could be a loose cable connection, particularly a power cable.  I've seen similar when a user had a poor quality power splitter, not uncommon with us unRAID'ers trying to find enough power connections.  In his case, the wye part of the splitter had loosely connected wires.  Check all of the cables and connectors, both power and SATA.

Link to comment

I believe those error flags indicate a momentary loss of contact (PHYRdyChg, others indicate corrupted transmission).  That could be power spike or very brief power outage related, or it could be a loose cable connection, particularly a power cable.  I've seen similar when a user had a poor quality power splitter, not uncommon with us unRAID'ers trying to find enough power connections.  In his case, the wye part of the splitter had loosely connected wires.  Check all of the cables and connectors, both power and SATA.

 

thanks robj, i'll add that to my list. currently running memtest overnight, possibly i will leave it until after i get home from work tomorrow, so that should be about almost 20 hours running memtest.

 

one thing i did notice though that i never noticed before.. i had to connect a monitor so i can see whats going on with memtest, when i powered down and started, it boots back to BIOS screen a number of times before it actually brings me to the unraid bootmenu where i select unraid, safemode, memtest, etc.. i never noticed this before because i didn't have a monitor connected. i restarted a couple times more and it always takes a few kick backs to the BIOS screen a few times before it's able to get to unraid. it always eventually gets to unraid though.

 

this momentary boot loop support our bad PSU theory? or just my flash drive is also going bad too now. jeez, this server is falling apart on me! lol

Link to comment

It sounds like you don't have stable power initially, but it's eventually settling down -- and then it may on occasion have spikes that cause your reboot issue.

 

Could be the power supply; or could be power regulation on the motherboard.    Look at the motherboard VERY carefully (with a flashlight) to see if there are any signs of leaking or bulging capacitors.

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.