shuts down itself during parity check?


Recommended Posts

Hi all, I need some help figuring out how to troubleshoot this.

 

Twice now, I started a parity check on the array and after about an hour or so (wasn't really paying attention), I come back and find the entire system powered off. Upon reboot, it says it did not complete parity check (duh) but I can't discern any reason for why it would shut itself off. My first thought was to check the logs, but apparently logs aren't preserved through power off..

 

I've started the check again and am "monitoring" progress thru the webgui. Hard drive temps look fine (mid 30s C), throughput looks fine (~215 MB/s), estimated time looks like ballpark of what it said when it worked few weeks ago (8 hr). 

 

Any ideas what would cause it to shut down the whole system during parity check?

Link to comment

This is a hardware issue, nothing to do with unRaid. 

 

Could be PSU, memory, failing MB, bad controller, overclocking the CPU or memory. Heat often plays a part.

 

Always hard to isolate. The fact it takes 30 minutes of parity checking to cause the problem makes it harder. Do you have another computer to swap parts to help diagnose?

Link to comment

I thought it might be heat related as well, maybe some thermal shutoff. but it's chugging along fine now and nothing's changed since the last restart. 

 

The system had been powered down for a few weeks like since 17th when I went on vacation, so maybe it just needs time to "get in the groove" again... stupid computers...

 

I'll keep an eye on it, assuming it finishes the check tonight, and if nothing else happens, guess we'll call it hardware gremlins..

Edited by darckhart
Link to comment

I would suggest opening it up and cleaning all of the dust and dirt out the case, fans and cooling fins on the heat sinks.  Then check to see that the fans are not having any issues spinning up by stopping them and watching what happens when you release the blades.  (In the past, I have had case fans, PS fans and CPU fans that would not want to spinup but once they did, they would run fine.)

Link to comment
4 hours ago, Frank1940 said:

I would suggest opening it up and cleaning all of the dust and dirt out the case, fans and cooling fins on the heat sinks.  Then check to see that the fans are not having any issues spinning up by stopping them and watching what happens when you release the blades.  (In the past, I have had case fans, PS fans and CPU fans that would not want to spinup but once they did, they would run fine.)

 

Yea I've got it all opened up and all fans are running. I even brought in a desk fan to point directly at things to make sure they're cool as can be.

 

Tried again this morning and it seemed to shut down in about the same place again as last night, about 5 hrs in, 3.33 ish TB checked of 6 TB total, soo it looks like it gets stuck about 50-55% sigh. HDDs were about 45-48C which I guess makes sense after running parity for 5 hrs.

 

Link to comment

An open case will not lower drive temperatures.  Good air flow is essential for unRAID servers.  Most folks are sucking the air in at the front pulling it over the drives first and using the case fans to exhaust the air at the rear.  By the way, having the case open may actually increase temperatures IF the air flow is set up properly!  (Both of my servers of my servers have three fans installed at the rear, side and top--- all sucking air out of the case.)

Link to comment

Put Fix Common Problems into trouble shooting mode - this will save the logs to your flash drive, just in case there's something that is software related that it might catch.

 

Also, if you're noticing that it's shutting down after about the same amount of time into the parity check, it could be that there is a memory problem that's being tripped by the precise combination of data right at that spot in the parity check. Running MemCheck (look for it on server boot) for a good 24 hours or so (minimum of 1 pass, 2 or 3 would be better) should at least point to that as the problem or rule it out..

 

51 minutes ago, Frank1940 said:

An open case will not lower drive temperatures.

I agree with this 100%! Pulling the side off the case makes sense, but it doesn't work. Usually. If you have very poor air flow inside the case, it might help. My setup is similar to Frank's - 3 fans in the front, pushing "cool" (room temp) air in over the drives; 2 fans at the top pulling hot air out; 1 fan at the top rear of the case pulling heat out; 1 fan in the PSU blowing out to ventilate that.

 

I use compressed air to blow off the front of the front fans, then I blow through them across the drives inside. Then I blow off the top grills, then through the top fans, then I blow off the back grills then blow through the fans. I do this monthly right before (or, if I forget, during) the parity check. I do all that while the server is up and running. Any time I have to shut the box down, I'll take the side off and blow out the CPU fan and make sure that the cards, cables, etc are pretty clean, as well. (Big ugly dust clouds - the wife hates that!)

Link to comment
On 8/31/2017 at 8:30 PM, xaudzai said:

I had the similar problem but not shut down, some time it go almost 24 hrs and stop, it's not consistent at any certain point. it turn out my seasonic power supply going bad.

 

I almost hate to say anything at this point.  However, since you brought up your experience, I will note mine at this point.  About three weeks ago, I replaced a dual rail PS with a NEW single rail one.  After a couple of weeks, I had a shutdown that looked like a power outage (parity check on restart).  But there were no other signs of one (nor blinking clocks).  I checked out the UPS to make sure it was working and the battery was still good.  No issues there.  Then three days after the restart, it happened again.  I turned on the trouble shooting mode in the Fix Common Problems plugin and it happened again.  There was no signs of any hardware or software issues in the extra logs and I knew almost exactly when it went down.

 

I replace that PS with a new one from a different manufacturer and it has been up now for 2 days, 13 hours and 47 minutes.  I have my finger crossed but I am reasonably certain it was the PS with some sort of issue. 

 

EDIT:  The server has now been up for 12 Days, 13 hours and 38 minutes.  I am fairly certain now that it was definitely  a PS issue!!!   (I am posting up this additional information for anyone else who happens to finds this thread while searching for possible causes of an otherwise unexplainable server shutdown.)

Edited by Frank1940
Link to comment

Thanks all! Great comments! 

 

This is an old dell rack server recommissioned for unraid duty. Reading through the posts, I feel my HDD temps are high-ish but fine, CPU temps are fine (parity check seems to be pretty low cpu use), ran memtest overnight and that came up with zero issues too. I'll also use the FCP plugin so I can see any logs it saves to the flash drive, though I ran a tail from a telnet and it never got any warning messages before the connection dropped when the server shut itself off. But PSU hmm that might be a good one to check. Mine has the redundant PSUs so could either one trigger a shutdown if it got hot? I suppose I'll have to purchase a spare and see.

 

 

Link to comment
37 minutes ago, darckhart said:

Mine has the redundant PSUs so could either one trigger a shutdown if it got hot? I suppose I'll have to purchase a spare and see.

Redundant should mean either has enough juice to run the server on its own, so pull one and run for a while, see what happens. No need to purchase a spare just for testing when you have the luxury of redundancy.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.