Random shutdowns after powercut


Recommended Posts

Hi all,

 

I have been searching all day on the forum, but haven't found the answer to my question, although I did see a few topics on this and followed them, but didn't find a solution.

So early this morning, we had heavy rain, which triggered our breaker to go out and thus taking out all the power in the house, also the unraid server.  It had been running for a long time without any issues, but nowadays, I actually only really use it to run Plex anymore.

 

I turned the power back on in the whole house, then turned the unraid on again. I did get a message in which I had to select the boot method, which I did, then I selected the blue screen Unraid OS boot up or whatever it says and all was good.

I waited for it to boot up, logged in, input encryption key and it was up and running.  While I was there, I noticed some old dockers had updates (duckdns, mariadb and nextcloud), so I updated those.  All of the sudden (probably 15-30 min after bootup), I couldn't connect to it anymore, so I went to check it was off again.  Puzzled, I powered it back up, logged in, did the encryption key and started array again.  I didn't really think about it anymore, but when I came back from an errant, it was offline again...  

I even took an air compressor and gently blew out the dust etc, which has accumulated over the last year or so. Booted it up again, logged in, encryption and started array and immediately disabled the dockers I am not currently using and disabled their autostart too.  

After about 20-30 min, it went offline again.  

 

I am not really a champ at interpreting the diagnostic logs, so hope someone can have a check what might be going on..

Also strange it happened after the powercut.

 

Thank you in advance,

best regards,

 

DrBobke

leonore-diagnostics-20240126-1343.zip

Link to comment

IF this behavior continues, Consider setting up the Syslog Server using "Mirror syslog to flash: " method.   After the next shutdown, pull the flash drive and get the file from the       /logs        directory/folder and upload it in a new post in this thread. 

 

EDIT:  One more thing, make sure that all fans are running.  

 

          Another thing.  Is a parity check running?  Do you run periodic parity checks?  If so, how often?

Edited by Frank1940
Link to comment
3 hours ago, JorgeB said:

This is most likely a hardware issue, or power.

Yeah, that is what I read in the other topics too, but it's weird that all was working fine and then nothing. I do understand this can happen after a powercut, but still.. Not really an answer.. ;-) 

Link to comment
2 hours ago, Frank1940 said:

IF this behavior continues, Consider setting up the Syslog Server using "Mirror syslog to flash: " method.   After the next shutdown, pull the flash drive and get the file from the       /logs        directory/folder and upload it in a new post in this thread. 

 

EDIT:  One more thing, make sure that all fans are running.  

 

          Another thing.  Is a parity check running?  Do you run periodic parity checks?  If so, how often?

Thanks, I will try this after dinner to set it up and see what happens.

I do run parity check, every week.  Parity also activates after the system is up and running again (as expected), but then the system shuts itself down after 15-30 min or so I think.

Link to comment
6 minutes ago, DrBobke1 said:

but it's weird that all was working fine and then nothing.

It's not that weird, you can try enabling the syslog server, but unless you have some plugin or something else shutting the down the server it's likely a hardware issue, and if that is the case there won't be anything relevant logged.

Link to comment
4 hours ago, Frank1940 said:

IF this behavior continues, Consider setting up the Syslog Server using "Mirror syslog to flash: " method.   After the next shutdown, pull the flash drive and get the file from the       /logs        directory/folder and upload it in a new post in this thread. 

 

EDIT:  One more thing, make sure that all fans are running.  

 

          Another thing.  Is a parity check running?  Do you run periodic parity checks?  If so, how often?

Hereby the logs, as requested. It was online for probably 20-30 min before it went offline again.  

By the way - cpu is watercooled, but despite being off for hours, when I turned on and went into the BIOS, I saw CPU was at 86°C, which doesn't seem right after MAYBE being on for like 2 min or so?

leonore-diagnostics-20240126-1343.zip

Link to comment

I have checked the logs, but nothing really jumps out for me (but I also don't know what to look for). I do find it strange that I have run unraid at 18:40 or so, but the logs state 13:42 etc, so seems to be 5 hours off, should that mean anything...

 

If it's a hardware issue, are there points that often are a good first step? As ideally, I don't want to buy all new stuff, don't mind some small things, but anything mayor, and I think I'll scrap my whole unraid and go to another system... 🙂 

 

Should it help, I am running Unraid 6.12 6 version, I updated it a few weeks ago.

Edited by DrBobke1
Link to comment
23 minutes ago, trurl said:

Are you sure your server has the correct date and time? You can check at command line with

date

 

Have you done memtest?

Just turned it on again, date is the correct date and time running the command line 'date'.

I haven't done a memtest yet, but will do so now.  Thank you for your help!

Link to comment
46 minutes ago, DrBobke1 said:

By the way - cpu is watercooled, but despite being off for hours, when I turned on and went into the BIOS, I saw CPU was at 86°C, which doesn't seem right after MAYBE being on for like 2 min or so?

This could be worth looking into.   If your system has been running for months, most likely the cooling pumps and fans have been running all the time.  It could be that when they stopped this time, they could not restart.  Remember that the water in the cooling block probably takes a long time to cool back down to room temp if it does not get circulated through the cooling fins.

Link to comment
46 minutes ago, DrBobke1 said:

Just turned it on again, date is the correct date and time running the command line 'date'.

I haven't done a memtest yet, but will do so now.  Thank you for your help!

Memtest done *PASS* without problems.  But I did see the CPU temp going up to 95°C.  I then noticed the fan of the watercooler not working, I did hear the water running through the system, but the casing, piping and cooler on the CPU, did feel warm to the touch (skinblazingly hot). When I was doing the memtest, I unplugged the CPU fan from the mobo and plugged it in again. I then didn't get the fan working either, but - on my mobo, I have 4 pins for the cpu cooler, whereas my Corsair cooler has 3 pins. I unplugged again and plugged it in again (1 different) and it started working again. I saw the temp come down to 60's° (the memtest finished), so I ended the test and went to power up again.

 

Funny thing is - I am 1000% convinced, I never changed the wire since I setup my system (which is probably 3 years or so). So if it would've been incorrectly inserted, it has always worked that way and I have never noticed the (CPU)fan not working. The other fans have been working all the time for sure.

 

I am eager to see if that is maybe why my system was shutting down for no reason.. But I am stumped, as I know for sure the wire has always been inserted in the way it has been, so if that really is the issue, that is strange to say the least.

Link to comment

You have single parity--right?  And the only thing running was a parity check or a RAM test...   If that is the case, I would not expect to see those temps on an Intel(R) Core(TM) i7-4930K CPU @ 3.40GHz CPU.  (You can run a single parity check on a single core processor from 1998 at full speed.  That is how much CPU power it requires!) 

Link to comment

I used to have 2 parity drives, but because unraid is complicated and getting right answers even harder, I switched a lot of my system to a Synology that I use for work-purposes too. So I scavenged a parity drive off the system, to use on the Synology.  Anyway, it has been months like that now...

Since it went down again, I am now running another memtest and see if I can let it run through the night. In that case, I should be able to be pretty confident that it's not the RAM or the CPU causing the issue for making unraid go down, right?

 

So far, it is going down after 20/30 min (running unraid), and the memtest I did earlier (and got the pass), ran for longer than that, and then, I let it run for 5-10 min more, before exiting and thinking it might have been the CPU water cooler fan not working.  Sadly, not the case! When just getting it up again (memtest, after it having been off for at least 45 min), temp was around 50°, but it went up quite rapidly to 95 again, but the fanbody, pipes etc felt cool to the touch).

 

Anything jumping out from the syslog?

 

EDIT: just went to check, it went down again during the memtest this time - does that mean anything in 'finding the culprit'?

Thanks a lot for the help!

Edited by DrBobke1
Link to comment
12 hours ago, DrBobke1 said:

EDIT: just went to check, it went down again during the memtest this time - does that mean anything in 'finding the culprit'?

It does confirm it's a hardware problem as expected, if you confirmed it's not a CPU overheating issue, I would start by trying it with a different PSU, next suspect would be the board.

Link to comment
1 hour ago, JorgeB said:

It does confirm it's a hardware problem as expected, if you confirmed it's not a CPU overheating issue, I would start by trying it with a different PSU, next suspect would be the board.

How can I confirm it is not the CPU overheating? As stated before, the temperature goes up quite a bit and the CPU fan/watercooling wasn't working, so potentially - could the CPU overheat from continuous use without the fan running? Would that be likely, or would the system 'protect itself' before that could happen?

Link to comment

Start by Googling    max cpu temp     Then figure out if the temperature shown by the plugin is accurate.  You could also google the water cooler you are using and see if others have had any problems with it and what those problems were.   (A water cooler is a much more complex system then an air cooling setup and thus has more points of failure.)  You might consider pulling the cooler off and renewing the thermo paste.    (I am assuming that you have not overclocked this system.)

Link to comment
20 minutes ago, JorgeB said:

You can boot a live windows install from a flash drive and then run a stress test with Prime95 while monitoring the temps with for example CoreTemp.

Should I install a win version on a thumb drive then, and try to boot from there? Won't that result in data loss on my array? 

Link to comment
14 minutes ago, Frank1940 said:

Start by Googling    max cpu temp     Then figure out if the temperature shown by the plugin is accurate.  You could also google the water cooler you are using and see if others have had any problems with it and what those problems were.   (A water cooler is a much more complex system then an air cooling setup and thus has more points of failure.)  You might consider pulling the cooler off and renewing the thermo paste.    (I am assuming that you have not overclocked this system.)

I haven't overclocked this systeem indeed, I haven't touched the CPU/watercooler in many years. It is actually my old PC that I cleared and then built the unraid system.  My new desktop pc, I built myself, but this one was built by a store when the CPU/motherboard were brand new.  So that is likely around 2015 (or even before?). Haven't pulled it apart as far as I know in all those years (except to change over HDD's etc).

Link to comment
3 minutes ago, DrBobke1 said:

I haven't overclocked this systeem indeed, I haven't touched the CPU/watercooler in many years. It is actually my old PC that I cleared and then built the unraid system.  My new desktop pc, I built myself, but this one was built by a store when the CPU/motherboard were brand new.  So that is likely around 2015 (or even before?). Haven't pulled it apart as far as I know in all those years (except to change over HDD's etc).

 

I can recall hearing stories of pumps in dishwasher and clothes washers where the blades on the impellers in the pumps were almost completely eroded away!   Things do wear out and everything ever made to do 'work' will eventually fail-- requiring repair or replacement.  One can make statements about the probability of failure for large populations of an item but they do not apply to a single sample of that item.  You are best to assume that anything and everything in that computer is a suspect until you have verified that it is working properly.

 

I am not saying that anything is defective but an indication that the CPU temps are hitting 90C is a red flag that I would be pursuing.  If that temperature is correct, the prime suspect would be a failure somewhere in the cooling setup...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.