Abrupt shutdown during parity check (I suspect temperature) How to confirm?


Recommended Posts

Hi All,

 

My parity checks have become a source of pain since last month.  Unraid will start a parity check and then shut down without any notice.  I think it might be temperature related,  but that should be a clean shoutdown right? I have clean powerdown installed.  The next time I start the array, it detects an unclean shutdown.  Is there any way for me to confirm?

 

To see if there was any logs, I ran "tail -F /var/log/syslog" over ssh to try and capture any output.  Unfortunately, the only output I see after starting parity check and before shutdown is this:

 

Feb  1 18:05:13 Tower emhttp: shcmd (80): /usr/sbin/hdparm -y /dev/sdg &> /dev/null

 

SDG is my cache drive.  Would the rest of the log help?  I have what I captured from syslog over ssh.

 

Is there anything I can do to track down the cause?

 

Thanks so much!

Link to comment

Thanks so much!  That actually giving me a lot to think about.  The only reason I suspected HD overtemp was because the very first time this happened, I got an email with HD Temp warning because I do have notifications setup.  Then I noticed the server had shut down.  Ever since then I do not get a warning email, the server just shuts down.  And I just assumed it was the same issue.  I will install the Dynamix System Temperature plugin and check to see what's happening.  I have a cheap amd sempron 145 in the system because I didn't want anything fancy.  But if it's overheating, maybe it's time for an upgrade :(

 

This is what I get when I follow https://lime-technology.com/wiki/index.php/Setting_up_CPU_and_board_temperature_sensing:

 

k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +53.5°C  (high = +70.0°C)

nct6776-isa-0290
Adapter: ISA adapter
Vcore:          +1.06 V  (min =  +0.00 V, max =  +1.74 V)
in1:            +1.86 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:           +3.38 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:          +3.38 V  (min =  +2.98 V, max =  +3.63 V)
in4:            +1.62 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:            +1.72 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:            +0.96 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:           +3.46 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:           +3.38 V  (min =  +2.70 V, max =  +3.63 V)
fan1:           804 RPM  (min =    0 RPM)
fan2:          3183 RPM  (min =    0 RPM)
fan3:             0 RPM  (min =    0 RPM)
fan4:             0 RPM  (min =    0 RPM)
fan5:             0 RPM  (min =    0 RPM)
SYSTIN:         +38.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
CPUTIN:         +54.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN:          +3.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PCH_CHIP_TEMP:   +0.0°C  
PCH_CPU_TEMP:    +0.0°C  
PCH_MCH_TEMP:    +0.0°C  
intrusion0:    ALARM
intrusion1:    ALARM
beep_enable:   disabled

 

Should I be keeping an eye on CPUTIN?  54 seems very high for idle does it not?  I haven't started the array yet since it will start a parity check.  I will run sensors over ssh every 30 seconds and see what happens to it and when it shuts down with

root@Tower:~# while true; do date; sensors; sleep 30; done

 

I hope it's not the power supply though.  It's a 1000w supply that is supposed to be good :(

 

This is just a very odd issue.  It will shut down for a few times, and then complete the parity check without shutting down.  No idea why.

 

I will be back.  Thanks so much for the hints!

Link to comment

Well, it's definitely that .... :(

When I started parity check:

Mon Feb  1 22:07:02 EST 2016
k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +56.0°C  (high = +70.0°C)

nct6776-isa-0290
Adapter: ISA adapter
Vcore:          +1.23 V  (min =  +0.00 V, max =  +1.74 V)
in1:            +1.86 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:           +3.38 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:          +3.36 V  (min =  +2.98 V, max =  +3.63 V)
in4:            +1.63 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:            +1.72 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:            +1.00 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:           +3.46 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:           +3.38 V  (min =  +2.70 V, max =  +3.63 V)
fan1:           809 RPM  (min =    0 RPM)
fan2:          3176 RPM  (min =    0 RPM)
fan3:             0 RPM  (min =    0 RPM)
fan4:             0 RPM  (min =    0 RPM)
fan5:             0 RPM  (min =    0 RPM)
SYSTIN:         +40.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
CPUTIN:         +56.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN:          +0.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PCH_CHIP_TEMP:   +0.0°C  
PCH_CPU_TEMP:    +0.0°C  
PCH_MCH_TEMP:    +0.0°C  
intrusion0:    ALARM
intrusion1:    ALARM
beep_enable:   disabled

 

When I decided to chicken out:

Mon Feb  1 22:10:33 EST 2016
k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +74.6°C  (high = +70.0°C)

nct6776-isa-0290
Adapter: ISA adapter
Vcore:          +1.06 V  (min =  +0.00 V, max =  +1.74 V)
in1:            +1.85 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:           +3.38 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:          +3.36 V  (min =  +2.98 V, max =  +3.63 V)
in4:            +1.64 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:            +1.72 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:            +1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:           +3.46 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:           +3.38 V  (min =  +2.70 V, max =  +3.63 V)
fan1:           817 RPM  (min =    0 RPM)
fan2:          3176 RPM  (min =    0 RPM)
fan3:             0 RPM  (min =    0 RPM)
fan4:             0 RPM  (min =    0 RPM)
fan5:             0 RPM  (min =    0 RPM)
SYSTIN:         +41.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
CPUTIN:         +71.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN:          -0.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PCH_CHIP_TEMP:   +0.0°C  
PCH_CPU_TEMP:    +0.0°C  
PCH_MCH_TEMP:    +0.0°C  
intrusion0:    ALARM
intrusion1:    ALARM
beep_enable:   disabled

 

Now:

Mon Feb  1 22:17:44 EST 2016
k10temp-pci-00c3
Adapter: PCI adapter
temp1:        +62.0°C  (high = +70.0°C)

nct6776-isa-0290
Adapter: ISA adapter
Vcore:          +1.06 V  (min =  +0.00 V, max =  +1.74 V)
in1:            +1.86 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
AVCC:           +3.38 V  (min =  +2.98 V, max =  +3.63 V)
+3.3V:          +3.38 V  (min =  +2.98 V, max =  +3.63 V)
in4:            +1.65 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:            +1.72 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:            +1.05 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
3VSB:           +3.46 V  (min =  +2.98 V, max =  +3.63 V)
Vbat:           +3.38 V  (min =  +2.70 V, max =  +3.63 V)
fan1:           817 RPM  (min =    0 RPM)
fan2:          3183 RPM  (min =    0 RPM)
fan3:             0 RPM  (min =    0 RPM)
fan4:             0 RPM  (min =    0 RPM)
fan5:             0 RPM  (min =    0 RPM)
SYSTIN:         +42.0°C  (high =  +0.0°C, hyst =  +0.0°C)  ALARM  sensor = thermistor
CPUTIN:         +62.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN:          -3.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PCH_CHIP_TEMP:   +0.0°C  
PCH_CPU_TEMP:    +0.0°C  
PCH_MCH_TEMP:    +0.0°C  
intrusion0:    ALARM
intrusion1:    ALARM
beep_enable:   disabled

 

The CPU temp is going up, but what is going up even more is the PCI device.  Which is my sata controller card (I think.  I have a pci-e sata card and a pci video card and that's it).  It doesn't have a fan.  But I guess it's time for me to put in a fan on the side vent to cool it and see what happens!

 

Is there even remotely a possibility that this might be an unraid issue?  I have had the same setup for about 8 months or so.  And only recently upgraded to unraid 6 when this started happening.

 

Thanks so much!!

Link to comment

I would suggest opening up the case and inspect the inside of the case.  Look at the CPU cooling fins.  Are they filled up with dust?  Start up the server and look at the all of the fans.  Are they running?  Are the fan blades clogged with dust and dirt.  Are the inlets on the case clogged with dust? 

 

It would not hurt to take the case outside (or some place where you aren't worried it gets really, really dirty) and blow out all of the accumulated dust and dirt. 

 

My semprons run about 38C... and I have a speed controller on the case fans so they are not running at full speed. 

 

If, after you have cleaned everything, you still have a cooling problem, I would suggest researching strategies for proper case cooling for servers.  The problems and solutions are a bit different than for gaming systems. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.