Help! Server crashing/restarting every few (minutes to hours)

zleef · February 14, 2017

Hi all, I'm losing my mind trying to solve this. Any suggestions for debugging this would be greatly appreciated!

Here's the summary: my server crashes and restarts somewhere between every few minutes and every few hours without any obvious (to me) errors.

I'm embarrassed to say, it's always had some form of this issue - although it previously would happen much less frequently (every few days/months), so it was low priority thing i never got around to fixing, other than spending a few hours on it here and there. But as of lately (last week or so, right around when i upgraded to unraid 6.3.0), something has changed and it's happening at an alarming rate.

I should say, I'm not a hardware guy. I know enough to get myself in trouble (ah, hem. see this post), but am far from having any hardware expertise.

Here's what i have tried:

- tailing syslog from IMPI console before/after server crashes. no errors, it just restarts.

- I've replaced the memory, sata cables, mother board, and PSU overtime. Each time, the issue still returns. The only things left are hard drives and cpu.

- a few days ago i bought a new usb jump drive (Lexar JumpDrive) and tried re-installing unraid on that. Using a fresh install with default config (no license, so disks weren't mounted) didn't immediately cause the crash as before. However, when i copied over my existing config the server did crash pretty quickly. I didn't let the server run for too long with the default config (maybe 2 hours with no crash). I'm open to leaving that running for longer, but because it was running in such a minimal capacity it didn't feel like a good test.

My current theory is that one of the disks is having problems (one of the disks in the array is ~5 years old), but none of the SMART reports seem to indicate failure (although this very well could be my ignorance of reading those reports). I'm open to suggestions.

Let me know what other information I can provide. I've attached a diagnostics dump from earlier today.

tower-diagnostics-20170214-0844.zip

testdasi · February 14, 2017

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

zleef · February 14, 2017

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

Thanks testdasi, i installed Fix Common Problems and started "Troubleshooting Mode". Unfortunately I'm not seeing anything jump out in the persisted syslog but i've attached it just in case. I tried to run the extended scan, but the server restarted minutes after starting that

FCPsyslog_tail.txt

Frank1940 · February 14, 2017

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

I looked through your diagnostics file and didn't see anything. Your disks appear to be fine. Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt? Have you installed Dynamix System Temperature? Have you run memtst for 24 hours? (I know you replaced memory...)

What plugins, Docks or VM have you installed? Most of what I saw were common plugins that most folks have so that should not be an issue here.

Saw your second post about I locking up while running the troubleshooting scan. I would be looking at the CPU overheating as a cause...

zleef · February 14, 2017

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

I looked through your diagnostics file and didn't see anything. Your disks appear to be fine. Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt? Have you installed Dynamix System Temperature? Have you run memtst for 24 hours? (I know you replaced memory...)

What plugins, Docks or VM have you installed? Most of what I saw were common plugins that most folks have so that should not be an issue here.

Saw your second post about I locking up while running the troubleshooting scan. I would be looking at the CPU overheating as a cause...

Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt?

Yes, just cleaned (after your post). CPU cooling fan is not clogged. It lasted a little longer initially after cleaning (maybe 1.5 hours), but now back to restarting every ~10 minutes or so.

Have you run memtst for 24 hours?

It's been awhile since i last did it. I don't believe i've done it with the "new" memory, so i'll go ahead and do that just to rule it out (probably will wait until tonight to start).

What plugins, Docks or VM have you installed? Most of what I saw were common plugins that most folks have so that should not be an issue here.

Plugins (listing all, including default):

- CA Auto Update Application

- CA Backup

- CA Cleanup Appdata

- Communicty Applications

- Dynamix System Temperature (installed today)

- Dynamix webGui

- Fix Common Problems (installed today)

- Preclear Disks

- Speedtest Command Line Tool

- unRAID server OS

Docker:

- I have several containers, but to help narrow down the problem i've disabled docker completely (and still have the problem).

Saw your second post about I locking up while running the troubleshooting scan. I would be looking at the CPU overheating as a cause...

Pardon my ignorance here, but is it possible for the temperatures reported to be incorrect?

Running the `sensors` command from the command line, i get the following:

coretemp-isa-0000
Adapter: ISA adapter
CPU Temp:     +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:       +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 1:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:       +31.0°C  (high = +86.0°C, crit = +96.0°C)

Even if we assume the temp doubles in the minutes (after i lookup the temperature and before the server crashes), it's still below the high or critical level.

Just to show the full output:

$ sensors
nct6776-isa-0a30
Adapter: ISA adapter
Vcore:          +1.46 V  (min =  +1.02 V, max =  +1.69 V)
in1:            +1.84 V  (min =  +1.55 V, max =  +2.02 V)
AVCC:           +3.34 V  (min =  +2.90 V, max =  +3.66 V)
+3.3V:          +3.34 V  (min =  +2.83 V, max =  +3.66 V)
in4:            +1.50 V  (min =  +0.97 V, max =  +1.65 V)
in5:            +1.27 V  (min =  +1.07 V, max =  +1.39 V)
in6:            +1.47 V  (min =  +0.89 V, max =  +1.23 V)  ALARM
3VSB:           +3.34 V  (min =  +2.83 V, max =  +3.66 V)
Vbat:           +3.15 V  (min =  +2.50 V, max =  +3.60 V)
fan1:             0 RPM  (min =  712 RPM)  ALARM
fan2:          2142 RPM  (min =  712 RPM)
fan3:           948 RPM  (min =  712 RPM)
fan4:             0 RPM  (min =  712 RPM)  ALARM
fan5:          2177 RPM  (min =  712 RPM)
SYSTIN:         +32.0°C  (high = +85.0°C, hyst = +80.0°C)  sensor = thermistor
CPUTIN:         +27.5°C  (high = +85.0°C, hyst = +80.0°C)  sensor = thermistor
AUXTIN:          +1.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PECI Agent 0:    +0.0°C  (high = +80.0°C, hyst = +75.0°C)  ALARM
                         (crit = +100.0°C)
PCH_CHIP_TEMP:   +0.0°C
PCH_CPU_TEMP:    +0.0°C
PCH_MCH_TEMP:    +0.0°C
intrusion0:    ALARM
intrusion1:    OK
beep_enable:   enabled

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +101.0°C)
MB Temp:      +29.8°C  (crit = +101.0°C)

coretemp-isa-0000
Adapter: ISA adapter
CPU Temp:     +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:       +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 1:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:       +31.0°C  (high = +86.0°C, crit = +96.0°C)

By the way, i just noticed the *ALARM* warning on several of the sensors. Any clues there?

I should mention, since i've gotten this board the fans have run at (what i believe) is full speed. I've never spent enough time to figure out the problem, but looking at the sensor readings, i wonder if it's because it's expecting fan1 and fan4 to have min rpm of 712, but i don't have any fans hooked up to those? $:-\$

Frank1940 · February 15, 2017

From what you have posted, I don't think you have a heat related issue either. But I would be concerned with that in6 voltage in the sensors report being high. Has this MB been used previously in another computer setup where someone might have been attempting to overclock?

I hope someone else can jump in and give you some insight here.

zleef · February 15, 2017

From what you have posted, I don't think you have a heat related issue either. But I would be concerned with that in6 voltage in the sensors report being high. Has this MB been used previously in another computer setup where someone might have been attempting to overclock?

I hope someone else can jump in and give you some insight here.

The MB was purchased new, so to my knowledge it shouldn't have ever been overclocked.

John_M · February 15, 2017

From your syslog:

Feb 14 08:35:40 Tower kernel: mce: [Hardware Error]: Machine check events logged

Run that Memtest that you were planning to run and run it for a good long time. If the memory proves to be good then finding out what is actually wrong is not going to be easy.

zleef · February 15, 2017

From your syslog:

Feb 14 08:35:40 Tower kernel: mce: [Hardware Error]: Machine check events logged

Run that Memtest that you were planning to run and run it for a good long time. If the memory proves to be good then finding out what is actually wrong is not going to be easy.

What would you consider a sufficient amount of time? 24 hours? 48?

In the likely case (based on my luck) that the memory is good, does anyone have suggestions for next steps?

John_M · February 15, 2017

I'd do it for 48 hours.

If the RAM is good, I'd use the Nerd Tools to install mcelog and see if that reveals anything. But do the Memtest first. Better than the version you select from the boot menu is the newer version you can download and run from its own USB device.

zleef · February 15, 2017

Thanks for the suggestions John_M. I've been running memtest from the boot menu, but will download and run the newer version just to be thorough. I'll report back after 48 hours of Memtest.

Thanks!

John_M · February 15, 2017

The stand-alone version needs its own USB stick. It dual boots: BIOS boot runs version 4 and UEFI boot runs version 7. If you can, run the later version.

gubbgnutten · February 15, 2017

...and please do interpret "If you can, run the later version" as "Run version 7 unless it is completely impossible".

zleef · February 15, 2017

...and please do interpret "If you can, run the later version" as "Run version 7 unless it is completely impossible".

Haha, thanks for the laugh.

Per your gentle suggestion, MemTest86 V7.2 is now running! 48 hours to go..

gubbgnutten · February 15, 2017

You're welcome! I'm sure your memory modules will enjoy the hammer tests

Never really know what memory test result to hope for when there is an obscure problem present...

zleef · February 16, 2017

So i just went to check in on MemTest, and it said something like "all tests passed, press any key to continue".

I'm attaching the screenshot of the next page which showed the summary.

I was assuming that this would just continue to run until I stopped it.. do i need to configure it somewhere to keep running, or should I just continue to restart the test every ~6hours?

Help! Server crashing/restarting every few (minutes to hours)

Recommended Posts

zleef

Link to comment

testdasi

Link to comment

zleef

Link to comment

Frank1940

Link to comment

zleef

Link to comment

Frank1940

Link to comment

zleef

Link to comment

John_M

Link to comment

zleef

Link to comment

John_M

Link to comment

zleef

Link to comment

John_M

Link to comment

gubbgnutten

Link to comment

zleef

Link to comment

gubbgnutten

Link to comment

zleef

Link to comment

Join the conversation