Help! Server crashing/restarting every few (minutes to hours)


zleef

Recommended Posts

Hi all, I'm losing my mind trying to solve this. Any suggestions for debugging this would be greatly appreciated!

 

Here's the summary: my server crashes and restarts somewhere between every few minutes and every few hours without any obvious (to me) errors.

 

I'm embarrassed to say, it's always had some form of this issue - although it previously would happen much less frequently (every few days/months), so it was low priority thing i never got around to fixing, other than spending a few hours on it here and there. But as of lately (last week or so, right around when i upgraded to unraid 6.3.0), something has changed and it's happening at an alarming rate.

 

I should say, I'm not a hardware guy. I know enough to get myself in trouble (ah, hem. see this post), but am far from having any hardware expertise.

 

Here's what i have tried:

- tailing syslog from IMPI console before/after server crashes. no errors, it just restarts.

- I've replaced the memory, sata cables, mother board, and PSU overtime. Each time, the issue still returns. The only things left are hard drives and cpu.

- a few days ago i bought a new usb jump drive (Lexar JumpDrive) and tried re-installing unraid on that. Using a fresh install with default config (no license, so disks weren't mounted) didn't immediately cause the crash as before. However, when i copied over my existing config the server did crash pretty quickly. I didn't let the server run for too long with the default config (maybe 2 hours with no crash). I'm open to leaving that running for longer, but because it was running in such a minimal capacity it didn't feel like a good test.

 

My current theory is that one of the disks is having problems (one of the disks in the array is ~5 years old), but none of the SMART reports seem to indicate failure (although this very well could be my ignorance of reading those reports). I'm open to suggestions.

 

Let me know what other information I can provide. I've attached a diagnostics dump from earlier today.

tower-diagnostics-20170214-0844.zip

Link to comment

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

 

Thanks testdasi, i installed Fix Common Problems and started "Troubleshooting Mode". Unfortunately I'm not seeing anything jump out in the persisted syslog but i've attached it just in case. I tried to run the extended scan, but the server restarted minutes after starting that :(

FCPsyslog_tail.txt

Link to comment

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

 

I looked through your diagnostics file and didn't see anything.  Your disks appear to be fine.  Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt?  Have you installed Dynamix System Temperature?  Have you run memtst for 24 hours?  (I know you replaced memory...)   

 

What plugins, Docks or VM have you installed?  Most of what I saw were common plugins that most folks have so that should not be an issue here. 

 

Saw your second post about I locking up while running the troubleshooting scan.  I would be looking at the CPU overheating as a cause...

Link to comment

I think you can use debug mode of the Fix Common Problems plugin and it will constantly log things until a crash. It probably will help more with diagnostic.

 

I looked through your diagnostics file and didn't see anything.  Your disks appear to be fine.  Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt?  Have you installed Dynamix System Temperature?  Have you run memtst for 24 hours?  (I know you replaced memory...)   

 

What plugins, Docks or VM have you installed?  Most of what I saw were common plugins that most folks have so that should not be an issue here. 

 

Saw your second post about I locking up while running the troubleshooting scan.  I would be looking at the CPU overheating as a cause...

 

Have you cleaned out the inside of the case recently and verified that all fans are running and the CPU cooling fins are not clogged with dirt?

Yes, just cleaned (after your post). CPU cooling fan is not clogged. It lasted a little longer initially after cleaning (maybe 1.5 hours), but now back to restarting every ~10 minutes or so.

 

Have you run memtst for 24 hours?

It's been awhile since i last did it. I don't believe i've done it with the "new" memory, so i'll go ahead and do that just to rule it out (probably will wait until tonight to start).

 

What plugins, Docks or VM have you installed?  Most of what I saw were common plugins that most folks have so that should not be an issue here. 

 

Plugins (listing all, including default):

- CA Auto Update Application

- CA Backup

- CA Cleanup Appdata

- Communicty Applications

- Dynamix System Temperature (installed today)

- Dynamix webGui

- Fix Common Problems (installed today)

- Preclear Disks

- Speedtest Command Line Tool

- unRAID server OS

 

Docker:

- I have several containers, but to help narrow down the problem i've disabled docker completely (and still have the problem).

 

 

Saw your second post about I locking up while running the troubleshooting scan.  I would be looking at the CPU overheating as a cause...

Pardon my ignorance here, but is it possible for the temperatures reported to be incorrect?

 

Running the `sensors` command from the command line, i get the following:

 

coretemp-isa-0000
Adapter: ISA adapter
CPU Temp:     +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:       +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 1:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:       +31.0°C  (high = +86.0°C, crit = +96.0°C)

 

Even if we assume the temp doubles in the minutes (after i lookup the temperature and before the server crashes), it's still below the high or critical level.

 

Just to show the full output:

$ sensors
nct6776-isa-0a30
Adapter: ISA adapter
Vcore:          +1.46 V  (min =  +1.02 V, max =  +1.69 V)
in1:            +1.84 V  (min =  +1.55 V, max =  +2.02 V)
AVCC:           +3.34 V  (min =  +2.90 V, max =  +3.66 V)
+3.3V:          +3.34 V  (min =  +2.83 V, max =  +3.66 V)
in4:            +1.50 V  (min =  +0.97 V, max =  +1.65 V)
in5:            +1.27 V  (min =  +1.07 V, max =  +1.39 V)
in6:            +1.47 V  (min =  +0.89 V, max =  +1.23 V)  ALARM
3VSB:           +3.34 V  (min =  +2.83 V, max =  +3.66 V)
Vbat:           +3.15 V  (min =  +2.50 V, max =  +3.60 V)
fan1:             0 RPM  (min =  712 RPM)  ALARM
fan2:          2142 RPM  (min =  712 RPM)
fan3:           948 RPM  (min =  712 RPM)
fan4:             0 RPM  (min =  712 RPM)  ALARM
fan5:          2177 RPM  (min =  712 RPM)
SYSTIN:         +32.0°C  (high = +85.0°C, hyst = +80.0°C)  sensor = thermistor
CPUTIN:         +27.5°C  (high = +85.0°C, hyst = +80.0°C)  sensor = thermistor
AUXTIN:          +1.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
PECI Agent 0:    +0.0°C  (high = +80.0°C, hyst = +75.0°C)  ALARM
                         (crit = +100.0°C)
PCH_CHIP_TEMP:   +0.0°C
PCH_CPU_TEMP:    +0.0°C
PCH_MCH_TEMP:    +0.0°C
intrusion0:    ALARM
intrusion1:    OK
beep_enable:   enabled

acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +101.0°C)
MB Temp:      +29.8°C  (crit = +101.0°C)

coretemp-isa-0000
Adapter: ISA adapter
CPU Temp:     +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:       +32.0°C  (high = +86.0°C, crit = +96.0°C)
Core 1:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:       +30.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:       +31.0°C  (high = +86.0°C, crit = +96.0°C)

 

 

By the way, i just noticed the *ALARM* warning on several of the sensors.  Any clues there?

 

I should mention, since i've gotten this board the fans have run at (what i believe) is full speed. I've never spent enough time to figure out the problem, but looking at the sensor readings, i wonder if it's because it's expecting fan1 and fan4 to have min rpm of 712, but i don't have any fans hooked up to those? :-\ 

Link to comment

From what you have posted, I don't think you have a heat related issue either.  But I would be concerned with that in6 voltage in the sensors report being high.  Has this MB been used previously in another computer setup where someone might have been attempting to overclock?

 

I hope someone else can jump in and give you some insight here.

Link to comment

From what you have posted, I don't think you have a heat related issue either.  But I would be concerned with that in6 voltage in the sensors report being high.  Has this MB been used previously in another computer setup where someone might have been attempting to overclock?

 

I hope someone else can jump in and give you some insight here.

 

The MB was purchased new, so to my knowledge it shouldn't have ever been overclocked.

Link to comment

From your syslog:

 

Feb 14 08:35:40 Tower kernel: mce: [Hardware Error]: Machine check events logged

 

Run that Memtest that you were planning to run and run it for a good long time. If the memory proves to be good then finding out what is actually wrong is not going to be easy.

 

What would you consider a sufficient amount of time? 24 hours? 48?

 

In the likely case (based on my luck) that the memory is good, does anyone have suggestions for next steps?

Link to comment

So i just went to check in on MemTest, and it said something like "all tests passed, press any key to continue".

 

I'm attaching the screenshot of the next page which showed the summary.

 

I was assuming that this would just continue to run until I stopped it.. do i need to configure it somewhere to keep running, or should I just continue to restart the test every ~6hours?

Screen_Shot_2017-02-15_at_5_10.12_PM.png.7311dd1aa9ad40cfed127c7eb104a192.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.