Random Shutdowns

bumblebee21 · January 30, 2017

Running 6.2.4. Full list of plugins and Dockers at the bottom of the thread. Also running one Windows 10 VM with VGA passthrough. Hardware also listed below.

I've had a couple instances lately where I've found my server to be randomly non-responsive. The machine itself is on, but I cannot telnet in and my VM is down (nothing on the screen). The only way to recover is to do a hard reboot, after which everything seems to run fine.

Since I can't access the server once it becomes non-responsive, I started outputting the log to another PC on the network via tail. Unfortunately, there doesn't seem to be much useful info in the log (see below). It looks like mover ran, then an hour later, the server went offline. I don't see any other tasks scheduled for around that time on either the server or the VM.

I'm sort of at a loss. Any ideas?

Jan 29 03:40:34 Tower root: mover finished
Jan 29 04:30:01 Tower root: Fix Common Problems Version 2017.01.24
Jan 29 04:30:07 Tower root: Fix Common Problems: Error: unclean shutdown detected of your server
Jan 29 04:30:07 Tower sSMTP[23043]: Creating SSL connection to host
Jan 29 04:30:07 Tower sSMTP[23043]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Jan 29 04:30:09 Tower sSMTP[23043]: Sent mail for [REDACTED] (221 2.0.0 closing connection w41sm8871776qtw.34 - gsmtp) uid=0 username=root outbytes=789
Jan 29 04:40:01 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1479" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

PLUGINS

CA Auto Update Applications

CA Backup / Restore Appdata

CA Cleanup Appdata

Community Applications

Dynamix Cache Directories

Dynamix File Integrity

Dynamix SSD TRIM

Dynamix System Buttons

Dynamix System Information

Dynamix System Statistics

Dynamix System Temperature

Dynamix webGui

Fix Common Problems

Nerd Tools

Unassigned Devices

DOCKERS

cadvisor

CouchPotato

Netdata

PlexMediaServer

sickrage

transmission

HARDWARE

Asrock Z170 Extreme4

Intel i5-6600k (running stock at the moment)

32gb G.Skill DDR4 2133mhz

1x2tb WD Red parity drive

4x1tb WD Red data drives

1x256gb Samsung 850 Evo cache drive

John_M · January 30, 2017

Since you have the Fix Common Problems plugin installed you ought to reboot and run it in troubleshooting mode until your server becomes unresponsive again.

bumblebee21 · February 1, 2017

Yeah, I should have though to do that earlier. At any rate, I enabled troubleshooting mode shortly after your reply and just got a crash.

Attached are the syslog; diagnostics will be in next post. I took a quick glance and didn't see anything, but hopefully you folks might notice something.

syslog.zip

bumblebee21 · February 1, 2017

Diagnostics attached.

tower-diagnostics-20170201-1520.zip

bumblebee21 · February 2, 2017

Anyone have thoughts?

John_M · February 3, 2017

Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here:

Feb 1 15:20:37 Tower root: Fix Common Problems: Capturing diagnostics. When uploading diagnostics to the forum, also upload /config/logs/syslog.txt on the flash drive

bumblebee21 · February 3, 2017

Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here:

John, thanks for the reply. The syslog I posted was in the /logs/ directory of the boot flash, along with all the diagnostics. I looked in the /config/ directory but did not see a logs directory. Screen cap below.

Eje7DvB.png?1

Squid · February 3, 2017

Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here:

John, thanks for the reply. The syslog I posted was in the /logs/ directory of the boot flash, along with all the diagnostics. I looked in the /config/ directory but did not see a logs directory. Screen cap below.

You posted correctly. That message in the log is a typo of mine. :'(

John_M · February 3, 2017

Thanks Squid. That explains it.

I'll spend it bit more time looking at the log now.

John_M · February 3, 2017

I haven't been able to find anything that might indicate what the problem is. Have you tested the memory recently?

bumblebee21 · February 4, 2017

I haven't been able to find anything that might indicate what the problem is. Have you tested the memory recently?

Thanks for looking. I haven't run a memtest in a while, but will plan to do that tonight.

John_M · February 4, 2017

When it becomes unresponsive does it stay that way or does it sometimes recover?

bumblebee21 · February 4, 2017

When it becomes unresponsive does it stay that way or does it sometimes recover?

No, never. It's sat unresponsive for at least 10 or 12 hours without any response.

bumblebee21 · February 4, 2017

First DIM/slot for 13 hours. 6 passes, 0 errors. Running the other DIM/slot now.

John_M · February 4, 2017

I realised late yesterday that the title of this thread is Random Shutdowns, which is actually rather different from the description in your OP of your server being non-responsive. I understand a shutdown to be power off, while frozen and non-responsive mean power on. Is it actually shutting down or freezing or becoming sluggish or something else?

bumblebee21 · February 4, 2017

I realised late yesterday that the title of this thread is Random Shutdowns, which is actually rather different from the description in your OP of your server being non-responsive. I understand a shutdown to be power off, while frozen and non-responsive mean power on. Is it actually shutting down or freezing or becoming sluggish or something else?

Yeah, shutdown was a bad title. What happens is this: the VM goes dark (nothing on the screen, no response from mouse/keyboard), the web-based GUI does not load (unreachable), Telnet loses its connection and cannot reconnect, and the onscreen output from unRAID through the iGPU is still there but no longer updates. Meanwhile the rig itself is still on, fans spinning, lights on, etc.

Any other ideas on hardware that might cause this? The only component that isn't relatively new is my PSU, which is a solid, reputable unit, but pushing 5-6 years old. At the same time, I would think a failing PSU would totally shut the system down, not just make it go unresponsive.

Thanks again for your help.

John_M · February 4, 2017

Failing PSUs cause all sorts of weird errors so it can't be ruled out. A cheaper thing to try first would be to run without any VMs for a while. That's what I'd do next - stop VMs and also stop the VM service in Settings -> VM Manager. Run like that for a few days and see how it goes. It helps to break the problem down into more manageable pieces.

bumblebee21 · February 7, 2017

Welp, I may have made things worse. I'd really like to avoid losing my main VM for weeks to see whether the system crashes again. So, I swapped out the PSU with a new one a buddy had handy. At the same time, I also upgraded to 6.3.0.

Since then, the system has yet to crash (though I haven't had enough uptime to say that it's stable), but I'm now getting repeated 'lost rtc interrupts' messages in the syslog. Specifically messages like, 'kernel: hpet1: lost 522 rtc interrupts,'

Any thoughts? I only found a few mentions of this error on the unraid forums.

John_M · February 7, 2017

Changing so many things at one time makes it difficult to follow what's going on and gives the impression of panic, rather than logical thought.

RTC = Real Time Clock. It might be worth changing the battery. Or maybe it's a BIOS bug. Diagnostics might reveal something.

bumblebee21 · February 7, 2017

Yeah, definitely not the best for troubleshooting. I wanted to upgrade given the security patches it had, not necessarily to fix issues. The PSU I'm hoping may actually help.

At any rate, I found a few references in other linux distros to turning off ACPI in bios to address the hpet issues. Sure enough, with ACPI off, I no longer see those interrupts.

So, I guess now I'll leave it in troubleshooting mode and wait for another lock up.

Thanks very much for your help, John.

bumblebee21 · February 8, 2017

The saga continues. Less than 24 hours after booting up the rig with new PSU (and unRAID 4.3), I got another lock up. Syslog and diagnostics attached. Again, I don't see anything in them that presages failure or lock up.

I'd really like to avoid going without my primary VM, but that may be my only option at this point.

FCPsyslog_tail.zip

tower-diagnostics-20170208-0213.zip

jbrodriguez · February 8, 2017

Failing PSUs cause all sorts of weird errors so it can't be ruled out. A cheaper thing to try first would be to run without any VMs for a while. That's what I'd do next - stop VMs and also stop the VM service in Settings -> VM Manager. Run like that for a few days and see how it goes. It helps to break the problem down into more manageable pieces.

I can't stress this enough due to personal experience ... my firewall (a Supermicro board based on an Intel Atom 2550 SOC) was doing just fine for about 3 years, then it starts rebooting every now and then out of the blue ... never thought about the UPS (which was definitely dying) ... guess it's worth your time to check it

John_M · February 8, 2017

I see this

Feb 7 16:48:05 Tower kernel: smpboot: Max logical packages: 1

Feb 7 16:48:05 Tower kernel: DMAR: Host address width 39

Feb 7 16:48:05 Tower kernel: DMAR: DRHD base: 0x000000fed90000 flags: 0x0

Feb 7 16:48:05 Tower kernel: DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 7e3ff0505e

Feb 7 16:48:05 Tower kernel: DMAR: DRHD base: 0x000000fed91000 flags: 0x1

Feb 7 16:48:05 Tower kernel: DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da

Feb 7 16:48:05 Tower kernel: DMAR: RMRR base: 0x00000027542000 end: 0x00000027561fff

Feb 7 16:48:05 Tower kernel: DMAR: RMRR base: 0x00000028800000 end: 0x00000038ffffff

Feb 7 16:48:05 Tower kernel: DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1

Feb 7 16:48:05 Tower kernel: DMAR-IR: HPET id 0 under DRHD base 0xfed91000

Feb 7 16:48:05 Tower kernel: DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.

Feb 7 16:48:05 Tower kernel: DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.

Feb 7 16:48:05 Tower kernel: DMAR-IR: Enabled IRQ remapping in xapic mode

Feb 7 16:48:05 Tower kernel: x2apic: IRQ remapping doesn't support X2APIC mode

Feb 7 16:48:05 Tower kernel: mce: [Hardware Error]: Machine check events logged

Feb 7 16:48:05 Tower kernel: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1

Feb 7 16:48:05 Tower kernel: TSC deadline timer enabled

Feb 7 16:48:05 Tower kernel: smpboot: CPU0: Intel® Core i5-6600K CPU @ 3.50GHz (family: 0x6, model: 0x5e, stepping: 0x3)

very early in you boot sequence, when the machine is still running on only one processor core. I don't know what it means but I don't think it's a good sign.

bumblebee21 · February 8, 2017

Good catch. I installed mcelog to check it out. Logs reported it to be an "internal parity error." In googling around, it looks like this is actually a benign error (false positive). Intel has actually released an erratum saying that these errors may be falsely reported, but can be safely ignored.

John_M · February 9, 2017

I went back and checked your earlier syslog and it's reported the same there too but if Intel acknowledges it then it must be safe to ignore it. I don't see anything else of note. I think you're going to have to choose between living with the crashes or living without your VM for a while. I'd disable the VM service in unRAID and turn off IOMMU (VT-d) in the BIOS... but if you want an excuse for not doing that just yet version 6.3.1 is now available so it has to be worth a try

Random Shutdowns

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation