Random Shutdowns


Recommended Posts

Running 6.2.4. Full list of plugins and Dockers at the bottom of the thread. Also running one Windows 10 VM with VGA passthrough. Hardware also listed below.

 

I've had a couple instances lately where I've found my server to be randomly non-responsive. The machine itself is on, but I cannot telnet in and my VM is down (nothing on the screen). The only way to recover is to do a hard reboot, after which everything seems to run fine.

 

Since I can't access the server once it becomes non-responsive, I started outputting the log to another PC on the network via tail. Unfortunately, there doesn't seem to be much useful info in the log (see below). It looks like mover ran, then an hour later, the server went offline. I don't see any other tasks scheduled for around that time on either the server or the VM.

 

I'm sort of at a loss. Any ideas?

 

Jan 29 03:40:34 Tower root: mover finished
Jan 29 04:30:01 Tower root: Fix Common Problems Version 2017.01.24
Jan 29 04:30:07 Tower root: Fix Common Problems: Error: unclean shutdown detected of your server
Jan 29 04:30:07 Tower sSMTP[23043]: Creating SSL connection to host
Jan 29 04:30:07 Tower sSMTP[23043]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256
Jan 29 04:30:09 Tower sSMTP[23043]: Sent mail for [REDACTED] (221 2.0.0 closing connection w41sm8871776qtw.34 - gsmtp) uid=0 username=root outbytes=789
Jan 29 04:40:01 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1479" x-info="http://www.rsyslog.com"] rsyslogd was HUPed

 

 

 

 

PLUGINS

CA Auto Update Applications

CA Backup / Restore Appdata

CA Cleanup Appdata

Community Applications

Dynamix Cache Directories

Dynamix File Integrity

Dynamix SSD TRIM

Dynamix System Buttons

Dynamix System Information

Dynamix System Statistics

Dynamix System Temperature

Dynamix webGui

Fix Common Problems

Nerd Tools

Unassigned Devices

 

DOCKERS

cadvisor

CouchPotato

Netdata

PlexMediaServer

sickrage

transmission

 

HARDWARE

Asrock Z170 Extreme4

Intel i5-6600k (running stock at the moment)

32gb G.Skill DDR4 2133mhz

1x2tb WD Red parity drive

4x1tb WD Red data drives

1x256gb Samsung 850 Evo cache drive

Link to comment

Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here:

 

Feb  1 15:20:37 Tower root: Fix Common Problems: Capturing diagnostics.  When uploading diagnostics to the forum, also upload /config/logs/syslog.txt on the flash drive

 

Link to comment

Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here:

 

John, thanks for the reply. The syslog I posted was in the /logs/ directory of the boot flash, along with all the diagnostics. I looked in the /config/ directory but did not see a logs directory. Screen cap below.

 

Eje7DvB.png?1

 

Link to comment

Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here:

 

John, thanks for the reply. The syslog I posted was in the /logs/ directory of the boot flash, along with all the diagnostics. I looked in the /config/ directory but did not see a logs directory. Screen cap below.

 

Eje7DvB.png?1

You posted correctly.  That message in the log is a typo of mine.  :'(
Link to comment

I realised late yesterday that the title of this thread is Random Shutdowns, which is actually rather different from the description in your OP of your server being non-responsive. I understand a shutdown to be power off, while frozen and non-responsive mean power on. Is it actually shutting down or freezing or becoming sluggish or something else?

 

Link to comment

I realised late yesterday that the title of this thread is Random Shutdowns, which is actually rather different from the description in your OP of your server being non-responsive. I understand a shutdown to be power off, while frozen and non-responsive mean power on. Is it actually shutting down or freezing or becoming sluggish or something else?

 

Yeah, shutdown was a bad title. What happens is this: the VM goes dark (nothing on the screen, no response from mouse/keyboard), the web-based GUI does not load (unreachable),  Telnet loses its connection and cannot reconnect, and the onscreen output from unRAID through the iGPU is still there but no longer updates. Meanwhile the rig itself is still on, fans spinning, lights on, etc.

 

Any other ideas on hardware that might cause this? The only component that isn't relatively new is my PSU, which is a solid, reputable unit, but pushing 5-6 years old. At the same time, I would think a failing PSU would totally shut the system down, not just make it go unresponsive.

 

Thanks again for your help.

Link to comment

Failing PSUs cause all sorts of weird errors so it can't be ruled out. A cheaper thing to try first would be to run without any VMs for a while. That's what I'd do next - stop VMs and also stop the VM service in Settings -> VM Manager. Run like that for a few days and see how it goes. It helps to break the problem down into more manageable pieces.

 

Link to comment

Welp, I may have made things worse. I'd really like to avoid losing my main VM for weeks to see whether the system crashes again. So, I swapped out the PSU with a new one a buddy had handy. At the same time, I also upgraded to 6.3.0.

 

Since then, the system has yet to crash (though I haven't had enough uptime to say that it's stable), but I'm now getting repeated 'lost rtc interrupts' messages in the syslog. Specifically messages like, 'kernel: hpet1: lost 522 rtc interrupts,'

 

Any thoughts? I only found a few mentions of this error on the unraid forums.

Link to comment

Yeah, definitely not the best for troubleshooting. I wanted to upgrade given the security patches it had, not necessarily to fix issues. The PSU I'm hoping may actually help.

 

At any rate, I found a few references in other linux distros to turning off ACPI in bios to address the hpet issues. Sure enough, with ACPI off, I no longer see those interrupts.

 

So, I guess now I'll leave it in troubleshooting mode and wait for another lock up.

 

Thanks very much for your help, John.

Link to comment

Failing PSUs cause all sorts of weird errors so it can't be ruled out. A cheaper thing to try first would be to run without any VMs for a while. That's what I'd do next - stop VMs and also stop the VM service in Settings -> VM Manager. Run like that for a few days and see how it goes. It helps to break the problem down into more manageable pieces.

 

I can't stress this enough due to personal experience ... my firewall (a Supermicro board based on an Intel Atom 2550 SOC) was doing just fine for about 3 years, then it starts rebooting every now and then out of the blue ... never thought about the UPS (which was definitely dying) ... guess it's worth your time to check it :)

Link to comment

I see this

 

Feb  7 16:48:05 Tower kernel: smpboot: Max logical packages: 1

Feb  7 16:48:05 Tower kernel: DMAR: Host address width 39

Feb  7 16:48:05 Tower kernel: DMAR: DRHD base: 0x000000fed90000 flags: 0x0

Feb  7 16:48:05 Tower kernel: DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 7e3ff0505e

Feb  7 16:48:05 Tower kernel: DMAR: DRHD base: 0x000000fed91000 flags: 0x1

Feb  7 16:48:05 Tower kernel: DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da

Feb  7 16:48:05 Tower kernel: DMAR: RMRR base: 0x00000027542000 end: 0x00000027561fff

Feb  7 16:48:05 Tower kernel: DMAR: RMRR base: 0x00000028800000 end: 0x00000038ffffff

Feb  7 16:48:05 Tower kernel: DMAR-IR: IOAPIC id 2 under DRHD base  0xfed91000 IOMMU 1

Feb  7 16:48:05 Tower kernel: DMAR-IR: HPET id 0 under DRHD base 0xfed91000

Feb  7 16:48:05 Tower kernel: DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.

Feb  7 16:48:05 Tower kernel: DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.

Feb  7 16:48:05 Tower kernel: DMAR-IR: Enabled IRQ remapping in xapic mode

Feb  7 16:48:05 Tower kernel: x2apic: IRQ remapping doesn't support X2APIC mode

Feb  7 16:48:05 Tower kernel: mce: [Hardware Error]: Machine check events logged

Feb  7 16:48:05 Tower kernel: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1

Feb  7 16:48:05 Tower kernel: TSC deadline timer enabled

Feb  7 16:48:05 Tower kernel: smpboot: CPU0: Intel® Core i5-6600K CPU @ 3.50GHz (family: 0x6, model: 0x5e, stepping: 0x3)

 

very early in you boot sequence, when the machine is still running on only one processor core. I don't know what it means but I don't think it's a good sign.

 

Link to comment

I went back and checked your earlier syslog and it's reported the same there too but if Intel acknowledges it then it must be safe to ignore it. I don't see anything else of note. I think you're going to have to choose between living with the crashes or living without your VM for a while. I'd disable the VM service in unRAID and turn off IOMMU (VT-d) in the BIOS... but if you want an excuse for not doing that just yet version 6.3.1 is now available so it has to be worth a try  ;)

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.