bumblebee21 Posted January 30, 2017 Share Posted January 30, 2017 Running 6.2.4. Full list of plugins and Dockers at the bottom of the thread. Also running one Windows 10 VM with VGA passthrough. Hardware also listed below. I've had a couple instances lately where I've found my server to be randomly non-responsive. The machine itself is on, but I cannot telnet in and my VM is down (nothing on the screen). The only way to recover is to do a hard reboot, after which everything seems to run fine. Since I can't access the server once it becomes non-responsive, I started outputting the log to another PC on the network via tail. Unfortunately, there doesn't seem to be much useful info in the log (see below). It looks like mover ran, then an hour later, the server went offline. I don't see any other tasks scheduled for around that time on either the server or the VM. I'm sort of at a loss. Any ideas? Jan 29 03:40:34 Tower root: mover finished Jan 29 04:30:01 Tower root: Fix Common Problems Version 2017.01.24 Jan 29 04:30:07 Tower root: Fix Common Problems: Error: unclean shutdown detected of your server Jan 29 04:30:07 Tower sSMTP[23043]: Creating SSL connection to host Jan 29 04:30:07 Tower sSMTP[23043]: SSL connection using ECDHE-RSA-AES128-GCM-SHA256 Jan 29 04:30:09 Tower sSMTP[23043]: Sent mail for [REDACTED] (221 2.0.0 closing connection w41sm8871776qtw.34 - gsmtp) uid=0 username=root outbytes=789 Jan 29 04:40:01 Tower rsyslogd: [origin software="rsyslogd" swVersion="8.16.0" x-pid="1479" x-info="http://www.rsyslog.com"] rsyslogd was HUPed PLUGINS CA Auto Update Applications CA Backup / Restore Appdata CA Cleanup Appdata Community Applications Dynamix Cache Directories Dynamix File Integrity Dynamix SSD TRIM Dynamix System Buttons Dynamix System Information Dynamix System Statistics Dynamix System Temperature Dynamix webGui Fix Common Problems Nerd Tools Unassigned Devices DOCKERS cadvisor CouchPotato Netdata PlexMediaServer sickrage transmission HARDWARE Asrock Z170 Extreme4 Intel i5-6600k (running stock at the moment) 32gb G.Skill DDR4 2133mhz 1x2tb WD Red parity drive 4x1tb WD Red data drives 1x256gb Samsung 850 Evo cache drive Quote Link to comment
John_M Posted January 30, 2017 Share Posted January 30, 2017 Since you have the Fix Common Problems plugin installed you ought to reboot and run it in troubleshooting mode until your server becomes unresponsive again. Quote Link to comment
bumblebee21 Posted February 1, 2017 Author Share Posted February 1, 2017 Yeah, I should have though to do that earlier. At any rate, I enabled troubleshooting mode shortly after your reply and just got a crash. Attached are the syslog; diagnostics will be in next post. I took a quick glance and didn't see anything, but hopefully you folks might notice something. syslog.zip Quote Link to comment
bumblebee21 Posted February 1, 2017 Author Share Posted February 1, 2017 Diagnostics attached. tower-diagnostics-20170201-1520.zip Quote Link to comment
bumblebee21 Posted February 2, 2017 Author Share Posted February 2, 2017 Anyone have thoughts? Quote Link to comment
John_M Posted February 3, 2017 Share Posted February 3, 2017 Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here: Feb 1 15:20:37 Tower root: Fix Common Problems: Capturing diagnostics. When uploading diagnostics to the forum, also upload /config/logs/syslog.txt on the flash drive Quote Link to comment
bumblebee21 Posted February 3, 2017 Author Share Posted February 3, 2017 Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here: John, thanks for the reply. The syslog I posted was in the /logs/ directory of the boot flash, along with all the diagnostics. I looked in the /config/ directory but did not see a logs directory. Screen cap below. Quote Link to comment
Squid Posted February 3, 2017 Share Posted February 3, 2017 Can you confirm that the syslog.zip file (114.83 kB) you attached to Reply #2 is from your boot flash, as mentioned here: John, thanks for the reply. The syslog I posted was in the /logs/ directory of the boot flash, along with all the diagnostics. I looked in the /config/ directory but did not see a logs directory. Screen cap below. You posted correctly. That message in the log is a typo of mine. :'( Quote Link to comment
John_M Posted February 3, 2017 Share Posted February 3, 2017 Thanks Squid. That explains it. I'll spend it bit more time looking at the log now. Quote Link to comment
John_M Posted February 3, 2017 Share Posted February 3, 2017 I haven't been able to find anything that might indicate what the problem is. Have you tested the memory recently? Quote Link to comment
bumblebee21 Posted February 4, 2017 Author Share Posted February 4, 2017 I haven't been able to find anything that might indicate what the problem is. Have you tested the memory recently? Thanks for looking. I haven't run a memtest in a while, but will plan to do that tonight. Quote Link to comment
John_M Posted February 4, 2017 Share Posted February 4, 2017 When it becomes unresponsive does it stay that way or does it sometimes recover? Quote Link to comment
bumblebee21 Posted February 4, 2017 Author Share Posted February 4, 2017 When it becomes unresponsive does it stay that way or does it sometimes recover? No, never. It's sat unresponsive for at least 10 or 12 hours without any response. Quote Link to comment
bumblebee21 Posted February 4, 2017 Author Share Posted February 4, 2017 First DIM/slot for 13 hours. 6 passes, 0 errors. Running the other DIM/slot now. Quote Link to comment
John_M Posted February 4, 2017 Share Posted February 4, 2017 I realised late yesterday that the title of this thread is Random Shutdowns, which is actually rather different from the description in your OP of your server being non-responsive. I understand a shutdown to be power off, while frozen and non-responsive mean power on. Is it actually shutting down or freezing or becoming sluggish or something else? Quote Link to comment
bumblebee21 Posted February 4, 2017 Author Share Posted February 4, 2017 I realised late yesterday that the title of this thread is Random Shutdowns, which is actually rather different from the description in your OP of your server being non-responsive. I understand a shutdown to be power off, while frozen and non-responsive mean power on. Is it actually shutting down or freezing or becoming sluggish or something else? Yeah, shutdown was a bad title. What happens is this: the VM goes dark (nothing on the screen, no response from mouse/keyboard), the web-based GUI does not load (unreachable), Telnet loses its connection and cannot reconnect, and the onscreen output from unRAID through the iGPU is still there but no longer updates. Meanwhile the rig itself is still on, fans spinning, lights on, etc. Any other ideas on hardware that might cause this? The only component that isn't relatively new is my PSU, which is a solid, reputable unit, but pushing 5-6 years old. At the same time, I would think a failing PSU would totally shut the system down, not just make it go unresponsive. Thanks again for your help. Quote Link to comment
John_M Posted February 4, 2017 Share Posted February 4, 2017 Failing PSUs cause all sorts of weird errors so it can't be ruled out. A cheaper thing to try first would be to run without any VMs for a while. That's what I'd do next - stop VMs and also stop the VM service in Settings -> VM Manager. Run like that for a few days and see how it goes. It helps to break the problem down into more manageable pieces. Quote Link to comment
bumblebee21 Posted February 7, 2017 Author Share Posted February 7, 2017 Welp, I may have made things worse. I'd really like to avoid losing my main VM for weeks to see whether the system crashes again. So, I swapped out the PSU with a new one a buddy had handy. At the same time, I also upgraded to 6.3.0. Since then, the system has yet to crash (though I haven't had enough uptime to say that it's stable), but I'm now getting repeated 'lost rtc interrupts' messages in the syslog. Specifically messages like, 'kernel: hpet1: lost 522 rtc interrupts,' Any thoughts? I only found a few mentions of this error on the unraid forums. Quote Link to comment
John_M Posted February 7, 2017 Share Posted February 7, 2017 Changing so many things at one time makes it difficult to follow what's going on and gives the impression of panic, rather than logical thought. RTC = Real Time Clock. It might be worth changing the battery. Or maybe it's a BIOS bug. Diagnostics might reveal something. Quote Link to comment
bumblebee21 Posted February 7, 2017 Author Share Posted February 7, 2017 Yeah, definitely not the best for troubleshooting. I wanted to upgrade given the security patches it had, not necessarily to fix issues. The PSU I'm hoping may actually help. At any rate, I found a few references in other linux distros to turning off ACPI in bios to address the hpet issues. Sure enough, with ACPI off, I no longer see those interrupts. So, I guess now I'll leave it in troubleshooting mode and wait for another lock up. Thanks very much for your help, John. Quote Link to comment
bumblebee21 Posted February 8, 2017 Author Share Posted February 8, 2017 The saga continues. Less than 24 hours after booting up the rig with new PSU (and unRAID 4.3), I got another lock up. Syslog and diagnostics attached. Again, I don't see anything in them that presages failure or lock up. I'd really like to avoid going without my primary VM, but that may be my only option at this point. FCPsyslog_tail.zip tower-diagnostics-20170208-0213.zip Quote Link to comment
jbrodriguez Posted February 8, 2017 Share Posted February 8, 2017 Failing PSUs cause all sorts of weird errors so it can't be ruled out. A cheaper thing to try first would be to run without any VMs for a while. That's what I'd do next - stop VMs and also stop the VM service in Settings -> VM Manager. Run like that for a few days and see how it goes. It helps to break the problem down into more manageable pieces. I can't stress this enough due to personal experience ... my firewall (a Supermicro board based on an Intel Atom 2550 SOC) was doing just fine for about 3 years, then it starts rebooting every now and then out of the blue ... never thought about the UPS (which was definitely dying) ... guess it's worth your time to check it Quote Link to comment
John_M Posted February 8, 2017 Share Posted February 8, 2017 I see this Feb 7 16:48:05 Tower kernel: smpboot: Max logical packages: 1 Feb 7 16:48:05 Tower kernel: DMAR: Host address width 39 Feb 7 16:48:05 Tower kernel: DMAR: DRHD base: 0x000000fed90000 flags: 0x0 Feb 7 16:48:05 Tower kernel: DMAR: dmar0: reg_base_addr fed90000 ver 1:0 cap 1c0000c40660462 ecap 7e3ff0505e Feb 7 16:48:05 Tower kernel: DMAR: DRHD base: 0x000000fed91000 flags: 0x1 Feb 7 16:48:05 Tower kernel: DMAR: dmar1: reg_base_addr fed91000 ver 1:0 cap d2008c40660462 ecap f050da Feb 7 16:48:05 Tower kernel: DMAR: RMRR base: 0x00000027542000 end: 0x00000027561fff Feb 7 16:48:05 Tower kernel: DMAR: RMRR base: 0x00000028800000 end: 0x00000038ffffff Feb 7 16:48:05 Tower kernel: DMAR-IR: IOAPIC id 2 under DRHD base 0xfed91000 IOMMU 1 Feb 7 16:48:05 Tower kernel: DMAR-IR: HPET id 0 under DRHD base 0xfed91000 Feb 7 16:48:05 Tower kernel: DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit. Feb 7 16:48:05 Tower kernel: DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting. Feb 7 16:48:05 Tower kernel: DMAR-IR: Enabled IRQ remapping in xapic mode Feb 7 16:48:05 Tower kernel: x2apic: IRQ remapping doesn't support X2APIC mode Feb 7 16:48:05 Tower kernel: mce: [Hardware Error]: Machine check events logged Feb 7 16:48:05 Tower kernel: ..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1 Feb 7 16:48:05 Tower kernel: TSC deadline timer enabled Feb 7 16:48:05 Tower kernel: smpboot: CPU0: Intel® Core i5-6600K CPU @ 3.50GHz (family: 0x6, model: 0x5e, stepping: 0x3) very early in you boot sequence, when the machine is still running on only one processor core. I don't know what it means but I don't think it's a good sign. Quote Link to comment
bumblebee21 Posted February 8, 2017 Author Share Posted February 8, 2017 Good catch. I installed mcelog to check it out. Logs reported it to be an "internal parity error." In googling around, it looks like this is actually a benign error (false positive). Intel has actually released an erratum saying that these errors may be falsely reported, but can be safely ignored. Quote Link to comment
John_M Posted February 9, 2017 Share Posted February 9, 2017 I went back and checked your earlier syslog and it's reported the same there too but if Intel acknowledges it then it must be safe to ignore it. I don't see anything else of note. I think you're going to have to choose between living with the crashes or living without your VM for a while. I'd disable the VM service in unRAID and turn off IOMMU (VT-d) in the BIOS... but if you want an excuse for not doing that just yet version 6.3.1 is now available so it has to be worth a try Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.