snowmirage Posted November 19, 2018 Share Posted November 19, 2018 Servers been up and running for several months now in the last week or so started seeing issues I've never come across before. I'm running Unraid 6.6.5 I struggled to even see the logs of the crashes until recently when I read a post mentioning you can turn on troubleshooting mode in the fix common problems plugin which then write the logs to the flash drive. After enabling that feature last night, I let the server run all night all with out even starting the array. The only task I started was the preclear scripts on 4 new 6TB drives I was getting ready to install. Here are some screen shots of the crashes prior to last night, I'd find the host unresponsive, no keyboard input worked and I couldn't access the web gui forcing me to do a hard reset. This morning I again found a similar error but this time the system was responsive to both local keyboard input and via the web gui I was able to scp down the syslog file stored on the flash drive by the fix common problems plugin.FCPsyslog_tail.txt I tried to grab the full diagnostic file but after 40 some odd min the page still hasn't generated the diagnostic file. When I scp'd the syslog file above down I saw a number of diag .zip files in the same directory but nothing for today, so it appears its failing to generate the diag file not just failing to let me download it via the GUI. The other oddity I noticed is failure of unassigned devices to finish loading its data on the main page. If anyone has some advice here it would be greatly appreciated. I'm going to try to clean reboot and grab the diag file again shortly. Quote Link to comment
snowmirage Posted November 19, 2018 Author Share Posted November 19, 2018 After a reboot the full diagnostic file generated just as it should.phoenix-diagnostics-20181119-1130.zip Not sure what else to check yet so I'm going to try a memory test. Quote Link to comment
John_M Posted November 19, 2018 Share Posted November 19, 2018 2 hours ago, snowmirage said: Not sure what else to check yet so I'm going to try a memory test. That's a useful thing to do. Let it run for 24 hours or more. You have a complex syslinux config with ACS override and unsafe interrupts enabled, both of which are best avoided unless absolutely necessary. Have you tried booting into a simpler mode for a while? You could try the GUI mode if you haven't edited it to be the same, or Safe mode. The aim would be to achieve stability of the basic NAS functions before enabling the more complex features, such as VMs. Your OP suggests that all was well when you ran an earlier version of Unraid. Have you checked for a newer BIOS as that might provide better compatibility with the newer kernel? 1 Quote Link to comment
snowmirage Posted November 19, 2018 Author Share Posted November 19, 2018 Thats a great idea thank you ! (avoiding the ACS override and unsafe interrupts configs) and I'll give that a shot after the memory test. I'm attempting to pass through GPUs and USB card to VMs and thats why those changes where added. Regarding the BIOS updates I'm a bit out of luck unfortunately its an old (though amazing) motherboard EVGA SR-2, these issues are happening at stock speeds and settings in the bios. Its been years since I've had to run memtest. I noticed when I started memtest via the boot menu going into unraid (where you can select GUI mode/ safe mode etc..) there was a flash briefly with an option to force enable SMP mode? I imagine if I have memory issues the test should still report errors even if I did not enable SMP mode correct? I tried searching around for an answer to that and most of what I could find just discussed the differences between "memtest" and "memtest86+" and that SMP mode for the later had some bugs years ago. Quote Link to comment
John_M Posted November 19, 2018 Share Posted November 19, 2018 You might find the downloadable (but still free!) version 7.5 of MemTest86 more useful than the built-in one, assuming your motherboard supports UEFI booting. Use it to make a separate bootable USB stick. https://www.memtest86.com/download.htm Quote Link to comment
snowmirage Posted November 21, 2018 Author Share Posted November 21, 2018 I was able to run the memtest for over 24 hrs completed 4 passes with no errors. After rebooting under settings > VM manager I changed VFIO allow unsafe interrupts from "yes" to "no" and changed PCIe ACS override from "Downstream" to "Disabled" Rebooted again then let the server run with the array started since this afternoon. Coming home at ~ 11pm I found the server had rebooted at some point about 4 hrs ago. In the course of all those reboots I think the troubleshooting mode in fix common problems may have been turned off as the last logs I see in the previous file I pulled from the thumb drive are from the time frame of my first post (1-2 days ago). I've attached the latest diagnostic file. I'm going to reenable troubleshooting mode in fix common proble ms and hopefully catch something interesting in those logs. Any other ideas / advice on what might be going on here? phoenix-diagnostics-20181120-2313.zip Quote Link to comment
snowmirage Posted November 21, 2018 Author Share Posted November 21, 2018 Crashed again last night around 2am ish In the sys log in this diagnostic file I see the system booting back up ~2:11am on the 21stphoenix-diagnostics-20181121-1034.zip and here I seem to see about the same thing.FCPsyslog_tail.txt I'm not sure what else to try here other than pointing a camera at the screen for 8 hrs and hoping I catch it crashing. Quote Link to comment
John_M Posted November 21, 2018 Share Posted November 21, 2018 11 hours ago, snowmirage said: Any other ideas / advice on what might be going on here? phoenix-diagnostics-20181120-2313.zip Hardware errors during CPU start-up? Nov 20 18:37:11 phoenix kernel: smp: Bringing up secondary CPUs ... Nov 20 18:37:11 phoenix kernel: x86: Booting SMP configuration: Nov 20 18:37:11 phoenix kernel: .... node #0, CPUs: #1 #2 Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: Machine check events logged Nov 20 18:37:11 phoenix kernel: #3 Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 5: be00000000800400 Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: TSC 0 ADDR 3fff8162a61c MISC 7fff Nov 20 18:37:11 phoenix kernel: mce: [Hardware Error]: PROCESSOR 0:206c2 TIME 1542756960 SOCKET 0 APIC 4 microcode 1f Nov 20 18:37:11 phoenix kernel: #4 #5 Quote Link to comment
snowmirage Posted November 21, 2018 Author Share Posted November 21, 2018 O... wow... I did not see that. Maybe its really a bad CPU. They are quite old. I'll remove CPU 2 and see if its stable on just CPU 1. Thank you Quote Link to comment
snowmirage Posted November 27, 2018 Author Share Posted November 27, 2018 Tried using motherboard jumpers on my EVGA SR-2 motherboard to disable each CPU but when either is disabled unraid won't boot. Before going through the headache of troubleshooting further I ordered another x5680 for ~ $47 when that gets here I'll try swapping the 2nd cpu out and see if it still crashes. Then swap the other CPU. Hard to know if this CPU is good or not as the ones I have seemed to be good for months. To replace the motherboard is going to run between 500 and 1500 which is crazy to spend on such old hardware. Hopefully a CPU swap fixes what ever my issue is Quote Link to comment
John_M Posted November 27, 2018 Share Posted November 27, 2018 Reseating the CPUs in their sockets, replacing the thermal compound and cleaning the dust out of the heatsinks - all part of the swap - are good things. I'd reseat the CPU power cables too. 1 Quote Link to comment
snowmirage Posted December 17, 2018 Author Share Posted December 17, 2018 Reporting back and thankfully the problem appears to be solved so far with a CPU swap. One of the errors I saw (posted above) seemed to indicate an issue with CPU #2. Its possible all it needed was to be reseated but being a fully custom water cooled system I didn't want to have to open everything up twice so for $47 with express shipping a new X5680 went in. That stayed up and stable for 5 days after that I reconnected two GTX 980s I had unplugged while troubleshooting and its now been up for another 5 days. No indications at all of the previous problems so far! My next step is to re-enable VFIO allow unsafe interrupts and PCIe ACS override then make sure its still stable. But so far things are looking great thanks for the assistance! Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.