PeteB Posted May 22, 2018 Share Posted May 22, 2018 (edited) Had the following problem with Unraid 6.5.2. * Unraid not responding to ping * No access via console Ended up having to hard reset the server. Parity check is running now. I have troubleshooting mode enabled, because the machine locked up yesterday as well, so I have diagnostics and syslog (attached). Any help would be much appreciated Pete FCPsyslog_tail.txt moose-diagnostics-20180522-1822.zip Edited May 22, 2018 by PeteB Quote Link to comment
PeteB Posted June 7, 2018 Author Share Posted June 7, 2018 (edited) OK had another occurrence of the lockup. This time there is an MCE on the console. I've installed mcelog from Nerd Tools to capture future events if they occur. I've attached a photo of what was on the screen and diagnostics/syslog taken by FCP troubleshooting mode. Hope one of you gurus could have a look and see if there is any clues as to what's going on. Some more information which may be useful is that my system was completely stable with 6.3.5, but my problems started immediately I upgraded to 6.4.1. It is happening less often now on 6.5.2 but that may all be a coincidence. Thanks, Pete FCPsyslog_tail.txt moose-diagnostics-20180606-1815.zip Edited June 7, 2018 by PeteB Quote Link to comment
PeteB Posted June 8, 2018 Author Share Posted June 8, 2018 Bump. Wondering if someone wouldn't mind having a quick look and see if there is anything of interest in the above screen grab or the diagnostics. I'm thinking that my next step is to upgrade the BIOS to the latest version. I am about 5 versions behind, and various ones talk about "improving system stability" or "improving memory stability". The other thing is that there is an unused PCI-E TV card in the server. I'll probably remove that if a BIOS upgrade doesn't help. Thoughts? Quote Link to comment
JorgeB Posted June 8, 2018 Share Posted June 8, 2018 It's an hardware error, so try updating the bios (though doubt it helps), running memtest, and if nothing is detected it may be a failing board, PSU, etc. Quote Link to comment
PeteB Posted June 8, 2018 Author Share Posted June 8, 2018 Thanks very much for the reply Johnnie. I'll do that. I just came across a bunch of posts about RCU_PREEMPT stalls with unraid. Here's a link: It's quite old, but in that instance it was a software problem related to specific hardware configurations. What's interesting is that my problems only came about after I upgraded from 6.3.5 to 6.4.0. (could be a co-incidence of course). The history is that it was happening frequently on 6.4.0 and 6.4.1, but on 6.5.2 it's happening infrequently. Before that the system was completely stable. So I'm wondering if there is a possibility it may be something with more recent versions? Quote Link to comment
Dev Null Posted June 8, 2018 Share Posted June 8, 2018 11 minutes ago, PeteB said: Bump. Wondering if someone wouldn't mind having a quick look and see if there is anything of interest in the above screen grab or the diagnostics. I'm thinking that my next step is to upgrade the BIOS to the latest version. I am about 5 versions behind, and various ones talk about "improving system stability" or "improving memory stability". The other thing is that there is an unused PCI-E TV card in the server. I'll probably remove that if a BIOS upgrade doesn't help. Hi Pete, I can't look at attachments at the moment, but focusing on the stall warning in that screenshot, I came across this kernel document detailing possible causes. https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt Now there's plenty of possibilities, and in general your idea to update BIOS and remove unnecessary hardware is a good one, but in this case I don't think that either of those will help (please do update the BIOS though - that's almost always a good thing to have at the latest version). What might be worth trying (regarding the BIOS) is after updating, reset to default parameters, and then go through and just ensure that just the needed options are altered (saving your current BIOS config first would be a good idea if your BIOS supports that option). The screenshot shows that the stall is on CPU 0, which I understand to be the CPU primarily used by uNRAID. I would suggest that you check any active VMs to make sure that you do not have that core bound to them. Also make sure that your CPU temperatures are ok. Thermal throttling can cause this, though I suspect that's an edge case thing. Doesn't do any harm to make sure that your cooling solution is working though (I.E. clean your CPU radiator of dust, make sure fans, etc are working, check CPU temps in web UI). Last idea is that you check those log files to see if the stall warning is in there (as I mentioned, I am not able to look at them currently). The stallwarn document linked above advises: Quote To diagnose the cause of the stall, inspect the stack traces. The offending function will usually be near the top of the stack. ...so there may be some additional clues. Best of luck. Quote Link to comment
PeteB Posted June 8, 2018 Author Share Posted June 8, 2018 (edited) Thanks very much for the reply Dev Null. Interestingly most of the causes seem to be software related. Might be wishful thinking on my behalf (avoiding a cost to upgrade would be nice!). I appreciate the tips. I'm able to exclude the VM tip though as I don't run any VMs on my system. I'm thinking that I'll do the following steps one at a time so that if the problem goes away I can feedback what changed: 1. Set the BIOS to be as vanilla as possible with regards to configuration. (May be very little to change here as I don't recall changing the config much at all). 2. Do a BIOS upgrade 3. Remove the PCI-E TV Tuner card. The frequency of my lockups is down to about one per fortnight so it may be a while before I can say the system is stable (if at all). Edited June 8, 2018 by PeteB Quote Link to comment
PeteB Posted June 8, 2018 Author Share Posted June 8, 2018 (edited) When I checked the BIOS settings there were almost none that were not default, so I have made the following changes: 1. Upgraded the BIOS to the latest version 2. Set the FLASH drive to UEFI Boot 3. Left "Intel Virtualization Technology" default to disabled. Previously I had it set to Enabled, but I don't run any VM's. 4. Set the BIOS Boot menu to boot UEFI rather than Legacy BOOT. Edited June 8, 2018 by PeteB Quote Link to comment
PeteB Posted July 1, 2018 Author Share Posted July 1, 2018 Not sure if it's an improvement or another symptom of the same problem. Once again Unraid crashed. Here is a photo of the console. I'd appreciate it if someone can review and advise. Quote Link to comment
PeteB Posted September 19, 2018 Author Share Posted September 19, 2018 I had a run of 79 days without a lockup and then had the following MCE message appear on the screen. My reading of the MCE website says this is an "internal parity error" (?). Interestingly upon reboot, (after the BIOS splash screen), there were no messages at all. ie: screen was completely blank. (no unraid menu at all). I've run memtest overnight (4 passes) without issue. Hoping for some pointers as to next plan of investigation. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.