Unraid 6.5.2 Locked Up


Recommended Posts

Had the following problem with Unraid 6.5.2. 

* Unraid not responding to ping

* No access via console

 

Ended up having to hard reset the server. Parity check is running now.

 

I have troubleshooting mode enabled, because the machine locked up yesterday as well, so I have diagnostics and syslog (attached).

 

Any help would be much appreciated

 

Pete

FCPsyslog_tail.txt

moose-diagnostics-20180522-1822.zip

Edited by PeteB
Link to comment
  • 3 weeks later...

OK had another occurrence of the lockup. This time there is an MCE on the console. I've installed mcelog from Nerd Tools to capture future events if they occur.

 

I've attached a photo of what was on the screen and diagnostics/syslog taken by FCP troubleshooting mode.

 

Hope one of you gurus could have a look and see if there is any clues as to what's going on. Some more information which may be useful is that my system was completely stable with 6.3.5, but my problems started immediately I upgraded to 6.4.1. It is happening less often now on 6.5.2 but that may all be a coincidence.

 

Thanks,

Pete

 

 

20180606_190719.jpg

FCPsyslog_tail.txt

moose-diagnostics-20180606-1815.zip

Edited by PeteB
Link to comment

Bump. Wondering if someone wouldn't mind having a quick look and see if there is anything of interest in the above screen grab or the diagnostics.

 

I'm thinking that my next step is to upgrade the BIOS to the latest version. I am about 5 versions behind, and various ones talk about "improving system stability" or "improving memory stability". The other thing is that there is an unused PCI-E TV card in the server. I'll probably remove that if a BIOS upgrade doesn't help.

 

Thoughts?

Link to comment

Thanks very much for the reply Johnnie. I'll do that.

 

I just came across a bunch of posts about RCU_PREEMPT stalls with unraid. Here's a link:

 

It's quite old, but in that instance it was a software problem related to specific hardware configurations. What's interesting is that my problems only came about after I upgraded from 6.3.5 to 6.4.0. (could be a co-incidence of course). The history is that it was happening frequently on 6.4.0 and 6.4.1, but on 6.5.2 it's happening infrequently. Before that the system was completely stable. So I'm wondering if there is a possibility it may be something with more recent versions?

Link to comment
11 minutes ago, PeteB said:

Bump. Wondering if someone wouldn't mind having a quick look and see if there is anything of interest in the above screen grab or the diagnostics.

 

I'm thinking that my next step is to upgrade the BIOS to the latest version. I am about 5 versions behind, and various ones talk about "improving system stability" or "improving memory stability". The other thing is that there is an unused PCI-E TV card in the server. I'll probably remove that if a BIOS upgrade doesn't help.

 

 

Hi Pete,

 

I can't look at attachments at the moment, but focusing on the stall warning in that screenshot, I came across this kernel document detailing possible causes.

 

https://www.kernel.org/doc/Documentation/RCU/stallwarn.txt

 

Now there's plenty of possibilities, and in general your idea to update BIOS and remove unnecessary hardware is a good one, but in this case I don't think that either of those will help (please do update the BIOS though - that's almost always a good thing to have at the latest version). What might be worth trying (regarding the BIOS) is after updating, reset to default parameters, and then go through and just ensure that just the needed options are altered (saving your current BIOS config first would be a good idea if your BIOS supports that option).

 

The screenshot shows that the stall is on CPU 0, which I understand to be the CPU primarily used by uNRAID. I would suggest that you check any active VMs to make sure that you do not have that core bound to them. Also make sure that your CPU temperatures are ok. Thermal throttling can cause this, though I suspect that's an edge case thing. Doesn't do any harm to make sure that your cooling solution is working though (I.E. clean your CPU radiator of dust, make sure fans, etc are working, check CPU temps in web UI).

 

Last idea is that you check those log files to see if the stall warning is in there (as I mentioned, I am not able to look at them currently). The stallwarn document linked above advises:

 

Quote

To diagnose the cause of the stall, inspect the stack traces.
The offending function will usually be near the top of the stack.

 

...so there may be some additional clues.

 

Best of luck.

Link to comment

Thanks very much for the reply Dev Null. Interestingly most of the causes seem to be software related. Might be wishful thinking on my behalf (avoiding a cost to upgrade would be nice!).

 

I appreciate the tips. I'm able to  exclude the VM tip though as I don't run any VMs on my system.

 

I'm thinking that I'll do the following steps one at a time so that if the problem goes away I can feedback what changed:

1. Set the BIOS to be as vanilla as possible with regards to configuration. (May be very little to change here as I don't recall changing the config much at all).

2. Do a BIOS upgrade

3. Remove the PCI-E TV Tuner card.

 

The frequency of my lockups is down to about one per fortnight so it may be a while before I can say the system is stable (if at all).

Edited by PeteB
Link to comment

When I checked the BIOS settings there were almost none that were not default, so I have made the following changes:

1. Upgraded the BIOS to the latest version

2. Set the FLASH drive to UEFI Boot

3. Left "Intel Virtualization Technology" default to disabled. Previously I had it set to Enabled, but I don't run any VM's.

4. Set the BIOS Boot menu to boot UEFI rather than Legacy BOOT.

 

 

Edited by PeteB
Link to comment
  • 4 weeks later...
  • 2 months later...

I had a run of 79 days without a lockup and then had the following MCE message appear on the screen. My reading of the MCE website says this is an "internal parity error" (?). Interestingly upon reboot, (after the BIOS splash screen), there were no messages at all. ie: screen was completely blank. (no unraid menu at all).

 

I've run memtest overnight (4 passes) without issue. Hoping for some pointers as to next plan of investigation.

 

 

20180912_175943.jpg

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.