Resource temporarily unavailable - locking up system

1activegeek · July 14, 2018

This is now the second time in a few days this has happened and it's causing me to restart the whole array as I can't do much to troubleshoot as almost everything kicks out the following line trying to run:

/bin/sh: fork: retry: Resource temporarily unavailable

I start getting emails consistently from scheduled scripts that are supposed to run that are likely piling up as well. I honestly have no idea where to start identifying what could be happening here. It looks as though the cache is not becoming completely full, all my disks are in healthy order, docker containers "were" running with no issue until this started to build up and no resources left. This time I started to get message about the size of the docker image utilization. Last time I did not, I just woke up to a fairly non-responsive system. Almost any command run attempt results in that error being spit out.

I'm hoping someone can point me to logs I may have or something that can help me outline what's happening if/when this strikes again. The last time it happened around the time of nightly backup and mover jobs. This time it just started maybe 6 hours after that had already completed.

Squid · July 14, 2018

16 minutes ago, 1activegeek said:

I'm hoping someone can point me to logs

Just post the diagnostics next time it happens.

1activegeek · July 14, 2018

What if the GUI is not loading? This time it might have worked, but the previous time I was able to load the UI, but nothing was really functioning it seemed. Is there a CLI option to run the diagnostics? It may not run from there either, so just trying to understand my options to attempt.

Squid · July 14, 2018

7 minutes ago, 1activegeek said:

Is there a CLI option to run the diagnostics?

diagnostics

8 minutes ago, 1activegeek said:

It may not run from there either

It will (should) always complete. However some of the commands utilized may take quite a while (20+ minutes) to run if the system is completed hosed

1activegeek · November 18, 2018

Ok, it's been awhile, but I've had this strike about 1-2 more times since posting originally. This time though, I happened to be sitting at my computer when the first email came in. So I immediately jumped onto the box and was able to successfully run the diagnostics and SCP off the box before restarting. Should I just post the diagnostics archive here? Or is there possibly sensitive info I should share privately?

Squid · November 18, 2018

7 hours ago, 1activegeek said:

Should I just post the diagnostics archive here?

this

1activegeek · November 18, 2018

Ok, here it is. And to stop the issue from continuing forward and locking up the system fully - I restarted. After restart, returns to normal.

atlantis-diagnostics-20181114-2232.zip

1activegeek · November 20, 2018

@squid Anything good looking in here?

1activegeek · November 29, 2018

Bump? Anybody able to help check out the diagnostics here and help me understand what could be the matter?

JorgeB · November 29, 2018

Since you're using NFS first is to update to v6.6.5 since there were some issues with NFS on the previous releases.

1activegeek · November 29, 2018

How far back did that NFS problem stretch? I've had this issue happen as far back as v6.5 and possibly 6.4. It just always happened in the middle of the night usually when I wasn't around to be able to have the system still be somewhat responsive. This time I was lucky enough to get the diagnostic output.

Edited November 29, 2018 by 1activegeek

JorgeB · November 29, 2018

Then it's not related to NFS, but you still should upgrade.

1activegeek · November 30, 2018

I'd like to do that, but I feel for troubleshooting purposes, it's best not to change what I'm currently running with. Obviously updates could bear new fixes, but it could also make it difficult to identify something particular to the setup currently that the LimeTech folks can diagnose.

JorgeB · November 30, 2018

If the issue started with v6.4 or v6.5 don't see the point of continuing with that release, but best of luck, hope you find the problem.

JonathanM · November 30, 2018

9 hours ago, 1activegeek said:

I'd like to do that, but I feel for troubleshooting purposes, it's best not to change what I'm currently running with. Obviously updates could bear new fixes, but it could also make it difficult to identify something particular to the setup currently that the LimeTech folks can diagnose.

Limetech is only going to spend time diagnosing issues with the current release and newer.

1activegeek · November 30, 2018

What is considered current and new? I'm on 6.6.3. I was just highlighting that this issue has occurred as far back as 6.4 or 6.5. But it most recently happened 11/18 as posted on 6.6.3. Which is 10 days past the update to 6.6.5 I see, but I make it a habit to not immediately update to the newest version. Seems that should qualify as current version as it's simply a minor release behind.

Edited November 30, 2018 by 1activegeek

itimpi · November 30, 2018

I understand being reluctant to rush into applying an update. The point releases are aimed at fixing reported errors rather than adding new features. Unless the problem is reproducible on the latest release it may not be obvious if you are suffering from an issue that is fixed or something else.

1activegeek · November 30, 2018

@itimpi - makes sense. I think the problem I have though is that this seems to only happen after the system has been up and running for extended periods. With updates coming out almost monthly, it's hard to let the system run and hit this issue.

I guess it comes down to - who can tell me if they will look into this diagnostic I've reported? Or if I leave the system running on another version and have the same issue, will someone then dig into it? I'd like to save everyone the time/effort of responding to this thread if nobody will do anything about it. If so I'll just move along and get pissed every time it keeps happening. Like I mentioned, this has happened for an extended period of time. I'd imagine any Lime folks should be able to know if any current fixed in the latest minor release "should" resolve my issue. I'd imagine most bug fixes and patches, should have a root cause that is known - which should be discoverable in the diagnostic?

Forgive me if my tone seems rude - this is just frustrating as I've now FINALLY gotten a diagnostic from this happening, and now I'm seeming to hear that it's useless as nobody will look at it. For reference, just notice the original post was 7/14. And that was after I finally had it happen 4-5 time where I decided I need to do something as new version updates were not the solution. I'm almost going on 5 months now since posting, and probably at least 8-10 months since this first happened.

JorgeB · November 30, 2018

Forgive me if my tone seems rude - this is just frustrating as I've now FINALLY gotten a diagnostic from this happening, and now I'm seeming to hear that it's useless as nobody will look at it.

I did look at your diagnostics, but didn't see anything out of the ordinary, hence why suggested you upgraded, if the issue keeps happening it also might be worth booting in safe mode and see it the problem persists, since the problem has been happening for several releases it's not likely an Unraid issue, more likely related to your hardware or some plugin/docker you're running.

1activegeek · December 27, 2018

So I'm back again. It's occurring again, I'm hoping 6.6.6 is the current version since unRAID hasn't told me to upgrade to anything else? Diagnostics attached. What else can I do here?

atlantis-diagnostics-20181227-1521.zip

Delarius · December 27, 2018

The issue is due to running out of available processes - I think. Note that in your 'top' listing you're showing pid 7530 running very hard. When correlated to the ps listing - you can see hundreds if not thousands of 'defunct' processes (but not this one) - all based around some fairly basic commands such as awk, tail, cut, sort & uniq.

I note in other logs you're manually creating a local user - hard to say what impact this might have but could be related to the problem - looks like it's also manually configured to run some sudo commands - again as this is not standard it's hard to speculate if this is related.

I'd also check what you're seeing in /etc/rc.d/rc.diskinfo (and obviously the corresponding file in /boot/config - the usb stick.)

It's almost like a service is failing, and thus keeps forking new processess which also fail.

1activegeek · January 1, 2019

Thanks for the response and info Delarius. I'll keep this in mind as I look into the next time it inevitably happens. I would agree it is all circled around running out of available processes as the errors being kicked out have indicated. It's good to know someone sees something out of normal operation happening.

@Delarius any tips on being able to spot or corner what the process that is causing the issue is when it happens? This way I can hopefully identify what may be causing this to happen?

Resource temporarily unavailable - locking up system

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation