Jump to content

[6.12.4] "Cannot allocate memory" error, system half-hanging


Pri
Go to solution Solved by Pri,

Recommended Posts

The problem:

For about a month now since I upgraded to the 6.12.x branch (.3 and then .4 only, I did not use .0 .1 or .2) I've had my server randomly start half-crashing every 1 to 16 days. Sometimes two times a day, sometimes only once a week, it seems random. I did not have any issues like this on 6.11, it was stable as a rock.

 

What I mean by half-crashing:

And by half-crashing what I mean is: my disk shares stop working, one of my Windows Server 2022 virtual machines crashes (the others do not), the WebUI becomes very slow to respond but does function, all my disks (NVMe SSD's and SATA Hard Drives) enter a sleep state and won't wake up etc - Only a reboot fixes it.

 

So far what I've tried:

1. Brand new USB stick to boot from. Some people previously said based on my log it looked like my USB stick was failing.

2. Disabling C-States in the BIOS

3. Uninstalling the Unassigned Devices plugin (some thought it could be related to this as others on reddit are having the same crashes as I am as per this thread: https://www.reddit.com/r/unRAID/comments/16yz0gd/possible_fix_for_people_crashing_on_612/ but it appears to be unrelated).

4. Ran memtest86, it passed after 2 hours, I let it run for a total of 7 hours and 20 minutes without errors.

 

Specs of my server / other hardware related details:

3rd Gen EPYC Milan with 7 NVMe devices, 11 Hard Disk Drives. The HBA I'm using is a 9500-8i. I do not have a dedicated GPU installed, I'm not doing any kind of fancy graphics stuff with my VM's either. All the firmware and BIOS are up to date on all my stuff (Motherboard, HBA, Network cards, NVMe drives etc).

 

Diagnostics taken today during the half-crashing it did

hyepyc-diagnostics-20231021-1828.zip

 

I did make a thread about this earlier, I marked it as resolved because everyone on discord told me: get a new USB key your current one seems broken. So that is what I did, however that hasn't resolved the issue so I'm making the topic again with a fresh diagnostics. The only real difference between then and now is I do have the new USB key in, I did uninstalled the unassigned devices plugin before this most recent crash and so forth.

 

Any help is greatly appreciated. Also you may see in my log a lot of SSH logins, this is from external software I run to automate Docker since unRAID doesn't have an official API. I have since disabled this in-case there is some kind of SSH-related out of memory bug in play etc

Edited by Pri
Link to comment

Someone suggested I run memtest86, so I have done that. After two hours it passed, no errors of any kind. I am allowing it to continue to run for the next 10 hours or so just to be sure though. I will add this to an edit in my above post aswell.

 

EDIT: It kept passing for 7 hours and 20 minutes before I turned it off as I needed the server up to do some work etc

Edited by Pri
Link to comment
5 minutes ago, JorgeB said:

There are some strange errors and still some apparent flash drive issues:

oct 21 17:08:40 HYEPYC emhttpd: Unregistered - flash device error (ENOFLASH7)

But I'm not seeing any USB errors, try booting in safe mode.

 

Yeah it looks like that, but I've completely changed the USB port, the USB stick etc - It seems to be a symptom of the problem in that it has some memory allocation issue and then that has this cascade of issues where it shows that USB error as part of it etc

 

I checked my old USB key, did all kinds of diagnostics on it, it seems to be perfectly fine. I'm at a loss to explain what is going on really etc

 

Regarding safe mode, is there any downsides to using that and is there anything you want me to provide once I'm running in safe mode?

Edited by Pri
Link to comment
  • 8 months later...
  • Solution

I was just made aware that I didn't follow up on this thread with a solution.

 

So I narrowed down the issue to Docker. Specifically, when dockers are restarted (instead of shutdown and started back up) there appears to be a memory leak related to system resource allocation. There's some set memory pool that gets consistently used up when a docker restarts until there's none left. At that point, unRAID goes haywire.

 

Storage devices can't be accessed (including the USB unRAID device), WebUI may fail to load, SSH won't connect reliably etc

 

Stopping the docker that you've restarted multiple times releases these resources back to the system and everything instantly begins working again. Since learning of this, I changed how I interact with my dockers so they aren't restarted but are completely stopped then started and I've experienced no more problems since then.

Link to comment

I've had it happen with a few containers that I made myself (which I was rebooting as part of my normal use of them) and also with the Mysterium container (the official one from the app's repo).

 

Since I rarely if ever need to reboot my normal dockers (Plex, Grafana, qBittorrent etc) I've not witnessed the issue with those.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...