Jump to content
  • [6.12.10] Cannot fork: Resource temporarily unavailable


    Hendrik112
    • Urgent

    Hi guys, my unraid server has been running 24/7 for a few months before suddenly docker services start to fail and the docker daemon being unavailable. I tried logging in to the WebUI, but that was completely frozen. I sadly wasn't able to copy diagnostics, since ssh was also unavailable.

    Using the IPMI I managed to get the following screenshot while initiating an acpi shutdown. The server was completely unresponsive for a few minutes until it finally shut down.
    Upon restart, a parity check was automatically triggered, which indicates to me that the “soft shutdown” wasn't performed correctly.

    The same happened again a day later. The same messages in the ipmi and the WebUI being barely useable, indicating "no flash drive". Not sure if this was just a frontend bug or the flash drive actually having issues.

     

    I have been using unraid for 5 years now and never saw anything like this before.

    The diagnostics file was generated after the restart has been performed.

     

    I searched around about the error message, but didn't find anything related to my error. My system has 128GB of ECC ram, so it's unlikely of being related to running oom on a 2GB system.

     

    unraid.png

    anton-diagnostics-20240609-0925.zip

     

    Edit:

     

    Flash drive issues likely cause these errors. Beginning with this log message:

    emhttpd: Unregistered Flash device error (ENOFLASH7)

     

    Followed by this:

    emhttpd: Plus key detected, GUID: 0781-5581-0000-100124105314 FILE: /boot/config/Plus.key
    emhttpd: error: device_read_smart, 9567: Cannot allocate memory (12): device_spinup: stream did not open: nvme1n1

     

    After multiple of these flash drive errors, followed by more memory allocation errors, the whole system becomes more and more unresponsive. Resulting in docker containers not being able to allocate resources and becoming more and more unstable.

    Which results in the original error message:

    Cannot fork: Resource temporarily unavailable

    At this point, the unraid WebGUI is barely responsive, and this stage can continue for about 1-3h, during which I am still able to gracefully shut down the server using the IPMI KVM with the powerdown command.

    Uptime Kuma is the only service able to still push out a discord notification that a ping process has failed at this stage.

    If I miss this window, the server locks up completely so that even an ACPI Shutdown doesn't get through. The only way to shut down now is by turning off the PSU.
     




    User Feedback

    Recommended Comments



    Jun 11 21:55:38 Anton emhttpd: Unregistered Flash device error (ENOFLASH7)
    Jun 11 21:55:40 Anton emhttpd: Plus key detected, FILE: /boot/config/Plus.key
    Jun 11 21:56:05 Anton emhttpd: Unregistered Flash device error (ENOFLASH7)
    Jun 11 21:56:06 Anton emhttpd: Plus key detected, FILE: /boot/config/Plus.key
    Jun 11 21:56:08 Anton emhttpd: Unregistered Flash device error (ENOFLASH7)
    Jun 11 21:56:10 Anton emhttpd: Plus key detected, FILE: /boot/config/Plus.key

     

    You are having issues with the flash drive, try using a different USB port, ideally USB 2.0, if the issue remains try a different flash drive.

    Link to comment

    This board only has USB 3.2 Gen1 Ports. I have been using that port for about 5 months now, and the flash drive has been running unraid for 5 years.

     

    I will try replacing the flash drive and hope, that someday we will be able to properly install unraid.

    Link to comment

    After replacing the flash drive, it took about 1.5 days until the server froze up completely again.

    This is the syslog from the whole time period:

     

    I will bump this to urgent now as this keeps happening and is rendering my server completely unusable.

    FYI, I have not performed any BIOS updates or anything else which could have caused this. The only thing I did was adding two more HDDs about a week ago.

    syslog-192.168.1.10 (1).log

    Link to comment

    The flash drive is still disappearing, but there are no associated USB error errors, so that's strange, could be a config issue, or a hardware issue, first I would redo the flash drive and restore just the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if all works you can then reconfigure the server or try restoring a few config files at a time from the backup to see if you can find the culprit, if that doesn't help look in the board BIOS for any USB related settings and try toggling those.

    Link to comment

    Bad news, I did the following:
    - Fresh install using unraid usb creator 2.1 only copying the following configurations:
        - Plus.key
        - super.dat
        - pools dir
        - docker user template (only Portainer)
        - docker.cfg (using custom macvlan docker network)
        - netowork.cfg (using vlans)
        - Running only the following compose stacks:
            - Traefik without uptime-kuma
            - Authelia
            - Jellyfin without nvidia runtime
            - Nextcloud
        - Set disk spin down to 3h
        - Configure shares to caches
        - Enable syslog to local

    After an uptime of about 2 days and 21 hours, the system was displaying the same error. This time I was able to run the diagnostics command from the CLI. After rebooting, there isn't even a logs folder on the flash drive.

    I will check the UEFI settings next, but I haven't changed anything before this problem occurred. I would be surprised if there suddenly was an issue with the CPU, RAM, or MB. That seems really unlikely to me. One thing that could potentially be causing this is the PSU, it's good but old, and I am not sure about the fan. 

    syslog-192.168.1.10 (2).log

    Link to comment
    Jun 17 23:15:35 Anton emhttpd: Unregistered Flash device error (ENOFLASH7)
    Jun 17 23:15:35 Anton emhttpd: read SMART /dev/sdh
    Jun 17 23:15:35 Anton emhttpd: error: device_read_smart, 9552: Cannot allocate memory (12): device_spinup: stream did not open: sdh

     

    There is also a Cannot allocate memory error being logged at the same time. Depending on the unraid logging backend, which I'm really not familiar with, the order of events might not be accurate. Restoring all the other configs now and running memtest for a day or two. Also checked the MB USB settings, no idea what they do, everything is still set to auto and there are no descriptions provided.

    Edited by Hendrik112
    Link to comment

    I was able to get the diagnostics saved this time around.

     

    Switched to a brand-new seasonic power supply and basically rewired the whole system.

    The RAM has no issues, it was unlikely anyway. Orignal Samsung ECC RAM which I already tested with the latest memtest version. I guess I will now just go backwards through the unraid versions and see if that changes something.

    I remembered updating the OS before adding two more disks.

    anton-diagnostics-20240620-0646.zip

    Link to comment
    1 hour ago, Hendrik112 said:

    I will now just go backwards through the unraid versions and see if that changes something.

    Worth a try, and also don't think this is RAM related.

    Link to comment

    Where can I download Unraid 6.12.8? That's the last version I used (skipped 6.12.9).

    Not showing in the USB Creator Tool also not available in the download archive.

     

    The new documentation is completely useless btw. The headline states "Manual upgrade or downgrade", but the subsections don't mention downgrading at all. Link
    Links to the docs from other forum posts don't work or point to a random article including yours from your first response.

    Edited by Hendrik112
    Link to comment

    The manual option can be used to upgrade and downgrade.

     

    v6.12.8 attached.

     

    • Like 1
    Link to comment

    Sadly, downgrading to 6.12.8 didn't help either.

     

    Since 12.8 was running fine for months, I think it doesn't make any sense downgrading further.

    I am now looking into USB signaling issues. I switched the USB to the front panel ports (also 3.0), which are connected to the X470 chipset instead of directly to the CPU.

     

    Is there any way the support can reactivate an already used flash drive in case I have to try another one?

     

    Link to comment

     

    43 minutes ago, Hendrik112 said:

    Is there any way the support can reactivate an already used flash drive in case I have to try another one?

    Never heard of this being possible, but the only way to be certain would be to contact support,

    • Like 1
    Link to comment
    16 hours ago, Hendrik112 said:

    Since 12.8 was running fine for months, I think it doesn't make any sense downgrading further.

    I agree, and it suggest the OS is not the problem, you could try recreating the flash drive from stock and restoring only the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if you still get the flash drive related errors in the log after that, it basically confirms it's a hardware issue.

    Link to comment
    1 minute ago, JorgeB said:

    I agree, and it suggest the OS is not the problem, you could try recreating the flash drive from stock and restoring only the bare minimum, like the key, super.dat and the pools folder for the assignments, also copy the docker user templates folder, if you still get the flash drive related errors in the log after that, it basically confirms it's a hardware issue.

    You already suggested that, and I already did that like a week ago. Unless I am missing something?

    Link to comment

    I searched the unraid forums as well as the unraid official discord for the ENOFLASH7 error and categorized them.

    In conclusion, most posts did not contain a solution.

    The slowly failing system is consistent through the posts.

    The issue seems to be present in all 6.11 to 6.12 unraid versions.

    Nobody was able to point out what exactly caused the issues.

    One user replaced their usb + flash drive, seemingly without ever resolving the issue. While others just recreated their flash drive and were fine.

     

    Fixed by disabling C-Staes in Ryzen CPUs (will try next):
    - https://forums.unraid.net/topic/100870-solved-multiple-errors-unregistered-flash-device-error-and-fork-errors/

     

    Fixed by stopping shinobi / stopping failing docker container (checking):
    - https://forums.unraid.net/topic/122636-solved-server-freeze-every-morning-requires-hard-reboot-fork-resource-unavailable/

    https://forums.unraid.net/topic/138316-help-solve-repeatedly-failing-unraid-server/

     

    Probably fixed by recreating the flash drive:
    - https://forums.unraid.net/topic/132557-got-on-this-am-to-a-mostly-hung-server/

    https://forums.unraid.net/topic/134416-unregistered-flash-device-error-enoflash7/

     

    No response from author:
    - https://forums.unraid.net/topic/146773-6124-cannot-allocate-memory-error-system-half-hanging/
    - https://forums.unraid.net/topic/116184-enoflash7-everyday/
    - https://forums.unraid.net/topic/125302-enoflash7-in-random-intervals/
    - https://forums.unraid.net/topic/141213-unraid-612-vm-crashing-enoflash7/
    - https://forums.unraid.net/topic/135679-periodic-freeze-until-reboot-unregistered-key-detected-unregestered/
    - https://forums.unraid.net/topic/118573-display-”no-flash“-for-the-first-ten-minutes/

    Edited by Hendrik112
    Link to comment
    1 hour ago, Hendrik112 said:

    You already suggested that, and I already did that like a week ago. Unless I am missing something?

    You restored some more stuff, it was just to be sure it wasn't for example a container causing issues, restore the minimum and run it without any containers for a couple of days, but I really doubt that is the problem.

    • Like 1
    Link to comment

    Sure, that's worth a try.

    Now that I am thinking about it, it's actually not that unlikely, since docker.cfg actually sets the docker storage to "directory" which will use the docker zfs storage driver if the share is a zfs pool.

    The docker zfs storage driver could very well cause these issues.

     

    The system should fail again tomorrow anyway, so I can try that.

    Link to comment

    Hehe, I figured it out :)

     

    This is not a hardware issue!

     

    Unraid OS is vulnerable to docker fork bombs. From my limited understanding of how the Linux kernel and docker work, this should usually be prevented by the cgroups pids limit.

    From the number of related issues to this problem and people throwing away perfectly fine flash drives, I see a strong case to further investigate this issue and fix it in the OS. One guy even replaced his mainbaord just because of this. The container mentioned below is not the first nor will it be the last to slowly fork bomb a system.

     

    How I was finally able to find the root cause of this.

    After roughly 30h of uptime I ran the docker stats command which revealed that the authelia container was using 3318 PIDs.

    The number of PIDs was also steadily increasing (every 10s). Which turned out to be the health check interval, which I modified from 30s to 10s.

    The authelia docker container, if it has been configured with TLS certs, will cause the docker health check script using the alpine wget command to create zombie processes. After checking with James, one of the lead devs of authelia if this could cause system crashes, I decreased the health check interval to 0.1s. This greatly speeded up the process and caused my system to freeze after around 1.5h instead of days. This behavior is obviously not intended by authelia and will soon be fixed.
    I was expecting the system to crash roughly around the PID limit, but it actually crashed at about half that.

     

    cat /proc/sys/kernel/pid_max
    32768

     

    ps -e | wc -l
    bash: fork: retry: Resource temporarily unavailable
    bash: fork: retry: Resource temporarily unavailable
    16157
    Edited by Hendrik112
    Link to comment

    Glad that you've found the issue, but just to confirm, were the flash drive logged issues also related to that?

    Link to comment
    11 minutes ago, JorgeB said:

    Glad that you've found the issue, but just to confirm, were the flash drive logged issues also related to that?

    Oh yes of course!

    A system running out of PIDs will fail unpredictably. ENOFLASH7 and the memory allocation errors are all because of that.
    With a real fork bomb, like the one in bash, the system will halt instantly.

    Because this was happening very slowly, the system is able to kill processes and reassign IDs until there aren't any left.

    The kernel probably killed whatever process is running the usb and license check. Not sure if the actual usb drivers would be killed for the kernel to survive, and I really don't want to find out.

    Never ever fork bomb a system you still want to use. There is a high chance you will have massive problems down the line.

    Edited by Hendrik112
    Link to comment
    50 minutes ago, Hendrik112 said:

    Oh yes of course!

    Thanks for confirming, I did find strange there there weren't any associated USB errors.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...