• 6.12-rc5 - Server at 100% CPU usage


    samsausages
    • Minor

    Process "lsof" is pegging out the CPU at 100% on the server.  When viewing Tower/Main page the CPU on the host is seeing high utilization as well. 


    I do have log spammed with this about an hour ago, but not spamming right now.

    Quote

    May 6 23:16:48 bertha nginx: 2023/05/06 23:16:48 [crit] 12671#12671: ngx_slab_alloc() failed: no memory May 6 23:16:48 bertha nginx: 2023/05/06 23:16:48 [error] 12671#12671: shpool alloc failed May 6 23:16:48 bertha nginx: 2023/05/06 23:16:48 [error] 12671#12671: nchan: Out of shared memory while allocating message of size 28112. Increase nchan_max_reserved_memory. May 6 23:16:48 bertha nginx: 2023/05/06 23:16:48 [error] 12671#12671: *21081963 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost" May 6 23:16:48 bertha nginx: 2023/05/06 23:16:48 [error] 12671#12671: MEMSTORE:00: can't create shared message for channel /devices

     

    • Upvote 1



    User Feedback

    Recommended Comments

    4 hours ago, JorgeB said:

    Diagnostics please.


    I attached it.  Looks like the log got cleared when I rebooted after the system hung last night.  When I checked the log last night, before it locked up completely, it was all spammed by what I posted above.
    I have now enabled the syslog server to try and retain them, if it happens again.

    bertha-diagnostics-20230507-0728.zip

    Edited by samsausages
    Link to comment

    I just had one of my cache pools disconnect, as a failed drive, but reboot brought it right back.  So I'm going to test my drives thoroughly, as a few of them are new and I have only had them in there for a few days, with little utilization.  I'm going to take them out and run some tests on my desktop, to eliminate them as a variable.

    Link to comment

    @JorgeBSo I have been doing lots of testing... I had 2 SMART errors pop up, 1 each time when this happened & Uraid went to 100% CPU usage. (Requiring a reboot)

    SMART error is as follows:
     

    Quote

    Entry 0:

    ErrCount (Error Count): 2

    SQId (Submission Queue Identifier): 4

    CmdId (Command Identifier): not available

    Status: 0xc00c

    PELoc (Parameter Error Location): not available

    LBA (Logical Block Address): 0

    NSID (Namespace Identifier): not available

    VS (Vendor Specific): not available
     

    Entry 1:

    ErrCount (Error Count): 1

    SQId (Submission Queue Identifier): 8

    CmdId (Command Identifier): not available

    Status: 0xc00c

    PELoc (Parameter Error Location): not available

    LBA (Logical Block Address): 0

    NSID (Namespace Identifier): not available

    VS (Vendor Specific): not available

     

    Deep diving into the error status code 0xc00c indicates a "Completion Queue Invalid" error. Reading up it sounds this error typically occurs when there's an issue with the completion queue head pointer or doorbell. It might be a firmware or driver issue, or even a hardware problem with the NVMe drive.

    I removed the drive from the system and ran several full SMART checks on a windows system, with no issues. (2x)
    I then used the Intel utility to perform several full diagnostic read/write tests on the drive. (3x)

    I then re-installed the drive in my Unraid server.  I wrote a script that copies files to the drive and does a battery of read/write tests.
    So far I have not been able to recreate the error/issue and I have completed 4 full drive writes & checksum tests on the written data.  Have not seen the error pop back up.

    So I'm still a bit unsure what caused it, I'm going to tweak my script a bit to throw more variety at the drive.  The one thing I'm doing differently than before is I had the "docker" folder on this drive and I set docker to use the ZFS file system (Not Image).  I haven't tested if the ZFS Docker File System integration has anything to do with it.

    But I wanted to give an update and see if you had anything to add to it.

    Edited by samsausages
    • Like 1
    Link to comment

    It is happening again right now, so I can provide more info & logs.  Boy, I hope it helps narrow it down as it's requiring a reboot every few days.
    Check point #7, when I stop the array.
    If I had to guess, it's related to the nchan webgui device list process, looks like there are a lot of them and they keep building.

    I went several days and it was working fine.  I ran extensive tests on my individual devices & the pool, in my workstation and in the server.  Writing varying types of data 10x to each individual drive & then another 10x when configured as a ZFS Pool.  Each time covering the full size of the disks & verifying cheksums.  I feel pretty good about my hardware being solid.  I suspect the SMART warning was logged because Unraid became unresponsive and the drive ran into what it saw as a driver error.


    While I haven't been able to recreate the issue on demand, I can expand on the symptoms and provide logs while it is happening:
    1. CPU usage ever increasing, until it hits 100% and freezes the system.  Seems like more threads are being utilized as time goes on, but not sure what triggers it.  Can sit at 40% use for some time, until it suddenly goes to 100%.
    2. "Main" page causes 100% Browser CPU usage on the client.  The MB/s Speed info on the page updates several times per second.
    3. "Main" page you cannot click on any of the disks, browser links etc to bring up Attributes or SMART info. (other pages are fine, can still "stop" array from Main page.  Putting the URL directly in to the browser, for the disk attributes, will work, it's just the links on the Main page is broken)
    4. No errors in the log specific enough for me to know what's happening

    5. Usually no one process that account for the 40-100% CPU usage, but frequently I do see 'lsof' make the list, sometimes it will show up more than once.  It's reflected in the htop process list below.  But lots of nchan device list processes.
    6. Stopping Docker has no effect.
    7. Stopping the array does!  CPU usage goes down, but still high at about 10%.  Upon stopping the array, the log is flooded with the error below:

    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [crit] 253614#253614: ngx_slab_alloc() failed: no memory
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: shpool alloc failed
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: nchan: Out of shared memory while allocating message of size 26778. Increase nchan_max_reserved_memory.
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: *10550928 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
    May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: MEMSTORE:00: can't create shared message for channel /devices


    8. Re-starting the array and the issue returns immediately, requiring a reboot.  I guess I could try restarting specific services like nginx, but I haven't tried that.

    9. No new SMART warning logged, I believe this is because I rebooted before the CPU went to 100% and the system became totally unresponsive. 
    10. I'm uploading 2 sets of logs.  #1 is from the server running and high CPU usage, but no errors flooded in the log.  #2 is with the array stopped and the log got flooded.

    Example of top 10 processes with top CPU usage:
    image.thumb.png.43df9568539d0febbfbf4df5d7c370c8.png


    Example of htop:
    image.thumb.png.fdea89504c2b040928424d5172e79edd.png
     

    babel-diagnostics-20230506-1546.zip bertha-diagnostics-20230513-1621.zip

    Edited by samsausages
    Link to comment

    I think I have traced this to something filling up tmpfs mounted on /run
    I haven't figured out what is filling it yet, but looks like the issues begin when tmpfs fills up.

     

    Going to try and trace down the exact cause for it filling this weekend.

    Edited by samsausages
    Link to comment

    I really beleive that I am having the same issue after upgrading as you!
    Upgraded to v.6.12.0-rc6 and CPU and RAM have been much higher than before!

    CPU is now always around +80! - services: find and shfs is now always up!

    At one time I had to cut the power because it stayed unresponsive for hours.

    image.thumb.png.d2615626d5811242c47e84e746e388cb.png
    Diag file attached.

     

    Edited by casperse
    • Upvote 1
    Link to comment

    I am getting pretty much the same thing. See my post in this thread as well - 

     

    I was hoping @samsausagesmay have had some luck tracking down the issue? I am getting it more or less every day, maybe every second day if I am lucky!  It is getting very frustrating to say the least...

     

     

    Link to comment
    On 7/1/2023 at 8:33 AM, DrSpaldo said:

    I was hoping @samsausagesmay have had some luck tracking down the issue? 


    I have not found the root cause, I do know it's docker, but haven't spent the time tracking it down, because I just used a temporary fix and was hoping it gets worked out.  Figured it's related to the docker changes in unraid and will be patched eventually.

    I found these temporary fixes in another thread:

     

    Solution 1

    If you have the scripts plugin installed. You can use this command adapted for log files. 

     

    find /run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/ -maxdepth 999999 -noleaf -type f -name "log.json" -exec rm -v "{}" \;

     

    This runs on my server every 24 hours and has yet to have the issue since. Docker Daemon will recreate the log to make sure that it is logging the health status of your application - so any docker application that has a health status showing. 

     

    Solution 2

    The other option is to remove the health check of the docker image that is running by using these parameters in

     

    Extra Parameters --no-healthcheck

     

    Solution 3

    The other option is to increase the size of your tmpfs /run folder with the command below but at some point that will fill up. This command will set it to 85MB from default 32MB

     

    mount -t tmpfs tmpfs /run -o remount,size=85M

    Link to comment

    Thanks for the post @samsausages. I will setup a user script for solution 1 every 24 hours and on first array start for solution 3. As for solution 2, I will just have to see if I can work out what dockers may be causing issues, I have a few..

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.