thespooler

Members
  • Posts

    58
  • Joined

  • Last visited

Report Comments posted by thespooler

  1. @bonienl I appreciate you commenting on this.  I'm happy to hear there is a change going forward that might address this. 

     

    While it's true we might not all suffer this issue for the same reasons.  I see some people get this after a few days, so I'm thankful it only happens after 20+ days.  Which I can appreciate also makes it harder to understand what is happening.   It feels like a slow memory leak to me.

     

    In regards to memory, in my use case, when things start misbehaving I always hit the Dashboard to verify the logs are at 100% and conveniently Memory Utilization is the only other state (going from my own memory) that is working.  So I always get a sense of memory use after and it was never anything that alarmed me.  But here's memory from diagnostics once the system is toast.

     

                  total        used        free      shared  buff/cache   available
    Mem:           11Gi       3.6Gi       1.7Gi       1.0Gi       6.1Gi       6.6Gi
    Swap:            0B          0B          0B
    Total:         11Gi       3.6Gi       1.7Gi

     

    nginx was still running with the same process id as indicated by the logs, so I don't think any additional memory was suddenly freed up once it started throwing signals 6s. Without a notification that the logs are filling up, it's hard to know when this is occurring to catch it in real time and then there is such a short window before it's all over.

     

    After the thousands of single error signal 6s,  memory errors start mixing in with the signal 6s: 

     

    Jan  4 00:21:43 Tera nginx: 2022/01/04 00:21:43 [crit] 14673#14673: ngx_slab_alloc() failed: no memory
    Jan  4 00:21:43 Tera nginx: 2022/01/04 00:21:43 [error] 14673#14673: shpool alloc failed
    Jan  4 00:21:43 Tera nginx: 2022/01/04 00:21:43 [error] 14673#14673: nchan: Out of shared memory while allocating message of size 7391. Increase nchan_max_reserved_memory.
    Jan  4 00:21:43 Tera nginx: 2022/01/04 00:21:43 [error] 14673#14673: *8142824 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
    Jan  4 00:21:43 Tera nginx: 2022/01/04 00:21:43 [error] 14673#14673: MEMSTORE:00: can't create shared message for channel /disks
    Jan  4 00:21:44 Tera nginx: 2022/01/04 00:21:44 [alert] 16163#16163: worker process 14673 exited on signal 6

     

    And these signals 6s (if not all of them) are coming from nchan's process id.

     

    And finally to tie it all back to old logins, the thousands of signal 6s come after this chunk in the logs.  Though these entries are 20+ minutes apart, they are back to back:

     

    Jan  3 18:02:18 Tera nginx: 2022/01/03 18:02:18 [error] 15731#15731: *8026641 limiting requests, excess: 20.409 by zone "authlimit", client: 192.168.0.101, server: , request: "PROPFIND /login HTTP/1.1", host: "tera"
    Jan  3 18:27:53 Tera nginx: 2022/01/03 18:27:53 [alert] 16163#16163: worker process 15731 exited on signal 6

     

    Does the GUI use WebDAV calls like PROPFIND?  That IP is my main desktop, there's nothing WebDAVish  I use.  I don't know if Brave might try and do some discovery, but it seems like a stretch.

     

    Not sure if your third suggestion is referring to plug-ins.  Everything extravagant is Docker.  The majority of plug-ins are all yours or Squids.  

     

    With my 20+ days of up time, I'm going to avoid going into safe mode for now. 

  2. I have experienced this same problem, and it's frustrating to see the same issue recurring across multiple threads for many years and I don't think I've seen anything from Unraid devs on this, other than to suggest a memory check.   Multiple years, dozens of users, and near silence.

     

    My uptime stays about 20-25 days then the logs start filling up, and once the log hits 100% the GUI has major issues.   Most of the web UI works in the sense that you can move around sluggishly, but as noted the web sockets suffer complete failure, but the only time I look at the Dashboard is to confirm that logs hit 100% and it's time to restart.   Upon which I would suffer a forced multi day parity check.  (This I recently discovered was a result of restarting with an ssh session being open, so glad that problem is at least resolved)

     

    Usually I know the logs are full when the docker menu starts misbehaving.   

     

    My first suggestion is to send a series of notifications that the log is nearing capacity as we experience for drives with storage and temperature.  This is an absolutely necessity.  It won't be a big window in terms of reacting to the issue, but it's something.

     

    It took 6 hours and 7500+ errors every 2 seconds in my most recent experience.  Not sure how usable the system is during this part. 

     

    line 2829: Jan  3 18:27:53 Tera nginx: 2022/01/03 18:27:53 [alert] 16163#16163: worker process 15731 exited on signal 6
    ....
    line 10504 Jan  4 00:21:40 Tera nginx: 2022/01/04 00:21:40 [alert] 16163#16163: worker process 14537 exited on signal 6

     

    Then nginx and nchan start to fail with memory allocation errors and things get much worse, that only lasts for 8 minutes before logs are completely full and it just stops.

     

    Jan  4 00:21:41 Tera nginx: 2022/01/04 00:21:41 [crit] 14620#14620: ngx_slab_alloc() failed: no memory
    ...
    Jan  4 00:29:57 Tera nginx: 2022/01/04 00:29:57 [error] 12073#12073: *8145363 nchan: error publishing message (HTTP status code 507), client: unix:, server: , request: "POST /pub/cpuload?buffer_length=1 HTTP/1.1", hos

     

    And then 3 days later,  Jan 7,  I discover Unraid is having issues because of zero notifications.

     

    In my use case I don't use web terminals.  I think that's just another indicator of the logs being full just like the docker menu and the dashboard failures.  Simple html templates function, but anything advanced fails.

     

    What does ring a bell though is the idea behind stale tabs.  I have a desktop and a laptop that I rotate through, and each one has a Brave tab open to Unraid for admin convenience.  I usually sit on Docker or Main tabs.  I don't think there should be anything wrong with that.  But as a workaround for now I will stop this behavior or see if I can find an extension to reload the tab each day.  I remember the days I only went down for an Unraid upgrade.  Those days have been sorely missed for years.  

     

    I find it interesting that quite a few people are just restarting nginx and life goes on.  What are you doing about the logs being at 100%?  Just ignoring it, or is there a service to restart that as well?