6.12-rc5 - Server at 100% CPU usage

JorgeB · May 7, 2023

samsausages · May 7, 2023

4 hours ago, JorgeB said:

Diagnostics please.

I attached it. Looks like the log got cleared when I rebooted after the system hung last night. When I checked the log last night, before it locked up completely, it was all spammed by what I posted above.
I have now enabled the syslog server to try and retain them, if it happens again.

bertha-diagnostics-20230507-0728.zip

Edited May 7, 2023 by samsausages

samsausages · May 7, 2023

I just had one of my cache pools disconnect, as a failed drive, but reboot brought it right back. So I'm going to test my drives thoroughly, as a few of them are new and I have only had them in there for a few days, with little utilization. I'm going to take them out and run some tests on my desktop, to eliminate them as a variable.

samsausages · May 9, 2023

@JorgeBSo I have been doing lots of testing... I had 2 SMART errors pop up, 1 each time when this happened & Uraid went to 100% CPU usage. (Requiring a reboot)

SMART error is as follows:

Quote

Entry 0:

ErrCount (Error Count): 2

SQId (Submission Queue Identifier): 4

CmdId (Command Identifier): not available

Status: 0xc00c

PELoc (Parameter Error Location): not available

LBA (Logical Block Address): 0

NSID (Namespace Identifier): not available

VS (Vendor Specific): not available

Entry 1:

ErrCount (Error Count): 1

SQId (Submission Queue Identifier): 8

CmdId (Command Identifier): not available

Status: 0xc00c

PELoc (Parameter Error Location): not available

LBA (Logical Block Address): 0

NSID (Namespace Identifier): not available

VS (Vendor Specific): not available

Deep diving into the error status code 0xc00c indicates a "Completion Queue Invalid" error. Reading up it sounds this error typically occurs when there's an issue with the completion queue head pointer or doorbell. It might be a firmware or driver issue, or even a hardware problem with the NVMe drive.

I removed the drive from the system and ran several full SMART checks on a windows system, with no issues. (2x)
I then used the Intel utility to perform several full diagnostic read/write tests on the drive. (3x)

I then re-installed the drive in my Unraid server. I wrote a script that copies files to the drive and does a battery of read/write tests.
So far I have not been able to recreate the error/issue and I have completed 4 full drive writes & checksum tests on the written data. Have not seen the error pop back up.

So I'm still a bit unsure what caused it, I'm going to tweak my script a bit to throw more variety at the drive. The one thing I'm doing differently than before is I had the "docker" folder on this drive and I set docker to use the ZFS file system (Not Image). I haven't tested if the ZFS Docker File System integration has anything to do with it.

But I wanted to give an update and see if you had anything to add to it.

Edited May 9, 2023 by samsausages

samsausages · May 13, 2023

It is happening again right now, so I can provide more info & logs. Boy, I hope it helps narrow it down as it's requiring a reboot every few days.
Check point #7, when I stop the array.
If I had to guess, it's related to the nchan webgui device list process, looks like there are a lot of them and they keep building.

I went several days and it was working fine. I ran extensive tests on my individual devices & the pool, in my workstation and in the server. Writing varying types of data 10x to each individual drive & then another 10x when configured as a ZFS Pool. Each time covering the full size of the disks & verifying cheksums. I feel pretty good about my hardware being solid. I suspect the SMART warning was logged because Unraid became unresponsive and the drive ran into what it saw as a driver error.

While I haven't been able to recreate the issue on demand, I can expand on the symptoms and provide logs while it is happening:
1. CPU usage ever increasing, until it hits 100% and freezes the system. Seems like more threads are being utilized as time goes on, but not sure what triggers it. Can sit at 40% use for some time, until it suddenly goes to 100%.
2. "Main" page causes 100% Browser CPU usage on the client. The MB/s Speed info on the page updates several times per second.
3. "Main" page you cannot click on any of the disks, browser links etc to bring up Attributes or SMART info. (other pages are fine, can still "stop" array from Main page. Putting the URL directly in to the browser, for the disk attributes, will work, it's just the links on the Main page is broken)
4. No errors in the log specific enough for me to know what's happening

5. Usually no one process that account for the 40-100% CPU usage, but frequently I do see 'lsof' make the list, sometimes it will show up more than once. It's reflected in the htop process list below. But lots of nchan device list processes.
6. Stopping Docker has no effect.
7. Stopping the array does! CPU usage goes down, but still high at about 10%. Upon stopping the array, the log is flooded with the error below:

May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [crit] 253614#253614: ngx_slab_alloc() failed: no memory
May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: shpool alloc failed
May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: nchan: Out of shared memory while allocating message of size 26778. Increase nchan_max_reserved_memory.
May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: *10550928 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/devices?buffer_length=1 HTTP/1.1", host: "localhost"
May 13 16:18:09 bertha nginx: 2023/05/13 16:18:09 [error] 253614#253614: MEMSTORE:00: can't create shared message for channel /devices

8. Re-starting the array and the issue returns immediately, requiring a reboot. I guess I could try restarting specific services like nginx, but I haven't tried that.

9. No new SMART warning logged, I believe this is because I rebooted before the CPU went to 100% and the system became totally unresponsive.
10. I'm uploading 2 sets of logs. #1 is from the server running and high CPU usage, but no errors flooded in the log. #2 is with the array stopped and the log got flooded.

Example of top 10 processes with top CPU usage:

Example of htop:

babel-diagnostics-20230506-1546.zip bertha-diagnostics-20230513-1621.zip

Edited May 14, 2023 by samsausages

samsausages · May 19, 2023

I think I have traced this to something filling up tmpfs mounted on /run
I haven't figured out what is filling it yet, but looks like the issues begin when tmpfs fills up.

Going to try and trace down the exact cause for it filling this weekend.

Edited May 19, 2023 by samsausages

casperse · June 13, 2023

I really beleive that I am having the same issue after upgrading as you!
Upgraded to v.6.12.0-rc6 and CPU and RAM have been much higher than before!

CPU is now always around +80! - services: find and shfs is now always up!

At one time I had to cut the power because it stayed unresponsive for hours.

Diag file attached.

Edited June 13, 2023 by casperse

casperse · June 13, 2023

diagnostics-20230613-1307.zip

DrSpaldo · July 1, 2023

I am getting pretty much the same thing. See my post in this thread as well -

I was hoping @samsausagesmay have had some luck tracking down the issue? I am getting it more or less every day, maybe every second day if I am lucky! It is getting very frustrating to say the least...

samsausages · July 10, 2023

On 7/1/2023 at 8:33 AM, DrSpaldo said:

I was hoping @samsausagesmay have had some luck tracking down the issue?

I have not found the root cause, I do know it's docker, but haven't spent the time tracking it down, because I just used a temporary fix and was hoping it gets worked out. Figured it's related to the docker changes in unraid and will be patched eventually.

I found these temporary fixes in another thread:

Solution 1

If you have the scripts plugin installed. You can use this command adapted for log files.

find /run/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/ -maxdepth 999999 -noleaf -type f -name "log.json" -exec rm -v "{}" \;

This runs on my server every 24 hours and has yet to have the issue since. Docker Daemon will recreate the log to make sure that it is logging the health status of your application - so any docker application that has a health status showing.

Solution 2

The other option is to remove the health check of the docker image that is running by using these parameters in

Extra Parameters --no-healthcheck

Solution 3

The other option is to increase the size of your tmpfs /run folder with the command below but at some point that will fill up. This command will set it to 85MB from default 32MB

mount -t tmpfs tmpfs /run -o remount,size=85M

DrSpaldo · July 12, 2023

Thanks for the post @samsausages. I will setup a user script for solution 1 every 24 hours and on first array start for solution 3. As for solution 2, I will just have to see if I can work out what dockers may be causing issues, I have a few..

6.12-rc5 - Server at 100% CPU usage

User Feedback

Recommended Comments

JorgeB 7521

Link to comment

samsausages 30

Link to comment

samsausages 30

Link to comment

samsausages 30

Link to comment

samsausages 30

Link to comment

samsausages 30

Link to comment

casperse 35

Link to comment

casperse 35

Link to comment

DrSpaldo 5

Link to comment

samsausages 30

Link to comment

DrSpaldo 5

Link to comment

Join the conversation