100% CPU Usage, Server Unresponsive

AGuyInTheWrongRoom · June 24, 2023

Hi everyone,

I have recently added a new HDD to the array and happened to update the OS to 6.1.1.5 (and more recently to 6.12.1). At the same time, I started having problems with high CPU usage: after some minutes of uptime, CPU usage would spike to high 90%s and eventually reach 100% and become unresponsive. Docker apps stopped working, GUI unreachable, just had to shut down server manually and power it back on.

I checked netdata and apparently IOWAIT is constantly consuming the most resources. When running htop command in terminal I saw... well, I have no idea what I was looking iat. But I'm attaching the pictures and diagnostics.

My CPU is an Intel® Core™ i7-4770K CPU @ 3.50GHz. I'm completely at a loss here.

Could someone help me solve this problem?

mushu-diagnostics-20230624-1310.zip

JorgeB · June 25, 2023

Does stopping the docker service help?

Xeniummm · June 25, 2023

Hello

I have the same problem on my Unraid server

initialy was bouncing to 100% verry offten. I have stopped the Array first and the usage change to 80-100% constant.

I have disabled the Docker and still the same 80-100% constant.

I am trying to gather the diagnostics but page is crashing constantly.

is there any known bug what I can follow to fix my unraid server ?

ezm · June 26, 2023

I am experiencing the same issue here.

My server is all good for the previous couple of months. I haven't upgrade to 6.12 for stability concerns, still 6.11.5. Recently I added a `CA Mover Tuning` plugin to make use of cache drives. Then my server would randomly stuck, none of WebGUI, Docker, SMB function works. The CPU usage soars to close to 100% (there is a specific pattern that some threads of the CPU is 100% and a couple of threads are ~0%, but the whole system just freezes until about 5 minutes later)

As I manually assigned specific threads to docker instances, I originally assume some of the docker instances messed up but when I check the usage of all the dockers, they all seem fine. One more thing to add is that sometimes the CPU usage is only 25%, but the pattern is 25% of the threads are 100% used while others are ~0%. But the whole system is still freezing in this case, which surprises me because it looks less like a lack of performance, but the system flow is stuck somewhere very close to Docker service.

Then I use `top` command to check the CPU usage. The CPU usage is not reported in real-time when the stuck is happening. But right after the system goes back to normal, I noticed `shfs` & `dockerd` & `java` have the most CPU usage, and all of them seem to have connection with Docker.

I tried disable Docker but no good.

I thought this situation is just me but then I found this post. There is a high possibility that a lot of users are experiencing the same performance issue.

Edited June 26, 2023 by ezm

ezm · June 26, 2023

32 minutes ago, ezm said:

I am experiencing the same issue here.

My server is all good for the previous couple of months. I haven't upgrade to 6.12 for stability concerns, still 6.11.5. Recently I added a `CA Mover Tuning` plugin to make use of cache drives. Then my server would randomly stuck, none of WebGUI, Docker, SMB function works. The CPU usage soars to close to 100% (there is a specific pattern that some threads of the CPU is 100% and a couple of threads are ~0%, but the whole system just freezes until about 5 minutes later)

As I manually assigned specific threads to docker instances, I originally assume some of the docker instances messed up but when I check the usage of all the dockers, they all seem fine. One more thing to add is that sometimes the CPU usage is only 25%, but the pattern is 25% of the threads are 100% used while others are ~0%. But the whole system is still freezing in this case, which surprises me because it looks less like a lack of performance, but the system flow is stuck somewhere very close to Docker service.

Then I use `top` command to check the CPU usage. The CPU usage is not reported in real-time when the stuck is happening. But right after the system goes back to normal, I noticed `shfs` & `dockerd` & `java` have the most CPU usage, and all of them seem to have connection with Docker.

I tried disable Docker but no good.

I thought this situation is just me but then I found this post. There is a high possibility that a lot of users are experiencing the same performance issue.

Hi guys, I just found an ancient post but the main idea is to change all the docker appdata folder path setting from /mnt/user/appdata to /mnt/cache/appdata, assuming you have a cache drive and want to store appdata on cache drive. I now think the issue is that CPU is constantly checking the file between these two path and moving them. Before the redundant moving is finished, the whole docker service is just paused.

At least it seems to solve my problem for now. I will update if anything.

itimpi · June 26, 2023

Those two paths are just different views of the same file(s). It is just that the /mnt/user one goes through the Unraid Fuse layer that handles user shares (which means any files on array drives or other pools are also included) and the /mnt/cache one bypasses it going directly to the cache drive showing just what is on the 'cache' pool.. There is no file movement involved.

ezm · June 27, 2023

16 hours ago, itimpi said:

Those two paths are just different views of the same file(s). It is just that the /mnt/user one goes through the Unraid Fuse layer that handles user shares (which means any files on array drives or other pools are also included) and the /mnt/cache one bypasses it going directly to the cache drive showing just what is on the 'cache' pool.. There is no file movement involved.

Thanks for your explanation! I just found out that the problem doesn't go away. The CPU 100% usage still happens randomly. The previous method I said is not helping.

Struggling to debug this...

ezm · June 27, 2023

Also noticed that when I transfer big files, like 3 episode of TV shows each as 7GB, the CPU 100% and server freeze can be triggered easily. I copy from a windows machine directly to array via SMB.

What I did: scrub the docker img file and corrected some error. Stopped Sonarr as it looks like causing problem with a bad indexer setting.

The trigger of this server stuck remains unknown. I did find some posts about enable C6 power saving on BIOS. I vaguely recall that I could change the power related setting in the BIOS but right now is hard for me to verify it due to my hardware placement. Other than this clue, I have no idea.

ezm · June 27, 2023

I found the problem is from the m.2 cache drives I connected over PCIE slot. One time I happened to write the cache to total full, and tried to fix it. Since the read error corrected message shows right after the server recovering from freeze, I think I might to run a full test on the m.2 drives to fix the issue.

JorgeB · June 27, 2023

You can try to see if it helps with the PCIe errors:

https://forums.unraid.net/topic/118286-nvme-drives-throwing-errors-filling-logs-instantly-how-to-resolve/?do=findComment&comment=1165009

Xeniummm · June 27, 2023

Hello all and thank you for looking in to this problem.

I have checked on my unraid server and here is the current configuration what I have

1 array of 3 disks

1 chache of 2 ssd's

1 cache of 1 ssd

1 cache of 1 disk

I don't think that my server errors are triggered by m.2

After my server rebooted I noticed 1 error what I haven't had before.

the error is refering to an vifo-pci.

I have atached the print screen for a better view.

LeeNeighoff · August 24, 2023

I am also having this issue, on 6.12.3 with a Ryzen 3950X. Tried thor2002ro's updated kernel (mostly for Arc GPU compatibility, but also to troubleshoot this) as well, but I see the same behavior. It gets to the point that all 32 threads hit 100%, the web GUI stops responding, and all services running on both VMs and Docker containers cease to operate.

EDIT: After viewing the syslog live, I caught it locking up. Several/many instances of "rcu_preempt self-detected stall on CPU," "rcu_preempt detected expedited stalls on CPUs/tasks," and "native_queued_spin_lock_slowpath" for CPU 20, with "PID: dockerd Tainted: P O."

Investigating further as it seems something either Docker or a container is doing is locking things up.

EDIT 2: After dealing with intel_gpu_top spawning ~30 processes each using 100% of a CPU thread and fighting the VM Manager locking up the system afterwards, I've determined that in this case, my issue came down to a VM with corrupted data. After moving my entire VM cache drive to the array, running a parity check, and then updating every component of the VM's OS and installed software, things have stabilized. For now, at least.

Edited August 31, 2023 by LeeNeighoff

JorgeB · August 25, 2023

On 6/27/2023 at 4:23 PM, Xeniummm said:

the error is refering to an vifo-pci.

Missed this post, you need to edit /boot/config/vfio-pci.cfg and remove the offending entry.

100% CPU Usage, Server Unresponsive

Recommended Posts

AGuyInTheWrongRoom

Link to comment

JorgeB

Link to comment

Xeniummm

Link to comment

ezm

Link to comment

ezm

Link to comment

itimpi

Link to comment

ezm

Link to comment

ezm

Link to comment

ezm

Link to comment

JorgeB

Link to comment

Xeniummm

Link to comment

LeeNeighoff

Link to comment

JorgeB

Link to comment

Join the conversation