Unraid 6.8.3 Random bouts of 100% CPU useage. Making server useless.


Recommended Posts

Hey guys, this is the second time this has happened this week. The wife and kids are watching Plex and suddenly its starts to stutter and eventually stops playing. I remote into the server from work to see 100% CPU usage on all cores of my Ryzen 5 2600.

 

The first time I assumed it was the parity check that was running that was causing the issues, so I stopped it and rescheduled it for later and after a reboot everything was fine. But this time there was no parity check running, the mover wasn't running, Im not sure whats causing this issue. I decided NOT to reboot this time, and instead downloaded the diag (attached) and let it run it course. It did eventually stop and go back to normal CPU usage...but this shouldn't happen to begin with, and idk whats causing it. 

 

UPDATE: Wife just told me its been fine all day until around 2:00-2:30 this afternoon. Its currently 5:13 as Im typing this where I am.

 

 

serverus-diagnostics-20200612-1702.zip

Link to comment

The "TOP" log has all of your CPU usage in. You can also run "TOP" in the terminal to view it live next time you're maxing out your CPU. A copy of the top of your CPU usage:


  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 8279 root      20   0 2978932 399240 127800 S  43.8   1.2 913:52.27 Web Conte+
 8143 root      20   0 4245108   1.8g 153688 S  18.8   5.8 226:54.19 firefox
 8326 root       0 -20       0      0      0 R  18.8   0.0  58:38.44 loop2
   10 root      20   0       0      0      0 I   6.2   0.0   6:17.75 rcu_sched
  855 root      20   0       0      0      0 S   6.2   0.0  75:47.48 kswapd0
 3740 root      20   0       0      0      0 I   6.2   0.0   0:49.65 kworker/u+
 6814 root      20   0  187516  64380  50764 S   6.2   0.2  98:08.30 Xorg
 7865 root      20   0       0      0      0 I   6.2   0.0   0:01.89 kworker/u+
15190 root      20   0    6788   3092   2180 R   6.2   0.0   0:00.01 top
15471 root      20   0 1631980 832512      0 S   6.2   2.5  59:29.26 xteve
20238 root      20   0   81144  11156   3840 D   6.2   0.0  18:19.88 Xvfb
31898 root      20   0       0      0      0 I   6.2   0.0   0:17.14 kworker/u+

Command on the far right shows what is using the CPU.

Link to comment

hmm, ok. I just ssh in and ran "top" im seeing several things with useage in the 20's, but im not sure what to do about any of them. They don't appear to be any of my containers that i have running. One of them is Firefox, could that be because im booted in GUI mode?

Interestingly enough if I run htop instead my CPU usage looks normal. But that doesn't change the fact that something is clearly wrong, my server is so over loaded Im going to have to reboot soon or noone can watch TV. But I really need to figure out why this keeps happening. Its so bad that not only is Plex not working, but the web UI for Unraid is just barely responsive.

Link to comment
6 hours ago, relink said:

One of them is Firefox, could that be because im booted in GUI mode?

Yes, but the main culprit appears to be this one:

 

nobody   20324 47.4  3.5 10036260 1181400 ?    SNsl Jun09 2054:52  |   |       |   \_ /usr/local/crashplan/bin/CrashPlanService

 

Try shutting sown CrashPlan when it happens again.

Link to comment

I believe you may have been right. I had to try and do something so I just started shutting down containers until my CPU usage dropped. When I saw how big of a difference Crashplan made, I pinned it to a single core. now that single core is maxed out, and the rest of my CPU looks normal. 

 

But I had this problem once before with crashplan, probably close to 2 years ago. I fixed it back then and it has not been an issue since. Any idea why it Would suddenly became a problem again? 

Link to comment

Ok, I guess I spoke too soon. The issue just crept back up within the last hour. My son was watching a movie and I noticed it just stopped playing and when I checked the server, sure enough 100% usage on all cores. I attached an updated diag.

 

Here the kicker though, I went into the CPU pinning screen and set every single container and VM to a specific number of cores, and there is not one single thing that I have running on here that is able to use all the CPU cores. Most things are limited to 2-4 cores, plex is the most at 10 out of 12 cores.

 

Luckily I have learned that stopping and re-starting the array seems to fix the issues, so at least I don't have to perform a full reboot. But I have to get this fixed, unfortunately Im not sure whats causing it, especially since "top" and "htop" don't appear to be showing the whole picture.

serverus-diagnostics-20200615-2106.zip

Link to comment
On 6/20/2020 at 2:41 AM, johnnie.black said:

You could try disabling all dockers and let it run for a few days, if all OK then start enabling one by one.

Ouch. There must be a better way to find out whats causing this.

Is there not a more accurate task manager that could possibly show whats causing 100% CPU useage? Also the last time around I noticed near 100% RAM usage too.

Link to comment

Hi there,

 

Saw your email into support and wanted to chime in on your thread here.  Unfortunately johnnie.black is right in that you're going to need to take the "one at a time" approach to figure out the root cause.  The main problem here is that there wasn't some "event" that occurred prior to these issues that we can point to.  Everything was fine until it wasn't.  When issues like that happen, 99 times out of 100 it's because of something amiss with the hardware or a plugin/container update that broke something.  Do you have your containers set to auto-update or do you manually update them?

 

You can absolutely check out HTOP through a command line (just type htop from a terminal session) and see a more detailed process reporting, but even then, you will likely still have to resort to shutting down all your containers, letting the system run for a while to see if the CPU usage spikes just randomly and if not, start slowly turning on containers one by one until you find the culprit.  I wish I had better advice for you, but again, when the issues just come out of nowhere like this and there wasn't some event that occurred right before the issues manifested, there is just no other way to narrow it down.

Link to comment

As of the crash yesterday, I now only have the bare essential containers running and no VMs. If I can go a few days without another issue then I will start re-enabling things. If I crash again, then I will disable all containers and see what happens. 
 

The part the I find confusing about this is that there is not a single container or VM in my system that has access to all CPU threads. Plex has access to the most and even its capped at 10 out of 12, and everything else is limited to between 2 and 4. 

Link to comment

I haven't read the thread fully, so apologies for that, but I'm curious:

 

Have you seen a 100% CPU crash from top/htop, or just from the GUI?

I ask because the GUI also takes into account iowait in the CPU usage. This will spike any time the system is waiting on I/O (ie, disks), so I'm wondering if you've got a dodgy HBA or similar causing crazy latency on your disks. This can look like high CPU, because you'll see the graphs max out, and everything will slow to a crawl, but it's actually just that nothing can pull the data it needs.

Link to comment

Try this:

 

# screen (install it from nerdools if you don't have it)

 

#screen

#cd /

# while true; do ps -eocomm,pcpu | egrep -v '(0.0)|(%CPU)' >> cpu.log; echo "do a little dance, get down tonight"; sleep 1; done &

 

If the servers dies and your reboot get back to and do

# cd /

# tail -f cpu.log 

# cat cpu.log |more    and look for the app taking the most CPU? 

 

But the idea of elimination as suggested here is the way to go, turn everything off and then turn each docker/container/vm on one at the time.

Edited by johnwhicker
Link to comment

So I think I managed to catch things as they were falling apart this time. It seems that the issue is coming from running out of RAM. I don't know how unraid handles that, does it have a swap file? if so where is it?

 

Anyway, I immediately ssh into unraid and ran htop and just simply didn't see anything using that much ram, same when running top...I just don't see anything using that much ram. Despite this, even with all containers and VMs stopped the ram usage never dropped below 54%. After restarting the array with all my main containers running I haven't gone over 19% ram usage.

 

I have attached 2 diags this time. The first one is from before I restarted the array with everything stopped except pihole, and unbound. The other is after restarting the array and with my main containers running.

serverus-diagnostics-20200629-2017.zip serverus-diagnostics-20200629-2013.zip

Link to comment

Relink,

 

Hello... im not going to be a big help to you here... I can only share what happens with my system... I have the ryzen 2300g and 8gigs of ram...

 

I run there dockers only... and after nearly a year running with out issues I started to notice that my ram usage was 80%+ I would shutdown/reboot my dockers and it would bring things back inline to about 50%... and within a few days it would be back up to 80%... 

 

So I dont know it's it's like a memory creep of unraid or not... but I just elected to buy more ram and since DDR4 prices have dropped so much I bought 16gigs more... 

 

So what your describing is a bit more extreme then my situation, but i hope it might help

Link to comment
14 hours ago, -Daedalus said:

Have you seen a 100% CPU crash from top/htop, or just from the GUI?

This is exactly what I see. I only see the 100% usage in the GUI. In htop everything looks normal. But that still doesn't stop docker from becoming completely unresponsive.

 

I actually have had an issue with either my HBA or extender, im not sure which. But its an issue ive had for quite a while now, and this problem im having now is fairly new. But anyway, any time I go to reboot my unraid server I will generally have to reboot a minimum of 1-2 times to actually get all my disks to show up. On the first boot im guaranteed to have several disks missing from the array. However once I get all the disks to show up again, everything always seemed to have ran ok.

Link to comment

Im checking up on my server this morning and I'm already seeing the RAM usage getting up-to 72%, however htop shows the process using the most ram is Plex at only 8.7%, and the Plex dashboard confirms this number...CPU useage is between 20-30% which for the current load is only slightly above average, and isn't anything that would freak me out.

Link to comment

So Ive been going through every single line and setting on every single page of my unraid server trying to see if anything jumps out at me. One thing did, I have a plugin installed called "Dynamix Cache Directories", I don't remember if this comes with unraid or if I installed it. But anyway I read up on what it does and decided to try disabling it. Also this is by far the oldest plugin on my system showing the most current version to be "2018.12.04".

 

Since disabling it, which was only 2 days ago, I haven't crashed, and I've had RAM usage in the 50% range instead of 80+%, and CPU usage seems to be staying around or under 20%.

Link to comment

Im beginning to think it may be related to some disk problems I have been having. I have looked through my syslog server and see pretty consistent CRC errors from all of my drives. So I have all new cables on the way for my HBA and SAS expander.

I noticed a crash happened a couple minutes after adding a new series to sonar, so just as the new episodes began flooding into the array is when it locked up. That's what it seemed like anyway.

Cables will be here Wednesday, I guess Ill see what happens.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.