Jump to content

How do I identify the cause of very high CPU usage?


Recommended Posts

I've noticed several times in the last few weeks that my server is showing very high CPU utilization:

 

image.png.7069fc7bac59bbfa7cd14b00826257d1.png  image.png.a613305c0878190a99c466e821610c74.png

Obviously, the numbers bounce around, but it will stay at this level for a while.

 

I fired up an SSH session and ran top and it shows me something along these lines:

image.png.4b52faf9c3bd77d9090c51de0b087ad4.png

The CPU percentages here add up to around 45% (with some rounding).

 

What are good ways to track CPU utilization over time so I can determine what is causing these spikes?

Edited by FreeMan
Link to comment

The CPU is generally at low usage, even with the plugin enabled. On occasion I see it nearly pinned, and I'd like to know what to do to try to track down what's causing it.

 

I agree that it could be FIP causing the issue, and I think b2sum is part of the FIP, but I'm not certain. However, what the GUI was showing in those screen shots and what top was showing me don't seem to come close to adding up. It could well be my misunderstanding of how *nix works and I'm just not interpreting the results of top correctly.

 

The server does generally seem to be sluggish when the GUI shows high utilization as that's usually what triggers me to go look at the dashboard to see what's going on, so the GUI's representation does seem to be accurate.

Link to comment
  • 9 months later...

I never found a resolution to this issue, and now it seems to be back. I do get times of incredibly high CPU utilization on occasion, but generally they last a few minutes. This time, it has been going on for basically 24 hours now, and I haven't a clue how to track it down.

 

Using the Ultimate UNRAID Dashboard, showing CPU utilization over the last 24 hours, I grabbed these two graphs this morning:

image.thumb.png.8276fbdb485beb864c977e9dba7c456a.png

image.thumb.png.76b6641cfc29db6239a3577743c566df.png

Based on these screen shots, it appears that the utilization drops off during the overnight hours (22:00 - 04:00 it seems reasonably "normal"), which would indicate that it's end-user created. However, this is 99% a media server with some Dockers running and no VMs running. I'm not doing anything (that I'm aware of) that would execute on the server. I do have the Handbrake docker installed, but it's not even currently running, and that's about the only thing I'm aware of where I'd be intentionally be putting a heavy load on the server.

 

I just grabbed this screenshot of top running:

image.thumb.png.dbb7568e7719c0f2505fb65426e17f81.png

Here's the diagnostics I downloaded this morning (grabbed right after I took the graph screen shots). nas-diagnostics-20210513-0634.zip

 

As you can see here, the CPU has been pinned nearly all day today (about a 150 minute overlap with the graphs above):

image.png.881ac383992f3a6fbedf9fa726f87f5c.png

The only activity that the server would have had during the day (that I'm aware of) would have been a few file downloads. Nobody was even watching anything via Emby or Kodi, so it wasn't doing any transcoding. I was working (from home) all day today, doing nothing on the server. I occasionally woke up my personal Win10 desktop machine just to check on the server, but, other than looking in on the WebGUI & UUD, I wasn't doing anything that should have kicked off any significant processing on the machine.

 

I've spent time watching the CPU utilization on the Dockers page. Generally they're running at 0.0x% with occasional spikes as high as 2%, and I've seen telegraf spike momentarily to 20%, though I've never seen it above about 7% in top.  Radarr & Readarr ocassionally will spike at 7-9% CPU in top, but I've not seen them reporting that high on the Docker page of the official GUI.

 

What tools are there at my disposal, beyond top, to determine what's going on? I'm still running cache_dirs and you can see it in top showing at about 0.7% CPU, so I find it hard to believe that all of this is its fault. I just stopped screen updating (using Greenshot) to take a look, and it took 3 times to even get cache_dirs to show up in the rather tall list of output from top. When it did, it reported 0.7% CPU utilization again, just like in the screen shot I included. I'm still running the File Integrity Plugin using the BLAKE2 hash, but I don't see anything in top at all that indicates anything about "File", "FIP" or "BLAKE" showing up in those top 20-30 threads, so I don't think that's the issue.

 

At the moment, this is making watching TV shows/movies (the server is primarily a media server) exceedingly difficult - many shows keep hanging on my Kodi box (separate hardware) because the server can't keep the stream going. I'm getting tempted to restart the server in the hopes that whatever is causing this won't restart, but I'd rather find the cause than just band-aid a solution.

 

 

Edited by FreeMan
Link to comment

I stopped dockers 1 at a time, giving about a minute between each shut down. It had essentially zero effect on CPU utilization.

 

I finally rebooted the server. After an initial burst of 100% utilization, it seems to have settled into a more normal usage pattern. I hope it remains this way and doesn't spike again.

 

Unfortunately, while this has returned the server to usability, it doesn't explain what happened or why.

Link to comment
  • 4 weeks later...

This seems to be happening again.

 

CPU is pinned:

image.png.71c35de474627c0ab9843b67c37164d8.png

 

And it has been more often than not since about 9pm last night:

image.thumb.png.9b9f22805a2ac289d4b090cf4f7831cf.png

 

Access to the server via the WebGUI and various dockers interacting through web services seems to be mostly fine, however, SMB access from a Win10 client or via my Kodi box (on dedicated hardware) is nearly impossible.

 

Here's the latest nas-diagnostics-20210609-1350.zip

 

Here are 3 screen grabs of `top` over a 3-4 minute period when the highest CPU utilization seemed "high":

image.thumb.png.d4aafa1b3d3f1146b42a0742f3511e1b.pngimage.thumb.png.26164c4de92a996de240fca277040657.pngimage.thumb.png.baf25b835ca35ca36281778d17fd8889.png

 

Simply trying to browse (using Win Explorer or xplorer2 LITE) from my freshly rebooted Win10 desktop to the server is basically impossible, as is just about any other access to the server from my desktop machine. I can, however, open the Web UI to Emby and fire up a movie in a browser window on this same machine without any issue whatsoever.

 

It seems that "wa" - disk wait may be what's causing the issue, but I'm not sure how to track down what's causing that. I am still running the File Integrity Plugin, but the blake2 process doesn't appear in any of these top screen grabs, so I don't think that's the cause, though I'm more than happy to be proven wrong. Sorry to ping you directly, but anyone @limetech have any suggestions on what to look at next? This has been happening on-and-off for nearly a year now, and it's the only complaint I've had in about 10 years of otherwise very happy UNRAID use.

image.png

Edited by FreeMan
Link to comment

Something is seriously wrong!

image.png.470cafb26365a39f0c868aef48cee6e9.png

 

I've never had a parity check run anywhere near this slow. I mean I don't have the fastest setup in the world, but my last 21 checks (what fits on the first page of the parity check display) averaged 106MB/sec. Now it's running at 19.4? It slows down as it nears the 4TB mark because of a couple of older 4TB drives in the mix, but never that slow...

 

Wondering if I'm getting some sort of disk error that's not been reported yet, I've paused the parity check and I'm running extended SMART checks on all drives.

Link to comment

Interesting. The extended SMART tests completed on the 3 SSDs in the cache pool, but they seem to have terminated on all the spinning array drives. Additionally, none of the spinning drives are showing any SMART history at all, except for Disk 8, which shows that it was "Aborted by host".

image.png.1fadc690dc38f3a8d222232fbf082e47.png

 

New diagnostics: nas-diagnostics-20210610-1333.zip

 

Any thoughts? I'm going to resume the parity check since the SMART processes aborted themselves. I figure it will be comforting to know parity is good.

Link to comment
1 hour ago, JorgeB said:

Disable spin down or it can interrupt the SMART test, but I would run diskspeed first.

running diskspeed now. It's been a long time since I last used it - I've got 2 controllers, 1 on board and 1 on a card, IIRC, it tests one drive at a time on each controller, correct? It really seems to be struggling with one drive in particular. Either that drive's got issues, or it's because I haven't stopped all my dockers and influx/telegraph keep writing to it.

 

I'll let the test run to completion as a baseline, then I'll shut down all dockers and run it again, to see if it makes a difference.

Link to comment

OK. Issues yesterday with Speed Gaps on Drive 3 yesterday. To the point where it stopped the test and wouldn't continue. I shut down the DiskSpeed docker, watched some TV and resumed the parity check overnight.

 

This morning, I shut down all dockers, fired up DriveSpeed and ran a test again. Everything seems normal to me:

image.png.c23f27f8689b8707ef4c858fcb988df1.png

image.png.cac9ece41e68b1fde76cb835df5e0ff8.png

 

The small dip at ~4.5TB/60% is on Disk4, but doesn't seem particularly significant:

image.png.8543573b6a0a9c6a453cb8b360ecd4d5.png

 

FYI: Drive/controller arrangement: (I'm sure there's a more compact/text representation somewhere, but this was quick & easy)

image.png.0cb20b38d84749fa98d758a5e232ea9b.png

 

Despite the fact that all the SMART tests seem to have aborted, it appears that the Extended SMART test completed on Disks 3 & 4 w/o error:

image.png.e407f4f0770633a846e02b7555df9cc4.pngimage.png.add0bf5e032e039d281ca5d5acbd4a4f.png

 

Note that the power on hours are within a few hours of current - the test #1 results are the runs from yesterday afternoon. I am, however, starting an Extended Test on all drives again, just to get complete data.

 

Anything else I should be looking into?

 

 

Edited by FreeMan
Link to comment

I've got the extended SMART tests running now.

I'll be gone most of the rest of the day, but if that finishes before I leave, I'll resume the parity check before I go.

 

UPDATE: Eh, decided to resume the parity check as the SMART tests are only at about 10% completion. As of now, it's running at ~110 MB/s. It'll be late this evening before I'm back home to check on it (though I'll touch base via ControlR & WireGuard, when possible). All dockers are shut down for now, so that may well be helping, too.

Edited by FreeMan
Link to comment

Extended SMART tests finally completed... No errors on any HDD or SSD.

nas-diagnostics-20210612-1050.zip

 

The correcting parity check from the unclean shut down has been resumed with about 3.6TB to complete. Since the initial pause of this check:

On 6/10/2021 at 10:52 AM, FreeMan said:

I've never had a parity check run anywhere near this slow. I mean I don't have the fastest setup in the world, but my last 21 checks (what fits on the first page of the parity check display) averaged 106MB/sec. Now it's running at 19.4? It slows down as it nears the 4TB mark because of a couple of older 4TB drives in the mix, but never that slow...

The check seem to be running at a more normal speed.

 

Any other suggestions of what to do or where to look to determine what the issue may be?

 

I've got a pre-cleared 8TB drive that's ready to go in for a replacement for either Drive 3 or 4 (whichever has had the most spinning hours - they're close). Would it make sense at this point to do the disk replacement (after the parity check completes), or should I try to sort this out before risking anything?

Link to comment

The parity check finally completed. It's reporting 600MB/s because the last bit of run was only 3 hours and it was on a set of 8TB disks. Maths are a bit off... :/ It doesn't seem to have noticed that I manually paused/resumed several times.

 

This did pop up within the last hour or so of the parity check:

image.png.38707057f74cca94b0e43016c4c96d8a.png

 

I can acknowledge it, I know, but is that something to be concerned about? It's the only error that came out of the parity check.nas-diagnostics-20210612-1956.zip

Link to comment
16 hours ago, FreeMan said:

The check seem to be running at a more normal speed.

 

Any other suggestions of what to do or where to look to determine what the issue may be?

If now it's working correctly it's difficult to say, but maybe it was some docker running in the background, if it happens again, the parity check being very slow, post diags grabbed at that time.

Link to comment
6 hours ago, JorgeB said:

If now it's working correctly it's difficult to say, but maybe it was some docker running in the background, if it happens again, the parity check being very slow, post diags grabbed at that time.

I appreciate that and will do so. That's also what I've done for the initial issue in this thread of the very high CPU utilization.

 

I seem to be able to work around these issues (reboot the server fixes the high CPU, pause the parity check fixes the very slow speed) but that's doing me nothing to track down the actual cause of the issues and come up with some sort of resolution.

Link to comment
  • 2 weeks later...

OK, I don't have the CPU completely pinned at the moment, but the server's becoming unusable.

 

I've got 2 torrents downloading at ~1Mb/s or less, and I'm trying to play a video on my Kodi box, but it simply won't start - it spins for 30-60 seconds then goes right back to the menu like nothing ever happened. I did just talk to my son who is upstairs watching videos on his laptop. He's using a web-based Emby client, watching over WiFi with absolutely no problems what so ever.

 

I'm seeing a lot of this in top:

image.png.6d2079e555b8a90ae956e29504ab65d9.png

 

Here are the diagnostics that I just pulled while this is going on: nas-diagnostics-20210621-2313.zip

 

Any suggestions?

Edited by FreeMan
Link to comment
  • FreeMan changed the title to How do I identify the cause of very high CPU usage?

Again this afternoon, I've run into this:

image.png.53731fd8ab9c166470ded3f3ff43e3d4.png

and it's been like that for several minutes - no jumping around, just pegged.

I grabbed a couple of screen shots from top showing shfs taking a fair amount of CPU:

image.png.e7b48a248a32430e5cab90bf22462ce3.png

image.png.800ceb3d3441e1b1efb07604db9b7104.png

image.png.b56da152bfae3727c83f637bc567e99b.png

 

and I grabbed diagnostics again.

nas-diagnostics-20210628-1344.zip

 

I discovered this page which seems to indicate that shfs may no longer be relevant. I don't recall what distro UNRAID is based on, but Arch, at least, seems to be deprecating it.

 

Also, the server's been quite busy most of the day today

image.png.9ad43a8932ae074c389683a579dde74d.png

 

Any insight whatsoever to what may be causing this or how to figure out what's causing it would be most appreciated!

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...