Jump to content

100% CPU on one core, waiting on something, can't run diagnostics, can't access Docker


Recommended Posts

I've seen similar reports like here, but mine is slightly different in that there are no dockers running, so unsure what to look for.

 

Here's the details:

 

UNRAID Dashboard shows one core @ 100%

image.thumb.png.6b06eeb804466c3e67a232e703357aca.png

 

When running top, there's nothing obvious in the CPU column, however I can see 24.9 wa, which would indicate 1 of the 4 cores is at 100% waiting on something:

 

top - 20:57:30 up 2 days,  6:14,  1 user,  load average: 319.27, 319.06, 318.26
Tasks: 608 total,   1 running, 607 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 74.7 id, 24.7 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7866.9 total,    236.3 free,    981.9 used,   6648.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5966.1 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                      
19097 root      20   0    6976   3464   2408 R   0.7   0.0   0:00.04 top                                                                                          
 6029 root      20   0  149904   8548   3780 S   0.3   0.1   1:06.58 nginx                                                                                        
15089 root      20   0  690796  32716  18916 S   0.3   0.4   3:52.37 containerd                                                                                   
18930 root      20   0  104952  12676   6936 S   0.3   0.2   0:00.05 php-fpm                                                                                      
    1 root      20   0    2468   1740   1632 S   0.0   0.0   0:15.82 init                                                                                         
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.02 kthreadd                                                                                     
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                                                                                       
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                                                                                   
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd                                                                         
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                                                                                 
    9 root      20   0       0      0      0 S   0.0   0.0   0:02.71 ksoftirqd/0             

 

I can't successfully run diagnostics, I simply receive no response:

root@Tower:~# 
root@Tower:~# diagnostics
Starting diagnostics collection... 

 

I can't see much the Docker page on the GUI, however I know I have 2 Dockers installed, but no Dockers running, there were none running yesterday when i could successfully access this screen:

image.thumb.png.9e393eb999e9fabe300dd5a5430c7144.png

 

I receive this never-ending loop on the Apps screen:

image.thumb.png.3d325db130e83f7fbc2bdfa7861bc941.png

 

I can access the server over the network, but don't seem to be able to access disk3 (whereas the other disks are successful). disk3 I was copying files to yesterday, and it looks like it ran out of space (curious if this has caused it):

image.thumb.png.6c931deac28ba5df33f93f0ced3cbac2.png

 

I also know I tried to run the Fix Common Problems plugin yesterday and it stopped about 3% through.

 

Attached to this is the results of ps -x, this is about as far as my investigation can take me.

 

Any thoughts / advice as to next steps? What caused this? What is the process waiting on? What's hanging? What do I need to kill to resolve this, and how do I not cause this next time?

ps-x.rtf

Link to comment
Just now, freezingkiwis said:

Whoa.

Whoa.

Whoa.

Seems an obvious culprit could indeed be the drive(s). I'm running 4 of the things. Oh damn.

I'm more inclined to blame ReiserFS, having experienced almost identical symptoms several years ago.

 

Also, just because those drives have an exponential failure rate compared to similar drives in that time period, doesn't mean that there is a 100% failure rate. In my experience, drives that survive several years are more likely to die relatively normal deaths typically preceded by warning SMART errors.

 

Doesn't mean those drives are good, but without a SMART report you can't pass judgement just yet.

Link to comment
1 minute ago, itimpi said:


as far as I know you never get any sort of file system maintenance that is not initiated by the user.

Poor choice of words. ReiserFS can get super slow sorting out file fragments when asked to write to a drive that's full. That's what I meant by maintenance, it's likely just stuck sorting out where to put things.

Link to comment

OK, I just browsed to the Settings page to run the FCP again, and now the UI seems to be hanging on every single page, can't load a single thing.

 

Here's the tail of my syslog:

 

Feb 12 21:02:53 Tower nginx: 2021/02/12 21:02:53 [error] 6029#6029: *305660 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.60, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Apps"
Feb 12 21:02:53 Tower nginx: 2021/02/12 21:02:53 [error] 6029#6029: *305299 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.60, server: , request: "POST /plugins/community.applications/include/exec.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Apps"
Feb 12 21:07:26 Tower nginx: 2021/02/12 21:07:26 [error] 6029#6029: *305996 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.178.60, server: , request: "GET /Docker HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.178.25", referrer: "http://192.168.178.25/Dashboard"
Feb 12 21:14:26 Tower nginx: 2021/02/12 21:14:26 [error] 6029#6029: *306628 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.60, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Settings"
Feb 13 03:40:01 Tower root: mover: cache not present, or only cache present
Feb 13 10:10:46 Tower php-fpm[6004]: [WARNING] [pool www] server reached max_children setting (50), consider raising it
Feb 13 10:10:48 Tower login[26586]: ROOT LOGIN  on '/dev/pts/1'
Feb 13 10:12:46 Tower nginx: 2021/02/13 10:12:46 [error] 6029#6029: *352732 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.123, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Dashboard"
Feb 13 10:12:46 Tower nginx: 2021/02/13 10:12:46 [error] 6029#6029: *352729 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.123, server: , request: "POST /webGui/include/DashboardApps.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Dashboard"
Feb 13 10:59:13 Tower php-fpm[6004]: [WARNING] [pool www] server reached max_children setting (50), consider raising it

 

Attached is the last 100 lines of my syslog.

 

Unfortunately I'm losing my house power this morning so I need to switch off the server, but it's been running for at least 2 days since the issue, which occurred on either evening of 10/Feb or morning of 11/Feb (I left a backup running to disk3 overnight).

 

 

syslog.txt

Link to comment

Sounds like a forced shutdown is your only option.

 

The good news is that ReiserFS is very capable of handling that sort of thing, but it will likely take hours for the drives to mount cleanly as ReiserFS replays the transactions.

 

Once you have stable power guaranteed, boot the server and retrieve diagnostics, then start the array and wait for all the drives to finish mounting, which will probably take hours, then collect diagnostics again.

Link to comment

OK, post reboot, diagnostics collected almost immediately "20210213-1447-after-reboot-before-ReiserFS".

 

Parity check was started, so I allowed that to run for ~11 hours, finding no errors:

chrome_210214_141633_nQ7.png.550f73713a93e35d11929e6478c4349f.png

 

Processor and RAM are nice and low, nothing in particular is happening on the server now:

chrome_210214_141746_k5s.png.75e362eb6b9e9e6f4268151e8f93bccf.png

 

And I've attached diagnostics again, after all of the above "20210214-1415-after-Parity-check". 

 

I've successfully run FCP, no errors or warnings found.

 

I can successfully connect to my server over the network, and browse all drives.

 

I'm about to start copying some files to the server, probably will attempt all 4 drives (this server is primarily my backup), and will see how things pan out.

 

Not expecting to run out of space on a hard drive, but should I give it a go / avoid it?

 

Anything else I should track / tail / keep an eye out for???

tower-diagnostics-20210213-1447-after-reboot-before-ReiserFS.zip tower-diagnostics-20210214-1415-after-Parity-check.zip

Edited by freezingkiwis
better filename/attachment descriptions.
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...