100% CPU on one core, waiting on something, can't run diagnostics, can't access Docker

freezingkiwis · February 12, 2021

I've seen similar reports like here, but mine is slightly different in that there are no dockers running, so unsure what to look for.

Here's the details:

UNRAID Dashboard shows one core @ 100%

When running top, there's nothing obvious in the CPU column, however I can see 24.9 wa, which would indicate 1 of the 4 cores is at 100% waiting on something:

top - 20:57:30 up 2 days,  6:14,  1 user,  load average: 319.27, 319.06, 318.26
Tasks: 608 total,   1 running, 607 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  0.2 sy,  0.0 ni, 74.7 id, 24.7 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   7866.9 total,    236.3 free,    981.9 used,   6648.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   5966.1 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                      
19097 root      20   0    6976   3464   2408 R   0.7   0.0   0:00.04 top                                                                                          
 6029 root      20   0  149904   8548   3780 S   0.3   0.1   1:06.58 nginx                                                                                        
15089 root      20   0  690796  32716  18916 S   0.3   0.4   3:52.37 containerd                                                                                   
18930 root      20   0  104952  12676   6936 S   0.3   0.2   0:00.05 php-fpm                                                                                      
    1 root      20   0    2468   1740   1632 S   0.0   0.0   0:15.82 init                                                                                         
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.02 kthreadd                                                                                     
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp                                                                                       
    4 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_par_gp                                                                                   
    6 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 kworker/0:0H-kblockd                                                                         
    8 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 mm_percpu_wq                                                                                 
    9 root      20   0       0      0      0 S   0.0   0.0   0:02.71 ksoftirqd/0

I can't successfully run diagnostics, I simply receive no response:

root@Tower:~# 
root@Tower:~# diagnostics
Starting diagnostics collection...

I can't see much the Docker page on the GUI, however I know I have 2 Dockers installed, but no Dockers running, there were none running yesterday when i could successfully access this screen:

I receive this never-ending loop on the Apps screen:

I can access the server over the network, but don't seem to be able to access disk3 (whereas the other disks are successful). disk3 I was copying files to yesterday, and it looks like it ran out of space (curious if this has caused it):

I also know I tried to run the Fix Common Problems plugin yesterday and it stopped about 3% through.

Attached to this is the results of ps -x, this is about as far as my investigation can take me.

Any thoughts / advice as to next steps? What caused this? What is the process waiting on? What's hanging? What do I need to kill to resolve this, and how do I not cause this next time?

ps-x.rtf

JonathanM · February 12, 2021

Educated guess here, ReiserFS is doing file system maintenance. Leave it alone for at LEAST 24 hours and see if it resolves.

Another issue could be your 3TB seagates, those models are infamous, they have their own wiki entry about the class action. https://en.wikipedia.org/wiki/ST3000DM001

Squid · February 12, 2021

Would also be helpful if you said what test FCP stopped on since that's listed along with the %

freezingkiwis · February 12, 2021

5 hours ago, Squid said:

Would also be helpful if you said what test FCP stopped on since that's listed along with the %

Wish I could but I can't remember, but if I'm being honest, I don't think this was the cause. I ran FCP once I started having issues, my gut says it hung on 3% due to these issues, not caused them.

freezingkiwis · February 12, 2021

8 hours ago, jonathanm said:

Another issue could be your 3TB seagates, those models are infamous, they have their own wiki entry about the class action. https://en.wikipedia.org/wiki/ST3000DM001

Whoa.

Seems an obvious culprit could indeed be the drive(s). I'm running 4 of the things. Oh damn.

JonathanM · February 12, 2021

Just now, freezingkiwis said:

Whoa.

Whoa.

Whoa.

Seems an obvious culprit could indeed be the drive(s). I'm running 4 of the things. Oh damn.

I'm more inclined to blame ReiserFS, having experienced almost identical symptoms several years ago.

Also, just because those drives have an exponential failure rate compared to similar drives in that time period, doesn't mean that there is a 100% failure rate. In my experience, drives that survive several years are more likely to die relatively normal deaths typically preceded by warning SMART errors.

Doesn't mean those drives are good, but without a SMART report you can't pass judgement just yet.

itimpi · February 12, 2021

8 hours ago, jonathanm said:

Educated guess here, ReiserFS is doing file system maintenance

as far as I know you never get any sort of file system maintenance that is not initiated by the user.

Squid · February 12, 2021

11 minutes ago, freezingkiwis said:

my gut says it hung on 3% due to these issues, not caused them.

Absolutely. But the test it hung on would help pinpoint what your problem is in lieu of diagnostics

JonathanM · February 12, 2021

1 minute ago, itimpi said:

as far as I know you never get any sort of file system maintenance that is not initiated by the user.

Poor choice of words. ReiserFS can get super slow sorting out file fragments when asked to write to a drive that's full. That's what I meant by maintenance, it's likely just stuck sorting out where to put things.

freezingkiwis · February 12, 2021

OK, I just browsed to the Settings page to run the FCP again, and now the UI seems to be hanging on every single page, can't load a single thing.

Here's the tail of my syslog:

Feb 12 21:02:53 Tower nginx: 2021/02/12 21:02:53 [error] 6029#6029: *305660 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.60, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Apps"
Feb 12 21:02:53 Tower nginx: 2021/02/12 21:02:53 [error] 6029#6029: *305299 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.60, server: , request: "POST /plugins/community.applications/include/exec.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Apps"
Feb 12 21:07:26 Tower nginx: 2021/02/12 21:07:26 [error] 6029#6029: *305996 upstream timed out (110: Connection timed out) while reading upstream, client: 192.168.178.60, server: , request: "GET /Docker HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock:", host: "192.168.178.25", referrer: "http://192.168.178.25/Dashboard"
Feb 12 21:14:26 Tower nginx: 2021/02/12 21:14:26 [error] 6029#6029: *306628 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.60, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Settings"
Feb 13 03:40:01 Tower root: mover: cache not present, or only cache present
Feb 13 10:10:46 Tower php-fpm[6004]: [WARNING] [pool www] server reached max_children setting (50), consider raising it
Feb 13 10:10:48 Tower login[26586]: ROOT LOGIN  on '/dev/pts/1'
Feb 13 10:12:46 Tower nginx: 2021/02/13 10:12:46 [error] 6029#6029: *352732 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.123, server: , request: "POST /plugins/community.applications/scripts/notices.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Dashboard"
Feb 13 10:12:46 Tower nginx: 2021/02/13 10:12:46 [error] 6029#6029: *352729 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 192.168.178.123, server: , request: "POST /webGui/include/DashboardApps.php HTTP/1.1", upstream: "fastcgi://unix:/var/run/php5-fpm.sock", host: "192.168.178.25", referrer: "http://192.168.178.25/Dashboard"
Feb 13 10:59:13 Tower php-fpm[6004]: [WARNING] [pool www] server reached max_children setting (50), consider raising it

Attached is the last 100 lines of my syslog.

Unfortunately I'm losing my house power this morning so I need to switch off the server, but it's been running for at least 2 days since the issue, which occurred on either evening of 10/Feb or morning of 11/Feb (I left a backup running to disk3 overnight).

syslog.txt

JonathanM · February 12, 2021

Sounds like a forced shutdown is your only option.

The good news is that ReiserFS is very capable of handling that sort of thing, but it will likely take hours for the drives to mount cleanly as ReiserFS replays the transactions.

Once you have stable power guaranteed, boot the server and retrieve diagnostics, then start the array and wait for all the drives to finish mounting, which will probably take hours, then collect diagnostics again.

freezingkiwis · February 14, 2021

OK, post reboot, diagnostics collected almost immediately "20210213-1447-after-reboot-before-ReiserFS".

Parity check was started, so I allowed that to run for ~11 hours, finding no errors:

chrome_210214_141633_nQ7.png.550f73713a93e35d11929e6478c4349f.png

Processor and RAM are nice and low, nothing in particular is happening on the server now:

chrome_210214_141746_k5s.png.75e362eb6b9e9e6f4268151e8f93bccf.png

And I've attached diagnostics again, after all of the above "20210214-1415-after-Parity-check".

I've successfully run FCP, no errors or warnings found.

I can successfully connect to my server over the network, and browse all drives.

I'm about to start copying some files to the server, probably will attempt all 4 drives (this server is primarily my backup), and will see how things pan out.

Not expecting to run out of space on a hard drive, but should I give it a go / avoid it?

Anything else I should track / tail / keep an eye out for???

tower-diagnostics-20210213-1447-after-reboot-before-ReiserFS.zip tower-diagnostics-20210214-1415-after-Parity-check.zip

Edited February 14, 2021 by freezingkiwis
better filename/attachment descriptions.

100% CPU on one core, waiting on something, can't run diagnostics, can't access Docker

Recommended Posts

freezingkiwis

Link to comment

JonathanM

Link to comment

Squid

Link to comment

freezingkiwis

Link to comment

freezingkiwis

Link to comment

JonathanM

Link to comment

itimpi

Link to comment

Squid

Link to comment

JonathanM

Link to comment

freezingkiwis

Link to comment

JonathanM

Link to comment

freezingkiwis

Link to comment

Join the conversation