Crazy Load Average of 53, GUI unresponsive, am part way through a rebuild. What to do?


-C-

Recommended Posts

I was updating some docker containers through the Docker GUI page when the page froze.

 

Checked top via SSH and got this:

top - 17:58:46 up 1 day, 21:02,  1 user,  load average: 53.92, 53.54, 53.07
Tasks: 1107 total,   3 running, 1104 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  2.6 sy,  0.0 ni, 54.9 id, 41.4 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  31872.3 total,   5800.0 free,  11092.2 used,  14980.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  19566.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
24398 root      20   0   34316  32632   1960 R  21.5   0.1   0:05.21 find
12074 root      20   0  974656 409920    532 S  10.9   1.3 215:55.56 shfs
18749 nobody    20   0  455592 102776  80756 S   5.3   0.3   0:02.04 php-fpm82
21905 nobody    20   0  386252  77244  61756 R   4.6   0.2   0:00.31 php-fpm82
18604 nobody    20   0  455788 105656  83604 S   4.0   0.3   0:02.86 php-fpm82
11495 root       0 -20       0      0      0 S   3.3   0.0  41:37.86 z_rd_int_0
11496 root       0 -20       0      0      0 S   3.3   0.0  41:39.55 z_rd_int_1
11497 root       0 -20       0      0      0 S   3.3   0.0  41:38.24 z_rd_int_2
 7491 nobody    20   0 2711124 259320  24452 S   1.7   0.8   0:11.99 mariadbd

Thing is, I'm part way through an array rebuild having replaced a failed HDD. Usually I would restart the server if the GUI becomes snafu, but in this case, is it safe to do so? (I have the Parity Check Tuning plugin installed) or is there a CLI command I can try to bring things back?

 

 

Link to comment
I checked the logs and found the crash happened around here:

Sep  3 17:00:54 Tower webGUI: Successful login user root from 192.168.34.42
Sep  3 17:01:25 Tower php-fpm[7836]: [WARNING] [pool www] server reached max_children setting (50), consider raising it

 

Doing some further digging on that error, I found this post:

In which the poster found the issue was due to the GPU Statistics plugin.

I had just installed that a couple of days ago, so it would seem that this is likely the cause of my problem too.

 

I successfully removed the plugin via CLI with

plugin remove gpustat.plg

...but after a few minutes the sys load remains high and still no GUI.

 

Looking like a reboot's my only option, but

Status:  Parity Sync/Data Rebuild  (65.3% completed)
Edited by -C-
Link to comment

Load is still climbing:

load average: 57.57, 57.49, 57.00

Looks like it could be related Docker:

 

image.thumb.png.3df7131b0d1d28b48556b8f69417db62.png

 

root@Tower:/mnt/user/system# umount /var/lib/docker
umount: /var/lib/docker: target is busy.

 

Parity rebuild seems to be going much slower than it should be, guess it's due to the high load.

So I ran

parity.check stop

after doing so the GUI's now loading fine, but I'm getting a "Retry unmounting user share(s)" in the GUI footer.

 

I tried a reboot but it's hung.

 

Via SSH I tried stopping Docker service, but it doesn't seem to be that:

root@Tower:/mnt/disks# umount /var/lib/docker
umount: /var/lib/docker: not mounted.

I left it (wasn't sure what else to try) and eventually it restarted and things seem to be back to normal.

 

I've now discovered that the Parity Tuning plugin doesn't/ can't continue a Parity Sync/Data Rebuild in the same way that it can a correcting parity check, so it's back to the beginning with that.

 

I'm going to avoid touching anything until the rebuild's finished.

Link to comment

Just tried logging into the Unraid GUI and am now getting a

image.png.da5303676b9e63f28eb35a565bb160ab.png

 

Load is pegged again:

top - 12:24:40 up 1 day, 12:32,  1 user,  load average: 52.86, 52.47, 52.31
Tasks: 1152 total,   1 running, 1151 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.9 us,  5.2 sy,  0.0 ni, 82.7 id,  9.1 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  31872.3 total,   6157.1 free,  12252.2 used,  13463.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  18476.1 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
11805 root      20   0  975184 412108    516 S  93.7   1.3 158:06.42 /usr/local/bin/shfs /mnt/user -disks 31 -o default_permissions,allow_other,noatime -o remember=0
29298 nobody    20   0  226844 109888  32124 S  22.5   0.3  12:33.84 /usr/lib/plexmediaserver/Plex Media Server
12138 nobody    20   0  386324  70936  55404 S   6.0   0.2   0:00.28 php-fpm: pool www
 9015 root      20   0       0      0      0 S   4.3   0.0 137:11.80 [unraidd0]
13271 nobody    20   0  386216  65896  50496 S   4.3   0.2   0:00.16 php-fpm: pool www
22798 nobody    20   0  386744  79348  63384 S   4.3   0.2   0:17.22 php-fpm: pool www
 7495 root      20   0       0      0      0 D   1.0   0.0  26:23.74 [mdrecoveryd]

 

Here's a list of installed plugins:

root@Tower:~# ls /var/log/plugins/
Python3.plg@                 dynamix.cache.dirs.plg@      dynamix.system.temp.plg@  open.files.plg@           unRAIDServer.plg@             zfs.master.plg@
appdata.backup.plg@          dynamix.file.integrity.plg@  dynamix.unraid.net.plg@   parity.check.tuning.plg@  unassigned.devices-plus.plg@
community.applications.plg@  dynamix.file.manager.plg@    file.activity.plg@        qnap-ec.plg@              unassigned.devices.plg@
disklocation-master.plg@     dynamix.s3.sleep.plg@        fix.common.problems.plg@  tips.and.tweaks.plg@      unbalance.plg@
dynamix.active.streams.plg@  dynamix.system.autofan.plg@  intel-gpu-top.plg@        unRAID6-Sanoid.plg@       user.scripts.plg@

 

I can access files on the array OK over the network, rebuild is still running, albeit very slowly:

root@Tower:~# parity.check status
Status:  Parity Sync/Data Rebuild  (65.2% completed)

Any advice on what I can try to get the load back down?

 

Link to comment
On 9/4/2023 at 12:10 AM, -C- said:

I've now discovered that the Parity Tuning plugin doesn't/ can't continue a Parity Sync/Data Rebuild in the same way that it can a correcting parity check, so it's back to the beginning with that.


Are you sure?   I am sure it used to.   I will have to check this out again.

Link to comment
2 hours ago, JorgeB said:

Probably only a reboot will help.

In which case I'm stuck in a loop, for now- I rebooted the first time it happened and everything was stable for a day or so before it happened again, without a reason I can find.

 

What's painful is that the rebuild is happening slowly- when I could last access the GUI I was getting around 10-30 MB/s, so I'll likely be stuck without a GUI for another day at least. I've not had a disk fail without warning before, so not had to rebuild from parity like this and am not sure whether that's normal. It's certainly running a lot slower than a correcting check.

 

38 minutes ago, itimpi said:


Are you sure?   I am sure it used to.   I will have to check this out again.

That's what happened when I rebooted part way through the rebuild yesterday. Not sure if that's normal though.

I've disabled mover as the rebuild was stopping for the daily move and not restarting afterwards.

Link to comment
Just now, itimpi said:

I wonder if you got an unclean shutdown (or the plugin erroneously thought one had happened) as that would stop anything being restarted.

Certainly possible. The system wasn't happy when I rebooted it (which was the reason for the reboot) and it may have killed hung processes in order to reboot. It certainly took longer than usual.

 

(I used powerdown -r to restart in case that makes any difference.)

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.