Crazy Load Average of 53, GUI unresponsive, am part way through a rebuild. What to do?

-C- · September 3, 2023

I was updating some docker containers through the Docker GUI page when the page froze.

Checked top via SSH and got this:

top - 17:58:46 up 1 day, 21:02,  1 user,  load average: 53.92, 53.54, 53.07
Tasks: 1107 total,   3 running, 1104 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.9 us,  2.6 sy,  0.0 ni, 54.9 id, 41.4 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  31872.3 total,   5800.0 free,  11092.2 used,  14980.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  19566.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
24398 root      20   0   34316  32632   1960 R  21.5   0.1   0:05.21 find
12074 root      20   0  974656 409920    532 S  10.9   1.3 215:55.56 shfs
18749 nobody    20   0  455592 102776  80756 S   5.3   0.3   0:02.04 php-fpm82
21905 nobody    20   0  386252  77244  61756 R   4.6   0.2   0:00.31 php-fpm82
18604 nobody    20   0  455788 105656  83604 S   4.0   0.3   0:02.86 php-fpm82
11495 root       0 -20       0      0      0 S   3.3   0.0  41:37.86 z_rd_int_0
11496 root       0 -20       0      0      0 S   3.3   0.0  41:39.55 z_rd_int_1
11497 root       0 -20       0      0      0 S   3.3   0.0  41:38.24 z_rd_int_2
 7491 nobody    20   0 2711124 259320  24452 S   1.7   0.8   0:11.99 mariadbd

Thing is, I'm part way through an array rebuild having replaced a failed HDD. Usually I would restart the server if the GUI becomes snafu, but in this case, is it safe to do so? (I have the Parity Check Tuning plugin installed) or is there a CLI command I can try to bring things back?

-C- · September 3, 2023

I checked the logs and found the crash happened around here:

Sep  3 17:00:54 Tower webGUI: Successful login user root from 192.168.34.42
Sep  3 17:01:25 Tower php-fpm[7836]: [WARNING] [pool www] server reached max_children setting (50), consider raising it

Doing some further digging on that error, I found this post:

In which the poster found the issue was due to the GPU Statistics plugin.

I had just installed that a couple of days ago, so it would seem that this is likely the cause of my problem too.

I successfully removed the plugin via CLI with

plugin remove gpustat.plg

...but after a few minutes the sys load remains high and still no GUI.

Looking like a reboot's my only option, but

Status:  Parity Sync/Data Rebuild  (65.3% completed)

Edited September 3, 2023 by -C-

-C- · September 3, 2023

Load is still climbing:

load average: 57.57, 57.49, 57.00

Looks like it could be related Docker:

root@Tower:/mnt/user/system# umount /var/lib/docker
umount: /var/lib/docker: target is busy.

Parity rebuild seems to be going much slower than it should be, guess it's due to the high load.

So I ran

parity.check stop

after doing so the GUI's now loading fine, but I'm getting a "Retry unmounting user share(s)" in the GUI footer.

I tried a reboot but it's hung.

Via SSH I tried stopping Docker service, but it doesn't seem to be that:

root@Tower:/mnt/disks# umount /var/lib/docker
umount: /var/lib/docker: not mounted.

I left it (wasn't sure what else to try) and eventually it restarted and things seem to be back to normal.

I've now discovered that the Parity Tuning plugin doesn't/ can't continue a Parity Sync/Data Rebuild in the same way that it can a correcting parity check, so it's back to the beginning with that.

I'm going to avoid touching anything until the rebuild's finished.

-C- · September 5, 2023

Just tried logging into the Unraid GUI and am now getting a

image.png.da5303676b9e63f28eb35a565bb160ab.png

Load is pegged again:

top - 12:24:40 up 1 day, 12:32,  1 user,  load average: 52.86, 52.47, 52.31
Tasks: 1152 total,   1 running, 1151 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.9 us,  5.2 sy,  0.0 ni, 82.7 id,  9.1 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  31872.3 total,   6157.1 free,  12252.2 used,  13463.0 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  18476.1 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
11805 root      20   0  975184 412108    516 S  93.7   1.3 158:06.42 /usr/local/bin/shfs /mnt/user -disks 31 -o default_permissions,allow_other,noatime -o remember=0
29298 nobody    20   0  226844 109888  32124 S  22.5   0.3  12:33.84 /usr/lib/plexmediaserver/Plex Media Server
12138 nobody    20   0  386324  70936  55404 S   6.0   0.2   0:00.28 php-fpm: pool www
 9015 root      20   0       0      0      0 S   4.3   0.0 137:11.80 [unraidd0]
13271 nobody    20   0  386216  65896  50496 S   4.3   0.2   0:00.16 php-fpm: pool www
22798 nobody    20   0  386744  79348  63384 S   4.3   0.2   0:17.22 php-fpm: pool www
 7495 root      20   0       0      0      0 D   1.0   0.0  26:23.74 [mdrecoveryd]

Here's a list of installed plugins:

root@Tower:~# ls /var/log/plugins/
Python3.plg@                 dynamix.cache.dirs.plg@      dynamix.system.temp.plg@  open.files.plg@           unRAIDServer.plg@             zfs.master.plg@
appdata.backup.plg@          dynamix.file.integrity.plg@  dynamix.unraid.net.plg@   parity.check.tuning.plg@  unassigned.devices-plus.plg@
community.applications.plg@  dynamix.file.manager.plg@    file.activity.plg@        qnap-ec.plg@              unassigned.devices.plg@
disklocation-master.plg@     dynamix.s3.sleep.plg@        fix.common.problems.plg@  tips.and.tweaks.plg@      unbalance.plg@
dynamix.active.streams.plg@  dynamix.system.autofan.plg@  intel-gpu-top.plg@        unRAID6-Sanoid.plg@       user.scripts.plg@

I can access files on the array OK over the network, rebuild is still running, albeit very slowly:

root@Tower:~# parity.check status
Status:  Parity Sync/Data Rebuild  (65.2% completed)

Any advice on what I can try to get the load back down?

JorgeB · September 5, 2023

Probably only a reboot will help.

itimpi · September 5, 2023

On 9/4/2023 at 12:10 AM, -C- said:

I've now discovered that the Parity Tuning plugin doesn't/ can't continue a Parity Sync/Data Rebuild in the same way that it can a correcting parity check, so it's back to the beginning with that.

Are you sure? I am sure it used to. I will have to check this out again.

-C- · September 5, 2023

2 hours ago, JorgeB said:

Probably only a reboot will help.

In which case I'm stuck in a loop, for now- I rebooted the first time it happened and everything was stable for a day or so before it happened again, without a reason I can find.

What's painful is that the rebuild is happening slowly- when I could last access the GUI I was getting around 10-30 MB/s, so I'll likely be stuck without a GUI for another day at least. I've not had a disk fail without warning before, so not had to rebuild from parity like this and am not sure whether that's normal. It's certainly running a lot slower than a correcting check.

38 minutes ago, itimpi said:

Are you sure? I am sure it used to. I will have to check this out again.

That's what happened when I rebooted part way through the rebuild yesterday. Not sure if that's normal though.

I've disabled mover as the rebuild was stopping for the daily move and not restarting afterwards.

itimpi · September 5, 2023

11 minutes ago, -C- said:

That's what happened when I rebooted part way through the rebuild yesterday. Not sure if that's normal though.

I wonder if you got an unclean shutdown (or the plugin erroneously thought one had happened) as that would stop anything being restarted.

-C- · September 5, 2023

Just now, itimpi said:

I wonder if you got an unclean shutdown (or the plugin erroneously thought one had happened) as that would stop anything being restarted.

Certainly possible. The system wasn't happy when I rebooted it (which was the reason for the reboot) and it may have killed hung processes in order to reboot. It certainly took longer than usual.

(I used powerdown -r to restart in case that makes any difference.)

Crazy Load Average of 53, GUI unresponsive, am part way through a rebuild. What to do?

Recommended Posts

-C-

Link to comment

-C-

Link to comment

-C-

Link to comment

-C-

Link to comment

JorgeB

Link to comment

itimpi

Link to comment

-C-

Link to comment

itimpi

Link to comment

-C-

Link to comment

Join the conversation