WebUI is semi-unresponsive, load average > 400

Rodael · July 6, 2020

Hello,

top - 10:00:03 up 1 day, 11:54,  1 user,  load average: 447.11, 443.73, 433.41
Tasks: 954 total,   2 running, 949 sleeping,   0 stopped,   3 zombie
%Cpu(s):  0.2 us,  0.3 sy,  0.0 ni, 15.6 id, 84.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  64418.8 total,    928.1 free,  11473.8 used,  52016.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  51404.9 avail Mem

I have been having some trouble lately, where seemingly at random the server gets overburdened and stops responding to most of my connections. I can however open the webUI and some of the content loads (including the diagnostics zip attached).

Docker/VM views are unresponsive, and the dashboard doesn't load. I can see the array in the main tab though. I tried force shutdown on my Gitlab-CE virtual machine, but that changed nothing. This all happened after I started hosting my own instance.

If I reboot the machine it runs perfectly well again.

Edit:

Version 6.8.3

Ryzen 3950x

Asrock Rack x470d4u

64gb ddr4 ECC

2x 970 evo 1tb (cache)
2x 970 evo 500gb (1 for plex, 1 empty)

11x 3/4 tb wd red for data

Running a bunch of dockers, from memory:

NginxProxyManager

Plex (with its own ssd)

QbittorrentVPN

SickChill

Gitlab-Runner

Two VMs, both ubuntu 18.04/20.04

Gitlab-CE VM (tried running in Docker first, but tried moving it to a VM, in case it was causing the high loads)

Backup VM (basic VM running a shell script taking backups of a mySQL server)

Attached a screenshot of my plugins

ryzen-diagnostics-20200706-0950.zip

Edited July 6, 2020 by Rodael

Rodael · July 6, 2020

Upon closer inspection, it seems like my second cache drive may be faulty? dmesg returns this:

https://pastebin.com/raw/U1pzFRSW

I snipped when it started spewing errors

Should a faulty drive in a mirrored configuration take down the entire system?

JorgeB · July 6, 2020

There are a lot of checksum errors, that suggest a hardware issue causing data corruption, like bad RAM, or the NVMe devices are dropping alternatively.

Rodael · July 6, 2020

8 minutes ago, johnnie.black said:

There are a lot of checksum errors, that suggest a hardware issue causing data corruption, like bad RAM, or the NVMe devices are dropping alternatively.

Yeah, I'm currently trying to shut the server down, but it's stuck at "Forcing shutdown". I'm gonna run memtest86+ on it for a while to verify.

WebUI is semi-unresponsive, load average > 400

Recommended Posts

Rodael

Link to comment

Rodael

Link to comment

JorgeB

Link to comment

Rodael

Link to comment

Join the conversation