July 6, 20205 yr Hello, top - 10:00:03 up 1 day, 11:54, 1 user, load average: 447.11, 443.73, 433.41 Tasks: 954 total, 2 running, 949 sleeping, 0 stopped, 3 zombie %Cpu(s): 0.2 us, 0.3 sy, 0.0 ni, 15.6 id, 84.0 wa, 0.0 hi, 0.0 si, 0.0 st MiB Mem : 64418.8 total, 928.1 free, 11473.8 used, 52016.9 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 51404.9 avail Mem I have been having some trouble lately, where seemingly at random the server gets overburdened and stops responding to most of my connections. I can however open the webUI and some of the content loads (including the diagnostics zip attached). Docker/VM views are unresponsive, and the dashboard doesn't load. I can see the array in the main tab though. I tried force shutdown on my Gitlab-CE virtual machine, but that changed nothing. This all happened after I started hosting my own instance. If I reboot the machine it runs perfectly well again. Edit: Version 6.8.3 Ryzen 3950x Asrock Rack x470d4u 64gb ddr4 ECC 2x 970 evo 1tb (cache) 2x 970 evo 500gb (1 for plex, 1 empty) 11x 3/4 tb wd red for data Running a bunch of dockers, from memory: NginxProxyManager Plex (with its own ssd) QbittorrentVPN SickChill Gitlab-Runner Two VMs, both ubuntu 18.04/20.04 Gitlab-CE VM (tried running in Docker first, but tried moving it to a VM, in case it was causing the high loads) Backup VM (basic VM running a shell script taking backups of a mySQL server) Attached a screenshot of my plugins ryzen-diagnostics-20200706-0950.zip Edited July 6, 20205 yr by Rodael
July 6, 20205 yr Author Upon closer inspection, it seems like my second cache drive may be faulty? dmesg returns this: https://pastebin.com/raw/U1pzFRSW I snipped when it started spewing errors Should a faulty drive in a mirrored configuration take down the entire system?
July 6, 20205 yr Community Expert There are a lot of checksum errors, that suggest a hardware issue causing data corruption, like bad RAM, or the NVMe devices are dropping alternatively.
July 6, 20205 yr Author 8 minutes ago, johnnie.black said: There are a lot of checksum errors, that suggest a hardware issue causing data corruption, like bad RAM, or the NVMe devices are dropping alternatively. Yeah, I'm currently trying to shut the server down, but it's stuck at "Forcing shutdown". I'm gonna run memtest86+ on it for a while to verify.
Archived
This topic is now archived and is closed to further replies.