Mizerka Posted December 18, 2019 Share Posted December 18, 2019 Hey, bit of a strange one, been on and off upgrading and working on unraid (in middle of encrypting entire array), over last week it crashed twice on me, I say crashed, but it's actually sitting at 80-100% cpu usage across all cores which it never does (24 cores, only running few dockers) and even more interestingly it's running 100% ram. Yesterday, I tried a few things like killing docker, force restarting, shutting downa array etc, but nothing worked, webui was somewhat usable and could console on to it as well. output of top: top - 19:23:29 up 22:10, 1 user, load average: 154.26, 151.66, 147.51 Tasks: 1062 total, 3 running, 1059 sleeping, 0 stopped, 0 zombie %Cpu(s): 19.1 us, 4.9 sy, 0.1 ni, 7.8 id, 67.4 wa, 0.0 hi, 0.8 si, 0.0 st MiB Mem : 96714.1 total, 520.3 free, 94856.5 used, 1337.3 buff/cache MiB Swap: 0.0 total, 0.0 free, 0.0 used. 223.3 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1809 472 20 0 153908 20228 0 D 501.0 0.0 301:48.74 grafana-server 1835 root 39 19 0 0 0 S 16.5 0.0 17:38.78 kipmi0 25784 root 0 -20 0 0 0 D 10.2 0.0 42:59.54 loop2 914 root 20 0 0 0 0 S 6.6 0.0 17:07.15 kswapd0 11948 root 20 0 3424 2308 1680 R 5.0 0.0 0:00.15 lsof 5922 root 20 0 5413656 61964 0 S 4.6 0.1 253:53.78 influxd 6737 root 20 0 175536 18744 0 S 3.0 0.0 283:15.09 telegraf 915 root 20 0 0 0 0 S 2.0 0.0 8:00.16 kswapd1 12001 root 20 0 0 0 0 I 2.0 0.0 1:44.95 kworker/u50:0-btrfs-endio 17414 root 20 0 0 0 0 I 1.7 0.0 0:10.87 kworker/u49:0-btrfs-endio 18371 root 20 0 0 0 0 I 1.7 0.0 0:36.69 kworker/u49:3-btrfs-endio 25464 root 20 0 0 0 0 I 1.7 0.0 0:06.19 kworker/u49:5-btrfs-endio 9124 nobody 20 0 150628 11848 6372 S 1.3 0.0 4:15.60 nginx 9652 root 20 0 0 0 0 I 1.3 0.0 0:33.14 kworker/u50:7-btrfs-endio 16159 root 20 0 0 0 0 I 1.3 0.0 0:27.58 kworker/u50:6-btrfs-endio 23079 root 20 0 0 0 0 R 1.3 0.0 0:39.64 kworker/u49:4+btrfs-endio 10860 root 20 0 0 0 0 I 1.0 0.0 0:03.74 kworker/u49:11-btrfs-endio 10955 root 20 0 0 0 0 I 1.0 0.0 0:02.17 kworker/u49:13-btrfs-endio 11390 root 20 0 9776 4396 2552 R 1.0 0.0 0:00.10 top 2533 root 20 0 0 0 0 I 0.7 0.0 0:20.58 kworker/u49:7-btrfs-endio 2621 nobody 20 0 7217200 113880 0 S 0.7 0.1 6:06.13 jackett 7894 root 22 2 113580 24708 19188 S 0.7 0.0 8:01.97 php 9093 root 20 0 283668 3948 3016 S 0.7 0.0 8:15.71 emhttpd 25333 root 20 0 1927136 120228 976 S 0.7 0.1 171:10.80 shfs 31883 nobody 20 0 4400376 491120 4 S 0.7 0.5 15:38.17 Plex Media Serv 147 root 20 0 0 0 0 I 0.3 0.0 0:29.49 kworker/14:1-events 936 root 20 0 113748 13228 7672 S 0.3 0.0 0:00.22 php-fpm 1662 root 20 0 36104 988 0 D 0.3 0.0 0:07.78 openvpn 1737 root 0 -20 0 0 0 I 0.3 0.0 0:02.95 kworker/12:1H-kblockd 2530 root 20 0 8677212 232344 0 S 0.3 0.2 1:09.19 java 2794 nobody 20 0 197352 51372 0 D 0.3 0.1 0:44.64 python 6629 root 20 0 0 0 0 I 0.3 0.0 0:27.33 kworker/8:2-events 7607 root 20 0 3656 232 196 D 0.3 0.0 0:00.01 bash 14099 root 20 0 33648 14988 52 D 0.3 0.0 0:13.68 supervisord 18350 root 20 0 0 0 0 I 0.3 0.0 0:42.51 kworker/u50:3-btrfs-endio 21018 nobody 20 0 3453756 581936 4 S 0.3 0.6 51:39.60 mono 21121 nobody 20 0 2477100 532884 4 S 0.3 0.5 8:34.30 mono 22859 root 20 0 0 0 0 I 0.3 0.0 0:12.78 kworker/u50:5-btrfs-endio-meta 25853 root 20 0 2649020 45064 19816 S 0.3 0.0 81:46.26 containerd 31808 root 20 0 76984 664 412 D 0.3 0.0 0:01.23 php7.0 32288 nobody 20 0 429092 1852 0 S 0.3 0.0 0:21.01 Plex Tuner Serv 32302 root 20 0 0 0 0 I 0.3 0.0 0:30.08 kworker/19:1-xfs-buf/md9 1 root 20 0 2460 1700 1596 S 0.0 0.0 0:13.40 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.07 kthreadd 3 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_gp 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 rcu_par_gp 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H-kblockd 9 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq 10 root 20 0 0 0 0 S 0.0 0.0 0:27.84 ksoftirqd/0 11 root 20 0 0 0 0 I 0.0 0.0 2:03.96 rcu_sched 12 root 20 0 0 0 0 I 0.0 0.0 0:00.00 rcu_bh Clearly grafana is having some fun there managing 500% cpu (goes up to like 900% sometimes) But even after trying to kill it, it doesn't work, by that I mean it refuses to die, even when force killing entire docker. The only thing I can think of recently is that I've started to run some youtube-dl scripts as part of recent yt changes, to archive some channels, but that's hardly doing anything destructive imo, it does write some temp files then remuxes parts into single mkv's etc but that's about it, all done locally by another client as well. Attached diagnostics, unraid running atm, but I'll probably kill it before end of the night. Any help is appreciated. nekounraid-diagnostics-20191218-1910.zip Quote Link to comment
Mizerka Posted December 19, 2019 Author Share Posted December 19, 2019 Guess who's back, back again Array's dead, dead again. I've isolated one of the cores after forced reboot, so now at least webgui is usable (I guess isolation to everything but unraid os? okay), despite every other core sitting at 100%. Dockers are mostly dead due to lack of cpu time, but sometimes respond back with a webpage or output. shares are working almost normally as well. nothing useful in logs again. After removing plugins one by one, array returned to normal after killing ipmi or temparature sensor plugins. so that's interesting that it'd brick unraid out of nowhere... oh well, we'll see tomorrow. Quote Link to comment
soja Posted December 19, 2019 Share Posted December 19, 2019 Bazarr is using ~90GB ram in your diagnostics output 1 Quote Link to comment
Mizerka Posted December 19, 2019 Author Share Posted December 19, 2019 oh, you're right, i missed that; nobody 13716 8.7 92.7 92190640 91863428 ? Sl 06:34 66:18 | | \_ /usr/bin/python -u /app/bazarr/bazarr/main.py --no-update --config /config okay, killing it for now then, I guess it's some memory leak, never seen it use that much/ Thanks 1 Quote Link to comment
szymon Posted December 22, 2019 Share Posted December 22, 2019 Hi, I recently had the exact same issue with the exact same scenario. I explained it here but it seems that my assumptions were incorrect I was going through the excercise of encrypting array disks one by one and my unraid would stall consuming 100% CPU, 100% RAM and would constantly read from one of the cache SSDs at 200mb/s. I isolated the issue to the docker itself but was not able to find out which container was responsible. Once stalled the GUI was unusable, docker stats would never load and I could not kill the dockers. The only way out of it was to disable docker service or reboot. Then I ran all the containers with limited memory parameter and set docker not to shut the container down if out of memory. A few hours later and I can see bazarr consuming 100% of allocated memory. Unraid was fortunately responsive because of the limited RAM and I was able to do some diagnostics. It turns out that python within bazarr container is using all thr RAM. I don't know if the encryption is at fault but I disabled it for now. I will recreate bazarr from scratch with new config when I have some spare time and I will post the result here. Quote Link to comment
S1dney Posted December 23, 2019 Share Posted December 23, 2019 Also you might want to run "btrfs balance start -mconvert=raid1 /mnt/cache" against your pool cause your setup isn't that redundant at the moment 🙂 Data Metadata System Id Path RAID1 single(!) single(!) Unallocated -- --------- --------- -------- -------- ----------- 1 /dev/sdt1 250.00GiB 2.01GiB 4.00MiB 213.75GiB 2 /dev/sdu1 250.00GiB 2.00GiB - 213.76GiB -- --------- --------- -------- -------- ----------- Total 250.00GiB 4.01GiB 4.00MiB 427.51GiB Used 219.61GiB 1.96GiB 64.00KiB If one of your drives fails now, your in bad luck. See: 1 1 Quote Link to comment
szymon Posted December 23, 2019 Share Posted December 23, 2019 Thanks @S1dney, I was affected by the cache pool bug and just fixed it! Cheers! Quote Link to comment
S1dney Posted December 23, 2019 Share Posted December 23, 2019 12 minutes ago, szymon said: Thanks @S1dney, I was affected by the cache pool bug and just fixed it! Cheers! You're welcome! I think a lot of us are.. Without knowing. If you're not troubleshooting the cache you'll likely not notice this, unless you're questioned about why the metrics are way off between disks Quote Link to comment
Mizerka Posted December 26, 2019 Author Share Posted December 26, 2019 On 12/23/2019 at 11:22 AM, S1dney said: Also you might want to run "btrfs balance start -mconvert=raid1 /mnt/cache" against your pool cause your setup isn't that redundant at the moment 🙂 Data Metadata System Id Path RAID1 single(!) single(!) Unallocated -- --------- --------- -------- -------- ----------- 1 /dev/sdt1 250.00GiB 2.01GiB 4.00MiB 213.75GiB 2 /dev/sdu1 250.00GiB 2.00GiB - 213.76GiB -- --------- --------- -------- -------- ----------- Total 250.00GiB 4.01GiB 4.00MiB 427.51GiB Used 219.61GiB 1.96GiB 64.00KiB If one of your drives fails now, your in bad luck. See: Thanks for flagging this, wasn't aware of it. Quote Link to comment
S1dney Posted December 27, 2019 Share Posted December 27, 2019 18 hours ago, Mizerka said: Thanks for flagging this, wasn't aware of it. No problem! Cheers. Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.