stoutpanda Posted October 5, 2018 Share Posted October 5, 2018 Hello guys, I'm new to unraid and the community. This past month I've built a new epyc server and migrated my homelab over from an older ProxMox server to Unraid. Today is the second time I've experienced an issue where Unraid becomes completely inaccessible via the network, I believe centered around docker but I am not 100% certain. Last time I hard reset the server via IPMI, but wanted to see if there was any chance of bringing it back online with your guys help, or at least so I could do a clean shutdown and not have to do another parity verify. I'm using XFS across all data drives, and two 1TB CACHE SSDs as well as a few more SSDs in unassigned devices. However, the issue happened before adding the cache & unassigned devices, and while running rc4. I am unable to access the server via any method except my ipmi console (SSH not responding, samba/nfs not responding, web ui not responding). I'm unable to stop or force-quit the php-fpm process, or kill it with kill -9, or killall php-fpm. Using the stop or force-quit just sits until it fails. When I list my docker containers with a docker container ls, I see that one my pihole/pihole:latest container has a status of unhealthy. I am unable to stop or kill any of my docker containers, the console just sits there. I've tried stopping / killing the ones even in a healthy status with no luck. I am unable to force-quit, stop or kill -9 the docker service just like the php-fpm. When trying to stop a ontainer, or php-fpm I occasionally see the following message about killinfo: rcu-sched detected expedited stalls. I am posting a screen capture of the error as I am not able to connect to get it in a text format, nor can I push data out of the server via the network. The CPU is currently operating at 42C from the supermicro ipmi console temperature reading. I'd appreciate any advice or further troubleshooting ideas. EDIT: I am attempting to run the diagnostics command now, but do not know how to get them off of the server without causing the parity reverify. Link to comment
trurl Posted October 5, 2018 Share Posted October 5, 2018 Does this happen immediately after a hard reset or can you access it normally for a little while? I don't have anything except a hard reset to recommend. Then a memtest. If that passes then if you can get Diagnostics immediately after booting that might give us more to work with. Link to comment
stoutpanda Posted October 5, 2018 Author Share Posted October 5, 2018 Thanks trurl, it has run fine for several days. Current uptime is 3.5 days, but I think I got close to two weeks before. I can definitely run the diagnostics and upload them after a hard reset if there is nothing else to try. Link to comment
stoutpanda Posted October 5, 2018 Author Share Posted October 5, 2018 Ah crud, the server is not running diagnostics either, just more of the same cpu errors when trying to run the diagnostics script. Link to comment
stoutpanda Posted October 5, 2018 Author Share Posted October 5, 2018 Unfortunately the diagnostics would not ever complete, so I've hard reset it as suggested. I was able to copy the full syslog (attached as syslog-2018-10-15.zip) before the hard reset however. I did run diagnostics (sekhmet-diagnostics-20181005-1246.zip) after a quick boot up to get the syslog file. I've rebooted the server after grabbing those and running memtest as suggested as well. syslog-2018-10-05.zip sekhmet-diagnostics-20181005-1246.zip Link to comment
trurl Posted October 5, 2018 Share Posted October 5, 2018 Why do you have 60G docker image? I ask because I have seen people with docker problems increase that size hoping to fix the problem instead of actually getting to the bottom of a bad container configuration. 20G should be more than enough. Link to comment
stoutpanda Posted October 5, 2018 Author Share Posted October 5, 2018 This was due to an earlier issue with a deluge docker that wasn't configured correctly and was writing data to the docker image. I have since removed that docker image/container and went with the binhex-deluge suggested in a thread on the forums here, but never went back and resized the docker image. Edit: I am willing to shrink that down of course if needed. Link to comment
trurl Posted October 5, 2018 Share Posted October 5, 2018 One of the reasons I recommend a smaller docker image is because it will make it more apparent when you have an out-of-control container. If you have a large image it can run for a while before it breaks. Just changing from some other deluge to binhex isn't really the fix for an application writing into the image. What causes writes into the image is a path in the application which you have set to somewhere that isn't mapped storage. Link to comment
stoutpanda Posted October 5, 2018 Author Share Posted October 5, 2018 13 minutes ago, trurl said: One of the reasons I recommend a smaller docker image is because it will make it more apparent when you have an out-of-control container. If you have a large image it can run for a while before it breaks. Just changing from some other deluge to binhex isn't really the fix for an application writing into the image. What causes writes into the image is a path in the application which you have set to somewhere that isn't mapped storage. Yes, you are correct. That was the issue, and it was also before I found the CA plugin & figured out the /mnt/user shares and had a better understanding of how things in Unraid were organized. I actually ran onto the faq by Squid you have linked in your signature. I mentioned that I got rid of the image just because I wanted to let you know that that image in particular shouldn't be causing any further trouble. I can reboot the server and will gladly provide any info regarding the docker configurations for further review however, and greatly appreciate your time and help. Link to comment
stoutpanda Posted October 5, 2018 Author Share Posted October 5, 2018 Here was the usage. I'm recreating it to be a 20GB file now, but just to show before I change anything. Filesystem Size Used Avail Use% Mounted on /dev/loop2 60G 3.9G 55G 7% /var/lib/docker root@Sekhmet:~# docker ps -s CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES SIZE 571f9dd9722e linuxserver/unifi "/init" 2 days ago Up 25 seconds unifi 101MB (virtual 559MB) 0a5ad4b8f382 mlebjerg/steamcachebundle:latest "/scripts/bootstrap.…" 3 days ago Up 41 seconds SteamCacheBundle 3.65MB (virtual 27MB) 390ebd87726a linuxserver/mariadb "/init" 5 days ago Up 28 seconds 0.0.0.0:3306->3306/tcp mariadb 340kB (virtual 362MB) a62d83fdcae4 linuxserver/duplicati "/init" 5 days ago Up 30 seconds 0.0.0.0:8200->8200/tcp duplicati 301kB (virtual 596MB) 00ca105186bf pihole/pihole:latest "/s6-init" 7 days ago Up 48 seconds (healthy) pihole 10.5MB (virtual 356MB) 8d2543550e7c binhex/arch-deluge "/usr/bin/tini -- /b…" 13 days ago Up 35 seconds binhex-deluge 1.15MB (virtual 1.04GB) Edit: Docker file resized to 20GB (deleted, recreated containers from templates). root@Sekhmet:/mnt/user/system/docker# ls -lh total 20G -rw-rw-rw- 1 nobody users 20G Oct 5 14:24 docker.img /dev/loop2 20G 3.4G 15G 19% /var/lib/docker Link to comment
stoutpanda Posted October 6, 2018 Author Share Posted October 6, 2018 Well it passed the first pass of the memtest. Link to comment
stoutpanda Posted October 6, 2018 Author Share Posted October 6, 2018 Started the memtest over with all cores and ran it last night. So far no errors. I've found some similar errors for Ryzen CPU on various ubuntu kernels, but haven't found much for epyc. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085 Link to comment
stoutpanda Posted October 8, 2018 Author Share Posted October 8, 2018 Well I ran the memtest with no issues / errors through this morning. I'm going to bring the server backonline into unraid now and see how long until it happens again. I am hopeful now that it may be a bug similar to the above thread for ryzen and future kernel updates may fix it, unless anyone else has ideas. Link to comment
stoutpanda Posted October 11, 2018 Author Share Posted October 11, 2018 Well it just happened again after 4 days 16 hours of uptime following the memtest. This time I wasn't able to access my VMs once it happened, they dropped my VNC / RDP sessions and then the whole server became unresponsive except from the console itself again. This time even more difficult to work with, as I couldn't even get it to list the status of dockers without it just sitting waiting for the command to run. Could not shutdown or destroy vms from console (device or resource busy). Again unable to run diagnostics before power cycling, but was able to copy the syslog. Ran diagnostics after boot again and attaching those. sekhmet-diagnostics-20181011-1515.zip syslog-2018-10-11.zip Link to comment
stoutpanda Posted October 11, 2018 Author Share Posted October 11, 2018 I did notice something the following that happened right before the rcu-sched errors started this time while reviewing the syslogs, but I'm having trouble determining what program may have caused it. BUG: unable to handle kernel NULL pointer dereference at 00000000000000300 Quote Oct 11 14:22:28 Sekhmet kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030 Oct 11 14:22:28 Sekhmet kernel: PGD 0 P4D 0 Oct 11 14:22:28 Sekhmet kernel: Oops: 0000 [#1] SMP NOPTI Oct 11 14:22:28 Sekhmet kernel: CPU: 4 PID: 16770 Comm: CPU 0/KVM Tainted: G W 4.18.10-unRAID #2 Oct 11 14:22:28 Sekhmet kernel: Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 1.0b 04/27/2018 Oct 11 14:22:28 Sekhmet kernel: RIP: 0010:drop_spte+0x4a/0x75 [kvm] Oct 11 14:22:28 Sekhmet kernel: Code: 48 01 d8 72 09 ba ff ef 00 00 48 c1 e2 1f 48 01 d0 ba f5 ff 7f 00 48 89 de 48 c1 e8 0c 48 c1 e2 29 48 c1 e0 06 48 8b 54 10 28 <48> 2b 72 30 48 89 d7 48 c1 fe 03 e8 e9 dd ff ff 48 89 ef 48 89 c6 Oct 11 14:22:28 Sekhmet kernel: RSP: 0018:ffffc90006983c68 EFLAGS: 00010212 Oct 11 14:22:28 Sekhmet kernel: RAX: 0000000030c9f680 RBX: ffff880c327da758 RCX: 0000000000000001 Oct 11 14:22:28 Sekhmet kernel: RDX: 0000000000000000 RSI: ffff880c327da758 RDI: 7fffc40082e34fec Oct 11 14:22:28 Sekhmet kernel: RBP: ffffc90007015000 R08: 0000000000000001 R09: 0000000000000000 Oct 11 14:22:28 Sekhmet kernel: R10: ffff880424c10008 R11: ffffc90006983ce8 R12: ffffc90007015000 Oct 11 14:22:28 Sekhmet kernel: R13: ffff880424c10000 R14: ffff880424c10008 R15: 0000000300000001 Oct 11 14:22:28 Sekhmet kernel: FS: 0000153c0d02a700(0000) GS:ffff88080f600000(0000) knlGS:0000000000000000 Oct 11 14:22:28 Sekhmet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 11 14:22:28 Sekhmet kernel: CR2: 0000000000000030 CR3: 0000000c372f8000 CR4: 00000000003406e0 Oct 11 14:22:28 Sekhmet kernel: Call Trace: Oct 11 14:22:28 Sekhmet kernel: kvm_zap_rmapp+0x39/0x5b [kvm] Oct 11 14:22:28 Sekhmet kernel: kvm_unmap_rmapp+0x5/0x9 [kvm] Oct 11 14:22:28 Sekhmet kernel: kvm_handle_hva_range+0x117/0x154 [kvm] Oct 11 14:22:28 Sekhmet kernel: ? kvm_zap_rmapp+0x5b/0x5b [kvm] Oct 11 14:22:28 Sekhmet kernel: kvm_mmu_notifier_invalidate_range_start+0x43/0x7a [kvm] Oct 11 14:22:28 Sekhmet kernel: __mmu_notifier_invalidate_range_start+0x64/0x69 Link to comment
stoutpanda Posted October 12, 2018 Author Share Posted October 12, 2018 Another hang up today. This time I was able to access the web page (though anything related to docker, plugins or the app page wasn't loading) & via ssh so I was able to get diagnostics this time. Was unable to stop dockers via command stop or kill commands. Tried a shutdown via the web page and it hung, tried a powerdown command via the console following and stuck again at ccp 0000:06:00.2: disabled. sekhmet-diagnostics-20181012-1524.zip Link to comment
JorgeB Posted October 12, 2018 Share Posted October 12, 2018 This looks to me like a hardware problem: Oct 11 17:59:51 Sekhmet kernel: BUG: Bad page map in process docker pte:ffff88039e8b0318 pmd:39e4f3067 Not sure how well these AMD EPYC boards work with Unraid/Linux, have you checked the board's system event log for any errors there? Link to comment
stoutpanda Posted October 13, 2018 Author Share Posted October 13, 2018 2 hours ago, johnnie.black said: This looks to me like a hardware problem: Oct 11 17:59:51 Sekhmet kernel: BUG: Bad page map in process docker pte:ffff88039e8b0318 pmd:39e4f3067 Not sure how well these AMD EPYC boards work with Unraid/Linux, have you checked the board's system event log for any errors there? No events other than powering on / power off fortunately/unfortunately. Link to comment
stoutpanda Posted October 15, 2018 Author Share Posted October 15, 2018 Updated to 6.6.2 hoping that the newer kernel may alleviate some of this! Link to comment
stoutpanda Posted October 23, 2018 Author Share Posted October 23, 2018 Uptime - 8 days, 1 hour, 5 minutes, no further issues. I'm feeling very hopeful that the crashes I was having appear to have been resolved by the newer kernel. To anyone with EPYC having similar errors, Linux kernel: version 4.18.14 & unraid 6.6.2 appear to have resolved all. Link to comment
Recommended Posts
Archived
This topic is now archived and is closed to further replies.