Unraid 6.6.1 - Server Only Accessible via a Console

stoutpanda · October 5, 2018

Hello guys, I'm new to unraid and the community. This past month I've built a new epyc server and migrated my homelab over from an older ProxMox server to Unraid.

Today is the second time I've experienced an issue where Unraid becomes completely inaccessible via the network, I believe centered around docker but I am not 100% certain. Last time I hard reset the server via IPMI, but wanted to see if there was any chance of bringing it back online with your guys help, or at least so I could do a clean shutdown and not have to do another parity verify.

I'm using XFS across all data drives, and two 1TB CACHE SSDs as well as a few more SSDs in unassigned devices. However, the issue happened before adding the cache & unassigned devices, and while running rc4.

I am unable to access the server via any method except my ipmi console (SSH not responding, samba/nfs not responding, web ui not responding).

I'm unable to stop or force-quit the php-fpm process, or kill it with kill -9, or killall php-fpm. Using the stop or force-quit just sits until it fails.

When I list my docker containers with a docker container ls, I see that one my pihole/pihole:latest container has a status of unhealthy. I am unable to stop or kill any of my docker containers, the console just sits there. I've tried stopping / killing the ones even in a healthy status with no luck.

I am unable to force-quit, stop or kill -9 the docker service just like the php-fpm.

When trying to stop a ontainer, or php-fpm I occasionally see the following message about killinfo: rcu-sched detected expedited stalls. I am posting a screen capture of the error as I am not able to connect to get it in a text format, nor can I push data out of the server via the network. The CPU is currently operating at 42C from the supermicro ipmi console temperature reading.

I'd appreciate any advice or further troubleshooting ideas.

EDIT: I am attempting to run the diagnostics command now, but do not know how to get them off of the server without causing the parity reverify.

trurl · October 5, 2018

Does this happen immediately after a hard reset or can you access it normally for a little while?

I don't have anything except a hard reset to recommend. Then a memtest.

If that passes then if you can get Diagnostics immediately after booting that might give us more to work with.

stoutpanda · October 5, 2018

Thanks trurl, it has run fine for several days. Current uptime is 3.5 days, but I think I got close to two weeks before.

I can definitely run the diagnostics and upload them after a hard reset if there is nothing else to try.

stoutpanda · October 5, 2018

Ah crud, the server is not running diagnostics either, just more of the same cpu errors when trying to run the diagnostics script.

stoutpanda · October 5, 2018

Unfortunately the diagnostics would not ever complete, so I've hard reset it as suggested.

I was able to copy the full syslog (attached as syslog-2018-10-15.zip) before the hard reset however.

I did run diagnostics (sekhmet-diagnostics-20181005-1246.zip) after a quick boot up to get the syslog file.

I've rebooted the server after grabbing those and running memtest as suggested as well.

syslog-2018-10-05.zip

sekhmet-diagnostics-20181005-1246.zip

trurl · October 5, 2018

Why do you have 60G docker image? I ask because I have seen people with docker problems increase that size hoping to fix the problem instead of actually getting to the bottom of a bad container configuration. 20G should be more than enough.

stoutpanda · October 5, 2018

This was due to an earlier issue with a deluge docker that wasn't configured correctly and was writing data to the docker image.

I have since removed that docker image/container and went with the binhex-deluge suggested in a thread on the forums here, but never went back and resized the docker image.

Edit: I am willing to shrink that down of course if needed.

trurl · October 5, 2018

One of the reasons I recommend a smaller docker image is because it will make it more apparent when you have an out-of-control container. If you have a large image it can run for a while before it breaks.

Just changing from some other deluge to binhex isn't really the fix for an application writing into the image. What causes writes into the image is a path in the application which you have set to somewhere that isn't mapped storage.

stoutpanda · October 5, 2018

13 minutes ago, trurl said:

One of the reasons I recommend a smaller docker image is because it will make it more apparent when you have an out-of-control container. If you have a large image it can run for a while before it breaks.

Just changing from some other deluge to binhex isn't really the fix for an application writing into the image. What causes writes into the image is a path in the application which you have set to somewhere that isn't mapped storage.

Yes, you are correct. That was the issue, and it was also before I found the CA plugin & figured out the /mnt/user shares and had a better understanding of how things in Unraid were organized. I actually ran onto the faq by Squid you have linked in your signature.

I mentioned that I got rid of the image just because I wanted to let you know that that image in particular shouldn't be causing any further trouble.

I can reboot the server and will gladly provide any info regarding the docker configurations for further review however, and greatly appreciate your time and help.

stoutpanda · October 5, 2018

Here was the usage. I'm recreating it to be a 20GB file now, but just to show before I change anything.

Filesystem Size Used Avail Use% Mounted on
/dev/loop2 60G 3.9G 55G 7% /var/lib/docker

root@Sekhmet:~# docker ps -s
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES SIZE
571f9dd9722e linuxserver/unifi "/init" 2 days ago Up 25 seconds unifi 101MB (virtual 559MB)
0a5ad4b8f382 mlebjerg/steamcachebundle:latest "/scripts/bootstrap.…" 3 days ago Up 41 seconds SteamCacheBundle 3.65MB (virtual 27MB)
390ebd87726a linuxserver/mariadb "/init" 5 days ago Up 28 seconds 0.0.0.0:3306->3306/tcp mariadb 340kB (virtual 362MB)
a62d83fdcae4 linuxserver/duplicati "/init" 5 days ago Up 30 seconds 0.0.0.0:8200->8200/tcp duplicati 301kB (virtual 596MB)
00ca105186bf pihole/pihole:latest "/s6-init" 7 days ago Up 48 seconds (healthy) pihole 10.5MB (virtual 356MB)
8d2543550e7c binhex/arch-deluge "/usr/bin/tini -- /b…" 13 days ago Up 35 seconds binhex-deluge 1.15MB (virtual 1.04GB)

Edit: Docker file resized to 20GB (deleted, recreated containers from templates).

root@Sekhmet:/mnt/user/system/docker# ls -lh
total 20G
-rw-rw-rw- 1 nobody users 20G Oct 5 14:24 docker.img

/dev/loop2 20G 3.4G 15G 19% /var/lib/docker

stoutpanda · October 6, 2018

Well it passed the first pass of the memtest.

stoutpanda · October 6, 2018

Started the memtest over with all cores and ran it last night. So far no errors.

I've found some similar errors for Ryzen CPU on various ubuntu kernels, but haven't found much for epyc. https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1690085

stoutpanda · October 8, 2018

Well I ran the memtest with no issues / errors through this morning. I'm going to bring the server backonline into unraid now and see how long until it happens again.

I am hopeful now that it may be a bug similar to the above thread for ryzen and future kernel updates may fix it, unless anyone else has ideas.

stoutpanda · October 11, 2018

Well it just happened again after 4 days 16 hours of uptime following the memtest.

This time I wasn't able to access my VMs once it happened, they dropped my VNC / RDP sessions and then the whole server became unresponsive except from the console itself again. This time even more difficult to work with, as I couldn't even get it to list the status of dockers without it just sitting waiting for the command to run.

Could not shutdown or destroy vms from console (device or resource busy).

Again unable to run diagnostics before power cycling, but was able to copy the syslog. Ran diagnostics after boot again and attaching those.

sekhmet-diagnostics-20181011-1515.zip

syslog-2018-10-11.zip

stoutpanda · October 11, 2018

I did notice something the following that happened right before the rcu-sched errors started this time while reviewing the syslogs, but I'm having trouble determining what program may have caused it.

BUG: unable to handle kernel NULL pointer dereference at 00000000000000300

Quote

Oct 11 14:22:28 Sekhmet kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
Oct 11 14:22:28 Sekhmet kernel: PGD 0 P4D 0
Oct 11 14:22:28 Sekhmet kernel: Oops: 0000 [#1] SMP NOPTI
Oct 11 14:22:28 Sekhmet kernel: CPU: 4 PID: 16770 Comm: CPU 0/KVM Tainted: G W 4.18.10-unRAID #2
Oct 11 14:22:28 Sekhmet kernel: Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 1.0b 04/27/2018
Oct 11 14:22:28 Sekhmet kernel: RIP: 0010:drop_spte+0x4a/0x75 [kvm]
Oct 11 14:22:28 Sekhmet kernel: Code: 48 01 d8 72 09 ba ff ef 00 00 48 c1 e2 1f 48 01 d0 ba f5 ff 7f 00 48 89 de 48 c1 e8 0c 48 c1 e2 29 48 c1 e0 06 48 8b 54 10 28 <48> 2b 72 30 48 89 d7 48 c1 fe 03 e8 e9 dd ff ff 48 89 ef 48 89 c6
Oct 11 14:22:28 Sekhmet kernel: RSP: 0018:ffffc90006983c68 EFLAGS: 00010212
Oct 11 14:22:28 Sekhmet kernel: RAX: 0000000030c9f680 RBX: ffff880c327da758 RCX: 0000000000000001
Oct 11 14:22:28 Sekhmet kernel: RDX: 0000000000000000 RSI: ffff880c327da758 RDI: 7fffc40082e34fec
Oct 11 14:22:28 Sekhmet kernel: RBP: ffffc90007015000 R08: 0000000000000001 R09: 0000000000000000
Oct 11 14:22:28 Sekhmet kernel: R10: ffff880424c10008 R11: ffffc90006983ce8 R12: ffffc90007015000
Oct 11 14:22:28 Sekhmet kernel: R13: ffff880424c10000 R14: ffff880424c10008 R15: 0000000300000001
Oct 11 14:22:28 Sekhmet kernel: FS: 0000153c0d02a700(0000) GS:ffff88080f600000(0000) knlGS:0000000000000000
Oct 11 14:22:28 Sekhmet kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 14:22:28 Sekhmet kernel: CR2: 0000000000000030 CR3: 0000000c372f8000 CR4: 00000000003406e0
Oct 11 14:22:28 Sekhmet kernel: Call Trace:
Oct 11 14:22:28 Sekhmet kernel: kvm_zap_rmapp+0x39/0x5b [kvm]
Oct 11 14:22:28 Sekhmet kernel: kvm_unmap_rmapp+0x5/0x9 [kvm]
Oct 11 14:22:28 Sekhmet kernel: kvm_handle_hva_range+0x117/0x154 [kvm]
Oct 11 14:22:28 Sekhmet kernel: ? kvm_zap_rmapp+0x5b/0x5b [kvm]
Oct 11 14:22:28 Sekhmet kernel: kvm_mmu_notifier_invalidate_range_start+0x43/0x7a [kvm]
Oct 11 14:22:28 Sekhmet kernel: __mmu_notifier_invalidate_range_start+0x64/0x69

stoutpanda · October 12, 2018

Another hang up today.

This time I was able to access the web page (though anything related to docker, plugins or the app page wasn't loading) & via ssh so I was able to get diagnostics this time.

Was unable to stop dockers via command stop or kill commands. Tried a shutdown via the web page and it hung, tried a powerdown command via the console following and stuck again at ccp 0000:06:00.2: disabled.

sekhmet-diagnostics-20181012-1524.zip

JorgeB · October 12, 2018

This looks to me like a hardware problem:

Oct 11 17:59:51 Sekhmet kernel: BUG: Bad page map in process docker  pte:ffff88039e8b0318 pmd:39e4f3067

Not sure how well these AMD EPYC boards work with Unraid/Linux, have you checked the board's system event log for any errors there?

stoutpanda · October 13, 2018

2 hours ago, johnnie.black said:
This looks to me like a hardware problem:
Oct 11 17:59:51 Sekhmet kernel: BUG: Bad page map in process docker  pte:ffff88039e8b0318 pmd:39e4f3067
Not sure how well these AMD EPYC boards work with Unraid/Linux, have you checked the board's system event log for any errors there?

No events other than powering on / power off fortunately/unfortunately.

stoutpanda · October 15, 2018

Updated to 6.6.2 hoping that the newer kernel may alleviate some of this!

stoutpanda · October 23, 2018

Uptime - 8 days, 1 hour, 5 minutes, no further issues.

I'm feeling very hopeful that the crashes I was having appear to have been resolved by the newer kernel. To anyone with EPYC having similar errors,

Linux kernel:

version 4.18.14

& unraid 6.6.2 appear to have resolved all.

Unraid 6.6.1 - Server Only Accessible via a Console

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Archived