Unraid 6.6.1 - Server Only Accessible via a Console


Recommended Posts

Hello guys, I'm new to unraid and the community. This past month I've built a new epyc server and migrated my homelab over from an older ProxMox server to Unraid. 

 

Today is the second time I've experienced an issue where Unraid becomes completely inaccessible via the network, I believe centered around docker but I am not 100% certain. Last time I hard reset the server via IPMI, but wanted to see if there was any chance of bringing it back online with your guys help, or at least so I could do a clean shutdown and not have to do another parity verify. 

 

I'm using XFS across all data drives, and two 1TB CACHE SSDs as well as a few more SSDs in unassigned devices. However, the issue happened before adding the cache & unassigned devices, and while running rc4. 

 

I am unable to access the server via any method except my ipmi console (SSH not responding, samba/nfs not responding, web ui not responding). 

 

I'm unable to stop or force-quit the php-fpm process, or kill it with kill -9, or killall php-fpm. Using the stop or force-quit just sits until it fails. 

 

When I list my docker containers with a docker container ls, I see that one my pihole/pihole:latest container has a status of unhealthy. I am unable to stop or kill any of my docker containers, the console just sits there. I've tried stopping / killing the ones even in a healthy status with no luck.

 

I am unable to force-quit, stop or kill -9 the docker service just like the php-fpm. 

 

When trying to stop a ontainer, or php-fpm I occasionally see the following message about killinfo: rcu-sched detected expedited stalls. I am posting a screen capture of the error as I am not able to connect to get it in a text format, nor can I push data out of the server via the network. The CPU is currently operating at 42C from the supermicro ipmi console temperature reading. 

 

 

I'd appreciate any advice or further troubleshooting ideas. 

 

EDIT: I  am attempting to run the diagnostics command now, but do not know how to get them off of the server without causing the parity reverify. 

 

 

 

 

 

 

 

 

capture.jpg

capture2.jpg

Edited by stoutpanda
diagnostics
Link to comment

Does this happen immediately after a hard reset or can you access it normally for a little while?

 

I don't have anything except a hard reset to recommend. Then a memtest.

 

If that passes then if you can get Diagnostics immediately after booting that might give us more to work with.

  • Like 1
Link to comment

Unfortunately the diagnostics would not ever complete,  so I've hard reset it as suggested. 

 

I was able to copy the full syslog (attached as syslog-2018-10-15.zip) before the hard reset however. 

 

I did run diagnostics (sekhmet-diagnostics-20181005-1246.zip) after a quick boot up to get the syslog file. 

 

I've rebooted the server after grabbing those and running memtest as suggested as well. 

syslog-2018-10-05.zip

sekhmet-diagnostics-20181005-1246.zip

Link to comment

This was due to an earlier issue with a deluge docker that wasn't configured correctly and was writing data to the docker image.

 

I have since removed that docker image/container and went with the binhex-deluge suggested in a thread on the forums here, but never went back and resized the docker image. 

 

Edit: I am willing to shrink that down of course if needed.

Edited by stoutpanda
clarification
Link to comment

One of the reasons I recommend a smaller docker image is because it will make it more apparent when you have an out-of-control container. If you have a large image it can run for a while before it breaks.

 

Just changing from some other deluge to binhex isn't really the fix for an application writing into the image. What causes writes into the image is a path in the application which you have set to somewhere that isn't mapped storage.

Link to comment
13 minutes ago, trurl said:

One of the reasons I recommend a smaller docker image is because it will make it more apparent when you have an out-of-control container. If you have a large image it can run for a while before it breaks.

 

Just changing from some other deluge to binhex isn't really the fix for an application writing into the image. What causes writes into the image is a path in the application which you have set to somewhere that isn't mapped storage.

Yes, you are correct. That was the issue, and it was also before I found the CA plugin & figured out the /mnt/user shares and had a better understanding of how things in Unraid were organized. I actually ran onto the faq by Squid you have linked in your signature. 

 

 I mentioned that I got rid of the image just because I wanted to let you know that that image in particular shouldn't be causing any further trouble. 

 

I can reboot the server and will gladly provide any info regarding the docker configurations for further review however, and greatly appreciate your time and help. 

Link to comment

Here was the usage. I'm recreating it to be a 20GB file now, but just to show before I change anything.

 

 

 

Filesystem      Size  Used Avail Use% Mounted on
/dev/loop2       60G  3.9G   55G   7% /var/lib/docker
 

 

root@Sekhmet:~# docker ps -s
CONTAINER ID        IMAGE                              COMMAND                  CREATED             STATUS                    PORTS                    NAMES               SIZE
571f9dd9722e        linuxserver/unifi                  "/init"                  2 days ago          Up 25 seconds                                      unifi               101MB (virtual 559MB)
0a5ad4b8f382        mlebjerg/steamcachebundle:latest   "/scripts/bootstrap.…"   3 days ago          Up 41 seconds                                      SteamCacheBundle    3.65MB (virtual 27MB)
390ebd87726a        linuxserver/mariadb                "/init"                  5 days ago          Up 28 seconds             0.0.0.0:3306->3306/tcp   mariadb             340kB (virtual 362MB)
a62d83fdcae4        linuxserver/duplicati              "/init"                  5 days ago          Up 30 seconds             0.0.0.0:8200->8200/tcp   duplicati           301kB (virtual 596MB)
00ca105186bf        pihole/pihole:latest               "/s6-init"               7 days ago          Up 48 seconds (healthy)                            pihole              10.5MB (virtual 356MB)
8d2543550e7c        binhex/arch-deluge                 "/usr/bin/tini -- /b…"   13 days ago         Up 35 seconds                                      binhex-deluge       1.15MB (virtual 1.04GB)
 

 

Edit: Docker file resized to 20GB (deleted, recreated containers from templates). 


root@Sekhmet:/mnt/user/system/docker# ls -lh
total 20G
-rw-rw-rw- 1 nobody users 20G Oct  5 14:24 docker.img
 

/dev/loop2       20G  3.4G   15G  19% /var/lib/docker
 

Edited by stoutpanda
Resized docker image
Link to comment

Well I ran the memtest with no issues / errors through this morning. I'm going to bring the server backonline into unraid now and see how long until it happens again. 

 

I am hopeful now that it may be a bug similar to the above thread for ryzen and future kernel updates may fix it, unless anyone else has ideas. 

Link to comment

Well it just happened again after 4 days 16 hours of uptime following the memtest.

 

 This time I wasn't able to access my VMs once it happened, they dropped my VNC / RDP sessions and then the whole server became unresponsive except from the console itself again. This time even more difficult to work with, as I couldn't even get it to list the status of dockers without it just sitting waiting for the command to run. 

 

Could not shutdown or destroy vms from console (device or resource busy).

 

Again unable to run diagnostics before power cycling, but was able to copy the syslog. Ran diagnostics after boot again and attaching those. 

 

 

sekhmet-diagnostics-20181011-1515.zip

syslog-2018-10-11.zip

Link to comment

I did notice something the following that happened right before the rcu-sched errors started this time while reviewing the syslogs, but I'm having trouble determining what program may have caused it. 

 

 

BUG: unable to handle kernel NULL pointer dereference at 00000000000000300 

Quote

Oct 11 14:22:28 Sekhmet kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
Oct 11 14:22:28 Sekhmet kernel: PGD 0 P4D 0 
Oct 11 14:22:28 Sekhmet kernel: Oops: 0000 [#1] SMP NOPTI

Oct 11 14:22:28 Sekhmet kernel: CPU: 4 PID: 16770 Comm: CPU 0/KVM Tainted: G        W         4.18.10-unRAID #2
Oct 11 14:22:28 Sekhmet kernel: Hardware name: Supermicro Super Server/H11SSL-NC, BIOS 1.0b 04/27/2018
Oct 11 14:22:28 Sekhmet kernel: RIP: 0010:drop_spte+0x4a/0x75 [kvm]
Oct 11 14:22:28 Sekhmet kernel: Code: 48 01 d8 72 09 ba ff ef 00 00 48 c1 e2 1f 48 01 d0 ba f5 ff 7f 00 48 89 de 48 c1 e8 0c 48 c1 e2 29 48 c1 e0 06 48 8b 54 10 28 <48> 2b 72 30 48 89 d7 48 c1 fe 03 e8 e9 dd ff ff 48 89 ef 48 89 c6 
Oct 11 14:22:28 Sekhmet kernel: RSP: 0018:ffffc90006983c68 EFLAGS: 00010212
Oct 11 14:22:28 Sekhmet kernel: RAX: 0000000030c9f680 RBX: ffff880c327da758 RCX: 0000000000000001
Oct 11 14:22:28 Sekhmet kernel: RDX: 0000000000000000 RSI: ffff880c327da758 RDI: 7fffc40082e34fec
Oct 11 14:22:28 Sekhmet kernel: RBP: ffffc90007015000 R08: 0000000000000001 R09: 0000000000000000
Oct 11 14:22:28 Sekhmet kernel: R10: ffff880424c10008 R11: ffffc90006983ce8 R12: ffffc90007015000
Oct 11 14:22:28 Sekhmet kernel: R13: ffff880424c10000 R14: ffff880424c10008 R15: 0000000300000001
Oct 11 14:22:28 Sekhmet kernel: FS:  0000153c0d02a700(0000) GS:ffff88080f600000(0000) knlGS:0000000000000000
Oct 11 14:22:28 Sekhmet kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 14:22:28 Sekhmet kernel: CR2: 0000000000000030 CR3: 0000000c372f8000 CR4: 00000000003406e0
Oct 11 14:22:28 Sekhmet kernel: Call Trace:
Oct 11 14:22:28 Sekhmet kernel: kvm_zap_rmapp+0x39/0x5b [kvm]
Oct 11 14:22:28 Sekhmet kernel: kvm_unmap_rmapp+0x5/0x9 [kvm]
Oct 11 14:22:28 Sekhmet kernel: kvm_handle_hva_range+0x117/0x154 [kvm]
Oct 11 14:22:28 Sekhmet kernel: ? kvm_zap_rmapp+0x5b/0x5b [kvm]
Oct 11 14:22:28 Sekhmet kernel: kvm_mmu_notifier_invalidate_range_start+0x43/0x7a [kvm]
Oct 11 14:22:28 Sekhmet kernel: __mmu_notifier_invalidate_range_start+0x64/0x69

 

Link to comment

Another hang up today.

 

This time I was able to access the web page (though anything related to docker, plugins or the app page wasn't loading) & via ssh so I was able to get diagnostics this time. 

 

Was unable to stop dockers via command stop or kill commands. Tried a shutdown via the web page and it hung, tried a powerdown command via the console following and stuck again at ccp 0000:06:00.2: disabled. 

 

 

sekhmet-diagnostics-20181012-1524.zip

Link to comment
2 hours ago, johnnie.black said:

This looks to me like a hardware problem:


Oct 11 17:59:51 Sekhmet kernel: BUG: Bad page map in process docker  pte:ffff88039e8b0318 pmd:39e4f3067

Not sure how well these AMD EPYC boards work with Unraid/Linux, have you checked the board's system event log for any errors there?

No events other than powering on / power off fortunately/unfortunately. 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.