Unraid 6.9.2 Becomes Unresponsive Roughly Once per Week

SigmaInigma · June 4, 2021

I've been using UNAID for over 4 years now and it has always been amazing, but recently I have been experiencing some system instability and I could use some help figuring out what's going on. I'm not sure when exactly this started but it's been roughly two months. Basically about once per week my Unraid system becomes unresponsive and I am unable to access any of my docker containers or even the Unraid Web UI. In order to fix this I typically just hold the power button on the system and reboot it to get it back up and running but this is happening often enough that it's become very annoying.

Since the syslog get's wiped on reboot I have had trouble diagnosing the issue but I recently enabled writing the syslog to the flash drive in order to persist the logs. So far this has happened twice since I started writing the syslog to the flash drive. Here are the last few lines before the system becomes irresponsive:

First Occurrence

May 22 03:00:12 Tower Docker Auto Update: Community Applications Docker Autoupdate finished
May 22 03:40:01 Tower crond[1996]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

Second Occurence:

Jun  3 00:00:02 Tower Plugin Auto Update: Community Applications Plugin Auto Update finished
Jun  3 02:00:55 Tower kernel: XFS (nvme0n1p1): Metadata corruption detected at xfs_dinode_verify+0xa3/0x581 [xfs], inode 0x3042f953 dinode
Jun  3 02:00:55 Tower kernel: XFS (nvme0n1p1): Unmount and run xfs_repair
Jun  3 02:00:55 Tower kernel: XFS (nvme0n1p1): First 128 bytes of corrupted metadata buffer:
Jun  3 02:00:55 Tower kernel: 00000000: 49 4e 81 a4 03 02 00 00 00 00 00 63 00 00 00 64  IN.........c...d
Jun  3 02:00:55 Tower kernel: 00000010: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
Jun  3 02:00:55 Tower kernel: 00000020: 60 9c 7e 83 15 70 32 44 60 9c 7e 83 15 70 32 44  `.~..p2D`.~..p2D
Jun  3 02:00:55 Tower kernel: 00000030: 60 9c 7e 83 15 70 32 44 00 00 00 00 00 02 83 70  `.~..p2D.......p
Jun  3 02:00:55 Tower kernel: 00000040: 00 00 00 00 00 00 00 29 00 00 00 00 00 00 00 01  .......)........
Jun  3 02:00:55 Tower kernel: 00000050: 00 00 00 02 00 00 00 00 00 00 00 00 b4 55 9d fd  .............U..
Jun  3 02:00:55 Tower kernel: 00000060: ff ff ff ff 1e 67 d2 f9 00 00 00 00 00 00 00 07  .....g..........
Jun  3 02:00:55 Tower kernel: 00000070: 00 00 00 34 00 00 a3 b4 00 00 00 00 00 00 00 00  ...4............
Jun  3 03:00:01 Tower Docker Auto Update: Community Applications Docker Autoupdate running
Jun  3 03:00:01 Tower Docker Auto Update: Checking for available updates
Jun  3 03:00:04 Tower Docker Auto Update: No updates will be installed
Jun  3 03:40:01 Tower crond[1924]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
Jun  3 05:00:34 Tower root: /etc/libvirt: 125.7 MiB (131846144 bytes) trimmed on /dev/loop3
Jun  3 05:00:34 Tower root: /var/lib/docker: 2.8 GiB (2966016000 bytes) trimmed on /dev/loop2
Jun  3 05:00:34 Tower root: /mnt/cache: 448.4 GiB (481491525632 bytes) trimmed on /dev/nvme0n1p1

This line seems to be the common link between the two occurrences but I'm bot sure if the curruped metadata buffer from the second occurrence is a concern as well or not:

Jun  3 03:40:01 Tower crond[1924]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

Here is my system info

Unraid Version: 6.9.2 
Model: Custom
M/B: Micro-Star International Co., Ltd. Z370 GAMING M5 (MS-7B58) Version 1.0 - s/n: I316865908
BIOS: American Megatrends Inc. Version 1.A0. Dated: 06/08/2020
CPU: Intel® Core™ i7-8700K CPU @ 3.70GHz
HVM: Enabled
IOMMU: Disabled
Cache: 384 KiB, 1536 KiB, 12 MB
Memory: 16 GiB DDR4 (max. installable capacity 64 GiB)
Network: eth0: 10000 Mbps, full duplex, mtu 1500
 eth1: interface down
Kernel: Linux 5.10.28-Unraid x86_64
OpenSSL: 1.1.1j

Any idea what could be going on? I've uploaded the syslog from my flash drive with the persisted information as well as the diagnostics zip file but the zip file is from after the reboot so the syslog in that does not contain what happened prior to reboot.

One other thing to note is that I recently switched to a different server to run Unraid and while the issues I mentioned did not immediately start happening I'm wondering if that might have anything to do with it.

Let me know if there is any more information I can provide. Thanks!

tower-diagnostics-20210604-1140.zip syslog (3)

JorgeB · June 4, 2021

Macvlan call traces are usually the result of having dockers with a custom IP address, more info below.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

5 minutes ago, SigmaInigma said:

XFS (nvme0n1p1): Metadata corruption detected at xfs_dinode_verify+0xa3/0x581

Also this means you need to run a filesystem check on this device, this is likely the result of the unclean shutdowns.

SigmaInigma · June 4, 2021

49 minutes ago, JorgeB said:

Macvlan call traces are usually the result of having dockers with a custom IP address, more info below.

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

Also this means you need to run a filesystem check on this device, this is likely the result of the unclean shutdowns.

Did you see any Macvlan call traces in my log? I might be missing it but I didn't see anything like that. I do have one of my dockers on the br0 network with a custom IP though so maybe that is causing the issue?

ApacheGuacamole     br0        192.168.117.118080
CUPS                bridge     192.168.117.10631
Dolphin             bridge     192.168.117.108080
EAPcontroller       host       192.168.117.10???
heimdall            proxynet   192.168.117.108143, 8180
letsencrypt         proxynet   192.168.117.10180, 1443
ombi                proxynet   192.168.117.103579
plex                host       192.168.117.101900, 3005, 5353, 8324, 32400, 32410, 32412, 32413, 32414, 32469
radarr              proxynet   192.168.117.10787
radarr-uhd          proxynet   192.168.117.107879
sabnzbd             proxynet   192.168.117.107070, 9090
sonarr              proxynet   192.168.117.108989
unifi-controller     bridge     192.168.117.103478, 8080, 8443, 8843, 8880, 10001

JorgeB · June 4, 2021

8 minutes ago, SigmaInigma said:

Did you see any Macvlan call traces in my log?

It's on top of the log, line 10 or so.

SigmaInigma · June 4, 2021

Got it, I'll try moving this docker to the bridge network instead and see if that fixes it. Thanks!

John_M · June 4, 2021

4 hours ago, SigmaInigma said:


Jun 3 03:40:01 Tower crond[1924]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null

This is nothing to worry about. It's the standard log entry when the Mover finds nothing to move. Having a non-zero exit status makes it look like an error, but it isn't.

Unraid 6.9.2 Becomes Unresponsive Roughly Once per Week

Recommended Posts

SigmaInigma

Link to comment

JorgeB

Link to comment

SigmaInigma

Link to comment

JorgeB

Link to comment

SigmaInigma

Link to comment

John_M

Link to comment

Join the conversation