Unraid crashing after some time


AquaWolf

Recommended Posts

Hello there I had before some issues with Docker, these are now solved but my Unraid is still unstable after one or two days usage.

My System Log shortly before last logline is full of:

Tower rsyslogd: action 'action-0-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ]

an these:

Feb  4 04:36:34 Tower kernel: rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
Feb  4 04:36:34 Tower kernel: rcu: #01112-....: (8 GPs behind) idle=1f6/1/0x4000000000000002 softirq=17084346/17084346 fqs=6963800 
Feb  4 04:36:34 Tower kernel: #011(detected by 0, t=29760827 jiffies, g=60378281, q=18075856)
Feb  4 04:36:34 Tower kernel: Sending NMI from CPU 0 to CPUs 12:
Feb  4 04:36:34 Tower kernel: NMI backtrace for cpu 12
Feb  4 04:36:34 Tower kernel: CPU: 12 PID: 0 Comm: swapper/12 Tainted: G        W         5.10.28-Unraid #1
Feb  4 04:36:34 Tower kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B09/X399 SLI PLUS (MS-7B09), BIOS A.70 11/14/2018
Feb  4 04:36:34 Tower kernel: RIP: 0010:nf_ct_key_equal+0x4/0x5d [nf_conntrack]
Feb  4 04:36:34 Tower kernel: Code: 48 33 56 1c 48 09 d1 75 19 8b 57 24 8b 46 24 81 e2 ff ff ff 00 25 ff ff ff 00 39 c2 0f 94 c0 0f b6 c0 83 e0 01 c3 49 89 f9 55 <48> 89 f7 48 89 d5 49 8d 71 10 49 89 cb e8 9b ff ff ff 45 31 d2 84
Feb  4 04:36:34 Tower kernel: RSP: 0018:ffffc900069f4978 EFLAGS: 00000206
Feb  4 04:36:34 Tower kernel: RAX: 00000001119963e5 RBX: ffff888a12629448 RCX: ffffffff8210b440
Feb  4 04:36:34 Tower kernel: RDX: ffff888a1262a6cc RSI: ffffc900069f49e8 RDI: ffff888a12629448
Feb  4 04:36:34 Tower kernel: RBP: ffff888a1262a6c0 R08: ffff888a12629400 R09: ffff888a12629448
Feb  4 04:36:34 Tower kernel: R10: 0000000000000001 R11: ffffffff8210b440 R12: ffffffff8210b440
Feb  4 04:36:34 Tower kernel: R13: ffffc900069f49e8 R14: ffff888a1262a6cc R15: ffff888a12629400
Feb  4 04:36:34 Tower kernel: FS:  0000000000000000(0000) GS:ffff888c66d00000(0000) knlGS:0000000000000000
Feb  4 04:36:34 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Feb  4 04:36:34 Tower kernel: CR2: 000000c000867000 CR3: 000000012ca00000 CR4: 00000000003506e0
Feb  4 04:36:34 Tower kernel: Call Trace:
Feb  4 04:36:34 Tower kernel: <IRQ>
Feb  4 04:36:34 Tower kernel: nf_conntrack_tuple_taken+0xb9/0x144 [nf_conntrack]
Feb  4 04:36:34 Tower kernel: nf_nat_used_tuple+0x2e/0x49 [nf_nat]
Feb  4 04:36:34 Tower kernel: nf_nat_setup_info+0x332/0x6aa [nf_nat]
Feb  4 04:36:34 Tower kernel: ? ipt_do_table+0x4bb/0x5c0 [ip_tables]
Feb  4 04:36:34 Tower kernel: ? ipt_do_table+0x570/0x5c0 [ip_tables]
Feb  4 04:36:34 Tower kernel: __nf_nat_alloc_null_binding+0x5f/0x76 [nf_nat]
Feb  4 04:36:34 Tower kernel: nf_nat_inet_fn+0x91/0x183 [nf_nat]
Feb  4 04:36:34 Tower kernel: ? br_handle_frame_finish+0x351/0x351
Feb  4 04:36:34 Tower kernel: nf_nat_ipv4_pre_routing+0x1e/0x4a [nf_nat]
Feb  4 04:36:34 Tower kernel: nf_hook_slow+0x39/0x8e
Feb  4 04:36:34 Tower kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
Feb  4 04:36:34 Tower kernel: NF_HOOK+0xb7/0xf7 [br_netfilter]
Feb  4 04:36:34 Tower kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
Feb  4 04:36:34 Tower kernel: br_nf_pre_routing+0x229/0x239 [br_netfilter]
Feb  4 04:36:34 Tower kernel: ? br_nf_forward_finish+0xd0/0xd0 [br_netfilter]
Feb  4 04:36:34 Tower kernel: br_handle_frame+0x25e/0x2a6
Feb  4 04:36:34 Tower kernel: ? br_pass_frame_up+0xda/0xda
Feb  4 04:36:34 Tower kernel: __netif_receive_skb_core+0x335/0x4e7
Feb  4 04:36:34 Tower kernel: __netif_receive_skb_list_core+0x78/0x104
Feb  4 04:36:34 Tower kernel: netif_receive_skb_list_internal+0x1bf/0x1f2
Feb  4 04:36:34 Tower kernel: ? dev_gro_receive+0x55d/0x578
Feb  4 04:36:34 Tower kernel: gro_normal_list+0x1d/0x39
Feb  4 04:36:34 Tower kernel: napi_complete_done+0x79/0x104
Feb  4 04:36:34 Tower kernel: bnx2x_poll+0x100c/0x1285 [bnx2x]
Feb  4 04:36:34 Tower kernel: ? resched_cpu+0x14/0x58
Feb  4 04:36:34 Tower kernel: ? enqueue_task_fair+0x101/0x156
Feb  4 04:36:34 Tower kernel: net_rx_action+0xf4/0x29d
Feb  4 04:36:34 Tower kernel: __do_softirq+0xc4/0x1c2
Feb  4 04:36:34 Tower kernel: asm_call_irq_on_stack+0x12/0x20
Feb  4 04:36:34 Tower kernel: </IRQ>
Feb  4 04:36:34 Tower kernel: do_softirq_own_stack+0x2c/0x39
Feb  4 04:36:34 Tower kernel: __irq_exit_rcu+0x45/0x80
Feb  4 04:36:34 Tower kernel: common_interrupt+0x119/0x12e
Feb  4 04:36:34 Tower kernel: asm_common_interrupt+0x1e/0x40
Feb  4 04:36:34 Tower kernel: RIP: 0010:arch_local_irq_enable+0x7/0x8
Feb  4 04:36:34 Tower kernel: Code: 00 48 83 c4 28 4c 89 e0 5b 5d 41 5c 41 5d 41 5e 41 5f c3 9c 58 0f 1f 44 00 00 c3 fa 66 0f 1f 44 00 00 c3 fb 66 0f 1f 44 00 00 <c3> 55 8b af 28 04 00 00 b8 01 00 00 00 45 31 c9 53 45 31 d2 39 c5
Feb  4 04:36:34 Tower kernel: RSP: 0018:ffffc90006443ea0 EFLAGS: 00000246
Feb  4 04:36:34 Tower kernel: RAX: ffff888c66d22380 RBX: 0000000000000002 RCX: 000000000000001f
Feb  4 04:36:34 Tower kernel: RDX: 0000000000000000 RSI: 0000000021af2900 RDI: 0000000000000000
Feb  4 04:36:34 Tower kernel: RBP: ffff88869fa0dc00 R08: 0000f1be712f7038 R09: 0000000000000000
Feb  4 04:36:34 Tower kernel: R10: 0000000000002e7b R11: 071c71c71c71c71c R12: 0000f1be712f7038
Feb  4 04:36:34 Tower kernel: R13: ffffffff820c8c40 R14: 0000000000000002 R15: 0000000000000000
Feb  4 04:36:34 Tower kernel: cpuidle_enter_state+0x101/0x1c4
Feb  4 04:36:34 Tower kernel: cpuidle_enter+0x25/0x31
Feb  4 04:36:34 Tower kernel: do_idle+0x1a6/0x214
Feb  4 04:36:34 Tower kernel: cpu_startup_entry+0x18/0x1a
Feb  4 04:36:34 Tower kernel: secondary_startup_64_no_verify+0xb0/0xbb
Feb  4 04:36:34 Tower rsyslogd: file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2002.0 try https://www.rsyslog.com/e/2027 ]
Feb  4 04:36:34 Tower rsyslogd: action 'action-0-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ]
Feb  4 04:36:34 Tower rsyslogd: file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2002.0 try https://www.rsyslog.com/e/2027 ]
Feb  4 04:36:34 Tower rsyslogd: action 'action-0-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ]
Feb  4 04:36:34 Tower rsyslogd: file '/var/log/syslog'[9] write error - see https://www.rsyslog.com/solving-rsyslog-write-errors/ for help OS error: No space left on device [v8.2002.0 try https://www.rsyslog.com/e/2027 ]
Feb  4 04:36:34 Tower rsyslogd: action 'action-0-builtin:omfile' (module 'builtin:omfile') message lost, could not be processed. Check for additional error messages before this one. [v8.2002.0 try https://www.rsyslog.com/e/2027 ]
Feb  4 04:36:34 Tower rsyslogd: rsyslogd[internal_messages]: 560 messages lost due to rate-limiting (500 allowed within 5 seconds)
Feb  7 06:17:29 Tower nginx: 2022/02/07 06:17:29 [error] 7456#7456: MEMSTORE:00: can't create shared message for channel /disks
Feb  7 06:17:30 Tower nginx: 2022/02/07 06:17:30 [crit] 7456#7456: ngx_slab_alloc() failed: no memory
Feb  7 06:17:30 Tower nginx: 2022/02/07 06:17:30 [error] 7456#7456: shpool alloc failed
Feb  7 06:17:30 Tower nginx: 2022/02/07 06:17:30 [error] 7456#7456: nchan: Out of shared memory while allocating message of size 7386. Increase nchan_max_reserved_memory.
Feb  7 06:17:30 Tower nginx: 2022/02/07 06:17:30 [error] 7456#7456: *135964 nchan: error publishing message (HTTP status code 500), client: unix:, server: , request: "POST /pub/disks?buffer_length=1 HTTP/1.1", host: "localhost"
Feb  7 06:17:30 Tower nginx: 2022/02/07 06:17:30 [error] 7456#7456: MEMSTORE:00: can't create shared message for channel /disks

 

I rebooted now. In the attachment are the diagnostics after reboot.

 

 I would be really thankfull for some help :)

Kind regards

tower-diagnostics-20220216-0730.zip

Link to comment

See if this applies to you, if yes, upgrading to v6.10 and switching to ipvlan might fix it (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enable, top right)), or see below for more info.:

 

https://forums.unraid.net/topic/70529-650-call-traces-when-assigning-ip-address-to-docker-containers/

See also here:

https://forums.unraid.net/bug-reports/stable-releases/690691-kernel-panic-due-to-netfilter-nf_nat_setup_info-docker-static-ip-macvlan-r1356/

Link to comment

I Upgrade to rc2

Docker is not working starting anymore seems to be a Problem with the docker image

Feb 16 13:32:59 Tower emhttpd: shcmd (68): /usr/local/sbin/mount_image '/mnt/user/system/docker/docker.img' /var/lib/docker 100
Feb 16 13:32:59 Tower kernel: loop2: detected capacity change from 0 to 209715200
Feb 16 13:32:59 Tower kernel: BTRFS: device fsid b7558f03-4ad1-4884-b9b7-9fe63d3900b9 devid 1 transid 2830106 /dev/loop2 scanned by mount (4507)
Feb 16 13:32:59 Tower kernel: BTRFS info (device loop2): flagging fs with big metadata feature
Feb 16 13:32:59 Tower kernel: BTRFS info (device loop2): using free space tree
Feb 16 13:32:59 Tower kernel: BTRFS info (device loop2): has skinny extents
Feb 16 13:32:59 Tower kernel: BTRFS info (device loop2): bdev /dev/loop2 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0
Feb 16 13:32:59 Tower kernel: BTRFS info (device loop2): enabling ssd optimizations
Feb 16 13:32:59 Tower kernel: BTRFS info (device loop2): cleaning free space cache v1
Feb 16 13:32:59 Tower kernel: BTRFS warning (device loop2): checksum verify failed on 28567109632 wanted 0x92b333b4 found 0x5563533e level 0
Feb 16 13:32:59 Tower kernel: BTRFS warning (device loop2): checksum verify failed on 28567109632 wanted 0x92b333b4 found 0x5563533e level 0
Feb 16 13:32:59 Tower kernel: BTRFS: error (device loop2) in btrfs_set_free_space_cache_v1_active:3992: errno=-5 IO failure
Feb 16 13:32:59 Tower kernel: BTRFS error (device loop2): commit super ret -30
Feb 16 13:32:59 Tower root: mount: /var/lib/docker: can't read superblock on /dev/loop2.
Feb 16 13:32:59 Tower kernel: BTRFS error (device loop2): open_ctree failed
Feb 16 13:32:59 Tower root: mount error

 

And my VMs with GPU Passthrough are not working anymore I'm getting just a few lines in syslog:

Feb 16 13:38:13 Tower kernel: br0: port 2(vnet1) entered blocking state
Feb 16 13:38:13 Tower kernel: br0: port 2(vnet1) entered disabled state
Feb 16 13:38:13 Tower kernel: device vnet1 entered promiscuous mode
Feb 16 13:38:13 Tower kernel: br0: port 2(vnet1) entered blocking state
Feb 16 13:38:13 Tower kernel: br0: port 2(vnet1) entered forwarding state
Feb 16 13:38:15 Tower avahi-daemon[4465]: Joining mDNS multicast group on interface vnet1.IPv6 with address fe80::fc54:ff:fe30:3c6e.
Feb 16 13:38:15 Tower avahi-daemon[4465]: New relevant interface vnet1.IPv6 for mDNS.
Feb 16 13:38:15 Tower avahi-daemon[4465]: Registering new address record for fe80::fc54:ff:fe30:3c6e on vnet1.*.
Feb 16 13:38:19 Tower kernel: vfio-pci 0000:43:00.0: vfio_ecap_init: hiding ecap 0x19@0x900
Feb 16 13:38:19 Tower kernel: vfio-pci 0000:43:00.0: No more image in the PCI ROM
Feb 16 13:38:19 Tower kernel: vfio-pci 0000:41:00.0: vfio_ecap_init: hiding ecap 0x1e@0x110
Feb 16 13:38:19 Tower kernel: vfio-pci 0000:41:00.0: vfio_ecap_init: hiding ecap 0x19@0x300

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.