System stuck / Kernel Panic / cache drive input-output errors


Recommended Posts

Hi,

 

after the system ran for weeks without a single problem, unfortunately since a few days the problems are piling up and I have no starting point where to start looking.

 

After I repaired my Plex database and tried to browse the media library, the web interface of the UnRAID was no longer accessible. I could still connect via SSH, but I couldn't really do anything anymore. Even diagnostics retourned nothing, or just ran without end.

 

The whole system seems to hang, I can't stop the Docker service and I can't access the webinterface of any Docker.

 

Attached is the diagnostics from yesterday, today unfortunately I could not create in the webinterface, nor via CLI.

 

I could only copy the complete syslog from flash. From this a short excerpt, maybe this has relevance.

 

 

Aug 29 15:25:04 Tower kernel: ------------[ cut here ]------------
Aug 29 15:25:04 Tower kernel: WARNING: CPU: 0 PID: 20516 at net/netfilter/nf_nat_core.c:594 nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
Aug 29 15:25:04 Tower kernel: Modules linked in: vhost_net vhost tap kvm_intel kvm md_mod xt_mark xt_comment bluetooth ecdh_generic ecc udp_diag macvlan veth xt_nat nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tun vhost_iotlb nvidia_uvm(PO) xfs cmac cifs asn1_decoder cifs_arc4 cifs_md4 oid_registry dns_resolver dm_crypt dm_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) i915 drm_buddy i2c_algo_bit ttm drm_display_helper intel_gtt agpgart tcp_diag inet_diag ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs af_packet 8021q garp mrp bridge stp llc nvidia_drm(PO) nvidia_modeset(PO) intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp nvidia(PO) crct10dif_pclmul crc32_pclmul crc32c_intel
Aug 29 15:25:04 Tower kernel: ghash_clmulni_intel video sha512_ssse3 aesni_intel drm_kms_helper crypto_simd cryptd rapl drm mpt3sas i2c_i801 intel_cstate mei_me backlight i2c_smbus intel_wmi_thunderbolt mxm_wmi ahci syscopyarea raid_class sysfillrect input_leds sysimgblt intel_uncore e1000e i2c_core mei led_class libahci scsi_transport_sas fb_sys_fops wmi button unix [last unloaded: kvm]
Aug 29 15:25:04 Tower kernel: CPU: 0 PID: 20516 Comm: kworker/u48:19 Tainted: P      D W  O       6.1.38-Unraid #2
Aug 29 15:25:04 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018
Aug 29 15:25:04 Tower kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan]
Aug 29 15:25:04 Tower kernel: RIP: 0010:nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
Aug 29 15:25:04 Tower kernel: Code: a8 80 75 26 48 8d 73 58 48 8d 7c 24 20 e8 18 1b 43 00 48 8d 43 0c 4c 8b bb 88 00 00 00 48 89 44 24 18 eb 54 0f ba e0 08 73 07 <0f> 0b e9 75 06 00 00 48 8d 73 58 48 8d 7c 24 20 e8 eb 1a 43 00 48
Aug 29 15:25:04 Tower kernel: RSP: 0018:ffffc90000003c78 EFLAGS: 00010282
Aug 29 15:25:04 Tower kernel: RAX: 0000000000000180 RBX: ffff8887bbcb5300 RCX: ffff888173a62900
Aug 29 15:25:04 Tower kernel: RDX: 0000000000000000 RSI: ffffc90000003d5c RDI: ffff8887bbcb5300
Aug 29 15:25:04 Tower kernel: RBP: ffffc90000003d40 R08: 00000000900a0a0a R09: 0000000000000000
Aug 29 15:25:04 Tower kernel: R10: 0000000000000098 R11: 0000000000000000 R12: ffffc90000003d5c
Aug 29 15:25:04 Tower kernel: R13: 0000000000000000 R14: ffffc90000003e40 R15: 0000000000000001
Aug 29 15:25:04 Tower kernel: FS:  0000000000000000(0000) GS:ffff88905f800000(0000) knlGS:0000000000000000
Aug 29 15:25:04 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 15:25:04 Tower kernel: CR2: 000014d41cb5069c CR3: 000000000420a002 CR4: 00000000001706f0
Aug 29 15:25:04 Tower kernel: Call Trace:
Aug 29 15:25:04 Tower kernel: <IRQ>
Aug 29 15:25:04 Tower kernel: ? __warn+0xab/0x122
Aug 29 15:25:04 Tower kernel: ? report_bug+0x109/0x17e
Aug 29 15:25:04 Tower kernel: ? nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
Aug 29 15:25:04 Tower kernel: ? handle_bug+0x41/0x6f
Aug 29 15:25:04 Tower kernel: ? exc_invalid_op+0x13/0x60
Aug 29 15:25:04 Tower kernel: ? asm_exc_invalid_op+0x16/0x20
Aug 29 15:25:04 Tower kernel: ? nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
Aug 29 15:25:04 Tower kernel: ? nf_nat_setup_info+0x44/0x7d1 [nf_nat]
Aug 29 15:25:04 Tower kernel: ? xt_write_recseq_end+0xf/0x1c [ip_tables]
Aug 29 15:25:04 Tower kernel: ? __local_bh_enable_ip+0x56/0x6b
Aug 29 15:25:04 Tower kernel: ? ipt_do_table+0x57a/0x5bf [ip_tables]
Aug 29 15:25:04 Tower kernel: ? __wake_up_common_lock+0x88/0xbb
Aug 29 15:25:04 Tower kernel: ? xt_write_recseq_end+0xf/0x1c [ip_tables]
Aug 29 15:25:04 Tower kernel: __nf_nat_alloc_null_binding+0x66/0x81 [nf_nat]
Aug 29 15:25:04 Tower kernel: nf_nat_inet_fn+0xc0/0x1a8 [nf_nat]
Aug 29 15:25:04 Tower kernel: nf_nat_ipv4_local_in+0x2a/0xaa [nf_nat]
Aug 29 15:25:04 Tower kernel: nf_hook_slow+0x3d/0x96
Aug 29 15:25:04 Tower kernel: ? ip_protocol_deliver_rcu+0x164/0x164
Aug 29 15:25:04 Tower kernel: NF_HOOK.constprop.0+0x79/0xd9
Aug 29 15:25:04 Tower kernel: ? ip_protocol_deliver_rcu+0x164/0x164
Aug 29 15:25:04 Tower kernel: __netif_receive_skb_one_core+0x77/0x9c
Aug 29 15:25:04 Tower kernel: process_backlog+0x8c/0x116
Aug 29 15:25:04 Tower kernel: __napi_poll.constprop.0+0x2b/0x124
Aug 29 15:25:04 Tower kernel: net_rx_action+0x159/0x24f
Aug 29 15:25:04 Tower kernel: __do_softirq+0x129/0x288
Aug 29 15:25:04 Tower kernel: do_softirq+0x7f/0xab
Aug 29 15:25:04 Tower kernel: </IRQ>
Aug 29 15:25:04 Tower kernel: <TASK>
Aug 29 15:25:04 Tower kernel: __local_bh_enable_ip+0x4c/0x6b
Aug 29 15:25:04 Tower kernel: netif_rx+0x52/0x5a
Aug 29 15:25:04 Tower kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Aug 29 15:25:04 Tower kernel: ? _raw_spin_unlock+0x14/0x29
Aug 29 15:25:04 Tower kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]
Aug 29 15:25:04 Tower kernel: process_one_work+0x1ab/0x295
Aug 29 15:25:04 Tower kernel: worker_thread+0x18b/0x244
Aug 29 15:25:04 Tower kernel: ? rescuer_thread+0x281/0x281
Aug 29 15:25:04 Tower kernel: kthread+0xe7/0xef
Aug 29 15:25:04 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Aug 29 15:25:04 Tower kernel: ret_from_fork+0x22/0x30
Aug 29 15:25:04 Tower kernel: </TASK>
Aug 29 15:25:04 Tower kernel: ---[ end trace 0000000000000000 ]---
Aug 29 15:27:15 Tower kernel: CIFS: VFS: \\10.10.10.21\backup Close unmatched open for MID:1046853
Aug 29 15:29:47 Tower kernel: CIFS: VFS: \\10.10.10.21\backup Close unmatched open for MID:1048894
Aug 29 15:31:23 Tower sshd[9323]: Connection closed by 10.10.10.77 port 51686
Aug 29 15:31:23 Tower sshd[9323]: Close session: user root from 10.10.10.77 port 51686 id 0
Aug 29 15:31:23 Tower sshd[9323]: pam_unix(sshd:session): session closed for user root
Aug 29 15:31:23 Tower sshd[9323]: pam_elogind(sshd:session): Failed to release session: Interrupted system call
Aug 29 15:31:23 Tower sshd[9323]: Transferred: sent 218648, received 11400 bytes
Aug 29 15:31:23 Tower sshd[9323]: Closing connection to 10.10.10.77 port 51686
Aug 29 15:32:18 Tower kernel: CIFS: VFS: \\10.10.10.21\backup Close unmatched open for MID:1051096
Aug 29 15:32:48 Tower webGUI: Successful login user root from 10.10.10.155
Aug 29 15:34:25 Tower sshd[1560]: Connection from 10.10.10.77 port 52543 on 10.10.10.11 port 22 rdomain ""
Aug 29 15:34:28 Tower sshd[1560]: Postponed keyboard-interactive for root from 10.10.10.77 port 52543 ssh2 [preauth]
Aug 29 15:34:30 Tower sshd[1560]: Postponed keyboard-interactive/pam for root from 10.10.10.77 port 52543 ssh2 [preauth]
Aug 29 15:34:30 Tower sshd[1560]: Accepted keyboard-interactive/pam for root from 10.10.10.77 port 52543 ssh2
Aug 29 15:34:30 Tower sshd[1560]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0)
Aug 29 15:34:30 Tower sshd[1560]: Starting session: shell on pts/2 for root from 10.10.10.77 port 52543 id 0
Aug 29 15:37:22 Tower kernel: CIFS: VFS: \\10.10.10.21\backup Close unmatched open for MID:1054479
Aug 29 15:38:57 Tower dnsmasq[12988]: exiting on receipt of SIGTERM
Aug 29 15:42:26 Tower kernel: CIFS: VFS: \\10.10.10.21\backup Close unmatched open for MID:1057614
Aug 29 15:43:50 Tower shutdown[25317]: shutting down for system reboot
Aug 29 15:44:11 Tower sshd[1560]: Connection closed by 10.10.10.77 port 52543
Aug 29 15:44:11 Tower sshd[1560]: Close session: user root from 10.10.10.77 port 52543 id 0
Aug 29 15:44:11 Tower sshd[1560]: pam_unix(sshd:session): session closed for user root
Aug 29 15:44:11 Tower sshd[1560]: Transferred: sent 39928, received 6680 bytes
Aug 29 15:44:11 Tower sshd[1560]: Closing connection to 10.10.10.77 port 52543
Aug 29 15:47:29 Tower kernel: CIFS: VFS: \\10.10.10.21\backup Close unmatched open for MID:1060811

 

and another part of the log with errors:

 

Quote

Aug 29 16:14:35 Tower kernel: 00000000: 90 33 9e 18 90 88 ff ff 00 01 8c fb 65 c0 00 04 .3..........e... Aug 29 16:14:35 Tower kernel: XFS (dm-12): Internal error xfs_trans_cancel at line 1097 of file fs/xfs/xfs_trans.c. Caller xfs_setattr_size+0x242/0x328 [xfs] Aug 29 16:14:35 Tower kernel: CPU: 16 PID: 14856 Comm: truncate Tainted: P O 6.1.38-Unraid #2 Aug 29 16:14:35 Tower kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X99 Extreme4, BIOS P3.80 04/06/2018 Aug 29 16:14:35 Tower kernel: Call Trace: Aug 29 16:14:35 Tower kernel: <TASK> Aug 29 16:14:35 Tower kernel: dump_stack_lvl+0x44/0x5c Aug 29 16:14:35 Tower kernel: xfs_trans_cancel+0xbd/0x114 [xfs] Aug 29 16:14:35 Tower kernel: xfs_setattr_size+0x242/0x328 [xfs] Aug 29 16:14:35 Tower kernel: xfs_vn_setattr+0x75/0xf4 [xfs] Aug 29 16:14:35 Tower kernel: notify_change+0x24e/0x397 Aug 29 16:14:35 Tower kernel: ? do_truncate+0x89/0xc1 Aug 29 16:14:35 Tower kernel: do_truncate+0x89/0xc1 Aug 29 16:14:35 Tower kernel: do_sys_ftruncate+0xb2/0xf0 Aug 29 16:14:35 Tower kernel: do_syscall_64+0x6b/0x81 Aug 29 16:14:35 Tower kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Aug 29 16:14:35 Tower kernel: RIP: 0033:0x14f2a42036b7 Aug 29 16:14:35 Tower kernel: Code: 77 01 c3 48 8b 15 61 c7 0d 00 f7 d8 64 89 02 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 b8 4d 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 31 c7 0d 00 f7 d8 64 89 02 b8 Aug 29 16:14:35 Tower kernel: RSP: 002b:00007ffc981bbde8 EFLAGS: 00000206 ORIG_RAX: 000000000000004d Aug 29 16:14:35 Tower kernel: RAX: ffffffffffffffda RBX: 00007ffc981bc010 RCX: 000014f2a42036b7 Aug 29 16:14:35 Tower kernel: RDX: 00007ffc981bbe20 RSI: 0000002580000000 RDI: 0000000000000004 Aug 29 16:14:35 Tower kernel: RBP: 00007ffc981bcf0a R08: 0000002580000000 R09: 0000000000000000 Aug 29 16:14:35 Tower kernel: R10: 000014f2a4102358 R11: 0000000000000206 R12: 0000000000000841 Aug 29 16:14:35 Tower kernel: R13: 0000000000000001 R14: 0000000000000004 R15: 0000000000000002 Aug 29 16:14:35 Tower kernel: </TASK> Aug 29 16:14:35 Tower root: truncate: failed to truncate '/mnt/cache/system/docker/docker.img' at 161061273600 bytes: Structure needs cleaning Aug 29 16:14:35 Tower kernel: XFS (dm-12): Corruption of in-memory data (0x8) detected at xfs_trans_cancel+0xd6/0x114 [xfs] (fs/xfs/xfs_trans.c:1098). Shutting down filesystem. Aug 29 16:14:35 Tower kernel: XFS (dm-12): Please unmount the filesystem and rectify the problem(s) Aug 29 16:14:35 Tower root: mount: /var/lib/docker: /mnt/cache/system/docker/docker.img is not a block device, and stat(2) fails?. Aug 29 16:14:35 Tower root: dmesg(1) may have more information after failed mount system call. Aug 29 16:14:35 Tower root: mount error Aug 29 16:14:35 Tower kernel: squashfs: Unknown parameter 'space_cache' Aug 29 16:14:35 Tower emhttpd: shcmd (277): exit status: 1 Aug 29 16:14:35 Tower kernel: mdcmd (37): check correct Aug 29 16:14:35 Tower kernel: md: recovery thread: check P Q ... Aug 29 16:14:35 Tower avahi-daemon[14795]: Service "Tower" (/services/ssh.service) successfully established. Aug 29 16:14:35 Tower avahi-daemon[14795]: Service "Tower" (/services/smb.service) successfully established. Aug 29 16:14:35 Tower avahi-daemon[14795]: Service "Tower" (/services/sftp-ssh.service) successfully established.

 

After hard reboot I got the error "/mnt/cache/system/docker/docker.img" Path does not exist.

cd /mnt/cache and ls followiung error: /bin/ls: cannot open directory '.': Input/output error

 

Trying to repair xfs says: 

Phase 1 - find and verify superblock... Phase 2 - using internal log - zero log... ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this.

 

tower-diagnostics-20230828-1151.zip tower-diagnostics-20230829-1624.zip

Link to comment
46 minutes ago, JorgeB said:

Try switching to ipvlan (Settings -> Docker Settings -> Docker custom network type -> ipvlan (advanced view must be enabled, top right)).

 

For the cache issue, use -L with xfs_repair.

 

Phase 1 - find and verify superblock...
Phase 2 - using internal log
        - zero log...
ALERT: The filesystem has valuable metadata changes in a log which is being
destroyed because the -L option was used.
        - scan filesystem freespace and inode maps...
clearing needsrepair flag and regenerating metadata
sb_ifree 18672, counted 24267
sb_fdblocks 118931100, counted 100300833
        - found root inode chunk
Phase 3 - for each AG...
        - scan and clear agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
inode 47039578 - bad extent starting block number 4494803742481198, offset 2280176868672639
correcting nextents for inode 47039578
bad data fork in inode 47039578
cleared inode 47039578
        - agno = 1
        - agno = 2
        - agno = 3
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
entry "docker.img" in shortform directory 47039576 references free inode 47039578
junking entry "docker.img" in directory inode 47039576
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify and correct link counts...
Maximum metadata LSN (995:311495) is ahead of log (1:2).
Format log to cycle 998.
done

 

Now I can ls the /mnt/cache again but all my docker container are gone.

 

Any chance to reverse the change or to rebuild it at the state before the corruption?

 

docker.img itself is at the correct place.

Edited by aurevo
Link to comment
1 hour ago, JorgeB said:

If you have appdata (or a backup) and the containers templates you can just recreate:

(or

 

I have now rebuilt the image and am currently reinstalling the containers.

 

Can you tell from the logs or other information why the system hung, or could this also be related to the corrupt Docker image?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.