bobo89 Posted December 26, 2023 Share Posted December 26, 2023 Chasing 2 things here. Server started locking up once a day after I added a new vlan on the NIC and adding that VLAN to docker. (unsure if that is the cause, but that's the major last thing I've done). All I can capture is this on the screen. Server is locked up hard and can't be interacted with. Any thoughts what causing this issue? macvlan? Second thing is after a couple hard reboots drive6 started reading "unmountable: unsoported or no file system". Following the instructions here I mounted in maintenance: https://docs.unraid.net/unraid-os/manual/storage-management/#drive-shows-as-unmountable root@Tower:/mnt# xfs_repair -n -L -v /dev/md6p1 Phase 1 - find and verify superblock... - block cache size set to 6118448 entries Phase 2 - using internal log - zero log... zero_log: head block 3003760 tail block 3002304 ALERT: The filesystem has valuable metadata changes in a log which is being ignored because the -n option was used. Expect spurious inconsistencies which may be resolved by first mounting the filesystem to replay the log. - scan filesystem freespace and inode maps... sb_fdblocks 98313526, counted 111938173 - found root inode chunk Phase 3 - for each AG... - scan (but don't clear) agi unlinked lists... - process known inodes and perform inode discovery... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 inode 11297473053 - bad extent starting block number 4503567551246457, offset 0 correcting nextents for inode 11297473053 bad data fork in inode 11297473053 would have cleared inode 11297473053 - agno = 6 - agno = 7 - process newly discovered inodes... Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - check for inodes claiming duplicate blocks... - agno = 0 - agno = 6 - agno = 1 - agno = 2 - agno = 4 - agno = 5 - agno = 7 - agno = 3 entry "00141-capture.jpg" at block 1 offset 608 in directory inode 11297472846 references free inode 11297473053 would clear inode number in entry at offset 608... inode 11297473053 - bad extent starting block number 4503567551246457, offset 0 correcting nextents for inode 11297473053 bad data fork in inode 11297473053 would have cleared inode 11297473053 No modify flag set, skipping phase 5 Phase 6 - check inode connectivity... - traversing filesystem ... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 entry "00141-capture.jpg" in directory inode 11297472846 points to free inode 11297473053, would junk entry bad hash table for directory inode 11297472846 (no data entry): would rebuild would rebuild directory inode 11297472846 - agno = 6 - agno = 7 - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify link counts... No modify flag set, skipping filesystem flush and exiting. XFS_REPAIR Summary Mon Dec 25 23:12:52 2023 Phase Start End Duration Phase 1: 12/25 23:10:44 12/25 23:10:44 Phase 2: 12/25 23:10:44 12/25 23:10:45 1 second Phase 3: 12/25 23:10:45 12/25 23:11:53 1 minute, 8 seconds Phase 4: 12/25 23:11:53 12/25 23:11:54 1 second Phase 5: Skipped Phase 6: 12/25 23:11:54 12/25 23:12:52 58 seconds Phase 7: 12/25 23:12:52 12/25 23:12:52 Total run time: 2 minutes, 8 seconds [Click and drag to move] root@Tower:/mnt# xfs_repair -v /dev/md6p1 Phase 1 - find and verify superblock... - block cache size set to 6118448 entries Phase 2 - using internal log - zero log... zero_log: head block 3003760 tail block 3002304 ERROR: The filesystem has valuable metadata changes in a log which needs to be replayed. Mount the filesystem to replay the log, and unmount it before re-running xfs_repair. If you are unable to mount the filesystem, then use the -L option to destroy the log and attempt a repair. Note that destroying the log may cause corruption -- please attempt a mount of the filesystem before doing this. Basically do I want to run it with the -L now or not ? tower-diagnostics-20231225-2303.zip Quote Link to comment
itimpi Posted December 26, 2023 Share Posted December 26, 2023 26 minutes ago, bobo89 said: Basically do I want to run it with the -L now or not ? Yes Quote Link to comment
bobo89 Posted December 26, 2023 Author Share Posted December 26, 2023 35 minutes ago, itimpi said: Yes Seems to have fixed it, thanks! Any thoughts on the Call traces? Quote Link to comment
JorgeB Posted December 26, 2023 Share Posted December 26, 2023 Enable the syslog server and post that after a crash. Quote Link to comment
bobo89 Posted December 26, 2023 Author Share Posted December 26, 2023 10 hours ago, JorgeB said: Enable the syslog server and post that after a crash. The server locks up so hard I can't get onto it. I have enabled remote syslogging to another server, and offloading to the USB key. I'll see if I can use that to catch another crash. Usually takes about 24 hours. Edit: Caught one within a couple minutes of starting the array. Attached is diagnostics and here is the sysllog. root@temp:/var/log# tail -f syslog Dec 26 16:56:19 Tower nmbd[31928]: Dec 26 16:56:19 Tower nmbd[31928]: Samba name server TOWER is now a local master browser for workgroup WORKGROUP on subnet 192.168.2.118 Dec 26 16:56:19 Tower nmbd[31928]: Dec 26 16:56:19 Tower nmbd[31928]: ***** Dec 26 16:56:22 Tower kernel: Bluetooth: Core ver 2.22 Dec 26 16:56:22 Tower kernel: NET: Registered PF_BLUETOOTH protocol family Dec 26 16:56:22 Tower kernel: Bluetooth: HCI device and connection manager initialized Dec 26 16:56:22 Tower kernel: Bluetooth: HCI socket layer initialized Dec 26 16:56:22 Tower kernel: Bluetooth: L2CAP socket layer initialized Dec 26 16:56:22 Tower kernel: Bluetooth: SCO socket layer initialized Dec 26 16:56:33 Tower kernel: docker0: port 1(veth04efed0) entered disabled state Dec 26 16:56:33 Tower kernel: veth898bf67: renamed from eth0 Dec 26 16:56:33 Tower kernel: docker0: port 1(veth04efed0) entered disabled state Dec 26 16:56:33 Tower kernel: device veth04efed0 left promiscuous mode Dec 26 16:56:33 Tower kernel: docker0: port 1(veth04efed0) entered disabled state Dec 26 16:56:36 Tower kernel: NET: Registered PF_PACKET protocol family Dec 26 16:56:38 Tower mergerfs[17304]: running basic garbage collection Dec 26 16:56:38 Tower mergerfs[17304]: threadpool (fuse.read): spawning 24 threads w/ max queue depth 24 Dec 26 16:56:38 Tower mergerfs[17304]: read-thread-count=24; process-thread-count=-1; process-thread-queue-depth=-1; pin-threads=false; Dec 26 21:56:44 temp systemd[1]: systemd-timedated.service: Deactivated successfully. Dec 26 16:58:41 Tower kernel: ------------[ cut here ]------------ Dec 26 16:58:41 Tower kernel: WARNING: CPU: 7 PID: 6258 at net/netfilter/nf_conntrack_core.c:1210 __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack] Dec 26 16:58:41 Tower kernel: Modules linked in: af_packet bluetooth ecdh_generic ecc nvidia_uvm(PO) xt_connmark xt_mark xt_comment iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables macvtap macvlan tap bridge 8021q garp mrp stp llc mlx4_en mlx4_core igb i2c_algo_bit nvidia_drm(PO) nvidia_modeset(PO) edac_mce_amd edac_core intel_rapl_msr intel_rapl_common iosf_mbi kvm_amd nvidia(PO) kvm video drm_kms_helper Dec 26 16:58:41 Tower kernel: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd wmi_bmof mxm_wmi drm mpt3sas rapl backlight k10temp i2c_piix4 nvme syscopyarea raid_class sysfillrect ccp input_leds scsi_transport_sas ahci i2c_core sysimgblt joydev led_class fb_sys_fops nvme_core libahci wmi button acpi_cpufreq unix [last unloaded: mlx4_core] Dec 26 16:58:41 Tower kernel: CPU: 7 PID: 6258 Comm: kworker/u64:7 Tainted: P O 6.1.49-Unraid #1 Dec 26 16:58:41 Tower kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.E0 06/10/2020 Dec 26 16:58:41 Tower kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan] Dec 26 16:58:41 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack] Dec 26 16:58:41 Tower kernel: Code: 44 24 10 e8 e2 e1 ff ff 8b 7c 24 04 89 ea 89 c6 89 04 24 e8 7e e6 ff ff 84 c0 75 a2 48 89 df e8 9b e2 ff ff 85 c0 89 c5 74 18 <0f> 0b 8b 34 24 8b 7c 24 04 e8 18 dd ff ff e8 93 e3 ff ff e9 72 01 Dec 26 16:58:41 Tower kernel: RSP: 0018:ffffc900003b0d98 EFLAGS: 00010202 Dec 26 16:58:41 Tower kernel: RAX: 0000000000000001 RBX: ffff8881e2698900 RCX: 5703bb9def20d4f0 Dec 26 16:58:41 Tower kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881e2698900 Dec 26 16:58:41 Tower kernel: RBP: 0000000000000001 R08: 5da7c85202080faa R09: d0edd303cd3e3a8a Dec 26 16:58:41 Tower kernel: R10: ca240b8a0ce8c507 R11: ffffc900003b0d60 R12: ffffffff82a11d00 Dec 26 16:58:41 Tower kernel: R13: 000000000000ba78 R14: ffff88953c5e8800 R15: 0000000000000000 Dec 26 16:58:41 Tower kernel: FS: 0000000000000000(0000) GS:ffff889f9e9c0000(0000) knlGS:0000000000000000 Dec 26 16:58:41 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Dec 26 16:58:41 Tower kernel: CR2: 000014ad5540b020 CR3: 0000000108a1c000 CR4: 0000000000350ee0 Dec 26 16:58:41 Tower kernel: Call Trace: Dec 26 16:58:41 Tower kernel: <IRQ> Dec 26 16:58:41 Tower kernel: ? __warn+0xab/0x122 Dec 26 16:58:41 Tower kernel: ? report_bug+0x109/0x17e Dec 26 16:58:41 Tower kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack] Dec 26 16:58:41 Tower kernel: ? handle_bug+0x41/0x6f Dec 26 16:58:41 Tower kernel: ? exc_invalid_op+0x13/0x60 Dec 26 16:58:41 Tower kernel: ? asm_exc_invalid_op+0x16/0x20 Dec 26 16:58:41 Tower kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack] Dec 26 16:58:41 Tower kernel: ? __nf_conntrack_confirm+0x9e/0x2b0 [nf_conntrack] Dec 26 16:58:41 Tower kernel: ? nf_nat_inet_fn+0xc0/0x1a8 [nf_nat] Dec 26 16:58:41 Tower kernel: nf_conntrack_confirm+0x25/0x54 [nf_conntrack] Dec 26 16:58:41 Tower kernel: nf_hook_slow+0x3d/0x96 Dec 26 16:58:41 Tower kernel: ? ip_protocol_deliver_rcu+0x164/0x164 Dec 26 16:58:41 Tower kernel: NF_HOOK.constprop.0+0x79/0xd9 Dec 26 16:58:41 Tower kernel: ? ip_protocol_deliver_rcu+0x164/0x164 Dec 26 16:58:41 Tower kernel: __netif_receive_skb_one_core+0x77/0x9c Dec 26 16:58:41 Tower kernel: process_backlog+0x8c/0x116 Dec 26 16:58:41 Tower kernel: __napi_poll.constprop.0+0x2b/0x124 Dec 26 16:58:41 Tower kernel: net_rx_action+0x159/0x24f Dec 26 16:58:41 Tower kernel: __do_softirq+0x129/0x288 Dec 26 16:58:41 Tower kernel: do_softirq+0x7f/0xab Dec 26 16:58:41 Tower kernel: </IRQ> Dec 26 16:58:41 Tower kernel: <TASK> Dec 26 16:58:41 Tower kernel: __local_bh_enable_ip+0x4c/0x6b Dec 26 16:58:41 Tower kernel: netif_rx+0x52/0x5a Dec 26 16:58:41 Tower kernel: macvlan_broadcast+0x10a/0x150 [macvlan] Dec 26 16:58:41 Tower kernel: ? _raw_spin_unlock+0x14/0x29 Dec 26 16:58:41 Tower kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan] Dec 26 16:58:41 Tower kernel: process_one_work+0x1ab/0x295 Dec 26 16:58:41 Tower kernel: worker_thread+0x18b/0x244 Dec 26 16:58:41 Tower kernel: ? rescuer_thread+0x281/0x281 Dec 26 16:58:41 Tower kernel: kthread+0xe7/0xef Dec 26 16:58:41 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b Dec 26 16:58:41 Tower kernel: ret_from_fork+0x22/0x30 Dec 26 16:58:41 Tower kernel: </TASK> Dec 26 16:58:41 Tower kernel: ---[ end trace 0000000000000000 ]--- In case this pictures helps this was displayed on the screen when I got to it.(when there was a hard crash, not related to the previous text dump) tower-diagnostics-20231226-1702.zip Quote Link to comment
Solution itimpi Posted December 26, 2023 Solution Share Posted December 26, 2023 That call trace looks like it could be macvlan related. If so you need to either switch docker to using ipvlan, or alternatively disable bridging on eth0 to continue using macvlan. Quote Link to comment
bobo89 Posted December 27, 2023 Author Share Posted December 27, 2023 13 hours ago, itimpi said: That call trace looks like it could be macvlan related. If so you need to either switch docker to using ipvlan, or alternatively disable bridging on eth0 to continue using macvlan. Switched to ipvlan. Not only does it seem that fixed the issue (no more traces in the logs for about 9 hours ) but docker networking responsiveness seems to have improved Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.