Call trace and xfs_repair

bobo89 · December 26, 2023

Chasing 2 things here. Server started locking up once a day after I added a new vlan on the NIC and adding that VLAN to docker. (unsure if that is the cause, but that's the major last thing I've done). All I can capture is this on the screen. Server is locked up hard and can't be interacted with. Any thoughts what causing this issue? macvlan?

Second thing is after a couple hard reboots drive6 started reading "unmountable: unsoported or no file system".

Following the instructions here I mounted in maintenance:

https://docs.unraid.net/unraid-os/manual/storage-management/#drive-shows-as-unmountable



root@Tower:/mnt# xfs_repair -n -L -v /dev/md6p1
Phase 1 - find and verify superblock...
        - block cache size set to 6118448 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 3003760 tail block 3002304
ALERT: The filesystem has valuable metadata changes in a log which is being
ignored because the -n option was used.  Expect spurious inconsistencies
which may be resolved by first mounting the filesystem to replay the log.
        - scan filesystem freespace and inode maps...
sb_fdblocks 98313526, counted 111938173
        - found root inode chunk
Phase 3 - for each AG...
        - scan (but don't clear) agi unlinked lists...
        - process known inodes and perform inode discovery...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
inode 11297473053 - bad extent starting block number 4503567551246457, offset 0
correcting nextents for inode 11297473053
bad data fork in inode 11297473053
would have cleared inode 11297473053
        - agno = 6
        - agno = 7
        - process newly discovered inodes...
Phase 4 - check for duplicate blocks...
        - setting up duplicate extent list...
        - check for inodes claiming duplicate blocks...
        - agno = 0
        - agno = 6
        - agno = 1
        - agno = 2
        - agno = 4
        - agno = 5
        - agno = 7
        - agno = 3
entry "00141-capture.jpg" at block 1 offset 608 in directory inode 11297472846 references free inode 11297473053
        would clear inode number in entry at offset 608...
inode 11297473053 - bad extent starting block number 4503567551246457, offset 0
correcting nextents for inode 11297473053
bad data fork in inode 11297473053
would have cleared inode 11297473053
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
        - traversing filesystem ...
        - agno = 0
        - agno = 1
        - agno = 2
        - agno = 3
        - agno = 4
        - agno = 5
entry "00141-capture.jpg" in directory inode 11297472846 points to free inode 11297473053, would junk entry
bad hash table for directory inode 11297472846 (no data entry): would rebuild
would rebuild directory inode 11297472846
        - agno = 6
        - agno = 7
        - traversal finished ...
        - moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.

        XFS_REPAIR Summary    Mon Dec 25 23:12:52 2023

Phase           Start           End             Duration
Phase 1:        12/25 23:10:44  12/25 23:10:44
Phase 2:        12/25 23:10:44  12/25 23:10:45  1 second
Phase 3:        12/25 23:10:45  12/25 23:11:53  1 minute, 8 seconds
Phase 4:        12/25 23:11:53  12/25 23:11:54  1 second
Phase 5:        Skipped
Phase 6:        12/25 23:11:54  12/25 23:12:52  58 seconds
Phase 7:        12/25 23:12:52  12/25 23:12:52

Total run time: 2 minutes, 8 seconds

[Click and drag to move]

root@Tower:/mnt# xfs_repair -v /dev/md6p1
Phase 1 - find and verify superblock...
        - block cache size set to 6118448 entries
Phase 2 - using internal log
        - zero log...
zero_log: head block 3003760 tail block 3002304
ERROR: The filesystem has valuable metadata changes in a log which needs to
be replayed.  Mount the filesystem to replay the log, and unmount it before
re-running xfs_repair.  If you are unable to mount the filesystem, then use
the -L option to destroy the log and attempt a repair.
Note that destroying the log may cause corruption -- please attempt a mount
of the filesystem before doing this.

Basically do I want to run it with the -L now or not ?

tower-diagnostics-20231225-2303.zip

itimpi · December 26, 2023

26 minutes ago, bobo89 said:

Basically do I want to run it with the -L now or not ?

Yes

bobo89 · December 26, 2023

35 minutes ago, itimpi said:

Yes

Seems to have fixed it, thanks!

Any thoughts on the Call traces?

JorgeB · December 26, 2023

Enable the syslog server and post that after a crash.

bobo89 · December 26, 2023

10 hours ago, JorgeB said:

Enable the syslog server and post that after a crash.

The server locks up so hard I can't get onto it.

I have enabled remote syslogging to another server, and offloading to the USB key. I'll see if I can use that to catch another crash. Usually takes about 24 hours.

Edit: Caught one within a couple minutes of starting the array. Attached is diagnostics and here is the sysllog.

root@temp:/var/log# tail -f syslog
Dec 26 16:56:19 Tower nmbd[31928]:
Dec 26 16:56:19 Tower nmbd[31928]:   Samba name server TOWER is now a local master browser for workgroup WORKGROUP on subnet 192.168.2.118
Dec 26 16:56:19 Tower nmbd[31928]:
Dec 26 16:56:19 Tower nmbd[31928]:   *****
Dec 26 16:56:22 Tower kernel: Bluetooth: Core ver 2.22
Dec 26 16:56:22 Tower kernel: NET: Registered PF_BLUETOOTH protocol family
Dec 26 16:56:22 Tower kernel: Bluetooth: HCI device and connection manager initialized
Dec 26 16:56:22 Tower kernel: Bluetooth: HCI socket layer initialized
Dec 26 16:56:22 Tower kernel: Bluetooth: L2CAP socket layer initialized
Dec 26 16:56:22 Tower kernel: Bluetooth: SCO socket layer initialized
Dec 26 16:56:33 Tower kernel: docker0: port 1(veth04efed0) entered disabled state
Dec 26 16:56:33 Tower kernel: veth898bf67: renamed from eth0
Dec 26 16:56:33 Tower kernel: docker0: port 1(veth04efed0) entered disabled state
Dec 26 16:56:33 Tower kernel: device veth04efed0 left promiscuous mode
Dec 26 16:56:33 Tower kernel: docker0: port 1(veth04efed0) entered disabled state
Dec 26 16:56:36 Tower kernel: NET: Registered PF_PACKET protocol family
Dec 26 16:56:38 Tower mergerfs[17304]: running basic garbage collection
Dec 26 16:56:38 Tower mergerfs[17304]: threadpool (fuse.read): spawning 24 threads w/ max queue depth 24
Dec 26 16:56:38 Tower mergerfs[17304]: read-thread-count=24; process-thread-count=-1; process-thread-queue-depth=-1; pin-threads=false;
Dec 26 21:56:44 temp systemd[1]: systemd-timedated.service: Deactivated successfully.
Dec 26 16:58:41 Tower kernel: ------------[ cut here ]------------
Dec 26 16:58:41 Tower kernel: WARNING: CPU: 7 PID: 6258 at net/netfilter/nf_conntrack_core.c:1210 __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Dec 26 16:58:41 Tower kernel: Modules linked in: af_packet bluetooth ecdh_generic ecc nvidia_uvm(PO) xt_connmark xt_mark xt_comment iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables macvtap macvlan tap bridge 8021q garp mrp stp llc mlx4_en mlx4_core igb i2c_algo_bit nvidia_drm(PO) nvidia_modeset(PO) edac_mce_amd edac_core intel_rapl_msr intel_rapl_common iosf_mbi kvm_amd nvidia(PO) kvm video drm_kms_helper
Dec 26 16:58:41 Tower kernel: crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd wmi_bmof mxm_wmi drm mpt3sas rapl backlight k10temp i2c_piix4 nvme syscopyarea raid_class sysfillrect ccp input_leds scsi_transport_sas ahci i2c_core sysimgblt joydev led_class fb_sys_fops nvme_core libahci wmi button acpi_cpufreq unix [last unloaded: mlx4_core]
Dec 26 16:58:41 Tower kernel: CPU: 7 PID: 6258 Comm: kworker/u64:7 Tainted: P           O       6.1.49-Unraid #1
Dec 26 16:58:41 Tower kernel: Hardware name: Micro-Star International Co., Ltd. MS-7B78/X470 GAMING PRO CARBON (MS-7B78), BIOS 2.E0 06/10/2020
Dec 26 16:58:41 Tower kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan]
Dec 26 16:58:41 Tower kernel: RIP: 0010:__nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Dec 26 16:58:41 Tower kernel: Code: 44 24 10 e8 e2 e1 ff ff 8b 7c 24 04 89 ea 89 c6 89 04 24 e8 7e e6 ff ff 84 c0 75 a2 48 89 df e8 9b e2 ff ff 85 c0 89 c5 74 18 <0f> 0b 8b 34 24 8b 7c 24 04 e8 18 dd ff ff e8 93 e3 ff ff e9 72 01
Dec 26 16:58:41 Tower kernel: RSP: 0018:ffffc900003b0d98 EFLAGS: 00010202
Dec 26 16:58:41 Tower kernel: RAX: 0000000000000001 RBX: ffff8881e2698900 RCX: 5703bb9def20d4f0
Dec 26 16:58:41 Tower kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8881e2698900
Dec 26 16:58:41 Tower kernel: RBP: 0000000000000001 R08: 5da7c85202080faa R09: d0edd303cd3e3a8a
Dec 26 16:58:41 Tower kernel: R10: ca240b8a0ce8c507 R11: ffffc900003b0d60 R12: ffffffff82a11d00
Dec 26 16:58:41 Tower kernel: R13: 000000000000ba78 R14: ffff88953c5e8800 R15: 0000000000000000
Dec 26 16:58:41 Tower kernel: FS:  0000000000000000(0000) GS:ffff889f9e9c0000(0000) knlGS:0000000000000000
Dec 26 16:58:41 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 26 16:58:41 Tower kernel: CR2: 000014ad5540b020 CR3: 0000000108a1c000 CR4: 0000000000350ee0
Dec 26 16:58:41 Tower kernel: Call Trace:
Dec 26 16:58:41 Tower kernel: <IRQ>
Dec 26 16:58:41 Tower kernel: ? __warn+0xab/0x122
Dec 26 16:58:41 Tower kernel: ? report_bug+0x109/0x17e
Dec 26 16:58:41 Tower kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Dec 26 16:58:41 Tower kernel: ? handle_bug+0x41/0x6f
Dec 26 16:58:41 Tower kernel: ? exc_invalid_op+0x13/0x60
Dec 26 16:58:41 Tower kernel: ? asm_exc_invalid_op+0x16/0x20
Dec 26 16:58:41 Tower kernel: ? __nf_conntrack_confirm+0xa4/0x2b0 [nf_conntrack]
Dec 26 16:58:41 Tower kernel: ? __nf_conntrack_confirm+0x9e/0x2b0 [nf_conntrack]
Dec 26 16:58:41 Tower kernel: ? nf_nat_inet_fn+0xc0/0x1a8 [nf_nat]
Dec 26 16:58:41 Tower kernel: nf_conntrack_confirm+0x25/0x54 [nf_conntrack]
Dec 26 16:58:41 Tower kernel: nf_hook_slow+0x3d/0x96
Dec 26 16:58:41 Tower kernel: ? ip_protocol_deliver_rcu+0x164/0x164
Dec 26 16:58:41 Tower kernel: NF_HOOK.constprop.0+0x79/0xd9
Dec 26 16:58:41 Tower kernel: ? ip_protocol_deliver_rcu+0x164/0x164
Dec 26 16:58:41 Tower kernel: __netif_receive_skb_one_core+0x77/0x9c
Dec 26 16:58:41 Tower kernel: process_backlog+0x8c/0x116
Dec 26 16:58:41 Tower kernel: __napi_poll.constprop.0+0x2b/0x124
Dec 26 16:58:41 Tower kernel: net_rx_action+0x159/0x24f
Dec 26 16:58:41 Tower kernel: __do_softirq+0x129/0x288
Dec 26 16:58:41 Tower kernel: do_softirq+0x7f/0xab
Dec 26 16:58:41 Tower kernel: </IRQ>
Dec 26 16:58:41 Tower kernel: <TASK>
Dec 26 16:58:41 Tower kernel: __local_bh_enable_ip+0x4c/0x6b
Dec 26 16:58:41 Tower kernel: netif_rx+0x52/0x5a
Dec 26 16:58:41 Tower kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
Dec 26 16:58:41 Tower kernel: ? _raw_spin_unlock+0x14/0x29
Dec 26 16:58:41 Tower kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]
Dec 26 16:58:41 Tower kernel: process_one_work+0x1ab/0x295
Dec 26 16:58:41 Tower kernel: worker_thread+0x18b/0x244
Dec 26 16:58:41 Tower kernel: ? rescuer_thread+0x281/0x281
Dec 26 16:58:41 Tower kernel: kthread+0xe7/0xef
Dec 26 16:58:41 Tower kernel: ? kthread_complete_and_exit+0x1b/0x1b
Dec 26 16:58:41 Tower kernel: ret_from_fork+0x22/0x30
Dec 26 16:58:41 Tower kernel: </TASK>
Dec 26 16:58:41 Tower kernel: ---[ end trace 0000000000000000 ]---

In case this pictures helps this was displayed on the screen when I got to it.(when there was a hard crash, not related to the previous text dump)

tower-diagnostics-20231226-1702.zip

itimpi · December 26, 2023

That call trace looks like it could be macvlan related. If so you need to either switch docker to using ipvlan, or alternatively disable bridging on eth0 to continue using macvlan.

bobo89 · December 27, 2023

13 hours ago, itimpi said:

That call trace looks like it could be macvlan related. If so you need to either switch docker to using ipvlan, or alternatively disable bridging on eth0 to continue using macvlan.

Switched to ipvlan. Not only does it seem that fixed the issue (no more traces in the logs for about 9 hours ) but docker networking responsiveness seems to have improved

Call trace and xfs_repair

Recommended Posts

bobo89

Link to comment

itimpi

Link to comment

bobo89

Link to comment

JorgeB

Link to comment

bobo89

Link to comment

itimpi

Link to comment

bobo89

Link to comment

Join the conversation