Jump to content

[6.12.10] Server crashing with call trace and BTRFS errors after macvlan migration


Go to solution Solved by JorgeB,

Recommended Posts

Posted (edited)

Hey everyone,

 

I'm currently in the process of migrating my network and separating all services onto different VLANs. Since I'm using Ubiquiti networking gear, I had to switch back to using macvlan for my Docker containers based on the recommendations in the Unraid 6.12.4 release notes. The network part of the migration is working fine, and I followed all the required changes listed in the 6.12.4 changelog to use macvlan successfully:

  • Disabled bridging on eth0 (main interface)
  • Kept bridging enabled on eth1 with my Docker VLANs
  • Enabled "Host access to custom networks" in Docker settings

 

However, I'm still running into stability issues after some hours of uptime. The server crashes and becomes unreachable, with call trace errors showing up in the logs. This seems to also cause other issues like Docker randomly shutting down and restarting, as well as BTRFS errors appearing in the syslog.

I noticed that in my docker.cfg file, only "br1" is listed as the custom Docker network. When I try to change this, it reverts back after restarting the Docker service.

grafik.png.b5455cfe327aa8336828cb2d230d403c.png

 

 

I've attached the syslog from the last few days capturing the crashes and call traces, along with the latest diagnostics package and my current network.cfg and docker.cfg files. Any help or suggestions on resolving these crashes and call trace issues would be greatly appreciated! Let me know if you need any other information from me.

 

 

Thanks in advance!

 

 

May 27 21:56:08 server kernel: ------------[ cut here ]------------
May 27 21:56:08 server kernel: WARNING: CPU: 1 PID: 25449 at net/netfilter/nf_nat_core.c:594 nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
May 27 21:56:08 server kernel: Modules linked in: vhost_net vhost kvm_intel kvm tun ipvlan xt_mark xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_iotlb veth xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter xfs dm_crypt dm_mod md_mod nfsd auth_rpcgss oid_registry lockd grace sunrpc zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) bluetooth ecdh_generic ecc tcp_diag inet_diag nct6775 nct6775_core hwmon_vid wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs bridge macvtap macvlan tap 8021q garp mrp stp llc e1000e r8169 realtek i915 intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp iosf_mbi drm_buddy i2c_algo_bit ttm
May 27 21:56:08 server kernel: crct10dif_pclmul crc32_pclmul crc32c_intel drm_display_helper ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 drm_kms_helper aesni_intel crypto_simd drm cryptd mei_hdcp mei_pxp i2c_i801 intel_gtt rapl wmi_bmof intel_wmi_thunderbolt intel_cstate mpt3sas agpgart nvme mei_me apex(O) i2c_smbus syscopyarea ahci raid_class nvme_core i2c_core mei gasket(O) scsi_transport_sas libahci intel_uncore input_leds joydev cdc_acm led_class sysfillrect sysimgblt intel_pch_thermal fb_sys_fops video wmi backlight intel_pmc_core acpi_tad acpi_pad button unix [last unloaded: kvm]
May 27 21:56:08 server kernel: CPU: 1 PID: 25449 Comm: kworker/u40:1 Tainted: P        W  O       6.1.79-Unraid #1
May 27 21:56:08 server kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z490 PG Velocita, BIOS P1.70 01/04/2021
May 27 21:56:08 server kernel: Workqueue: events_unbound macvlan_process_broadcast [macvlan]
May 27 21:56:08 server kernel: RIP: 0010:nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
May 27 21:56:08 server kernel: Code: a8 80 75 26 48 8d 73 58 48 8d 7c 24 20 e8 18 bb fd ff 48 8d 43 0c 4c 8b bb 88 00 00 00 48 89 44 24 18 eb 54 0f ba e0 08 73 07 <0f> 0b e9 75 06 00 00 48 8d 73 58 48 8d 7c 24 20 e8 eb ba fd ff 48
May 27 21:56:08 server kernel: RSP: 0018:ffffc900001f0c78 EFLAGS: 00010282
May 27 21:56:08 server kernel: RAX: 0000000000000180 RBX: ffff8882094d2900 RCX: ffff88814de4f140
May 27 21:56:08 server kernel: RDX: 0000000000000000 RSI: ffffc900001f0d5c RDI: ffff8882094d2900
May 27 21:56:08 server kernel: RBP: ffffc900001f0d40 R08: 000000000114000a R09: 0000000000000000
May 27 21:56:08 server kernel: R10: 0000000000000098 R11: 0000000000000001 R12: ffffc900001f0d5c
May 27 21:56:08 server kernel: R13: 0000000000000000 R14: ffffc900001f0e40 R15: 0000000000000001
May 27 21:56:08 server kernel: FS:  0000000000000000(0000) GS:ffff88901f440000(0000) knlGS:0000000000000000
May 27 21:56:08 server kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 27 21:56:08 server kernel: CR2: 000014f2bf7cb508 CR3: 0000000121dae001 CR4: 00000000007726e0
May 27 21:56:08 server kernel: PKRU: 55555554
May 27 21:56:08 server kernel: Call Trace:
May 27 21:56:08 server kernel: <IRQ>
May 27 21:56:08 server kernel: ? __warn+0xab/0x122
May 27 21:56:08 server kernel: ? report_bug+0x109/0x17e
May 27 21:56:08 server kernel: ? nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
May 27 21:56:08 server kernel: ? handle_bug+0x41/0x6f
May 27 21:56:08 server kernel: ? exc_invalid_op+0x13/0x60
May 27 21:56:08 server kernel: ? asm_exc_invalid_op+0x16/0x20
May 27 21:56:08 server kernel: ? nf_nat_setup_info+0x8c/0x7d1 [nf_nat]
May 27 21:56:08 server kernel: ? nf_nat_setup_info+0x44/0x7d1 [nf_nat]
May 27 21:56:08 server kernel: ? xt_write_recseq_end+0xf/0x1c [ip_tables]
May 27 21:56:08 server kernel: ? __local_bh_enable_ip+0x56/0x6b
May 27 21:56:08 server kernel: ? ipt_do_table+0x575/0x5ba [ip_tables]
May 27 21:56:08 server kernel: ? ip_route_input_slow+0x6d6/0x86c
May 27 21:56:08 server kernel: __nf_nat_alloc_null_binding+0x66/0x81 [nf_nat]
May 27 21:56:08 server kernel: nf_nat_inet_fn+0xc0/0x1a8 [nf_nat]
May 27 21:56:08 server kernel: nf_nat_ipv4_local_in+0x2a/0xaa [nf_nat]
May 27 21:56:08 server kernel: nf_hook_slow+0x3a/0x96
May 27 21:56:08 server kernel: ? ip_protocol_deliver_rcu+0x164/0x164
May 27 21:56:08 server kernel: NF_HOOK.constprop.0+0x79/0xd9
May 27 21:56:08 server kernel: ? ip_protocol_deliver_rcu+0x164/0x164
May 27 21:56:08 server kernel: __netif_receive_skb_one_core+0x77/0x9c
May 27 21:56:08 server kernel: process_backlog+0x8c/0x116
May 27 21:56:08 server kernel: __napi_poll.constprop.0+0x28/0x124
May 27 21:56:08 server kernel: net_rx_action+0x159/0x24f
May 27 21:56:08 server kernel: __do_softirq+0x126/0x288
May 27 21:56:08 server kernel: do_softirq+0x7f/0xab
May 27 21:56:08 server kernel: </IRQ>
May 27 21:56:08 server kernel: <TASK>
May 27 21:56:08 server kernel: __local_bh_enable_ip+0x4c/0x6b
May 27 21:56:08 server kernel: netif_rx+0x52/0x5a
May 27 21:56:08 server kernel: macvlan_broadcast+0x10a/0x150 [macvlan]
May 27 21:56:08 server kernel: ? _raw_spin_unlock+0x14/0x29
May 27 21:56:08 server kernel: macvlan_process_broadcast+0xbc/0x12f [macvlan]
May 27 21:56:08 server kernel: process_one_work+0x1a8/0x295
May 27 21:56:08 server kernel: worker_thread+0x18b/0x244
May 27 21:56:08 server kernel: ? rescuer_thread+0x281/0x281
May 27 21:56:08 server kernel: kthread+0xe4/0xef
May 27 21:56:08 server kernel: ? kthread_complete_and_exit+0x1b/0x1b
May 27 21:56:08 server kernel: ret_from_fork+0x1f/0x30
May 27 21:56:08 server kernel: </TASK>
May 27 21:56:08 server kernel: ---[ end trace 0000000000000000 ]---

 

 

grafik.png

grafik.png

 

 

EDIT: Uploaded new diagnostics file

 

 

syslog-previous

server-diagnostics-20240528-1314.zip

Edited by CryPt00n
Uploaded new diagnostics file
Posted (edited)
27 minutes ago, JorgeB said:

If docker is using eth1 you need to disable bridging there.

But when disabling the bridge, i can no longer use the vlan which is tagged to eth1

 

EDIT: Solved it by tagging the VLAN on the Switch port and set the ip settings on eth1 matching to the vlan. I will let the server run then for some time, hopefully it is no longer crashing. Thanks for the quick answer :)

 

grafik.png.0975cf9680687267b414c7efcb2e114d.png

Edited by CryPt00n
  • Like 1
Posted

Hi @JorgeB, just an update on this case.

 

The Network problem seems fixed, thanks again for the help.

 

But I still run into BTRFS Issues, for me it looks like my drive is broken. For some days I find the following Error in my logs, which forces the system to read-only. I´m using BTRFS on both my Cache pools. Now, even after a restart, it is nearly instant forced to read-only.

May 30 04:37:01 server kernel: BTRFS warning (device dm-3): checksum verify failed on logical 10832904192 mirror 1 wanted 0x526f9829 found 0x1e8b5ba6 level 0
May 30 04:37:01 server kernel: BTRFS error (device dm-3): failed to run delayed ref for logical 38892613632 num_bytes 229376 type 178 action 1 ref_mod 1: -5
May 30 04:37:01 server kernel: BTRFS: error (device dm-3: state A) in btrfs_run_delayed_refs:2150: errno=-5 IO failure
May 30 04:37:01 server kernel: BTRFS info (device dm-3: state EA): forced readonly

 

 

Also cannot finish a scrub, it's failing with status code -5. I attached two diagnostics, one from before restarting, and one after restarting, doing a filesystem check and scrub.

 

Filesystem check Drive 2

[1/7] checking root items
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
Csum didn't match
ERROR: failed to repair root items: Input/output error
[2/7] checking extents
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
Csum didn't match
checksum verify failed on 10832887808 wanted 0xed101370 found 0x8d79d567
checksum verify failed on 10832887808 wanted 0xed101370 found 0x8d79d567
Csum didn't match
checksum verify failed on 10832936960 wanted 0x11cbcdce found 0xeb337b06
checksum verify failed on 10832936960 wanted 0x11cbcdce found 0xeb337b06
bad tree block 10832936960, bytenr mismatch, want=10832936960, have=4256055004336886150
checksum verify failed on 10832986112 wanted 0x2ac57ae3 found 0x3c9143a3
checksum verify failed on 10832986112 wanted 0x2ac57ae3 found 0x3c9143a3
Csum didn't match
checksum verify failed on 10833002496 wanted 0x47f261cc found 0xa493f25a
checksum verify failed on 10833002496 wanted 0x47f261cc found 0xa493f25a
Csum didn't match
owner ref check failed [10832887808 16384]
owner ref check failed [10832904192 16384]
owner ref check failed [10832936960 16384]
owner ref check failed [10832986112 16384]
owner ref check failed [10833002496 16384]

[...]

data extent[38901858304, 65536] referencer count mismatch (root 5 owner 158556278 offset 22189772800) wanted 0 have 1
backpointer mismatch on [38901858304 65536]
ERROR: errors found in extent allocation tree or chunk allocation
[3/7] checking free space tree
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
Csum didn't match
[4/7] checking fs roots
checksum verify failed on 10832936960 wanted 0x11cbcdce found 0xeb337b06
checksum verify failed on 10832936960 wanted 0x11cbcdce found 0xeb337b06
bad tree block 10832936960, bytenr mismatch, want=10832936960, have=4256055004336886150
checksum verify failed on 10832887808 wanted 0xed101370 found 0x8d79d567
checksum verify failed on 10832887808 wanted 0xed101370 found 0x8d79d567
Csum didn't match
[5/7] checking only csums items (without verifying data)
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
Csum didn't match
Error looking up extent record -5
csum exists for 38877519872-38877720576 but there is no extent record
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
checksum verify failed on 10832904192 wanted 0x526f9829 found 0x1e8b5ba6
Csum didn't match
Error looking up extent record -5

[...]

bad tree block 10832936960, bytenr mismatch, want=10832936960, have=4256055004336886150
Error going to next leaf -5
ERROR: errors found in csum tree
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
Opening filesystem to check...
Checking filesystem on /dev/mapper/sdf1
UUID: e10f68fc-6db1-4421-862e-3832a94ff001
found 660951715840 bytes used, error(s) found
total csum bytes: 508863876
total tree bytes: 839008256
total fs tree bytes: 189693952
total extent tree bytes: 49577984
btree space waste bytes: 128288893
file data blocks allocated: 660112625664
 referenced 660111429632


 

 

server-diagnostics-20240530-1410.zip server-diagnostics-20240530-1451.zip

Posted

Btrfs is detecting data corruption on that pool:

 

May 29 18:16:17 server kernel: BTRFS info (device dm-3): bdev /dev/mapper/sdf1 errs: wr 0, rd 0, flush 0, corrupt 4, gen 0

 

Run a correcting scrub and post the results.

Posted (edited)

Scrub is failing on this disk

 

EDIT: Should I try a Filesystem Repair?

 

May 30 15:01:34 server kernel: BTRFS info (device dm-3): scrub: not finished on devid 1 with status: -5
UUID:             e10f68fc-6db1-4421-862e-3832a94ff001
Scrub started:    Thu May 30 15:00:44 2024
Status:           aborted
Duration:         0:00:50
Total to scrub:   615.56GiB
Rate:             502.08MiB/s
Error summary:    verify=1 csum=4
  Corrected:      0
  Uncorrectable:  5
  Unverified:     0

 

Edited by CryPt00n
Posted (edited)
1 hour ago, JorgeB said:

Look in the sylog for the list of corrupt files, delete those or restore from a backup, then re-run the scrub to confirm no more errors.

 

Do i have to enable any debug mode or so to see the files? I can just see this kind of messages in the syslog/dmesg, also wasn't able to find some more information about this.

May 30 16:10:08 server kernel: BTRFS error (device dm-3): bdev /dev/mapper/sdf1 errs: wr 0, rd 0, flush 0, corrupt 38, gen 0
May 30 16:10:08 server kernel: BTRFS warning (device dm-3): tree block 10833002496 mirror 0 has bad csum, has 0x47f261cc want 0xa493f25a
May 30 16:10:08 server kernel: BTRFS warning (device dm-3): checksum error at logical 10833002496 on dev /dev/mapper/sdf1, physical 10833002496: metadata leaf (level 0) in tree 7
May 30 16:10:08 server kernel: BTRFS warning (device dm-3): checksum error at logical 10833002496 on dev /dev/mapper/sdf1, physical 10833002496: metadata leaf (level 0) in tree 7
May 30 16:10:08 server kernel: BTRFS error (device dm-3): bdev /dev/mapper/sdf1 errs: wr 0, rd 0, flush 0, corrupt 39, gen 0
May 30 16:10:38 server kernel: BTRFS warning (device dm-3): checksum verify failed on logical 10832887808 mirror 1 wanted 0xed101370 found 0x8d79d567 level 0

 

EDIT: For example,

find /mnt/ssd_pool -inum 10832887808

gives me no result

 

 

Edited by CryPt00n
  • Solution
Posted

That suggests the corruption is on metadata, in that case would recommend backing what you can from the pool, reformatting and restoring the data.

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...