Hi Folks,
Over the past few days I've been troubleshooting some BRTFS errors that have appeared in my Syslogs. Previously, a simple restart would restore my ability to mount the filesystem and run my dockers, but as of this afternoon, the filesystem is unmountable.
I've attached a diagnostic from this afternoon (21st) and yesterday (20th). From what I see searching online, it seems either there is an issue with a memory module which lead to some corruption issues or there is an issue with one of my drives.
My cache setup is a RAID 1 BTRFS of an NVME drive and a SSD (in my logs, /dev/sdc1 and /dev/nvme0n1).
Notably, on September 20th (Diagnostic attached), I see the following:
Sep 20 00:45:28 Unraid kernel: BTRFS critical (device sdc1): corrupt leaf: root=5 block=1161025175552 slot=143 ino=20996272 file_offset=65536, invalid type for file extent, have 164 expect range [0, 2]
Sep 20 00:45:28 Unraid kernel: BTRFS info (device sdc1): leaf 1161025175552 gen 1029363 total ptrs 197 free space 749 owner 5
Today, I see the following:
Sep 21 04:03:17 Unraid kernel: ------------[ cut here ]------------
Sep 21 04:03:17 Unraid kernel: WARNING: CPU: 16 PID: 34855 at fs/btrfs/extent-tree.c:3061 __btrfs_free_extent+0x466/0xc02
Sep 21 04:03:17 Unraid kernel: Modules linked in: veth wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_nat xt_tcpudp xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter nvidia_uvm(PO) xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) ip6table_filter ip6_tables iptable_filter ip_tables x_tables macvtap macvlan tap bridge 8021q garp mrp stp llc ixgbe xfrm_algo mdio e1000e nvidia_drm(PO) nvidia_modeset(PO) intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvidia(PO) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 video aesni_intel
Sep 21 04:03:17 Unraid kernel: drm_kms_helper crypto_simd ipmi_ssif cryptd drm rapl isci i2c_i801 intel_cstate nvme backlight mei_me i2c_smbus ahci cp210x apex(O) libsas syscopyarea input_leds sysfillrect sysimgblt i2c_core joydev led_class usbserial mei intel_uncore nvme_core scsi_transport_sas libahci gasket(O) fb_sys_fops acpi_ipmi wmi ipmi_si button unix [last unloaded: xfrm_algo]
Sep 21 04:03:17 Unraid kernel: CPU: 16 PID: 34855 Comm: kworker/u82:5 Tainted: P W O 6.1.49-Unraid #1
Sep 21 04:03:17 Unraid kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./EP2C602-4L/D16, BIOS P1.90 04/11/2018
Sep 21 04:03:17 Unraid kernel: Workqueue: events_unbound btrfs_preempt_reclaim_metadata_space
Sep 21 04:03:17 Unraid kernel: RIP: 0010:__btrfs_free_extent+0x466/0xc02
Sep 21 04:03:17 Unraid kernel: Code: 41 b4 01 45 89 e0 89 d9 ba f0 0b 00 00 41 83 e0 01 e9 93 00 00 00 31 db 45 8b 75 40 e9 dc 00 00 00 83 f8 fe 0f 85 93 00 00 00 <0f> 0b 49 8b 7d 00 e8 48 45 00 00 48 c7 c6 c4 8c 10 82 41 56 4c 8b
Sep 21 04:03:17 Unraid kernel: RSP: 0018:ffffc90038d57bd8 EFLAGS: 00010246
Sep 21 04:03:17 Unraid kernel: RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000000
Sep 21 04:03:17 Unraid kernel: RDX: 0000000000000000 RSI: ffff88943ed4434b RDI: ffffc90038d57b80
Sep 21 04:03:17 Unraid kernel: RBP: ffff88a0c7a97000 R08: 0000000000001000 R09: 0000160000000000
Sep 21 04:03:17 Unraid kernel: R10: ffff888000000000 R11: ffff88a50da34c70 R12: 0000000000000000
Sep 21 04:03:17 Unraid kernel: R13: ffff88a0afe7d620 R14: 000000012cfaf000 R15: 0000000000000361
Sep 21 04:03:17 Unraid kernel: FS: 0000000000000000(0000) GS:ffff88bfff980000(0000) knlGS:0000000000000000
Sep 21 04:03:17 Unraid kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 21 04:03:17 Unraid kernel: CR2: 000000c000422000 CR3: 000000000220a006 CR4: 00000000001726e0
Sep 21 04:03:17 Unraid kernel: Call Trace:
Sep 21 04:03:17 Unraid kernel: <TASK>
Sep 21 04:03:17 Unraid kernel: ? __warn+0xab/0x122
Sep 21 04:03:17 Unraid kernel: ? report_bug+0x109/0x17e
Sep 21 04:03:17 Unraid kernel: ? __btrfs_free_extent+0x466/0xc02
Sep 21 04:03:17 Unraid kernel: ? handle_bug+0x41/0x6f
Sep 21 04:03:17 Unraid kernel: ? exc_invalid_op+0x13/0x60
Sep 21 04:03:17 Unraid kernel: ? asm_exc_invalid_op+0x16/0x20
Sep 21 04:03:17 Unraid kernel: ? __btrfs_free_extent+0x466/0xc02
Sep 21 04:03:17 Unraid kernel: ? sysfs_slab_add+0x14d/0x1c8
Sep 21 04:03:17 Unraid kernel: __btrfs_run_delayed_refs+0x698/0xbe2
Sep 21 04:03:17 Unraid kernel: btrfs_run_delayed_refs+0x65/0x146
Sep 21 04:03:17 Unraid kernel: flush_space+0x37e/0x491
Sep 21 04:03:17 Unraid kernel: ? newidle_balance+0x289/0x30a
Sep 21 04:03:17 Unraid kernel: ? _raw_spin_lock+0x13/0x1c
Sep 21 04:03:17 Unraid kernel: ? _raw_spin_unlock+0x14/0x29
Sep 21 04:03:17 Unraid kernel: ? btrfs_bg_type_to_factor+0xa/0x1c
Sep 21 04:03:17 Unraid kernel: ? calc_available_free_space.isra.0+0x33/0x59
Sep 21 04:03:17 Unraid kernel: btrfs_preempt_reclaim_metadata_space+0xe8/0x172
Sep 21 04:03:17 Unraid kernel: process_one_work+0x1ab/0x295
Sep 21 04:03:17 Unraid kernel: worker_thread+0x18b/0x244
Sep 21 04:03:17 Unraid kernel: ? rescuer_thread+0x281/0x281
Sep 21 04:03:17 Unraid kernel: kthread+0xe7/0xef
Sep 21 04:03:17 Unraid kernel: ? kthread_complete_and_exit+0x1b/0x1b
Sep 21 04:03:17 Unraid kernel: ret_from_fork+0x22/0x30
Sep 21 04:03:17 Unraid kernel: </TASK>
Sep 21 04:03:17 Unraid kernel: ---[ end trace 0000000000000000 ]---
Sep 21 04:03:17 Unraid kernel: BTRFS info (device sdc1): leaf 1958068224 gen 1031631 total ptrs 179 free space 2321 owner 2
I've run extended SMART tests on /dev/sdc1 and /dev/nvme0n1, and both pass.
Any guidance on what I can look at next? The cache is currently unmountable -- do I have to format again and restore backups (appdata for Dockers, VMs), or is there a way to re-mount just a single (ex: the nvme drive) file system? If I do have to rebuild, I'm concerned I haven't found the root cause and will be back in this situation shortly.
TIA for any thoughts
unraid-diagnostics-20230921-1640.zip unraid-diagnostics-20230920-0837.zip Samsung_SSD_970_EVO_500GB_S466NX0K904535J-20230921-1905.txt Crucial_CT512MX100SSD1_14240D1D6D98-20230921-1907.txt