Help! Not sure where to start troubleshooting BTRFS error - XFS (md2): Metadata corruption detected at xfs_dinode

mc_866 · July 25, 2020

I'm not sure what to focus on next so I'm looking for some help.

I started having issues over a month ago on a Ryzen 1600 system with B450 board. My cache drive kept filling up and for some reason, I'm not certain it was related the server would hard lock randomly. I attempted to do at least 3 parity checks after those lock ups but it would see to crash before the 2 day check could complete on my 10TB parity.

So I decided to change up some hardware thinking that may be the issue. My previous cache drive was a 1TB nvme, I changed that out for a 2TB nvme drive. After making that change I continued to have the hard locks and unclean shutdowns with the Ryzen setup. I did a memtest per a recommendation I had in another post I made and that came back clean after 4 passes so I thought it may have something to do with the Ryzen platform. I then decided to do a motherboard swap and get back to something in the Intel camp and something a bit more "enterprise" so I grabbed a supermicro Petium-D board with integrated 10Gbe and 32GB of ECC. I also grabbed a couple new 12 TB drives to expand my storage.

I installed the new board and memory without much issue. I had some trouble with the static IP but I think I'm past that but I do want to revisit the bonding settings because I now have 2 10Gbe ports and 2 1G ports.

I added one of the new drives as a parity and that add/rebuild went great. Because my old cache drive was filling up oddly with my old setup the next step I took was to delete my docker image and reinstall the dockers I was using. I was able to complete that yesterday. Once that was done I used my old parity drive to replace a 3TB drive in my array. That rebuild is happening now.

During that rebuild I started to see a good amount of errors in the log and also noticed some of my dockers weren't working correctly. I also noticed that the cache drive seems to show write only.

Lots to unpack here but the errors that I've been seeing in the log look like they point to the cache drive and maybe disk 2.

Unraid kernel: BTRFS error (device loop2): bad tree block start, want 6996787200 have 0

Also

Unraid kernel: XFS (md2): Metadata corruption detected at xfs_dinode_verify+0xa5/0x52e [xfs], inode 0x10d8ed11d dinode

I've attached the full diagnostics!

Edited December 11, 2022 by mc_866

trurl · July 25, 2020

Corruption on disk2 and cache and docker image.

Go to Settings - Docker, disable dockers then delete the docker image from that same page. Leave dockers disabled until you get your other problems fixed.

Then let disk1 rebuild complete and post new diagnostics.

Do you have any VMs?

mc_866 · July 25, 2020

Thanks!

I did have one VM back on the Ryzen system but deleted it when I first started trouble shooting issues on that system. I believe it was my VM that was overrunning my cache drive. I thought I had set it to only use a single disk and not cache but that wasn't the case. So no VM's presently.

mc_866 · July 26, 2020

Cache drive now says Unmountable: No file system

JorgeB · July 27, 2020

There are checksum errors on cache:

Jul 26 15:41:37 Unraid kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 3, gen 0

This suggests a hardware issue, you'll need to delete the corrupt files are restore form backups, scrub will identify them all on the syslog.

Also your CPU is overheating, check cooling.

mc_866 · July 27, 2020

I did a reformat on the cache nvme and added some active cooling for the CPU. The new proc was passively cooled.

I then restored my dockers and everything is back up and running. So far so good now for the past ~12 hours.

mc_866 · July 29, 2020

After adding back my containers I also added my second new drive.

I was adding one to parity and then the old parity to the array. This was scheduled to take 2+ days because it was a 12Tb drive.

I was roughly ~6 hours from being done last night and the system froze again. Unfortunately I don't think I caught it on the logs. Here is the latest diagnostic from after my hard restart last night.

I've already changed the mobo, proc and memory thinking it was all tied to the AMD platform but issues persist. Could it be my USB drive? Any other items to consider?

Actually as I type this and look to add the diagnostics zips I realized I can't reach the webui and the shares seem to be down so I crashed AGAIN!

JorgeB · July 29, 2020

1 minute ago, mc_866 said:

AMD platform but issues persist

Did you see this?

trurl · July 29, 2020

3 minutes ago, mc_866 said:

system froze again. Unfortunately I don't think I caught it on the logs.

mc_866 · July 29, 2020

17 minutes ago, johnnie.black said:

Did you see this?

Yes I did thank you. I moved off the Ryzen build entirely to an Intel based Pentium D which is basically a embedded Xeon on a Supermicro board.

So no longer running AMD.

But yes had read through that thread and up until a few weeks ago had been fairly stable for some months.

Edited July 29, 2020 by mc_866

mc_866 · July 29, 2020

8 minutes ago, trurl said:

Thanks, I did set that up as mirror to flash. Trouble has been when I copy from flash using MC in terminal the zip is read only. I think I was able to grab a couple logs and will post them as soon as I can get the shares back online. Right now with a directly connected monitor and keyboard I typed the command diagnostics in. Hoping that will complete but I can't presently access the web gui or shares.

JorgeB · July 29, 2020

5 minutes ago, mc_866 said:

So no longer running AMD.

Sorry, misread the post.

mc_866 · July 29, 2020

5 minutes ago, johnnie.black said:

Sorry, misread the post.

No worries, my build has been all over the place lately as I replace things that I think are causing the issue.

Right now I'm at the spot where the only things I haven't changed are the controller card and the USB drive.

That's why I'm thinking something here has to be a software issue because the trouble persists with new hardware.

JorgeB · July 29, 2020

The symptoms look more like a hardware problem, but you can try running in safe mode without any dockers/VMs for a couple of days to rule out most other issues, if it's not stable as a basic NAS it's most likely hardware problem.

mc_866 · July 29, 2020

I don't disagree that it looks like hardware. Just not sure what else to test to prove out hardware vs software issue.

Also I just visited the KB/M console and it appears that the machine is still running and I'm attempting to pull diagnostics. Still can't access GUI or shares.

Edited July 29, 2020 by mc_866

trurl · July 30, 2020

23 hours ago, johnnie.black said:

try running in safe mode without any dockers/VMs for a couple of days

Doesn't look like you did this

mc_866 · July 30, 2020

I'm allowing the rebuild to go right now, all dockers stopped.

Presently 1 day 3 hr runtime, 6 hours left for rebuild. I don't want to jinx it but it seems like this is the window when it has been locking up.

If it runs for a couple more days with stability without the dockers, does it point to a docker issue?

trurl · July 30, 2020

There were lots of these in your previous logs:

Jul 29 10:34:14 Unraid kernel: WARNING: CPU: 0 PID: 24751 at mm/truncate.c:449 truncate_inode_pages_range+0x57e/0x644
Jul 29 10:34:14 Unraid kernel: Modules linked in: macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables xt_nat ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding ixgbe(O) igb(O) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp ipmi_ssif crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd mxm_wmi glue_helper intel_cstate intel_uncore intel_rapl_perf mpt3sas i2c_i801 i2c_core ahci libahci nvme raid_class scsi_transport_sas button acpi_pad pcc_cpufreq nvme_core wmi ipmi_si [last unloaded: tun]
Jul 29 10:34:14 Unraid kernel: CPU: 0 PID: 24751 Comm: shfs Tainted: G        W  O      4.19.107-Unraid #1
Jul 29 10:34:14 Unraid kernel: Hardware name: Supermicro Super Server/X10SDV-2C-TP4F, BIOS 2.1 11/08/2019
Jul 29 10:34:14 Unraid kernel: RIP: 0010:truncate_inode_pages_range+0x57e/0x644

One google search result among others:

mc_866 · July 30, 2020

Thanks for sharing!

It's like chasing a ghost. Stability has been good with no dockers.

Also wondering if PSU would be a factor. It's connected to UPS. Would that be an issue?

JorgeB · July 31, 2020

Very clean.

mc_866 · August 2, 2020

1 hour ago, mc_866 said:

Uptime 4 days and 5 hours. Just finished normal parity check.

Going to try enabling a docker or two.

Here is the log before doing so. Still looks clean.

unraid-diagnostics-20200802-1603.zip 179.72 kB · 0 downloads

OK turned on my Unifi-controller docker and within an hour I got one of the CPU tainted errors with call trace. This was the only docker running.

This docker is running with a defined static IP. Is that an issue. I thought this controller app would be very low risk but appears it may be causing my issues.

Should I not run this with a static IP?

Aug 2 16:30:08 Unraid kernel: WARNING: CPU: 2 PID: 0 at net/netfilter/nf_conntrack_core.c:945 __nf_conntrack_confirm+0xa0/0x69e Aug 2 16:30:08 Unraid kernel: Modules linked in: xt_nat macvlan xt_CHECKSUM ipt_REJECT ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding ixgbe(O) igb(O) sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm ipmi_ssif crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper mxm_wmi intel_cstate intel_uncore intel_rapl_perf mpt3sas i2c_i801 i2c_core nvme raid_class ahci libahci scsi_transport_sas nvme_core pcc_cpufreq wmi ipmi_si acpi_pad button [last unloaded: ixgbe] Aug 2 16:30:08 Unraid kernel: CPU: 2 PID: 0 Comm: swapper/2 Tainted: G O 4.19.107-Unraid #1 Aug 2 16:30:08 Unraid kernel: Hardware name: Supermicro Super Server/X10SDV-2C-TP4F, BIOS 2.1 11/08/2019 Aug 2 16:30:08 Unraid kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e Aug 2 16:30:08 Unraid kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48 Aug 2 16:30:08 Unraid kernel: RSP: 0018:ffff88885fb038d0 EFLAGS: 00010202 Aug 2 16:30:08 Unraid kernel: RAX: 0000000000000188 RBX: ffff888219cd2600 RCX: ffff8888064a31d8 Aug 2 16:30:08 Unraid kernel: RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffffffff81e090bc Aug 2 16:30:08 Unraid kernel: RBP: ffff8888064a3180 R08: 00000000b609fd86 R09: ffffffff81c8aa80 Aug 2 16:30:08 Unraid kernel: R10: 0000000000000158 R11: ffffffff81e91080 R12: 0000000000000cdc Aug 2 16:30:08 Unraid kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000e2af Aug 2 16:30:08 Unraid kernel: FS: 0000000000000000(0000) GS:ffff88885fb00000(0000) knlGS:0000000000000000 Aug 2 16:30:08 Unraid kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Aug 2 16:30:08 Unraid kernel: CR2: 000000000053ce00 CR3: 0000000001e0a001 CR4: 00000000003606e0 Aug 2 16:30:08 Unraid kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Aug 2 16:30:08 Unraid kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Aug 2 16:30:08 Unraid kernel: Call Trace: Aug 2 16:30:08 Unraid kernel: <IRQ> Aug 2 16:30:08 Unraid kernel: ipv4_confirm+0xaf/0xb9 Aug 2 16:30:08 Unraid kernel: nf_hook_slow+0x3a/0x90 Aug 2 16:30:08 Unraid kernel: ip_local_deliver+0xad/0xdc Aug 2 16:30:08 Unraid kernel: ? ip_sublist_rcv_finish+0x54/0x54 Aug 2 16:30:08 Unraid kernel: ip_sabotage_in+0x38/0x3e Aug 2 16:30:08 Unraid kernel: nf_hook_slow+0x3a/0x90 Aug 2 16:30:08 Unraid kernel: ip_rcv+0x8e/0xbe Aug 2 16:30:08 Unraid kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1 Aug 2 16:30:08 Unraid kernel: __netif_receive_skb_one_core+0x53/0x6f Aug 2 16:30:08 Unraid kernel: netif_receive_skb_internal+0x79/0x94 Aug 2 16:30:08 Unraid kernel: br_pass_frame_up+0x128/0x14a Aug 2 16:30:08 Unraid kernel: ? br_port_flags_change+0x29/0x29 Aug 2 16:30:08 Unraid kernel: br_handle_frame_finish+0x342/0x383 Aug 2 16:30:08 Unraid kernel: ? br_pass_frame_up+0x14a/0x14a Aug 2 16:30:08 Unraid kernel: br_nf_hook_thresh+0xa3/0xc3 Aug 2 16:30:08 Unraid kernel: ? br_pass_frame_up+0x14a/0x14a Aug 2 16:30:08 Unraid kernel: br_nf_pre_routing_finish+0x24a/0x271 Aug 2 16:30:08 Unraid kernel: ? br_pass_frame_up+0x14a/0x14a Aug 2 16:30:08 Unraid kernel: ? br_handle_local_finish+0xe/0xe Aug 2 16:30:08 Unraid kernel: ? nf_nat_ipv4_in+0x1e/0x62 [nf_nat_ipv4] Aug 2 16:30:08 Unraid kernel: ? br_handle_local_finish+0xe/0xe Aug 2 16:30:08 Unraid kernel: br_nf_pre_routing+0x31c/0x343 Aug 2 16:30:08 Unraid kernel: ? br_nf_forward_ip+0x362/0x362 Aug 2 16:30:08 Unraid kernel: nf_hook_slow+0x3a/0x90 Aug 2 16:30:08 Unraid kernel: br_handle_frame+0x27e/0x2bd Aug 2 16:30:08 Unraid kernel: ? br_pass_frame_up+0x14a/0x14a Aug 2 16:30:08 Unraid kernel: __netif_receive_skb_core+0x4a7/0x7b1 Aug 2 16:30:08 Unraid kernel: ? udp_gro_receive+0x4b/0x136 Aug 2 16:30:08 Unraid kernel: __netif_receive_skb_one_core+0x35/0x6f Aug 2 16:30:08 Unraid kernel: netif_receive_skb_internal+0x79/0x94 Aug 2 16:30:08 Unraid kernel: napi_gro_receive+0x44/0x7b Aug 2 16:30:08 Unraid kernel: ixgbe_poll+0xb97/0xce4 [ixgbe] Aug 2 16:30:08 Unraid kernel: net_rx_action+0x107/0x26c Aug 2 16:30:08 Unraid kernel: __do_softirq+0xc9/0x1d7 Aug 2 16:30:08 Unraid kernel: irq_exit+0x5e/0x9d Aug 2 16:30:08 Unraid kernel: do_IRQ+0xb2/0xd0 Aug 2 16:30:08 Unraid kernel: common_interrupt+0xf/0xf Aug 2 16:30:08 Unraid kernel: </IRQ> Aug 2 16:30:08 Unraid kernel: RIP: 0010:cpuidle_enter_state+0xe8/0x141 Aug 2 16:30:08 Unraid kernel: Code: ff 45 84 f6 74 1d 9c 58 0f 1f 44 00 00 0f ba e0 09 73 09 0f 0b fa 66 0f 1f 44 00 00 31 ff e8 7a 8d bb ff fb 66 0f 1f 44 00 00 <48> 2b 2c 24 b8 ff ff ff 7f 48 b9 ff ff ff ff f3 01 00 00 48 39 cd Aug 2 16:30:08 Unraid kernel: RSP: 0018:ffffc900031d3e98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffd9 Aug 2 16:30:08 Unraid kernel: RAX: ffff88885fb1fac0 RBX: ffff88885fb2a200 RCX: 000000000000001f Aug 2 16:30:08 Unraid kernel: RDX: 0000000000000000 RSI: 000000003a2e90d6 RDI: 0000000000000000 Aug 2 16:30:08 Unraid kernel: RBP: 00014d89d0db269e R08: 00014d89d0db269e R09: 0000000000000354 Aug 2 16:30:08 Unraid kernel: R10: 0000000000376ec0 R11: 071c71c71c71c71c R12: 0000000000000003 Aug 2 16:30:08 Unraid kernel: R13: ffffffff81e5b120 R14: 0000000000000000 R15: ffffffff81e5b258 Aug 2 16:30:08 Unraid kernel: ? cpuidle_enter_state+0xbf/0x141 Aug 2 16:30:08 Unraid kernel: do_idle+0x17e/0x1fc Aug 2 16:30:08 Unraid kernel: cpu_startup_entry+0x6a/0x6c Aug 2 16:30:08 Unraid kernel: start_secondary+0x197/0x1b2 Aug 2 16:30:08 Unraid kernel: secondary_startup_64+0xa4/0xb0 Aug 2 16:30:08 Unraid kernel: ---[ end trace 35a5cd2fd10ccce9 ]--- Aug 2 17:01:01 Unraid kernel: device br0 left promiscuous mode

Edited August 2, 2020 by mc_866

Squid · August 2, 2020

macvlan traces I *believe* have been minimized with 6.9 beta25+

Also, since you mentioned "tainted", that word means something not quite as ominous as it sounds. Basically, it means that you're not running a stock linux kernel (ie: you're running unRaid).

mc_866 · August 2, 2020

2 hours ago, Squid said:

macvlan traces I *believe* have been minimized with 6.9 beta25+

Also, since you mentioned "tainted", that word means something not quite as ominous as it sounds. Basically, it means that you're not running a stock linux kernel (ie: you're running unRaid).

Nothing to be concerned with for those irq traces then?

I'm working to try and isolate if I can what's causing my lock ups.

From my most recent effort it seems to be tied to dockers.

Edited August 3, 2020 by mc_866

mc_866 · August 3, 2020

Turned on my plex docker. BTRFS errors re-appeared right away.

(regular) error at logical 224809541632 on dev /dev/nvme0n1p1 Aug 3 15:01:39 Unraid kernel: BTRFS warning (device nvme0n1p1): checksum error at logical 224809545728 on dev /dev/nvme0n1p1, physical 202227412992, root 5, inode 1290252, offset 19296256, length 4096, links 1 Aug 3 15:01:39 Unraid kernel: BTRFS error (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 0, rd 0, flush 0, corrupt 6, gen 0 Aug 3 15:01:39 Unraid kernel: BTRFS error (device nvme0n1p1): unable to fixup (regular) error at logical 224809545728 on dev /dev/nvme0n1p1

Turn off plex, no errors found.

Unfortunately during covid Plex is a tier 1 app so the family depends on it.

Edited August 3, 2020 by mc_866

mc_866 · August 5, 2020

Made it 5 days of uptime.

Hard lock last night. Left the house and when I returned the server was unresponsive, didn't even have any code on the screen.

I don't think I was able to recover the crash logs cause the system was locked.

Not sure where to go from here. Any guidance?

Help! Not sure where to start troubleshooting BTRFS error - XFS (md2): Metadata corruption detected at xfs_dinode

Recommended Posts

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Join the conversation