(SOLVED) Filesystem failures when system is under load


Laptop765
Go to solution Solved by John_M,

Recommended Posts

Hello, I've been debugging my machine after going through dozens and dozens of forum posts and other research and am at a complete loss. It seems that whenever my server is under a lot of load something goes horribly wrong and the filesystem becomes corrupt. I finally found a way to reproduce it consistently just by starting a specific VM. As soon as I do BTRFS errors start flying around and the disk becomes missing to Unraid. Attached are two sets of diagnostics from the same boot: one before the VM and one after the errors start. Stopping and starting the array followed by a BTRFS scrub fixes things up no problem but stopping is a requirement.

 

At one point I had the same problem on the disk array with XFS and noticed that if I manually mounted /dev/sdX1 the files and filesystem were fine but mounting the corresponding /dev/mdY showed a corrupt filesystem that couldn't be repaired.

 

Main Hardware:

  • ASRock B450 Pro4 Motherboard
  • Ryzen 9 3950X CPU
  • 4x Corsair Vengeance LPX 16GB DDR4 RAM
  • 2x Samsung 970 Evo Plus 2TB
  • 3x Seagate IronWolf NAS 8TB
  • 1x GeForce 1080Ti
  • Corsair CX500 CPU

 

Things I've tried:

  • Replacing both cache drives (moved from SATA -> M.2)
  • Replacing SATA cables
  • Swapping SATA ports around on the motherboard and eventually plugging them all into PCIe cards
  • Memtest86+ showed one failure but after reseating all the RAM sticks it went away
  • Turning IOMMU off but that prevents me from turning on the VM to reproduce the issue; I read that with B450 and Ryzen 3950x this could be an issue and trying various combinations of iommu={pt,soft} and pci=noats didn't make a difference
  • Booting the VM without any passthrough (previously a GPU was assigned)
  • Turning PCIe ACS override on and off
  • Unassigning all devices from VFIO
  • Upgrading to BIOS 5.00
  • Upgrading to Unraid 6.10.3

And apologies if I'm leaving out some other things I've tried, I've lost track of it all at this point.

 

I don't mind buying replacement hardware if needed but I've already replaced a bunch and would really like to narrow down what it is before spending more money.

 

Thank you so much in advance!

 

EDIT: I just noticed Unraid 6.11.0 and gave it a try with the same results. Attaching a second before/after set of diagnostics.

jarvis-diagnostics-20220925-1612.zip jarvis-diagnostics-20220925-1614.zip

jarvis-diagnostics-20220925-1711.zip jarvis-diagnostics-20220925-1710.zip

Edited by Laptop765
Adding some more detail
Link to comment
3 hours ago, Laptop765 said:

Memtest86+ showed one failure but after reseating all the RAM sticks it went away

 

I'd go back and re-test the RAM. It isn't clear from your post whether your problem started before or after re-seating the RAM but unless you're sure it's good you're wasting your time doing anything else. The RAM modules themselves seem to be a slightly odd choice. Being DDR4-3600 they're not the best match for your CPU but you're clocking them at 2133 MT/s so at least that's within spec. You should be able to run them at 2666 MT/s - have you reduced the speed to see if it would fix the problem?

 

ASRock's website has been down for a couple of days now so I can't check on the specifics of your motherboard, but is the B450 Pro 4's CPU power delivery really adequate for the 3950X? What's special about the VM you use to reproduce the problem? Note that there are pre-existing errors on your NVMe cache before you start the VM:

 

Sep 25 17:09:48 Jarvis kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 38166417, rd 30944114, flush 3780, corrupt 102327, gen 246

 

I can't tell whether subsequent errors reported in the log are "new" or simply manifestations of the pre-existing errors.

 

Link to comment

I used to have a similar experience with my Ryzen 3600 CPU with a Gigabyte B450M DS3H Motherboard on VMWare ESXi.  Basically I could run 3 VM guest systems fine, but as soon as i enabled another pair of cores, the ESX host would grind to a halt, and often just crash requiring a hard power off.  I think I saw a video on LTT that explained how the cache worked on the Zen2.  I've given that system to my son to game on so he's happy since I upgraded.  I never even tried Unraid on that system, as I had already made up my mind that the Zen2 was not good for virtual machines.

 

I'm going to put that down to the Zen2 L3 cache system having to traverse the Infinity Fabric caching system and taking 'forever' to access the cache.

 

Again, just a guess.  Keep it in mind, and maybe try to 'pin' the CPU's to specific dockers/VM's. 

Link to comment

Thanks to both of you for the replies!

 

@John_M I could reproduce the failure before re-seating the RAM and also after. I'm not sure I fully follow the RAM/CPU mismatch, I haven't manually changed anything in the BIOS with respect to clocks on either the CPU or the RAM so it's running at whatever the default is. Are you saying that the RAM is either too fast or too slow? RAM is one of the things I never quite got the hang of so I try not to mess with it.

 

I'm a bit confused by your wording. I'm running at 2133MT/s and I should be able to run at 2666MT/s but you're suggesting I try lower like <=2000MT/s?

 

I'm also not sure about CPU power delivery - are you suggesting that the motherboard might not be up to spec for the CPU I have?

 

From my understanding reading through these forums BTRFS reports all cumulative failures as INFO on boot until manually reset. So the system boots in a functional state and running BTRFS check returns a clean filesystem. Everything seems fine until I boot the VM. As far as I know there's nothing special about it, it's just a Win11 VM with ~12GB of RAM and ~8 reserved cores (I don't remember exact numbers off hte top of my head). It's also possible it's not the VM and just anything that puts the system under a lot of load at once, the VM is just how I know to reproduce it reliably.

 

Just to clarify the ...1710.zip and ...1711.zip are before and after booting the VM on a boot with a clean Memtest86+ run on all 4 sticks and with a clean BTRFS check before reproducing the issue.

 

@DarthKegRaider Good data points to have, thanks. I do have my VM cores reserved and pinned but not for anything else on the system. Why would taking 'forever' to execute something cause filesystem corruption instead of just being slow?

 

Link to comment

I ran a few more tests today but none of them revealed a solution:

  • Took the video card out in case there wasn't enough power going to the system
  • Bumped RAM speed down to 1866
  • Selected RAM profile 3200
  • Set all RAM back to Auto and turned off C-States and set Power Supply Idle Control to Typical Current Idle
  • Downloaded a newer version of Memtest86+ (6.00b3) and everything passed

Doing some more reading it sounds like maybe it's the B450 being problematic and I'm considering getting an X570.

Link to comment

Hey @Laptop765,

 

Looking at your 20220925-1711 zipped log file, I can see in the syslog.txt file the BTRFS errors are plentiful for you nvme device. 

device: (nvme0n1) Samsung_SSD_970_EVO_Plus_2TB_S6S2NS0T630395J
device: (nvme1n1) Samsung_SSD_970_EVO_Plus_2TB_S6S2NS0T630373E

17:08:02 - Your system starts spitting out a few SIGTERM issues around the 'wsdd2' service.  So I compared against my own dual Xeon CPU unRaid machine and I get the same SIGTERM messages with wsdd2, so let's ignore that one.

 

At 17:10:52, BTRFS starts complaining about a duplicate device (nvme2n2p1) then gets disabled.

 

I'd be more inclined to think that perhaps the NVME port on your MB associated with your first nvme drive (S6S2NS0T630395J) might not be up to the task.

 

https://www.umart.com.au/product/asrock-b450-pro4-atx-am4-motherboard-45293

The only data I could find on the motherboard (asrock won't load for me here), indicates that one of the M2 slots only operates at Gen3x2 speeds (16Gb/s) whereas the primary is Gen3x4 (32Gb/s).  The mismatch in speeds may be the reason.

 

You could try with a PCIe to NVMe adaptor, but personally I'd look at a complete new motherboard.    

Link to comment
On 9/26/2022 at 6:23 AM, DarthKegRaider said:

I'm going to put that down to the Zen2 L3 cache system having to traverse the Infinity Fabric caching system and taking 'forever' to access the cache.

 

The L3 cache is on the same chiplet as the L2 cache, L1 cache and CPU cores. The Infinity Fabric is not involved.

Link to comment
  • Solution
On 9/26/2022 at 7:20 AM, Laptop765 said:

Are you saying that the RAM is either too fast or too slow?

 

The reason I said the RAM was a slightly odd choice is that it's specced at DDR4-3600 but the maximum your CPU can run at is 3200 MT/s. That figure is derated by your particular configuration - you have two DIMMs per channel and they are dual-rank. Becuase that's a lot of physical chips connected across the bus the recommended maximum speed for a 3000-series CPU is 2666 MT/s. So you might have paid more for faster RAM when slightly slower RAM would suffice.

 

On 9/26/2022 at 7:20 AM, Laptop765 said:

I'm a bit confused by your wording. I'm running at 2133MT/s and I should be able to run at 2666MT/s but you're suggesting I try lower like <=2000MT/s?

 

2133 MT/s is fine. I was asking why you weren't running it at 2666 MT/s and if the reason is that you had decided to slow it down in an attempt to avoid the errors.

 

On 9/26/2022 at 7:20 AM, Laptop765 said:

Just to clarify the ...1710.zip and ...1711.zip are before and after booting the VM on a boot with a clean Memtest86+ run on all 4 sticks and with a clean BTRFS check before reproducing the issue.

 

The BTRFS errors I pointed out are real errors - hardware errors, I believe, and they are present in both sets of diagnostics - i.e. before you start the VM. Check the timestamp. The system runs fine before you start the VM but since the VM makes heavy use of the cache pool the problems begin when you start the VM.

 

On the question about power delivery, the Pro 4 series has weaker VRMs than more gaming orientated motherboards and the B450 series was designed for the 2000-series of CPUs. Your 3950X is rather more power hungry than the top CPU in the 2000 series (the 2700X). I don't think it's an issue in your case but it is why I asked what your VM was doing, thinking you might be using it to thrash the CPU.

 

If you're happy with the RAM the next thing to address is the NVMe cache.

 

Link to comment

I appreciate the explanations!

 

On the RAM front, I didn't do enough research to know this system would be capped at 3200 MT/s. I've done a bunch of reading since your first post and have learned a bunch. I was running at 2133 MT/s because that was the "Auto" default in the BIOS, I just assumed it would have set it to the fastest.

 

Quote

The BTRFS errors I pointed out are real errors - hardware errors, I believe, and they are present in both sets of diagnostics - i.e. before you start the VM.

 

Looking at the ...1710.zip file (before booting the VM) I don't see any BTRFS errors. Assuming you're talking about the below line I'm pretty sure that's historical error information and those are cumulative and persist even after repairing the filesystem. It's logged as info rather than warning or error. As far as I can tell the filesystem is healthy before booting the VM. I don't think it's specific to what the VM is doing because the errors start almost immediately after clicking "Start" before the OS is even booted.

 

Sep 25 17:09:48 Jarvis kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 38166417, rd 30944114, flush 3780, corrupt 102327, gen 246

 

Quote

If you're happy with the RAM the next thing to address is the NVMe cache.

 

RAM seems pretty stable running Memtest86+ for over 24h with no errors. I've had these problems pre-NVMe cache when I had 2 SATA SSDs. The NVMe drives are actually new in an attempt to solve the problem. Looking at NVMe, RAM, power delivery, etc. all signs are pointing pretty strongly to the B450 not being up to the task so I ordered an X570 to give a try.

Link to comment

I think that I've finally reached a good state. Here's a summary from where I left off for the record in case it helps anyone else.

 

Swapped the B450 for the Asus ROG Strix X570-E Gaming WiFi II. Started by doing a straight motherboard swap but was still running into the filesystem issue when booting the VM. Fiddling with the same BIOS settings earlier in the thread didn't change anything. Ultimately I copied all the data from the NVMe drives and instead of just removing them from the cache pool and changing the filesystem so they got re-formatted by Unraid I ran blkdiscard against both drives to really make sure I was starting from scratch. This seemed to fix things.

 

After that I still had some stability problems that didn't seem filesystem related. Again, ran through the same BIOS settings but nothing seemed to help. Updating the BIOS didn't help either. Strangely, whenever the machine hung our entire house's network stopped working until it was powered off!

 

Eventually I was able to catch a kernel panic:

 

Oct  1 17:44:33 Jarvis kernel: general protection fault, probably for non-canonical address 0x9780b23d8b23bb2a: 0000 [#1] PREEMPT SMP NOPTI
Oct  1 17:44:33 Jarvis kernel: CPU: 1 PID: 316 Comm: kswapd0 Not tainted 5.19.9-Unraid #1
Oct  1 17:44:33 Jarvis kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING WIFI II, BIOS 4404 05/30/2022
Oct  1 17:44:33 Jarvis kernel: RIP: __x86_return_thunk 0010:__x86_return_thunk+0x0/0x8
Oct  1 17:44:33 Jarvis kernel: Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f6 <c3> cc 0f ae e8 eb f9 cc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Oct  1 17:44:33 Jarvis kernel: RSP: 0018:ffffc9000007ced8 EFLAGS: 00010246
Oct  1 17:44:33 Jarvis kernel: RAX: 9780b23d8b23bb2a RBX: ffff888fee86cd40 RCX: 0000000000220004
Oct  1 17:44:33 Jarvis kernel: RDX: 0000000000000000 RSI: ffffea00128ef800 RDI: ffff888423be72b8
Oct  1 17:44:33 Jarvis kernel: RBP: ffff888423be72b8 R08: ffff8884a3be6cc0 R09: 0000000000220004
Oct  1 17:44:33 Jarvis kernel: R10: ffff8884a3be6cc0 R11: 0000000000030b00 R12: 0000079534a5afdf
Oct  1 17:44:33 Jarvis kernel: R13: 0000000000000851 R14: 0000000000002710 R15: ffff8881073e0fc0
Oct  1 17:44:33 Jarvis kernel: FS:  0000000000000000(0000) GS:ffff888fee840000(0000) knlGS:0000000000000000
Oct  1 17:44:33 Jarvis kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct  1 17:44:33 Jarvis kernel: CR2: 000015542c8f9820 CR3: 00000001060e2000 CR4: 0000000000350ee0
Oct  1 17:44:33 Jarvis kernel: Call Trace:
Oct  1 17:44:33 Jarvis kernel: <IRQ>
Oct  1 17:44:33 Jarvis kernel: ? rcu_do_batch+0x23a/0x46c
Oct  1 17:44:33 Jarvis kernel: ? rcu_core+0x265/0x2ac
Oct  1 17:44:33 Jarvis kernel: ? timekeeping_get_ns+0x19/0x33
Oct  1 17:44:33 Jarvis kernel: ? __do_softirq+0x129/0x288
Oct  1 17:44:33 Jarvis kernel: ? __irq_exit_rcu+0x79/0xb8
Oct  1 17:44:33 Jarvis kernel: ? sysvec_apic_timer_interrupt+0x85/0xa6
Oct  1 17:44:33 Jarvis kernel: </IRQ>
Oct  1 17:44:33 Jarvis kernel: <TASK>
Oct  1 17:44:33 Jarvis kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20
Oct  1 17:44:33 Jarvis kernel: ? hlist_bl_lock+0x14/0x41
Oct  1 17:44:33 Jarvis kernel: ? hlist_bl_lock+0xe/0x41
Oct  1 17:44:33 Jarvis kernel: ? ___d_drop+0x3b/0x62
Oct  1 17:44:33 Jarvis kernel: ? __d_drop+0x15/0x2a
Oct  1 17:44:33 Jarvis kernel: ? __dentry_kill+0x56/0x131
Oct  1 17:44:33 Jarvis kernel: ? shrink_dentry_list+0xaa/0xba
Oct  1 17:44:33 Jarvis kernel: ? prune_dcache_sb+0x51/0x73
Oct  1 17:44:33 Jarvis kernel: ? super_cache_scan+0xf4/0x17c
Oct  1 17:44:33 Jarvis kernel: ? do_shrink_slab+0x18b/0x2a0
Oct  1 17:44:33 Jarvis kernel: ? shrink_slab+0x113/0x265
Oct  1 17:44:33 Jarvis kernel: ? shrink_node+0x327/0x542
Oct  1 17:44:33 Jarvis kernel: ? balance_pgdat+0x294/0x426
Oct  1 17:44:33 Jarvis kernel: ? kswapd+0x2fa/0x33d
Oct  1 17:44:33 Jarvis kernel: ? _raw_spin_rq_lock_irqsave+0x20/0x20
Oct  1 17:44:33 Jarvis kernel: ? balance_pgdat+0x426/0x426
Oct  1 17:44:33 Jarvis kernel: ? kthread+0xe7/0xef
Oct  1 17:44:33 Jarvis kernel: ? kthread_complete_and_exit+0x1b/0x1b
Oct  1 17:44:33 Jarvis kernel: ? ret_from_fork+0x22/0x30
Oct  1 17:44:33 Jarvis kernel: </TASK>
Oct  1 17:44:33 Jarvis kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_iotlb xt_nat xt_tcpudp veth xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xfs md_mod iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc ipv6 igb i2c_algo_bit r8169 realtek wmi_bmof asus_ec_sensors edac_mce_amd edac_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd nvme i2c_piix4 rapl input_leds led_class ahci k10temp i2c_core nvme_core libahci wmi button acpi_cpufreq unix [last unloaded: ccp]
Oct  1 17:44:33 Jarvis kernel: ---[ end trace 0000000000000000 ]---
Oct  1 17:44:33 Jarvis kernel: RIP: 0010:__x86_return_thunk+0x0/0x8
Oct  1 17:44:33 Jarvis kernel: Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f6 <c3> cc 0f ae e8 eb f9 cc 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Oct  1 17:44:33 Jarvis kernel: RSP: 0018:ffffc9000007ced8 EFLAGS: 00010246
Oct  1 17:44:33 Jarvis kernel: RAX: 9780b23d8b23bb2a RBX: ffff888fee86cd40 RCX: 0000000000220004
Oct  1 17:44:33 Jarvis kernel: RDX: 0000000000000000 RSI: ffffea00128ef800 RDI: ffff888423be72b8
Oct  1 17:44:33 Jarvis kernel: RBP: ffff888423be72b8 R08: ffff8884a3be6cc0 R09: 0000000000220004
Oct  1 17:44:33 Jarvis kernel: R10: ffff8884a3be6cc0 R11: 0000000000030b00 R12: 0000079534a5afdf
Oct  1 17:44:33 Jarvis kernel: R13: 0000000000000851 R14: 0000000000002710 R15: ffff8881073e0fc0

 

I couldn't pinpoint this exactly but some other Unraid threads pointed to possibly being network related and that fit with the house-wide network failure mentioned above.

 

Remembering the new board had both 2.5G and 1G network ports, I swapped from the former to the latter and that seemed to solve the problem.

 

Things seem to be mostly stable so I'm going to consider this solved and will re-open or post a new topic if something else comes up in the future. Thank you so much for all the help!

Link to comment
  • Laptop765 changed the title to (SOLVED) Filesystem failures when system is under load

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.