Laptop765

Members
  • Posts

    6
  • Joined

  • Last visited

Recent Profile Visitors

The recent visitors block is disabled and is not being shown to other users.

Laptop765's Achievements

Noob

Noob (1/14)

0

Reputation

  1. I think that I've finally reached a good state. Here's a summary from where I left off for the record in case it helps anyone else. Swapped the B450 for the Asus ROG Strix X570-E Gaming WiFi II. Started by doing a straight motherboard swap but was still running into the filesystem issue when booting the VM. Fiddling with the same BIOS settings earlier in the thread didn't change anything. Ultimately I copied all the data from the NVMe drives and instead of just removing them from the cache pool and changing the filesystem so they got re-formatted by Unraid I ran blkdiscard against both drives to really make sure I was starting from scratch. This seemed to fix things. After that I still had some stability problems that didn't seem filesystem related. Again, ran through the same BIOS settings but nothing seemed to help. Updating the BIOS didn't help either. Strangely, whenever the machine hung our entire house's network stopped working until it was powered off! Eventually I was able to catch a kernel panic: Oct 1 17:44:33 Jarvis kernel: general protection fault, probably for non-canonical address 0x9780b23d8b23bb2a: 0000 [#1] PREEMPT SMP NOPTI Oct 1 17:44:33 Jarvis kernel: CPU: 1 PID: 316 Comm: kswapd0 Not tainted 5.19.9-Unraid #1 Oct 1 17:44:33 Jarvis kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING WIFI II, BIOS 4404 05/30/2022 Oct 1 17:44:33 Jarvis kernel: RIP: __x86_return_thunk 0010:__x86_return_thunk+0x0/0x8 Oct 1 17:44:33 Jarvis kernel: Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f6 <c3> cc 0f ae e8 eb f9 cc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 1 17:44:33 Jarvis kernel: RSP: 0018:ffffc9000007ced8 EFLAGS: 00010246 Oct 1 17:44:33 Jarvis kernel: RAX: 9780b23d8b23bb2a RBX: ffff888fee86cd40 RCX: 0000000000220004 Oct 1 17:44:33 Jarvis kernel: RDX: 0000000000000000 RSI: ffffea00128ef800 RDI: ffff888423be72b8 Oct 1 17:44:33 Jarvis kernel: RBP: ffff888423be72b8 R08: ffff8884a3be6cc0 R09: 0000000000220004 Oct 1 17:44:33 Jarvis kernel: R10: ffff8884a3be6cc0 R11: 0000000000030b00 R12: 0000079534a5afdf Oct 1 17:44:33 Jarvis kernel: R13: 0000000000000851 R14: 0000000000002710 R15: ffff8881073e0fc0 Oct 1 17:44:33 Jarvis kernel: FS: 0000000000000000(0000) GS:ffff888fee840000(0000) knlGS:0000000000000000 Oct 1 17:44:33 Jarvis kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Oct 1 17:44:33 Jarvis kernel: CR2: 000015542c8f9820 CR3: 00000001060e2000 CR4: 0000000000350ee0 Oct 1 17:44:33 Jarvis kernel: Call Trace: Oct 1 17:44:33 Jarvis kernel: <IRQ> Oct 1 17:44:33 Jarvis kernel: ? rcu_do_batch+0x23a/0x46c Oct 1 17:44:33 Jarvis kernel: ? rcu_core+0x265/0x2ac Oct 1 17:44:33 Jarvis kernel: ? timekeeping_get_ns+0x19/0x33 Oct 1 17:44:33 Jarvis kernel: ? __do_softirq+0x129/0x288 Oct 1 17:44:33 Jarvis kernel: ? __irq_exit_rcu+0x79/0xb8 Oct 1 17:44:33 Jarvis kernel: ? sysvec_apic_timer_interrupt+0x85/0xa6 Oct 1 17:44:33 Jarvis kernel: </IRQ> Oct 1 17:44:33 Jarvis kernel: <TASK> Oct 1 17:44:33 Jarvis kernel: ? asm_sysvec_apic_timer_interrupt+0x16/0x20 Oct 1 17:44:33 Jarvis kernel: ? hlist_bl_lock+0x14/0x41 Oct 1 17:44:33 Jarvis kernel: ? hlist_bl_lock+0xe/0x41 Oct 1 17:44:33 Jarvis kernel: ? ___d_drop+0x3b/0x62 Oct 1 17:44:33 Jarvis kernel: ? __d_drop+0x15/0x2a Oct 1 17:44:33 Jarvis kernel: ? __dentry_kill+0x56/0x131 Oct 1 17:44:33 Jarvis kernel: ? shrink_dentry_list+0xaa/0xba Oct 1 17:44:33 Jarvis kernel: ? prune_dcache_sb+0x51/0x73 Oct 1 17:44:33 Jarvis kernel: ? super_cache_scan+0xf4/0x17c Oct 1 17:44:33 Jarvis kernel: ? do_shrink_slab+0x18b/0x2a0 Oct 1 17:44:33 Jarvis kernel: ? shrink_slab+0x113/0x265 Oct 1 17:44:33 Jarvis kernel: ? shrink_node+0x327/0x542 Oct 1 17:44:33 Jarvis kernel: ? balance_pgdat+0x294/0x426 Oct 1 17:44:33 Jarvis kernel: ? kswapd+0x2fa/0x33d Oct 1 17:44:33 Jarvis kernel: ? _raw_spin_rq_lock_irqsave+0x20/0x20 Oct 1 17:44:33 Jarvis kernel: ? balance_pgdat+0x426/0x426 Oct 1 17:44:33 Jarvis kernel: ? kthread+0xe7/0xef Oct 1 17:44:33 Jarvis kernel: ? kthread_complete_and_exit+0x1b/0x1b Oct 1 17:44:33 Jarvis kernel: ? ret_from_fork+0x22/0x30 Oct 1 17:44:33 Jarvis kernel: </TASK> Oct 1 17:44:33 Jarvis kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_iotlb xt_nat xt_tcpudp veth xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xfs md_mod iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc ipv6 igb i2c_algo_bit r8169 realtek wmi_bmof asus_ec_sensors edac_mce_amd edac_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd nvme i2c_piix4 rapl input_leds led_class ahci k10temp i2c_core nvme_core libahci wmi button acpi_cpufreq unix [last unloaded: ccp] Oct 1 17:44:33 Jarvis kernel: ---[ end trace 0000000000000000 ]--- Oct 1 17:44:33 Jarvis kernel: RIP: 0010:__x86_return_thunk+0x0/0x8 Oct 1 17:44:33 Jarvis kernel: Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc f6 <c3> cc 0f ae e8 eb f9 cc 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Oct 1 17:44:33 Jarvis kernel: RSP: 0018:ffffc9000007ced8 EFLAGS: 00010246 Oct 1 17:44:33 Jarvis kernel: RAX: 9780b23d8b23bb2a RBX: ffff888fee86cd40 RCX: 0000000000220004 Oct 1 17:44:33 Jarvis kernel: RDX: 0000000000000000 RSI: ffffea00128ef800 RDI: ffff888423be72b8 Oct 1 17:44:33 Jarvis kernel: RBP: ffff888423be72b8 R08: ffff8884a3be6cc0 R09: 0000000000220004 Oct 1 17:44:33 Jarvis kernel: R10: ffff8884a3be6cc0 R11: 0000000000030b00 R12: 0000079534a5afdf Oct 1 17:44:33 Jarvis kernel: R13: 0000000000000851 R14: 0000000000002710 R15: ffff8881073e0fc0 I couldn't pinpoint this exactly but some other Unraid threads pointed to possibly being network related and that fit with the house-wide network failure mentioned above. Remembering the new board had both 2.5G and 1G network ports, I swapped from the former to the latter and that seemed to solve the problem. Things seem to be mostly stable so I'm going to consider this solved and will re-open or post a new topic if something else comes up in the future. Thank you so much for all the help!
  2. I appreciate the explanations! On the RAM front, I didn't do enough research to know this system would be capped at 3200 MT/s. I've done a bunch of reading since your first post and have learned a bunch. I was running at 2133 MT/s because that was the "Auto" default in the BIOS, I just assumed it would have set it to the fastest. Looking at the ...1710.zip file (before booting the VM) I don't see any BTRFS errors. Assuming you're talking about the below line I'm pretty sure that's historical error information and those are cumulative and persist even after repairing the filesystem. It's logged as info rather than warning or error. As far as I can tell the filesystem is healthy before booting the VM. I don't think it's specific to what the VM is doing because the errors start almost immediately after clicking "Start" before the OS is even booted. Sep 25 17:09:48 Jarvis kernel: BTRFS info (device nvme0n1p1): bdev /dev/nvme0n1p1 errs: wr 38166417, rd 30944114, flush 3780, corrupt 102327, gen 246 RAM seems pretty stable running Memtest86+ for over 24h with no errors. I've had these problems pre-NVMe cache when I had 2 SATA SSDs. The NVMe drives are actually new in an attempt to solve the problem. Looking at NVMe, RAM, power delivery, etc. all signs are pointing pretty strongly to the B450 not being up to the task so I ordered an X570 to give a try.
  3. Thanks for that info! I've ordered an X570 board to try out.
  4. I ran a few more tests today but none of them revealed a solution: Took the video card out in case there wasn't enough power going to the system Bumped RAM speed down to 1866 Selected RAM profile 3200 Set all RAM back to Auto and turned off C-States and set Power Supply Idle Control to Typical Current Idle Downloaded a newer version of Memtest86+ (6.00b3) and everything passed Doing some more reading it sounds like maybe it's the B450 being problematic and I'm considering getting an X570.
  5. Thanks to both of you for the replies! @John_M I could reproduce the failure before re-seating the RAM and also after. I'm not sure I fully follow the RAM/CPU mismatch, I haven't manually changed anything in the BIOS with respect to clocks on either the CPU or the RAM so it's running at whatever the default is. Are you saying that the RAM is either too fast or too slow? RAM is one of the things I never quite got the hang of so I try not to mess with it. I'm a bit confused by your wording. I'm running at 2133MT/s and I should be able to run at 2666MT/s but you're suggesting I try lower like <=2000MT/s? I'm also not sure about CPU power delivery - are you suggesting that the motherboard might not be up to spec for the CPU I have? From my understanding reading through these forums BTRFS reports all cumulative failures as INFO on boot until manually reset. So the system boots in a functional state and running BTRFS check returns a clean filesystem. Everything seems fine until I boot the VM. As far as I know there's nothing special about it, it's just a Win11 VM with ~12GB of RAM and ~8 reserved cores (I don't remember exact numbers off hte top of my head). It's also possible it's not the VM and just anything that puts the system under a lot of load at once, the VM is just how I know to reproduce it reliably. Just to clarify the ...1710.zip and ...1711.zip are before and after booting the VM on a boot with a clean Memtest86+ run on all 4 sticks and with a clean BTRFS check before reproducing the issue. @DarthKegRaider Good data points to have, thanks. I do have my VM cores reserved and pinned but not for anything else on the system. Why would taking 'forever' to execute something cause filesystem corruption instead of just being slow?
  6. Hello, I've been debugging my machine after going through dozens and dozens of forum posts and other research and am at a complete loss. It seems that whenever my server is under a lot of load something goes horribly wrong and the filesystem becomes corrupt. I finally found a way to reproduce it consistently just by starting a specific VM. As soon as I do BTRFS errors start flying around and the disk becomes missing to Unraid. Attached are two sets of diagnostics from the same boot: one before the VM and one after the errors start. Stopping and starting the array followed by a BTRFS scrub fixes things up no problem but stopping is a requirement. At one point I had the same problem on the disk array with XFS and noticed that if I manually mounted /dev/sdX1 the files and filesystem were fine but mounting the corresponding /dev/mdY showed a corrupt filesystem that couldn't be repaired. Main Hardware: ASRock B450 Pro4 Motherboard Ryzen 9 3950X CPU 4x Corsair Vengeance LPX 16GB DDR4 RAM 2x Samsung 970 Evo Plus 2TB 3x Seagate IronWolf NAS 8TB 1x GeForce 1080Ti Corsair CX500 CPU Things I've tried: Replacing both cache drives (moved from SATA -> M.2) Replacing SATA cables Swapping SATA ports around on the motherboard and eventually plugging them all into PCIe cards Memtest86+ showed one failure but after reseating all the RAM sticks it went away Turning IOMMU off but that prevents me from turning on the VM to reproduce the issue; I read that with B450 and Ryzen 3950x this could be an issue and trying various combinations of iommu={pt,soft} and pci=noats didn't make a difference Booting the VM without any passthrough (previously a GPU was assigned) Turning PCIe ACS override on and off Unassigning all devices from VFIO Upgrading to BIOS 5.00 Upgrading to Unraid 6.10.3 And apologies if I'm leaving out some other things I've tried, I've lost track of it all at this point. I don't mind buying replacement hardware if needed but I've already replaced a bunch and would really like to narrow down what it is before spending more money. Thank you so much in advance! EDIT: I just noticed Unraid 6.11.0 and gave it a try with the same results. Attaching a second before/after set of diagnostics. jarvis-diagnostics-20220925-1612.zip jarvis-diagnostics-20220925-1614.zip jarvis-diagnostics-20220925-1711.zip jarvis-diagnostics-20220925-1710.zip