I have recently upgraded my unraid server to use a spare ryzen 7 1700 cpu on a MSI B550 Tomahawk Motherboard.
Since that time I have been seeing frequent crashes / lock ups for a variety of reasons.
My most recent crash is :
Apr 1 20:51:10 Tower kernel: ------------[ cut here ]------------
Apr 1 20:51:10 Tower kernel: WARNING: CPU: 7 PID: 7435 at kernel/exit.c:725 do_exit+0x4b/0x8eb
Apr 1 20:51:10 Tower kernel: Modules linked in: xt_nat veth xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle nf_tables vhost_net tun vhost vhost_iotlb tap xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter xfs nfsd lockd grace sunrpc md_mod ip6table_filter ip6_tables iptable_filter ip_tables x_tables r8169 realtek sr_mod cdrom edac_mce_amd kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd r8125(O) cryptd glue_helper wmi_bmof i2c_piix4 rapl input_leds ccp ahci i2c_core wmi k10temp led_class cdc_acm libahci acpi_cpufreq button [last unloaded: realtek]
Apr 1 20:51:10 Tower kernel: CPU: 7 PID: 7435 Comm: unraidd10 Tainted: G S D O 5.10.28-Unraid #1
Apr 1 20:51:10 Tower kernel: Hardware name: Micro-Star International Co., Ltd. MS-7C91/MAG B550 TOMAHAWK (MS-7C91), BIOS A.80 12/16/2021
Apr 1 20:51:10 Tower kernel: RIP: 0010:do_exit+0x4b/0x8eb
Apr 1 20:51:10 Tower kernel: Code: 65 48 8b 1c 25 c0 7b 01 00 48 8b 83 e8 06 00 00 48 85 c0 74 17 48 8b 10 48 39 d0 75 0d 48 8b 50 10 48 83 c0 10 48 39 c2 74 02 <0f> 0b 65 8b 0d ec 40 fc 7e 89 c8 48 c7 c7 2e 61 d7 81 25 00 ff ff
Apr 1 20:51:10 Tower kernel: RSP: 0018:ffffc90000a7fee8 EFLAGS: 00010012
Apr 1 20:51:10 Tower kernel: RAX: ffffc90000a7fe40 RBX: ffff8881055a3800 RCX: 0000000000000027
Apr 1 20:51:10 Tower kernel: RDX: ffff88813be8c348 RSI: 0000000000000001 RDI: 0000000000000009
Apr 1 20:51:10 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffdfff
Apr 1 20:51:10 Tower kernel: R10: ffffc90000a7f958 R11: ffffc90000a7f950 R12: 0000000000000009
Apr 1 20:51:10 Tower kernel: R13: 0000000000000009 R14: 0000000000000046 R15: 0000000000000000
Apr 1 20:51:10 Tower kernel: FS: 0000000000000000(0000) GS:ffff888fee9c0000(0000) knlGS:0000000000000000
Apr 1 20:51:10 Tower kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 1 20:51:10 Tower kernel: CR2: 0000000000000000 CR3: 0000000383a90000 CR4: 00000000003506e0
Apr 1 20:51:10 Tower kernel: Call Trace:
Apr 1 20:51:10 Tower kernel: ? md_seq_show+0x69e/0x69e [md_mod]
Apr 1 20:51:10 Tower kernel: ? kthread+0xe5/0xea
Apr 1 20:51:10 Tower kernel: rewind_stack_do_exit+0x17/0x17
Apr 1 20:51:10 Tower kernel: RIP: 0000:0x0
Apr 1 20:51:10 Tower kernel: Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Apr 1 20:51:10 Tower kernel: RSP: 0000:0000000000000000 EFLAGS: 00000000 ORIG_RAX: 0000000000000000
Apr 1 20:51:10 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Apr 1 20:51:10 Tower kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Apr 1 20:51:10 Tower kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
Apr 1 20:51:10 Tower kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Apr 1 20:51:10 Tower kernel: R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
Apr 1 20:51:10 Tower kernel: ---[ end trace 558ddcf995bf62ba ]---
Getting frustrated with being unable to find out what the root cause is, I know i have a bad cache device, and that was causing the mover sequence to lockup / crash, Wondering if anyone can assist / point me in the correct direction
Attached is my diagnostics
tower-diagnostics-20220401-2103.zip