• 6.12.8 - Segfaults and call traces


    curled
    • Closed

    After upgrading from 6.12.6 to 6.12.8, I started getting segfaults and call traces, from various different processes. The WebUI would hang after a while, and the entire system would become unresponsive (ssh, kvm, etc). Safe mode without plugins was similarly affected.

     

    Unfortunately, I was not able to capture syslog of every crash and I didn't want to enable writing to flash as I had to force shutdown server each time and didn't want to deal with corrupted flash drive in addition, so these are some occurrences that I managed to capture.

     

    Mar  3 01:00:01 Tower root: mover: started
    Mar  3 01:00:02 Tower root: mover: finished
    Mar  3 01:04:31 Tower kernel: traps: python3[12337] general protection fault ip:14b7635f7b2c sp:14b70994f730 error:0 in libpython3.11.so.1.0[14b763548000+1d3000]
    Mar  3 01:12:39 Tower kernel: python3[21664]: segfault at 590c5402 ip 000014a8678c8585 sp 00007ffd90b8faf0 error 6 in libpython3.9.so.1.0[14a86772a000+201000] likely on CPU 12 (core 24, socket 0)
    Mar  3 01:12:39 Tower kernel: Code: 24 10 4c 8b 44 24 08 44 89 ea 48 8b 0c 24 48 8d 35 95 4b 0b 00 e8 5b de ff ff c7 85 30 03 00 00 00 00 00 00 e9 2c ff ff ff 8b <87> a8 02 00 00 39 87 ac 02 00 00 7f 10 8b 87 90 02 00 00 39 87 94

     

    This was not isolated to python3 (which is not a standard UR lib), but also smartctl, php-fpm and other processes.

     

    Example of call traces:

    Mar  3 04:00:01 Tower root: mover: finished
    Mar  3 04:32:44 Tower kernel: BUG: kernel NULL pointer dereference, address: 0000000000000038
    Mar  3 04:32:44 Tower kernel: #PF: supervisor read access in kernel mode
    Mar  3 04:32:44 Tower kernel: #PF: error_code(0x0000) - not-present page
    Mar  3 04:32:44 Tower kernel: PGD 52c52a067 P4D 52c52a067 PUD 4a43f3067 PMD 0 
    Mar  3 04:32:44 Tower kernel: Oops: 0000 [#2] PREEMPT SMP NOPTI
    Mar  3 04:32:44 Tower kernel: CPU: 13 PID: 8018 Comm: smartctl_type Tainted: P      D    O       6.1.74-Unraid #1
    Mar  3 04:32:44 Tower kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3302 02/21/2024
    Mar  3 04:32:44 Tower kernel: RIP: 0010:memcg_slab_free_hook+0x28/0xcf
    Mar  3 04:32:44 Tower kernel: Code: cc cc 41 57 41 56 49 89 d6 41 55 41 54 55 48 89 f5 53 48 89 fb 48 83 ec 10 89 4c 24 0c e8 5a e1 ff ff 84 c0 0f 84 94 00 00 00 <4c> 8b 65 38 49 83 fc 03 0f 86 86 00 00 00 49 83 e4 fc 45 31 ed 41
    Mar  3 04:32:44 Tower kernel: RSP: 0018:ffffc90030997ca0 EFLAGS: 00010202
    Mar  3 04:32:44 Tower kernel: RAX: 0000000000000001 RBX: ffff888100045a00 RCX: 0000000000000001
    Mar  3 04:32:44 Tower kernel: RDX: ffffc90030997cf0 RSI: 0000000000000000 RDI: ffff888100045a00
    Mar  3 04:32:44 Tower kernel: RBP: 0000000000000000 R08: ffff8889a6b4d300 R09: ffffffff8184e49c
    Mar  3 04:32:44 Tower kernel: R10: ffff8889a6b4d300 R11: ffff888aa3934100 R12: 0000000000000000
    Mar  3 04:32:44 Tower kernel: R13: ffff8889a6b4d500 R14: ffffc90030997cf0 R15: 0000000000000071
    Mar  3 04:32:44 Tower kernel: FS:  0000000000000000(0000) GS:ffff889fffb40000(0000) knlGS:0000000000000000
    Mar  3 04:32:44 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar  3 04:32:44 Tower kernel: CR2: 0000000000000038 CR3: 0000000648b1e000 CR4: 0000000000750ee0
    Mar  3 04:32:44 Tower kernel: PKRU: 55555554
    Mar  3 04:32:44 Tower kernel: Call Trace:
    Mar  3 04:32:44 Tower kernel: <TASK>
    Mar  3 04:32:44 Tower kernel: ? __die_body+0x1a/0x5c
    Mar  3 04:32:44 Tower kernel: ? page_fault_oops+0x329/0x376
    Mar  3 04:32:44 Tower kernel: ? do_user_addr_fault+0x12e/0x48d
    Mar  3 04:32:44 Tower kernel: ? exc_page_fault+0xfb/0x11d
    Mar  3 04:32:44 Tower kernel: ? asm_exc_page_fault+0x22/0x30
    Mar  3 04:32:44 Tower kernel: ? mas_destroy+0xa8/0xbb
    Mar  3 04:32:44 Tower kernel: ? memcg_slab_free_hook+0x28/0xcf
    Mar  3 04:32:44 Tower kernel: kmem_cache_free+0xb7/0x154
    Mar  3 04:32:44 Tower kernel: ? mas_destroy+0xa8/0xbb
    Mar  3 04:32:44 Tower kernel: mas_destroy+0xa8/0xbb
    Mar  3 04:32:44 Tower kernel: mmap_region+0x457/0x61e
    Mar  3 04:32:44 Tower kernel: ? preempt_latency_start+0x1e/0x46
    Mar  3 04:32:44 Tower kernel: do_mmap+0x3bc/0x428
    Mar  3 04:32:44 Tower kernel: vm_mmap_pgoff+0xbb/0x112
    Mar  3 04:32:44 Tower kernel: ksys_mmap_pgoff+0x138/0x166
    Mar  3 04:32:44 Tower kernel: do_syscall_64+0x68/0x81
    Mar  3 04:32:44 Tower kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
    Mar  3 04:32:44 Tower kernel: RIP: 0033:0x15215c49fe33
    Mar  3 04:32:44 Tower kernel: Code: 1f 84 00 00 00 00 00 4c 89 23 31 c0 48 c7 43 08 00 04 00 00 eb e2 90 41 89 ca 41 f7 c1 ff 0f 00 00 75 14 b8 09 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 1d c3 0f 1f 40 00 c7 05 36 34 01 00 16 00 00
    Mar  3 04:32:44 Tower kernel: RSP: 002b:00007fff326694c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000009
    Mar  3 04:32:44 Tower kernel: RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 000015215c49fe33
    Mar  3 04:32:44 Tower kernel: RDX: 0000000000000001 RSI: 0000000000162000 RDI: 0000152158bef000
    Mar  3 04:32:44 Tower kernel: RBP: 00007fff32669860 R08: 0000000000000004 R09: 000000000004f000
    Mar  3 04:32:44 Tower kernel: R10: 0000000000000812 R11: 0000000000000246 R12: 00007fff32669540
    Mar  3 04:32:44 Tower kernel: R13: 0000152158d78690 R14: 00007fff32669900 R15: 0000152158ba0000
    Mar  3 04:32:44 Tower kernel: </TASK>
    Mar  3 04:32:44 Tower kernel: Modules linked in: veth xt_nat xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge nvidia_uvm(PO) xfs dm_crypt dm_mod nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag ipmi_devintf nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap 8021q garp mrp stp llc igc nvidia_drm(PO) nvidia_modeset(PO) intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvidia(PO) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel ast drm_vram_helper i2c_algo_bit
    Mar  3 04:32:44 Tower kernel: drm_ttm_helper crypto_simd ttm cryptd drm_kms_helper mei_hdcp mei_pxp i2c_i801 rapl intel_cstate drm ipmi_ssif mpt3sas agpgart mei_me ahci cdc_ether wmi_bmof i2c_smbus nvme tpm_crb syscopyarea input_leds raid_class sr_mod usbnet sysfillrect intel_uncore sysimgblt i2c_core mei joydev led_class nvme_core cdrom libahci mii scsi_transport_sas vmd acpi_ipmi fb_sys_fops thermal fan video tpm_tis tpm_tis_core ipmi_si wmi backlight tpm intel_pmc_core acpi_tad acpi_pad button unix [last unloaded: igc]
    Mar  3 04:32:44 Tower kernel: CR2: 0000000000000038
    Mar  3 04:32:44 Tower kernel: ---[ end trace 0000000000000000 ]---
    Mar  3 04:32:44 Tower kernel: RIP: 0010:do_dentry_open+0x206/0x304
    Mar  3 04:32:44 Tower kernel: Code: 43 44 a8 04 74 11 48 8b 53 28 48 83 7a 08 00 75 06 83 e0 fb 89 43 44 48 8b 8b d0 00 00 00 48 8b 81 90 00 00 00 48 85 c0 74 0e <48> 83 78 58 00 74 07 81 4b 44 00 00 40 00 8b 53 40 89 d0 25 3f fc
    Mar  3 04:32:44 Tower kernel: RSP: 0018:ffffc90034467cd8 EFLAGS: 00010282
    Mar  3 04:32:44 Tower kernel: RAX: c350ffff8881e30d RBX: ffff8887885be300 RCX: ffff8881e30dc2c2
    Mar  3 04:32:44 Tower kernel: RDX: ffffffffa485a140 RSI: 0000000000000000 RDI: 00000000ffffffff
    Mar  3 04:32:44 Tower kernel: RBP: 0000000000000000 R08: ffffffffa4820098 R09: ffffffffa482090f
    Mar  3 04:32:44 Tower kernel: R10: 0000000000000000 R11: ffff88814f199268 R12: ffff8881e30dc138
    Mar  3 04:32:44 Tower kernel: R13: ffff8887885be310 R14: ffffffffa4824635 R15: 0000000000000000
    Mar  3 04:32:44 Tower kernel: FS:  0000000000000000(0000) GS:ffff889fffb40000(0000) knlGS:0000000000000000
    Mar  3 04:32:44 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar  3 04:32:44 Tower kernel: CR2: 0000000000000038 CR3: 0000000648b1e000 CR4: 0000000000750ee0
    Mar  3 04:32:44 Tower kernel: PKRU: 55555554
    Mar  3 04:32:44 Tower kernel: note: smartctl_type[8018] exited with irqs disabled
    Mar  3 05:03:30 Tower kernel: traps: cache_dirs[22836] general protection fault ip:4e932f sp:7ffe6185a8a0 error:0 in bash[426000+c5000]

     

    And another one:

    Mar  3 02:00:01 Tower root: mover: finished
    Mar  3 02:13:49 Tower kernel: general protection fault, probably for non-canonical address 0xc350ffff8881e365: 0000 [#1] PREEMPT SMP NOPTI
    Mar  3 02:13:49 Tower kernel: CPU: 12 PID: 32677 Comm: find Tainted: P           O       6.1.74-Unraid #1
    Mar  3 02:13:49 Tower kernel: Hardware name: ASUSTeK COMPUTER INC. System Product Name/Pro WS W680-ACE IPMI, BIOS 3302 02/21/2024
    Mar  3 02:13:49 Tower kernel: RIP: 0010:do_dentry_open+0x206/0x304
    Mar  3 02:13:49 Tower kernel: Code: 43 44 a8 04 74 11 48 8b 53 28 48 83 7a 08 00 75 06 83 e0 fb 89 43 44 48 8b 8b d0 00 00 00 48 8b 81 90 00 00 00 48 85 c0 74 0e <48> 83 78 58 00 74 07 81 4b 44 00 00 40 00 8b 53 40 89 d0 25 3f fc
    Mar  3 02:13:49 Tower kernel: RSP: 0018:ffffc90034467cd8 EFLAGS: 00010282
    Mar  3 02:13:49 Tower kernel: RAX: c350ffff8881e30d RBX: ffff8887885be300 RCX: ffff8881e30dc2c2
    Mar  3 02:13:49 Tower kernel: RDX: ffffffffa485a140 RSI: 0000000000000000 RDI: 00000000ffffffff
    Mar  3 02:13:49 Tower kernel: RBP: 0000000000000000 R08: ffffffffa4820098 R09: ffffffffa482090f
    Mar  3 02:13:49 Tower kernel: R10: 0000000000000000 R11: ffff88814f199268 R12: ffff8881e30dc138
    Mar  3 02:13:49 Tower kernel: R13: ffff8887885be310 R14: ffffffffa4824635 R15: 0000000000000000
    Mar  3 02:13:49 Tower kernel: FS:  000014c2faaeb740(0000) GS:ffff889fffb00000(0000) knlGS:0000000000000000
    Mar  3 02:13:49 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar  3 02:13:49 Tower kernel: CR2: 0000000000475048 CR3: 0000000455dbe000 CR4: 0000000000750ee0
    Mar  3 02:13:49 Tower kernel: PKRU: 55555554
    Mar  3 02:13:49 Tower kernel: Call Trace:
    Mar  3 02:13:49 Tower kernel: <TASK>
    Mar  3 02:13:49 Tower kernel: ? __die_body+0x1a/0x5c
    Mar  3 02:13:49 Tower kernel: ? die_addr+0x38/0x51
    Mar  3 02:13:49 Tower kernel: ? exc_general_protection+0x30f/0x345
    Mar  3 02:13:49 Tower kernel: ? asm_exc_general_protection+0x22/0x30
    Mar  3 02:13:49 Tower kernel: ? xfs_dir_fsync+0x61/0x61 [xfs]
    Mar  3 02:13:49 Tower kernel: ? xfs_buf_readahead_map+0x5/0x50 [xfs]
    Mar  3 02:13:49 Tower kernel: ? xfs_buf_get_map+0x66c/0x804 [xfs]
    Mar  3 02:13:49 Tower kernel: ? do_dentry_open+0x206/0x304
    Mar  3 02:13:49 Tower kernel: ? do_dentry_open+0x192/0x304
    Mar  3 02:13:49 Tower kernel: path_openat+0x8f4/0xa4d
    Mar  3 02:13:49 Tower kernel: do_filp_open+0x55/0xb8
    Mar  3 02:13:49 Tower kernel: ? getname_flags+0x29/0x152
    Mar  3 02:13:49 Tower kernel: ? kmem_cache_alloc+0x122/0x14d
    Mar  3 02:13:49 Tower kernel: ? _raw_spin_unlock+0x14/0x29
    Mar  3 02:13:49 Tower kernel: do_sys_openat2+0x6c/0xd9
    Mar  3 02:13:49 Tower kernel: do_sys_open+0x3a/0x5a
    Mar  3 02:13:49 Tower kernel: do_syscall_64+0x68/0x81
    Mar  3 02:13:49 Tower kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
    Mar  3 02:13:49 Tower kernel: RIP: 0033:0x14c2fabf19ef
    Mar  3 02:13:49 Tower kernel: Code: 89 4c 24 58 f6 c2 40 75 32 89 d0 45 31 d2 25 00 00 41 00 3d 00 00 41 00 74 21 80 3d f2 cb 0e 00 00 74 45 b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 85 00 00 00 48 83 c4 78 c3 48 8d 84 24 80
    Mar  3 02:13:49 Tower kernel: RSP: 002b:00007fff682c9fc0 EFLAGS: 00000202 ORIG_RAX: 0000000000000101
    Mar  3 02:13:49 Tower kernel: RAX: ffffffffffffffda RBX: 00007fff682ca13c RCX: 000014c2fabf19ef
    Mar  3 02:13:49 Tower kernel: RDX: 00000000000b0900 RSI: 000000000045d2d0 RDI: 000000000000000e
    Mar  3 02:13:49 Tower kernel: RBP: 000000000045d1d0 R08: 0000000000000073 R09: 0000000000000000
    Mar  3 02:13:49 Tower kernel: R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
    Mar  3 02:13:49 Tower kernel: R13: 0000000000000000 R14: 0000000000444c90 R15: 0000000000000004
    Mar  3 02:13:49 Tower kernel: </TASK>
    Mar  3 02:13:49 Tower kernel: Modules linked in: veth xt_nat xt_CHECKSUM ipt_REJECT nf_reject_ipv4 xt_tcpudp ip6table_mangle ip6table_nat iptable_mangle vhost_net tun vhost vhost_iotlb xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 xt_addrtype br_netfilter bridge nvidia_uvm(PO) xfs dm_crypt dm_mod nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) tcp_diag inet_diag ipmi_devintf nct6775 nct6775_core hwmon_vid ip6table_filter ip6_tables iptable_filter ip_tables x_tables efivarfs macvtap macvlan tap 8021q garp mrp stp llc igc nvidia_drm(PO) nvidia_modeset(PO) intel_rapl_msr intel_rapl_common iosf_mbi x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvidia(PO) crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 sha256_ssse3 sha1_ssse3 aesni_intel ast drm_vram_helper i2c_algo_bit
    Mar  3 02:13:49 Tower kernel: drm_ttm_helper crypto_simd ttm cryptd drm_kms_helper mei_hdcp mei_pxp i2c_i801 rapl intel_cstate drm ipmi_ssif mpt3sas agpgart mei_me ahci cdc_ether wmi_bmof i2c_smbus nvme tpm_crb syscopyarea input_leds raid_class sr_mod usbnet sysfillrect intel_uncore sysimgblt i2c_core mei joydev led_class nvme_core cdrom libahci mii scsi_transport_sas vmd acpi_ipmi fb_sys_fops thermal fan video tpm_tis tpm_tis_core ipmi_si wmi backlight tpm intel_pmc_core acpi_tad acpi_pad button unix [last unloaded: igc]
    Mar  3 02:13:49 Tower kernel: ---[ end trace 0000000000000000 ]---
    Mar  3 02:13:49 Tower kernel: RIP: 0010:do_dentry_open+0x206/0x304
    Mar  3 02:13:49 Tower kernel: Code: 43 44 a8 04 74 11 48 8b 53 28 48 83 7a 08 00 75 06 83 e0 fb 89 43 44 48 8b 8b d0 00 00 00 48 8b 81 90 00 00 00 48 85 c0 74 0e <48> 83 78 58 00 74 07 81 4b 44 00 00 40 00 8b 53 40 89 d0 25 3f fc
    Mar  3 02:13:49 Tower kernel: RSP: 0018:ffffc90034467cd8 EFLAGS: 00010282
    Mar  3 02:13:49 Tower kernel: RAX: c350ffff8881e30d RBX: ffff8887885be300 RCX: ffff8881e30dc2c2
    Mar  3 02:13:49 Tower kernel: RDX: ffffffffa485a140 RSI: 0000000000000000 RDI: 00000000ffffffff
    Mar  3 02:13:49 Tower kernel: RBP: 0000000000000000 R08: ffffffffa4820098 R09: ffffffffa482090f
    Mar  3 02:13:49 Tower kernel: R10: 0000000000000000 R11: ffff88814f199268 R12: ffff8881e30dc138
    Mar  3 02:13:49 Tower kernel: R13: ffff8887885be310 R14: ffffffffa4824635 R15: 0000000000000000
    Mar  3 02:13:49 Tower kernel: FS:  000014c2faaeb740(0000) GS:ffff889fffb00000(0000) knlGS:0000000000000000
    Mar  3 02:13:49 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Mar  3 02:13:49 Tower kernel: CR2: 0000000000475048 CR3: 0000000455dbe000 CR4: 0000000000750ee0
    Mar  3 02:13:49 Tower kernel: PKRU: 5555555

     

    Rolling back to 6.12.6 has resolved the issue. No call traces or segfaults there with the rest of the system in the exact same state, including plugins, configuration, docker etc.

     

    System information:

    Model:	Custom
    M/B:	ASUSTeK COMPUTER INC. Pro WS W680-ACE IPMI Version Rev 1.xx
    BIOS:	American Megatrends Inc. Version 3302 Dated 02/21/2024
    CPU:	13th Gen Intel® Core™ i9-13900K @ 5445 MHz
    HVM:	Enabled
    IOMMU:	Enabled
    Cache:	L1 Cache: 384 KiB, L1 Cache: 256 KiB, L2 Cache: 16 MiB, L3 Cache: 36 MiB, L1 Cache: 512 KiB, L1 Cache: 1 MiB, L2 Cache: 16 MiB, L3 Cache: 36 MiB
    Memory:	96 GiB DDR5 Single-bit ECC (max. installable capacity 256 GiB)
    Network:	eth0: 1000 Mbps, full duplex, mtu 1500
    Kernel:	Linux 6.1.64-Unraid x86_64
    OpenSSL:	1.1.1v

     

    It appears that this is a global issue affecting multiple users:

     

    As such, this is likely an issue in kernel that is shipped with 6.12.8 and requires an urgent attention.




    User Feedback

    Recommended Comments

    @trurl Unfortunately, collecting diagnostics was not possible as the issue rendered system unresponsive, all, using GUI, ssh or kvm connections. As such, only syslog collection was possible up to a point it would become unresponsive - which is in the OP.

    Link to comment
    14 hours ago, curled said:

    only syslog collection was possible up to a point it would become unresponsive - which is in the OP.

    Collect diagnostics before the issue occurs, it contains system profile and settings information that may be relevant. It's not a universal issue that effects everyone, so anything that may help find commonalities is needed.

    Link to comment

    Do you happen to have the Disk Location plugin? It appears that's what was causing the issue for me.

    Link to comment

    I did have Disk Location plugin installed. However, I have since moved my Unraid installation to another hardware (Motherboard, CPU and RAM) and issue has been resolved. So it appears this was a hardware issue.

     

    Since memtest showed no errors for 16 passes, I suspect it was either motherboard or CPU.

    Link to comment
    8 minutes ago, curled said:

    Since memtest showed no errors for 16 passes, I suspect it was either motherboard or CPU.

    Possibly, but please note that memtest is only definitive when it finds errors, several accounts of users confirming RAM was the problem without memtest finding anything, when there are multiple RAM sticks it's possible to test with just one, and if still issues try a different one, that will basically rule out bad RAM.

    • Like 1
    Link to comment

    Personally, having been mentioned here, i could already identify one bad RAM module through memtest (got several errors in the same exact memory adress and bit, so very high certainty it is the RAM and nothing else). So definitely don't take my post as evidence there's an issue in Unraid itself.

     

    I also had to run tests for very long, it took like 30h to get the first error and i'm taking days now to test each single module. Memtest can take a while to find errors, and dozens of passes.

    Edited by river_system
    Link to comment

    Yes, it appears this is a hardware issue.

     

    I've discovered more errors while stress testing with different RAM sticks and I get sporadic errors like this and random segfaults / freezes:

    [   28.797207] SQUASHFS error: xz decompression failed, data probably corrupt
    [   28.797209] SQUASHFS error: Failed to read block 0x12eefd3c: -5
    [   28.797210] SQUASHFS error: Unable to read fragment cache entry [12eefd3c]
    [   28.797215] SQUASHFS error: Unable to read fragment cache entry [12eefd3c]
    [   28.797215] SQUASHFS error: Unable to read page, block 12eefd3c, size 94e8
    [   28.797217] SQUASHFS error: Unable to read fragment cache entry [12eefd3c]
    [   28.797217] SQUASHFS error: Unable to read page, block 12eefd3c, size 94e8

     

    I've tried at least 6 USB sticks, including one that works properly in another server, all ports on the motherboard. Since all 4 RAM sticks cleared memtest and all of them cause these issues (even with reduced frequency / non-XMP), I can probably narrow this down to a CPU / mobo.

    Edited by curled
    Link to comment

    So I've been testing this machine with various configurations and narrowed it down to couple of settings in BIOS that made it stable:

     

    Asus performance enhancement: Disabled

    DRAM Frequency: 4800

    DRAM Timings: 40-39-39-77 (I don't trust this motherboard's auto configuration, so that's why I entered manually, according to RAM manufacturer spec)

    CPU C-States: Disabled

    CPU Short & Long power: 125w

     

    I think it's mostly about the last setting. As soon as I change it to auto or 253w, system becomes unstable. For example, even without call traces, with PL1 and PL2 set to 253w, running a non-correcting xfs_repair in a loop would throw errors on a verified correct filesystem after a few loops:

     

    $ xfs_repair -n /dev/mapper/md1p1; while true; do xfs_repair -n /dev/mapper/md1p1 2>&1 | grep -E 'rewrite|CRC'; done
    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
            - zero log...
            - scan filesystem freespace and inode maps...
            - found root inode chunk
    Phase 3 - for each AG...
            - scan (but don't clear) agi unlinked lists...
            - process known inodes and perform inode discovery...
            - agno = 0
            - agno = 1
            - agno = 2
            - agno = 3
            - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
            - setting up duplicate extent list...
            - check for inodes claiming duplicate blocks...
            - agno = 0
            - agno = 2
            - agno = 1
            - agno = 3
    No modify flag set, skipping phase 5
    Phase 6 - check inode connectivity...
            - traversing filesystem ...
            - traversal finished ...
            - moving disconnected inodes to lost+found ...
    Phase 7 - verify link counts...
    No modify flag set, skipping filesystem flush and exiting.
    bad CRC for inode 268674054, would rewrite
    bad CRC for inode 402653313, would rewrite
    Metadata CRC error detected at 0x469f20, xfs_agi block 0x15d4dd22/0x200
    agi has bad CRC for ag 3
    bad CRC for inode 134, would rewrite
    bad CRC for inode 402653361, would rewrite
    bad CRC for inode 173, would rewrite
    Metadata CRC error detected at 0x44228d, xfs_bnobt block 0xe8de8c8/0x1000
    bad CRC for inode 402653314, would rewrite
    bad CRC for inode 402653340
    bad CRC for inode 134217866, would rewrite
    bad CRC for inode 128, would rewrite
    bad CRC for inode 402653323, would rewrite
    bad CRC for inode 402653327, would rewrite
    Metadata CRC error detected at 0x47c68f, xfs_sb block 0xe8de8c0/0x200
    superblock has bad CRC for ag 2
    bad CRC for inode 268674052, would rewrite

     

    Same errors disappear under 125w power limit, along with all call traces.

     

    I have tested this with 3 power supplies, so I can rule that one out, I have tested with 4 DRAM modules one-by-one and the only thing that has made the system stable was setting a power limit to 125W. I'm not completely sure why that's happening, possibly a faulty CPU, or something in linux kernel that makes it unstable at turbo boost frequencies?

    Edited by curled
    Link to comment

    I have RMAd the CPU and replacement one does not have the issue. Turns out it was a hardware issue after all.

    • Like 1
    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.