Random Server Lockups


aeleos
Go to solution Solved by JorgeB,

Recommended Posts

I have been battling random lockups on my usually stable server. The only change I made recently was to add some docker containers that have a relatively high CPU and memory usage. I was able to capture some logs using a remote syslog server, as as soon as the lockup happened I wouldn't get any logs written. I noticed some logs with 'shfs invoked oom-killer: ' errors but I was seeing them both before and during the lockup. The logs that I have from the time period of the crash have this, with a series of oom errors after eventually having having the full lockup around 45 minutes later. I can post the full remote syslog if someone thinks it would help but I would need to anonymize it. I have run out ideas other than maybe replacing the RAM with 64GB instead of 32, or getting a new CPU/MB as I can only suspect its a hardware issue.

 

Jan 24 09:46:51 Tower kernel: BUG: kernel NULL pointer dereference, address: 0000000000000116
Jan 24 09:46:51 Tower kernel: #PF: supervisor read access in kernel mode
Jan 24 09:46:51 Tower kernel: #PF: error_code(0x0000) - not-present page
Jan 24 09:46:51 Tower kernel: PGD 164e69067 P4D 164e69067 PUD 59bea3067 PMD 0 
Jan 24 09:46:51 Tower kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Jan 24 09:46:51 Tower kernel: CPU: 10 PID: 27565 Comm: traefik Tainted: G           O      5.19.17-Unraid #2
Jan 24 09:46:51 Tower kernel: Hardware name: Gigabyte Technology Co., Ltd. X470 AORUS GAMING 5 WIFI/X470 AORUS GAMING 5 WIFI-CF, BIOS F63a 02/17/2022
Jan 24 09:46:51 Tower kernel: RIP: 0010:folio_try_get_rcu+0x0/0x21
Jan 24 09:46:51 Tower kernel: Code: e8 9d fd 67 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 74 05 e8 c1 35 69 00 48 81 c4 88 00 00 00 5b e9 ef 59 a6 00 <8b> 57 34 85 d2 74 10 8d 4a 01 89 d0 f0 0f b1 4f 34 74 04 89 c2 eb
Jan 24 09:46:51 Tower kernel: RSP: 0000:ffffc90001dd7cc0 EFLAGS: 00010246
Jan 24 09:46:51 Tower kernel: RAX: 00000000000000e2 RBX: 00000000000000e2 RCX: 00000000000000e2
Jan 24 09:46:51 Tower kernel: RDX: 0000000000000001 RSI: ffff88830490afe8 RDI: 00000000000000e2
Jan 24 09:46:51 Tower kernel: RBP: 0000000000000000 R08: 000000000000003c R09: ffffc90001dd7cd0
Jan 24 09:46:51 Tower kernel: R10: ffffc90001dd7cd0 R11: ffffc90001dd7d48 R12: 0000000000000000
Jan 24 09:46:51 Tower kernel: R13: ffff888186926f38 R14: 0000000000004dfe R15: ffff888186926f40
Jan 24 09:46:51 Tower kernel: FS:  000000c000570090(0000) GS:ffff88881ea80000(0000) knlGS:0000000000000000
Jan 24 09:46:51 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 24 09:46:51 Tower kernel: CR2: 0000000000000116 CR3: 00000002fb6ce000 CR4: 00000000003506e0
Jan 24 09:46:51 Tower kernel: Call Trace:
Jan 24 09:46:51 Tower kernel: <TASK>
Jan 24 09:46:51 Tower kernel: __filemap_get_folio+0x98/0x1ff
Jan 24 09:46:51 Tower kernel: filemap_fault+0x6e/0x524
Jan 24 09:46:51 Tower kernel: __do_fault+0x30/0x6e
Jan 24 09:46:51 Tower kernel: __handle_mm_fault+0x9a5/0xc7d
Jan 24 09:46:51 Tower kernel: ? __fget_light+0x3d/0x4c
Jan 24 09:46:51 Tower kernel: handle_mm_fault+0x113/0x1d7
Jan 24 09:46:51 Tower kernel: do_user_addr_fault+0x36a/0x514
Jan 24 09:46:51 Tower kernel: exc_page_fault+0xfc/0x11e
Jan 24 09:46:51 Tower kernel: asm_exc_page_fault+0x22/0x30
Jan 24 09:46:51 Tower kernel: RIP: 0033:0x45f173
Jan 24 09:46:51 Tower kernel: Code: 94 24 08 01 00 00 48 39 c6 0f 8e d8 0b 00 00 4c 89 9c 24 00 01 00 00 4d 89 e0 4c 8b a4 24 08 03 00 00 4c 8b 9c 24 10 03 00 00 <41> 83 7c 24 14 00 0f 84 b1 0b 00 00 4c 89 9c 24 f0 02 00 00 4c 89
Jan 24 09:46:51 Tower kernel: RSP: 002b:000000c00058d8e0 EFLAGS: 00010206
Jan 24 09:46:51 Tower kernel: RAX: 0000000000000005 RBX: 0000000000000000 RCX: 000000c000bfa4e0
Jan 24 09:46:51 Tower kernel: RDX: 0000000000c08500 RSI: 000000007fffffff RDI: 0000000000000000
Jan 24 09:46:51 Tower kernel: RBP: 000000c00058dc40 R08: 0000000000000000 R09: 000000000043ce36
Jan 24 09:46:51 Tower kernel: R10: 000000c000c08498 R11: 0000000005d7be80 R12: 00000000051fe940
Jan 24 09:46:51 Tower kernel: R13: 0000000000000000 R14: 000000c000561380 R15: 0000000000000000
Jan 24 09:46:51 Tower kernel: </TASK>
Jan 24 09:46:51 Tower kernel: Modules linked in: vhost_net tun vhost tap kvm_amd ccp kvm xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_iotlb veth xt_nat xt_tcpudp xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xfs dm_crypt dm_mod dax md_mod it87 hwmon_vid efivarfs iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc bonding tls mpt3sas igb btusb btrtl btbcm raid_class gigabyte_wmi wmi_bmof mxm_wmi edac_mce_amd edac_core crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl btintel k10temp bluetooth nvme i2c_algo_bit i2c_piix4 apex(O) scsi_transport_sas gasket(O) i2c_core ahci nvme_core ecdh_generic ecc libahci thermal
Jan 24 09:46:51 Tower kernel: tpm_crb tpm_tis tpm_tis_core tpm wmi button unix [last unloaded: tun]
Jan 24 09:46:51 Tower kernel: CR2: 0000000000000116
Jan 24 09:46:51 Tower kernel: ---[ end trace 0000000000000000 ]---
Jan 24 09:46:51 Tower kernel: RIP: 0010:folio_try_get_rcu+0x0/0x21
Jan 24 09:46:51 Tower kernel: Code: e8 9d fd 67 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 74 05 e8 c1 35 69 00 48 81 c4 88 00 00 00 5b e9 ef 59 a6 00 <8b> 57 34 85 d2 74 10 8d 4a 01 89 d0 f0 0f b1 4f 34 74 04 89 c2 eb
Jan 24 09:46:51 Tower kernel: RSP: 0000:ffffc90001dd7cc0 EFLAGS: 00010246
Jan 24 09:46:51 Tower kernel: RAX: 00000000000000e2 RBX: 00000000000000e2 RCX: 00000000000000e2
Jan 24 09:46:51 Tower kernel: RDX: 0000000000000001 RSI: ffff88830490afe8 RDI: 00000000000000e2
Jan 24 09:46:51 Tower kernel: RBP: 0000000000000000 R08: 000000000000003c R09: ffffc90001dd7cd0
Jan 24 09:46:51 Tower kernel: R10: ffffc90001dd7cd0 R11: ffffc90001dd7d48 R12: 0000000000000000
Jan 24 09:46:51 Tower kernel: R13: ffff888186926f38 R14: 0000000000004dfe R15: ffff888186926f40
Jan 24 09:46:51 Tower kernel: FS:  000000c000570090(0000) GS:ffff88881ea80000(0000) knlGS:0000000000000000
Jan 24 09:46:51 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 24 09:46:51 Tower kernel: CR2: 0000000000000116 CR3: 00000002fb6ce000 CR4: 00000000003506e0

 

tower-diagnostics-20230124-1604.zip

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.