Balor Posted June 15, 2023 Share Posted June 15, 2023 Hello, My server has been stable for months, but lately I'm encoutering a crash every week or even every 12h. It's hard to understand was is actually causing it. I finally was able to capture the log at the time of the crash and it's a kernel panic caused by Dockerd: Jun 15 12:51:46 Tower kern kernel: BUG: kernel NULL pointer dereference, address: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: #PF: supervisor write access in kernel mode Jun 15 12:51:46 Tower kern kernel: #PF: error_code(0x0002) - not-present page Jun 15 12:51:46 Tower kern kernel: PGD 17fd79067 P4D 17fd79067 PUD 17fd7c067 PMD 0 Jun 15 12:51:46 Tower kern kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI Jun 15 12:51:46 Tower kern kernel: CPU: 1 PID: 4858 Comm: dockerd Not tainted 5.19.17-Unraid #2 Jun 15 12:51:46 Tower kern kernel: Hardware name: HC Technology.,Ltd. HCAR357-NR/HCAR357-NR, BIOS 5.14 09/09/2021 Jun 15 12:51:46 Tower kern kernel: RIP: 0010:do_raw_spin_lock+0x7/0x1a Jun 15 12:51:46 Tower kern kernel: Code: c1 07 e9 11 c1 b4 00 31 c0 48 81 ff 78 ac 83 81 72 0c 31 c0 48 81 ff 70 b1 83 81 0f 92 c0 e9 f5 c0 b4 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 74 08 89 c6 e8 b3 03 00 00 90 e9 db c0 b4 00 8b 07 31 Jun 15 12:51:46 Tower kern kernel: RSP: 0018:ffffc90001a07d60 EFLAGS: 00010046 Jun 15 12:51:46 Tower kern kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: RDX: 0000000000000001 RSI: ffffc90001a07dc8 RDI: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: RBP: ffffc90001a07dc8 R08: 0000000000000010 R09: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 Jun 15 12:51:46 Tower kern kernel: R13: ffff8881e7cfe300 R14: 0000000000000246 R15: 0000000000000008 Jun 15 12:51:46 Tower kern kernel: FS: 00001524017eb700(0000) GS:ffff888800c40000(0000) knlGS:0000000000000000 Jun 15 12:51:46 Tower kern kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 15 12:51:46 Tower kern kernel: CR2: 0000000000000000 CR3: 0000000155bd8000 CR4: 00000000003506e0 Jun 15 12:51:46 Tower kern kernel: Call Trace: Jun 15 12:51:46 Tower kern kernel: <TASK> Jun 15 12:51:46 Tower kern kernel: _raw_spin_lock_irqsave+0x2c/0x37 Jun 15 12:51:46 Tower kern kernel: prepare_to_wait_event+0x19/0xa0 Jun 15 12:51:46 Tower kern kernel: pipe_read+0x229/0x33e Jun 15 12:51:46 Tower kern kernel: ? _raw_spin_rq_lock_irqsave+0x20/0x20 Jun 15 12:51:46 Tower kern kernel: new_sync_read+0x7c/0xb3 Jun 15 12:51:46 Tower kern kernel: ? 0xffffffff81000000 Jun 15 12:51:46 Tower kern kernel: vfs_read+0xc6/0x10c Jun 15 12:51:46 Tower kern kernel: ksys_read+0x76/0xc2 Jun 15 12:51:46 Tower kern kernel: ? fpregs_assert_state_consistent+0x1d/0x41 Jun 15 12:51:46 Tower kern kernel: do_syscall_64+0x6b/0x81 Jun 15 12:51:46 Tower kern kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd Jun 15 12:51:46 Tower kern kernel: RIP: 0033:0x4baa7b Jun 15 12:51:46 Tower kern kernel: Code: e8 2a e5 fa ff eb 88 cc cc cc cc cc cc cc cc e8 db 2c fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30 Jun 15 12:51:46 Tower kern kernel: RSP: 002b:000000c0183f6ad8 EFLAGS: 00000212 ORIG_RAX: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: RAX: ffffffffffffffda RBX: 000000c00005e500 RCX: 00000000004baa7b Jun 15 12:51:46 Tower kern kernel: RDX: 0000000000000008 RSI: 000000c0183f6bc0 RDI: 0000000000000096 Jun 15 12:51:46 Tower kern kernel: RBP: 000000c0183f6b28 R08: 0000000000000001 R09: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: R10: 0000000000000008 R11: 0000000000000212 R12: 00000000004af3ed Jun 15 12:51:46 Tower kern kernel: R13: 0000000000000000 R14: 000000c0107fcd00 R15: ffffffffffffffff Jun 15 12:51:46 Tower kern kernel: </TASK> Jun 15 12:51:46 Tower kern kernel: Modules linked in: xt_connmark xt_comment iptable_raw wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha nfsv3 nfs xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle iptable_mangle vhost_net vhost vhost_iotlb tap veth xt_nat xt_tcpudp xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter zstd zram zsmalloc xfs nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod xt_MASQUERADE xt_mark iptable_nat ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 tun tcp_diag inet_diag efivarfs ip6table_filter ip6_tables iptable_filter ip_tables x_tables af_packet 8021q garp mrp bridge stp llc bonding tls r8169 realtek amdgpu edac_mce_amd edac_core kvm_amd kvm gpu_sched i2c_algo_bit drm_ttm_helper ttm drm_display_helper drm_kms_helper crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd drm rapl k10temp ahci libahci Jun 15 12:51:46 Tower kern kernel: i2c_piix4 agpgart i2c_core syscopyarea sysfillrect sysimgblt ccp fb_sys_fops nvme nvme_core thermal tpm_crb tpm_tis video tpm_tis_core backlight tpm button acpi_cpufreq unix [last unloaded: realtek] Jun 15 12:51:46 Tower kern kernel: CR2: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: ---[ end trace 0000000000000000 ]--- Jun 15 12:51:46 Tower kern kernel: RIP: 0010:do_raw_spin_lock+0x7/0x1a Jun 15 12:51:46 Tower kern kernel: Code: c1 07 e9 11 c1 b4 00 31 c0 48 81 ff 78 ac 83 81 72 0c 31 c0 48 81 ff 70 b1 83 81 0f 92 c0 e9 f5 c0 b4 00 31 c0 ba 01 00 00 00 <f0> 0f b1 17 74 08 89 c6 e8 b3 03 00 00 90 e9 db c0 b4 00 8b 07 31 Jun 15 12:51:46 Tower kern kernel: RSP: 0018:ffffc90001a07d60 EFLAGS: 00010046 Jun 15 12:51:46 Tower kern kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: RDX: 0000000000000001 RSI: ffffc90001a07dc8 RDI: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: RBP: ffffc90001a07dc8 R08: 0000000000000010 R09: 0000000000000000 Jun 15 12:51:46 Tower kern kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001 Jun 15 12:51:46 Tower kern kernel: R13: ffff8881e7cfe300 R14: 0000000000000246 R15: 0000000000000008 Jun 15 12:51:46 Tower kern kernel: FS: 00001524017eb700(0000) GS:ffff888800c40000(0000) knlGS:0000000000000000 Jun 15 12:51:46 Tower kern kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jun 15 12:51:46 Tower kern kernel: CR2: 0000000000000000 CR3: 0000000155bd8000 CR4: 00000000003506e0 Jun 15 12:51:46 Tower kern kernel: note: dockerd[4858] exited with preempt_count 1 I'm attaching a diagnosis that I captured after a hard restart of the server after this crash. Did anybody encounter issues like this ? tower-diagnostics-20230615-1311.zip Quote Link to comment
Balor Posted June 15, 2023 Author Share Posted June 15, 2023 Could this be a symptom of my Cache Drive (SSD) dying ? There is nothing in the SMART that would explain this ... Quote Link to comment
JorgeB Posted June 16, 2023 Share Posted June 16, 2023 12 hours ago, Balor said: Could this be a symptom of my Cache Drive (SSD) dying ? Unlikely, you could try and find out if it's related to a specific container, by starting one a time and let it run for however long is needed to confirm. Quote Link to comment
Balor Posted June 16, 2023 Author Share Posted June 16, 2023 Right now I've changed the FS of the cache disk to xfs and changed docker to use a directory on it. We'll see if that helps in any way. Quote Link to comment
Balor Posted June 21, 2023 Author Share Posted June 21, 2023 Well, it wasn't BTRFS neither docker. It was the power supply overheating causing server instabilities. I've improved the airflow and replaced the power supply by a better quality one. Quote Link to comment
Balor Posted June 21, 2023 Author Share Posted June 21, 2023 And it crashed again, or freeze should be the right word... Looking into when I'm home Quote Link to comment
Balor Posted July 2, 2023 Author Share Posted July 2, 2023 After trying everything, I simply rolled back the latest container I had setup : https://github.com/thrnz/docker-wireguard-pia Using docker container that redirect their traffic to an other causes my stability issues. Once I removed this container and the network setup, everything is table again. Quote Link to comment
Solution Balor Posted July 25, 2023 Author Solution Share Posted July 25, 2023 Well in the end it was a hardware issue, the memory was failing. Replace those stick by good ones from corsair and since that, no more issues. The previous stick looks like they were overheating (I can see the temp of the server have drastically reduced since I replaced them). Quote Link to comment
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.