Jump to content

Unraid Kernel Crashing After 6.11.# Upgrade


Go to solution Solved by JorgeB,

Recommended Posts

Recently after installing Unraid 6.11.0 (and subsequently 6.11.1 in hopes of fixing this issue) my server started reporting kernel NULL pointer dereferencing after about 1-2 days of uptime. This causes:

  • The WebUI to become unresponsive (loading indefinitely after "successful login")
  • Some (but not all) on my dockers to become unresponsive
    • Dockers like plex and the *arrs seem to stay alive
    • Others become unresponsive via their web interfaces but are reported as up via a 'docker container ls' check.
  • Non-functional powerdown and poweroff commands (in attempt to reboot without parity check)


It does not seem to impact:

  • Operation of a VM I have running before, during, and after the error is reported (I'm typing on that VM right now and I had this issue pop up earlier this AM).
  • Connection to the server via SSH

 

For full transparency, I did upgrade my server's hardware (CPU and added RAM) 2 weeks before the 6.11.0 upgrade when operating on 6.10.3. I ran memtest86 on all the RAM for 3 passes after installation so I'm pretty sure that's OK. And 2 weeks of stable Unraid 6.10.3 operation leads me to believe this is unrelated to my hardware upgrade, but I cannot be 100% sure.

 

Here is the relevant section of the syslog. Diagnostics attached. I have other diagnostics showing the same issue on both 6.11.0 and 6.11.1.

 

Any help from the gurus appreciated! 🙏

 

Quote

Oct 11 05:02:02 Cogsworth kernel: BUG: kernel NULL pointer dereference, address: 0000000000000076
Oct 11 05:02:02 Cogsworth kernel: #PF: supervisor read access in kernel mode
Oct 11 05:02:02 Cogsworth kernel: #PF: error_code(0x0000) - not-present page
Oct 11 05:02:02 Cogsworth kernel: PGD 0 P4D 0 
Oct 11 05:02:02 Cogsworth kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Oct 11 05:02:02 Cogsworth kernel: CPU: 3 PID: 29837 Comm: Disk Tainted: P           O      5.19.14-Unraid #1
Oct 11 05:02:02 Cogsworth kernel: Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING, BIOS 4403 04/27/2022
Oct 11 05:02:02 Cogsworth kernel: RIP: 0010:folio_try_get_rcu+0x0/0x21
Oct 11 05:02:02 Cogsworth kernel: Code: e8 8e 61 63 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 74 05 e8 9e 9b 64 00 48 81 c4 88 00 00 00 5b c3 cc cc cc cc <8b> 57 34 85 d2 74 10 8d 4a 01 89 d0 f0 0f b1 4f 34 74 04 89 c2 eb
Oct 11 05:02:02 Cogsworth kernel: RSP: 0000:ffffc90002f97cc0 EFLAGS: 00010246
Oct 11 05:02:02 Cogsworth kernel: RAX: 0000000000000042 RBX: 0000000000000042 RCX: 0000000000000042
Oct 11 05:02:02 Cogsworth kernel: RDX: 0000000000000001 RSI: ffff888292672da0 RDI: 0000000000000042
Oct 11 05:02:02 Cogsworth kernel: RBP: 0000000000000000 R08: 0000000000000014 R09: ffffc90002f97cd0
Oct 11 05:02:02 Cogsworth kernel: R10: ffffc90002f97cd0 R11: ffffc90002f97d48 R12: 0000000000000000
Oct 11 05:02:02 Cogsworth kernel: R13: ffff888041111178 R14: 000000000011dd97 R15: ffff888041111180
Oct 11 05:02:02 Cogsworth kernel: FS:  0000149ac3e84b38(0000) GS:ffff88900e8c0000(0000) knlGS:0000000000000000
Oct 11 05:02:02 Cogsworth kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 05:02:02 Cogsworth kernel: CR2: 0000000000000076 CR3: 0000000bc4606000 CR4: 0000000000750ee0cogsworth-diagnostics-20221011-0736.zip
Oct 11 05:02:02 Cogsworth kernel: PKRU: 55555554
Oct 11 05:02:02 Cogsworth kernel: Call Trace:
Oct 11 05:02:02 Cogsworth kernel: <TASK>
Oct 11 05:02:02 Cogsworth kernel: __filemap_get_folio+0x98/0x1ff
Oct 11 05:02:02 Cogsworth kernel: filemap_fault+0x6e/0x524
Oct 11 05:02:02 Cogsworth kernel: __do_fault+0x2d/0x6e
Oct 11 05:02:02 Cogsworth kernel: __handle_mm_fault+0x9a5/0xc7d
Oct 11 05:02:02 Cogsworth kernel: handle_mm_fault+0x113/0x1d7
Oct 11 05:02:02 Cogsworth kernel: do_user_addr_fault+0x36a/0x514
Oct 11 05:02:02 Cogsworth kernel: exc_page_fault+0xfc/0x11e
Oct 11 05:02:02 Cogsworth kernel: asm_exc_page_fault+0x22/0x30
Oct 11 05:02:02 Cogsworth kernel: RIP: 0033:0x149ac6f427b5
Oct 11 05:02:02 Cogsworth kernel: Code: 8b 48 08 48 8b 32 48 8b 00 48 39 f0 73 09 48 8d 14 08 48 39 d6 eb 0c 48 39 c6 73 0b 48 8d 14 0e 48 39 d0 73 02 0f 0b 48 89 c7 <f3> a4 66 48 8d 3d 59 b7 22 00 66 66 48 e8 d9 d8 f6 ff 48 89 28 48
Oct 11 05:02:02 Cogsworth kernel: RSP: 002b:0000149ac3e83960 EFLAGS: 00010216
Oct 11 05:02:02 Cogsworth kernel: RAX: 00001474db777770 RBX: 0000149ac3e83ad0 RCX: 0000000000004000
Oct 11 05:02:02 Cogsworth kernel: RDX: 000014730499b866 RSI: 0000147304997866 RDI: 00001474db777770
Oct 11 05:02:02 Cogsworth kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000149ac3e83778
Oct 11 05:02:02 Cogsworth kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
Oct 11 05:02:02 Cogsworth kernel: R13: 0000149ac3e83b40 R14: 0000149ac40dc3d0 R15: 0000149ac3e83ac0
Oct 11 05:02:02 Cogsworth kernel: </TASK>
Oct 11 05:02:02 Cogsworth kernel: Modules linked in: xt_CHECKSUM ipt_REJECT nf_reject_ipv4 ip6table_mangle ip6table_nat iptable_mangle vhost_net vhost vhost_iotlb tap tun nvidia_uvm(PO) veth xt_nat xt_tcpudp xt_conntrack nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype br_netfilter xfs dm_crypt dm_mod dax md_mod nct6775 nct6775_core hwmon_vid iptable_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 wireguard curve25519_x86_64 libcurve25519_generic libchacha20poly1305 chacha_x86_64 poly1305_x86_64 ip6_udp_tunnel udp_tunnel libchacha ip6table_filter ip6_tables iptable_filter ip_tables x_tables bridge stp llc bonding tls ipv6 igb i2c_algo_bit r8169 realtek nvidia_drm(PO) nvidia_modeset(PO) mxm_wmi wmi_bmof asus_ec_sensors nvidia(PO) edac_mce_amd edac_core kvm_amd kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl k10temp i2c_piix4 drm_kms_helper ccp drm mpt3sas backlight syscopyarea sysfillrect ahci sysimgblt fb_sys_fops
Oct 11 05:02:02 Cogsworth kernel: libahci corsair_cpro i2c_core raid_class scsi_transport_sas tpm_crb tpm_tis tpm_tis_core tpm wmi button acpi_cpufreq unix [last unloaded: i2c_algo_bit]
Oct 11 05:02:02 Cogsworth kernel: CR2: 0000000000000076
Oct 11 05:02:02 Cogsworth kernel: ---[ end trace 0000000000000000 ]---
Oct 11 05:02:02 Cogsworth kernel: RIP: 0010:folio_try_get_rcu+0x0/0x21
Oct 11 05:02:02 Cogsworth kernel: Code: e8 8e 61 63 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 74 05 e8 9e 9b 64 00 48 81 c4 88 00 00 00 5b c3 cc cc cc cc <8b> 57 34 85 d2 74 10 8d 4a 01 89 d0 f0 0f b1 4f 34 74 04 89 c2 eb
Oct 11 05:02:02 Cogsworth kernel: RSP: 0000:ffffc90002f97cc0 EFLAGS: 00010246
Oct 11 05:02:02 Cogsworth kernel: RAX: 0000000000000042 RBX: 0000000000000042 RCX: 0000000000000042
Oct 11 05:02:02 Cogsworth kernel: RDX: 0000000000000001 RSI: ffff888292672da0 RDI: 0000000000000042
Oct 11 05:02:02 Cogsworth kernel: RBP: 0000000000000000 R08: 0000000000000014 R09: ffffc90002f97cd0
Oct 11 05:02:02 Cogsworth kernel: R10: ffffc90002f97cd0 R11: ffffc90002f97d48 R12: 0000000000000000
Oct 11 05:02:02 Cogsworth kernel: R13: ffff888041111178 R14: 000000000011dd97 R15: ffff888041111180
Oct 11 05:02:02 Cogsworth kernel: FS:  0000149ac3e84b38(0000) GS:ffff88900e8c0000(0000) knlGS:0000000000000000
Oct 11 05:02:02 Cogsworth kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 11 05:02:02 Cogsworth kernel: CR2: 0000000000000076 CR3: 0000000bc4606000 CR4: 0000000000750ee0
Oct 11 05:02:02 Cogsworth kernel: PKRU: 55555554

 

Link to comment

I'm having this same issue with 6.11.1 that I did not have at any time being on 6.10 - 6.10.3

 

Can confirm some of the things you posted.

  • WebUI doesn't want to load (sometimes). It's weird, first I notice some of my docker webui wont load, and then usually it finally happens to unraid's webui. I will get "execution error" if I try to stop or restart them.
  • SSH still works. This is how I reboot when the webui doesn't load.

Unlike you, I haven't changed any of my hardware in over a year while being on 6.9.x - 6.10.x

 

Here 3 of my diagnostics, hope they can help get this issue resolved.

impulse-diagnostics-20221011-2255.zip impulse-diagnostics-20221011-1334.zip impulse-diagnostics-20221002-1448.zip

Link to comment

@JorgeB Thanks for raising this issue up. I will mark this tread as solved while it is tracked on the bug report you created.

 

As an aside, @trurl pointed a couple of us to his Unraid 6 FAQ in another thread with concern to the Ryzen processor family and stability issues. I have been running a Ryzen 7 3800x for 3 years without issue before my upgrade to a Ryzen 9 5950x. Though to be complete and check the boxes, I under clocked my memory (2133 MHz which is the Auto setting my board sets) and modified the default Auto C-State setting which worked on my 3800x, to the "typical current idle". The same bug showed up so I believe this is unrelated.

 

I am currently letting my server do a 4th parity check due to all the dirty shutdowns in Safe Mode with the RAM set to 2133MHz and the C-State setting back in Auto. I'll let it sit like that for a couple more days (I'm thinking 60-72hrs uptime) to see if it shows in Safe Mode. If not, I will start working with the plugins @JorgeB itemized in the bug report.

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...