Jump to content
  • [6.12.x] Call traces on modprobe i915 with intel 13th gen, Gigabyte Z690 UD DDR4 board with VT-d enabled


    MellowB
    • Urgent

    So I tried upgrading to 6.12.6 recently from 6.11.5 and had call traces on boot and GPU Statistics could not read anything from the iGPU after boot, it would detect the AlderLake iGPU but thats it. Calling modprobe i915 in the konsole would freeze the task, same with powertop --auto-tune or when you manually would try to change the BAD/GOOD state of the iGPU in powertop itself. System also freezes on reboot, not sure where but probably also something iGPU related.

     

    Call trace here:

    Dec  3 10:54:57 Tower kernel: i915 0000:00:02.0: [drm] VT-d active for gfx access
    Dec  3 10:54:57 Tower kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
    Dec  3 10:54:57 Tower kernel: #PF: supervisor read access in kernel mode
    Dec  3 10:54:57 Tower kernel: #PF: error_code(0x0000) - not-present page
    Dec  3 10:54:57 Tower kernel: PGD 10670e067 P4D 10670e067 PUD 10670f067 PMD 0 
    Dec  3 10:54:57 Tower kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
    Dec  3 10:54:57 Tower kernel: CPU: 16 PID: 1063 Comm: udevd Tainted: G           O       6.1.64-Unraid #1
    Dec  3 10:54:57 Tower kernel: Hardware name: Gigabyte Technology Co., Ltd. Z690 UD DDR4/Z690 UD DDR4, BIOS F27 09/12/2023
    Dec  3 10:54:57 Tower kernel: RIP: 0010:kernfs_root+0x0/0x14
    Dec  3 10:54:57 Tower kernel: Code: 89 cb 44 8b 6c 24 30 e8 24 32 fb ff 48 8b bd 58 02 00 00 48 89 da 5b 48 89 c6 5d 4c 89 e1 45 89 e8 41 5c 41 5d e9 66 ff ff ff <48> 8b 47 08 48 85 c0 48 0f 45 f8 48 8b 47 50 c3 cc cc cc cc 0f 1f
    Dec  3 10:54:57 Tower kernel: RSP: 0018:ffffc900007e3a88 EFLAGS: 00010286
    Dec  3 10:54:57 Tower kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00000000ffffffff
    Dec  3 10:54:57 Tower kernel: RDX: 0000000000000000 RSI: ffffffff81e987e8 RDI: 0000000000000000
    Dec  3 10:54:57 Tower kernel: RBP: ffffffff81e986a0 R08: 0000000000000000 R09: ffffffff829513f0
    Dec  3 10:54:57 Tower kernel: R10: 00003fffffffffff R11: fefefefefefefeff R12: ffffffff82335da0
    Dec  3 10:54:57 Tower kernel: R13: ffff888101826000 R14: ffff888103fc9b50 R15: ffff8881018260d0
    Dec  3 10:54:57 Tower kernel: FS:  000014e96b2f8240(0000) GS:ffff88907fa00000(0000) knlGS:0000000000000000
    Dec  3 10:54:57 Tower kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    Dec  3 10:54:57 Tower kernel: CR2: 0000000000000008 CR3: 0000000102f00000 CR4: 0000000000750ee0
    Dec  3 10:54:57 Tower kernel: PKRU: 55555554
    Dec  3 10:54:57 Tower kernel: Call Trace:
    Dec  3 10:54:57 Tower kernel: <TASK>
    Dec  3 10:54:57 Tower kernel: ? __die_body+0x1a/0x5c
    Dec  3 10:54:57 Tower kernel: ? page_fault_oops+0x329/0x376
    Dec  3 10:54:57 Tower kernel: ? do_user_addr_fault+0x12e/0x48d
    Dec  3 10:54:57 Tower kernel: ? exc_page_fault+0xfb/0x11d
    Dec  3 10:54:57 Tower kernel: ? asm_exc_page_fault+0x22/0x30
    Dec  3 10:54:57 Tower kernel: ? kernfs_vfs_xattr_set+0x41/0x41
    Dec  3 10:54:57 Tower kernel: ? notifier_call_chain+0x35/0x5a
    Dec  3 10:54:57 Tower kernel: kernfs_find_and_get_ns+0x1c/0x5c
    Dec  3 10:54:57 Tower kernel: sysfs_unmerge_group+0x16/0x4d
    Dec  3 10:54:57 Tower kernel: dpm_sysfs_remove+0x1e/0x52
    Dec  3 10:54:57 Tower kernel: device_del+0xa4/0x31d
    Dec  3 10:54:57 Tower kernel: ? i915_ggtt_probe_hw+0x593/0x5be [i915]
    Dec  3 10:54:57 Tower kernel: platform_device_del+0x21/0x70
    Dec  3 10:54:57 Tower kernel: platform_device_unregister+0xf/0x19
    Dec  3 10:54:57 Tower kernel: sysfb_disable+0x2b/0x54
    Dec  3 10:54:57 Tower kernel: aperture_remove_conflicting_pci_devices+0x1e/0x82
    Dec  3 10:54:57 Tower kernel: i915_driver_probe+0x83f/0xc19 [i915]
    Dec  3 10:54:57 Tower kernel: ? slab_free_freelist_hook.constprop.0+0x3b/0xaf
    Dec  3 10:54:57 Tower kernel: local_pci_probe+0x3d/0x81
    Dec  3 10:54:57 Tower kernel: pci_device_probe+0x197/0x1eb
    Dec  3 10:54:57 Tower kernel: ? sysfs_do_create_link_sd+0x71/0xb7
    Dec  3 10:54:57 Tower kernel: really_probe+0x115/0x282
    Dec  3 10:54:57 Tower kernel: __driver_probe_device+0xc0/0xf2
    Dec  3 10:54:57 Tower kernel: driver_probe_device+0x1f/0x77
    Dec  3 10:54:57 Tower kernel: ? __device_attach_driver+0x97/0x97
    Dec  3 10:54:57 Tower kernel: __driver_attach+0xd7/0xee
    Dec  3 10:54:57 Tower kernel: ? __device_attach_driver+0x97/0x97
    Dec  3 10:54:57 Tower kernel: bus_for_each_dev+0x6e/0xa7
    Dec  3 10:54:57 Tower kernel: bus_add_driver+0xd8/0x1d0
    Dec  3 10:54:57 Tower kernel: driver_register+0x99/0xd7
    Dec  3 10:54:57 Tower kernel: i915_init+0x1f/0x7f [i915]
    Dec  3 10:54:57 Tower kernel: ? 0xffffffffa0813000
    Dec  3 10:54:57 Tower kernel: do_one_initcall+0x82/0x19f
    Dec  3 10:54:57 Tower kernel: ? kmalloc_trace+0x43/0x52
    Dec  3 10:54:57 Tower kernel: do_init_module+0x4b/0x1d4
    Dec  3 10:54:57 Tower kernel: __do_sys_init_module+0xb6/0xf9
    Dec  3 10:54:57 Tower kernel: do_syscall_64+0x68/0x81
    Dec  3 10:54:57 Tower kernel: entry_SYSCALL_64_after_hwframe+0x64/0xce
    Dec  3 10:54:57 Tower kernel: RIP: 0033:0x14e96b80adfa
    Dec  3 10:54:57 Tower kernel: Code: 48 8b 0d 21 20 0d 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee 1f 0d 00 f7 d8 64 89 01 48
    Dec  3 10:54:57 Tower kernel: RSP: 002b:00007ffc9e41b3d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
    Dec  3 10:54:57 Tower kernel: RAX: ffffffffffffffda RBX: 0000000000468c70 RCX: 000014e96b80adfa
    Dec  3 10:54:57 Tower kernel: RDX: 000014e96b8ffaad RSI: 00000000004b1868 RDI: 000014e96ac60010
    Dec  3 10:54:57 Tower kernel: RBP: 000014e96b8ffaad R08: 0000000000000007 R09: 0000000000464e80
    Dec  3 10:54:57 Tower kernel: R10: 0000000000000005 R11: 0000000000000246 R12: 000014e96ac60010
    Dec  3 10:54:57 Tower kernel: R13: 0000000000000000 R14: 000000000047c9f0 R15: 0000000000000000
    Dec  3 10:54:57 Tower kernel: </TASK>
    Dec  3 10:54:57 Tower kernel: Modules linked in: i915(+) mei_hdcp mei_pxp intel_rapl_msr gigabyte_wmi mxm_wmi wmi_bmof drm_buddy i2c_algo_bit ttm x86_pkg_temp_thermal intel_powerclamp coretemp drm_display_helper drm_kms_helper kvm_intel kvm drm crct10dif_pclmul processor_thermal_device_pci crc32_pclmul processor_thermal_device crc32c_intel btusb ghash_clmulni_intel sha512_ssse3 processor_thermal_rfim btrtl i2c_i801 sha256_ssse3 btbcm sha1_ssse3 aesni_intel crypto_simd cryptd rapl btintel intel_cstate intel_uncore intel_gtt bluetooth r8125(O) processor_thermal_mbox i2c_smbus mei_me nvme ahci processor_thermal_rapl agpgart ecdh_generic mei intel_rapl_common ecc libahci i2c_core int340x_thermal_zone tpm_crb nvme_core syscopyarea sysfillrect sysimgblt iosf_mbi video tpm_tis fb_sys_fops tpm_tis_core thermal(+) fan int3400_thermal tpm wmi backlight acpi_thermal_rel intel_pmc_core acpi_tad acpi_pad button unix
    Dec  3 10:54:57 Tower kernel: CR2: 0000000000000008
    Dec  3 10:54:57 Tower kernel: ---[ end trace 0000000000000000 ]---

     

    As well as full diagnostics below attached.

    Disabling VT-d in BIOS lets the system boot and work but I kinda need VT-d so not sure what to do with this at this point anymore...

    tower-diagnostics-20231203-1100.zip




    User Feedback

    Recommended Comments

    For testing I created a whole new USB Stick with a Trial of 6.12.6 and that starts fine with VT-d enabled.

    Installed some basic plugins and also stuff like nerdtools/powertop and corefreq and system still boots/works without issues. modprobe i915 does not crash anymore with that and "GPU Statistics" plugin is able to display iGPU information.

     

    Seems my Unraid install is borked, any clue on what it could be that it crashes stuff that "early" in the boot already?

    Booting in Safe Mode does NOT help and issues remain.

    Link to comment

    Ok, managed to resolve my own issue. While "resolved" is maybe not the proper wording but I have no more call traces and the system starts and works (so far) without any issues on 6.12.6. Just a reminder tho, this config was working fine without issues with 6.11.5 for months.

     

    What finally got me on the right track was removing the vfio-pci.cfg. This only binds my PCIe GPU (RTX3060) with Audio as well as a NVMe SSD at boot. Next thing I remembered was that the initial display out in BIOS was set to PCIe1 - there is also a display connected to the PCIe1 GPU that displays BIOS and UNRAID output up to the point where it is bound by VFIO presumably.

     

    Switching the initial display setting in BIOS to the iGPU seems to resolve the issue.

     

    I can properly bind my PCIe GPU via VFIO and also have proper output up to login prompt on the iGPU (there is usually no display connected to that since the PCIe GPU is used with a Windows VM, remaining server is headless/webui only), PCIe1 GPU is working in VM as well as NVMe SSD.

     

    Not sure why any of this would crash the iGPU modprobe in UNRAID since the iGPU is never bound via VFIO or used besides with a plex docker (that at that point obviously is not loaded) but here you go.

    Link to comment

    Your fix doesn't seem relevant in my case, as I have iGPU only, so no vfio-pci.cfg that I can tell.

     

    I didn't try disabling VT-d, and don't have the time or inclination to create a trial USB to test. I just rolled back to 6.12.4 and everything started working again. iGPU detected and usable in Plex docker, reboots work again, no system hangs.

    Link to comment


    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.

×
×
  • Create New...