• [6.9.x - 6.11.x] intel i915 module causing system hangs with no report in syslog (not alder lake)


    Tristankin
    • Minor

    Since the 5.x kernel based releases many users have been reporting system hangs every few days once the i915 module is loaded.

    With reports from a few users detailed in the thread below we have worked out that the issue is caused by the i915 module and is a persistent issue with both the 6.9.x release and 6.10 release candidates.


    The system does not need to be actively transcoding for the hang to occur. 6.8.3 does not have this issue and is not hardware related. Unloading the i915 module stops the hangs. Hangs are still present in 6.10.0RC2. I can provide a list of similar reports if required.

    • Like 8
    • Thanks 1
    • Haha 1



    User Feedback

    Recommended Comments



    52 minutes ago, Hoopster said:

    No, I am at 50+ days of uptime since removing the Intel-GPU-top and GPU Statistics plugins. 

     

    I also recently removed the CoreFreq plugin as there have been several reports of it locking up servers.  This was not in response to a crash, just an extra precaution.

    What CPU are you running?  Are you using iGPU passthrough to Plex for transcoding?

    Link to comment
    32 minutes ago, NightOps said:

    What CPU are you running?  Are you using iGPU passthrough to Plex for transcoding?

    i5-7600k

    yes, I am using the iGPU for transcoding in my plex docker

    Link to comment
    38 minutes ago, flyize said:

    I had a friend who's much more knowledgeable with linux write me up a walkthrough to patch 6.10. I haven't tried it yet, but will report back if I do.

    What does it patch in 6.10?

    Link to comment

    same for me. 

    6.10 RC2

    Intel Core i9 12900k. 

    Intel GPU Top installed and IGPU passthrough to Plex Container. 

    Sometimes the whole System is unresponsive via http and ssh. And sometimes the Transcoding Handler just don't stop and using 100 percent CPU on 12 Threads the webgui is still working then but thats all. 

    When I do a fresh boot and test transcoding nothing happens its just running as expected. 

    After a day or two the Server becomes completely unresponsive.

    I have /boot/config/modprobe.d/i915.conf with content blacklist i915 and Intel GPU Top Installed.

    In go file I have just 

    chmod -R 777 /dev/dri 

    Seems I am not using i915.force_probe=4680 or just to stupid to find it.

    So i sm not really sure if my settings are correct to be honest. 

     

    Link to comment

    I have no files in my boot/config/modprobe.d folder. I have plex installed but using software transcoding for now, no GPU monitoring tools/plugins (did have some in the past) and still getting crashes on 6.9.2 with i5-11600k.

    Link to comment

    6.10 rc2

    i5-12600k

    64GB G.Skill 3200Mhz CL16 running XMP profile

    Asus Prime Z680-P D4 on BIOS Version 1008 dated: Thu 13 Jan 2022 12:00:00 AM EST

    Intel GPU Top Plugin installed, no other modifications performed

    Plex using software transcoding, not passing /dev/dri device through

     

    Every morning at 2am I get 2 segfaults from Plex Media Scanner.  I've posted in Plex's forum, but no response yet.

     

    No crashes in 18 days, running rock stable.  I did reboot when installing the new BIOS, so my current uptime is 9 days at the moment... At 10 days that will be the longest I've had the computer booted without a lockup.

     

    Link to comment
    20 hours ago, NightOps said:

    What CPU are you running?  Are you using iGPU passthrough to Plex for transcoding?

    Xeon E-2288G and yes I am using the iGPU (/dev/dri) for Plex and HandBrake transcoding.  Server never locked up when transcoding was happening.  It always seemed to happen during idle time. 

     

    I am not saying Intel-GPU-Top was the problem, but, I have not had a crash since removing it and GPU Statistics.  Running unRAID 6.9.2 currently.  Lockups previously happened on that version plus the two 6.10.0 RCs.

    Link to comment
    8 minutes ago, Hoopster said:

    Xeon E-2288G and yes I am using the iGPU (/dev/dri) for Plex and HandBrake transcoding.  Server never locked up when transcoding was happening.  It always seemed to happen during idle time. 

     

    I am not saying Intel-GPU-Top was the problem, but, I have not had a crash since removing it and GPU Statistics.  Running unRAID 6.9.2 currently.  Lockups previously happened on that version plus the two 6.10.0 RCs.

     

    That Xeon is coffee lake (9th gen intel) which Unraid is certified for. The issue is on 11/12th gen intel chips which technically Unraid is not certified for, at least not until 6.10 release.

    Edited by snailtrails
    Link to comment
    3 minutes ago, snailtrails said:

    That Xeon is coffee lake (9th gen intel) which Unraid is certified for

    If you look at the hardware specs for the user that started this bug report as well as others in the thread he linked, you will see that there are many reports of i915 lockups with 9th generation Intel CPUs/iGPUs.  It may be more prevalent in the 1tth and 12th generations that don't have official iGPU support in the i915 drivers but it seems to not be limited to those generations.

    Link to comment

    Were hangs every week on i5-8600 last 3 months. Downgraded to 6.8.3 - no problems at all last 3 weeks.
    I'am not using any transcode, igpu, etc.

    Link to comment

    Were hangs every week on i5-8600 last 3 months. Only hard reset helped a lot. But my server is remote machine so each hang is headache.
    Downgraded to 6.8.3 - no problems at all last 3 weeks.
    I'am not using any transcode, igpu, etc.

    Please fix it.

    Link to comment
    On 2/21/2022 at 5:21 PM, snailtrails said:

    Do you have tips and tweaks enabled with power saving turned on?

    If you ask me, then my answer is: yes but this not a problem in 6.8.3

    Link to comment

    Having also the same problem with Intel i7 9700.

    Also was using iGPU passthrough to jellyfin for transcoding but did not use it for a long time now. 

    Also use powertop --auto-tune to optimize power consumption.

    Freeze happened in various time - 10 - 33 days uptime.

    Server and services are not reachable anymore. Ping is possible. But need to power cycle the server.

    Nothing in the logs. Also set up seperate syslogserver, but no hint.

     

    • Like 1
    Link to comment

    I ran a quick test tonight after upgrading to 6.10.0-rc3 and I was able to lock up the server on 3 separate occasions when transcoding via plex. Now I was actually able to see something on my syslog this time around (see below) unlike previously, I am assuming this is the same issue unless I have something else going on that is causing this? I am running a 12600k.

     

    Quote

    Mar 11 00:38:41 Jarvis kernel: i915 0000:00:02.0: [drm] Resetting vcs0 for preemption time out
    Mar 11 00:38:41 Jarvis kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in Plex Transcoder [22251]
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:0:00000000
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] Resetting vcs1 for stopped heartbeat on vcs1
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs1
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffc, in Plex Transcoder [22251]
    Mar 11 00:38:56 Jarvis kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x276/0x29b [i915]
    Mar 11 00:38:56 Jarvis kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] Plex Transcoder[22251] context reset due to GPU hang
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 00:38:56 Jarvis kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 11 00:39:00 Jarvis kernel: Fence expiration time out i915-0000:00:02.0:Plex Transcoder[22251]:1e5a!
     

     

    • Like 1
    Link to comment
    3 hours ago, MadMatt337 said:

    I ran a quick test tonight after upgrading to 6.10.0-rc3 and I was able to lock up the server on 3 separate occasions when transcoding via plex. Now I was actually able to see something on my syslog this time around (see below) unlike previously, I am assuming this is the same issue unless I have something else going on that is causing this? I am running a 12600k.

     

     

    It should be noted that the release notes about the kernel are:

    Linux Kernel

    Upgrade to [rc3] Linux 5.15.27 kernel which includes so-called Sequoia and Dirty Pipe vulnerability mitigations.

    …. So it’s not yet on. 15.16,x - which I believe is what contains the supported iGPU drivers for 12th gen.  Let’s not get our hopes up yet if that’s the case.

    Link to comment
    4 hours ago, MadMatt337 said:

    Mar 11 00:38:56 Jarvis kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110

    Can you try to add:

    i915.enable_guc=0

    to your syslinux.conf and test if it crashes again, if yes please edit this entry and try:

    i915.enable_guc=2

    and report back if it's the same.

     

    On 2/21/2022 at 3:19 PM, Nuke said:

    Were hangs every week on i5-8600 last 3 months. Only hard reset helped a lot. But my server is remote machine so each hang is headache.

    On 2/20/2022 at 8:20 PM, Tristankin said:

    Yep, can confirm 9th gen, i3 9100. No go with all 6.9.x and 6.10-RC

    On 2/17/2022 at 9:09 PM, bearcat2004 said:

    i5-7600k

    Have you guys connected a Monitor or a HDMI Dummy plug to your iGPU?

    It is much recommended to at least attach a HDMI Dummy plug to avoid Kernel Panics with newer Kernels.

     

     

    On 2/18/2022 at 9:34 AM, feraay said:

    I have /boot/config/modprobe.d/i915.conf with content blacklist i915 and Intel GPU Top Installed.

    In go file I have just 

    chmod -R 777 /dev/dri 

    Seems I am not using i915.force_probe=4680 or just to stupid to find it.

    Please don't do this, Intel-GPU-TOP will handle the force probe, just blacklist the i915 module, you even don't need to do a chmod -R 777 /dev/dri because Unraid does this itself if the path is found.

    • Like 1
    Link to comment

    Had my first crash with 6.10 RC3.

     

    So in advice of @Ich777 I do the following. 

    Plugged a HDMI Dummy to Onbard HDMI.

    Removed the chmod -R 777 /dev/dri from go file.

    Installed Intel GPU TOP

    created the /boot/config/modprobe.d/i915.conf File

     

    ok so after 3 crashes in a row and a damaged Plex config I can also see this in the logs.

    Mar 11 15:16:13 Mycroft kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110 
    Mar 11 15:16:13 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip 
    Mar 11 15:16:13 Mycroft kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x276/0x29b [i915] 
    Mar 11 15:16:13 Mycroft kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110

     

    so added i915.enable_guc=0 we will see 

     

    ok i915.enable_guc=0 resulted in a crash also. will change it to 2 and test again. 

    I was not able to geht a log from guc=0 the crash was faster ^^

     

     

    with i915.enable_guc=2:

     

    ar 11 16:08:33 Mycroft kernel: i915 0000:00:02.0: [drm] Resetting vcs0 for preemption time out
    Mar 11 16:08:33 Mycroft kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in Plex Transcoder [18057]
    Mar 11 16:08:44 Mycroft kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in Plex Transcoder [18057]
    Mar 11 16:08:44 Mycroft kernel: i915 0000:00:02.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0
    Mar 11 16:08:44 Mycroft kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs0
    Mar 11 16:08:45 Mycroft kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x276/0x29b [i915]
    Mar 11 16:08:45 Mycroft kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] Plex Transcoder[18057] context reset due to GPU hang
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 11 16:08:45 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!

     

    what about 3?

    GuC submission and power management is enabled by setting the kernel module parameter: i915.enable_guc=1

    HuC authentication only is enabled by setting the kernel module parameter: i915.enable_guc=2

    Combine for both features together: i915.enable_guc=3

     

     

    Edited by feraay
    • Like 1
    Link to comment
    6 hours ago, feraay said:

    what about 3?

    From what I see this won't change anything, but at least you can try it.

    In your case (or I think for all Alder Lake chips) the iGPU doesn't reset properly with the current Kernel.

    On what Unraid version are you? Can you share your Diagnostics?

    Link to comment

    just a small update 

     

    i uninstalled Intel GPU TOP

    created just a empty i915.conf in /boot/config/modprobe.d

    and set i915.force_probe=4680 i915.enable_guc=2 in syslinuxconfig 

    Two transcoding are running since 15 minutes and no GPU HANG in the logs till now.

    With the intel gpu top installed the error appears just in time. 

    May its just luck I am not sure.

     

    ok it happened 

    Mar 15 16:11:42 Mycroft kernel: i915 0000:00:02.0: [drm] Resetting vcs1 for preemption time out
    Mar 15 16:11:42 Mycroft kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in Plex Transcoder [17657]
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in Plex Transcoder [17657]
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] Resetting vcs1 for stopped heartbeat on vcs1
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs1
    Mar 15 16:11:53 Mycroft kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_reset+0x276/0x29b [i915]
    Mar 15 16:11:53 Mycroft kernel: [drm:__uc_sanitize [i915]] *ERROR* Failed to reset GuC, ret = -110
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] Plex Transcoder[17657] context reset due to GPU hang
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    Mar 15 16:11:53 Mycroft kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    Mar 15 16:11:58 Mycroft kernel: Fence expiration time out i915-0000:00:02.0:Plex Transcoder[17657]:7cfe!

     

    one transcode died but the server is still responsive and didn't crash and the second transcode is still running.

     

    second transcode also crashed and Plex docker crashed but server is still responsive. 

     

    The WebGui was still accessable but server did not respond anymore.

    I will go with cpu transcoding and test again with Kernel 5.16 

     

    Edited by feraay
    • Like 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.