• [6.9.x - 6.11.x] intel i915 module causing system hangs with no report in syslog (not alder lake)


    Tristankin
    • Minor

    Since the 5.x kernel based releases many users have been reporting system hangs every few days once the i915 module is loaded.

    With reports from a few users detailed in the thread below we have worked out that the issue is caused by the i915 module and is a persistent issue with both the 6.9.x release and 6.10 release candidates.


    The system does not need to be actively transcoding for the hang to occur. 6.8.3 does not have this issue and is not hardware related. Unloading the i915 module stops the hangs. Hangs are still present in 6.10.0RC2. I can provide a list of similar reports if required.

    • Like 8
    • Thanks 1
    • Haha 1



    User Feedback

    Recommended Comments



    1 hour ago, Tristankin said:

    Ah yeah OK, the previews. Did not think of that. 

    5.x kernel intel drivers have been known to be a bit shit, and also I have heard reports that the dummy plug has been more important since the 5.x releases (somewhere in this thread from memory)

    The bit that also is making it all very hard to identify is that there is never anything logged in syslog. Which makes the failure appear as a hardware one, but I am pretty darn sure it is an unlogged gpu fault.

    I actually have this issue and had to uninstall Intel GPU top. I do have a monitor connected and I believe I have taken a photo of the issue. I leave the monitor connected in my rack, and it's on so I don't think it's the dongle issue.

     

    I'm running alder lake. At the time of the issues I had the most up to date BIOS. Running stock clocks. I tried with and without XMP profiles.

     

    I'm pretty sure in one of the threads (probably this one) I attached a syslog (from USB) and a screenshot but it's been quite a while since then.

     

    I have next week off of work. If ich wants me to, I can follow his suggestions to debug this next week.

    Link to comment
    20 minutes ago, Earendur said:

    I'm running alder lake. At the time of the issues I had the most up to date BIOS. Running stock clocks. I tried with and without XMP profiles.

    This thread is not about Alder Lake!!!

     

    Alder Lake is a whole different story and will be fixed somewhat in 6.11.0 beta series, although Plex has to fix it's stuff first since on Jellyfin everything is working with the newer Kernels and only Plex crashes the servers.

    Link to comment
    2 minutes ago, ich777 said:

    This thread is not about Alder Lake!!!

     

    Alder Lake is a whole different story and will be fixed somewhat in 6.11.0 beta series, although Plex has to fix it's stuff first since on Jellyfin everything is working with the newer Kernels and only Plex crashes the servers.

    I understand that, however I was experiencing the exact same issue on the reported versions of Unraid here - a system hang with no information in syslog and I confirmed it was caused by the iGPU.
     

    So while I am on Alder Lake, I was getting the exact same issue reported by Tristankin.  Could the issue be different due to us being on different CPUs? Maybe. But I'm not convinced it's not the same issue.

    Link to comment
    3 minutes ago, Earendur said:

    I understand that, however I was experiencing the exact same issue on the reported versions of Unraid here - a system hang with no information in syslog and I confirmed it was caused by the iGPU.

    Currently Alder Lake won't work with transcoding and especially not with Plex.

    Alder Lake is simply not supported by this Kernel that Unraid 6.10.2 uses and causes crashing.

     

    Alder Lake is a completely different story.

    Link to comment
    1 minute ago, ich777 said:

    Currently Alder Lake won't work with transcoding and especially not with Plex.

    Alder Lake is simply not supported by this Kernel that Unraid 6.10.2 uses and causes crashing.

     

    Alder Lake is a completely different story.

    Okay that's fine. I'll leave my issue out until we get more releases.

    I do have a kernel panic that randomly happens - sometimes multiple times a week, other times it'll go 40 days - and I strongly suspect it's the docker networking. BUT I will start another thread if that issue gets to be more than just a minor annoyance like it is now. I figure the kernel updates should solve this issue over time.

    Thanks

    Link to comment
    9 minutes ago, Earendur said:

    and I strongly suspect it's the docker networking.

    Have you yet tried to switch from MACVLAN to IPVLAN on 6.10.2?

    What containers are you running...?

    Maybe mark me in the other thread about that and mention me from time to time, the coming next two months I'm not able to be much around.

    Link to comment
    On 6/8/2022 at 2:22 PM, ich777 said:

    Have you yet tried to switch from MACVLAN to IPVLAN on 6.10.2?

    What containers are you running...?

    Maybe mark me in the other thread about that and mention me from time to time, the coming next two months I'm not able to be much around.

    I am running MACVLAN. 

    Pihole is running as br0 with it's own IP address, but all the rest are on a custom docker network.

    What's the risk if I switch to IPVLAN?

     

    Edited by Earendur
    Link to comment
    11 minutes ago, Earendur said:

    What's the risk if I switch to IPVLAN?

    From my perspective nothing, I run it since it was introduced.

    Keep in mind some routers have issues with this, especially the Fritzbox from AVM, nothing bad should happen, you can always go back to MACVLAN.

    Link to comment
    15 hours ago, ich777 said:

    @Tristankin & @Akilae keep in mind that the iGPU is maybe also used when you import new media to generate the previews at least in Plex I think that this applies.

     

    It seems also suspicious to me that on 6.10.0rc4 everything worked with a HDMI dummy plug.

     

    I suspect anything else than this module, must be some weird bug...

    @Tristankin Have you yet tried to blacklist it and install the Intel-GPU-TOP plugin?

     

    @Akilae do you have your Diagnostics somewhere?

     

    Greetings from Lunz am See... :D

    Hey,

     

    @ich777, here are my Diagnostics.

    I'm also not sure if there wasn't an issue using 6.10.0rc4. It was simply a release where I had no freezes for about 20 days until rc5 was released.

     

    Also, I tried to use ipvlan and everything "seems" normal up to a point when Docker isn't able to connect to github anymore, container updates are failing, etc.

    I doubt that my Sophos Firewall is the issue here tho.

     

    nas01-diagnostics-20220609-0835.zip

    Edited by Akilae
    Link to comment
    4 hours ago, Lee Kim Tatt said:

    Guys... 6.10.2 seem like got i915 fixed. Running j3455 Celeron, been stable with hardware transcode for days. 

    I also use J3455, but after repeated testing in 6.10.2, the system crash still exists during hardware decoding.

    Link to comment
    On 6/10/2022 at 11:03 AM, airlychee said:

    I also use J3455, but after repeated testing in 6.10.2, the system crash still exists during hardware decoding.

    Weird... Been days of running with hardware acceleration jellyfin, seems stable to me. Did u add any guc to kernel parameters? I am using 0.

    • Like 1
    Link to comment
    7 minutes ago, flyize said:

    Can someone explain to a layman what the GUC setting does?

    I think the simplest way to describe it is here.

     

    Quote

    The Graphics micro (µ) Controller (GuC) is designed to offload some of the functionality usually run on the host driver.  This functionality includes:
        - Authentication of the HEVC/H.265 micro (µ) Controller (HuC)

    Enables use of HuC codec acceleration extensions by the iHD Intel media driver (described below).
        - Low level graphics context scheduling

     

    GuC context scheduling operations will include determining which context to run next, submitting a context to a command streamer for a next available engine and pre-empting and resubmitting existing contexts as required. With the GuC selecting which context to submit and the actual engine instance to submit to, it is also responsible for detecting hangs and initiating engine resets.

     

    Maybe also worth mentioning Kernel 5.18.2 and 5.18.3 in combination with Unraid works fine with Alder Lake and HW transcoding (currently Jellyfin tested), I think even Plex even fixed their custom version from FFmpeg that causes crashes with Alder Lake.

     

    ...please also note that this is not a Alder Lake Bug Thread!!!

    • Like 1
    • Thanks 1
    Link to comment
    1 hour ago, flyize said:

    I wasn't asking about Alder Lake :P

    But this is mainly a Alder Lake or better speaking Tiger Lake+ issue and it needs to be activated there so that everything works properly... ;)

    • Like 1
    Link to comment
    On 6/12/2022 at 5:00 PM, Lee Kim Tatt said:

    Weird... Been days of running with hardware acceleration jellyfin, seems stable to me. Did u add any guc to kernel parameters? I am using 0.

    How did you add guc to kernel parameters?syslinux.conf of modprobe.d/i915?

    Link to comment
    19 minutes ago, airlychee said:

    How did you add guc to kernel parameters?syslinux.conf of modprobe.d/i915?

    What CPU are you using? Keep in mind this thread is not for Alder Lake!

    On 11th Gen+ GuC is enabled automatically.
     

    Read this:

     

    Link to comment
    Just now, ich777 said:

    What CPU are you using? Keep in mind this thread is not for Alder Lake!

    On 11th Gen+ GuC is enabled automatically.
     

    Read this:

    I am using J3455-ITX, and the transcoding worked fine in 6.8.3. But since 6.9.x, there has been a crash during the transcoding process.

     

    Link to comment
    1 hour ago, airlychee said:

    I am using J3455-ITX, and the transcoding worked fine in 6.8.3. But since 6.9.x, there has been a crash during the transcoding process.

    Have you yet tried to use 6.10.2?

     

    What container are you using for transcoding? If Plex, have you tried anything else so far?

    Link to comment
    6 hours ago, ich777 said:

    Have you yet tried to use 6.10.2?

     

    What container are you using for transcoding? If Plex, have you tried anything else so far?

    I usually use plex, just tested jellyfin in 6.10.2, everything seems to be working fine, but plex still crashes unriad...

    Edited by airlychee
    • Thanks 1
    Link to comment
    3 minutes ago, airlychee said:

    I usually use plex, just tested jellyfin in 6.10.2, everything seems to be working fine, but plex still crashes unriad...

    Then this is a pure Plex issue and I would recommend that you post on the Plex forums or in the corresponding support thread for the container.

    Link to comment
    38 minutes ago, ich777 said:

    Then this is a pure Plex issue and I would recommend that you post on the Plex forums or in the corresponding support thread for the container.

    FWIW, Plex has 'fixed' it. The fix has cleared QA and should be to us Docker folks soon. If you run Ubuntu, then the fix is already in this thread.

     

    https://forums.plex.tv/t/ubuntu-22-04-packaging-development-preview/793355/101?u=fly

     

    edit: Now enough Alder Lake! :P

    Edited by flyize
    • Upvote 1
    Link to comment
    12 minutes ago, flyize said:

    edit: Now enough Alder Lake! :P

    Don't know how I should react to this post with image.png.12fece083dc604df92fd78a81c7eec5e.png or this image.png.327b912a498f675c139a7e364a10c229.png. :D

    Link to comment

    I've been following this thread for some time (and have made my own thread a year ago when 6.9.0 came out), just wanted to add a data point.  I run an i7-8700 and use the iGPU for transcoding (other system details and diagnostics are in the linked thread for anyone wanting to look).  Rock solid on 6.8.3.  Upgrading to 6.9.x caused crashes every day or two, nothing in remote syslog, just like the OP of this thread.

     

    I will look into BIOS settings when I get a chance.  Truth be told, I didn't try the last recommendation in my own thread to disable C States as I had already reverted back to 6.8.3.

    • Like 1
    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.