• [6.9.x - 6.11.x] intel i915 module causing system hangs with no report in syslog (not alder lake)


    Tristankin
    • Minor

    Since the 5.x kernel based releases many users have been reporting system hangs every few days once the i915 module is loaded.

    With reports from a few users detailed in the thread below we have worked out that the issue is caused by the i915 module and is a persistent issue with both the 6.9.x release and 6.10 release candidates.


    The system does not need to be actively transcoding for the hang to occur. 6.8.3 does not have this issue and is not hardware related. Unloading the i915 module stops the hangs. Hangs are still present in 6.10.0RC2. I can provide a list of similar reports if required.

    • Like 8
    • Thanks 1
    • Haha 1



    User Feedback

    Recommended Comments



    When these system hangs happen for me (started on 6.9.2 and have continued with 6.10.0 RC), there is nothing meaningful in the syslog around the time of the hang but this is sometimes reported in the IPMI log.

     

    image.thumb.png.5e821275e783d280ef04e91ce8417db7.png

     

    An OS Stop/Shutdown sounds like a kernel/driver issue to me.  It's like the rest of the system is working but the OS just decided to shutdown.

     

    FWIW, I have never seen these shutdowns as a result of hardware transcoding.

    Edited by Hoopster
    • Like 2
    Link to comment

    Is there any extra info the devs need for this? I assume you are going to set up a test intel system with the i915 module loaded to see if you can replicate the fault?

    Link to comment

    Edit:

    Looks like my issue was caused by a faulty CPU. Replaced RAM/Mobo and PSU with known good hardware and it still persisted. Intel agreed to RMA the 11500 I had an issued me with a refund. Purchased a 11700k as a replacement and it's been working fine ever since.

    My CPU would only ever crash my system when transcoding, p95/furmark were stable, which led me to the confusion of a software level issue.

    Edited by CaptainPeech
    Link to comment

    I have been using Unraid since 6.1.8, and I believe I've been experiencing a similar issue with 6.9.2

    This year in June I upgraded my motherboard & CPU (built in 2015: Asrock H97 Pro4S, Xeon E3-1231v3 24GB 1600 ram, upgraded this year to: Asrock Z170S, i5-7600k, 16GB 3200 ram) and I had upgraded specifically for the Intel QuickSync hardware transcoding. On the Xeon hardware, the server was perfectly stable, but it was at its limit in terms of software transcoding, hence the upgrade.

    After the upgrade, I started noticing frequent crashes occurring more & more frequently -- at first once per month, then once per week, and now every few days. 

     

    I thought I had successfully troubleshooted (troubleshot?) the issue when I noticed the NTP server was causing a kernel panic. I found this after I set up a syslog server on a raspberry pi I had lying around. Unfortunately, even after disabling NTP entirely, the crashes persisted, but now there was no trace of any issues in the syslog. My raspberry pi is also a pihole, so I am able to see when the DNS requests cease, and that the server crashes most frequently in the early hours of the morning, but beyond that I'm stumped. 

     

    This morning, I found the threads posted by @Tristankin, and rather than downgrade back to 6.8.3, I decided to experiment by commenting out the

    # modprobe i915

    line in my go file. 

     

    I'm hoping that the server continues to function, and I think I'll be able to make do with software transcoding for the time being, but I would like very much for this issue to be resolved. Happy to provide any details as necessary. 

    Edited by bearcat2004
    • Like 3
    Link to comment

    UPDATE:  I rolled back to 6.9.2 from 6.10.0 rc2 and still had random system hangs.  16 days ago I did the following and have not had an issue since then:

    • recreated i915.conf with the 'touch' method
    • uninstalled the Intel GPU Top plugin
    • uninstalled the GPU Statistics plugin
    • enabled Turbo Boost in the Tips and Tweaks plugin (it had been disabled because heavy sustained workloads had been causing crashes with Turbo Boost)

     

    UPDATE: A few days ago, I also uninstalled the CoreFreq  plugin.  I had not had problems with it installed but I see that it has been recommended to others with server lockups to uninstall that plugin.

     

    I am not declaring victory yet nor do I think the above is necessarily a solution.  This just happens to be the longest I have ever gone without a system hang since I started having the problems last July.

     

    - Over five months of running 6.92 and 6.10.x with no hangs and many QSV transcodes.  The only thing that appears to cause a hang is when Turbo Boost is enabled and there is high CPU load.  This only happened again after upgrading to 6.10.0.  I have Turbo Boost off for now.

    Edited by Hoopster
    • Like 1
    Link to comment

     

    I've had two crashes -- about one per week -- since 12/27. Admittedly, despite my previous post where I had mentioned I could make do with software transcoding, I had pretty quickly re-enabled hardware transcoding after hitting a limit with my patience while watching a 4K HDR movie on Plex, and forgot to disable it when I was done. 

     

    I am going to try uninstalling the Intel GPU Top & GPU Statistics plugins as well to see if that changes things. I didn't realize that they automatically configure the i915 kernel driver when they're installed. I suppose I'll have to reboot and reconfigure my Plex docker to reconfigure the kernel driver/module.

     

    2 hours ago, Hoopster said:
    • recreated i915.conf with the 'touch' method

    Where would I create this file if I were to create this file the same way?

     

    Lastly, after my most recent crash, I did see one odd line in my logs when the system came back online, something i915/GPU-related that I hadn't noticed before:

    Jan  9 23:38:15 Lighthouse kernel: i915 0000:00:02.0: [drm] HPD interrupt storm detected on connector HDMI-A-1: switching from hotplug detection to polling

    The only thing I could think of to try and remedy this was to connect a HDMI dummy device to the motherboard. I had an old HDMI-to-VGA converter lying around, so I connected that as well to see if that might have any effect. 

     

    Still wish we had answers to this puzzle!

    Edited by bearcat2004
    Link to comment
    Just now, bearcat2004 said:

    Where would I create this file if I were to create this file the same way?

    From the terminal type 'touch /boot/config/modprobe.d/i915.conf'

     

    Intel GPU Top tries to load i915 as well and I thought it might be interfering so I removed it.

    • Thanks 1
    Link to comment

    Ok, so I loaded the i915 kernel driver using the method from @Hoopster above, and interestingly the 'HPD interrupt storm detected' line in my previous post is absent from the system log.

    I googled that error and it led to a bunch of bug reports from 2018 for the Linux kernel wherein the X server would freeze after loading the i915 module (sounds familiar).

    Some examples:

    https://issues.hyperbola.info/index.php?do=details&task_id=741

    https://bugs.freedesktop.org/show_bug.cgi?id=106675

     

    I think it's a good sign that this error has not appeared in my logs thus far.  

     

    I hope that I get at least 16 days of stable uptime too!

     

    • Like 1
    Link to comment
    10 minutes ago, Arkhad said:

    Hi, I appear to have the same issue.

    @Hoopster What did you put in the i915.conf file ?

    Nothing, the touch command will create an empty file.

    Unraid checks on startup if a file is present or not. The content is irrelevant.

    • Like 2
    • Thanks 1
    Link to comment

    Well Hoopster method doesn't work for me. Server still crashes shortely after the array is up. The weird thing is the crash doesn't happen if the parity check is running.

    Edited by Arkhad
    Link to comment

    I believe I'm experiencing this same issue.  I have been discussing it in the 12th gen Alder Lake thread here:

    I touched the i915.conf file, have Intel GPU Top and GPU statistics installed. The hangs will lockup the system entirely, but the terminal is still accessible direct on the machine itself.  Shutdown commands don't shut the system down. Syslog says nothing.  Crashes would happen whether people were streaming or not.  Sometimes several times an hour, others 4-6 hours apart.  I'm on 6.10.0 RC2

    The fix for me was to remove the /dev/dri device from the Plex container.  I still have the i915.conf, Intel GPU Top, and GPU Statistics installed, but the crashes stopped immediate when I removed /dev/dri from the container.

    I was about to test if pinning the container to the performance cores would help, but based on the people experiencing the same issue here it appears the issue is not limited to 12th gen Intel CPUs

    Edited by Earendur
    Link to comment
    3 hours ago, Earendur said:

    the crashes stopped immediate when I removed /dev/dri from the container.

    When you remove /dev/dri from the plex container, you switch to software transcoding though, right? Or are you able to confirm it is using hardware transcoding?

    • Like 1
    Link to comment

    I’ve had this issue when transcoding, when not transcoding, when running handbrake, and when saving a simple change to the app running in a docker.

     

    Asus Z690-P D4 (latest BIOS)

    i5-12600k

    4x 16GB DDR4-3200 CL16

    • Like 1
    Link to comment
    4 hours ago, muzo178 said:

    When you remove /dev/dri from the plex container, you switch to software transcoding though, right? Or are you able to confirm it is using hardware transcoding?


    Originally, I simply turned off the hardware transcoding in Plex itself, but the crashes would still occur, which is consistent with the reported problem here - that the drivers don't even need to be actively running to cause the issue.

    I tried a number of different troubleshooting steps to figure out the issue, from Memtest to see if it was a ram issue, to updating the BIOS.  It wasn't until I deleted the /dev/dri device from the docker container configuration - effectively making it impossible for plex to access the drivers for hardware transcoding - that the crashes stopped.

    There is definitely an instability in the iGPU drivers.  I tested the hardware transcoding and was using it for several hours in some cases before it would crash the system.  Other times, it would happen 3 times in a hour, and when no one was streaming from the plex server at all. 

    I knew the hardware transcoding was working because plex dashboard was showing (hw) and so was tautulli.  Since the removal of the device, no transcoding operations are showing (hw).

    Edited by Earendur
    Link to comment

    well it seems i also got this issue..

    i miss 2 hours as you can see in the log i write to app pool:

     

    Jan 18 08:02:41 Plex emhttpd: read SMART /dev/sdg
    Jan 18 08:31:09 Plex webGUI: Successful login user root from 
    Jan 18 10:02:17 Plex kernel: mdcmd (36): set md_write_method 1
    Jan 18 10:02:17 Plex kernel: 

     

    so the only fix seems to be disabling the i915 in total?

    Link to comment
    27 minutes ago, Tristankin said:

    Devs, is the plan to wait till there is a stable kernel? Or a patched one?

     

    I did some research some time ago and the i915 driver is very troublesome.  There has been some work done in 6.10 with this, but there just isn't a lot that Limetech can do since it appears to be a driver issue,  I've been using it on 6.10 and have not had any issues.  On 6.10, the driver is loaded by Unraid and you don't need to put it in your go file.

     

    Of course this is the problem with the i915 driver.  Certain things work for some, and not for others.  It's all pretty hit and miss.

     

    There was recently an issue with Plex not handling the driver correctly, but they released an updated version that fixed the problem with Plex.

     

    So I can't say it has been definitely fixed, but it has been worked on.

    • Like 5
    Link to comment

    Once 6.10 is released, let's revisit this.  6.10rc3 (not released yet) is currently on Linux Kernel 5.15.15.  This latest Kernel may see this fixed.  Feel free to PM me if I don't get back to this issue.

    • Like 2
    Link to comment

    When rc3 is released do the following before rebooting:

    • Remove any references to i915 from your go file.
    • Remove any i915 files from the /flash/modprobe.d folder.

     

    The driver is loaded by Unraid without you having to do anything.  I wanted to give you a heads up because you would probably upgrade before I could post here.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.