• [6.9.x - 6.11.x] intel i915 module causing system hangs with no report in syslog (not alder lake)


    Tristankin
    • Minor

    Since the 5.x kernel based releases many users have been reporting system hangs every few days once the i915 module is loaded.

    With reports from a few users detailed in the thread below we have worked out that the issue is caused by the i915 module and is a persistent issue with both the 6.9.x release and 6.10 release candidates.


    The system does not need to be actively transcoding for the hang to occur. 6.8.3 does not have this issue and is not hardware related. Unloading the i915 module stops the hangs. Hangs are still present in 6.10.0RC2. I can provide a list of similar reports if required.

    • Like 8
    • Thanks 1
    • Haha 1



    User Feedback

    Recommended Comments



    My system has a monitor plugged into the HDMI. But I think I had a panic crash in 6.8.3. All I have is this shot of the screen before I restarted the system.20220412_134934.thumb.jpg.c488263546f47cd5d5b0587a90bb1262.jpg

    Link to comment
    6 hours ago, dchamb said:

    All I have is this shot of the screen before I restarted the system.

    Can you post your Diagnostics?

     

    Have you yet tried to upgrade to 6.9.2?

    Seems like something else is wrong, can you do a Memtest on your system?

    Link to comment
    On 4/12/2022 at 10:48 PM, ich777 said:

    Can you post your Diagnostics?

     

    Have you yet tried to upgrade to 6.9.2?

    Seems like something else is wrong, can you do a Memtest on your system?

    I'm not sure diagnostics will be helpful since I captured them after restarting the system. It was dead in the water with only the screen capture displayed on the console. 

     

    A few months ago I tried 6.9.2 but it would crash constantly so I rolled back to 6.8.3. Previous to attempting 6.9.2 my system stayed up almost 500 days with no issues. 

     

    I can try doing a Memtest. 

     

    The only change I made that could possibly be a cause is I enabled iGPU hardware transcoding in Plex about a week before this crash. 

    I'm beginning to think I should stick a NVIDIA card in this thing and abandon Intel hardware transcoding.

    Link to comment
    14 minutes ago, dchamb said:

    I'm beginning to think I should stick a NVIDIA card in this thing and abandon Intel hardware transcoding.

    I'm having no issues whatsover with HW transcoding on my 10th gen Intel i5-10600 (having a HDMI 1080p dummy plug plugged in my iGPU), I also find the transcoding quality from Intel a bit better than NVENC but that is always subjective to the user... :)

     

    It's really suspicious to me that you have had issues on 6.8.2 too and also of course on newer versions.

    Link to comment

    After having (at least I tought) solved this issue in rc5 (had an uptime since the release of rc5) by adding an hdmi dummy to my i7-6700K setup I had the first crash after only 1 day of uptime on rc7. Did anything change regarding that matter in rc7?

    Link to comment
    2 minutes ago, Akilae said:

    After having (at least I tought) solved this issue in rc5 (had an uptime since the release of rc5) by adding an hdmi dummy to my i7-6700K setup I had the first crash after only 1 day of uptime on rc7. Did anything change regarding that matter in rc7?

    Nope, but are you sure that your iGPU was the cause of the issue?

    Link to comment

    No I'm not sure, I'm sure that my GPU was the issue in the past up till the hdmi dummy. I literally changed everything incl. rmaing the memory sticks and memtesting the new ones for 2 days. Most amount of time online until a crash without any log entry where about 3-4 days at max. Then I read your recommendation with the HDMI dummy, added this to the system and instantly reached many days uptime without any crashes. Now I installed rc7 without any changes to my system and it crashed after only 1 day, also without any log entry again. That's why I instantly thought the issues are back.

    Link to comment
    2 minutes ago, Akilae said:

    Now I installed rc7 without any changes to my system and it crashed after only 1 day, also without any log entry again.

    Do you run the syslogserver on your machine?

     

    3 minutes ago, Akilae said:

    That's why I instantly thought the issues are back.

    Where you transcoding something with your iGPU bevore it crashed?

     

    3 minutes ago, Akilae said:

    No I'm not sure, I'm sure that my GPU was the issue in the past up till the hdmi dummy.

    Only a fix for Ice Lake CPUs was added but nothing that should affect your Skylake CPU.

    Link to comment
    4 minutes ago, ich777 said:

    Do you run the syslogserver on your machine?

     

    Where you transcoding something with your iGPU bevore it crashed?

     

    Only a fix for Ice Lake CPUs was added but nothing that should affect your Skylake CPU.

     

    Yes, syslog logs to a share and is capped to 500MB

     

    Tautulli shows a transcoding action within 4 minutes to the last available syslog entry before my hard reset. Also the user was only able to watch 2% of the show, so I guess that was when the crash happened.

     

    I see, thank you for the information. If I can provide any additional information just tell me.

    Link to comment
    2 minutes ago, Akilae said:

    Tautulli shows a transcoding action within 4 minutes to the last available syslog entry before my hard reset. Also the user was only able to watch 2% of the show, so I guess that was when the crash happened.

    Please first try if this was just a false positive and let the server run for a few days, maybe also try to power off the server, pull the power cord from the wall, press a few times the on/off and reset button (to completely empty the caps) and then turn it on again.

     

    3 minutes ago, Akilae said:

    I see, thank you for the information. If I can provide any additional information just tell me.

    Fell free to get back to me if the server crashes again and we can diagnose furter... I really hope this was just a false positive and that your server runns without a hitch now.

    • Like 1
    Link to comment

    I have a 9700k, running RC7 and added a dummy plug as well to no avail. If I keep the module from loading/disable quicksync in plex my system does not freeze. :( I can provide any logs that would be of help. 

    Link to comment
    4 minutes ago, ich777 said:

    Please first try if this was just a false positive and let the server run for a few days, maybe also try to power off the server, pull the power cord from the wall, press a few times the on/off and reset button (to completely empty the caps) and then turn it on again.

     

    Fell free to get back to me if the server crashes again and we can diagnose furter... I really hope this was just a false positive and that your server runns without a hitch now.

    Will do that. Thank you.

    • Like 1
    Link to comment
    6 minutes ago, mechmess said:

    If I keep the module from loading/disable quicksync in plex my system does not freeze. :( I can provide any logs that would be of help.

    Please post your Diagnostics.

    Link to comment
    On 5/6/2022 at 2:24 PM, ich777 said:

    Please post your Diagnostics.

    Wanted to wait for another crash before sending. 

     

    Last crash was ~23:00 on May 9th. Plex transcoding was happening at the time of crash.

     

    I have a remote syslog server set up and included the log from that. The last few lines before the crash are:

     

    May  9 03:07:54 Tower CA Backup/Restore: Backup / Restore Completed
    May  9 08:02:33 Tower webGUI: Successful login user root from 192.168.7.153
    May  9 08:04:52 Tower kernel: mdcmd (36): check 
    May  9 08:04:52 Tower kernel: md: recovery thread: check P ...
    May  9 08:05:05 Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
    May  9 13:13:07 Tower flash_backup: adding task: /usr/local/emhttp/plugins/dynamix.my.servers/scripts/UpdateFlashBackup update
    May  9 15:51:27 Tower kernel: curl[26998]: segfault at d ip 0000151904658751 sp 00007ffd2fb6afc0 error 4 in ld-2.31.so[15190464e000+23000]
    May  9 15:51:27 Tower kernel: Code: 68 4c 8b 6c 24 70 49 89 c0 48 8b ac 24 d0 00 00 00 4c 8b a4 24 e8 00 00 00 4c 8b 5c 24 78 0f 1f 00 48 83 bc 24 f8 00 00 00 00 <0f> 84 61 01 00 00 41 0f b6 40 05 83 e0 03 83 e8 01 83 f8 01 0f 86

    tower-diagnostics-20220510-0902.zip syslog-192.168.7.224.log

    Edited by mechmess
    Link to comment

    Under 6.10.0-rc7 to enable quicksync GPU encoder for plex docker, is this the correct method now?

    - Make sure igpu enabled in bios

    - Install Intel GPU top from store

    - Open plex settings > go to transcoder > check boxes for both hardware transcoding

     

    No more manually enabling i915 driver given it is baked into unraid now?

     

    Link to comment
    1 hour ago, mechmess said:

    Last crash was ~23:00 on May 9th. Plex transcoding was happening at the time of crash.

    This doesn't seem to be caused by the iGPU.

    Can you maybe create a issue for this on the bug tracker?

    Link to comment
    23 minutes ago, snailtrails said:

    No more manually enabling i915 driver given it is baked into unraid now?

    You don't have to use Intel GPU TOP anymore because the iGPU is actiavated on boot by Unraid itself, however you have to use it if you want to use it with the GPU Statistics plugin, otherwise GPU Statistics won't work.

    • Like 1
    Link to comment
    30 minutes ago, snailtrails said:

    Under 6.10.0-rc7 to enable quicksync GPU encoder for plex docker, is this the correct method now?

    - Make sure igpu enabled in bios

    - Install Intel GPU top from store

    - Open plex settings > go to transcoder > check boxes for both hardware transcoding

     

    No more manually enabling i915 driver given it is baked into unraid now?

     

    You still need to add /dev/dri as well

    Link to comment
    28 minutes ago, ich777 said:

    This doesn't seem to be caused by the iGPU.

    Can you maybe create a issue for this on the bug tracker?

    I’ll do so. However, my system is stable for weeks on end without the driver loading. I have been re enabling it with every RC release. Could just be a coincidence though!  
     

    Thanks for taking a look!! 

    • Like 1
    Link to comment
    3 minutes ago, mechmess said:

    I’ll do so. However, my system is stable for weeks on end without the driver loading. I have been re enabling it with every RC release. Could just be a coincidence though!  
     

    Thanks for taking a look!! 

     

    I don't think its coincidence.  I've got a 10th Gen i7 with the EXACT same issue.  If I blacklist the i915 module from loading my system is perfectly stable.  But as soon as I allow i915 to load, even if nobody is actively streaming or transcoding, after anywhere from a few days to almost a month, it will eventually crash without a hint of what happened being written to the syslog.  This happens with either an HDMI monitor attached or a dummy.  I travel a lot and use my NAS for a lot of stuff including home security so these random crashes can be a huge inconvenience if I'm out of town.

     

    Somewhat off-topic but because of this issue I'm actually in the process of getting myself acclimated with a new TrueNAS Scale system I built to see if it's feasible for me to make the switch from Unraid.  I'm not a fan of how it uses Kubernetes for the docker apps but I'm getting used to it, and I used a new docker-compose app they have that basically allows you to easily use docker-compose inside a native Truecharts app--which is awesome.  That made it straight forward to setup a cloudflared docker to run an Argo tunnel into the Truecharts Traefik app and reverse proxy back to my various apps just like I do in Unraid.  Now I'm getting my "arrs" stack configured how I like and it's working out okay so far.

     

    I'll be running both systems in parallel for a while.  Assuming Quicksync hardware transcoding works and remains stable, I think TrueNAS will be an okay alternative for me.

    Link to comment
    5 minutes ago, RogerWilco486 said:

    I don't think its coincidence.  I've got a 10th Gen i7 with the EXACT same issue.  If I blacklist the i915 module from loading my system is perfectly stable.  But as soon as I allow i915 to load, even if nobody is actively streaming or transcoding, after anywhere from a few days to almost a month, it will eventually crash without a hint of what happened being written to the syslog.  This happens with either an HDMI monitor attached or a dummy.  I travel a lot and use my NAS for a lot of stuff including home security so these random crashes can be a huge inconvenience if I'm out of town.

    Please share your Diagnostics.

     

    I really can't imagine why it is working fine on my system i5-10600 and on the system from @alturismo i9-10850K just fine (I'm using Emby/Jellyfin and he uses Plex/Emby) wher it isn't on yours, don't know if @sonic6 transcodes that much but he also is running a i3-10400 and it works there too with Plex from what I know.

     

    I think a lot of users are runing 10th Gen hardware transcoding without any issues so far also on the latest RC versions.

     

    8 minutes ago, RogerWilco486 said:

    I'll be running both systems in parallel for a while.  Assuming Quicksync hardware transcoding works and remains stable, I think TrueNAS will be an okay alternative for me.

    But what is different on TrueNAS than on Unraid? What Kernel version does your TrueNAS box run?

    I can also imagine that it has something to do that TrueNAS runs Kubernetes instead of Docker/Containerd/Runc

    Link to comment
    7 minutes ago, ich777 said:

    @RogerWilco486 & @mechmess which container (repo) from Plex are you using on Unraid?

    I don't use Plex, I'm actually using your Jellyfin-AMD-Intel-Nvidia container with acceleration disabled.

     

    Re diagnostics, I've had i915 blacklisted so long it's been a while since I've experienced a crash.  But I did provide diagnostics a few pages back in this thread before it was suggested to try disabling i915 which solved the crashes for me.

     

    I just consoled into my TrueNAS, the kernel is "5.10.109+truenas" so considerably older than Unraid's kernel.

    Link to comment
    Just now, RogerWilco486 said:

    Re diagnostics, I've had i915 blacklisted so long it's been a while since I've experienced a crash.  But I did provide diagnostics a few pages back in this thread before it was suggested to try disabling i915 which solved the crashes for me.

    Have you a display plugin in to the HDMI for you iGPU constantly?

    I had a few people so far with crashing servers on the 6.10.0-rc series (of course not 12th Gen because this will be fixed in versions Kernel 5.18+) but when plugging in a HDMI dummy plug or a monitor it solved that crashing.

     

    Please keep also in mind that Intel made a huge mess out of his drivers for Linux from Kernel versions starting with 5.10 and up to now from my perspective.

     

    1 minute ago, RogerWilco486 said:

    I just consoled into my TrueNAS, the kernel is "5.10.109+truenas" so considerably older than Unraid's kernel.

    Ah wait, TrueNAS is based on BSD or am I wrong?

    Link to comment
    1 minute ago, ich777 said:

    Have you a display plugin in to the HDMI for you iGPU constantly?

     

    Yep, I have a HDMI dummy plugged in when not using a monitor.  I suppose it's possible the dummy is bad...how likely is that though?

     

    1 minute ago, ich777 said:

    Ah wait, TrueNAS is based on BSD or am I wrong?

     

    TrueNAS Scale is based on Linux to be their "hyper-converged" stack.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.