• [6.9.x - 6.11.x] intel i915 module causing system hangs with no report in syslog (not alder lake)


    Tristankin
    • Minor

    Since the 5.x kernel based releases many users have been reporting system hangs every few days once the i915 module is loaded.

    With reports from a few users detailed in the thread below we have worked out that the issue is caused by the i915 module and is a persistent issue with both the 6.9.x release and 6.10 release candidates.


    The system does not need to be actively transcoding for the hang to occur. 6.8.3 does not have this issue and is not hardware related. Unloading the i915 module stops the hangs. Hangs are still present in 6.10.0RC2. I can provide a list of similar reports if required.

    • Like 8
    • Thanks 1
    • Haha 1



    User Feedback

    Recommended Comments



    5 minutes ago, RogerWilco486 said:

    I suppose it's possible the dummy is bad...how likely is that though?

    Don't know, if the dummy actually doesn't work and you boot the server up you will get a notification (beep codes usually that no VGA is plugged in) and if you are on the Unraid RC versions you can see almost immediately at boot an issue with the i915 module when it initializes...

     

    5 minutes ago, RogerWilco486 said:

    TrueNAS Scale is based on Linux to be their "hyper-converged" stack.

    Ah okay, the more you know... :)

    Then TrueNAS Scale is basically almost the same Kernel version as Unraid 6.9.2 (5.10.28), do they have a beta channel too with a newer Kernel, something like 5.15.38 (latest LTS Kernel) so that you can check that you don't run in exact the same issues as on Unraid?

    Link to comment

    My conclusion of all this combined with my own experiences lead me to put a NVIDIA card in my motherboard and completely abandon the iGPU. 

     

    I tried unRaid 6.9.2 and it crashed every other day with nothing about the crash in the logs. I tried 6.8.3 with software transcoding and it works fine, although slower. I tried 6.8.3 with hardware decoding with iGPU and it crashed about every week or so (but Plex performance is better).

     

    Has anyone made the jump to NVIDIA without an issue in hardware transcoding whether it is 6.8.3 or 6.9.2 (and above)?

     

    Dale

    Edited by dchamb
    Link to comment
    3 hours ago, ich777 said:

    You don't have to use Intel GPU TOP anymore because the iGPU is actiavated on boot by Unraid itself, however you have to use it if you want to use it with the GPU Statistics plugin, otherwise GPU Statistics won't work.

    If I dont have GPU TOP installed, I do not get /dev/dri to populate. Is there something else I need to change? I have GPU TOP installed, so maybe that is causing some sort of issue vs just letting it roll default. 

    Link to comment
    3 hours ago, ich777 said:

    (of course not 12th Gen because this will be fixed in versions Kernel 5.18+)

    Hopefully. Still not fixed in Unraid with 5.18RC5...

    Link to comment
    2 hours ago, dchamb said:

    Has anyone made the jump to NVIDIA without an issue in hardware transcoding whether it is 6.8.3 or 6.9.2 (and above)?

     

    I have spare Quadro boards and I would do this except for the system in question is ITX and the single PCIe slot is used for the LSI HBA (necessary since I'm using SAS drives).

    Link to comment
    11 minutes ago, RogerWilco486 said:

     

    I have spare Quadro boards and I would do this except for the system in question is ITX and the single PCIe slot is used for the LSI HBA (necessary since I'm using SAS drives).

    So if you had a second PCIe slot, you would be fine with it? I have to check mine too since I'm using a PCIe slot for my SATA drives.

    Link to comment

    i can only confirm what @ich777 pointed, no issues here at all with my server and also with my maintained server(s) using i915's.

     

    the hdmi dummy point here is not causing crashes, just to make safe reboots with no attached monitor (here atleast).

     

    what always leads to random crashes described here is (sadly) the corefreq plugin for me, without adjusting anything, as soon its installed the server will randomly crash after a day, a week, a month ... i never tested corefreq plugin in combination without igpu so i cant say anything about this, this may just as sidenote.

    • Thanks 1
    Link to comment
    7 hours ago, flyize said:

    Hopefully. Still not fixed in Unraid with 5.18RC5...

    I don‘t understand completely… Do you have a custom Kernel installed on RC5?

    Keep in mind that Kernel 5.18 is still RC and not stable.
     

    As Tom pointed out, Unraid 6.10.x stable will stay on the latest LTS Kernel, which is at time of writing 5.15.38 and 6.11

    Unraid 6.11.x will then have the latest stable Kernel which is as time of writing 5.17.6

    It is maybe possible that 5.18 has became stable and that 6.11 will have Kernel 5.18+
     

    Intel made a real mess here lately and that‘s why I always recommend to make sure that bleeding edge hardware works on Linux before buying.

    Link to comment
    7 hours ago, RogerWilco486 said:

    I have spare Quadro boards and I would do this except for the system in question is ITX and the single PCIe slot is used for the LSI HBA (necessary since I'm using SAS drives).

    My test server has a Nvidia T400 built in and I test HW transcoding for Nvidia there and it is working fine since 6.8.2 up to the latest RC version.

    Link to comment
    55 minutes ago, ich777 said:

    It is maybe possible that 5.18 has became stable and that 6.11 will have Kernel 5.18+

    Watching the 5.18 RC releases, if it continues as smooth as it has been they say we are looking at a release on the 22nd May so hopefully it can make it into 6.11 RC series. Im guessing (looking at past Unraid releases) that 6.11 RC1 isn't going to be released in May so perhaps there is still a chance 🙂

    Edited by Titan84
    Link to comment
    8 hours ago, ich777 said:

    I don‘t understand completely… Do you have a custom Kernel installed on RC5?

    Keep in mind that Kernel 5.18 is still RC and not stable.
     

    As Tom pointed out, Unraid 6.10.x stable will stay on the latest LTS Kernel, which is at time of writing 5.15.38 and 6.11

    Unraid 6.11.x will then have the latest stable Kernel which is as time of writing 5.17.6

    It is maybe possible that 5.18 has became stable and that 6.11 will have Kernel 5.18+
     

    Intel made a real mess here lately and that‘s why I always recommend to make sure that bleeding edge hardware works on Linux before buying.

    That's what I said. :P Running 5.18RC5 (with Unraid 6.10RC7) and the iGPU still doesn't work. Hopefully they will *eventually* fix it.

    • Like 1
    Link to comment
    1 minute ago, flyize said:

    That's what I said. :P Running 5.18RC5 (with Unraid 6.10RC7) and the iGPU still doesn't work. Hopefully they will *eventually* fix it.

    Thaha, Intel really messed up lately with their iGPU drivers.

    Can you maybe send me the Diagnostics from your system with Kernel 5.18RC5 (I think @thor2002ro creates those builds or am I wrong?).

    Link to comment
    3 minutes ago, ich777 said:

    Thaha, Intel really messed up lately with their iGPU drivers.

    Can you maybe send me the Diagnostics from your system with Kernel 5.18RC5 (I think @thor2002ro creates those builds or am I wrong?).

    Yep, its from @thor2002ro! I'd be happy to shoot them over to ya. The good thing is that it doesn't hard crash the server, only the Plex container. You want a PM, or post em here?

    Link to comment

    So from my findings I think the problem is not 100% kernel related. Found this very helpful thread on media driver repo: https://github.com/intel/media-driver/issues/1342 which explains (if you're patient enough to read it) that the system hang is solved by kernel 5.17+ but the actual hardware transcoding crash is related to ffmpeg (https://github.com/intel/media-driver/issues/1342#issuecomment-1106171903). 

     

    Plex doesn't push anything in syslog for me but I've tried emby and managed to get a log identical with the one explained in that thread. 
     

    Jun  1 15:02:41 Skippy kernel: i915 0000:00:02.0: [drm] Resetting vcs0 for preemption time out
    
    Jun  1 15:02:41 Skippy kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [16514]
    
    Jun  1 15:02:49 Skippy kernel: i915 0000:00:02.0: [drm] GPU HANG: ecode 12:4:28fffffd, in ffmpeg [16514]
    
    Jun  1 15:02:49 Skippy kernel: i915 0000:00:02.0: [drm] Resetting vcs0 for stopped heartbeat on vcs0
    
    Jun  1 15:02:49 Skippy kernel: i915 0000:00:02.0: [drm] Resetting chip for stopped heartbeat on vcs0
    
    Jun  1 15:02:49 Skippy elogind-uaccess-command[17064]: Failed to reset ACL on /dev/dri/card0: Operation not supported
    
    Jun  1 15:02:49 Skippy elogind-uaccess-command[17065]: Failed to reset ACL on /dev/dri/card0: Operation not supported
    
    Jun  1 15:02:49 Skippy kernel: [drm:intel_gt_reset [i915]] *ERROR* Failed to reset GuC, ret = -110
    
    Jun  1 15:02:49 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* Failed to reset chip
    
    Jun  1 15:02:49 Skippy kernel: i915 0000:00:02.0: [drm:intel_gt_reset [i915]] CI tainted:0x9 by intel_gt_handle_error+0x343/0x530 [i915]
    
    Jun  1 15:02:49 Skippy kernel: [drm:__intel_gt_set_wedged [i915]] *ERROR* Failed to reset GuC, ret = -110
    
    Jun  1 15:02:49 Skippy kernel: i915 0000:00:02.0: [drm] ffmpeg[16514] context reset due to GPU hang
    
    Jun  1 15:02:52 Skippy ntpd[1407]: kernel reports TIME_ERROR: 0x41: Clock Unsynchronized
    
    Jun  1 15:02:54 Skippy kernel: Fence expiration time out i915-0000:00:02.0:ffmpeg[16514]:3aa2!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    
    
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* rcs0 TLB invalidation did not complete in 4ms!
    
    Jun  1 15:04:27 Skippy kernel: i915 0000:00:02.0: [drm] *ERROR* bcs0 TLB invalidation did not complete in 4ms!

     

    Although plex has proprietary transcoding, that's still based on ffmpeg so my thinking is that the same behaviour is happening. 

    So luckily by upgrading unraid kernel to 5.18 I've solved the freeze problem, the actual hardware transcoding stops working with the above error and I think we need to wait until plex applies the ffmpeg fix into their own server. 

    I'm using I7 12700k, unraid 6.10.1 with the unofficial 5.18 kernel

    Link to comment
    On 5/6/2022 at 9:18 PM, ich777 said:

    Please first try if this was just a false positive and let the server run for a few days, maybe also try to power off the server, pull the power cord from the wall, press a few times the on/off and reset button (to completely empty the caps) and then turn it on again.

     

    Fell free to get back to me if the server crashes again and we can diagnose furter... I really hope this was just a false positive and that your server runns without a hitch now.

    Hello,

     

    I just wanted to follow up on my earlier post in this thread, the error is back. It unluckily wasn't just a hickup and I had to remove /dev/dri from plex completely to stop the freezes.

     

    What can we do from here on? As I have read there are some reports that Kernel 5.18 will fix the issues, is this just for Alder Lake or should that be a possible fix for Skylake as well?

     

    Best regards,

    • Like 1
    Link to comment

    For what it's worth I wiped Unraid from my Comet Lake NAS and rebuilt with Fedora Server/OpenZFS/Samba/Docker/libvirt and it's been rock solid.  I'm using the Intel Media VAAPI driver with the 5.17 kernel and transcoding has been working flawlessly too. 

     

    When running Unraid on this same exact hardware it would randomly lock up anytime I allowed the i915 module to load.

    • Like 1
    Link to comment
    1 hour ago, RogerWilco486 said:

    with the 5.17 kernel

    Unraid stable is on 5.15.43, some fixes where introduced in 5.17 where in 5.18 everything should be fixed even for Alder Lake, Intel messed really up with Alder Lake and the implementation in the Kernel.

    Link to comment
    1 hour ago, ich777 said:

    Unraid stable is on 5.15.43, some fixes where introduced in 5.17 where in 5.18 everything should be fixed even for Alder Lake, Intel messed really up with Alder Lake and the implementation in the Kernel.

    I was running Unraid RC and therefore on the 5.17 kernel--- still had the lockups. 

    Link to comment
    2 minutes ago, RogerWilco486 said:

    I was running Unraid RC and therefore on the 5.17 kernel--- still had the lockups. 

    Unraid is only on 5.15 kernel unless you installed a custom kernel even for the RCs

    Edited by SimonF
    Link to comment
    23 minutes ago, RogerWilco486 said:

    I was running Unraid RC and therefore on the 5.17 kernel--- still had the lockups. 

    Was the Kernel already stable at the time you tried it or was it a RC Kernel?

    Link to comment

    Since the new Kernel 5.18.x will likely solve this issue, what are the chances that @limetech could release say a version 6.11-RC1 for us that has the latest Kernel. In this RC1 version literally the only changes from 6.10.2 would be the new Kernal and nothing else.

    I'm aware that the team have a lot going on right now with the NIC issue and I'm not sure what's all involved with combining unraid with the new Kernel so this might be totally out of the question but I thought id ask ;-)

    I'm in the same boat as a lot of other people with a 12900K but don't want to use a custom Kernel mod if I can help it as I don't want to mess things up 🙂

    Link to comment
    1 hour ago, Titan84 said:

    say a version 6.11-RC1 for us that has the latest Kernel.

    I would expect any 6.11 release to have significant new functionality.    If it was only a kernel upgrade I would expect it to be a point release within the 6.10 series.

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.