• [6.9.x - 6.11.x] intel i915 module causing system hangs with no report in syslog (not alder lake)


    Tristankin
    • Minor

    Since the 5.x kernel based releases many users have been reporting system hangs every few days once the i915 module is loaded.

    With reports from a few users detailed in the thread below we have worked out that the issue is caused by the i915 module and is a persistent issue with both the 6.9.x release and 6.10 release candidates.


    The system does not need to be actively transcoding for the hang to occur. 6.8.3 does not have this issue and is not hardware related. Unloading the i915 module stops the hangs. Hangs are still present in 6.10.0RC2. I can provide a list of similar reports if required.

    • Like 8
    • Thanks 1
    • Haha 1



    User Feedback

    Recommended Comments



    And again today

     

    Apr  1 08:33:04 Firefly  crond[1067]: exit status 1 from user root /usr/local/sbin/mover &> /dev/null
    Apr  1 10:33:38 Firefly  emhttpd: spinning down /dev/sde
    Apr  1 10:33:45 Firefly  emhttpd: spinning down /dev/sdb
    Apr  1 10:35:39 Firefly  emhttpd: spinning down /dev/sdh
    Apr  1 10:35:39 Firefly  emhttpd: spinning down /dev/sdg
    Apr  1 10:35:39 Firefly  emhttpd: spinning down /dev/sdc
    Apr  1 11:42:40 Firefly  emhttpd: read SMART /dev/sdh
    Apr  1 11:42:40 Firefly  emhttpd: read SMART /dev/sdc
    Apr  1 13:26:28 Firefly  emhttpd: spinning down /dev/sdf
    Apr  1 13:41:59 Firefly  emhttpd: spinning down /dev/sdj
    Apr  1 13:52:28 Firefly  emhttpd: read SMART /dev/sde
    Apr  1 14:09:10 Firefly  emhttpd: read SMART /dev/sdf
    Apr  1 15:37:59 Firefly  emhttpd: spinning down /dev/sdh
    Apr  1 15:37:59 Firefly  emhttpd: spinning down /dev/sdc
    Apr  1 16:25:51 Firefly kernel: microcode: microcode updated early to revision 0xf0, date = 2021-11-12
    Apr  1 16:25:51 Firefly kernel: Linux version 5.19.17-Unraid (root@Develop) (gcc (GCC) 12.2.0, GNU ld version 2.39-slack151) #2 SMP PREEMPT_DYNAMIC Wed Nov 2 11:54:15 PDT 2022
    Apr  1 16:25:51 Firefly kernel: Command line: BOOT_IMAGE=/bzimage initrd=/bzroot

     

     

    Attached diagnostics just in case.

    I am really starting to run out of ideas...

    firefly-diagnostics-20230401-1628.zip

    Link to comment

    If you read upthread, I think someone suggested attaching a monitor to it so you can see the error on screen.

    Link to comment
    2 hours ago, flyize said:

    If you read upthread, I think someone suggested attaching a monitor to it so you can see the error on screen.

    Personally, I did, and the screen showed nothing in particular. Maybe someone else will have some more luck.

     

    @Tristankin, Do you happen to have multiple NICs on you motherboard by any chance ? I ask because I have the same problems as you, and it seems I may have regain stability (30 days uptime and still no crash) by disabling my BR1 network interface (which I used for VMs on a separated vlan), leaving only BR0 enabled. I know it seems very random and unrelated, but without anything useful in the logs (like you) I just try random things now...

    Edited by Opawesome
    Link to comment
    6 hours ago, Opawesome said:

    Do you happen to have multiple NICs on you motherboard by any chance ?


    Only the single physical NIC, 2 x docker bridges, one for swag reverse proxy.

    image.thumb.png.6026101fe08e27f50b820520f6229865.png

     

    Quote

    someone suggested attaching a monitor to it so you can see the error on screen.

     

    You are right, I have to give this a go. I am just wary of the power usage having a monitor on 24/7. 

    Link to comment
    On 4/2/2023 at 4:04 AM, flyize said:

    If you read upthread, I think someone suggested attaching a monitor to it so you can see the error on screen.

     

    Turns out I didn't have to wait too long


    IMG_20230404_223056.thumb.jpg.cc6c2992d944b8f190571cf37d7ce924.jpg

     

    Nothing. No response to input devices. Reset button doesn't work. Long press of the power button is the only thing that shuts it down.
     

    root@Firefly:~# cat /sys/class/graphics/*/modes
    U:1920x1080p-0
    root@Firefly:~# cat /sys/class/graphics/*/virtual_size
    1920,1080

     

    I checked the res on the box after reboot and putting the dummy plug back in. I assume this means the dummy plug is fine.

    Anything else I should be checking?

     

     

    Link to comment

    maybe tail the log on the screen so that if an error is being logged, you can see it on the screen.

     

    tail -f /var/log/syslog

     

    Link to comment
    5 minutes ago, muzo178 said:

    maybe tail the log on the screen so that if an error is being logged, you can see it on the screen.

     

    tail -f /var/log/syslog

     


    Does it have to be written to the log before tail can work? I'm getting nothing in the syslog to flash. Do you think that the log in ram isn't written to the flash quickly enough?

    Link to comment

    unlikely... it should be writing to flash. but if you enter the tail command you will see the log live on your monitor. i doubt you will get anything though. i never got anything when my box was crashing.. worth a try though..

    • Like 1
    Link to comment
    28 minutes ago, Tristankin said:

    Nothing. No response to input devices. Reset button doesn't work. Long press of the power button is the only thing that shuts it down.

    Are you sire that this isn‘t a hardware fault?

    Link to comment
    6 minutes ago, ich777 said:

    Are you sire that this isn‘t a hardware fault?


    The system is 100% stable with 6.8.3. 6 months when it was the latest version, and then another 6 months after trying to upgrade to 6.9.x and then downgrading to 6.8.3 again. Unless it is a hardware fault that only occurs with a 5.x kernel then no.

    Edited by Tristankin
    Link to comment
    11 minutes ago, ich777 said:

    Are you sire that this isn‘t a hardware fault?

     

    Would it be possible to run a 4.19 kernel on the latest build? Could help identify the cause?

    Link to comment
    38 minutes ago, Tristankin said:

    The system is 100% stable with 6.8.3. 6 months when it was the latest version, and then another 6 months after trying to upgrade to 6.9.x and then downgrading to 6.8.3 again.

    But what about the other people with the same hardware (at least motherboard and cpu) out there which have zero issues.

     

    32 minutes ago, Tristankin said:

    Would it be possible to run a 4.19 kernel on the latest build? Could help identify the cause?

    No and no.

    Link to comment
    1 minute ago, ich777 said:

    But what about the other people with the same hardware (at least motherboard and cpu) out there which have zero issues.

     

    Good for them I guess. What about the people in this thread that are reporting the same issue?

    So my options are now, run out of date software or replace my hardware.

    Link to comment
    38 minutes ago, Tristankin said:

     

    Good for them I guess. What about the people in this thread that are reporting the same issue?

    So my options are now, run out of date software or replace my hardware.

    For fun, you could try the 6.12 beta.

    Link to comment
    1 hour ago, flyize said:

    For fun, you could try the 6.12 beta.


    I have tried 6.9.x, 6.10.x, and 6.11.x with the same result. If something was fixed I would have thought it would have happened across those versions.

    I have just changed from deluge to binhex/arch-qbittorrentvpn:4.3.9-2-01 after seeing the freeze reports under stable releases.

    I have been trying to find a solution to this issue for the past 2 years. None of this is particularly fun.

    Edited by Tristankin
    Link to comment
    20 minutes ago, Tristankin said:

    I have just changed from deluge to binhex/arch-qbittorrentvpn:4.3.9-2-01 after seeing the freeze reports under general.

    Have you yet tried to disable torrent for a bit and see if this causes your issues since libtorrent is know to crash systems (not only Unraid).

    Link to comment
    10 minutes ago, ich777 said:

    Have you yet tried to disable torrent for a bit and see if this causes your issues since libtorrent is know to crash systems (not only Unraid).


    I have moved to a completely different client and rolled back to the 4.3.9 version as listed here:
    https://forums.unraid.net/bug-reports/stable-releases/crashes-since-updating-to-v611x-for-qbittorrent-and-deluge-users-r2153/page/8/#comments

    They seem to get more in their logs though....

    Edited by Tristankin
    Link to comment

    @Tristankin I signed up here to contribute my findings with the same issue. I have not yet tried the Unraid OS, but I believe this is actually a Linux Kernel issue, as I am experiencing the exact same problem of hard freezes on my Intel NUC 6CAYH (Intel Celeron J3455) but on the Debian OS. Specifically Debian 10 and 11, running Kernel 5.10 or Kernel 6.1 respectively. I have not yet tried downgrading Debian back to 4.19, but I might do a fresh debian 10 install on a USB-Stick just to test this out.

    Curiously, the Debian-based Ubuntu 22.04 on Kernel 5.19 is NOT experiencing the same freezes, so I am really unsure what is going on. It might be that Kernel 5.19 is the magic one-off kernel that just works, or maybe Canonical did something special with the Ubuntu-Kernel that prevents the hard-freezes. I, and it seems the linux community as a whole, is somewhat puzzled by this.

     

    A more thorough description of what I tried and tested can be found here (in German): https://debianforum.de/forum/viewtopic.php?t=186674

     

    Edit: Small update: It appears that disabling Intel VT-d solved or at least improved the situation for me. I am typing this currently on a Debian 11 Kernel 5.10 Machine running 4 contemporany transcodes and the system is not showing any signs of freezing (usually froze within 2-3 minutes).

    I picked this up somewhere in this tread, so thank you for your input! Now to find out why it crashes with VT-d enabled for specific linux distros and not for others....

     

    Edited by clang
    Added info about VT-d resolving crash for me.
    • Like 1
    • Thanks 1
    Link to comment
    10 hours ago, clang said:

    Now to find out why it crashes with VT-d enabled for specific linux distros and not for others....

    One possible difference is the IOMMU mode, for some time now that Unraid uses pass-through mode, before is used DMA translation mode, IIRC Ubuntu still uses that one.

    Link to comment
    On 4/27/2023 at 6:12 AM, clang said:

    @Tristankin I signed up here to contribute my findings with the same issue. I have not yet tried the Unraid OS, but I believe this is actually a Linux Kernel issue, as I am experiencing the exact same problem of hard freezes on my Intel NUC 6CAYH (Intel Celeron J3455) but on the Debian OS. Specifically Debian 10 and 11, running Kernel 5.10 or Kernel 6.1 respectively. I have not yet tried downgrading Debian back to 4.19, but I might do a fresh debian 10 install on a USB-Stick just to test this out.

    Curiously, the Debian-based Ubuntu 22.04 on Kernel 5.19 is NOT experiencing the same freezes, so I am really unsure what is going on. It might be that Kernel 5.19 is the magic one-off kernel that just works, or maybe Canonical did something special with the Ubuntu-Kernel that prevents the hard-freezes. I, and it seems the linux community as a whole, is somewhat puzzled by this.

     

    A more thorough description of what I tried and tested can be found here (in German): https://debianforum.de/forum/viewtopic.php?t=186674

     

    Edit: Small update: It appears that disabling Intel VT-d solved or at least improved the situation for me. I am typing this currently on a Debian 11 Kernel 5.10 Machine running 4 contemporany transcodes and the system is not showing any signs of freezing (usually froze within 2-3 minutes).

    I picked this up somewhere in this tread, so thank you for your input! Now to find out why it crashes with VT-d enabled for specific linux distros and not for others....

     


    VT-d did work for a bit, definitely decreased the number of freezes but did not stop them outright.

    Now that I am back on 6.8.3 I have had 3 weeks uptime without a single crash. This is a repeat of what happened after the upgrade to 6.9.2 was causing the same crashes and I went back to 6.8.3 for 6 months, again without a single crash.

    Issue is now the docker version is so old on 6.8.3 that I am getting incompatibilities on containers, forcing me to go with different version or rolling back to ones ~6 months old. (specifically the *arrs). If someone could repackange 6.8.3 with an updated version of docker I am sure I could get a few more years out of the system (maybe call it 6.8.4?) :)

    So it seems now I have the choice between stability and security/new features.

     

    On 4/27/2023 at 4:49 PM, JorgeB said:
    On 4/27/2023 at 6:12 AM, clang said:

    Now to find out why it crashes with VT-d enabled for specific linux distros and not for others....

    One possible difference is the IOMMU mode, for some time now that Unraid uses pass-through mode, before is used DMA translation mode, IIRC Ubuntu still uses that one.


    Is there anything I can do to test this? I imagine this is a kernel level adjustment?

    Link to comment
    6 hours ago, Tristankin said:

    Is there anything I can do to test this?

    If it still crashes with VT-d disabled unlikely to help, and v6.9 still uses DMA translation mode, only changed from v6.10.

    • Thanks 1
    Link to comment
    On 4/29/2023 at 1:56 AM, Tristankin said:


    VT-d did work for a bit, definitely decreased the number of freezes but did not stop them outright.

    Now that I am back on 6.8.3 I have had 3 weeks uptime without a single crash. This is a repeat of what happened after the upgrade to 6.9.2 was causing the same crashes and I went back to 6.8.3 for 6 months, again without a single crash.

    Issue is now the docker version is so old on 6.8.3 that I am getting incompatibilities on containers, forcing me to go with different version or rolling back to ones ~6 months old. (specifically the *arrs). If someone could repackange 6.8.3 with an updated version of docker I am sure I could get a few more years out of the system (maybe call it 6.8.4?) :)

    So it seems now I have the choice between stability and security/new features.

     


    Is there anything I can do to test this? I imagine this is a kernel level adjustment?

    There have been issues with 6.12RCs with some 11th Gen and one of the affected users found this resolved there crashing issue.

     

    in /boot/config/modprobe.d/i915.conf

    try adding

    options i915 enable_dc=0

    to see if that helps.

    Link to comment
    2 hours ago, SimonF said:

    There have been issues with 6.12RCs with some 11th Gen and one of the affected users found this resolved there crashing issue.

     

    in /boot/config/modprobe.d/i915.conf

    try adding

    options i915 enable_dc=0

    to see if that helps.


    I'm running 9th gen so probably won't help. That and I have given up trying to fix the issue and sticking to 6.8.3 for as long as I can keep my docker containers going. I tried for 4 months of crashes (so far this year, many more months on 6.9.x) and being told it's my hardware so I will have to budget an upgrade of the platform some time in the future. (funny how many unraid systems are running on apparently bad hardware)

    Link to comment

    FWIW

     

    options i915 enable_dc=0

     

    seems to have fixed my crashing issues on a 9900k. It's not definitive yet, but I've been running stable for several days now, whereas I was crashing almost hourly before making that change.

     

    Update 2023-10-26: still stable after making this change

    Edited by ghostserverd
    Link to comment

    Thought I would give 6.12.6 a go as I am starting to worry about the age and vulnerability of some of the containers I have to hold back to keep compatibility with an ancient version of docker on 6.8.3. 

    I got about 7 hours before the machine froze up. 6.8.3 has not frozen once on me, including the latest stint from April till now.

    I added the options i915 enable_dc=0 to the i915.conf file. I do not have intel GPU top installed.

    cat /sys/module/i915/parameters/enable_dc returns 0

    Should I be looking at disabling power well?

    Attached are the latest syslog and diagnostics.

    syslog-previous firefly-diagnostics-20231222-2218.zip

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.