• 6.12-rc3 crashes in ~half hour after boot (i915 related?)


    vojtagrec
    • Urgent

    After I updated from rc2 to rc3, the server crashes on me usually within half an hour after booting. The system freezes, web UI unresponsive, sometimes I manage to log in via SSH but in the end I have to hard-reset the server (more specifically, I have to power off the server with a long press and then start again – pressing reset or short-pressing power button does nothing, as does running "reboot" in SSH console).

     

    Because of the crash causing web UI to stop responding, I was unable to get diagnostics at that moment – the attached diagnostics are right after a boot with rc3 (before the crash). One time I managed to be logged in via SSH when it crashed and got the "dmesg" output (attached). Seems to be related to "i915" and the symptoms in general seem to be similar to this old issue.

     

    I did the upgrade in the middle of the night, so even though I have a few Docker containers set to autostart, nobody was really using the server (= no transcoding in Plex etc.), it was just idling and still crashed nevertheless.

     

    I did change some BIOS settings (enabled some power optimizations and set fan curves) before booting into rc3 but now that I reverted back to rc2 everything seems to run fine with the same BIOS setup (running for 20h without a crash now), so I don't think my BIOS changes play a role in this issue.

     

    Please let me know if you need me to run some more tests.

     

    nibbler-diagnostics-20230416-0122.zip dmesg.txt




    User Feedback

    Recommended Comments



    Chiming in, symptoms sound similar to what I've been experiencing. In my case, I'm using an Intel NUC 11 ATKPE - Pentium Jasper Lake N6005-based, but I think that is a fairly similar generation to 11th series Core processors. No diagnostics, I've only just turned on Syslog Server to start digging deeper into this.

     

    Symptoms include loss of availability of services, unresponsiveness on network etc (including Docker services, SSH, SMB). after a few days of uptime. I've just updated the BIOS and have updated to RC4.1 today, and run a memory test (memtest86 free downloaded today) which passed, so will see if it continues.

     

    For what it's worth, I had been running this machine with Proxmox for a few months prior to trying out Unraid with no issues. Briefly tried TrueNAS Scale which didn't exhibit these issues. Since I was starting from scratch with Unraid I decided to go straight to RC3 so can't compare to previous versions. Also, I initially tried using btrfs on my pool drive but it reverted to read-only quite quickly (within a week of usage) and trying ZFS ended up giving me other stability issues, so both pool and array are on XFS now.

     

    Hope RC4.1 helps! Will update if it doesn't.

    Link to comment
    14 minutes ago, fneb said:

    Symptoms include loss of availability of services, unresponsiveness on network etc (including Docker services, SSH, SMB). after a few days of uptime.

    See if the mirrored syslog shows anything but looks unrelated since this issue happens after a few minutes.

    Link to comment

    Has anybody run into this issue again on rc5? I rolled back to stable but if rc5 is close to the final release I'm curious if this is solved.

    Link to comment

    @ich777 I also tried the different i915 flags. With i915.disable_power_well=1 it crashes in the same manner (diagnostics attached).

     

    With i915.enable_dc=0, it seems to not crash (uptime over 1 hour now without crash, hope I don't jinx it). I purposefully kept PiKVM display open & active (so that I can make sure the display is not asleep etc.).

    nibbler-diagnostics-20230504-2305-disable_power_well.zip

    Edited by vojtagrec
    • Like 2
    Link to comment
    5 hours ago, vojtagrec said:

    @ich777 I also tried the different i915 flags. With i915.disable_power_well=1 it crashes in the same manner (diagnostics attached).

     

    With i915.enable_dc=0, it seems to not crash (uptime over 1 hour now without crash, hope I don't jinx it). I purposefully kept PiKVM display open & active (so that I can make sure the display is not asleep etc.).

    nibbler-diagnostics-20230504-2305-disable_power_well.zip

    Still running without crash?

    Link to comment
    4 hours ago, JorgeB said:

    If anyone else can test this please do to confirm the fix is not an isolated case.

    I'll try this morning and post my results. 

    • Like 1
    Link to comment
    2 hours ago, menos said:

    I'll try this morning and post my results. 

    So far, up over two hours without any crashes or weirdness. KVM is connected and has been displaying the whole time. It looks like i915.enable_dc=0 may have worked. What are the long term drawbacks to leaving it set like that?

    • Like 1
    Link to comment
    13 minutes ago, menos said:

    What are the long term drawbacks to leaving it set like that?

     

    Quote

    i915.enable_dc=0 disables GPU power management. This does solve random hangs on certain Intel systems, notably Goldmount and Kaby Lake Refresh chips. Using this parameter does result in higher power use and shorter battery life on laptops/notebooks.

    https://wiki.archlinux.org/title/intel_graphics

     

    You can retry without that option every time there's a new Unraid release, newer kernel/driver might fix it.

    Link to comment
    On 5/1/2023 at 1:01 PM, JorgeB said:

    See if the mirrored syslog shows anything but looks unrelated since this issue happens after a few minutes.

    Mixed info from the mirrored syslog - I've made a new topic with my issue. Thanks!

    Link to comment

    @ich777 @JorgeB Is there some updated guide on how to build kernel for Unraid? I just found this outdated one.

     

    I think the bug might be caused by the same commit as this one (+ on FreeDesktop) and would like to try a kernel with the commit reverted. Or at least try bisecting the issue if it shows to be something else. And probably report it back to mainline, given that 6.1 is LTS release and will live on for years.

     

    I’m a software developer with basic working knowledge of C and modest experience with Linux, so just pointing out the Unraid peculiarities might help (but ofc some ready-made script/VM/Docker image would be ideal). Thanks!

    Link to comment
    On 5/5/2023 at 10:35 AM, JorgeB said:

    If anyone else can test this please do to confirm the fix is not an isolated case.

    Hello i have an i5 11500 an experiencing the same issue since 6.12 rc3. Yesterday i add the flag i915.enable_dc=0 and now it's ok.

    Thank you

    • Like 3
    Link to comment

    So does the crash happen because docker/vm/system are generating log entries that eventually fill up the "tempfs" directory mounted to memory?

    That's one of the instances I have encountered where the system works fine for hours/days, but then crashes when tempfs gets filled up.

    FYI for people that are trying to troubleshoot similar crashes, keep an eye on "tempfs" storage space by doing a df -h and looking for tempfs.  If it fills up on you, then you know something is throwing a lot of errors and you need to find out what it is.
    Most common culprits are System Logs, Docker Containers, Web Server Data and KVM temporary files.

    Link to comment
    1 hour ago, samsausages said:

    So does the crash happen because docker/vm/system are generating log entries that eventually fill up the "tempfs" directory mounted to memory?

    That's one of the instances I have encountered where the system works fine for hours/days, but then crashes when tempfs gets filled up.

    FYI for people that are trying to troubleshoot similar crashes, keep an eye on "tempfs" storage space by doing a df -h and looking for tempfs.  If it fills up on you, then you know something is throwing a lot of errors and you need to find out what it is.
    Most common culprits are System Logs, Docker Containers, Web Server Data and KVM temporary files.

    No, this specific error is related to power management of the Intel igpu. 

    • Like 1
    Link to comment

    I am experiencing this on an 11700K (also running a PiKVM) and 6.12.0-rc6 with the error:

     

    WARNING: CPU: 6 PID: 8994 at drivers/gpu/drm/i915/display/intel_display_power_well.c:271 hsw_wait_for_power_well_enable+0xc9/0xd8

     

    Hangs every 20-30 mins. Sometimes it's just really slow to respond but eventually completely hangs. System is still on.

    I don't actually have a monitor that I can easily use but I have plugged in a ghost display into the DisplayPort that I had from a previous build.

     

    sakaar-diagnostics-20230601-2112.zip

    Edited by Craig Dennis
    Link to comment
    1 hour ago, Craig Dennis said:

    I am experiencing this on an 11700K (also running a PiKVM) and 6.12.0-rc6 with the error:

     

    WARNING: CPU: 6 PID: 8994 at drivers/gpu/drm/i915/display/intel_display_power_well.c:271 hsw_wait_for_power_well_enable+0xc9/0xd8

     

    Hangs every 20-30 mins. Sometimes it's just really slow to respond but eventually completely hangs. System is still on.

    I don't actually have a monitor that I can easily use but I have plugged in a ghost display into the DisplayPort that I had from a previous build.

     

    sakaar-diagnostics-20230601-2112.zip

    Have you tried the i915.enable_dc=0 option?

    Link to comment
    8 hours ago, menos said:

    Have you tried the i915.enable_dc=0 option?


    I wanted to try them one at a time to ensure I know what worked. 
     

    With the ghost monitor installed I have 9 hours uptime. I will now try the i915 flag and report back.

    Link to comment
    On 6/2/2023 at 10:47 AM, Craig Dennis said:

    @menos i915.enable_dc=0 did not work for me. Server just crashed with PiKVM connected and no ghost monitor.

     

    @Craig Dennis On which RC are you? I just upgraded to rc7 and got a crash too. It looks like there is some regression, I had the enable_dc=0 applied via /boot/config/modprobe.d/i915.conf and it worked perfectly fine with rc6 but it seems to not work with rc7. I booted rc7 after the crash and checked /sys/module/i915/parameters/enable_dc and indeed it was "-1" (auto). When I added the kernel param to "Syslinux Configuration" it seems to work (I just tested with my server, current uptime 30+ min and it always crashed around ~20 min after boot for me).

     

    FYI @ich777 @JorgeB the workaround proposed in release notes (via modprobe.d) does not work with rc7, see above.

    Link to comment
    27 minutes ago, vojtagrec said:

    FYI @ich777 @JorgeB the workaround proposed in release notes (via modprobe.d) does not work with rc7, see above.

    Lets see if other users can confirm, anyone else affected please re-test with rc7.

    Link to comment
    14 hours ago, vojtagrec said:

    @Craig Dennis Eh sorry, I just noticed you posted before rc7 was released, so my comment is probably irrelevant to your case...


    Yeah I was on RC6 but there’s a chance I put the flag in the wrong location (not in modprobe). 
     

    If I get a chance I’ll test the correct location, then upgrade and test again. 

    Link to comment
    23 hours ago, vojtagrec said:

    @Craig Dennis On which RC are you? I just upgraded to rc7 and got a crash too. It looks like there is some regression, I had the enable_dc=0 applied via /boot/config/modprobe.d/i915.conf and it worked perfectly fine with rc6 but it seems to not work with rc7. I booted rc7 after the crash and checked /sys/module/i915/parameters/enable_dc and indeed it was "-1" (auto). When I added the kernel param to "Syslinux Configuration" it seems to work (I just tested with my server, current uptime 30+ min and it always crashed around ~20 min after boot for me).

    First of all, I completely missed that you've mentioned me here.

     

    Can you please test this:

    Remove the iGPU from the bus with:

    echo "1" > /sys/devices/pci0000\:00/0000\:00\:02.0/remove

    (in this case I'm assuming that the iGPU is on the PCI bus on: 00:02.0)

    After that unload the module with:

    rmmod i915

    Load the module again with enable_dc=0 with:

    modprobe i915 enable_dc=0

    Then rescan the PCIe bus to again get your iGPU into the system with:

    echo "1" > /sys/bus/pci/rescan

    After that issue:

    cat /sys/module/i915/parameters/enable_dc

    And enable_dc should be at 0 again

     

     

    Please not that none of these command should print an error, rmmod for example should display nothing and modprobe also doesn't display anything.

     

    Please let me know if that is working on your platform.

    Maybe also try to play around with different power states for you iGPU and if enable_dc=2 is maybe working, even enabled_dc=3 can work:

    Quote

    enable_dc:Enable power-saving display C-states. (-1=auto [default]; 0=disable; 1=up to DC5; 2=up to DC6; 3=up to DC5 with DC3CO; 4=up to DC6 with DC3CO) (int)

     

    Link to comment



    Join the conversation

    You can post now and register later. If you have an account, sign in now to post with your account.
    Note: Your post will require moderator approval before it will be visible.

    Guest
    Add a comment...

    ×   Pasted as rich text.   Restore formatting

      Only 75 emoji are allowed.

    ×   Your link has been automatically embedded.   Display as a link instead

    ×   Your previous content has been restored.   Clear editor

    ×   You cannot paste images directly. Upload or insert images from URL.


  • Status Definitions

     

    Open = Under consideration.

     

    Solved = The issue has been resolved.

     

    Solved version = The issue has been resolved in the indicated release version.

     

    Closed = Feedback or opinion better posted on our forum for discussion. Also for reports we cannot reproduce or need more information. In this case just add a comment and we will review it again.

     

    Retest = Please retest in latest release.


    Priority Definitions

     

    Minor = Something not working correctly.

     

    Urgent = Server crash, data loss, or other showstopper.

     

    Annoyance = Doesn't affect functionality but should be fixed.

     

    Other = Announcement or other non-issue.