6.12-rc3 crashes in ~half hour after boot (i915 related?)

fneb · May 1, 2023

Chiming in, symptoms sound similar to what I've been experiencing. In my case, I'm using an Intel NUC 11 ATKPE - Pentium Jasper Lake N6005-based, but I think that is a fairly similar generation to 11th series Core processors. No diagnostics, I've only just turned on Syslog Server to start digging deeper into this.

Symptoms include loss of availability of services, unresponsiveness on network etc (including Docker services, SSH, SMB). after a few days of uptime. I've just updated the BIOS and have updated to RC4.1 today, and run a memory test (memtest86 free downloaded today) which passed, so will see if it continues.

For what it's worth, I had been running this machine with Proxmox for a few months prior to trying out Unraid with no issues. Briefly tried TrueNAS Scale which didn't exhibit these issues. Since I was starting from scratch with Unraid I decided to go straight to RC3 so can't compare to previous versions. Also, I initially tried using btrfs on my pool drive but it reverted to read-only quite quickly (within a week of usage) and trying ZFS ended up giving me other stability issues, so both pool and array are on XFS now.

Hope RC4.1 helps! Will update if it doesn't.

JorgeB · May 1, 2023

14 minutes ago, fneb said:

Symptoms include loss of availability of services, unresponsiveness on network etc (including Docker services, SSH, SMB). after a few days of uptime.

See if the mirrored syslog shows anything but looks unrelated since this issue happens after a few minutes.

menos · May 4, 2023

Has anybody run into this issue again on rc5? I rolled back to stable but if rc5 is close to the final release I'm curious if this is solved.

vojtagrec · May 4, 2023

@menos @ich777 Just tested on rc5, still crashing. Diagnostics after the crash attached. Will try some of the i915 module flags. To me it looks like it must be some regression in kernels newer than 6.1.20 (that was in rc2).

nibbler-diagnostics-20230504-1730.zip

vojtagrec · May 4, 2023

@ich777 I also tried the different i915 flags. With i915.disable_power_well=1 it crashes in the same manner (diagnostics attached).

With i915.enable_dc=0, it seems to not crash (uptime over 1 hour now without crash, hope I don't jinx it). I purposefully kept PiKVM display open & active (so that I can make sure the display is not asleep etc.).

nibbler-diagnostics-20230504-2305-disable_power_well.zip

Edited May 4, 2023 by vojtagrec

menos · May 5, 2023

5 hours ago, vojtagrec said:

@ich777 I also tried the different i915 flags. With i915.disable_power_well=1 it crashes in the same manner (diagnostics attached).

With i915.enable_dc=0, it seems to not crash (uptime over 1 hour now without crash, hope I don't jinx it). I purposefully kept PiKVM display open & active (so that I can make sure the display is not asleep etc.).

nibbler-diagnostics-20230504-2305-disable_power_well.zip

Still running without crash?

vojtagrec · May 5, 2023

3 hours ago, menos said:

Still running without crash?

@menos Yep, current uptime 9h45m.

JorgeB · May 5, 2023

If anyone else can test this please do to confirm the fix is not an isolated case.

menos · May 5, 2023

4 hours ago, JorgeB said:

If anyone else can test this please do to confirm the fix is not an isolated case.

I'll try this morning and post my results.

menos · May 5, 2023

2 hours ago, menos said:

I'll try this morning and post my results.

So far, up over two hours without any crashes or weirdness. KVM is connected and has been displaying the whole time. It looks like i915.enable_dc=0 may have worked. What are the long term drawbacks to leaving it set like that?

JorgeB · May 5, 2023

13 minutes ago, menos said:

What are the long term drawbacks to leaving it set like that?

Quote

i915.enable_dc=0 disables GPU power management. This does solve random hangs on certain Intel systems, notably Goldmount and Kaby Lake Refresh chips. Using this parameter does result in higher power use and shorter battery life on laptops/notebooks.

https://wiki.archlinux.org/title/intel_graphics

You can retry without that option every time there's a new Unraid release, newer kernel/driver might fix it.

fneb · May 7, 2023

On 5/1/2023 at 1:01 PM, JorgeB said:

See if the mirrored syslog shows anything but looks unrelated since this issue happens after a few minutes.

Mixed info from the mirrored syslog - I've made a new topic with my issue. Thanks!

vojtagrec · May 9, 2023

@ich777 @JorgeB Is there some updated guide on how to build kernel for Unraid? I just found this outdated one.

I think the bug might be caused by the same commit as this one (+ on FreeDesktop) and would like to try a kernel with the commit reverted. Or at least try bisecting the issue if it shows to be something else. And probably report it back to mainline, given that 6.1 is LTS release and will live on for years.

I’m a software developer with basic working knowledge of C and modest experience with Linux, so just pointing out the Unraid peculiarities might help (but ofc some ready-made script/VM/Docker image would be ideal). Thanks!

rachid596 · May 21, 2023

On 5/5/2023 at 10:35 AM, JorgeB said:

If anyone else can test this please do to confirm the fix is not an isolated case.

Hello i have an i5 11500 an experiencing the same issue since 6.12 rc3. Yesterday i add the flag i915.enable_dc=0 and now it's ok.

Thank you

samsausages · May 22, 2023

So does the crash happen because docker/vm/system are generating log entries that eventually fill up the "tempfs" directory mounted to memory?

That's one of the instances I have encountered where the system works fine for hours/days, but then crashes when tempfs gets filled up.

FYI for people that are trying to troubleshoot similar crashes, keep an eye on "tempfs" storage space by doing a df -h and looking for tempfs. If it fills up on you, then you know something is throwing a lot of errors and you need to find out what it is.
Most common culprits are System Logs, Docker Containers, Web Server Data and KVM temporary files.

menos · May 22, 2023

1 hour ago, samsausages said:

So does the crash happen because docker/vm/system are generating log entries that eventually fill up the "tempfs" directory mounted to memory?

That's one of the instances I have encountered where the system works fine for hours/days, but then crashes when tempfs gets filled up.

FYI for people that are trying to troubleshoot similar crashes, keep an eye on "tempfs" storage space by doing a df -h and looking for tempfs. If it fills up on you, then you know something is throwing a lot of errors and you need to find out what it is.
Most common culprits are System Logs, Docker Containers, Web Server Data and KVM temporary files.

No, this specific error is related to power management of the Intel igpu.

Craig Dennis · June 1, 2023

I am experiencing this on an 11700K (also running a PiKVM) and 6.12.0-rc6 with the error:

WARNING: CPU: 6 PID: 8994 at drivers/gpu/drm/i915/display/intel_display_power_well.c:271 hsw_wait_for_power_well_enable+0xc9/0xd8

Hangs every 20-30 mins. Sometimes it's just really slow to respond but eventually completely hangs. System is still on.

I don't actually have a monitor that I can easily use but I have plugged in a ghost display into the DisplayPort that I had from a previous build.

sakaar-diagnostics-20230601-2112.zip

Edited June 1, 2023 by Craig Dennis

menos · June 1, 2023

1 hour ago, Craig Dennis said:
I am experiencing this on an 11700K (also running a PiKVM) and 6.12.0-rc6 with the error:
WARNING: CPU: 6 PID: 8994 at drivers/gpu/drm/i915/display/intel_display_power_well.c:271 hsw_wait_for_power_well_enable+0xc9/0xd8
Hangs every 20-30 mins. Sometimes it's just really slow to respond but eventually completely hangs. System is still on.

I don't actually have a monitor that I can easily use but I have plugged in a ghost display into the DisplayPort that I had from a previous build.

sakaar-diagnostics-20230601-2112.zip

Have you tried the i915.enable_dc=0 option?

Craig Dennis · June 2, 2023

8 hours ago, menos said:

Have you tried the i915.enable_dc=0 option?

I wanted to try them one at a time to ensure I know what worked.

With the ghost monitor installed I have 9 hours uptime. I will now try the i915 flag and report back.

Craig Dennis · June 2, 2023

@menos i915.enable_dc=0 did not work for me. Server just crashed with PiKVM connected and no ghost monitor.

vojtagrec · June 6, 2023

On 6/2/2023 at 10:47 AM, Craig Dennis said:

@menos i915.enable_dc=0 did not work for me. Server just crashed with PiKVM connected and no ghost monitor.

@Craig Dennis On which RC are you? I just upgraded to rc7 and got a crash too. It looks like there is some regression, I had the enable_dc=0 applied via /boot/config/modprobe.d/i915.conf and it worked perfectly fine with rc6 but it seems to not work with rc7. I booted rc7 after the crash and checked /sys/module/i915/parameters/enable_dc and indeed it was "-1" (auto). When I added the kernel param to "Syslinux Configuration" it seems to work (I just tested with my server, current uptime 30+ min and it always crashed around ~20 min after boot for me).

FYI @ich777 @JorgeB the workaround proposed in release notes (via modprobe.d) does not work with rc7, see above.

vojtagrec · June 6, 2023

@Craig Dennis Eh sorry, I just noticed you posted before rc7 was released, so my comment is probably irrelevant to your case...

JorgeB · June 6, 2023

27 minutes ago, vojtagrec said:

FYI @ich777 @JorgeB the workaround proposed in release notes (via modprobe.d) does not work with rc7, see above.

Lets see if other users can confirm, anyone else affected please re-test with rc7.

Craig Dennis · June 7, 2023

14 hours ago, vojtagrec said:

@Craig Dennis Eh sorry, I just noticed you posted before rc7 was released, so my comment is probably irrelevant to your case...

Yeah I was on RC6 but there’s a chance I put the flag in the wrong location (not in modprobe).

If I get a chance I’ll test the correct location, then upgrade and test again.

ich777 · June 7, 2023

23 hours ago, vojtagrec said:

@Craig Dennis On which RC are you? I just upgraded to rc7 and got a crash too. It looks like there is some regression, I had the enable_dc=0 applied via /boot/config/modprobe.d/i915.conf and it worked perfectly fine with rc6 but it seems to not work with rc7. I booted rc7 after the crash and checked /sys/module/i915/parameters/enable_dc and indeed it was "-1" (auto). When I added the kernel param to "Syslinux Configuration" it seems to work (I just tested with my server, current uptime 30+ min and it always crashed around ~20 min after boot for me).

First of all, I completely missed that you've mentioned me here.

Can you please test this:

Remove the iGPU from the bus with:

echo "1" > /sys/devices/pci0000\:00/0000\:00\:02.0/remove

(in this case I'm assuming that the iGPU is on the PCI bus on: 00:02.0)

After that unload the module with:

rmmod i915

Load the module again with enable_dc=0 with:

modprobe i915 enable_dc=0

Then rescan the PCIe bus to again get your iGPU into the system with:

echo "1" > /sys/bus/pci/rescan

After that issue:

cat /sys/module/i915/parameters/enable_dc

And enable_dc should be at 0 again

Please not that none of these command should print an error, rmmod for example should display nothing and modprobe also doesn't display anything.

Please let me know if that is working on your platform.

Maybe also try to play around with different power states for you iGPU and if enable_dc=2 is maybe working, even enabled_dc=3 can work:

Quote

enable_dc:Enable power-saving display C-states. (-1=auto [default]; 0=disable; 1=up to DC5; 2=up to DC6; 3=up to DC5 with DC3CO; 4=up to DC6 with DC3CO) (int)

6.12-rc3 crashes in ~half hour after boot (i915 related?)

User Feedback

Recommended Comments

fneb 0

Link to comment

JorgeB 7521

Link to comment

menos 3

Link to comment

vojtagrec 6

Link to comment

vojtagrec 6

Link to comment

menos 3

Link to comment

vojtagrec 6

Link to comment

JorgeB 7521

Link to comment

menos 3

Link to comment

menos 3

Link to comment

JorgeB 7521

Link to comment

fneb 0

Link to comment

vojtagrec 6

Link to comment

rachid596 49

Link to comment

samsausages 30

Link to comment

menos 3

Link to comment

Craig Dennis 4

Link to comment

menos 3

Link to comment

Craig Dennis 4

Link to comment

Craig Dennis 4

Link to comment

vojtagrec 6

Link to comment

vojtagrec 6

Link to comment

JorgeB 7521

Link to comment

Craig Dennis 4

Link to comment

ich777 3777

Link to comment

Join the conversation