[Support]: Intel iGPU Utilization Stats into InfluxDB for use with Grafana - intel-gpu-telegfraf


Recommended Posts

Hi all,

 

I figured I would share the container I put together, maybe someone else will find it useful.

 

The goal: See the utilization of the Intel iGPU in my Grafana dashboard.

The how: Create a container running a tiny script to manipulate the output of intel_gpu_top and have Telegraf send it to a InfluxDB instance.

The result:

grafana_screenshot.png

Docker template repo to add to your Unraid repo list: https://github.com/brianmiller/docker-templates (the container is called 'intel-gpu-telegraf')

Docker Hub repo: https://hub.docker.com/r/theoriginalbrian/intel-gpu-telegraf

Docker GitHub: https://github.com/brianmiller/docker-intel-gpu-telegraf

 

Currently, the container looks for the "Video/0" engine within the iGPU.  If there's a desire to pull this into the Unraid template UI and make it editable, let me know.  It shouldn't be difficult too do.

 

-Brian

Edited by TheBrian
add screenshot
  • Like 1
  • Thanks 1
Link to comment
  • 3 weeks later...
  • 2 weeks later...
  • 4 months later...

I may be an edge case but in beta35 this (very handy) docker fills up my syslog with the following error until the system's overloaded.

Nov 23 10:00:10 NAS kernel: bad: scheduling from the idle thread!
Nov 23 10:00:10 NAS kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.18-Unraid #1
Nov 23 10:00:10 NAS kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J5005-ITX, BIOS P1.40 08/06/2018
Nov 23 10:00:10 NAS kernel: Call Trace:
Nov 23 10:00:10 NAS kernel: dump_stack+0x6b/0x83
Nov 23 10:00:10 NAS kernel: dequeue_task_idle+0x21/0x2a
Nov 23 10:00:10 NAS kernel: __schedule+0x135/0x49e
Nov 23 10:00:10 NAS kernel: ? __mod_timer+0x215/0x23c
Nov 23 10:00:10 NAS kernel: schedule+0x77/0xa0
Nov 23 10:00:10 NAS kernel: schedule_timeout+0xa7/0xe0
Nov 23 10:00:10 NAS kernel: ? __next_timer_interrupt+0xaf/0xaf
Nov 23 10:00:10 NAS kernel: msleep+0x13/0x19
Nov 23 10:00:10 NAS kernel: pci_raw_set_power_state+0x185/0x257
Nov 23 10:00:10 NAS kernel: pci_restore_standard_config+0x35/0x3b
Nov 23 10:00:10 NAS kernel: pci_pm_runtime_resume+0x29/0x7b
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: __rpm_callback+0x6b/0xcf
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: rpm_callback+0x50/0x66
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: rpm_resume+0x2e2/0x3d6
Nov 23 10:00:10 NAS kernel: ? __schedule+0x47d/0x49e
Nov 23 10:00:10 NAS kernel: __pm_runtime_resume+0x55/0x71
Nov 23 10:00:10 NAS kernel: __intel_runtime_pm_get+0x15/0x4a [i915]
Nov 23 10:00:10 NAS kernel: i915_pmu_enable+0x53/0x147 [i915]
Nov 23 10:00:10 NAS kernel: i915_pmu_event_add+0xf/0x20 [i915]
Nov 23 10:00:10 NAS kernel: event_sched_in+0xd3/0x18f
Nov 23 10:00:10 NAS kernel: merge_sched_in+0xb4/0x1de
Nov 23 10:00:10 NAS kernel: visit_groups_merge.constprop.0+0x174/0x3ad
Nov 23 10:00:10 NAS kernel: ctx_sched_in+0x11e/0x13e
Nov 23 10:00:10 NAS kernel: perf_event_sched_in+0x49/0x6c
Nov 23 10:00:10 NAS kernel: ctx_resched+0x6d/0x7c
Nov 23 10:00:10 NAS kernel: __perf_install_in_context+0x117/0x14b
Nov 23 10:00:10 NAS kernel: remote_function+0x19/0x43
Nov 23 10:00:10 NAS kernel: flush_smp_call_function_queue+0x103/0x1a4
Nov 23 10:00:10 NAS kernel: flush_smp_call_function_from_idle+0x2f/0x3a
Nov 23 10:00:10 NAS kernel: do_idle+0x20f/0x236
Nov 23 10:00:10 NAS kernel: cpu_startup_entry+0x18/0x1a
Nov 23 10:00:10 NAS kernel: start_kernel+0x4af/0x4d1
Nov 23 10:00:10 NAS kernel: secondary_startup_64+0xa4/0xb0

 

Link to comment
  • 2 weeks later...

That was exactly the plugin I was looking for many thanks !

 

Just a note to maybe add an env for influx_username (for Auth) as my existing DB required influx_password AND username.

I modified directly the telegraf.conf but would be a great addition for other users.

 

Thanks Again !

 

Link to comment
On 11/17/2020 at 4:06 AM, tronyx said:

This is awesome! Thanks so much for putting it together. I don't suppose you'd be willing to share your dashboard for this for easy replication of the above panels?

Certainly!  I'll see if I can update the GitHub with the dashboards I use. I'll link them here when I get them uploaded.

  • Like 1
Link to comment
On 12/1/2020 at 12:46 PM, Lunz_ said:

That was exactly the plugin I was looking for many thanks !

 

Just a note to maybe add an env for influx_username (for Auth) as my existing DB required influx_password AND username.

I modified directly the telegraf.conf but would be a great addition for other users.

 

Thanks Again !

 

This should be easy enough.  I'll take a look.

  • Thanks 1
Link to comment
On 11/23/2020 at 8:38 AM, CS01-HS said:

I may be an edge case but in beta35 this (very handy) docker fills up my syslog with the following error until the system's overloaded.


Nov 23 10:00:10 NAS kernel: bad: scheduling from the idle thread!
Nov 23 10:00:10 NAS kernel: CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.8.18-Unraid #1
Nov 23 10:00:10 NAS kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./J5005-ITX, BIOS P1.40 08/06/2018
Nov 23 10:00:10 NAS kernel: Call Trace:
Nov 23 10:00:10 NAS kernel: dump_stack+0x6b/0x83
Nov 23 10:00:10 NAS kernel: dequeue_task_idle+0x21/0x2a
Nov 23 10:00:10 NAS kernel: __schedule+0x135/0x49e
Nov 23 10:00:10 NAS kernel: ? __mod_timer+0x215/0x23c
Nov 23 10:00:10 NAS kernel: schedule+0x77/0xa0
Nov 23 10:00:10 NAS kernel: schedule_timeout+0xa7/0xe0
Nov 23 10:00:10 NAS kernel: ? __next_timer_interrupt+0xaf/0xaf
Nov 23 10:00:10 NAS kernel: msleep+0x13/0x19
Nov 23 10:00:10 NAS kernel: pci_raw_set_power_state+0x185/0x257
Nov 23 10:00:10 NAS kernel: pci_restore_standard_config+0x35/0x3b
Nov 23 10:00:10 NAS kernel: pci_pm_runtime_resume+0x29/0x7b
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: __rpm_callback+0x6b/0xcf
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: rpm_callback+0x50/0x66
Nov 23 10:00:10 NAS kernel: ? pci_pm_default_resume+0x1e/0x1e
Nov 23 10:00:10 NAS kernel: rpm_resume+0x2e2/0x3d6
Nov 23 10:00:10 NAS kernel: ? __schedule+0x47d/0x49e
Nov 23 10:00:10 NAS kernel: __pm_runtime_resume+0x55/0x71
Nov 23 10:00:10 NAS kernel: __intel_runtime_pm_get+0x15/0x4a [i915]
Nov 23 10:00:10 NAS kernel: i915_pmu_enable+0x53/0x147 [i915]
Nov 23 10:00:10 NAS kernel: i915_pmu_event_add+0xf/0x20 [i915]
Nov 23 10:00:10 NAS kernel: event_sched_in+0xd3/0x18f
Nov 23 10:00:10 NAS kernel: merge_sched_in+0xb4/0x1de
Nov 23 10:00:10 NAS kernel: visit_groups_merge.constprop.0+0x174/0x3ad
Nov 23 10:00:10 NAS kernel: ctx_sched_in+0x11e/0x13e
Nov 23 10:00:10 NAS kernel: perf_event_sched_in+0x49/0x6c
Nov 23 10:00:10 NAS kernel: ctx_resched+0x6d/0x7c
Nov 23 10:00:10 NAS kernel: __perf_install_in_context+0x117/0x14b
Nov 23 10:00:10 NAS kernel: remote_function+0x19/0x43
Nov 23 10:00:10 NAS kernel: flush_smp_call_function_queue+0x103/0x1a4
Nov 23 10:00:10 NAS kernel: flush_smp_call_function_from_idle+0x2f/0x3a
Nov 23 10:00:10 NAS kernel: do_idle+0x20f/0x236
Nov 23 10:00:10 NAS kernel: cpu_startup_entry+0x18/0x1a
Nov 23 10:00:10 NAS kernel: start_kernel+0x4af/0x4d1
Nov 23 10:00:10 NAS kernel: secondary_startup_64+0xa4/0xb0

 

I'm glad it's useful.  I haven't seen these errors before.  Did they start after you installed the intel-gpu-telegraf container or after the upgrade of unraid?

Link to comment
12 minutes ago, TheBrian said:

I'm glad it's useful.  I haven't seen these errors before.  Did they start after you installed the intel-gpu-telegraf container or after the upgrade of unraid?

I upgraded to beta35 then installed intel-gpu-telegraf for the first time. I'll try again and report back

Link to comment
  • 1 month later...
On 12/6/2020 at 8:47 AM, CS01-HS said:

I managed to cause the same error (and freeze my server) playing around with intel_gpu_top in the intel-gpu-tools container so my problem's at a lower level, your container's fine.

After several freezes (which caused unclean shutdowns) in Handbrake using the hardware encoder and also with monitoring, and random corrupted encodes, I got my machine stable with the following changes:

  1. Added intel_iommu=on,igfx_off to syslinux config (this may be optional)
  2. Added a dummy HDMI plug to my headless server (j5005)

It's been stable now for several days despite continuous hardware encoding in Handbrake with no corruption.

Link to comment
  • 5 weeks later...

Hey

 

This is exactly what I've been looking for but for some reason I'm unable to get it working I'm clearly being thick (been a long few weeks) I've installed the docker but for some reason im getting  "[agent] Error writing to outputs.influxdb: could not write any address" any help would be much appreciated

 

Cheers Tinni  

 

Link to comment
  • 3 weeks later...
  • 6 months later...
  • 2 months later...

First thanks for this container, very handy.

 

One suggestion:

 

I was running batch conversions in HandBrake and couldn't figure out why my iGPU wasn't fully utilized:

1759086883_ScreenShot2021-11-21at8_47_55AM.thumb.png.fc652a2607003fee21d38fa84ea2f2fb.png

 

It turns out it was but it was maxing out 3D render load (95%) and the reporting script (get_intel_gpu_status.sh) grabs video load (9%):

464740485_ScreenShot2021-11-21at8_49_21AM.thumb.png.4dbd6ac893675ef5d79c2cc1b0ace4d0.png

 

So I tweaked the script to grab whatever's highest:

#!/bin/bash

#This is so messy...

#Beat intel_gpu_top into submission
JSON=$(/usr/bin/timeout -k 3 3 /usr/bin/intel_gpu_top -J)
VIDEO_UTIL=$(echo "$JSON"|grep "busy"|sort|tail -1|cut -d ":" -f2|cut -d "," -f1|cut -d " " -f2)

#Spit out something telegraf can work with
echo "[{\"time\": `date +%s`, \"intel_gpu_util\": "$VIDEO_UTIL"}]"

#Exit cleanly
exit 0

 

I overwrite the container's version with the following Post Argument, where utils is a new mapped path to the folder containing my tweaked version:

&& docker exec intel-gpu-telegraf sh -c '/usr/bin/cp -f /utils/get_intel_gpu_status.sh /opt/intel-gpu-telegraf/; chmod a+x /opt/intel-gpu-telegraf/get_intel_gpu_status.sh'

 

(Full path to cp is necessary because cp is aliased to cp -i)

 

Now the display reflects full utilization:

429761145_ScreenShot2021-11-21at8_48_43AM.thumb.png.52606282f1ece68e5c681e0b37f509fb.png

Link to comment
  • 1 month later...

getting a strange error when trying this container:

2022-01-01T14:21:20Z E! [inputs.exec] Error in plugin: invalid character '}' looking for beginning of value

To try and trace it I decided to map both the telegraf.conf and shell file separately into unraid's folders, this way I can modify both accordingly. Unfortunately I can't work out the cause of the error. All the {} brackets seem to be correctly paired, there are no strays.

 

Any idea on this one? The conf and shell file are in @TheBrian's github above...

Link to comment
  • 3 months later...
  • 1 year later...
On 1/1/2022 at 8:27 AM, mishmash- said:

getting a strange error when trying this container:

2022-01-01T14:21:20Z E! [inputs.exec] Error in plugin: invalid character '}' looking for beginning of value

To try and trace it I decided to map both the telegraf.conf and shell file separately into unraid's folders, this way I can modify both accordingly. Unfortunately I can't work out the cause of the error. All the {} brackets seem to be correctly paired, there are no strays.

 

Any idea on this one? The conf and shell file are in @TheBrian's github above...

 

Facing the exact same issue - did you ever find a fix? 

Link to comment
  • 3 weeks later...
On 8/24/2023 at 4:45 PM, Shu said:

 

Facing the exact same issue - did you ever find a fix? 

 

I never managed to trace the issue. I ended up modifying the container slightly to implement in proxmox - I now virtualise unraid and give the iGPU to a plex container in proxmox.

The iGPU is monitored from inside the proxmox plex container. You can see the files here - maybe some mods on your side will help your implementation:

https://sorrento-lab.github.io/9_1_PlexLXC.html

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.