[Plugin] Linuxserver.io - Unraid Nvidia

maxse · May 9, 2019

5 hours ago, Xaero said:

Where are you looking to see the utilization?
the default nvidia-smi screen shows fan speed, GPU core and GPU error correcting percentages, which aren't applicable to the nvenc/nvdec pipelines.
You'll want to use nvidia-smi dmon to get the columns of percentages, and pay attention in particular to the last two columns. If you watch them for the duration of a short video, you should see how it works. It fills a buffer rapidly with video, and then idles. With multiple streams it will simply use the idle durations from the other streams to buffer a new stream. You'd have to have about a dozen or more streams for them to need to double up the duty cycle, and that's where you MIGHT start seeing decreased performance. Of course, this also means you would need to be able to DECODE fast enough for 12+ streams simultaneously encoding. Which is probably more of a problem.

I have used watch nvidia-smi and looked under GPU-Util on the right side of the box. Right below that is a percentage. Is that something else?

Xaero · May 10, 2019

18 hours ago, maxse said:

I have used watch nvidia-smi and looked under GPU-Util on the right side of the box. Right below that is a percentage. Is that something else?

That's the GPU core utilization. It shouldn't really increase with nvenc or nvdec usage. That's the whole point of nvenc and nvdec - giving gamers a way to accelerate video transcoding for streaming. For a streaming site like Netflix, it doesn't really make sense to transcode video data live for each stream, just store the video pre-encoded for transcoding. For a gamer to stream to twitch, they need a way to encode video on the fly without impacting rendering performance, enter nvenc and nvdec. The spike you are seeing is kind of suspicious, but the important metric to look at is the enc and dec columns of
nvidia-smi dmon

Xaero · May 10, 2019

On 5/8/2019 at 8:07 PM, Chad Kunsman said:

By the way, I have half solved this problem after more testing.

The card can and will clock itself down while idle UNTIL it has to perform transcoding work. Once the transcode job is finished, it will never clock itself down again until I restart the Plex docker container.

See the moment I restarted the container here: http://i.imgur.com/zeXKiFZ.png

See a thread from another user who seems to have the exact same issue here: https://forums.plex.tv/t/stuck-in-p-state-p0-after-transcode-finished-on-nvidia/387685/2

I really hope there's a solution as this is about $40 a year down the drain in extra power usage just due to the card not idling as it should.

This information could prove useful for creating an (albeit, hacky) solution - if need be.
I don't currently have any hardware set up where I can test anything (my server is in boxes, currently)

One thing you may try is enabling persistence mode:
nvidia-smi -pm 1

This will result in a couple of watts of idle usage, but will force the drivers to stay loaded, even when no job is running. It's possible the drivers are exiting as soon as the transcode jobs finish, and not changing the power state back to idle.

Or, if it's already enabled, you could try disabling it:
nvidia-smi -pm 0

A hacky, scripted solution would be to monitor the nvidia smi output for a condition of both LOW GPU, NVENC and NVDEC performance and a HIGH power state, and issue the nvidia-smi --gpu-reset which would reset the GPU allowing it to idle again. Both of these are hacky workarounds.
I too would echo the Plex team on this in posting on the nvidia developer forums with this information, particularly point out the use of the new nvidia docker blobs, as it could quite possibly be an issue there. Once I have my server up and running again, I'll have a poke at seeing if I can replicate and/or resolve this issue.
https://devtalk.nvidia.com/

Edited May 10, 2019 by Xaero

Chad Kunsman · May 10, 2019

14 hours ago, Xaero said:

This information could prove useful for creating an (albeit, hacky) solution - if need be.
I don't currently have any hardware set up where I can test anything (my server is in boxes, currently)

One thing you may try is enabling persistence mode:
nvidia-smi -pm 1

This will result in a couple of watts of idle usage, but will force the drivers to stay loaded, even when no job is running. It's possible the drivers are exiting as soon as the transcode jobs finish, and not changing the power state back to idle.

Or, if it's already enabled, you could try disabling it:
nvidia-smi -pm 0

A hacky, scripted solution would be to monitor the nvidia smi output for a condition of both LOW GPU, NVENC and NVDEC performance and a HIGH power state, and issue the nvidia-smi --gpu-reset which would reset the GPU allowing it to idle again. Both of these are hacky workarounds.
I too would echo the Plex team on this in posting on the nvidia developer forums with this information, particularly point out the use of the new nvidia docker blobs, as it could quite possibly be an issue there. Once I have my server up and running again, I'll have a poke at seeing if I can replicate and/or resolve this issue.
https://devtalk.nvidia.com/

"GPU Reset couldn't run because GPU 00000000:01:00.0 is the primary GPU." is the message I get when I try to reset.

Persistence mode on or off makes no difference. Only resetting the Plex docker container causes it to downclock.

The hacky solution for scripting may come into play if nothing else works. My 'something else' I'm trying is going to be to give a P400 a shot to see if behavior is any different. I know for a fact its general power usage will be less. Maybe it will also not be affected by this bug. If it is, I'll give the script a try and definitely echo these concerns to the Plex team.

Thank you!

Xaero · May 11, 2019

6 hours ago, Chad Kunsman said:

"GPU Reset couldn't run because GPU 00000000:01:00.0 is the primary GPU." is the message I get when I try to reset.

Persistence mode on or off makes no difference. Only resetting the Plex docker container causes it to downclock.

The hacky solution for scripting may come into play if nothing else works. My 'something else' I'm trying is going to be to give a P400 a shot to see if behavior is any different. I know for a fact its general power usage will be less. Maybe it will also not be affected by this bug. If it is, I'll give the script a try and definitely echo these concerns to the Plex team.

Thank you!

I think the concerns are best directed at the nvidia team, specifically the nvidia docker team. Plex can't really do anything about what the driver or kernel decides to do with the card when it's done using it. The kernel, or driver, should be telling the card to enter a different PState when it's idle and that's not happening.

mattcoughlin · May 11, 2019

I recently installed this "plugin". Now 6.7 official is out. I tried searching it but couldn't find it. How do i upgrade the unraid OS in the future? I assume through the plugin, but 6.7 isn't available. If that's the case is there a turn around time typically for updates?

CHBMB · May 11, 2019

1 hour ago, mattcoughlin said:

I recently installed this "plugin". Now 6.7 official is out. I tried searching it but couldn't find it. How do i upgrade the unraid OS in the future? I assume through the plugin, but 6.7 isn't available. If that's the case is there a turn around time typically for updates?

We have to sleep. Turnaround time is when I get to it, and if it builds successfully, which at the moment, it is not.....

ETA: God only knows.....

Please remember that this is not official and relies on someone (me) manually running the builds, waiting for it to compile, (Typically around 30 mins using 32GB RAM and 24 cores) and uploading it to the server. Delays can come from my other commitments, family, job, other life things, and as it states in the original post, we rely on a lot of upstream work, so there are no guarantees that we will release any given version. Very dependent on things just working as we have no control over the upstream.

trurl · May 11, 2019

1 minute ago, CHBMB said:

God only knows.....

how many times you will have to answer this same question.

CHBMB · May 11, 2019

Just now, trurl said:

how many times you will have to answer this same question.

I'm not going to answer it again. If people can't be bothered to read a couple of posts before they post, then I'm not going to be bothered answering....

Community can self police this one.

That was the first person to ask though, I make some allowances as I honestly believe a lot of people have no real idea how much work goes into stuff like this.

ezhik · May 11, 2019

3 hours ago, CHBMB said:

We have to sleep. Turnaround time is when I get to it, and if it builds successfully, which at the moment, it is not.....

ETA: God only knows.....

Please remember that this is not official and relies on someone (me) manually running the builds, waiting for it to compile, (Typically around 30 mins using 32GB RAM and 24 cores) and uploading it to the server. Delays can come from my other commitments, family, job, other life things, and as it states in the original post, we rely on a lot of upstream work, so there are no guarantees that we will release any given version. Very dependent on things just working as we have no control over the upstream.

Can we compile it?

saarg · May 11, 2019

7 minutes ago, ezhik said:

Can we compile it?

Of course. Nothing stops you 🙂

CHBMB · May 11, 2019

16 minutes ago, ezhik said:

Can we compile it?

Yeah, cos that's the bit I'm stuck on......

For goodness sake, if I could compile the damn thing, there wouldn't be a problem.

But by all means try and build it from source.

CHBMB · May 11, 2019

OK, announcement.

Any stupid posts asking why this isn't released, if they can build it themselves, etc etc prepare to hear my wrath.

We're not complete noobs at this, @bass_rock and I wrote this, and when it's broken we'll do our best to fix it, any amount of asking is not going to speed it up.

If anyone thinks they can do better, then by all means write your own version, but as far as I can remember nobody else did, which is why we did it.

We're working on it.

ETA: I DON'T KNOW

When that changes I'll update the thread.

My working theory why this isn't working is that there's been a major GCC version upgrade between v8.3 and v9.1 so I'm working on trying to downgrade GCC which is difficult as I can't find a Slackware package for it, so I'm trying to build it from source, and make some Slackware packages so I can keep them as static sources, which is not as easy as I hoped.

Edited May 11, 2019 by CHBMB

ezhik · May 11, 2019

50 minutes ago, CHBMB said:

OK, announcement.

Any stupid posts asking why this isn't released, if they can build it themselves, etc etc prepare to hear my wrath.

We're not complete noobs at this, @bass_rock and I wrote this, and when it's broken we'll do our best to fix it, any amount of asking is not going to speed it up.

If anyone thinks they can do better, then by all means write your own version, but as far as I can remember nobody else did, which is why we did it.

We're working on it.

ETA: I DON'T KNOW

When that changes I'll update the thread.

My working theory why this isn't working is that there's been a major GCC version upgrade between v8.3 and v9.1 so I'm working on trying to downgrade GCC which is difficult as I can't find a Slackware package for it, so I'm trying to build it from source, and make some Slackware packages so I can keep them as static sources, which is not as easy as I hoped.

Thank you @CHBMB, your work is greatly appreciated! We will patiently await the updates.

ezhik · May 11, 2019

To the @limetech and the community dev team:

Edited May 11, 2019 by ezhik

MisterLas · May 11, 2019

@CHBMB I was wondering when the new.... jk. Keep up the good work, this plugin is awesome. I work for a rather large software support company, so I feel your pain on the releases sometimes. Most of the community should realize you all are doing this great work, mostly on free time, and have lives as well. Take your time, imho. Kudos!

@limetech, thanks for the awesome releases as well.

Edited May 11, 2019 by MisterLas

ezhik · May 11, 2019

@CHBMB when this thing compiles cleanly, let me know. Beer is on me.

acozad1 · May 11, 2019

I know, I am going to catch shit for this. But I tried the new upgrade and it messed me up. Was unable to run it and had to downgrade to 6.6.7 and I am having a problem with my plex docker. I think I isolated it to the extra parameters --runtime=nvidia. it causes the docker to halt. Any incite on how to fix would be greatly appreciated.

Jhp612 · May 11, 2019

2 minutes ago, acozad1 said:

I know, I am going to catch shit for this. But I tried the new upgrade and it messed me up. Was unable to run it and had to downgrade to 6.6.7 and I am having a problem with my plex docker. I think I isolated it to the extra parameters --runtime=nvidia. it causes the docker to halt. Any incite on how to fix would be greatly appreciated.

you need the new driver build which has been discussed A BILLION TIMES ALREADY

CHBMB · May 11, 2019

8 minutes ago, acozad1 said:

I know, I am going to catch shit for this. But I tried the new upgrade and it messed me up. Was unable to run it and had to downgrade to 6.6.7 and I am having a problem with my plex docker. I think I isolated it to the extra parameters --runtime=nvidia. it causes the docker to halt. Any incite on how to fix would be greatly appreciated.

Upgrading and downgrading when you're using the Nvidia plugin needs to be done via the plugin, using LT stock builds or the built in Unraid downgrade tools won't work.

acozad1 · May 11, 2019

I have installed the nvidia driver and have installed the proper build for 6.6.7

Edited May 11, 2019 by acozad1

sittingmongoose · May 11, 2019

Just for clarity sake. Do I only upgrade via the nvidia plugin? Or do I upgrade unraid first then upgrade the nvidia driver(when it’s available).

saarg · May 11, 2019

2 minutes ago, acozad1 said:

I have installed the nvidia driver and have installed the proper build for 6.6.7

You have an extra character in the end of the nvidia visible devices variable.

acozad1 · May 11, 2019

O shit. Great catch. I will edit that and try it again. Thank you for noticing and helping me out. I appreciate it.

That was it. You nailed it. Thank you again....

Edited May 11, 2019 by acozad1

trurl · May 11, 2019

5 minutes ago, sittingmongoose said:

Just for clarity sake. Do I only upgrade via the nvidia plugin? Or do I upgrade unraid first then upgrade the nvidia driver(when it’s available).

The plugin is actually a custom build of Unraid that includes nvidia drivers.

If you upgrade Unraid, then you have removed the custom build.

If you upgrade the plugin, you have installed a new custom build of Unraid.

[Plugin] Linuxserver.io - Unraid Nvidia

Recommended Posts

Link to comment

Top Posters In This Topic

Popular Days

Top Posters In This Topic

Popular Days

Popular Posts

linuxserver.io

CHBMB

linuxserver.io

Posted Images

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment

Link to comment