[Plugin] Linuxserver.io - Unraid Nvidia


Recommended Posts

5 hours ago, Xaero said:

Where are you looking to see the utilization?
the default nvidia-smi screen shows fan speed, GPU core and GPU error correcting percentages, which aren't applicable to the nvenc/nvdec pipelines.
You'll want to use nvidia-smi dmon to get the columns of percentages, and pay attention in particular to the last two columns. If you watch them for the duration of a short video, you should see how it works. It fills a buffer rapidly with video, and then idles. With multiple streams it will simply use the idle durations from the other streams to buffer a new stream. You'd have to have about a dozen or more streams for them to need to double up the duty cycle, and that's where you MIGHT start seeing decreased performance. Of course, this also means you would need to be able to DECODE fast enough for 12+ streams simultaneously encoding. Which is probably more of a problem.

I have used watch nvidia-smi and looked under GPU-Util on the right side of the box. Right below that is a percentage. Is that something else? 

Link to comment
18 hours ago, maxse said:

I have used watch nvidia-smi and looked under GPU-Util on the right side of the box. Right below that is a percentage. Is that something else? 

That's the GPU core utilization. It shouldn't really increase with nvenc or nvdec usage. That's the whole point of nvenc and nvdec - giving gamers a way to accelerate video transcoding for streaming. For a streaming site like Netflix, it doesn't really make sense to transcode video data live for each stream, just store the video pre-encoded for transcoding. For a gamer to stream to twitch, they need a way to encode video on the fly without impacting rendering performance, enter nvenc and nvdec. The spike you are seeing is kind of suspicious, but the important metric to look at is the enc and dec columns of
nvidia-smi dmon

Link to comment
On 5/8/2019 at 8:07 PM, Chad Kunsman said:

 

By the way, I have half solved this problem after more testing. 

 

The card can and will clock itself down while idle UNTIL it has to perform transcoding work. Once the transcode job is finished, it will never clock itself down again until I restart the Plex docker container. 

 

See the moment I restarted the container here: http://i.imgur.com/zeXKiFZ.png

 

See a thread from another user who seems to have the exact same issue here: https://forums.plex.tv/t/stuck-in-p-state-p0-after-transcode-finished-on-nvidia/387685/2

 

I really hope there's a solution as this is about $40 a year down the drain in extra power usage just due to the card not idling as it should. 

This information could prove useful for creating an (albeit, hacky) solution - if need be.
I don't currently have any hardware set up where I can test anything (my server is in boxes, currently)

One thing you may try is enabling persistence mode:
nvidia-smi -pm 1

This will result in a couple of watts of idle usage, but will force the drivers to stay loaded, even when no job is running. It's possible the drivers are exiting as soon as the transcode jobs finish, and not changing the power state back to idle.

 

Or, if it's already enabled, you could try disabling it:
nvidia-smi -pm 0



A hacky, scripted solution would be to monitor the nvidia smi output for a condition of both LOW GPU, NVENC and NVDEC performance and a HIGH power state, and issue the nvidia-smi --gpu-reset which would reset the GPU allowing it to idle again. Both of these are hacky workarounds.
I too would echo the Plex team on this in posting on the nvidia developer forums with this information, particularly point out the use of the new nvidia docker blobs, as it could quite possibly be an issue there. Once I have my server up and running again, I'll have a poke at seeing if I can replicate and/or resolve this issue.
https://devtalk.nvidia.com/

Edited by Xaero
  • Like 1
Link to comment
14 hours ago, Xaero said:

This information could prove useful for creating an (albeit, hacky) solution - if need be.
I don't currently have any hardware set up where I can test anything (my server is in boxes, currently)

One thing you may try is enabling persistence mode:
nvidia-smi -pm 1

This will result in a couple of watts of idle usage, but will force the drivers to stay loaded, even when no job is running. It's possible the drivers are exiting as soon as the transcode jobs finish, and not changing the power state back to idle.

 

Or, if it's already enabled, you could try disabling it:
nvidia-smi -pm 0



A hacky, scripted solution would be to monitor the nvidia smi output for a condition of both LOW GPU, NVENC and NVDEC performance and a HIGH power state, and issue the nvidia-smi --gpu-reset which would reset the GPU allowing it to idle again. Both of these are hacky workarounds.
I too would echo the Plex team on this in posting on the nvidia developer forums with this information, particularly point out the use of the new nvidia docker blobs, as it could quite possibly be an issue there. Once I have my server up and running again, I'll have a poke at seeing if I can replicate and/or resolve this issue.
https://devtalk.nvidia.com/

"GPU Reset couldn't run because GPU 00000000:01:00.0 is the primary GPU." is the message I get when I try to reset. 

 

Persistence mode on or off makes no difference. Only resetting the Plex docker container causes it to downclock. 

 

The hacky solution for scripting may come into play if nothing else works. My 'something else' I'm trying is going to be to give a P400 a shot to see if behavior is any different. I know for a fact its general power usage will be less. Maybe it will also not be affected by this bug. If it is, I'll give the script a try and definitely echo these concerns to the Plex team. 

 

Thank you!

 

 

Link to comment
6 hours ago, Chad Kunsman said:

"GPU Reset couldn't run because GPU 00000000:01:00.0 is the primary GPU." is the message I get when I try to reset. 

 

Persistence mode on or off makes no difference. Only resetting the Plex docker container causes it to downclock. 

 

The hacky solution for scripting may come into play if nothing else works. My 'something else' I'm trying is going to be to give a P400 a shot to see if behavior is any different. I know for a fact its general power usage will be less. Maybe it will also not be affected by this bug. If it is, I'll give the script a try and definitely echo these concerns to the Plex team. 

 

Thank you!

 

 

I think the concerns are best directed at the nvidia team, specifically the nvidia docker team. Plex can't really do anything about what the driver or kernel decides to do with the card when it's done using it. The kernel, or driver, should be telling the card to enter a different PState when it's idle and that's not happening.

  • Like 1
Link to comment

I recently installed this "plugin". Now 6.7 official is out.  I tried searching it but couldn't find it. How do i upgrade the unraid OS in the future? I assume through the plugin, but 6.7 isn't available.  If that's the case is there a turn around time typically for updates?

Link to comment
1 hour ago, mattcoughlin said:

I recently installed this "plugin". Now 6.7 official is out.  I tried searching it but couldn't find it. How do i upgrade the unraid OS in the future? I assume through the plugin, but 6.7 isn't available.  If that's the case is there a turn around time typically for updates?

We have to sleep.  Turnaround time is when I get to it, and if it builds successfully, which at the moment, it is not.....

 

ETA:  God only knows.....

 

Please remember that this is not official and relies on someone (me) manually running the builds, waiting for it to compile, (Typically around 30 mins using 32GB RAM and 24 cores) and uploading it to the server.  Delays can come from my other commitments, family, job, other life things, and as it states in the original post, we rely on a lot of upstream work, so there are no guarantees that we will release any given version.  Very dependent on things just working as we have no control over the upstream.

  • Like 2
Link to comment
Just now, trurl said:

how many times you will have to answer this same question.

I'm not going to answer it again.  If people can't be bothered to read a couple of posts before they post, then I'm not going to be bothered answering.... :D

 

Community can self police this one.

 

That was the first person to ask though, I make some allowances as I honestly believe a lot of people have no real idea how much work goes into stuff like this.

  • Upvote 1
Link to comment
3 hours ago, CHBMB said:

We have to sleep.  Turnaround time is when I get to it, and if it builds successfully, which at the moment, it is not.....

 

ETA:  God only knows.....

 

Please remember that this is not official and relies on someone (me) manually running the builds, waiting for it to compile, (Typically around 30 mins using 32GB RAM and 24 cores) and uploading it to the server.  Delays can come from my other commitments, family, job, other life things, and as it states in the original post, we rely on a lot of upstream work, so there are no guarantees that we will release any given version.  Very dependent on things just working as we have no control over the upstream.

Can we compile it? :)

Link to comment
16 minutes ago, ezhik said:

Can we compile it? :)

Yeah, cos that's the bit I'm stuck on......

 

For goodness sake, if I could compile the damn thing, there wouldn't be a problem.

But by all means try and build it from source.

Link to comment

OK, announcement.

 

Any stupid posts asking why this isn't released, if they can build it themselves, etc etc prepare to hear my wrath.

 

We're not complete noobs at this, @bass_rock and I wrote this, and when it's broken we'll do our best to fix it, any amount of asking is not going to speed it up.

 

If anyone thinks they can do better, then by all means write your own version, but as far as I can remember nobody else did, which is why we did it.

 

We're working on it. 

 

ETA: I DON'T KNOW

 

When that changes I'll update the thread.

 

My working theory why this isn't working is that there's been a major GCC version upgrade between v8.3 and v9.1 so I'm working on trying to downgrade GCC which is difficult as I can't find a Slackware package for it, so I'm trying to build it from source, and make some Slackware packages so I can keep them as static sources, which is not as easy as I hoped.

Edited by CHBMB
  • Like 2
Link to comment
50 minutes ago, CHBMB said:

OK, announcement.

 

Any stupid posts asking why this isn't released, if they can build it themselves, etc etc prepare to hear my wrath.

 

We're not complete noobs at this, @bass_rock and I wrote this, and when it's broken we'll do our best to fix it, any amount of asking is not going to speed it up.

 

If anyone thinks they can do better, then by all means write your own version, but as far as I can remember nobody else did, which is why we did it.

 

We're working on it. 

 

ETA: I DON'T KNOW

 

When that changes I'll update the thread.

 

My working theory why this isn't working is that there's been a major GCC version upgrade between v8.3 and v9.1 so I'm working on trying to downgrade GCC which is difficult as I can't find a Slackware package for it, so I'm trying to build it from source, and make some Slackware packages so I can keep them as static sources, which is not as easy as I hoped.

 

Thank you @CHBMB, your work is greatly appreciated! We will patiently await the updates. 

  • Like 1
Link to comment

@CHBMB I was wondering when the new.... jk.   Keep up the good work, this plugin is awesome.   I work for a rather large software support company, so I feel your pain on the releases sometimes.  Most of the community should realize you all are doing this great work, mostly on free time, and have lives as well.  Take your time, imho.   Kudos!

 

@limetech, thanks for the awesome releases as well.

Edited by MisterLas
  • Like 2
Link to comment

I know, I am going to catch shit for this. But I tried the new upgrade and it messed me up. Was unable to run it and had to downgrade to 6.6.7 and I am having a problem with my plex docker. I think I isolated it to the extra parameters --runtime=nvidia. it causes the docker to halt. Any incite on how to fix would be greatly appreciated.  

Screen Shot 2019-05-11 at 9.43.50 AM.png

Link to comment
2 minutes ago, acozad1 said:

I know, I am going to catch shit for this. But I tried the new upgrade and it messed me up. Was unable to run it and had to downgrade to 6.6.7 and I am having a problem with my plex docker. I think I isolated it to the extra parameters --runtime=nvidia. it causes the docker to halt. Any incite on how to fix would be greatly appreciated.  

Screen Shot 2019-05-11 at 9.43.50 AM.png

 

you need the new driver build which has been discussed A BILLION TIMES ALREADY 

Link to comment
8 minutes ago, acozad1 said:

I know, I am going to catch shit for this. But I tried the new upgrade and it messed me up. Was unable to run it and had to downgrade to 6.6.7 and I am having a problem with my plex docker. I think I isolated it to the extra parameters --runtime=nvidia. it causes the docker to halt. Any incite on how to fix would be greatly appreciated.  

Screen Shot 2019-05-11 at 9.43.50 AM.png

Upgrading and downgrading when you're using the Nvidia plugin needs to be done via the plugin, using LT stock builds or the built in Unraid downgrade tools won't work.

Link to comment
5 minutes ago, sittingmongoose said:

Just for clarity sake.  Do I only upgrade via the nvidia plugin?  Or do I upgrade unraid first then upgrade the nvidia driver(when it’s available).

The plugin is actually a custom build of Unraid that includes nvidia drivers.

 

If you upgrade Unraid, then you have removed the custom build.

 

If you upgrade the plugin, you have installed a new custom build of Unraid.

Link to comment
  • trurl locked this topic
Guest
This topic is now closed to further replies.