[Plugin] Nvidia-Driver


ich777

Recommended Posts

32 minutes ago, ich777 said:

Tools -> Diagnostics -> Download

 

As I said only a guess, but I think you know sometimes such things can happen even if you are upgraded or changed something, but as said before only a guess and that don't have to be the case here.

 

Please try to switch to Legacy since UEFI causes often trouble with Nvidia cards...

 

Please try to remove the monitor cable from the card since another user reported that this caused troubles on his build.

 

This is really easy, I've created a Container for it that does everything for you but that would be the last thing that I would recommend. :)

I understand that this sounds a little bit too much but it's really easy, trust me, but as said this is the last thing I recommend to try.

 

As said above please try to boot with Legacy or CSM (however it is called in your BIOS ;) ) and remove the HDMI cable to your screen (I think it was THIS post).

 

 

I'll give these suggestions a try and get back to you! 

  • Like 1
Link to comment

I'm seeing my card always reporting as power level P0 with 6.9.0-rc2. However, when I was using the nvidia-plugin from LinuxServer.io, it would always go back to P8 when idle.

 

It looks like even trying to force power save mode with `nvidia-smi -pm 1` doesn't switch the card to P8.

 

How do I get my card into P8 mode since it should be spending most of its time idle?


 

# nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:03:00.0.
All done.
# nvidia-smi
Sat Feb 27 16:31:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.45.01    Driver Version: 455.45.01    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P2200        On   | 00000000:03:00.0 Off |                  N/A |
| 57%   44C    P0    20W /  75W |      0MiB /  5058MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

 

Link to comment
4 hours ago, Jagadguru said:

I would like to be able to use UnRAID GUI at the same time as this.

Be sure to set the onboard graphics to your primary boot display in the BIOS and not the PCIe slot and connect the display to your onboard graphics.

 

4 hours ago, tkenn1s said:

nvidia-smi -pm 1

This is persistance mode and is maybe depricated now, Nvidia mentioned that long time ago that this command will be depricated some time.

Are you sure that the card is in P0 amd not P8? nvidia-smi has a lot problems with older cards now, for example my GTX 1050Ti reports also that it's in P0 but it is actually in P8.

 

What you can also try is to force a transcode and then see if the card falls back to P8 after the transcode ends.

 

Also this has to do with the driver and not the Plugin itself.

Link to comment
On 2/25/2021 at 10:56 PM, SiRMarlon said:

 

 

I'll give these suggestions a try and get back to you! 

 

Okay so looks like it ran great for a day and a half as my server just crashed. So I took the opportunity to move the card to another slot and boot the server up legacy instead of UEFI. I've attached the diagnostics that I pulled once the server came back up. Hopefully you can find something in there that may lead to finding out why the card is crashing. 

 

 

plex-unraid-diagnostics-20210227-2232.zip

  • Like 1
Link to comment

I just noticed that the card is not working anymore I've attached the diagnostic logs again. This time the server hasn't hung up. It's still working. 

 

image.png.39a6c08cb83bb7e48a46453e2ef8f45f.png

 

image.thumb.png.919d920bf2cf8f2bd3647c29c8d492a7.png

plex-unraid-diagnostics-20210228-1628.zip

 

Looking at system devices I can still see the card listed there ... 

 

 

 

 

image.png

 

Okay so looking at the system log it's full with nothing but this ...

 

'Plex-Unraid kernel: NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0'

 

image.thumb.png.050aa091ca079553891997cb788e545c.png

 

 

Edited by SiRMarlon
Link to comment
4 hours ago, SiRMarlon said:

This time the server hasn't hung up. It's still working.

What have you done now? Have you disconected the monitor from the Card and are you booted with legacy.

 

May I ask if you are able to pull out the Card from the server and test iton a desktop system?

If you can do that please be sure to install the drivers, the basic display output works always, after that run something like furmark for a few minutes and see if it crashes.

 

The last way thing I can recommend is to go with a build with the drivers integrated, I think I have one lying around somewhere.

  • Thanks 1
Link to comment
6 minutes ago, ich777 said:

What have you done now? Have you disconected the monitor from the Card and are you booted with legacy.

 

May I ask if you are able to pull out the Card from the server and test iton a desktop system?

If you can do that please be sure to install the drivers, the basic display output works always, after that run something like furmark for a few minutes and see if it crashes.

 

The last way thing I can recommend is to go with a build with the drivers integrated, I think I have one lying around somewhere.

 

 

The only difference this time is that I did boot the server in legacy mode and there has been no monitor hooked up to the server. I am just accessing the server via the web browser from my main PC. I know the card doesn't have any issues because this is the same card that I was running on my 6.8.3 server for more then a year with no issues. So for it to fail by just switching it to another system seems highly unlikely to me. 

 

When you say go with another build are you referring to a 6.9.o-rc2 build that has the drivers already in it? So that I won't need to run the NVIDIA plugin?

Edited by SiRMarlon
Link to comment
Just now, SiRMarlon said:

When you say go with another build are you referring to a 6.9.o-rc2 build that has the drivers already in it? So that I won't need to run the NVIDIA plugin?

Exactly.

 

1 minute ago, SiRMarlon said:

So for it to fail by just switching it to another system seems highly unlikely to me. 

Yes I know but I already head a few people that experience this with a P2000 one even had a P2000 that was already completely broken and didn't noticed in 6.8.3

 

It's really deficulty for me to debug this because everything you search online with the error seems to be a hardware failure or something related to a failure.

 

I will search the build and contact you again give me a few minutes, it's really early in the morning.

  • Thanks 1
Link to comment
3 minutes ago, SiRMarlon said:

Do you have a 6.8.3 stable one with the drivers by any chance? That is what I had before on the other system.

Yes I also have one in my Unraid-Kernel-Helper thread but I think it would be better to solve it first why it isn't working on 6.9.0RC2 because when 6.9.0 is released you will have the same problem I think...

 

I already sent you the link in a PM for 6.9.0RC2 with Nvidia drivers included.

 

 

Anyways here is the link to the prebuild 6.8.3 with Nvidia integrated (first post on the bottom, you also will need the Unraid-Kernel-Helper Plugin from the CA App to see your GPU UUID):

 

Link to comment

I have some issue with my 1660 Super. When the GPU is is under load it will go up to 60°C and hover around 40% fan power. After a while, after the GPU has a load of 0%, the temperature is back to normal 36°C and it is in P8 power state the fan decides to spin up to 100-110% and stay there for a while. After a short or long while (currently active for 10 minutes) it will settle down again but shortly after will ramp up again.

 

I got the latest drive  v455.45.01 installed and am wondering if there is a way to install a previous version just to rule out any issues with the driver itself.

 

I'm running Unraid 6.9.0-rc2, the GPU is at 8 lanes if that makes any difference. I would restart but I have a few pre-clears running which will still take a few hours.

Link to comment
39 minutes ago, Fribb said:

I got the latest drive  v455.45.01 installed and am wondering if there is a way to install a previous version just to rule out any issues with the driver itself.

What do you mean exactly with this? What previous version?

 

39 minutes ago, Fribb said:

I have some issue with my 1660 Super. When the GPU is is under load it will go up to 60°C and hover around 40% fan power. After a while, after the GPU has a load of 0%, the temperature is back to normal 36°C and it is in P8 power state the fan decides to spin up to 100-110% and stay there for a while. After a short or long while (currently active for 10 minutes) it will settle down again but shortly after will ramp up again.

Is this a new card that you've installed in the server or have you got it running also on previous versions?

This seems like a idle problem with the card itself that is mainly dedicated to the BIOS of the card and not the driver itself.

You can try to boot into GUI mode with the card and see if this helps.

Are you booting with UEFI or Legacy, try to boot with Legacy if you are booting with UEFI mode since this solves most problems on most Linux machines with Nvidia cards.

 

You can also try to enable persistance mode with the command: 'nvidia-smi -pm 1' but this command will be soon be deprecated by Nvidia.

Link to comment
1 minute ago, ich777 said:

What do you mean exactly with this? What previous version?

 

Is this a new card that you've installed in the server or have you got it running also on previous versions?

This seems like a idle problem with the card itself that is mainly dedicated to the BIOS of the card and not the driver itself.

You can try to boot into GUI mode with the card and see if this helps.

Are you booting with UEFI or Legacy, try to boot with Legacy if you are booting with UEFI mode since this solves most problems on most Linux machines with Nvidia cards.

 

You can also try to enable persistance mode with the command: 'nvidia-smi -pm 1' but this command will be soon be deprecated by Nvidia.

 

I meant a previous driver version because I didn't had this issue a couple of weeks ago. I don't know when the v455.45.01 so this might just me grasping at straws.

 

This is a GPU I use/used for a year now in the same server and only recently when I did my DAS server expansion and upgrade to 6.9 had this problem. I did try it with a different board while testing the DAS hardware and the problem didn't turn up there for some reason (also Unraid 6.9 and v455.45.01). I did run it with GUI mode for a while and it seemed to not happen there but I did not extensively test it as I did today running a few ffmpeg transcodes.

 

I boot with UEFI mode.

 

I already had the persistance mode active and also disabled and enabled it a couple of times trying to figure this out, without any luck.

 

Now that I stopped the array it suddenly stopped and after a couple of minutes it started again. Disabled the 2 docker containers that have the GPU passed through and it dipped shortly but stayed at max fan speed.

 

I rebooted in GUI mode now and the Fan has settled down again, I have to do some tests first to be sure that it is "working". Though I don't know if that should be the solution...

Link to comment
5 minutes ago, Fribb said:

I meant a previous driver version because I didn't had this issue a couple of weeks ago. I don't know when the v455.45.01 so this might just me grasping at straws.

The driver version didn't change in any way...

 

6 minutes ago, Fribb said:

I already had the persistance mode active and also disabled and enabled it a couple of times trying to figure this out, without any luck.

The persistence mode doesn't work with all cards and Nvidia also stated that this will be no longer supported and will be deprecated in the near future.

 

7 minutes ago, Fribb said:

I boot with UEFI mode.

Try to boot with Legacy please, eventually this will solve the issue.

 

8 minutes ago, Fribb said:

Now that I stopped the array it suddenly stopped and after a couple of minutes it started again. Disabled the 2 docker containers that have the GPU passed through and it dipped shortly but stayed at max fan speed.

 

I rebooted in GUI mode now and the Fan has settled down again, I have to do some tests first to be sure that it is "working". Though I don't know if that should be the solution...

This has actually to do with the BIOS of the card itself an how the manufacturer implemented the "idle" mode or the reset of the card. I think something is resetting the card after some time so that the fans spin at 100% up to if something uses the card or load is on the card then the default fan curve kicks in.

If you have a X server (GUI) running some load is on the GPU and the BIOS should tell the card to go into the default fan curve.

Hope this explains this a bit.

Link to comment
2 hours ago, ich777 said:

The driver version didn't change in any way...

 

The persistence mode doesn't work with all cards and Nvidia also stated that this will be no longer supported and will be deprecated in the near future.

 

Try to boot with Legacy please, eventually this will solve the issue.

 

This has actually to do with the BIOS of the card itself an how the manufacturer implemented the "idle" mode or the reset of the card. I think something is resetting the card after some time so that the fans spin at 100% up to if something uses the card or load is on the card then the default fan curve kicks in.

If you have a X server (GUI) running some load is on the GPU and the BIOS should tell the card to go into the default fan curve.

Hope this explains this a bit.

So, I booted in legacy mode and with GUI and let some ffmpeg transcodes running, GPU went up to 62°C and stayed there while the fan was at 40%. So far, hopefully I don't jinx it with saying that, the GPU has stayed quiet and at 35% fan speed.

 

What is weird to me is that I didn't had this behaviour the last year and only recently after upgrading to 6.9 (and switching Motherboards), I already have the latest bios on the MB. 

  • Like 1
Link to comment
11 minutes ago, Fribb said:

after upgrading to 6.9 (and switching Motherboards), I already have the latest bios on the MB.

Are you sure that you booted with the old Motherboard and 6.8.3 also to UEFI?

 

11 minutes ago, Fribb said:

with GUI

Can you try to boot into Legacy without the GUI, start a transcode and then see if it actually stays at the fan speed?

 

Also when you are at it, you can try to enable perstistence mode from the beginning. Legacy solves so much problems with Linux and Nvidia.

I see this also on my main machine with Debian and the official Nvidia drivers, updating the drivers in UEFI is a real pain...

Link to comment
4 minutes ago, ich777 said:

Are you sure that you booted with the old Motherboard and 6.8.3 also to UEFI?

 

Can you try to boot into Legacy without the GUI, start a transcode and then see if it actually stays at the fan speed?

 

Also when you are at it, you can try to enable perstistence mode from the beginning. Legacy solves so much problems with Linux and Nvidia.

I see this also on my main machine with Debian and the official Nvidia drivers, updating the drivers in UEFI is a real pain...

I'm pretty sure, yes. 

 

So, the transcodes are through, no GUI mode, persistance mode 1 and loaded with Legacy. Seems to run as it should but I will keep and eye/ear on it but hopefully it will stay that way. Still, why did it work before and not now anymore? 

Link to comment
14 minutes ago, Fribb said:

Still, why did it work before and not now anymore? 

That's a think I can't answer...

Are you really really sure that you've booted into UEFI before? I think if you upgraded to 6.9.0rc2 UEFI was the default boot mode because a old SandyBridge machine gave me problems when I've upgraded to 6.9.0rc2 before (UEFI boot was deactivated in the BIOS entirely and Unraid won't boot any longer, after I rebooted to finish the upgrade process).

Link to comment
7 hours ago, SiRMarlon said:

Just sent you a DM @ich777 I had to roll back to your 6.8.3 w/ integrated Nvidia drivers as 6.9.0-rc2 w/ integrated Nvidia drivers became real unstable on my system. 

Which version of the images are you using now? The one from my thread?

 

I think it has something to do with the combination of your hardware.

I don't know if you have seen it already but 6.9.0 stable was released.

Link to comment
13 hours ago, ich777 said:

That's a think I can't answer...

Are you really really sure that you've booted into UEFI before? I think if you upgraded to 6.9.0rc2 UEFI was the default boot mode because a old SandyBridge machine gave me problems when I've upgraded to 6.9.0rc2 before (UEFI boot was deactivated in the BIOS entirely and Unraid won't boot any longer, after I rebooted to finish the upgrade process).

Yes, I'm really really sure that I booted into UEFI before. I think I even read somewhere that you should use UEFI because that is the new standard?!

I had it on the old board, running 6.8.3 and the new board running 6.8.3 and 6.9-rc2 but on 6.8.3 I had the "old" Nvidia approach which I can't reproduce anymore because of missing drivers etc or I would have tried to see if that is a problem there too and my GPU has some issues. While testing the new components for my DAS I had the old board and even plugged the GPU into it and ran some transcodes without that problem turning up. Whatever the case, running in legacy seems to have fixed that for whatever reason.

  • Like 1
Link to comment

I updated to 6.9 stable today. Previously I was using the old nvidia plugin on 6.8.3 for Emby HW transcoding and all was good.

 

Emby wouldn't start cuz the old plugin and settings were still there. I deleted it and clear all the settings. Got it working without it.

 

Then I downloaded the new nVidia plugin and setup all the settings again. It works... but it doesn't seem as good as before?

Maybe my memory is a little fuzzy. But I usually test by choosing a 4k movie and stream to a 1080p device. In this case Emby in Chrome. It starts and I can see hardware transcoding going. But not too long the stream will freeze. After a couple reboots and looking around for other options and testing other 4k movies. It still freezes. I tried a different device, Samsung Note 10. It played a little longer but still froze.

 

I change the settings in Emby to turn off HW transcoding and my CPU can play the whole 4k movie to 1080p device no probs.

 

Ultimately this shouldn't be a real issue for me, as I keep 4k in a separate library and don't share it out to my users. And I will only play 4k movies on a 4k device locally.

 

Anyway, was just wondering if anyone else was having similar issue?

 

Micro-Star International Co., Ltd. X399 SLI PLUS

AMD Ryzen Threadripper 2920X 12-Core @ 4150 MHz

32GB  DDR4

NVIDIA GeForce GTX 1660

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.