[Plugin] Nvidia-Driver


ich777

Recommended Posts

23 hours ago, ich777 said:

Then roll back to an even older version, the drivers won't change and the package is always the same since they are precompiled.

I've just tried v525.78.01 as well and issue still appears after around 5 hours at idle.

 

1 hour ago, ich777 said:

If it was working before something in terms of hardware or BIOS must have been changed since otherwise the older driver will work as expected, I think you get the point what I'm trying to say.

Maybe the power supply is dying or something similar, if you haven't changed anything and you've rolled back the driver and you have the exact same issue with the previous driver there must be something different now...

The only thing changed was that I switched out the RAM about a month ago to 4x32GB sticks running on their XMP profile at 3200MHz. I have noticed though since this issue appeared that when I shutdown the system, if I turn it on the EZDebug CPU LED is red (MSI B550 Gaming Gen3 Motherboard) which apparently means CPU not detected or initial checks failed at start, but if I press the Reset button there, it boots up fine... I know the CPU is alright, and temps have never gotten high, so I'm starting to think possible PSU or Motherboard failure...

 

1 hour ago, ich777 said:

Can you please describe what you are doing? Do you update the BIOS on Unraid itself?

I updated the GPU's VBIOS driver right now through a bootable usb to enable Resizable BAR Support using Asus' tool (GPU is Asus 3060ti Dual OC). Was trying to do it through Unraid before but ended up doing it using their provided exe since safer.

 

I've also updated the Motherboard BIOS right after the issue started to occur, thinking it might've been an issue with the new driver and the old BIOS.

Link to comment
7 minutes ago, OrneryTaurus said:

 

Hey there. Here are the diagnostics for the system currently running in case it provides anything. The card was removed. I'll install the card again and run new diagnostics shortly.

smalls-diagnostics-20230510-1146.zip 232.46 kB · 0 downloads

 

Hi @ich777 - switching the PCIe port and it's now working with the latest driver. Plex transcoding is also working. Thanks for the kick in the rear :D I've attached diagnostics in case you needed to review.

smalls-diagnostics-20230510-1209.zip

  • Like 1
Link to comment
1 hour ago, alexdac99 said:

so I'm starting to think possible PSU or Motherboard failure...

Maybe it's a PSU issue but I can't tell for sure, maybe try to switch the PCIe slots if there are any which are suitable for the card.

 

If the card fell from the bus this indicates usually a Motherboard issue or PSU issue.

Link to comment
On 5/10/2023 at 12:28 AM, ich777 said:

As said above I think you have to add some parameters in ESXi so that newer cards are working, this is not the first time I see that.

 

Also please remove these two files from your modprobe.d folder: blacklist-nouveau.conf and nvidia.conf (why did you even do that?)

 

This is the part why it doesn't work:

0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] [10de:21c4] (rev a1)
    DeviceName: pciPassthru0
    Subsystem: Shenzhen Colorful Yugong Technology and Development Co. TU116 [GeForce GTX 1660 SUPER] [7377:0000]
    Kernel driver in use: nvidia
    Kernel modules: nvidia_drm, nvidia

ESXi does something to the device names and I think that was the issue the last time too.

 

The last user who reported that switched from ESXi to a native installation and it worked after that. Please also keep in mind that some hardware combinations and the 1660s series are complicated... at least on Linux.

 

Please look into the Visualizing Unraid subforums, there you should find the answer what you have to add to the ESXi host that the card will work without a Kernel Panic.

 

EDIT: I found that:

 

But there is nothing I can do about that since you are visualizing Unraid and this could cause issues, I also really can't help since I'm not familiar with ESXi (just search for ESXi in this thread and you will find a few posts, some even stating that drivers >=460 not working with ESXi).

 

1683804533719.png.195363c78502ee624968cdd5e9a55df0.png

image.thumb.png.9fcf63f0534af15db909903c97c0c2bf.png

grateful!

blacklist-nouveau.conf is a shielded driver that comes with esxi.

nvidia.conf is to set the PCI passthrough as the first choice.

After a few days of testing, I found out that this is not an esxi problem. It is not a question of whether to add parameters. I guess this should be some kind of limitation made by the nvidia driver,
I don't know how to use unraid to compile and install the driver, so I used esxi to virtualize a debian11, kernel 5.10.0-22-amd64, I downloaded countless nvidia drivers to compile and install for the kernel, and found the minimum version that can be used normally It's 470.82.01.

Excuse me, is it possible to compile a 470.82.01 driver for our group of esxi users?

If not, is my only solution to fallback to unraid_6.9.2?

Link to comment
2 hours ago, ceozero said:

If not, is my only solution to fallback to unraid_6.9.2?

Yes and no.

 

2 hours ago, ceozero said:

Excuse me, is it possible to compile a 470.82.01 driver for our group of esxi users?

You can do it yourself, you'll find how I do it in my GitHub repo for the driver here: Click

Link to comment
16 hours ago, ich777 said:

You can do it yourself, you'll find how I do it in my GitHub repo for the driver here: Click

 

don't know how to use this script, is there any documentation for reference? Can I run and compile it directly under unraid's ich777/debian-bullseye docker?

Or is it that the parameters I assigned are wrong?

 

DATA_DIR='/data'

NV_DRV_V='470.82.01'

UNAME='5.19.17'

LIBNVIDIA_CONTAINER_V='1.13.1'

CONTAINER_TOOLKIT_V='1.13.1'

 

thx

Link to comment
1 hour ago, ceozero said:

don't know how to use this script, is there any documentation for reference? Can I run and compile it directly under unraid's ich777/debian-bullseye docker?

No, this is not possible in the debian-bullseye container or at least not how it ships OOB because it is not intended for that use case but you could maybe do that if you install everything that is needed.

 

You have to install all dependencies/packages to compile the Kernel first, after that you have to compile the driver, then you have to pack it and after that you have to place it in the correct directory on the USB Boot device.

Please also note that you have to do this each time you upgrade your Unraid version.

 

The point that I'm trying to make is that this is not practicable for most users.

Link to comment
3 hours ago, ich777 said:

No, this is not possible in the debian-bullseye container or at least not how it ships OOB because it is not intended for that use case but you could maybe do that if you install everything that is needed.

 

You have to install all dependencies/packages to compile the Kernel first, after that you have to compile the driver, then you have to pack it and after that you have to place it in the correct directory on the USB Boot device.

Please also note that you have to do this each time you upgrade your Unraid version.

 

The point that I'm trying to make is that this is not practicable for most users.

 

Can't do anything about compilation.

Today I upgraded the debian kernel under esxi to 6.1, tried open source installation 525.116.04, it worked. Just simply use "./NVIDIA-Linux-x86_64-525.116.04.run -m=kernel-open", unfortunately, I can't use the latest 530 or 525 driver even after upgrading unraid to the beta version. Same kernel panic.

 

 

I can only temporarily relocate emby to debian, and continue to pay attention to this thread, until one day there will be a great master to solve this driver problem.

1683892798754.png.c6b1e22b3f17aad3b5d1328ed3bda11b.png

 

Link to comment
3 hours ago, ceozero said:

I can only temporarily relocate emby to debian, and continue to pay attention to this thread, until one day there will be a great master to solve this driver problem.

I could of course compile this driver but I‘m not soo sure if it compiles against newer Kernel versions thatn 5.18

 

In my personal oppinion this is a thing that ESXi sould solve or at least workaround.

Link to comment

Anybody else have their x265 transcoding suddenly stop working? I have an nVidia T600.

 

Running the latest driver (530.41.03) with the BinHex container.  Currently on the RC5 build of 6.12.0 which I think may be related since it seemed to coincide when I upgraded?

 

What's strange is that my GPU dashboard tool shows that a PlexTranscoder tool running inside the docker is holding onto a GPU transcode instance when I try to transcode a 4K x265 file - but the amount of RAM being used is much less than normal when doing a 4K video. (300MB vs. 30MB)

 

Here are the logs from Plex that show the error:

 

May 12, 2023 12:03:43.404 [22944562506552] ERROR - [Req#8228/Transcode] [FFMPEG] - Failed to initialise VAAPI connection: -1 (unknown libva error).
May 12, 2023 12:03:43.404 [22944562506552] DEBUG - [Req#8228/Transcode] Codecs: hardware transcoding: opening hw device failed - probably not supported by this system, error: I/O error

 

log.txt

Edited by ich777
put log into file
Link to comment
2 hours ago, Einsteinjr said:

Running the latest driver (530.41.03) with the BinHex container.  Currently on the RC5 build of 6.12.0 which I think may be related since it seemed to coincide when I upgraded?

No that I'm aware of, you are the first one who is reporting that.

 

Please post your Diagnostics.

 

Have you yet tried the official Plex container if that is working?

 

2 hours ago, Einsteinjr said:

What's strange is that my GPU dashboard tool shows that a PlexTranscoder tool running inside the docker is holding onto a GPU transcode instance when I try to transcode a 4K x265 file - but the amount of RAM being used is much less than normal when doing a 4K video. (300MB vs. 30MB)

Do you have screenshots from nvidia-smi, the Dashboard and so on?

Link to comment
On 5/4/2023 at 11:10 PM, supawiz6991 said:

I uninstalled the plugin, rebooted, reinstalled the plugin (it installed driver 530.41.03 during install), rebooted again and tried to set the production branch option and got the error again

---Can't find Nvidia Driver vlatest_prb for your Kernel v5.19.17 falling back to latest Nvidia Driver v530.41.03---

 

Using the manual driver selection I am able to successfully install the correct driver of 525.116.03.

 


@ich777 , I got a similar error now. Production branch does not work, always falls back to latest.

 

---Can't find Nvidia Driver vlatest_prb for your Kernel v5.19.17 falling back to latest Nvidia Driver v530.41.03---

--------Nothing to do, Nvidia Driver v530.41.03 already downloaded!---------

------------------------------Verifying CHECKSUM!------------------------------

----------------------------------CHECKSUM OK!---------------------------------

 

Link to comment
2 hours ago, emrepolat7 said:

Sometimes I want to use my graphics card for a VM.

Then make sure that nothing on the host (Docker) is using the card and start the VM.

 

2 hours ago, emrepolat7 said:

is there any option to disable the plugin without uninstall? 

Even if you are uninstalling the plugin you have to reboot too to fully remove the plugin/driver.

Link to comment
51 minutes ago, lincolnliu said:

Just installed a Tesla P4 GPU in my unraid server, I see in System Devices, Nvidia Driver doesn't see it, running unraid 6.1.1.5 with v530.41.03 driver.

Please try to boot with Legacy (CSM) instead of UEFI.

Also make sure that you've anbled above 4G decoding and Resizable Bar Support.

 

The error won't tell much what's going on but this indicates some kind of hardware incompatibility issue or often times some issue with the card or a BIOS setting:

May 26 23:35:51 Anaconda kernel: NVRM: GPU 0000:0c:00.0: RmInitAdapter failed! (0x31:0xffff:2465)
May 26 23:35:51 Anaconda kernel: NVRM: GPU 0000:0c:00.0: rm_init_adapter failed, device minor number 0
May 26 23:35:51 Anaconda kernel: NVRM: GPU 0000:0c:00.0: RmInitAdapter failed! (0x31:0xffff:2465)
May 26 23:35:51 Anaconda kernel: NVRM: GPU 0000:0c:00.0: rm_init_adapter failed, device minor number 0
May 26 23:35:52 Anaconda kernel: NVRM: GPU 0000:0c:00.0: RmInitAdapter failed! (0x31:0xffff:2465)
May 26 23:35:52 Anaconda kernel: NVRM: GPU 0000:0c:00.0: rm_init_adapter failed, device minor number 0
May 26 23:35:52 Anaconda kernel: NVRM: GPU 0000:0c:00.0: RmInitAdapter failed! (0x31:0xffff:2465)
May 26 23:35:52 Anaconda kernel: NVRM: GPU 0000:0c:00.0: rm_init_adapter failed, device minor number 0

 

The driver is loaded properly from what I can see here:

0c:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
    Subsystem: NVIDIA Corporation GP104GL [Tesla P4] [10de:11d8]
    Kernel driver in use: nvidia
    Kernel modules: nvidia_drm, nvidia

 

Link to comment

Thanks @ich777 for the super quick response!  I updated the bios to enable 4G encoding, resizeable bar to auto, and changed to legacy boot.  Unfortunately the Nvidia Driver plug in still cannot detect the Tesla P4. Can you take a look at the new diagnostics to see if additional error info is available?  Curious which diagnostics file did you use to see the nvidia driver load errors?

 

 

anaconda-diagnostics-20230527-0957.zip

Link to comment
15 minutes ago, lincolnliu said:

Can you take a look at the new diagnostics to see if additional error info is available?

I don't see anything obvious, since this is an AMD system, can you try to disable C-States in the BIOS?

Also make sure that you are on the latest BIOS version which is available for your Motherbaord (it seems that you are a versions behind from what I can see from the Diagnostics, you are on Version 4402 and the latest available one is: 4501 Click <- please double check if it's the correct Motherboard).

 

19 minutes ago, lincolnliu said:

Curious which diagnostics file did you use to see the nvidia driver load errors?

You can see it in the syslog, simply search for "NVRM" and you will get all messages from the driver and right at the bottom you can find the errors for the Tesla P4.

 

Oh wait, now that I think about it, you can actually try and move the cards around the PCIe slots, some motherboards can handle more than two GPUs at a time and since you have three installed that could be kind of the issue but this is only a vague guess.

 

BTW you can also use your Tesla as the primary one (for the Unraid console) since if you use it for Docker containers it won't hurt performance.

Link to comment

Thanks for the super quick responses! I only have 2 gpus, and have already moved the Tesla P4 to the first PCIE. 
 

I am going to try disable c state, will also try passing it through to my windows 11 VM to see if I can get it working there to rule out a hardware issue.  Will report back.

 

I am not sure what you mean by using the P4 as console GPU because it doesn’t have any IO ports. Which is why I put in a separate GPU for BIOS

Edited by lincolnliu
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.