[Plugin] Nvidia-Driver


ich777

Recommended Posts

36 minutes ago, frodr said:

What are chances that the Nvidia Driver plugin will work on Nvidia Tesla K80? I plan to use it for transcoding.

Maybe it will work but I really can't tell for sure because it's not officially listed but I've got reports about Tesla P4 cards working with the desktop driver and even some K series cards, you can try it and if it doesn't work then please post your Diagnostics and I will take a look into it.

 

But in general the plugin only supports consumer cards.

  • Thanks 1
Link to comment

Better question: Is it worth it?  It will be a power hog with a very early encoding chip, taking up 2 PCIe slots, that may or may not work.  A second hand Quadro P400 for $50 USD will sip power and likely work better.  For a little more money, better performance from a Quadro T400.

  • Like 1
Link to comment
9 minutes ago, sohailoo said:

will uninstalling the plugin uninstall the drivers or is there something else I should do to uninstall the drivers? 

Reboot afterwards, this should also be displayed in the plugin removal window.

Link to comment
1 hour ago, ConnerVT said:

Better question: Is it worth it?  It will be a power hog with a very early encoding chip, taking up 2 PCIe slots, that may or may not work.  A second hand Quadro P400 for $50 USD will sip power and likely work better.  For a little more money, better performance from a Quadro T400.

 

Good question. I was thinking of removing the Quadro RTX 4000, and sell it. This Quadro card is more modern. K80 is 2 x 2496 cores. Yeah, I think I bury this idea.

  • Like 1
Link to comment

Hello, not sure what happened on my end - but I haven't been able to successfully download the updated driver.  I've been stuck on 520.56.06 for a bit.  I keep the window open so it keeps "Trying to redownload the Nvidia Driver v530.41.03", and I can keep it open for days - no dice.  

Any way to manually download that driver and update it via cli, vs this plugin?  To be clear, all of my nvidia dockers & vms work great - love that I can do several streams and concurrently use on multiple containers/vms!

 

Thank you.

 

**Edit** tried to just delete it from /config/plugins/nvidia-driver/packages/5.19.17 and have it re-download again.  It quickly downloaded the same file - but still showing stuck on the same "Downloading...This could take some time..." window

 

Edited by OneMeanRabbit
Link to comment
2 hours ago, OneMeanRabbit said:

but I haven't been able to successfully download the updated driver.  I've been stuck on 520.56.06 for a bit

What kind of firewall/router are you using, do you run anything like PiHole or AdGuard for AdBlocking on your network?

 

2 hours ago, OneMeanRabbit said:

love that I can do several streams and concurrently use on multiple containers/vms!

But you can't do that at the same time.

 

Please note, a newer driver won't do anything in terms of transcoding speed...

 

2 hours ago, OneMeanRabbit said:

**Edit** tried to just delete it from /config/plugins/nvidia-driver/packages/5.19.17 and have it re-download again.  It quickly downloaded the same file - but still showing stuck on the same "Downloading...This could take some time..." window

Please post your Diagnostics and don't delete anything from the plugin directory so that I can see what's going on.

Link to comment
20 hours ago, ich777 said:

I will look into that but I'll maybe drop some features from the plugin anyways.

 

Can you try to uninstall the plugin, reboot, reinstall the plugin and see if that helps?

I uninstalled the plugin, rebooted, reinstalled the plugin (it installed driver 530.41.03 during install), rebooted again and tried to set the production branch option and got the error again

---Can't find Nvidia Driver vlatest_prb for your Kernel v5.19.17 falling back to latest Nvidia Driver v530.41.03---

 

Using the manual driver selection I am able to successfully install the correct driver of 525.116.03.

 

Link to comment

I apparently posted this in the wrong place (I wasn't going to assume that it was an issue with this Plugin). So here it is again, as requested...

Hey folks,

I have a strange issue. I have 2 GPUs installed in my system, a Quadro P600, and a GTX 1050. These are both recognised by the Nvidia Driver plugin. I have the correct ID setup in both the TDARR and Plex docker containers for the P600, but they both still insist on using the 1050.

Any Ideas?

 

PlexSetup.jpg

Nvidia Driver.jpg

SMI.jpg

gandalf-diagnostics-20230505-2131.zip

Link to comment
13 hours ago, NeoDude said:

Any Ideas?

Since when does this happen?

Have you yet tried to assign the other UUID, but this shouldn't make a difference since the UUIDs are correct from what I see from the Diagnostics).

 

I would also recommend that you check your VFIO bindings since it tries to bind 04:00 (which is your GTX1050 but it fails since the HW/VENDOR ID doesn't match - /boot/config/vfio-pci.cfg):

Processing 0000:04:00.0 1002:67df
Error: Vendor:Device 1002:67df not found at 0000:04:00.0, unable to bind device
---
Processing 0000:04:00.1 1002:aaf0
Error: Vendor:Device 1002:aaf0 not found at 0000:04:00.1, unable to bind device
---

 

Can you please also post the docker run command?

 

I would strongly recommend that you disable Privileged mode? Why did you even enable this? This is a huge security risk and can maybe cause this issue.

Link to comment

I've deleted the unrequired VFIO Bindings. These weren't checked in the GUI so I don't know why they were in there. I have also disabled Privileged mode. (This was a recent thing to see if it made a difference). After a reboot, Plex is now using the correct GPU, but TDARR is not. Here's the Docker Run for TDARR...

 

   docker run
  -d
  --name='tdarr'
  --net='br0.50'
  --ip='172.16.50.250'
  --cpuset-cpus='2,3,4,5,18,19,20,21'
  -e TZ="Europe/London"
  -e HOST_OS="Unraid"
  -e HOST_HOSTNAME="Gandalf"
  -e HOST_CONTAINERNAME="tdarr"
  -e 'serverIP'='172.16.50.250'
  -e 'TCP_PORT_8266'='8266'
  -e 'TCP_PORT_8265'='8265'
  -e 'PUID'='99'
  -e 'PGID'='100'
  -e 'internalNode'='true'
  -e 'NVIDIA_VISIBLE DEVICES'='GPU-04dd732e-60ad-a070-80b2-a0c4f284a9c1'
  -e 'NVIDIA_DRIVER_CAPABILITIES'='all'
  -e 'nodeIP'='0.0.0.0'
  -e 'nodeID'='Gandalf'
  -e 'TCP_PORT_8264'='8264'
  -l net.unraid.docker.managed=dockerman
  -l net.unraid.docker.webui='http://[IP]:[PORT:8265]'
  -l net.unraid.docker.icon='https://raw.githubusercontent.com/selfhosters/unRAID-CA-templates/master/templates/img/tdarr.png'
  -v '/mnt/user/appdata/tdarr/server':'/app/server':'rw'
  -v '/mnt/user/appdata/tdarr/configs':'/app/configs':'rw'
  -v '/mnt/user/appdata/tdarr/logs':'/app/logs':'rw'
  -v '/mnt/user0/media/':'/media':'rw'
  -v '/mnt/cache/appdata/tdarr/temp/':'/temp':'rw'
  --runtime=nvidia 'haveagitgat/tdarr_acc:dev'
2f5017a5896ff9f586419bf25d1a736256d750b6e2e8c97a2fb2f96b22597c2a

 

Link to comment
1 hour ago, NeoDude said:

After a reboot, Plex is now using the correct GPU, but TDARR is not. Here's the Docker Run for TDARR...

I‘m not entirely sure but if TDARR changes a environment variable on container start to use all GPUs that could be the reason why, but this is only a guess and I don‘t know if that‘s the case here but it would be possible that something like this is causing the issue here.

Link to comment
On 5/6/2023 at 2:38 PM, ich777 said:

I‘m not entirely sure but if TDARR changes a environment variable on container start to use all GPUs that could be the reason why, but this is only a guess and I don‘t know if that‘s the case here but it would be possible that something like this is causing the issue here.


Think I found the issue. There was a missing underscore in the "NVIDIA_VISIBLE_DEVICES" key. Not sure if this is default on the container or if it's something I've accidently done, probably the latter :P

  • Like 1
Link to comment

Having issues recently since the last driver update. For some reason my GPU falls off the bus. I've tried reseating it multiple times, cleaning the connector (no riser), updating motherboard bios, replacing CMOS battery, and uninstalling and reinstalling the nvidia-driver plugin, including clearing the plugin kernel folder and then downloading and updating the driver again. When I reboot sometimes it seems to connect, other times i start to get:

 

Atlantis kernel: NVRM: GPU 0000:2b:00.0: RmInitAdapter failed! (0x22:0x56:760)
Atlantis kernel: NVRM: GPU 0000:2b:00.0: rm_init_adapter failed, device minor number 0

 

I've attached the diagnostics as well as the nvidia-bug-report.log if that's of any help. There is also no power savings options for the PCIe slots on my motherboard in the BIOS. I also have the Re-Size BAR Support, above 4G decoding, and IOMMU BIOS settings enabled, and I disabled Secure Boot to see if that would help. At the end of the syslog is the stack trace from the gpu driver issue.

atlantis-diagnostics-20230508-2114.zip nvidia-bug-report.log.gz

Link to comment
4 hours ago, alexdac99 said:

Having issues recently since the last driver update.

What do you mean exactly, do you mean the driver itself or the plugin?

Have you yet tried to roll back to the previous driver that was working?

 

Have you changed something in terms of hardware or did you maybe update your BIOS or change some settings in the BIOS?

Have you yet tried to disable C-States in the BIOS?

 

4 hours ago, alexdac99 said:

including clearing the plugin kernel folder and then downloading and updating the driver again.

Please don't do this manually until advised since the plugin does this on it's own.

Link to comment

After updating from rtx750 to 1660s, it cannot be driven normally, and it cannot work after reinstalling or using a new system.

 

My unraid work has been under esxi for two years. I have always used rtx750 for emby, and now I can no longer drive it after updating to 1660s, nvidia-smi crashes directly, and the command freezes. It is also impossible to output the report, because nvidia-smi cannot output it.

 

 

Quote

May 9 16:43:36 DELL-UNRAID root: plugin: nvidia-driver.plg installed May 9 16:44:13 DELL-UNRAID kernel: ACPI Warning: \_SB.PCI0.PE50.S1F0._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61) May 9 16:44:13 DELL-UNRAID kernel: NVRM: GPU 0000:0b:00.0: RmInitAdapter failed! (0x26:0x56:1474) May 9 16:44:13 DELL-UNRAID kernel: BUG: unable to handle page fault for address: 0000000000004628 May 9 16:44:13 DELL-UNRAID kernel: NVRM: GPU 0000:0b:00.0: rm_init_adapter failed, device minor number 0 May 9 16:44:13 DELL-UNRAID kernel: #PF: supervisor read access in kernel mode May 9 16:44:13 DELL-UNRAID kernel: #PF: error_code(0x0000) - not-present page May 9 16:44:13 DELL-UNRAID kernel: PGD 0 P4D 0 May 9 16:44:13 DELL-UNRAID kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI May 9 16:44:13 DELL-UNRAID kernel: CPU: 41 PID: 8145 Comm: nv_queue Tainted: P O 5.19.17-Unraid #2 May 9 16:44:13 DELL-UNRAID kernel: Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.17369862.B64.2012240522 12/24/2020 May 9 16:44:13 DELL-UNRAID kernel: RIP: 0010:_nv010655rm+0x3b/0xb0 [nvidia] May 9 16:44:13 DELL-UNRAID kernel: Code: 93 be dd 02 48 8b bb 68 01 00 00 e8 7f cc 5a 00 85 c0 74 0b 48 83 c4 08 5b 41 5c c3 0f 1f 00 44 89 e7 e8 98 67 b6 ff 48 89 c7 <8b> 80 28 46 00 00 83 f8 01 74 38 80 bf 71 07 00 00 00 74 49 80 bf May 9 16:44:13 DELL-UNRAID kernel: RSP: 0018:ffffc9000ed77de0 EFLAGS: 00010246 May 9 16:44:13 DELL-UNRAID kernel: RAX: 0000000000000000 RBX: ffff888109322808 RCX: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: RDX: ffffc9000ea11008 RSI: 0000000000000000 RDI: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: RBP: ffff88a0c971b000 R08: 0000000000000000 R09: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: R10: ffffc9000ed77e88 R11: 00000000645a07dd R12: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: R13: ffff88a0c9718000 R14: ffff88810bf2d650 R15: ffff88a0c8890000 May 9 16:44:13 DELL-UNRAID kernel: FS: 0000000000000000(0000) GS:ffff88c0bd9c0000(0000) knlGS:0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 9 16:44:13 DELL-UNRAID kernel: CR2: 0000000000004628 CR3: 000000001320a002 CR4: 00000000007706e0 May 9 16:44:13 DELL-UNRAID kernel: PKRU: 55555554 May 9 16:44:13 DELL-UNRAID kernel: Call Trace: May 9 16:44:13 DELL-UNRAID kernel: <TASK> May 9 16:44:13 DELL-UNRAID kernel: ? rm_execute_work_item+0xed/0x130 [nvidia] May 9 16:44:13 DELL-UNRAID kernel: ? _raw_q_schedule+0x69/0x69 [nvidia] May 9 16:44:13 DELL-UNRAID kernel: ? os_execute_work_item+0x48/0x88 [nvidia] May 9 16:44:13 DELL-UNRAID kernel: ? _main_loop+0xf1/0x115 [nvidia] May 9 16:44:13 DELL-UNRAID kernel: ? kthread+0xe4/0xef May 9 16:44:13 DELL-UNRAID kernel: ? kthread_complete_and_exit+0x1b/0x1b May 9 16:44:13 DELL-UNRAID kernel: ? ret_from_fork+0x1f/0x30 May 9 16:44:13 DELL-UNRAID kernel: </TASK> May 9 16:44:13 DELL-UNRAID kernel: Modules linked in: nvidia(PO) drm backlight xt_MASQUERADE ip6table_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nfsd auth_rpcgss oid_registry lockd grace sunrpc md_mod tcp_diag inet_diag efivarfs ip6table_filter ip6_tables iptable_filter ip_tables x_tables 8021q garp mrp bridge stp llc bonding tls crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd rapl intel_cstate intel_uncore nvme i2c_piix4 intel_agp input_leds led_class e1000e intel_gtt nvme_core vmxnet3 i2c_core vmw_pvscsi agpgart ata_piix ac button unix May 9 16:44:13 DELL-UNRAID kernel: CR2: 0000000000004628 May 9 16:44:13 DELL-UNRAID kernel: ---[ end trace 0000000000000000 ]--- May 9 16:44:13 DELL-UNRAID kernel: RIP: 0010:_nv010655rm+0x3b/0xb0 [nvidia] May 9 16:44:13 DELL-UNRAID kernel: Code: 93 be dd 02 48 8b bb 68 01 00 00 e8 7f cc 5a 00 85 c0 74 0b 48 83 c4 08 5b 41 5c c3 0f 1f 00 44 89 e7 e8 98 67 b6 ff 48 89 c7 <8b> 80 28 46 00 00 83 f8 01 74 38 80 bf 71 07 00 00 00 74 49 80 bf May 9 16:44:13 DELL-UNRAID kernel: RSP: 0018:ffffc9000ed77de0 EFLAGS: 00010246 May 9 16:44:13 DELL-UNRAID kernel: RAX: 0000000000000000 RBX: ffff888109322808 RCX: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: RDX: ffffc9000ea11008 RSI: 0000000000000000 RDI: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: RBP: ffff88a0c971b000 R08: 0000000000000000 R09: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: R10: ffffc9000ed77e88 R11: 00000000645a07dd R12: 0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: R13: ffff88a0c9718000 R14: ffff88810bf2d650 R15: ffff88a0c8890000 May 9 16:44:13 DELL-UNRAID kernel: FS: 0000000000000000(0000) GS:ffff88c0bd9c0000(0000) knlGS:0000000000000000 May 9 16:44:13 DELL-UNRAID kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 9 16:44:13 DELL-UNRAID kernel: CR2: 0000000000004628 CR3: 00000020d4442006 CR4: 00000000007706e0 May 9 16:44:13 DELL-UNRAID kernel: PKRU: 55555554

 

 

1898275044_QHT7E2EK1X35PNRP7DC9I.thumb.png.f785e4aedcd53d7ad36277a09f2a5bf1.png

 

1398360542_EBLOOCF0YFJ9HGW3HB6.thumb.png.ea78a6ab1150689d49cde224dbfddf21.png

Link to comment
1 hour ago, ceozero said:

After updating from rtx750 to 1660s, it cannot be driven normally, and it cannot work after reinstalling or using a new system.

Have you yet tried to install the legacy driver version 470.xx, reboot and see if that version is working for you?

 

Can you please post the full Diagnostics?

 

When virtualizing via ESXi I think you have to add a parameter if I'm not mistaken so that newer cards will work.

Link to comment
2 hours ago, ich777 said:

Have you yet tried to install the legacy driver version 470.xx, reboot and see if that version is working for you?

 

Can you please post the full Diagnostics?

 

When virtualizing via ESXi I think you have to add a parameter if I'm not mistaken so that newer cards will work.

Sorry, I can't output diagnostics with plugins installed. Because the nvidia-smi output is used for diagnostics. As long as nvidia-smi is used, it will be stuck. Cannot output diagnostics.

dell-unraid-diagnostics-20230509-2128.zip

Link to comment
7 hours ago, ceozero said:

After updating from rtx750 to 1660s, it cannot be driven normally, and it cannot work after reinstalling or using a new system.

As said above I think you have to add some parameters in ESXi so that newer cards are working, this is not the first time I see that.

 

Also please remove these two files from your modprobe.d folder: blacklist-nouveau.conf and nvidia.conf (why did you even do that?)

 

This is the part why it doesn't work:

0b:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU116 [GeForce GTX 1660 SUPER] [10de:21c4] (rev a1)
    DeviceName: pciPassthru0
    Subsystem: Shenzhen Colorful Yugong Technology and Development Co. TU116 [GeForce GTX 1660 SUPER] [7377:0000]
    Kernel driver in use: nvidia
    Kernel modules: nvidia_drm, nvidia

ESXi does something to the device names and I think that was the issue the last time too.

 

The last user who reported that switched from ESXi to a native installation and it worked after that. Please also keep in mind that some hardware combinations and the 1660s series are complicated... at least on Linux.

 

Please look into the Visualizing Unraid subforums, there you should find the answer what you have to add to the ESXi host that the card will work without a Kernel Panic.

 

EDIT: I found that:

 

But there is nothing I can do about that since you are visualizing Unraid and this could cause issues, I also really can't help since I'm not familiar with ESXi (just search for ESXi in this thread and you will find a few posts, some even stating that drivers >=460 not working with ESXi).

Link to comment
12 hours ago, ich777 said:

What do you mean exactly, do you mean the driver itself or the plugin?

Have you yet tried to roll back to the previous driver that was working?

 

Have you changed something in terms of hardware or did you maybe update your BIOS or change some settings in the BIOS?

Have you yet tried to disable C-States in the BIOS?

 

Please don't do this manually until advised since the plugin does this on it's own.

 

The driver itself I think. I tried to roll back to v525.116.04 (production branch) and still had the same issues. I tried disabling Global C-States today, as well as adding pcie_aspm=off to the grub boot options (some people recommended that on nvidia forums) and I am still having it fall off the bus, even while the GPU is just idle. 

 

I also changed nothing in terms of hardware, and I only updated my BIOS after the issue started happening to see if that would help. I didn't change anything in terms of the BIOS until after the issue started.

 

EDIT: I just noticed that my VBIOS for my 3060ti isn't updated to supported Resizable BAR. I am trying to update it with the tool from Nvidia but when I rmmod the kernel drivers, the nvidia kernel gets reinitialized, I'm guessing from the plugin?

Edited by alexdac99
Forgot to include details about hardware; Also found info about VBIOS
Link to comment
1 hour ago, alexdac99 said:

The driver itself I think. I tried to roll back to v525.116.04 (production branch) and still had the same issues.

Then roll back to an even older version, the drivers won't change and the package is always the same since they are precompiled.

 

If it was working before something in terms of hardware or BIOS must have been changed since otherwise the older driver will work as expected, I think you get the point what I'm trying to say.

Maybe the power supply is dying or something similar, if you haven't changed anything and you've rolled back the driver and you have the exact same issue with the previous driver there must be something different now...

 

1 hour ago, alexdac99 said:

but when I rmmod the kernel drivers, the nvidia kernel gets reinitialized, I'm guessing from the plugin?

Can you please describe what you are doing? Do you update the BIOS on Unraid itself?

No, the plugin won't reinitialize the driver, I would recommend that you close all browser windows when doing that because it is possible if something calls nvidia-smi, like for example the GPU Statistics plugin is calling nvidia-smi frequently, that the card gets reinitialize, or if you visit the plugin page, because if you are visiting the plugin page nvidia-smi get also called one time.

  • Like 1
Link to comment
On 3/28/2023 at 7:53 PM, teslap4 said:

Signed up just so I could post here for others reference since I've seen multiple people ask with conflicting answers:
I was able to get a tesla P4 working with docker containers such as plex using the default nvidia driver plugin. I was running unraid 6.11.5 and using driver 530.30.02.

 

One issue I had is that the official plex docker was not properly accepting hardware transcode. I was able to remedy this by switching to the linuxserver docker for plex. Was also able to do hardware transcode using the p4 in jellyfin.

 

If you are having issues getting hardware transcoding in plex working, test with a different plex docker.

 

Hey there. Can you tell me what model and VBIOS version you got on that Tesla P4 card? I recently picked one up and it isn't working. Thinking it might be a situation tied to certain VBIOS versions working and others being patched.

Edited by OrneryTaurus
  • Like 2
Link to comment
12 minutes ago, OrneryTaurus said:

Hey there. Can you tell me what model and VBIOS version you got on that Tesla P4 card? I recently picked one up and it isn't working. Thinking it might be a situation tied to certain VBIOS versions working and others being patched.

Can you please post your Diagnostics with that I can tell what's going on and usually it's not tied to the VBIOS.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.