[Plugin] Nvidia-Driver


ich777

Recommended Posts

23 minutes ago, Big-G said:

Following the success above, we to CA and installed plugin again:

image.thumb.png.5bfd8efedcf1820b1b00e1098a659357.png

Also cleared and tried this manually with the amended script while having second GPU vfio bound, and it worked 100%:

mkdir -p /tmp/nvdrv && cd /tmp/nvdrv

wget https://github.com/ich777/unraid-nvidia-driver/releases/download/5.15.43-Unraid/nvidia-515.43.04-5.15.43-Unraid-1.txz

installpkg nvidia-515.43.04-5.15.43-Unraid-1.txz

depmod -a

modprobe nvidia

rm -rf /tmp/nvdrv

nvidia-smi

Sat May 28 11:00:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   44C    P0    39W / 180W |      0MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 

Previous Issue seems to be between the driver version downloaded and the modprobe, however changing to correct version fixes the issue.

 

Link to comment
15 minutes ago, Big-G said:

Also cleared and tried this manually with the amended script while having second GPU vfio bound, and it worked 100%:

mkdir -p /tmp/nvdrv && cd /tmp/nvdrv

wget https://github.com/ich777/unraid-nvidia-driver/releases/download/5.15.43-Unraid/nvidia-515.43.04-5.15.43-Unraid-1.txz

installpkg nvidia-515.43.04-5.15.43-Unraid-1.txz

depmod -a

modprobe nvidia

rm -rf /tmp/nvdrv

nvidia-smi

Sat May 28 11:00:42 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.43.04    Driver Version: 515.43.04    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   44C    P0    39W / 180W |      0MiB /  8192MiB |      1%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 

Previous Issue seems to be between the driver version downloaded and the modprobe, however changing to correct version fixes the issue.

 

Further to the previous comments, once the plugin is installed successfully, leaving it on Latest causes failures with server reboot, however if you set it to the version then it works without issue after reboot:

image.thumb.png.17e50e1eb95d0924734d2c40150d47f0.png 

Link to comment
1 hour ago, Big-G said:

Further to the previous comments, once the plugin is installed successfully, leaving it on Latest causes failures with server reboot, however if you set it to the version then it works without issue after reboot:

I will test this after the weekend.

Have to partially redo the download routine in the plugin anyways.

Link to comment
7 hours ago, minhquan07 said:

Hello, I've upgraded to 6.10.2 and for some reason the nvidia driver isn't detecting the 2 GPUs I have. Under the system devices they are detected though. Any advice how to fix this? 

Something crashed really hard but I think it's related to IOMMU:

May 29 09:28:02 Tower kernel: DMAR: ERROR: DMA PTE for vPFN 0x8f4d8 already set (to 8f4d8003 not cbc784803)

 

Please go to your BIOS and disable IOMMU and/or Intel Vtd and see if it fixes that issue.

Link to comment

I have a nvidia 

 

VGA compatible controller: NVIDIA Corporation GP108 [GeForce GT 1030] (rev a1)

 

I have tried it  the different drivers lastest v515.43.03  v470.94 and v495.46.   But I am getting the same results.

I have also confirmed looking at the nvidia website that this card is supported by these different drivers.

 

With a uninstall and reinstall of th nvidia plugin.  I am seeing this during the install.

 

May 30 14:45:31 UNRAID root: plugin: running: anonymous
May 30 14:45:31 UNRAID root: plugin: creating: /boot/config/plugins/nvidia-driver/nvidia-driver-2022.05.06.txz - downloading from URL https://github.com/ich777/unraid-nvidia-driver/raw/master/packages/nvidia-driver-2022.05.06.txz
May 30 14:45:42 UNRAID root: plugin: checking: /boot/config/plugins/nvidia-driver/nvidia-driver-2022.05.06.txz - MD5
May 30 14:45:42 UNRAID root: plugin: running: /boot/config/plugins/nvidia-driver/nvidia-driver-2022.05.06.txz
May 30 14:45:42 UNRAID root: plugin: creating: /usr/local/emhttp/plugins/nvidia-driver/README.md - from INLINE content
May 30 14:45:42 UNRAID root: plugin: running: anonymous
May 30 14:48:23 UNRAID kernel: nvidia: loading out-of-tree module taints kernel.
May 30 14:48:23 UNRAID kernel: nvidia: module license 'NVIDIA' taints kernel.
May 30 14:48:23 UNRAID kernel: Disabling lock debugging due to kernel taint
May 30 14:48:23 UNRAID kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
May 30 14:48:23 UNRAID kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
May 30 14:48:23 UNRAID kernel: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:03:00.0)
May 30 14:48:23 UNRAID kernel: nvidia: probe of 0000:03:00.0 failed with error -1
May 30 14:48:23 UNRAID kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
May 30 14:48:23 UNRAID kernel: NVRM: None of the NVIDIA devices were initialized.
May 30 14:48:23 UNRAID kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 245

 

Looking at the Bios.  I have confirmed that the Above feature is enabled.   And I have also confirmed that CSM is disabled.

 

I am just trying to understand what is going on.   I do believe it is the PCI i/O region issue that is the source of the problem.  But I am not finding anything to make sense of it.    So I am wondering if someone has seen this before with my nvidia issues?

 

 

nvidia-05-30-2022-cli.png

2022-05-30-pluginshot.png

bryunraid-diagnostics-20220530-1542.zip

Link to comment

Is it possible to change the download directory for the driver file? Currently my server is running off  a 1GB Flash drive.  When there is an update it almost completely fills my OS drive. I'm guessing it cannot be on the array as it would need to load before before that starts.  Could it be loaded to a different flash drive?

Link to comment
38 minutes ago, Alex.vision said:

Is it possible to change the download directory for the driver file? Currently my server is running off  a 1GB Flash drive.  When there is an update it almost completely fills my OS drive. I'm guessing it cannot be on the array as it would need to load before before that starts.  Could it be loaded to a different flash drive?

No, that's not possible, because everything is mounted after the Array has started or better speaking the Nvidia Driver is even installed before Unassigned Devices.

 

I would recommend that if you upgrade to a newer Unraid version that you move the files from /boot/previous to another directory to free up some space.

 

Also keep in mind that the Nvidia Driver keeps getting bigger and bigger, only for comparison:

Driver version 455.45.01 = 113MB

Driver version 515.48.07 = 246MB

 

Isn't it possible for you to upgrade to a bigger USB Boot device in your case, I can very recommend you for example a Transcend JetFlash 600 32GB

 

I nearly test every Nvidia Driver and for these tests I have such a USB Flash drive in use and also a SanDisk Cruzer Blade 16GB on a second server and the never failed on me (knock on wood... :D ).

Link to comment
24 minutes ago, ich777 said:

No, that's not possible, because everything is mounted after the Array has started or better speaking the Nvidia Driver is even installed before Unassigned Devices.

 

I thought this might be the case, figured I would ask.

 

Yes I can upgrade to a newer larger flash drive, I have three other Unraid servers that have bigger drives, this one has just been old faithful.  I think it was my first thumb drive back when I first started with Unraid in January of 2011.

 

Thanks for the info @ich777 I appreciate it. I will look at grabbing one of those flash drives soon.

  • Like 1
Link to comment

 

Hello everyone,

 

first of all I would like to say sorry for my bad English, secondly it is a great plugin, unfortunately I have installed a Geforce 710 in my unRAID server, unfortunately there is no suitable driver for it in the list. Is it possible to use an older driver here so that the Nvidida driver and the GPU Stats plugin can be used together?

Link to comment
1 hour ago, KinGSiZ3 said:

Geforce 710 in my unRAID server, unfortunately there is no suitable driver for it in the list. Is it possible to use an older driver here so that the Nvidida driver and the GPU Stats plugin can be used together?

There is for sure a suitable driver for the GeForce 710 in the list.

On what Unraid version are you?

If you are still on Unraid 6.9.2 I would recommend that you upgrade to 6.10.2

 

You simply can't see the Nvidia driver version 470.94 on Unraid 6.9.2 because I only list the last 8 drivers that are available, but as said above I would highly recommend to upgrade to Unraid 6.10.2

  • Like 1
  • Upvote 1
Link to comment
On 5/30/2022 at 11:45 PM, whynot88 said:

I have a nvidia 

 

VGA compatible controller: NVIDIA Corporation GP108 [GeForce GT 1030] (rev a1)

 

I have tried it  the different drivers lastest v515.43.03  v470.94 and v495.46.   But I am getting the same results.

I have also confirmed looking at the nvidia website that this card is supported by these different drivers.

 

With a uninstall and reinstall of th nvidia plugin.  I am seeing this during the install.

 

May 30 14:45:31 UNRAID root: plugin: running: anonymous
May 30 14:45:31 UNRAID root: plugin: creating: /boot/config/plugins/nvidia-driver/nvidia-driver-2022.05.06.txz - downloading from URL https://github.com/ich777/unraid-nvidia-driver/raw/master/packages/nvidia-driver-2022.05.06.txz
May 30 14:45:42 UNRAID root: plugin: checking: /boot/config/plugins/nvidia-driver/nvidia-driver-2022.05.06.txz - MD5
May 30 14:45:42 UNRAID root: plugin: running: /boot/config/plugins/nvidia-driver/nvidia-driver-2022.05.06.txz
May 30 14:45:42 UNRAID root: plugin: creating: /usr/local/emhttp/plugins/nvidia-driver/README.md - from INLINE content
May 30 14:45:42 UNRAID root: plugin: running: anonymous
May 30 14:48:23 UNRAID kernel: nvidia: loading out-of-tree module taints kernel.
May 30 14:48:23 UNRAID kernel: nvidia: module license 'NVIDIA' taints kernel.
May 30 14:48:23 UNRAID kernel: Disabling lock debugging due to kernel taint
May 30 14:48:23 UNRAID kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 245
May 30 14:48:23 UNRAID kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
May 30 14:48:23 UNRAID kernel: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:03:00.0)
May 30 14:48:23 UNRAID kernel: nvidia: probe of 0000:03:00.0 failed with error -1
May 30 14:48:23 UNRAID kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).
May 30 14:48:23 UNRAID kernel: NVRM: None of the NVIDIA devices were initialized.
May 30 14:48:23 UNRAID kernel: nvidia-nvlink: Unregistered the Nvlink Core, major device number 245

 

Looking at the Bios.  I have confirmed that the Above feature is enabled.   And I have also confirmed that CSM is disabled.

 

I am just trying to understand what is going on.   I do believe it is the PCI i/O region issue that is the source of the problem.  But I am not finding anything to make sense of it.    So I am wondering if someone has seen this before with my nvidia issues?

 

 

nvidia-05-30-2022-cli.png

2022-05-30-pluginshot.png

bryunraid-diagnostics-20220530-1542.zip 132.65 kB · 0 downloads

In the terminal what do you get when you execute "modprobe nvidia"?

  • Thanks 1
Link to comment
On 5/31/2022 at 12:45 AM, whynot88 said:

I am just trying to understand what is going on.   I do believe it is the PCI i/O region issue that is the source of the problem.  But I am not finding anything to make sense of it.    So I am wondering if someone has seen this before with my nvidia issues?

Somehow I completely missed your Diagnostics... :P

 

Is the issue resolved now or better speaking have you tried to enable CSM, boot with Legacy mode and enable Resizable BAR & Above 4G Support in your BIOS?

 

Also make sure that you are on the latest BIOS version.

Link to comment

 Hi,

 

I noticed that my containers which have GPUs passed through fail to restart after a CA auto-update this morning (15/06/22). One container which didn't stop during the night "appeared" to be working but was not transcoding a queued video. When I restarted this container manually I got the same error below.

 

The error I get when starting any GPU-passed container is a "Bad parameter" pop-up. When I edit the container config to see the compile error I get this:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: open failed: /proc/sys/kernel/overflowuid: permission denied: unknown.

I have the nvidia-driver plugin installed and have reinstalled most versions of the drivers, from v470.94 to v515.43.04. nvidia-smi shows my GPUs and the driver version correctly. 

I am not sure what the cause of the error is, if its docker, the nvidia plugin, etc.

 

I noticed this issue occurring since upgrading the unraid OS from 6.9 to 6.10 (which included a nvidia driver update).

 

I have tried:

  • a fresh docker.img, with previously configured container templates redownloaded.
  • restoring the appdata folder from a backup. 
  • Checking GPU usage. Only the containers are set up to use the GPU (I even tried with a single container). I also made sure not to use the OS GUI mode. Currently nothing is using the GPU, according to nvidia-smi.
  • Downgraded the OS from 6.10.2 to 6.10.1.

 

Here is the diagnostics file (had to use cmd line as GUI method doesn't do anything).

Any help will be greatly appreciated. 

 

dailynas-diagnostics-20220615-0956.zip

Edited by raf802
file upload
Link to comment
53 minutes ago, raf802 said:

I noticed this issue occurring since upgrading the unraid OS from 6.9 to 6.10 (which included a nvidia driver update).

What are the contents from: nvidia-powersave/script?

 

53 minutes ago, raf802 said:

The error I get when starting any GPU-passed container is a "Bad parameter" pop-up. When I edit the container config to see the compile error I get this:

Is only one container affected or there are multiple container affected?

If it is only one container, please post the container configuration (Docker template).

 

53 minutes ago, raf802 said:

I noticed this issue occurring since upgrading the unraid OS from 6.9 to 6.10 (which included a nvidia driver update).

I test every stable driver release for every new Unraid version and I have had no issues so far on 6.10.0+ with Jellyfin utilizing NVENC.

 

Please open up a terminal and post the output from:

cat /proc/sys/kernel/overflowuid

(screenshot preferred)

 

 

What also catches my eye is the modifications to the go file, do you still need them?

Link to comment

Thank you for your quick reply!

 

12 minutes ago, ich777 said:

What are the contents from: nvidia-powersave/script?

 This is the script from Spaceinvader One. I added after the issue originally occurred. Here are the contents:

#!/bin/bash
# check for driver
command -v nvidia-smi &> /dev/null || { echo >&2 "nvidia driver is not installed you will need to install this from community applications ... exiting."; exit 1; }
echo "Nvidia drivers are installed"
echo
echo "I can see these Nvidia gpus in your server"
echo
nvidia-smi --list-gpus 
echo
echo "-------------------------------------------------------------"
# set persistence mode for gpus ( When persistence mode is enabled the NVIDIA driver remains loaded even when no active processes, 
# stops modules being unloaded therefore stops settings changing when modules are reloaded
nvidia-smi --persistence-mode=1
#query power state
gpu_pstate=$(nvidia-smi --query-gpu="pstate" --format=csv,noheader);
#query running processes by pid using gpu
gpupid=$(nvidia-smi --query-compute-apps="pid" --format=csv,noheader);
#check if pstate is zero and no processes are running by checking if any pid is in string
if [ "$gpu_pstate" == "P0" ] && [ -z "$gpupid" ]; then
echo "No pid in string so no processes are running"
fuser -kv /dev/nvidia*
echo "Power state is"
echo "$gpu_pstate" # show what power state is
else
echo "Power state is" 
echo "$gpu_pstate" # show what power state is
fi
echo
echo "-------------------------------------------------------------"
echo
echo "Power draw is now"
# Check current power draw of GPU
nvidia-smi --query-gpu=power.draw --format=csv
exit

 

12 minutes ago, ich777 said:

Is only one container affected or there are multiple container affected?

If it is only one container, please post the container configuration (Docker template).

It affects any container that I try to pass a GPU to. I have tried other plex/emby/jellyfin containers which I have never used and they too had the issue. 

 

12 minutes ago, ich777 said:

Please open up a terminal and post the output from:

cat /proc/sys/kernel/overflowuid

(screenshot preferred)

Please see attached 

cat.JPG

Edited by raf802
formatting
Link to comment
11 minutes ago, raf802 said:

This is the script from Spaceinvader One. I added after the issue originally occurred.

Please disable this script for now since this also includes a outdated command, nvidia-persistenced is now a dedicated application and nvidia-smi —persistence-mode should not be used anymore because this will deprecated in the future.

 

13 minutes ago, raf802 said:

It affects any container that I try to pass a GPU to. I have tried other plex/emby/jellyfin containers which I have never used and they too had the issue. 

Please try to remove that script, reboot your server and try again.

 

13 minutes ago, raf802 said:

Please see attached 

The output seems fone to me…

 

Please keep me updated, could be possible that I wont answer very quickly in the next two days…

  • Thanks 1
Link to comment
21 minutes ago, ich777 said:

Please disable this script for now since this also includes a outdated command, nvidia-persistenced is now a dedicated application and nvidia-smi —persistence-mode should not be used anymore because this will deprecated in the future.

 

Please try to remove that script, reboot your server and try again.

 

The output seems fone to me…

 

Please keep me updated, could be possible that I wont answer very quickly in the next two days…

 Thank you again Ich. 

I disabled that powersave script and everything is working after the reboot. 

The GPUs are idling in power state P0.

When I was getting the issue, their power state was P8.

 

I will upgrade the OS and drivers to see if the GPUs keep working.

 

  • Like 1
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.