[Plugin] Nvidia-Driver


ich777

Recommended Posts

Hi all,

a few months ago the transcoding gpu has stopped working properly, after about 6 months since I set up the transcoding the Nvidia driver stopped recognizing it.

I'm not getting the error message: " NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running. "

I have tried everything, turning docker on and off, changing bios settings, reinstalling the plugin, testing a bunch of different driver versions and even downgrading unraid build but nothing seem to fix it.

Unraid is able to see the device connected as the Quadro P1000 (which is correct) and even booting directly into windows or passing it through works fine. I can benchmark it without any issues.
Just to make sure that the gpu wasn't an issue I also tried a different one and it behaved in exactly the same way.

I'm going to leave attached a part of the logs while the machine was turning on, it's clear that the driver is trying to communicate with the GPU however the connection doesn't work.

I'm not sure what else to try at this point, please give me a hand.

Thanks.

 

 

Aug 28 03:13:25 Tower kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 241

Aug 28 03:13:25 Tower kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

Aug 28 03:13:25 Tower kernel: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)

Aug 28 03:13:25 Tower kernel: nvidia: probe of 0000:02:00.0 failed with error -1

Aug 28 03:13:25 Tower kernel: NVRM: The NVIDIA probe routine was not called for 1 device(s).

Aug 28 03:13:25 Tower kernel: NVRM: This can occur when a driver such as:

Aug 28 03:13:25 Tower kernel: NVRM: nouveau, rivafb, nvidiafb or rivatv

Aug 28 03:13:25 Tower kernel: NVRM: was loaded and obtained ownership of the NVIDIA device(s).

Aug 28 03:13:25 Tower kernel: NVRM: Try unloading the conflicting kernel module (and/or

Aug 28 03:13:25 Tower kernel: NVRM: reconfigure your kernel without the conflicting

Aug 28 03:13:25 Tower kernel: NVRM: driver(s)), then try loading the NVIDIA kernel module

Aug 28 03:13:25 Tower kernel: NVRM: again.

Aug 28 03:13:25 Tower kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).

Aug 28 03:13:25 Tower kernel: NVRM: None of the NVIDIA devices were initialized.

Aug 28 03:13:25 Tower kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 241

Aug 28 03:13:25 Tower kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 241

Aug 28 03:13:25 Tower kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

Aug 28 03:13:25 Tower kernel: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)

Aug 28 03:13:25 Tower kernel: nvidia: probe of 0000:02:00.0 failed with error -1

Aug 28 03:13:25 Tower kernel: NVRM: The NVIDIA probe routine was not called for 1 device(s).

Aug 28 03:13:25 Tower kernel: NVRM: This can occur when a driver such as:

Aug 28 03:13:25 Tower kernel: NVRM: nouveau, rivafb, nvidiafb or rivatv

Aug 28 03:13:25 Tower kernel: NVRM: was loaded and obtained ownership of the NVIDIA device(s).

Aug 28 03:13:25 Tower kernel: NVRM: Try unloading the conflicting kernel module (and/or

Aug 28 03:13:25 Tower kernel: NVRM: reconfigure your kernel without the conflicting

Aug 28 03:13:25 Tower kernel: NVRM: driver(s)), then try loading the NVIDIA kernel module

Aug 28 03:13:25 Tower kernel: NVRM: again.

Aug 28 03:13:25 Tower kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).

Aug 28 03:13:25 Tower kernel: NVRM: None of the NVIDIA devices were initialized.

Aug 28 03:13:25 Tower kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 241

Aug 28 03:14:11 Tower webGUI: Successful login user root from 192.168.1.101

Aug 28 03:14:17 Tower kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 241

Aug 28 03:14:17 Tower kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

Aug 28 03:14:17 Tower kernel: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)

Aug 28 03:14:17 Tower kernel: nvidia: probe of 0000:02:00.0 failed with error -1

Aug 28 03:14:17 Tower kernel: NVRM: The NVIDIA probe routine was not called for 1 device(s).

Aug 28 03:14:17 Tower kernel: NVRM: This can occur when a driver such as:

Aug 28 03:14:17 Tower kernel: NVRM: nouveau, rivafb, nvidiafb or rivatv

Aug 28 03:14:17 Tower kernel: NVRM: was loaded and obtained ownership of the NVIDIA device(s).

Aug 28 03:14:17 Tower kernel: NVRM: Try unloading the conflicting kernel module (and/or

Aug 28 03:14:17 Tower kernel: NVRM: reconfigure your kernel without the conflicting

Aug 28 03:14:17 Tower kernel: NVRM: driver(s)), then try loading the NVIDIA kernel module

Aug 28 03:14:17 Tower kernel: NVRM: again.

Aug 28 03:14:17 Tower kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).

Aug 28 03:14:17 Tower kernel: NVRM: None of the NVIDIA devices were initialized.

Aug 28 03:14:17 Tower kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 241

Aug 28 03:14:17 Tower kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 241

Aug 28 03:14:17 Tower kernel: NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:

Aug 28 03:14:17 Tower kernel: NVRM: BAR0 is 0M @ 0x0 (PCI:0000:02:00.0)

Aug 28 03:14:17 Tower kernel: nvidia: probe of 0000:02:00.0 failed with error -1

Aug 28 03:14:17 Tower kernel: NVRM: The NVIDIA probe routine was not called for 1 device(s).

Aug 28 03:14:17 Tower kernel: NVRM: This can occur when a driver such as:

Aug 28 03:14:17 Tower kernel: NVRM: nouveau, rivafb, nvidiafb or rivatv

Aug 28 03:14:17 Tower kernel: NVRM: was loaded and obtained ownership of the NVIDIA device(s).

Aug 28 03:14:17 Tower kernel: NVRM: Try unloading the conflicting kernel module (and/or

Aug 28 03:14:17 Tower kernel: NVRM: reconfigure your kernel without the conflicting

Aug 28 03:14:17 Tower kernel: NVRM: driver(s)), then try loading the NVIDIA kernel module

Aug 28 03:14:17 Tower kernel: NVRM: again.

Aug 28 03:14:17 Tower kernel: NVRM: The NVIDIA probe routine failed for 1 device(s).

Aug 28 03:14:17 Tower kernel: NVRM: None of the NVIDIA devices were initialized.

Aug 28 03:14:17 Tower kernel: nvidia-nvlink: Unregistered Nvlink Core, major device number 241

Link to comment
5 minutes ago, Andrea Nizzola said:

Thank you for the reply, here is the diagnostic

Have you maybe upgraded the BIOS at one point?

 

Please make sure that you have the following BIOS options enabled:

  • Above 4G decoding
  • Resizable BAR Support

 

This issue is usually caused by something miss configured in the BIOS or a hardware incompatibility issue.

Have you yet tried to re seat the card or put it in another PCIe slot?

Link to comment

I am getting the error: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running." I need to be able to install driver 377.83 for my Quadro 2000. How do I install this driver?

Link to comment

Another option I have is to use an Intel i5-9400 9th Gen cpu which has integrated video. I have one of these and a compatible MB which is currently not in use. Would this work with the Nvidia driver and unRAID?

Edited by Erich
Link to comment
17 minutes ago, Erich said:

Another option I have is to use an Intel i5-9400 9th Gen cpu which has integrated video. I have one of these and a compatible MB which is currently not in use. Would this work with the Nvidia driver and unRAID?

I think I don't understand...

 

You want to use the iGPU for transcoding but want to use the Nvidia Driver for that?

If you wan't to use the iGPU for transcoding that will of course work but you don't need the Nvidia Driver for that, instead I would recommend that you install the Intel GPU TOP plugin and pass through the device /dev/dri to the container (this is how it's done on Intel platforms).

 

However I would recommend that you create a dedicated post for that in the corresponding support thread for that if you need help on how to do that.

  • Like 1
Link to comment

Yes, I was thinking instead of buying another GPU, I could just use my Intel i5-9400. It would save me from having to purchase another card. Not sure which one would be easier to set up though - installing a new GPU and the Nvidia driver, or using the iGPU and pass through the device. Thanks for clarifying that I wont need to use the Nvidia driver if I go the iGPU route..

Edited by Erich
Link to comment
58 minutes ago, Erich said:

installing a new GPU and the Nvidia driver, or using the iGPU and pass through the device.

You simply pass through the Device (not Path) /dev/dri to the container and you are done, it's easy as that (and of course enable HW transcoding in the application which you are using). ;)

Link to comment

Ok, Thank you for your help. I am really new to unRAID and Linux. In fact, I have never used either before, so this is all quite a learning curve. I really don't know what it means or how to "pass through the Device to the container". I just know that this enables the application to use the hardware for transcoding. Just not sure exactly what steps I need to do. :)

Link to comment
9 minutes ago, Erich said:

Ok, Thank you for your help. I am really new to unRAID and Linux. In fact, I have never used either before, so this is all quite a learning curve. I really don't know what it means or how to "pass through the Device to the container". I just know that this enables the application to use the hardware for transcoding. Just not sure exactly what steps I need to do. :)

Look at this video please, as said above this is the Nvidia support thread and not for the Intel iGPUs:

 

Link to comment

Good Morning, I'm having an issue with this. Tried going through step by step a few times and searching but through the issue but maybe I'm missing something. I'm on 6.10.3 and have a 780ti card. Looking through the list I confirmed the latest drivers that operate this card that are supported in the list are v470.129.06 - I've selected and tried to install them multiple times but they don't seem to work. 

 

I've confirmed the card is functional in another machine and it does appear in my hardware list on my server. Only thing I can come up with it either I've missed something or the card is incompatible with my mobo. Attached are diagnostics and screengrab of what I'm seeing. 

 

Main reason for doing this is to use hardware encoding on Plex. The dual Xeon CPU setup starts to struggle transcoding 4K and bottlenecks on any more than one so I was hoping the Nvidia encoding would help. 

 

Any help would be appreciated. 

Capture.PNG

urhomeserver-diagnostics-20220903-1108.zip

  • Upvote 1
Link to comment
20 minutes ago, Leadin said:

but maybe I'm missing something

I think you can see why it's not working:

04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GK110B [GeForce GTX 780 Ti] [10de:100a] (rev a1)
    Subsystem: eVga.com. Corp. GK110B [GeForce GTX 780 Ti] [3842:2884]
    Kernel driver in use: vfio-pci
    Kernel modules: nvidia_drm, nvidia

 

I don't know why but you have bound the card to VFIO in your syslinux.config

 

22 minutes ago, Leadin said:

780ti

This is a pretty bad choice if you want to use it for Plex HW transcoding...

This card isn't even capable of h265 (HEVC).

 

I would rather recommend that you look into something like a NVIDIA T400, this card is Turing based, has a maximum power draw from 35W, doesn't need external power and you can get it for about $ ~120,- brand new.

  • Like 2
Link to comment
4 hours ago, Leadin said:

Good Morning, I'm having an issue with this. Tried going through step by step a few times and searching but through the issue but maybe I'm missing something. I'm on 6.10.3 and have a 780ti card. Looking through the list I confirmed the latest drivers that operate this card that are supported in the list are v470.129.06 - I've selected and tried to install them multiple times but they don't seem to work. 

 

I've confirmed the card is functional in another machine and it does appear in my hardware list on my server. Only thing I can come up with it either I've missed something or the card is incompatible with my mobo. Attached are diagnostics and screengrab of what I'm seeing. 

 

Main reason for doing this is to use hardware encoding on Plex. The dual Xeon CPU setup starts to struggle transcoding 4K and bottlenecks on any more than one so I was hoping the Nvidia encoding would help. 

 

Any help would be appreciated. 

Capture.PNG

urhomeserver-diagnostics-20220903-1108.zip 113.81 kB · 1 download

Same issue here on the since installing 6.11.0-RCx

Link to comment
49 minutes ago, PSYCHOPATHiO said:

also the whole system froze & had to do a hard reset.

But then it's not the same issue like from above, the user from above also had to use the legacy driver in order to make the card visible to the system because this card is simply "old".

 

You are on the newest driver version from what I see in your log and the card is initialized fine:

Sep  4 08:10:59 Vidas kernel: nvidia: loading out-of-tree module taints kernel.
Sep  4 08:10:59 Vidas kernel: nvidia: module license 'NVIDIA' taints kernel.
Sep  4 08:10:59 Vidas kernel: Disabling lock debugging due to kernel taint
Sep  4 08:10:59 Vidas kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 242
Sep  4 08:10:59 Vidas kernel: nvidia 0000:0b:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
Sep  4 08:10:59 Vidas kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  515.65.01  Wed Jul 20 14:00:58 UTC 2022
Sep  4 08:10:59 Vidas kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  515.65.01  Wed Jul 20 13:43:59 UTC 2022
Sep  4 08:10:59 Vidas kernel: [drm] [nvidia-drm] [GPU ID 0x00000b00] Loading driver
Sep  4 08:10:59 Vidas kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0b:00.0 on minor 0

 

What is the output from nvidia-smi? Please also post a screenshot form the plugin page.

Do you have C-States enabled in your BIOS? On what BIOS version are you? make also sure that you enable Above 4G Decoding.

 

Is this the first time you are installing the driver?

Link to comment
41 minutes ago, ich777 said:

 

What is the output from nvidia-smi? Please also post a screenshot form the plugin page.

 

Do you have C-States enabled in your BIOS? On what BIOS version are you? make also sure that you enable Above 4G Decoding.

 

Is this the first time you are installing the driver?

 

C-state is disabled, 4G Decoding is disabled by default but will enable it later today.

I had the plugin installed since it was release long time ago but I tried removing the plugin and driver several times & reinstalling it with multiple reboots.

 

nvidia-smi shows "No devices were found", The message changed from the previous message as the post above now it just shows No devices were found.

 

firefox_7Q81dzjz57.thumb.png.45c5caa363e7511c0bfed722b4f0d275.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.