[Plugin] Nvidia-Driver


ich777

Recommended Posts

On 2/28/2023 at 1:57 AM, ich777 said:

First of all I would strongly recommend that you remove the nvidia.conf file in the modprobe.d directory, you are not using the OpenSource Driver module...

 

Have you yet tried to boot with CSM (Legacy) instead of booting with UEFI mode? Please also make sure that you are on the latest BIOS version, that you've enabled Above 4G Decoding and Resizable BAR Support in the BIOS.

 

I would also try to switch from MACVLAN to IPVLAN in the Docker settings first.

 

Is this only happening with Tdarr (if yes, IIRC this is nothing new that Tdarr can crash your server but TBH I really don't know if that was fixed already).

Have you yet tried to disable Tdarr and see if this is happening too with Emby/Jellyfin/Plex?

 

Do you can test the card in another system (install the drivers and put some 3D load for about 10 minutes on it, something like FurMark should do the Job just fine).

I have removed the nvidia.conf file, I was planning to use the open source drivers in the future. I didn't think it would hurt leaving it in there.

 

I just changed it to CSM boot and it's working good, I am on the latest BIOS version, I have already enabled 4G decoding and Resizeable bar.

 

When the crashes happened, TDARR was not transcoding any videos and was not using the GPU at all. I just had another crash now, the TDARR docker was launched but not doing anything. The GPU utilization was 0. Should I still try to stop the TDARR docker and test?

 

It would be difficult, but I could try in a while from now. I was hoping I can passthrough the GPU into a Windows 10 VM and stress test it that way. The load on the GPU has been relatively low at 20% utilization max I've ever seen, and maybe maximum 80W I've ever seen too. Dumb question, but even if this was caused by the GPU and the GPU went haywire, wouldn't at most the dockers using the GPU crash, like TDARR? Why would the kernel crash as the kernel does/shouldn't rely on the GPU as it's running in headless mode?

 

During the crash that happened again, I stopped all docker containers and attempted to do a soft shutdown using the `powerdown` command and the `poweroff` command but nothing happened after waiting 15 minutes. I'm not good at Linux so I had to do a hard reboot.

I have a new diagnostics log for you, taken right after the crash.

dragon-diagnostics-20230302-0307.zip

Link to comment
25 minutes ago, LimesKey said:

When the crashes happened, TDARR was not transcoding any videos and was not using the GPU at all. I just had another crash now, the TDARR docker was launched but not doing anything. The GPU utilization was 0. Should I still try to stop the TDARR docker and test?

Then the crash is most certainly not related to the Nvidia Driver.

 

Have you yet changed MACVLAN to IPVLAN in your Docker settings? MACVLAN is notorious to crash servers with similar Kernel panics.

 

27 minutes ago, LimesKey said:

I have removed the nvidia.conf file, I was planning to use the open source drivers in the future. I didn't think it would hurt leaving it in there.

May I ask why? The open source module has no real benefit...

Link to comment

I'm currently having trouble trying to update the NVIDIA driver. I am running the newest plugin, 2023.03.02.

When I select "latest" or manually select "v530.30.02" it still acts like it is trying to download v470.141.03 and spits out "Can't download Nvidia Driver Package v470.141.03". Is this a plugin error or something with my server?

Screenshot 2023-03-02 101652.png

Screenshot 2023-03-02 101612.png

Link to comment
1 hour ago, tusculumgolfer said:

When I select "latest" or manually select "v530.30.02" it still acts like it is trying to download v470.141.03 and spits out "Can't download Nvidia Driver Package v470.141.03". Is this a plugin error or something with my server?

I can't reproduce this on my end, I've even tried to downgrade to the driver version 470.141.03 and then set it again to latest and it properly downloaded the latest (v530.30.02) driver version.

Link to comment

@ich777 Thanks a lot for your answer before. I was able to compile and load the vGPU guest drivers, and they're working fine. However, I have some questions:

1) Can I post the scripts and steps to make the packages in this thread? (the user must provide the driver package from nvidia, the scripts do not download any drivers)

2) I'm using your plugin to load and manage the GUID of the GPU for docker, however I'm having an issue with the GUID: it resets every time the vm does a cold boot, which is an inconvenience, and I'm not sure about the cause of this.

Edited by midi
Link to comment
25 minutes ago, midi said:

1) Can I post the scripts and steps to make the packages in this thread? (the user must provide the driver package from nvidia, the scripts do not download any drivers)

I don't think that if you post the scripts to build the non public drivers anyone can come after you as long as you post how to legally obtain a license and the vgpu binaries which are needed to do so.

 

26 minutes ago, midi said:

2) I'm using your plugin to load and manage the GUID of the GPU for docker, however I'm having an issue with the GUID: it resets every time the vm does a cold boot, which is an inconvenience, and I'm not sure about the cause of this.

Sorry, I really can't help here because I simply don't have a card that has the capabilities to make use of a vGPU and I even don't know how this is all working.

Link to comment
1 minute ago, ich777 said:

I don't think that if you post the scripts to build the non public drivers anyone can come after you as long as you post how to legally obtain a license and the vgpu binaries which are needed to do so.

Yeah, technically those who are into this already know where to get the drivers legally (the nvidia enterprise portal), the goal of the scripts is just patch and repack the drivers not how to get them, but I will point out how to get those.

 

3 minutes ago, ich777 said:

Sorry, I really can't help here because I simply don't have a card that has the capabilities to make use of a vGPU and I even don't know how this is all working.

There is a community project to unlock the drivers to enable the vGPU capabilities to consumer graphics if you want to test, but they also support natively supported cards (like Tesla P4/P40..., they just lift some nvidia limitations). Yes it is against Nvidia's EULA, this is why I'm not sharing any files here, but anyone can get the files legally from Nvidia's Enterprise Portal.

Link to comment
8 hours ago, ich777 said:

I can't reproduce this on my end, I've even tried to downgrade to the driver version 470.141.03 and then set it again to latest and it properly downloaded the latest (v530.30.02) driver version.

I've tried restarting server and still doing same thing. Have any suggestions for resolution?

Link to comment
3 hours ago, tusculumgolfer said:

I've tried restarting server and still doing same thing. Have any suggestions for resolution?

Sounds a bit complicated but can you do the following:

  1. Uninstall the Plugin
  2. Reboot
  3. Install the Plugin again
  4. Reboot once more or restart the Docker service

I‘ve now tried it again on another machine and 6.11.5 and it is working there too as it should.

I will contact another user if he can test that for me too.

Link to comment
5 hours ago, midi said:

There is a community project to unlock the drivers to enable the vGPU capabilities to consumer graphics if you want to test

No thank you…

 

5 hours ago, midi said:

technically those who are into this already know where to get the drivers legally (the nvidia enterprise portal), the goal of the scripts is just patch and repack the drivers not how to get them

Sure thing, I also know that but the redistribution from the drivers is not allowed AFAIK and that‘s why I have not created a plugin for that (and I really don‘t want that they take down my GitHub).

So sharing the scripts here on the forums or even in a GitRepo of yours should be fine.

Link to comment
13 hours ago, tusculumgolfer said:

When I select "latest" or manually select "v530.30.02" it still acts like it is trying to download v470.141.03 and spits out "Can't download Nvidia Driver Package v470.141.03". Is this a plugin error or something with my server?

 

i guess more something wrong with your flash ... working also fine here

 

just as simple testrun, downngrade to 470, up to 525, 530 now, all fine

 

image.thumb.png.3755300dc0963ded2e6272ee386893e4.png

 

image.thumb.png.7e4518fa244cb57e68cc5b5835af82aa.png

 

image.thumb.png.78cfba543744f0260eb53d526a531ea7.png

  • Thanks 1
Link to comment
10 hours ago, midi said:

@ich777 Thanks a lot for your answer before. I was able to compile and load the vGPU guest drivers, and they're working fine. However, I have some questions:

1) Can I post the scripts and steps to make the packages in this thread? (the user must provide the driver package from nvidia, the scripts do not download any drivers)

2) I'm using your plugin to load and manage the GUID of the GPU for docker, however I'm having an issue with the GUID: it resets every time the vm does a cold boot, which is an inconvenience, and I'm not sure about the cause of this.

I have the same issues with GUID resets with cold boot as well.

Link to comment
5 hours ago, galloglypg said:

Can someone confirm this will allow the use of one Nvidia card for docker, and one Nvidia card for VMs.

Yes, this is of course possible.

 

5 hours ago, galloglypg said:

Also does this allow the non vm card to run the unraid gui in addition to docker. I am running a ryzen 5900x so no igpu.

Yes, but you have to maybe switch slots because you are forced to use the card which outputs the BIOS screen/console output from Unraid on boot.

You also have to maybe change a line in the config (if you only got a blinking cursor in GUI mode) but this is something for later.

Link to comment

Hi,

 

Probably VERY stupid question! (Sorry).

 

But what is the benefit of keeping the driver up to date if v470.141.03 is working as expected ?

 

I use a nvida GTX 1650 for plex transcoding in docker.

 

What are the benefits to moving to say v525.89.02 ?

 

Thanks

 

D.

 

 

Link to comment
8 hours ago, BigDanT said:

But what is the benefit of keeping the driver up to date if v470.141.03 is working as expected ?

No benefits at all as long as you are using it for transcoding.

 

8 hours ago, BigDanT said:

What are the benefits to moving to say v525.89.02 ?

No benefits at all as long as you are using it for transcoding.

 

The answer changes a bit if you use it for other things than transcoding because you get performance improvements for Cuda accelerated workloads or if you are using the card with the Steam-Headless container.

Link to comment
On 3/2/2023 at 10:55 PM, ich777 said:

Sounds a bit complicated but can you do the following:

  1. Uninstall the Plugin
  2. Reboot
  3. Install the Plugin again
  4. Reboot once more or restart the Docker service

I‘ve now tried it again on another machine and 6.11.5 and it is working there too as it should.

I will contact another user if he can test that for me too.

This worked. Thank you!

  • Like 1
Link to comment
9 hours ago, ich777 said:

No benefits at all as long as you are using it for transcoding.

 

No benefits at all as long as you are using it for transcoding.

 

The answer changes a bit if you use it for other things than transcoding because you get performance improvements for Cuda accelerated workloads or if you are using the card with the Steam-Headless container.

Thanks @ich777 keep up the good work !

 

Its good to know I'm not missing out on anything, my instinct is to always keep up with the latest updates, be that unraid or your drivers. it comforting to know that in this instance, if it aint broke don't update :)

  • Like 1
Link to comment

Hey @ich777 I was pointed in your direction for this question: 
I'm currently still running 6.8.2 with the old linuxserver nvidia build. I'm looking to finally upgrade to meet the latest stable unraid, however, in the past, I've always reverted back to the vanilla build before upgrading. The old plugin no longer works/supported so I am curious if you would suggest trying to get my hands on a vanilla build of 6.8.2 before attempting to upgrade to 6.11 and/or any other gotchas you might think of regarding my current situation?

Thanks

Link to comment

I have been having some issues recently with the plugin. From time to time I come home and find that my server GUI is in a 500 internal server error. The SSH into the server works fine, and all dockers and VMs are all running, but the web GUI just fails to load. If I SSH in and to "/etc/rc.d/rc.php-fpm restart" the GUI will come back, sometimes for a short time and sometimes for a while, but eventually I find myself back in the crash 500 internal server error. I have this plugin as well as GPU Statistics by b3rs3rk installed. The other weird thing is that when I do get the GUI back, the Statistics on the main page for my GPU are all blank.

I have a ASUS Strix 1050ti installed with the latest official driver in the plugin installed as well. I only use the GPU for Tdarr transcoding, and have had no issues while the GPU is doing that task. The output files also look fine so I dont think that MY GPU is dying. Any thoughts? 

eos-diagnostics-20230101-1425.zip

Edited by Tithonius
added diagnostics
Link to comment
6 hours ago, Tithonius said:

The other weird thing is that when I do get the GUI back, the Statistics on the main page for my GPU are all blank.

I don't see a Nvidia GPU connected to your system in the attached Diagnostics, I even don't see the Nvidia Driver plugin installed...

 

The only thing that I see is a Intel iGPU, may I ask why you aren't using the iGPU for HW transcoding?

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.