Jump to content

[Plugin] Nvidia-Driver


ich777

Recommended Posts

32 minutes ago, qw3r7yju4n said:

Any ideas? I expect this was a problem in the past.

Please uninstall the driver, reboot, install the driver from the ca app and reboot again.

 

You are also most likely on of the users who didn‘t see the line about waiting until you get a notification until its safe to reboot when upgrading to the newer Unraid version.

Link to comment
31 minutes ago, ich777 said:

Please uninstall the driver, reboot, install the driver from the ca app and reboot again.

 

You are also most likely on of the users who didn‘t see the line about waiting until you get a notification until its safe to reboot when upgrading to the newer Unraid version.

I cannot uninstall it. Its not there to uninstall in the UI. But, it is clearly there according to the error. is there a manual CLI option?

Link to comment
18 minutes ago, qw3r7yju4n said:

I cannot uninstall it. Its not there to uninstall in the UI. But, it is clearly there according to the error. is there a manual CLI option?

I found a way to remove it. Rebooting soon thanks for your help.

Link to comment
8 hours ago, qw3r7yju4n said:

I found a way to remove it. Rebooting soon thanks for your help.

What way? Can you share it with us?

 

If you cannot uninstall it it is not installed.

Are you sure that it is not in the Plugins Error tab on your Plugins page.

 

Usually when it‘s not installed it is enough to reboot and then install it again.

If it‘s in the Plugins Error tag then remove it from there, reboot and install it again.

Link to comment
9 hours ago, ich777 said:

What way? Can you share it with us?

 

If you cannot uninstall it it is not installed.

Are you sure that it is not in the Plugins Error tab on your Plugins page.

 

Usually when it‘s not installed it is enough to reboot and then install it again.

If it‘s in the Plugins Error tag then remove it from there, reboot and install it again.

You're exactly right. It was errored plugin. I removed it and will reboot when i can. I do not use the driver currently. It is a backup for Intel QSV. Thanks for your help and concern. I will come back to this thread if your solution does not work. but it will be a few days

Link to comment

Is there a way I can re-prioritize the gpu order in nvidia-smi?

 

~# nvidia-smi
Mon Mar  4 07:50:18 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:01:00.0 Off |                  N/A |
|  0%   37C    P2             96W /  350W |    1164MiB /  24576MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA RTX A5000               On  |   00000000:82:00.0 Off |                  Off |
| 30%   33C    P8              6W /  230W |    2435MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+


What I am actually trying to do: I would like to use the A5000 before the 3090 for containers (but I want to be able to use the 3090 when I need to)

I am using this user script to bring power states down hourly, but what I really need is for my containers to prefer the A5000.

Right now I pass through NVIDIA_VISIBLE_DEVICES-ALL to each container, if I pass through a GUID list with the A5000 first would that impact priority?

Thanks for letting me pick your brains.

Link to comment
9 minutes ago, paperblankets said:

What I am actually trying to do: I would like to use the A5000 before the 3090 for containers (but I want to be able to use the 3090 when I need to)

AFAIK no since it is tied to the bus number and therefore you would need to physically swap the cards on your motherboard.

 

10 minutes ago, paperblankets said:

I am using this user script to bring power states down hourly

Please don't use that script.

Simply put that in your go file:

nvidia-persistenced

 

As long as you don't use any of the two cards in a VM this will do the same and you only have to issue it once.

Link to comment
1 hour ago, ich777 said:

AFAIK no since it is tied to the bus number and therefore you would need to physically swap the cards on your motherboard.

 

Please don't use that script.

Simply put that in your go file:

nvidia-persistenced

 

As long as you don't use any of the two cards in a VM this will do the same and you only have to issue it once.

I removed the old user scripts. I also found weird behavior in the unraid frigate template, setting `NVIDIA_VISIBLE_DEVICES` is not respected by the image, but using `CUDA_VISIBLE_DEVICES` works as expected.

 

image.png.2faeeb22185a42c028aa94ae2e8433e1.png

Edited by paperblankets
Link to comment
22 minutes ago, paperblankets said:

`NVIDIA_VISIBLE_DEVICES` is not respected by the image, but using `CUDA_VISIBLE_DEVICES` works as expected.

You should report that in the Frigate thread and not here since I don't know anything about the image and it not strictly speaking tied to the Nvidia Driver plugin if you have issues...

I even don't know what kind of workload that Frigate is putting on the cards and also what other variables are you using in the template.

 

It would be also beneficial if you include Diagnostics since I really can't tell what's going on on your server nor do I know what driver are you using and so on.

Link to comment
1 minute ago, alitech said:

Something wrong with configuration? unable to update

Please remove the Nvidia Driver plugin, reboot, install it from the CA App and reboot or restart the Docker service.

 

What did you do? Did you upgrade your Unraid version or did the plugin notify you about a new driver update?

Link to comment

Many thanks for your reply.

 

No this happened by itself, it was a problem before the upgrade, I just wanted to get an answer on this. 

If I do remove this, will the settings I made elsewhere (i dont remember where) will remain in tact? Just reinstall and everything should be back and running as before??

Link to comment
1 hour ago, alitech said:

No this happened by itself, it was a problem before the upgrade, I just wanted to get an answer on this. 

Please post your Diagnostics before doing all of that and also the file: /boot/config/plugins/nvidia-driver/settings.cfg

 

1 hour ago, alitech said:

If I do remove this, will the settings I made elsewhere (i dont remember where) will remain in tact? Just reinstall and everything should be back and running as before??

I assume you are referring to Docker containers?

Yes, the containers won't be able to start when the driver is removed after the first reboot, after the second restart you should be up and running again just fine.

 

 

May I ask why did you click the Download button in the first place? Was something not working or did you simply just try to change the driver?

Link to comment

I'm having an issue where, consistently, when I use Nvidia drivers to do anything like LLM or even aHshcat for testing, Unraid crashes and I have to restart.
 

root@Tower:~# nvidia-smi -l
Wed Mar  6 11:26:35 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.40.07              Driver Version: 550.40.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:65:00.0 Off |                  N/A |
|  0%   62C    P0            119W /  420W |       0MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+


 

tower-diagnostics-20240306-1128.zip

Link to comment
1 hour ago, rainmanjam said:

Unraid crashes and I have to restart.

Are you sure that your power supply is up to the task since most of the times it has to do with the power supply.

Do you have a display connected to actually see what's going on? It would be really cool if you have a display connected and you could take a picture what's happening on screen when it actually crashes.

 

I assume your machine is not automatically restarting?

 

I see nothing obvious from your syslog, the driver loads fine and it should in theory be working.

  • Like 2
Link to comment
30 minutes ago, ich777 said:

Are you sure that your power supply is up to the task since most of the times it has to do with the power supply.

Do you have a display connected to actually see what's going on? It would be really cool if you have a display connected and you could take a picture what's happening on screen when it actually crashes.

 

I assume your machine is not automatically restarting?

 

I see nothing obvious from your syslog, the driver loads fine and it should in theory be working.

No display connected so I can't see what's going on. I can connect one up.
I tailed the syslog via SSH but nothing stands out before crashing.
I have a 1200w power supply.
 

Link to comment

Hi again ich77,
I'm having some trouble with my NVidia card that first worked properly for a few weeks after you helped me get it working.
Sadly now it keeps falling off the bus, I've made some changes to the BIOS and realloc but it didn't improve the situation...
Do you have any ideas on how I could fix this, things I could try or is the GPU itself the problem?
 

Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: GPU at PCI:0000:01:00: GPU-33d616df-a0e8-4c9a-3c11-0cad75613c6e
Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: A GPU crash dump has been created. If possible, please run
Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: the NVIDIA kernel module is unloaded.

 

I've also documented some of the things I tried to fix it on this topic:

 

jj-silverstone-diagnostics-20240306-2257.zip

Edited by KingHawk
Link to comment
3 hours ago, ich777 said:

Are you sure that your power supply is up to the task since most of the times it has to do with the power supply.

Do you have a display connected to actually see what's going on? It would be really cool if you have a display connected and you could take a picture what's happening on screen when it actually crashes.

 

I assume your machine is not automatically restarting?

 

I see nothing obvious from your syslog, the driver loads fine and it should in theory be working.

When I was hands on and watched it happen, the power just cut out and restarted the server

The power supply was fine. The power CABLE doesn't meet the specs for it. 

https://www.evga.com/support/faq/FAQdetails.aspx?faqid=59690

Waiting on a 12 AWG cable to arrive. Sometimes you just need to get your eyes on it to figure out what's going on.

  • Like 1
Link to comment
6 hours ago, KingHawk said:

Do you have any ideas on how I could fix this, things I could try or is the GPU itself the problem?

check

 

BIOS (latest version)

disable powersaving features

power supply enough

try another slot on the board

...

 

thats overall a hardware issue ... frustrating, but overall a trial & error procedure where the error is

  • Like 2
Link to comment
8 hours ago, KingHawk said:
Mar  6 22:09:42 JJ-SILVERSTONE kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.

As @alturismo already pointed out, XID 79 is a pretty generic error but most of the times related to the card itself, you can get more information here.

  • Like 1
Link to comment
5 hours ago, rainmanjam said:

When I was hands on and watched it happen, the power just cut out and restarted the server

Then this seems like a power related issue since otherwise you would get a Kernel panic.

 

If the power cuts out and the server restarts or simply freezes then it is most likely the power supply can't quiet cover a peak. That is maybe caused by the GPU in combination with the CPU since these are pretty intensive workloads and both can be hit quiet heavily initially, cause a spike and the over current protection hits.

Link to comment

Just installed a P400 Quadro to help out my old ryzen 1600 af with Plex transcoding. I set it up per the instructions, but every time I try to go into the nvidia-driver plugin settings, my entire server gui locks up and never recovers. I can't open the settings or get any other pages to load. I have to ssh in and reboot. I've attached my diagnostics. Any ideas what could be causing this issue?

 

Plex docker settings: image.thumb.png.4a2cf8ab0bb6972b95516a5cf1dd8fba.png

 

Transcoding settings inside plex:

image.thumb.png.7dea3cc8b99cbc5f8fe171dd7cd3f17c.png

 

 

 

unraid-diagnostics-20240307-1746.zip

Edited by colev14
added images of Plex config
Link to comment
8 hours ago, colev14 said:

my old ryzen 1600

IIRC there where quiet a few issues with these old Ryzen CPUs and PCIe cards especially Nvidia.

 

I don't quiet remember but I think you have to disable C-States and maybe force PCIe Gen3 in the BIOS, it is maybe also a good idea to enable Above 4G Decoding and Resizable Bar support.

Make also sure that you are on the latest BIOS version.

 

I see nothing obvious in your Diagnostics, can you attach a Display to the card and see what happens when you start a transcode and maybe take a picture?

As said above I remember quiet a few issues with those old Ryzen CPUs I think @SiRMarlon had similar issues.

Link to comment
15 hours ago, ich777 said:

IIRC there where quiet a few issues with these old Ryzen CPUs and PCIe cards especially Nvidia.

 

I don't quiet remember but I think you have to disable C-States and maybe force PCIe Gen3 in the BIOS, it is maybe also a good idea to enable Above 4G Decoding and Resizable Bar support.

Make also sure that you are on the latest BIOS version.

 

I see nothing obvious in your Diagnostics, can you attach a Display to the card and see what happens when you start a transcode and maybe take a picture?

As said above I remember quiet a few issues with those old Ryzen CPUs I think @SiRMarlon had similar issues.

I had C states disabled before, I changed the top slot to pcie gen 3, I enabled 4G decoding. I don't think the 1600af supports resizeable bar. It didn't show up in my bios settings and I my bios is updated. I was getting the same issues. Then I decided to try it on my phone and it works. So I tried it on another phone and the windows client and it works. It seems like the problem is with the plex web client. I don't think anybody on my server uses that, so I think I am good to go now. Thanks!

 

image.thumb.png.f51f94143c121791aebe64ab1983af12.png

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...