[Plugin] Nvidia-Driver


ich777

Recommended Posts

On 2/10/2023 at 12:22 AM, ich777 said:

Do you use nvidia-persistenced?

How fast is it transcoding when in P2?

One or two people have reported that already but AFAIK this caused no issues for them.

There is not much if anything that I can do about since this is a driver issue itself and out of my control.

No nvidia-persistenced.

Not sure if it is slower or not.

Just switched to a 3050 and it is also showing p2 always.

Link to comment
14 minutes ago, Jackal24 said:

Not sure if it is slower or not.

With how many FPS is it transcoding?

what kond of content ate you transcoding?

I‘ve tested it yesterday and got in ~1200FPS (720p source file).

 

Have you yet tried 3 simultaneous transcodes? It should switch to a higher power state if needed but I don‘t think that this is even necessary for transcoding.

Link to comment
12 minutes ago, Jackal24 said:

Looks like it is possibly an issue with the driver. Apparently Nvidia locks you at p2 if you are doing any CUDA work on consumer cards

But I don‘t think that this is an issue…

Why would you go in a higher power state than which is needed?

Link to comment
16 minutes ago, Jackal24 said:

I'll play with it some more tomorrow and see. Doing 4k transcoding to 1080p

According to "nvidia-smi -q -d SUPPORTED_CLOCKS | more" my max clocks on my 3050 should be 7001 and 2130. Under power state p2, I am only getting 6800 and 1807, so it is definitely being slowed. If I try to change it using "nvidia-smi -ac 7001,2130", it tells me that it isn't supported.

Edited by Jackal24
Link to comment
6 minutes ago, Jackal24 said:

Under power state p2, I am only getting 6800 and 1807, so it is definitely being slowed.

For sure but don't forget that you are using the NVENC/NVDEC and the card maybe don't have to be an a higher or even clock higher than if you put some 3D load on it.

 

8 minutes ago, Jackal24 said:

If I try to change it using "nvidia-smi -ac 7001,2130", it tells me that it isn't supported.

Yes, because that's not supported on Unraid and nothing I can do about.

 

6 minutes ago, Jackal24 said:

When I was running tdarr, it was hitting the card at 100% utilization, but running it slower than it would have had it been at p0

Do you have any numbers to compare with? It's a little hard for me to troubleshoot for me with no real numbers.

I've now tried transcoding with Unmanic and I'm at P0 on my T400.

Link to comment
Quote

For sure but don't forget that you are using the NVENC/NVDEC and the card maybe don't have to be an a higher or even clock higher than if you put some 3D load on it.

True.

 

Quote

Yes, because that's not supported on Unraid and nothing I can do about.

Understood.

 

Quote

Do you have any numbers to compare with? It's a little hard for me to troubleshoot for me with no real numbers.

I'll try to get some soon.

 

Quote

I've now tried transcoding with Unmanic and I'm at P0 on my T400.

I believe the T400 is a non-consumer card and doesn't have the p2 lock on it for CUDA workloads. The weird thing is that this happens even with only plex which I didn't think was considered a CUDA workload. (Although I don't know enough about it.)

Link to comment
5 hours ago, yuntong said:

There's my Diagnostics.

This seems like some kind of bug in the BIOS of your Motherboard or better speaking your UEFI Firmware and I really can't help resolving it.

 

The card is recognized fine and should work. Are you sure that the message wasn't there before you put the card in the system?

Please make sure that you've enabled Resizable BAR Support and Above 4G Decoding in your BIOS.

Do you experience any issues or just this message in the syslog?

 

Just to double check, you are sure this is not caused of any other VFIO mapping that you've done in your system:

BIND=0000:00:17.0|8086:7ae2

From what I see this is your SATA controller correct?

 

Have you yet tried to boot in Legacy Mode instead of UEFI?

Link to comment
On 2/10/2023 at 10:29 PM, ich777 said:

It looks like the plugin is causing a Kernel panic, but that is usually caused by a hardware defect or something else hardware related.

 

Please try to re-seat the card in the PCIe slot, check if external power is connected properly.

 

I would also recommend that you let the Parity sync finish first so that your data is safe and after that we can try to get the card back working again. What I'm really curious is if you can test the card in another system, install the drivers and put a 3D load (preferably FurMark) on it for about 30 minutes to see if it is working properly.

This would not be the first time I see a Quadro dying, these cards are pretty old now...

Thanks for these tips.

 

Did a parity sync first, completed successfully.

 

Then took out the Nvidia P2000 gpu from the unraid server and gave it a good run through Furmark in another system. The gpu seems to be working as it should, no issues or crashed popped up.

 

I put the gpu back in the server, after checking it is seater properly, yet again the unraid system becomes unresponsive the moment I try accessing anything gpu related (Nvidia drivers plugin or GPU stat plugin). It doesn't even let me shut down gracefully, have to force shutdown everytime which is making testing this multiple times over not a pleasant process.

 

Anything else you'd recommend me trying out as a possible fix?

 

Thanks a lot

Link to comment
1 hour ago, Massimo Platteau Beltrami said:

Thanks for these tips.

 

Did a parity sync first, completed successfully.

 

Then took out the Nvidia P2000 gpu from the unraid server and gave it a good run through Furmark in another system. The gpu seems to be working as it should, no issues or crashed popped up.

 

I put the gpu back in the server, after checking it is seater properly, yet again the unraid system becomes unresponsive the moment I try accessing anything gpu related (Nvidia drivers plugin or GPU stat plugin). It doesn't even let me shut down gracefully, have to force shutdown everytime which is making testing this multiple times over not a pleasant process.

 

Anything else you'd recommend me trying out as a possible fix?

 

Thanks a lot

 

Problem solved!

 

There seemed to have been an issue between the system bios and the nvidia drivers (im guessing). I updated the motherboard bios which was 2 years old, changed some bios settings and after a couple reboots unraid recognized the nvidia gpu again. I am able to access the nvidia driver plugin with no issues now. 

  • Like 1
Link to comment

I recently upgraded from 6.9.2 to 6.11.5 and my nVidia Tesla M60 will randomly disappear with the following messages in the syslog:

 

NVRM: GPU 0000:03:00.0: request_irq() failed (-22)

 

No VM's using the card.  nvidia plugin sees my card as two GPU's - plex (docker) using 1 of them.  I've attached my diag's just in case.  Edit: Had to revert back to 6.9.2.  Once the card disappears, the box responds to pings, but the UI doesn't.

 

 

extremis-diagnostics-20230213-1457.zip

Edited by Doogs
Link to comment
2 hours ago, Doogs said:

Once the card disappears, the box responds to pings, but the UI doesn't.

Have you yet tried a older version from the driver itself?

Seems like a bug with newer Kernels.

Please also try to re-seat the card and also check if power is connected properly.

Did you change anything else while upgrading to 6.11.5 like updating the BIOS, adding hardware/drives or something like that?

Link to comment
14 minutes ago, ich777 said:

Have you yet tried a older version from the driver itself?

Seems like a bug with newer Kernels.

Please also try to re-seat the card and also check if power is connected properly.

Did you change anything else while upgrading to 6.11.5 like updating the BIOS, adding hardware/drives or something like that?

I'm staying at 6.9.2 at the moment until a parity finishes because last Saturday I moved all the things to a new case (card was reseated during install and again when said card was blocking access to onboard SATA) and added a bunch of drives.  I did a new config and I should have rebuilt the parity instead of telling unraid the parity is valid.  I don't think that would cause the issues I'm having, but let me rule out the parity rebuild as a potential issue and I'll upgrade/try an older driver version.  

 

Only thing from my syslog output remotely related  from Google-fu was here - back in 2020.  I also don't have a normal "over the counter" GPU so that could be it as well.  I will update you and thank you for responding!

  • Like 1
Link to comment

Update: Parity finished this morning and I did an upgrade to 6.11.5.

 

Driver:

  1. 525.89.02 - Card Drops
  2. 525.85.05 - Card Drops
  3. 525.78.01 - Card Drops
  4. 525.60.13 - Card Drops

Everytime I downgrade the firmware and hit reboot, the server doesn't reboot and I have to manually power cycel the box. 😞

Link to comment
23 hours ago, Doogs said:

Everytime I downgrade the firmware and hit reboot, the server doesn't reboot and I have to manually power cycel the box. 😞

You are talking about the driver correct, not firmware...?

 

After you click Update & Download the server freezes or after you click reboot?

Are you really sure that the issue is actually the Nvidia Driver plugin, do you maybe have a SSH session open somewhere <- this will also prevent the reboot most of the times.

 

Have you yet tried to not install the driver and reboot the server just to double check if the server reboots?

From a technical standpoint nothing is different after clicking Update & Download since it only downloads the driver, nothing more, the driver is actually installed when the server boots.

Link to comment

I have been using unraid for two weeks, discovered Plex transcoding, installed Nvidia driver, switch to the second latest release where my 900 series gtx 970 is supported. Restarted, it did not turn on, rebooted manually a few times and it came on, but I now cannot connect to webui. Every-time I boot it goes threw the installing 3p drivers and also tons of no space left on device errors..? [See pictures attached,](https://imgur.com/a/edaTTZh), I cannot ssh btw so I don’t know how to get log other than pictures. 

Any help is appreciated, blessings ~ Bosh

Edited by Bosh
Link to comment
2 hours ago, Bosh said:

switch to the second latest release where my 900 series gtx 970 is supported.

Even the latest release will/should work.

 

2 hours ago, Bosh said:

Every-time I boot it goes threw the installing 3p drivers and also tons of no space left on device errors..?

This is caused because you got no more space in your RAM, how much RAM do you have installed? You need at least 8GB of RAM and this is even the bare minimum if you want to use the Nvidia Driver on your system.

But I also see that your server installs other packages too, if you are installing big third party packages like (python) this will fill your RAM even more and you will even need more RAM.

 

2 hours ago, Bosh said:

[See pictures attached,](https://imgur.com/a/edaTTZh)

BTW you can copy/paste pictures here and you don't need to use a third party site.

 

2 hours ago, Bosh said:

Any help is appreciated, blessings ~ Bosh

Please pull the USB Boot device from your server, plug the USB Boot device into a computer, navigate to "/config/plugins/", delete the file "nvidia-driver.plg" and the folder "nvidia-driver", plug the USB Boot device back into your server and boot it back again.

Link to comment
On 2/13/2023 at 3:14 AM, ich777 said:

This seems like some kind of bug in the BIOS of your Motherboard or better speaking your UEFI Firmware and I really can't help resolving it.

 

The card is recognized fine and should work. Are you sure that the message wasn't there before you put the card in the system?

Please make sure that you've enabled Resizable BAR Support and Above 4G Decoding in your BIOS.

Do you experience any issues or just this message in the syslog?

 

Just to double check, you are sure this is not caused of any other VFIO mapping that you've done in your system:

BIND=0000:00:17.0|8086:7ae2

From what I see this is your SATA controller correct?

 

Have you yet tried to boot in Legacy Mode instead of UEFI?

VFIO mapping for my old NAS raid disks,it worked well. when I removed the nvidia card, the errors were disappeared. I tried to update bios and use Legacy boot, it not working. My board is ASUS H610M-A D4.

Link to comment
30 minutes ago, yuntong said:

I tried to update bios and use Legacy boot, it not working. My board is ASUS H610M-A D4.

This seems like some kind of hardware incompatibility issue to me. Can't really do much about it.

I don't think that you have a second board (another one) on hand where you can test the card in combination with Unraid...?

Link to comment
1 hour ago, WenzelComputing said:

Is there a way to view the gui with idrac with the driver installed? it just goes to a black screen with a type prompt that you cant type on..

does it wotk without the installed driver ?

 

i just wonder how these things should be related ... in terms you mean the Dell MM iDRAC

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.