[Plugin] Nvidia-Driver


ich777

Recommended Posts

54 minutes ago, ConnerVT said:

 

Why would you think a Nvidia driver would be used to utilize a AMD GPU?

Because I know just enough to be dangerous and don't recall who makes which GPUs. My bad. I used to obsess over this stuff, but I've got too many other priorities in life to worry about keeping up. Now I just dabble when I can.

 

Tucks tail and scurries away

  • Like 1
Link to comment
9 hours ago, TealNerd said:

The diag from the other thread was with legacy and no pcie options

I see that the driver is loaded for the device:

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA104GL [RTX A4000] [10de:24b0] (rev a1)
    Subsystem: Dell GA104GL [RTX A4000] [1028:14ad]
    Kernel driver in use: nvidia
    Kernel modules: nvidia_drm, nvidia

...but it still fails:

Jan 30 12:49:00 montero kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x25:0x65:1457)
Jan 30 12:49:00 montero kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

 

My last two recommendations are that you try to force the PCIe slot where the card is connected to PCIe Gen3 and that you disable C-States in your BIOS.

 

I have now a few users issues with AMD B-Series Chipset boards which are not working correctly, it seems to me that this is a bad implementation in the BIOS or some kind of hardware incompatibility issue.

Link to comment

Hello,

 

I have recently encountered an issue with this driver working with an old Geforce 970, i recently had this working for several days transcoding my media library all setup and working with no issue.

 

However today completely out of the blue (I hadn't been changing settings or fiddling with it) it stopped working, it is no longer recognised with your driver. 

 

System devices does show it

 

IOMMU group 16:[10de:13c2] 03:00.0 VGA compatible controller: NVIDIA Corporation GM204 [GeForce GTX 970] (rev a1)

 

The output from nvidia-smi - No devices found

 

Server has been rebooted and i have confirmed this isnt hardware (cables unplugged or lose, the fan is spinning on the GPU)

 

I am a unraid novice so please be gentle :P

 

Unraid version 6.11.5

And i was using the production branch of the Geforce driver

 

Thanks in advance

Link to comment

Getting the "no devices were found" but not sure when it started.  I know it used to work.

I have a quadro p2000.

Dug through this topic and found the solution from September:

On 9/23/2022 at 2:01 PM, ich777 said:

Please do the following (this is only necessary if you upgraded before I recompiled the driver):

  1. Open up a Unraid terminal and execute:
    rm -f /boot/config/plugins/nvidia-driver/packages/5.19.9/*
  2. Close the terminal
  3. Go to the Nvidia-Driver Plugin page
  4. Click on the button "Update & Download" (please wait until the download finished and the Done button is displayed)
  5. Reboot

 

But it didn't fix it.

Attaching diagnostics.  Would greatly appreciate any help.  Thank you so much.

tower-diagnostics-20230201-0906.zip

Link to comment
1 hour ago, ich777 said:

Please try this:

  1. Uninstall GPU-Statistics
  2. Reboot
  3. In your BIOS enable above 4G Decoding and Resizable BAR Support

 

See if the issue is fixed.

This worked. Thank you!

Quick question.  Can I reinstall the GPU-Statistics?

I did have to make one other change in my BIOS and I'll record that here in case anyone else runs into this.
Original problem: nvidia quadro p2000 stopped showing up in Nvidia Driver. "no devices were found"

After following @ich777 instructions (quoted above) unraid would not boot.

What I discovered is that by enabling Resizable BAR Support it would change boot mode to UEFI instead of legacy.  I found this post that explained that I needed to turn off Secure Boot and I also needed to go into the USB drive and remove the hyphen from the EFI- folder.  Once I did those 2 things, it booted back up and the graphics card was recognized.

Thanks again for the help!

  • Like 1
Link to comment
27 minutes ago, Burkey said:

Unticked the box, rebooted, no settings changes required in the bios it booted fine, however GPU is still not showing

I see that your card falls from the bus, this indicates usually a issue connection issue with the card.

Please double check if the card is seated properly and if everything is connected (external power) too.

 

Feb  1 14:39:56 Burkey-Server kernel: NVRM: GPU at PCI:0000:03:00: GPU-68d7626d-e36a-0ab9-b351-29859f83f7ec
Feb  1 14:39:56 Burkey-Server kernel: NVRM: Xid (PCI:0000:03:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Feb  1 14:39:56 Burkey-Server kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.

 

Please also check if you are on the latest BIOS version, I know the board is pretty old but please check if you are on the latest BIOS.

 

Try the card in another PCIe slot. Sometimes switching the PCIe slot to a fixed PCIe generation solves the issue too instead of leaving it on AUTO.

Link to comment
16 hours ago, ich777 said:

I see that your card falls from the bus, this indicates usually a issue connection issue with the card.

Please double check if the card is seated properly and if everything is connected (external power) too.

 

Feb  1 14:39:56 Burkey-Server kernel: NVRM: GPU at PCI:0000:03:00: GPU-68d7626d-e36a-0ab9-b351-29859f83f7ec
Feb  1 14:39:56 Burkey-Server kernel: NVRM: Xid (PCI:0000:03:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Feb  1 14:39:56 Burkey-Server kernel: NVRM: GPU 0000:03:00.0: GPU has fallen off the bus.

 

Please also check if you are on the latest BIOS version, I know the board is pretty old but please check if you are on the latest BIOS.

 

Try the card in another PCIe slot. Sometimes switching the PCIe slot to a fixed PCIe generation solves the issue too instead of leaving it on AUTO.

Hello Mate,

 

I reseated the graphics card and it was picked up immediately, very weird how it just dropped off.

Thank you for your help!

 

  • Like 1
Link to comment

Hi all!

I've got nvidia plugin installed to passthrough an GPU to Boinc, for crunching work units.

 

GPU in NVIDIA RTX 2060

After sometime the GPU statistics plugin stops giving information.

The fans are full throttle.

nvidia-smi gives the following:

 

Unable to determine the device handle for GPU0000:01:00.0: Unknown Error

 

Logs have this:

 

Quote

Feb  4 10:08:07 Tower kernel: NVRM: GPU at PCI:0000:01:00: GPU-8adb5607-69d7-4620-d2f3-69565c4d9e04
Feb  4 10:08:07 Tower kernel: NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
Feb  4 10:08:07 Tower kernel: NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
Feb  4 10:08:07 Tower kernel: NVRM: A GPU crash dump has been created. If possible, please run
Feb  4 10:08:07 Tower kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Feb  4 10:08:07 Tower kernel: NVRM: the NVIDIA kernel module is unloaded.
Feb  4 10:08:07 Tower kernel: xhci_hcd 0000:01:00.2: Unable to change power state from D3hot to D0, device inaccessible
Feb  4 10:08:07 Tower kernel: xhci_hcd 0000:01:00.2: Unable to change power state from D3cold to D0, device inaccessible
Feb  4 10:08:07 Tower kernel: xhci_hcd 0000:01:00.2: Controller not ready at resume -19
Feb  4 10:08:07 Tower kernel: xhci_hcd 0000:01:00.2: PCI post-resume error -19!
Feb  4 10:08:07 Tower kernel: xhci_hcd 0000:01:00.2: HC died; cleaning up

 

I'm unable to stop the BOINC docker.

I have to restart the system, that starts an inmidiate parity check on the next start up.

 

Thank you for any indication

Edited by acastellab
Link to comment
On 11/24/2022 at 6:05 PM, Stan_ said:

Hello,

Can I get plugin file from your github repository and install at plugin-->install plugin page?because I cant use CA it shows SSL verfication failure.😔

After use this method,it shows

915855750_ZB@0818XJR1@V6LGJL5L4T.png.4a693255241bc1e25c83b3094fcb4e6b.png

clicked done button,then reboot server,switched docker off and on. this showed up

2073493075_1JU)7VNPOADFRNCG1UJ)9.png.1105cefe77a225372ab06d6cfb8f6523.png

as I reboot my kvm output a line

1642698619_PBP5W0TV(OTNUN5ECPG.png.d27e6f19426f3b429a3f1b4fbb031876.png

Is that my method is wrong? here is my diagnostics.

btw my card is tesla P4🙂

stannas-diagnostics-20221125-0948.zip 98.87 kB · 7 downloads

 

https://unraid.net/zh/购买-正版-unraid-许可证

 

If you believe this is an error, please contact suppport.

Link to comment
5 hours ago, Jackal24 said:

1660 Super. It always stays in power mode P2 and never goes to P0 even under high load. I know of at least one other person with this issue.

Do you use nvidia-persistenced?

How fast is it transcoding when in P2?

One or two people have reported that already but AFAIK this caused no issues for them.

There is not much if anything that I can do about since this is a driver issue itself and out of my control.

Link to comment

Hi, after replacing a faulty harddrive with a new one in my unraid server, I am able to start up unraid, change the drive in the GUI, start the rebuild process, but then right after the rebuilding process has started, the Unraid GUI stalls and stops working completely. From the syslogs it appears that the Nvidia driver is causing the system to misbehave and become unresponsive.

Here's the original post where I enquired about this issue:

 

 

As reccomended in this previous post, I rebooted in safe mode. Tried installing Nvidia driver plugin again, but it behaves exactly the same over and over again. It says its installed properly, but then refuses to open the Nvidia drivers plugin page, then it freezes the GUI. 

After quitting my webbrowser, then logging back in into the server the GUI seems to work again, but the server is not reacting to the reboot or shutdown button anymore. Other services like the terminal don't work either. It forces me to hard shutdown the server everytime after trying to install the Nvidia drivers plugin to be able to access the server again.

 

Any idea why the Nvidia driver plugin is breaking the server like this? And any idea of a possible fix to get it to work again?

 

 

Ive attached the most recent syslogs to this post incase they might be of any use.

 

Thanks for the help

Fluxion_Syslog_08Feb2023.txt

Link to comment
11 hours ago, Massimo Platteau Beltrami said:

Any idea why the Nvidia driver plugin is breaking the server like this? And any idea of a possible fix to get it to work again?

It looks like the plugin is causing a Kernel panic, but that is usually caused by a hardware defect or something else hardware related.

 

Please try to re-seat the card in the PCIe slot, check if external power is connected properly.

 

I would also recommend that you let the Parity sync finish first so that your data is safe and after that we can try to get the card back working again. What I'm really curious is if you can test the card in another system, install the drivers and put a 3D load (preferably FurMark) on it for about 30 minutes to see if it is working properly.

This would not be the first time I see a Quadro dying, these cards are pretty old now...

Link to comment
Feb 12 09:08:00 DiskStation kernel: ACPI Warning: \_SB.PC00.PEG1.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20220331/nsarguments-61)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: Aborting method \_SB.PC00.PEG1.PEGP._DSM due to previous error (AE_ALREADY_EXISTS) (20220331/psparse-529)
Feb 12 09:08:00 DiskStation kernel: ACPI BIOS Error (bug): Failure creating named object [\_SB.PC00.PEG1.PEGP._DSM.USRG], AE_ALREADY_EXISTS (20220331/dsfield-184)
Feb 12 09:08:00 DiskStation kernel: ACPI Error: AE_ALREADY_EXISTS, CreateBufferField failure (20220331/dswload2-477)

I install driver corrently for Tesla P4, and the  syslog show those infos, how can i solve it? thanks for reply.

屏幕截图 2023-02-12 091445.jpg

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.