[Plugin] Nvidia-Driver


ich777

Recommended Posts

I am still unable to get a second GPU working in my computer. Contacting Asrock support has led to them telling me that two nvidia cards is not supported, only two AMD.

 

It is a B460 board, which do not support SLI. I have insisted to them many times that I do not need SLI, but am I incorrect in assuming that is what they are saying is unsupported? Between having a niche use case, and a language barrier, I don't think I am getting the right answers from them.

 

As a follow-up question, if I do switch motherboards does anyone have any recommendations for LGA-1200 motherboards that are know to work with two Nvidia cards?

Link to comment
10 hours ago, ich777 said:

This is completely subjective I have to say, but I never had problems with ASUS or MSI.

 

Maybe it's also some BIOS or hardware compatibility issue...

 

Something interesting (started RMA process) but I decided to go back to unraid 6.9.2, I just shut the system down reflashed boot drive and copied across the config and fired the system up (had it running all day today with both cards in but only one properly recognised).  Both cards now recognised. 

 

My first crash was a couple of weeks ago but I am sure I have been running 6.10-rc1 for quite some time.

 

 

bothcards.png

Link to comment
29 minutes ago, Wingede said:

reflashed boot drive and copied across the config and fired the system up

Just as a note it would be easier if you only replace the bz* files on the USB boot device, so youbdon't have to create a whole new stick amd copy everything back.

 

30 minutes ago, Wingede said:

Both cards now recognised. 

Can you share your Diagnostics please again with everything working?

Link to comment
Just now, ich777 said:

Just as a note it would be easier if you only replace the bz* files on the USB boot device, so youbdon't have to create a whole new stick amd copy everything back.

 

Can you share your Diagnostics please again with everything working?

 

Thanks for the hint re replacing bz files - good to know.

 

Attached diagnostics with everything working, i just fired up one the vm's using the second card to get some load across both and seem fine.  Primary card is used for docker based stuff.

 

Some what wondering if I should push update to 6.10-rc1/2 and see if it breaks again?

av4-diagnostics-20211109-1907.zip

Link to comment
17 minutes ago, Wingede said:

Primary card is used for docker based stuff.

Oh, I thought you are using both for Docker containers.

Have you tried to bind the second card to VFIO or even stub it and tried to pass it through to a VM on 6.10.0?

What error code do you get in Windows or wasn't it picked up by the VM at all?

 

19 minutes ago, Wingede said:

Some what wondering if I should push update to 6.10-rc1/2 and see if it breaks again?

Also interested in that, please report back about your findings.

Link to comment
1 minute ago, ich777 said:

Oh, I thought you are using both for Docker containers.

Have you tried to bind the second card to VFIO or even stub it and tried to pass it through to a VM on 6.10.0?

What error code do you get in Windows or wasn't it picked up by the VM at all?

 

Also interested in that, please report back about your findings.

 

My configuration has always been card 0 for docker and card 1 for a Windows based VM.  This has been running on 6.9.x for ~6-8 or so weeks before upgrading to 6.10-rc1, then recently rc2.  Same hardware etc.  

 

Where I mentioned Windows seeing the card but having the error 43 issue was when I booted the entire system into a windows native installation with just a single card in the primary slot.  This was an attempt to see if it was linux/unraid related or narrowing down hardware.  I haven't booted native windows since. 

 

I will look to update back to rc2 and see what happens.  Perhaps there is some compatibility issue with my motherboard and kernel 5.14.  What I don't understand is that it was running rc2 for a couple of days at least before it all went pear shaped.  I will report back later on in the week as I need the system running and curious to see if it is stable for a few days.

 

 

 

 

 

  

Link to comment
15 minutes ago, Wingede said:

My configuration has always been card 0 for docker and card 1 for a Windows based VM.  This has been running on 6.9.x for ~6-8 or so weeks before upgrading to 6.10-rc1, then recently rc2.  Same hardware etc.  

May I ask for what containers are you using the second card? Only for transcoding or for something different too?

 

On what card does the unRAID console get displayed or is it displayed on the iGPU?

 

You can also remove this line on 6.10.0 from your go file:

modprobe i915

 

I would also recommend that you use the Intel GPU TOP plugin (not the container) to enable the iGPU, this will also allow you to see the usage on the unRAID Dashboard in combination with the GPU Statistics plugin.

 

Somebody else has also the exact same issue with two cards and 6.10.0:

 

I'm really curious to find out what the root of this issue is.

Link to comment
11 minutes ago, ich777 said:

May I ask for what containers are you using the second card? Only for transcoding or for something different too?

 

On what card does the unRAID console get displayed or is it displayed on the iGPU?

 

unRaid console uses iGPU, docker specifically uses the first k1200 for doing transcodes for emby or plex.  The second k1200 doesn't do a heck of a lot but is passed through to a Windows VM which I use for different things, some cad stuff when remote/being lazy ;).  I probably could just use the igpu for emby transcodes but I have the cards.  

 

 

 

11 minutes ago, ich777 said:

 

You can also remove this line on 6.10.0 from your go file:

modprobe i915

 

I would also recommend that you use the Intel GPU TOP plugin (not the container) to enable the iGPU, this will also allow you to see the usage on the unRAID Dashboard in combination with the GPU Statistics plugin.

 

Thanks for the advice will do.

 

 

 

11 minutes ago, ich777 said:

 

Somebody else has also the exact same issue with two cards and 6.10.0:

 

I'm really curious to find out what the root of this issue is.

 

Interesting!!, i'll make it a priority to update back to rc2 tomorrow and see if it breaks.  Starting to get a bit late here and need to wind down a bit but will do that tomorrow and post feedback since you have spent a fair amount of time helping me.

 

 

Link to comment
1 minute ago, Wingede said:

Interesting!!, i'll make it a priority to update back to rc2 tomorrow and see if it breaks. 

Can you maybe try to Disable Above 4G Decoding and see if it works?

 

I can only thing of Memory allocation issue with the newer Kernel.

 

If you got time please give me the output from:

dmesg | grep "root bus resource"

One time with enabled 4G decoding and one time disabled please?

Link to comment
7 hours ago, yakboyslim said:

I am still unable to get a second GPU working in my computer. Contacting Asrock support has led to them telling me that two nvidia cards is not supported,

image.png.79bf22ab3278065bd317ead499a06e0e.png

image.thumb.png.6e235c9bf1d3da5a316d7d436fd59e4a.png

 

but i have a z590 board with a 10850k CPU, i also had to vfio bind the cards to get them working (even igpu is as primary set), when using CPU based slots i also had to use ACS override both to get them properly seperated in iommu.

 

as the 1050 is desktop usage only i settled it in the PCH based x16 slot (running x4 mode) but leaves the gaming mashine untouched with x16 and rBar active, didnt tested further now if there is a collission with rbar when using both gpu's in CPU based PCI slots.

 

may post a screen from your iommu's

 

image.thumb.png.cd0a0e6698dc70e6efefbe7eadbace16.png

 

and drop a note what board exactly and which slots you tested, also as sample, i have to disable the z590 wifi device so it wont brake the USB slots etc ... may disable the devices you dont need in bios too.

  • Thanks 1
Link to comment
6 hours ago, alturismo said:

drop a note what board exactly and which slots you tested, also as sample, i have to disable the z590 wifi device so it wont brake the USB slots etc ... may disable the devices you dont need in bios too.

 

Its an ASrock B460 Phantom Gaming 4. I currently have the Quadro P400 in the upper x16 slot (CPU lanes) and the GTX1060 in the lower x16 slot (PCH running at x4 I believe) But that is just the arrangement that I stopped troubleshooting on, I have tried both slots, with every combo of VFIO I can think of.

 

1326808259_Screenshot2021-11-09081031.thumb.png.bfb4d583f76305aab343583fa0424963.png

Currently the GTX1060 is VFIO since the hope is one day to pass that card to a VM while the P400 is used by docker containers. I have tried a few different ACS override settings with no effect. When I try to start my VM (currently a Ubuntu 18.4 server) it fails to start with the error "Execution Error: Internal error: qemu unexpectedly closed the monitor"

 

1319831512_Screenshot2021-11-09080921.thumb.png.9766db929fe80e8497b4c1203c8d6b5d.png

 

If need be I can rerun the diagnostics, but I have included the diagnostics for how it is currently (GTX1060 with VFIO binding).

 

@ich777 Thanks for the help again. Also attached is the diagnostics before I did any VFIO binding from a few weeks ago.codraid-diagnostics-20211026-2318.zip

codraid-diagnostics-20211109-0808.zip

Link to comment
13 minutes ago, yakboyslim said:

But that is just the arrangement that I stopped troubleshooting on

And what is not working exactly?

The P400 is recognized by your system and you should be able to use it in Docker containers.

The 1060 is also recognized by your system and should be also ready for usage in a VM.

 

Do you want to pass through the card to a Ubuntu Server?

 

Please remember if you bind the card to VFIO the card won't show up in the Nvidia Driver plugin.

 

May I ask what connect that often to your server through SSH, your syslog is full with messages that something connected.

Link to comment

The connections are all from gravity-sync. I'm running pihole in a docker container as well as on an actual raspberry pi and using gravity-sync to keep them in sync, but it runs way too often and I am still working on that.

 

The problem is it doesn't work with the VM. It gives the qemu error, which from my (possibly inaccurate googling) is related to a memory allocation issue, which is similar to what you said was the probable cause earlier.

 

When I don't bind the VFIO it doesn't appear in the Nvidia driver plugin either. Only the first card appears there. If I VFIO bind the first card then neither card appears in Nvidia driver.

1 hour ago, ich777 said:

 

 

Everything points to a motherboard issue, but I don't want to just throw money into another one without knowing an idea of what the problem is here.

Link to comment
32 minutes ago, alturismo said:

ok, and you also have a vbios added to the vm ?

I have tried without. I have also tried with a VBIOS off techpowerup, and I have also tried following this guide:

 

When I try to dump the vbios it says it succeeded and creates a rom file, but the script also has an error for "qemu closed the monitor unexpectedly" so I don't think it is actually working. Regardless, I tried with that rom output and still got the same results. "qemu unexpectedly closed the monitor"

 

I am going to try unbinding the VFIO just to show that the second card still does not appear in nvidia-smi

Link to comment
38 minutes ago, yakboyslim said:

Everything points to a motherboard issue, but I don't want to just throw money into another one without knowing an idea of what the problem is here.

I found various reports on the Nvidia forums about the same issue that is maybe related to a BIOS bug.

 

Have you tried to Enable/Disable Above 4G Decoding?

Link to comment
13 hours ago, ich777 said:

Can you maybe try to Disable Above 4G Decoding and see if it works?

 

I can only thing of Memory allocation issue with the newer Kernel.

 

If you got time please give me the output from:

dmesg | grep "root bus resource"

One time with enabled 4G decoding and one time disabled please?

 

My system doesn't boot when 4g decoding is enabled and have to reset bios to get things working again.

 

I did the update from 6.9.2 to 6.10-rc2 and the primary k1200 disappears.  Diagnostics attached and dmesg output

 

 

busres-4gdisable.png

av4-diagnostics-20211110-0917.zip

Link to comment

Nvidia is advising I try driver version 470.82.00. This is not an option for me in the Unraid Nvidia plugin. They included some instructions for manually installing the drivers to slackware, and disabling noveau. Is it advisable to follow those instructions or will it play nice with the unraid nvidia plugin?

 

Quote

Please try the following, nvidia manual driver installation.

 

First please remove previous driver installation attempts.

If you installed using packages, remove with the package removal tool and reboot.

If you installed manually, remove with: sudo /usr/bin/nvidia-uninstall and reboot.

 

Then use the instructions here: https://docs.slackware.com/howtos:hardware:proprietary_graphics_drivers 

 

 

search for 'Installation via the nVIDIA Binary'

 

follow the instructions to disable nouveau by creating a text file in /etc/modprobe.d/  and reboot before installing

 

Link to comment
1 hour ago, Wingede said:

My system doesn't boot when 4g decoding is enabled and have to reset bios to get things working again.

Then it seems that there is something wrong with the BIOS when it doesn't boot when Above 4G Decoding is turned on because that's what 4G Decoding is for... Some BIOS also describe this as "Above 4G Decoding / Crypto Mining".

Link to comment
4 minutes ago, ich777 said:

Then it seems that there is something wrong with the BIOS when it doesn't boot when Above 4G Decoding is turned on because that's what 4G Decoding is for...

 

Admittedly i've never had that setting enabled until you mentioned it.  I did perform a bios update from F2 to F5 as part of this troubleshooting.  With that setting enabled it did open some rambar size setting which I left but the system didn't boot.  I had no video display from the iGPU, i have left dongles at work (which I can't get to due to lockdown) incase it somehow switched video output to the cards.  I did leave the system powered to see if unraid came up but after 5 minutes I had no ping response.  

 

I will update my ticket with gigabyte to see if they can share something.  6.9.2 seems fine at the moment.

 

 

Edited by Wingede
Link to comment
8 minutes ago, yakboyslim said:

I am on 6.9.2. You rock, thanks so much for all the help

Please go to the Nvidia Driver plugin page and you now should see the driver 470.82.00, if not wait at least 5 minutes and go again to the Nvidia Drier plugin page and then it should appear in the list.

 

I would recommend this steps:

  1. Select driver 470.82.00
  2. Click "Update"
  3. Then click on "Download" (keep in mind this can take some time and wait for it to say that it's finished)
  4. After the Download finished reboot your Server
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.