[Plugin] Nvidia-Driver


ich777

Recommended Posts

Not sure how to troubleshoot. Previously I had an nvidia nvs 300 installed, tried the plugin, realized it didn't work with that card and uninstalled the plug in. Then much later I installed a GTX 1050. The plugin recognized the card, the dashboard plugin works, I added the three things to a couple of dockers (handbrake and folding at home), but neither one are using the card. I get no errors in the docker logs. I did disable and re-enable docker and rebooted a few times. I changed the driver to production. Now some of my reboots come up dirty with the array. when I installed the card first it rebooted while first boot up (resulted in a parity check). When I changed the driver to production from latest, it did the same thing. What is my next step in diagnosis? I'm not sure if step 7 (from 1st post about the daemon has been completed, but I don't know how to check it). 

cvg02-diagnostics-20211103-0604.zip

Link to comment
1 hour ago, wildfire305 said:

Now some of my reboots come up dirty with the array.

This could have various reasons and is maybe don't related to the plugin.

 

1 hour ago, wildfire305 said:

handbrake

What Handbrake container are you using, keep in mind that the container or better speaking the app has to be built with NVENC support so that will work, from what I know the currently available Handbrake container in the CA App doesn't support NVNEC at all.

 

1 hour ago, wildfire305 said:

folding at home

Which container are you using, never used Folding at Home but maybe I could try it if it works on my system.

 

1 hour ago, wildfire305 said:

When I changed the driver to production from latest, it did the same thing. What is my next step in diagnosis?

I now tried on my development server to change the drivers back and forth with reboots in between but I don't experience any Parity checks at all, I'm currently on unRAID v6.10.0-rc2 but that shouldn't matter at all.

 

1 hour ago, wildfire305 said:

I'm not sure if step 7

If you are able to pass the argument '--runtime=nvidia' to the Extra Parameters in the template and the container starts up fine then everything went well.

You also could try to run 'nvidia-smi' in a container console window as root and see what the output is (keep in mind that this maybe will not work in every container).

Link to comment
1 hour ago, ich777 said:

This could have various reasons and is maybe don't related to the plugin.

 

What Handbrake container are you using, keep in mind that the container or better speaking the app has to be built with NVENC support so that will work, from what I know the currently available Handbrake container in the CA App doesn't support NVNEC at all.

Well that explains that, I am using the default from ca. How do I change that? What is recommended? I've also used tdarr on this server, but haven't tried the nvidia part yet on that docker. Do I just need to add the commands to the client? 

1 hour ago, ich777 said:

 

Which container are you using, never used Folding at Home but maybe I could try it if it works on my system.

The default folding at home in CA. Maybe it doesn't support it either. 

1 hour ago, ich777 said:

 

I now tried on my development server to change the drivers back and forth with reboots in between but I don't experience any Parity checks at all, I'm currently on unRAID v6.10.0-rc2 but that shouldn't matter at all.

 

If you are able to pass the argument '--runtime=nvidia' to the Extra Parameters in the template and the container starts up fine then everything went well.

You also could try to run 'nvidia-smi' in a container console window as root and see what the output is (keep in mind that this maybe will not work in every container).

The smi command yields results and adding the extra parameters line does still allow the docker to start. 

 

It seems like perhaps my tests may have been invalid. It seems like I need to try different dockers. I have plex and jellyfin running I may try those or tdarr node, or handbrake again if anyone can point me to how to choose a different docker for it. 

 

Thanks for your help and taking the time to be thorough. Not sure why the array was coming up dirty. The server is headless and I wasn't watching what was actually happening on a screen, I could just tell that it was getting half started and then rebooting. Do you know if I need to enable uefi for these cards? I currently have it set up as legacy. It is a Dell t5810 workstation. 

Link to comment
25 minutes ago, wildfire305 said:

Well that explains that, I am using the default from ca. How do I change that?

The maintainer from the container has to change that, for example Handbrake needs to be compiled with NVENC support.

Here is an alternative to the container on the CA App: Click (from what I know it should be compatible to the existing template on the CA App, you only need to switch the repository).

 

25 minutes ago, wildfire305 said:

I've also used tdarr on this server, but haven't tried the nvidia part yet on that docker. Do I just need to add the commands to the client?

Sorry I'm not very familiar with Tadarr... What do you mean exactly with "commands to the client"?

 

25 minutes ago, wildfire305 said:

The default folding at home in CA. Maybe it doesn't support it either. 

Have you looked at the description, maybe it is mentioned there what you have to do.

 

Also please see the second post what you need to add to the template.

 

25 minutes ago, wildfire305 said:

The smi command yields results and adding the extra parameters line does still allow the docker to start. 

In the container console? If yes, then it basically works in the container but if the application was compiled without NVENC/CUDA support then it can't make use of it.

Hope that makes sense to you.

 

I also got a few reports about Plex that not all containers work on the first try and even a user solved this issue by deleting the container and installing it from scratch with the Nvidia necessary entries.

 

Also a few posts above the same happened with Emby: Click

 

25 minutes ago, wildfire305 said:

I have plex and jellyfin running

These should support the Nvidia transcoding, I also run Plex (official repo) and Jellyfin (my repo) and transcoding is working just fine.

 

25 minutes ago, wildfire305 said:

Do you know if I need to enable uefi for these cards? I currently have it set up as legacy. It is a Dell t5810 workstation.

No, you don't need UEFI turned on, I also run my server with Legacy (CSM) and it is working just fine, sometimes, or at least with older drivers I had some issues with UEFI where the card wasn't properly recognized by the driver itself.

Anyways UEFI and Legacy (CSM) should work fine.

Link to comment
1 hour ago, ich777 said:

The maintainer from the container has to change that, for example Handbrake needs to be compiled with NVENC support.

Here is an alternative to the container on the CA App: Click (from what I know it should be compatible to the existing template on the CA App, you only need to switch the repository).

Changed it and handbrake worked perfect! 700fps on a 264 encode! Only 45% loaded probably means it can do two or more encodes at a time!

1 hour ago, ich777 said:

 

Sorry I'm not very familiar with Tadarr... What do you mean exactly with "commands to the client"?

Typo, I meant worker node not client. I'll try adding the three parameters to the node docker and see if it works. Tdarr is used to do massive multiple node (or single) transcoding. I've used it for two weeks and gained 2tb switching some things to h265 from h264 and avi. You can use the server docker and nodes (win/mac/Linux/docker) to distribute the process across your network. We now have three gtx class cards in the network that can make the process go even faster. The beauty of tdarr is that it fully manages the workload and any failures and retries, and tests the output to confirm it did a complete job. That's a lot of work that I do not have to manage. 

 

1 hour ago, ich777 said:

 

Have you looked at the description, maybe it is mentioned there what you have to do.

 

Also please see the second post what you need to add to the template.

 

In the container console? If yes, then it basically works in the container but if the application was compiled without NVENC/CUDA support then it can't make use of it.

Hope that makes sense to you.

Understood, thanks for the clarification. I don't think I did the smi command in the console for the container. I did it from the main terminal. If it doesn't work in the container console, does it mean that the container probably doesn't support it (or I don't have the container configured properly). 

1 hour ago, ich777 said:

No, you don't need UEFI turned on, I also run my server with Legacy (CSM) and it is working just fine, sometimes, or at least with older drivers I had some issues with UEFI where the card wasn't properly recognized by the driver itself.

Anyways UEFI and Legacy (CSM) should work fine.

Good to know. My card came from a uefi system, maybe it was just grumpy on first startup in a new system. I'll try some more reboots and hook up a screen and see if I am missing something. 

  • Like 1
Link to comment
18 minutes ago, wildfire305 said:

I did it from the main terminal. If it doesn't work in the container console, does it mean that the container probably doesn't support it (or I don't have the container configured properly). 

No, it's only a indicator if the card is properly recognized in the container if the command is available, but also keep in mind that it is also possible that you get another error if you try to open nvidia-smi in the container of missing libraries.

Better speaking if nvidia-smi is available in the container you've passed through the command --runtime=nvidia properly.

 

It is always a question that I ask because sometimes this argument from above is missing in the container template at the Extra Parameters.

Link to comment

Looks like it's working perfect in jellyfin. I appreciate all the help. I was just giving it a bad test (dockers I didn't know didn't support it). It appears to be working fine. Cpu load is low and gpu is about 8% transcoding a 265 playback over the internet to my android tablet. I'm a happy camper. Now I get to play with more power. Got room in this old tank for a second gpu... 

  • Like 1
Link to comment

Card is working great. I got it to work in foldingathome by modifying the config.xml file. I really wanted to try something that I knew would blast the thermals into this card for testing since it is used. However, out of two more tries, if I change the driver it results in a dirty array reboot both times. But regular rebooting works fine. Could just be my configuration or card. I have an MSI Geforce GTX 1050 2GB. I can live with the problem, but if you want to know more or diagnose I'm willing to provide as much information and testing as needed. Otherwise, everything is awesome.

Link to comment
2 hours ago, wildfire305 said:

Could just be my configuration or card. I have an MSI Geforce GTX 1050 2GB. I can live with the problem, but if you want to know more or diagnose I'm willing to provide as much information and testing as needed. Otherwise, everything is awesome.

I will try to change the drivers on my main rig and will report back if have dirty shutdowns, please give me a few hours, I can't reboot it instantly...

Link to comment
40 minutes ago, wildfire305 said:

No rush

Tested it now and had no Parity Checks or whatsoever.

 

Downgraded from 495.44 to 470.82.00:

470_82_00.thumb.png.dd05fee4333f0ad2c0f0ce162a2bfe61.png

 

Then downgraded from 470.82.00 to 470.74:

470_74.thumb.png.4617249190e3c6b9b306d04fa5b4cdb1.png

 

And lastly upgraded again from 470.74 to 495.44:

495_44.thumb.png.17866a705257869419a33864c81a9fa3.png

 

Here is also the Parity History (please ignore the canceled ones that I did about half a month ago, I did that by accident... :D ) :

1795348166_paritylog.png.6a7e01fd01d1f135cf8a3ad45c3e48cc.png

 

 

Are you sure that you don't have a SSH session or something similar open? This also could lead to a unclean shutdown.

Link to comment
20 minutes ago, ich777 said:

Are you sure that you don't have a SSH session or something similar open? This also could lead to a unclean shutdown.

Open but not running anything.... Will that do it? That could be the answer to all of my unclean shutdowns if so. I see the message that the server is shutting down, and then I just let it fail the putty session. 

 

I'm a bit of a noob, If I comb the logs, what would be a good search term for finding the reason for the unclean? I set up the syslog server last week so I should have it all. 

Link to comment
Open but not running anything.... Will that do it? That could be the answer to all of my unclean shutdowns if so. I see the message that the server is shutting down, and then I just let it fail the putty session. 
 
I'm a bit of a noob, If I comb the logs, what would be a good search term for finding the reason for the unclean? I set up the syslog server last week so I should have it all. 
In the past I have experienced this when I let a SSH session open and the shutdown also took forever.
Try to close every SSH session to the server before shutting it down, i think the second parity check that you see in the screenshot was caused also by a SSH session that I've left running while rebooting or shutting down the server.

But as you can see from the screenshots, no parity check was triggered in between the reboots.

Sent from my C64


Link to comment
7 hours ago, Skyshroud said:

can we ever expect a release that would allow us to use the same Nvidia graphics card for both VM's and Docker containers?

I think I don't understand what you mean by that.

Do you want to use it at the same time in a VM and in a container, if yes this is not possible since if you hand it over to a VM the VM has full access and control over the card and is basically not available on the host, so to speak unRAID.

 

Anyways it is possible that you use it in a container when the VM is turned off but very bad things can happen when you turn on a VM while it is used in a container or a VM is using the card and a container wants to use the cards.

VM crashes, Docker crashes and even what's more likely to happen is that the server entirely locks up.

 

That's why I don't recommending using one card for VM and Docker.

There is nothing I can do about that since this is a limitation of the hardware/software and not the drivers or the plugin itself.

Link to comment
On 11/6/2021 at 2:39 AM, ich777 said:

I think I don't understand what you mean by that.

Do you want to use it at the same time in a VM and in a container, if yes this is not possible since if you hand it over to a VM the VM has full access and control over the card and is basically not available on the host, so to speak unRAID.

 

Anyways it is possible that you use it in a container when the VM is turned off but very bad things can happen when you turn on a VM while it is used in a container or a VM is using the card and a container wants to use the cards.

VM crashes, Docker crashes and even what's more likely to happen is that the server entirely locks up.

 

That's why I don't recommending using one card for VM and Docker.

There is nothing I can do about that since this is a limitation of the hardware/software and not the drivers or the plugin itself.

 

I figured as much - just thought I would ask... wishful thinking, I suppose.

 

Thank you very much for your reply, and for all of the hard work you do here. It is greatly appreciated.

  • Like 1
Link to comment

Another update: Windows recognises the card but with error 43, updated drivers and reflashed the card and still not working.  Put it in the other slot and working but this doesn't help long term.  

 

 

Update: I noticed it was recognising the second card slot... removing that card and putting it in the primary slot the nvidia driver isn't recognising the card, full bios reset to defaults, lspci is showing it but nvidia driver not working... thinking it is something to do with the slot itself since the card is not faulty?

 

After a random crash today on my unraid server, powered back up to find one of my graphics cards not being recognised.  I've had this setup running for several months without any issues what so ever.  

 

The nvidia plugin is showing only a single card, but lspci and unraids system devices shows both.  nvidia-smi only shows one card.  I have swapped in a spare card as I originally thought the card had failed but that's not the case.  Bios update to the latest as of today also no success.  

 

Some hardware background

Motherboard: Gigabyte Z590 UD AC (F2 original bios, updated to F5)

2 x nVidia Quadro K1200's

Unraid Version: 6.10.0-rc2 (has been working without any issues for several weeks on this version and rc1 prior to that)

Nvidia driver version: Was running latest (v495.44, tried downgrading to v470.82.00) 

 

Nvidia plugin output (attached)

System tools/devices (attached card 1 and 2)

 

I'm not sure what caused the crash, this has happened twice in the last couple of weeks but this setup has been fine for months.  

 

/proc/(etc) only shows a single card which is as per nvidia plugin.

 

Any suggestions would be greatly appreciated.

 

 

 

nvidia-plugin.png

card1.png

card2.png

Edited by Wingede
Link to comment
3 hours ago, Wingede said:

Another update: Windows recognises the card but with error 43, updated drivers and reflashed the card and still not working.  Put it in the other slot and working but this doesn't help long term.  

So the second K1200 is not working in Windows if I understand that correctly? Have you tried this in a VM or in another physical computer?

 

3 hours ago, Wingede said:

The nvidia plugin is showing only a single card, but lspci and unraids system devices shows both.

Yes but this is from your logs:

Nov  8 20:13:22 aV4 kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
Nov  8 20:13:22 aV4 kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Nov  8 20:13:22 aV4 kernel: NVRM: The NVIDIA GPU 0000:01:00.0
Nov  8 20:13:22 aV4 kernel: NVRM: (PCI ID: 10de:13bc) installed in this system has
Nov  8 20:13:22 aV4 kernel: NVRM: fallen off the bus and is not responding to commands.
Nov  8 20:13:22 aV4 kernel: nvidia: probe of 0000:01:00.0 failed with error -1

 

Seems like your card is not responding and this means basically for the driver that it has fallen from the bus.

 

3 hours ago, Wingede said:

I've had this setup running for several months without any issues what so ever. 

Keep also in mind that hardware can fail over time, I don't want to say that the card is defective but I would like to keep in mind that it can happen.

 

3 hours ago, Wingede said:

Nvidia driver version: Was running latest (v495.44, tried downgrading to v470.82.00) 

This should make no difference because your card is supported on the latest drivers anyways.

 

3 hours ago, Wingede said:

Any suggestions would be greatly appreciated.

My suggestion would be to put the card in another computer and try if it is working there, but also keep in mind that you have to install the Nvidia drivers on that other machine because the basic display output is working most of the time without a problem, only after installing the driver and putting a 3D load on it you will see if it is fully working.

Link to comment
50 minutes ago, ich777 said:

 

My suggestion would be to put the card in another computer and try if it is working there, but also keep in mind that you have to install the Nvidia drivers on that other machine because the basic display output is working most of the time without a problem, only after installing the driver and putting a 3D load on it you will see if it is fully working.

 

Thanks for going through my logs.  

 

Windows or unraid in that primary pcie slot the os's are always able to see the card but not functional, windows shows error 43 and you get the picture from unraid.  I have 3 of these cards and thought it might be a faulty card (as they are getting old).  Moving that initial card to another slot everything worked fine but re-using that primary slot regardless of card used I have the issue.  

 

Going to just a single k1200 card in the system hasn't made any difference even with bios update and reset to defaults etc.  Moving that particular card to another other slot everything works as expected (tested some transcoding operations). 

 

I think it is pointing to a motherboard issue (this is the second one since purchasing, bad luck as the first one had a faulty LAN controller on it - well wasn't faulty, dead is more appropriate).    I just hope I haven't pushed the boundary on the motherboard by using 2 k1200's (would think not).

 

Regards and thanks again for parsing my logs.

 

 

 

 

 

  

 

 

 

 

Link to comment
10 minutes ago, Wingede said:

I think it is pointing to a motherboard issue (this is the second one since purchasing, bad luck as the first one had a faulty LAN controller on it - well wasn't faulty, dead is more appropriate).    I just hope I haven't pushed the boundary on the motherboard by using 2 k1200's (would think not).

I only have a few suggestions that you can try:

  • Enable Above 4G deconding in the BIOS
  • Change from UEFI Boot to Legacy and vice versa, depending on what mode you are booting with now
  • Manually set the PCIe Gen for slot one to Gen3 or Gen2 (would not make much of a difference in terms of performance for this most cards anyways)

 

Also saw that you are using a Gigabyte motherboard from which I am not a huge fan of because they gave me also troubles in the past and also I read recently a few posts about fault Motherboards, not working/correctly working HPET timers, various BIOS issues... :/

Link to comment
52 minutes ago, ich777 said:

I only have a few suggestions that you can try:

  • Enable Above 4G deconding in the BIOS
  • Change from UEFI Boot to Legacy and vice versa, depending on what mode you are booting with now
  • Manually set the PCIe Gen for slot one to Gen3 or Gen2 (would not make much of a difference in terms of performance for this most cards anyways)

 

Also saw that you are using a Gigabyte motherboard from which I am not a huge fan of because they gave me also troubles in the past and also I read recently a few posts about fault Motherboards, not working/correctly working HPET timers, various BIOS issues... :/

 

Tried the above changes and no difference.  I will go down the RMA process with the supplier again ;(.

 

What motherboard manufacturer do you recommend (for future reference)?

Link to comment
3 minutes ago, Wingede said:

What motherboard manufacturer do you recommend (for future reference)?

This is completely subjective I have to say, but I never had problems with ASUS or MSI.

 

4 minutes ago, Wingede said:

Tried the above changes and no difference.

Maybe it's also some BIOS or hardware compatibility issue...

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.