[Plugin] Nvidia-Driver


ich777

Recommended Posts

I have a 1050ti installed... And the latest driver installed. I'll do another diagnostic and post it, I took that while a crash was happening and I'd be willing to bet that might be why that was missing. The igpu in the system is not nearly as fast as the 1050ti at least in my testing. I have almost 30TB of data I need to transcode through, so I was looking for the higher fps while transcoding. 

Link to comment

In lspci.txt I can see the GPU there.

 

01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1) Subsystem: ASUSTeK Computer Inc. GP107 [GeForce GTX 1050 Ti] [1043:85d1] Kernel driver in use: nvidia Kernel modules: nvidia_drm, nvidia

 

But I also see in the log that it's unable to bind device? I'm confused.

 

 

Edited by Tithonius
Link to comment
1 hour ago, Tithonius said:

But I also see in the log that it's unable to bind device? I'm confused.

 

 

may delete the vfio bind, looks like this was from another device you had a vfio bind active (possible another GPU from earlier times ?) for VM passthrough ?

 

vfio bind should only be activated for devices which you want to passthrough a VM ...

Link to comment
46 minutes ago, alturismo said:

 

may delete the vfio bind, looks like this was from another device you had a vfio bind active (possible another GPU from earlier times ?) for VM passthrough ?

 

vfio bind should only be activated for devices which you want to passthrough a VM ...

I did have an AMD gpu i did some testing with, but never anything more then plug it in really

Link to comment

Okay, did that, but would that really be the reason for the 500 internal server error?

 

Also, just to clarify, I have uninstalled the app, rebooted, reinstalled the app, and rebooted again and i haven't had the error since, so maybe it was just a weird thing. but i was just curious if there was anything in the diagnostic that was alarming.

Edited by Tithonius
Link to comment
11 minutes ago, Tithonius said:

but i was just curious if there was anything in the diagnostic that was alarming.

I don't see anything in the last Diagnostics what's concerning, the card is properly recognized and should be ready for use.

If this happens again please issue:

diagnostics

from the terminal and post the Diagnostics afterwards here (the Diagnostics will be saved in the logs folder on your USB Boot Device).

 

I could be also be the case that something else was crashing your server but it seems rather strange to me that you where still able to connect through SSH, usually everything crashes and the server needs a hard reboot.

The next thing could also be that the card dropped from the bus, but these are all assumptions.

Link to comment

Hello,

I'm having trouble having the plugin recognize my GTX960 .

I think there might be a kernel conflict according to the logs, but I can't find out how to get rid of the conflicting kernel. What can I do?

 

I think the existing kernel is vfio, since I'm vm-ing unraid from proxmox.

 

image.png.af663db6d7b6a7bc1aac5208c6965691.png

Link to comment
5 minutes ago, alex01763 said:

Here are my diagnostics.

From what I know you have to maybe do some configuration on Proxmox if you are virtualizing Unraid but I can't help with that, however you have bound your card to VFIO for whatever reason in your syslinux.conf:

BOOT_IMAGE=/bzimage initrd=/bzroot intel_iommu=on iommu=pt vfio-pci.ids=10de:1401,10de:0fba video=efifb:off

Please remove the part:

vfio-pci.ids=10de:1401,10de:0fba video=efifb:off

(I would recommend binding a a PCI device through the System Device menu in Unraid and not through the syslinux.conf)

 

Please note that if you bind it to VFIO the card won't be recognized by the Plugin anyways... If you want to use it with the plugin please unbind the card from VFIO.

 

10 minutes ago, alex01763 said:

also, I got this alert too:

Ignore this for now, don't know why it alerted you about that because from what I see the latest driver was downloaded.

Link to comment
On 2/24/2023 at 10:41 AM, Doogs said:

Hey ich - Following up with you.  I've removed the card and it (unraid) been stable.  The nVidia M60 is powered by an 8-pin CPU cable and I don't think the stock 8-pin CPU cable will suffice since the official adapter takes 2 PCI-E cables and 'must carry 200+W....'.  

TL;DR - I don't think it's a plugin issue.  I'll follow up again once I get the correct adapter.  I can only surmise the card was working previously with the stock 8-pin CPU cable because my previous build was 7 drives and my latest build has double that.  As always, thank you for your continued support and prompt responses!

 

Cheers!

Hey ich,

 

Installed the card with the correct power cable and I'm still having issues.  The last thing I can do is try another card - I have access to 4 more of these.  Any suggestions on what else I could do?

 

EDIT - I'll pop the card back in, wait for the issue to happen and grab diag

Edited by Doogs
Link to comment

So my problem has progressed now. As a recap, I have a 1050ti installed with the latest updates to everything in my server. All drivers, os updates and such. I have a problem where randomly the server will work fine for days at a time, then just crash. I used to just have the gui crash, but now am getting entire server lockups. I enabled syslog logging earlier, but when the problem happened again I didn't see anything in the syslog that was concerning, and the helpful peeps over on the unRAID discord didn't see anything either. The diagnostic doesn't look any different then it did last time... I can't get a diagnostic while it's crashing anymore due to the server being hard locked. When I reboot and do diagnostics everything goes back to being normal. This has only been happening ever since I added the GPU to the system so it has to be related to this somehow. Is there another way for me to get different logs or something that will help me narrow down the issue? 

Link to comment

The GPU was pulled from a working gaming rig, and has never been abused, and doesn't seem to be really failing. It's transcodes all look great, and I'm having no other issues with it while it is working.

 

The card is also only showing as being on a gen 3 by 2x connection in GPU statistics even though it is installed in a gen 3 16x slot.. could that be pointing to a potential problem? The only other card in the system in my lsi hba, so it can't be a lanes issue with the CPU, there should be plenty available. I may try reseating or moving the GPU to a different slot, but I think that may be the only slot on that board it will fit in. 

 

I guess if it passes the memtest that is running atm then I'll try reseating the GPU to see if that's a problem. I already replaced all the memory in the system, and upgraded to 64GB so it's not a memory space issue, and I replaced all the ram after these problems started occuring as a troubleshooting step. I am running a memtest just to be sure, but early signs are showing that the memory is fine. 

 

I'm kinda at my wits end, and really am having trouble as I have never had problems with this machine until I added the gpu. It's super discouraging... :/

 

When I get home from work hopefully the memtest will be done and I'll boot the system again and do another diagnostic as well as post my syslog just for y'all to look over. I'm also going to make sure that some idle power stuff in bios isn't the problem as I see that can be an issue for others here. 

 

idk tbh I'm kinda just guessing here...

Edited by Tithonius
Link to comment
1 hour ago, Tithonius said:

So my problem has progressed now. As a recap, I have a 1050ti installed with the latest updates to everything in my server. All drivers, os updates and such. I have a problem where randomly the server will work fine for days at a time, then just crash. I used to just have the gui crash, but now am getting entire server lockups. I enabled syslog logging earlier, but when the problem happened again I didn't see anything in the syslog that was concerning, and the helpful peeps over on the unRAID discord didn't see anything either. The diagnostic doesn't look any different then it did last time... I can't get a diagnostic while it's crashing anymore due to the server being hard locked. When I reboot and do diagnostics everything goes back to being normal. This has only been happening ever since I added the GPU to the system so it has to be related to this somehow. Is there another way for me to get different logs or something that will help me narrow down the issue? 

Does your card disappear from nvidia-settings as well?  My card does and if I leave it that way, my server will entirely lock up as well.  Trying to see if there is a pattern here.

Link to comment
7 hours ago, Doogs said:

The last thing I can do is try another card - I have access to 4 more of these.  Any suggestions on what else I could do?

Not really, I can only think of a hardware compatibility issue... Other than that if you made sure that you are on the latest BIOS version, enabled Above 4G Decoding and Resizable BAR support I don't have any more recommendations besides booting in legacy mode (CSM).

You can try to set the PCIe Gen to Gen2 or even Gen3 in the BIOS manually if that makes a difference (that won't affect performance in the case for this cards, at least not noticeable).

Link to comment
3 hours ago, Tithonius said:

I didn't see anything in the syslog that was concerning, and the helpful peeps over on the unRAID discord didn't see anything either. The diagnostic doesn't look any different then it did last time... I can't get a diagnostic while it's crashing anymore due to the server being hard locked. When I reboot and do diagnostics everything goes back to being normal.

This is caused because your server is crashing really hard and even the syslog server won't work anymore when the crash happens, have you yet tried to connect a monitor tho the GPU to see the console output. The only thing you can really do to capture whats happening is to connect a screen and wait for a crash to happen, after that take a picture from the output.

 

Maybe it's related to MACVLAN, do you have Docker containers in br0 and Host Access enabled? Maybe try to switch to IPVLAN in the Docker settings (with Advanced View enabled).

 

3 hours ago, Tithonius said:

The card is also only showing as being on a gen 3 by 2x connection in GPU statistics even though it is installed in a gen 3 16x slot.. could that be pointing to a potential problem?

Usually that should be no issue.

 

3 hours ago, Tithonius said:

I'm kinda at my wits end, and really am having trouble as I have never had problems with this machine until I added the gpu. It's super discouraging... :/

Yes I know, many people have such issues but mostly because of hardware incompatibility issues or a bad implementation in the BIOS.

 

Maybe try to set the PCIe Gen to a fixed value instead of leaving it at Auto if you have the option for that.

Link to comment
3 hours ago, ich777 said:

This is caused because your server is crashing really hard and even the syslog server won't work anymore when the crash happens, have you yet tried to connect a monitor tho the GPU to see the console output. The only thing you can really do to capture whats happening is to connect a screen and wait for a crash to happen, after that take a picture from the output.

 

 

 

Okay, so i have a screen attached now, i can see the login prompt, i assume that I just login to the non gui on the screen then just leave it? Or do I have to have it show me console output somehow?

Link to comment
7 hours ago, Tithonius said:

Okay, so i have a screen attached now, i can see the login prompt

I completely forgot that you have to execute this command so that the screen doesn't goes blank or better speaking to sleep:

setterm --blank 0

 

7 hours ago, Tithonius said:

i assume that I just login to the non gui on the screen then just leave it?

You don't have to be logged in, just log in once, execute the command and maybe log out again (but you don't have to log out strictly speaking).

Link to comment
11 hours ago, ich777 said:

I completely forgot that you have to execute this command so that the screen doesn't goes blank or better speaking to sleep:

setterm --blank 0

 

You don't have to be logged in, just log in once, execute the command and maybe log out again (but you don't have to log out strictly speaking).

Okay, so last night I set the pcie gen to 3 instead of auto, reseated the GPU in the slot, and reseated all the memory. It passed a 13 hour memory test with 0 errors, and seems to be okay. When I woke up again this morning I saw that it had hard crashed again... The screen went to sleep so when I get home from work today I'll set the screen to not sleep like you said and see if the console can show me what's going on. I assume the command I want to run is the "tail -f /var/log/syslog" command yeah? But that would seem to just show me what the syslog shows, and the syslog didn't have anything last time..

 

Or is there a command that will show me a more verbose output? I wouldn't be opposed to having the most info possible if that's a thing. 

 

What command should I be running to see the console output?

 

I also set the docker network to IPVLAN and restarted docker, but didn't reboot after that change. Should I have rebooted?

 

On the plus side the GPU is showing at a x16 lane now so that's progress I guess... Still crashing though.

Edited by Tithonius
Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.