Jump to content

[Plugin] Nvidia-Driver


ich777

Recommended Posts

42 minutes ago, alturismo said:

 

nvidia-persistenced

 

this should take your card to sleep ...

 

 

may also not the proper command ;) if your card is the 1st one, this would be ...

 

nvidia-smi -i 0 -pm 1

 

but be aware, this feature is may dropped sooner or later, persistenced is the actual one

Thanks

Means I need to replace in script nvidia-smi --persistence-mode=1 to nvidia-smi --persistenced-mode=1

 

And by the way I tested with my UPS, card is actually gooing to sleep, issue that nvidia reports incorrect idle power state

Same gpu as mine - https://forum-en.msi.com/index.php?threads/rtx-4060-ventus-high-power-draw-when-idle.387442/

Link to comment
13 minutes ago, J05u said:

Means I need to replace in script nvidia-smi --persistence-mode=1 to nvidia-smi --persistenced-mode=1

No, this means you need to replace this whole line with:

nvidia-persistenced

 

14 minutes ago, J05u said:

And by the way I tested with my UPS, card is actually gooing to sleep, issue that nvidia reports incorrect idle power state

This is nothing new to me, software readings are most certainly wrong.

Link to comment
1 minute ago, ich777 said:

No, this means you need to replace this whole line with:

nvidia-persistenced

 

This is nothing new to me, software readings are most certainly wrong.

Sorry for being dumb.

It means in Script nvidia-smi --persistence-mode=1 should be replace to nvidia-persistenced and thats it?

  • Like 1
Link to comment
31 minutes ago, zachlovescoffee said:

So sorry! I always forget diagnostics. Here you go along with my thanks for all your excellent support of our community.

Do you have any custom scripts installed?

I would recommend that you remove PowerTOP for now and see if this helps, seems that your GPU fails to initialize:

Jul 31 22:15:37 darkstar kernel: NVRM: GPU at PCI:0000:01:00: GPU-ea06c034-1c38-9a48-60ea-e0779d32f6a1
Jul 31 22:15:37 darkstar kernel: NVRM: GPU Board Serial Number: 1322722038690
Jul 31 22:15:37 darkstar kernel: NVRM: Xid (PCI:0000:01:00): 62, pid='<unknown>', name=<unknown>, 20c2(2bc4) 00000000 00000000
Jul 31 22:15:47 darkstar kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jul 31 22:15:47 darkstar kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jul 31 22:15:51 darkstar kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jul 31 22:15:51 darkstar kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

 

Does the card work or does it not work at all?

Link to comment
45 minutes ago, ich777 said:

Do you have any custom scripts installed?

I would recommend that you remove PowerTOP for now and see if this helps, seems that your GPU fails to initialize:

Jul 31 22:15:37 darkstar kernel: NVRM: GPU at PCI:0000:01:00: GPU-ea06c034-1c38-9a48-60ea-e0779d32f6a1
Jul 31 22:15:37 darkstar kernel: NVRM: GPU Board Serial Number: 1322722038690
Jul 31 22:15:37 darkstar kernel: NVRM: Xid (PCI:0000:01:00): 62, pid='<unknown>', name=<unknown>, 20c2(2bc4) 00000000 00000000
Jul 31 22:15:47 darkstar kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jul 31 22:15:47 darkstar kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
Jul 31 22:15:51 darkstar kernel: NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x23:0x65:1426)
Jul 31 22:15:51 darkstar kernel: NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

 

Does the card work or does it not work at all?

No custom scripts installed to my knowledge. The GPU was working before when transcoding was necessary. I cannot find any content that requires transcoding right now to test it but the card appears to be persistently 'offline'. I tried uninstalled PowerTop from NerdTools but toggling it to 'off' but it's not uninstalling. Any ideas?

Screenshot 2023-08-01 at 8.26.21 AM.png

Link to comment
50 minutes ago, zachlovescoffee said:

The GPU was working before when transcoding was necessary. I cannot find any content that requires transcoding right now to test it but the card appears to be persistently 'offline'.

You have to reboot, the GPU is now in a state where it is not available to the system, you can look at your syslog, it says that it failed to initialize the GPU.

Did you reboot already, you have to reboot to get your GPU in a working state.

 

I will try this later too when I'm home because I've got the same GPU as you.

 

50 minutes ago, zachlovescoffee said:

NerdTools but toggling it to 'off' but it's not uninstalling. Any ideas?

I would report that in the NerdTools support thread.

Link to comment
11 minutes ago, ich777 said:

You have to reboot, the GPU is now in a state where it is not available to the system, you can look at your syslog, it says that it failed to initialize the GPU.

Did you reboot already, you have to reboot to get your GPU in a working state.

 

I will try this later too when I'm home because I've got the same GPU as you.

 

I would report that in the NerdTools support thread.

I've rebooted and power top seems to still be installed. Since it resets every every cycle it's not doing it's 'mojo' to power down the system. the GPU is functional again in the system but as you can see from my post just above, nvidia-smi is constantly pinging the CPU.

Link to comment
1 hour ago, zachlovescoffee said:

nvidia-smi is constantly pinging the CPU.

This is due to GPU Statistics, it polls nvidia-smi to get the readings, this was always the case.
I would recommend that you set a lower polling rate in the GPU Statistics plugin or simply don't visit the Dashboard, this will stop the polling and also the spikes. ;)

 

EDIT: I just tested it and transcoding is working with Plex and Jellyfin just fine with the latest Nvidia driver on unRAID 6.12.3

Link to comment
On 7/29/2023 at 2:07 PM, raftrider said:

I can confirm I had a similar issue with the latest nvidia driver and reverting to v530.41.03 solved almost all issues except for the web player transcoding now just hangs no errors on the browser. Card is P2000

@raftriderwere you able to get any help on this over on the Plex forums? 

Link to comment

@Jase i still can confirm its working flawlessly here, tested now with both cards here

 

GTX1060 card 1

image.thumb.png.ba3b644f87e26d4581058355797d58ea.png

 

RTX3080ti card 0

image.thumb.png.6d47266564eb2ed86a77ebc5ffb2fd55.png

 

h264 and hevc tested, also live tv ... so overall i wouldnt say its a plugin issue, even not a plex issue ...

 

running plexinc Docker and as you see upper latest nvidia driver package ...

 

image.thumb.png.55409d264c2287c74489e614bcfc2984.png

  • Like 1
Link to comment

Hi,

My System has 4 GPUs all Nvidia.

2 1030s and 1 2070 1 3060

My Server recognized all 4 when I was running 6.11.5 (I have a witness for that)

I updated to 6.12.3 and the server showed initally all 4 GPUs. Everything worked fine. As far as I am concerned I did not do anything to mess with the server itself (i had one vm running) went away from my computer for a coupe of hours, came back and the vm did not work anymore. It was paused (by itself). I tried to start it but I got an error so I checked the Nvidia Drivers page.

The Nvidia Drivers page showed:

1541024225_103030601030.png.56b07506502d7bb433f6bfd4ba545425.png

So the 2070 was not recognized (i had this problem when i showed my server to @SpaceInvaderOne) And we fixed it by changing the cards around on the mainboard.

I was a bit devestated because i thought that my beloved 2070 is just faulty/broken.

Again I did not do anything afterwards. I literally left the server alone (I was cooking a rather large meal for my family to be exact) I came back and what I wanted to do was now change the GPU of the VM (used to be the 2070) to the 3060 so that i could at least work with my VM.

I checked the Nvidia drivers page again, and to my surprise it now shows this....

428500666_2070and21030.thumb.png.a29fae734a0d10cce119b4f068c0f9f7.png

This is weirding me out. sometimes the driver recognizes the 2070 sometimes it recognizes the 3060. Apparently sometimes it does both.

 

server-diagnostics-20230810-1254.zip

Edited by Zeze21
Link to comment
1 hour ago, Zeze21 said:

I updated to 6.12.3 and the server showed initally all 4 GPUs.

The driver sees 3 GPUs:

Attached GPUs                             : 3
...
GPU 00000000:08:00.0
    Product Name                          : NVIDIA GeForce RTX 2070 SUPER
...
GPU 00000000:09:00.0
    Product Name                          : NVIDIA GeForce GT 1030
...
GPU 00000000:44:00.0
    Product Name                          : NVIDIA GeForce GT 1030

 

1 hour ago, Zeze21 said:

(i had one vm running)

Is one GPU passed through to the VM or are more GPUs passed through to VMs?

Please don't forget that if you pass a GPU through to the VM and the VM is running that it doesn't show on the plugin page...

 

However I see this in your lspci:

08:00.0 VGA compatible controller [0300]: NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] [10de:1e84] (rev a1)
	Subsystem: NVIDIA Corporation TU104 [GeForce RTX 2070 SUPER] [10de:1e84]
	Kernel driver in use: nvidia
	Kernel modules: nvidia_drm, nvidia
08:00.1 Audio device [0403]: NVIDIA Corporation TU104 HD Audio Controller [10de:10f8] (rev a1)
	Subsystem: NVIDIA Corporation TU104 HD Audio Controller [10de:1e84]
08:00.2 USB controller [0c03]: NVIDIA Corporation TU104 USB 3.1 Host Controller [10de:1ad8] (rev a1)
	Subsystem: NVIDIA Corporation TU104 USB 3.1 Host Controller [10de:1e84]
	Kernel driver in use: xhci_hcd
08:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU104 USB Type-C UCSI Controller [10de:1ad9] (rev a1)
	Subsystem: NVIDIA Corporation TU104 USB Type-C UCSI Controller [10de:1e84]
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP108 [GeForce GT 1030] [10de:1d01] (rev a1)
	Subsystem: ASUSTeK Computer Inc. GP108 [GeForce GT 1030] [1043:85f4]
	Kernel driver in use: nvidia
	Kernel modules: nvidia_drm, nvidia
09:00.1 Audio device [0403]: NVIDIA Corporation GP108 High Definition Audio Controller [10de:0fb8] (rev a1)
	Subsystem: ASUSTeK Computer Inc. GP108 High Definition Audio Controller [1043:85f4]
...
43:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106 [GeForce RTX 3060 Lite Hash Rate] [10de:2504] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd GA106 [GeForce RTX 3060 Lite Hash Rate] [1458:4096]
	Kernel driver in use: vfio-pci
	Kernel modules: nvidia_drm, nvidia
43:00.1 Audio device [0403]: NVIDIA Corporation GA106 High Definition Audio Controller [10de:228e] (rev a1)
	Subsystem: Gigabyte Technology Co., Ltd Device [1458:4096]
	Kernel driver in use: vfio-pci
44:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP108 [GeForce GT 1030] [10de:1d01] (rev a1)
	Subsystem: Micro-Star International Co., Ltd. [MSI] GP108 [GeForce GT 1030] [1462:8c98]
	Kernel driver in use: nvidia
	Kernel modules: nvidia_drm, nvidia
44:00.1 Audio device [0403]: NVIDIA Corporation GP108 High Definition Audio Controller [10de:0fb8] (rev a1)
	Subsystem: Micro-Star International Co., Ltd. [MSI] GP108 High Definition Audio Controller [1462:8c98]

 

Nvidia RTX2070 Super: nvidia driver loaded

Nvidia GT1030: nvidia driver loaded

Nvidia RTX3060: bound to VFIO

Nvidia GT1030: nvidia driver loaded

 

So to speak, everything seems completely fine and the Nvidia Driver plugin lists the cards correctly.

 

In this Diagnostics I see that you have bound the RTX3060 to VFIO so to speak the Nvidia Driver plugin can't see it.

I don't know how you've bound it to VFIO because I don't see any indication why it should be bound to VFIO in your Diagnostics but the syslog shows that it was bound to VFIO on boot:

Aug 10 12:19:48 Server kernel: vfio-pci 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=none,decodes=io+mem:owns=none

 

Again, please note that you're RTX 2070 Super won't show up if you've use it in a running VM, that is the default behaviour.

Link to comment

I have a T600 and the plugin was working great until I upgraded to unraid 6.12.X. Along with my CPU fan dying a slow death, macvlan errors that triggered a cache drive failure and restoring my appdata, and some other "joys," I was also seeing my T600 "disappear" and getting GPU errors in my logs similar to what others have posted.

 

Oddly, I tried turning off the netdata docker, and also uninstalled/reinstalled the GPU Stats plugin, and so far my T600 seems more stable. Previously it would fail and "disappear" from unraid until a reboot if I ran Plex or something that hit the GPU. It's only been 24 hours but wanted to share in case this helps users and/or the plugin dev.

Link to comment
27 minutes ago, Slappy said:

Previously it would fail and "disappear" from unraid until a reboot if I ran Plex or something that hit the GPU. It's only been 24 hours but wanted to share in case this helps users and/or the plugin dev.

Please always pull Diagnostics from your system if you encounter an error before you reboot and always include them in the Diagnostics.

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Reply to this topic...

×   Pasted as rich text.   Restore formatting

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...