[Plugin] Linuxserver.io - Unraid Nvidia


Recommended Posts

For a while I have been able to run 2 different dockers for plex (1 without GPU, and 1 with) so that I can use my VM.

I recently moved house and since then I have not been able to run the docker that utilises my 1660 Super, whenever I tried, it would just tell me there was a "server error".

 I already was on the beta25 version and everything was working fine, but for some reason the NVIDIA Unraid plugin is no longer recognising my GPU, even after trying to reinstall it 2 more times over a couple of days.

My GPU still works fine being passed through to my VM, and I can still use the non-GPU version of plex, I just can't run a version to hardware transcode when I am not using my VM.

Error Message: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

 

Current Setup:

Mobo: Asus B550 E Gaming

CPU: Ryzen 7 3700X

GPU 1: MSI 1660 Super

GPU 2: Asus RX570 (For Mac VM only)

RAM: 32GB Trident Z Neo

     

 

nvidiaunraid.thumb.PNG.8ed3da1a6cea6ae04f030a0315942b9a.PNG

 

Link to comment

Hi community,

 

I was having some crashing issues so had the server powered off for a few days while I was doing some research.  Since powering it back on I've been keeping log viewer open to keep an eye on things.  The last several days I have noticed weird messages in the logs.

 

First, my system:

CPU: Amd Ryzen 7 2700 8 cores

Mobo: Asus ROG Strix B450-F Gaming

16 GB

Asus Radeon HD6450 1gb (passthrough to VM)

GTX 1080 TI (used for plex transcoding)

 

Running on Unraid 6.8.3 and linuxserver.io's Unraid Nvidia plugin version 2019-06-23.

 

At first, log was getting spammed with the same error message every 10 seconds or so (flooded past what my syslog viewer could show at a time so no idea how long it went on for).  I unfortunately did not save diagnostics or take a screenshot, but it was:

 

"NVRM: GPU RmInitAdapter failed!

NVRM: rm_init_adapter failed for device bearing minor number 0."

 

Rebooting the server seemed to fix things at least temporarily.  I could watch things on plex and it would use hardware transcoding just fine and no errors in log.  However, the next day syslog would be flooded with the above messages again.  I saw a post on reddit recommending going back to stock 6.8.3 on the Unraid Nvidia plugin and then redo the Nvidia 6.8.3 build. This seemed to work and there were no errors when I woke up this morning.  However, tonight when I checked logs before bed I saw this:

 

Aug 21 20:34:12 SPAMFAM kernel: NVRM: Xid (PCI:0000:09:00): 79, pid=17083, GPU has fallen off the bus.
Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU is on Board .
Aug 21 20:34:12 SPAMFAM kernel: NVRM: A GPU crash dump has been created. If possible, please run
Aug 21 20:34:12 SPAMFAM kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Aug 21 20:34:12 SPAMFAM kernel: NVRM: the NVIDIA kernel module is unloaded.

 

This time I have the diagnostics file saved if its needed.  If any other information is needed, please let me know.

Link to comment
6 hours ago, braydination said:

For a while I have been able to run 2 different dockers for plex (1 without GPU, and 1 with) so that I can use my VM.

I recently moved house and since then I have not been able to run the docker that utilises my 1660 Super, whenever I tried, it would just tell me there was a "server error".

 I already was on the beta25 version and everything was working fine, but for some reason the NVIDIA Unraid plugin is no longer recognising my GPU, even after trying to reinstall it 2 more times over a couple of days.

My GPU still works fine being passed through to my VM, and I can still use the non-GPU version of plex, I just can't run a version to hardware transcode when I am not using my VM.

Error Message: "NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running."

 

Current Setup:

Mobo: Asus B550 E Gaming

CPU: Ryzen 7 3700X

GPU 1: MSI 1660 Super

GPU 2: Asus RX570 (For Mac VM only)

RAM: 32GB Trident Z Neo

     

 

nvidiaunraid.thumb.PNG.8ed3da1a6cea6ae04f030a0315942b9a.PNG

 

The card is probably bound to the vfio driver, so the Nvidia driver modules are not loaded. The output of lspci -k will show which driver is loaded.

Link to comment

Hey Guys, 

I'm having a strange issue.  I have a GTX 970 and a GTX 1660 TI.  I installed Unraid Nvidia 6.8.3.  It loaded driver 440.59 driver.  Everything worked fine, I copied my GPUID (GTX 970) and added it to Plex.  It is using it for trancoding fine, I can see it working in "watch nvidia-smi".  I decided to switch to the GTX 1660 TI to get some benchmarks. But, when I went back to the nvidia page all my GPU info is gone.  If I restart my server all the info is there again.  

 

So far it has been a pretty awesome plugin, I am able to passthrough both GPU's to dockers and VM as long as I copy the GPUID before I leave the page.

Error Capture.PNG

crimson-diagnostics-20200823-2154.zip

Link to comment
On 8/23/2020 at 12:18 AM, fpoa said:

Hi community,

 

I was having some crashing issues so had the server powered off for a few days while I was doing some research.  Since powering it back on I've been keeping log viewer open to keep an eye on things.  The last several days I have noticed weird messages in the logs.

 

First, my system:

CPU: Amd Ryzen 7 2700 8 cores

Mobo: Asus ROG Strix B450-F Gaming

16 GB

Asus Radeon HD6450 1gb (passthrough to VM)

GTX 1080 TI (used for plex transcoding)

 

Running on Unraid 6.8.3 and linuxserver.io's Unraid Nvidia plugin version 2019-06-23.

 

At first, log was getting spammed with the same error message every 10 seconds or so (flooded past what my syslog viewer could show at a time so no idea how long it went on for).  I unfortunately did not save diagnostics or take a screenshot, but it was:

 

"NVRM: GPU RmInitAdapter failed!

NVRM: rm_init_adapter failed for device bearing minor number 0."

 

Rebooting the server seemed to fix things at least temporarily.  I could watch things on plex and it would use hardware transcoding just fine and no errors in log.  However, the next day syslog would be flooded with the above messages again.  I saw a post on reddit recommending going back to stock 6.8.3 on the Unraid Nvidia plugin and then redo the Nvidia 6.8.3 build. This seemed to work and there were no errors when I woke up this morning.  However, tonight when I checked logs before bed I saw this:

 

Aug 21 20:34:12 SPAMFAM kernel: NVRM: Xid (PCI:0000:09:00): 79, pid=17083, GPU has fallen off the bus.
Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU has fallen off the bus.
Aug 21 20:34:12 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: GPU is on Board .
Aug 21 20:34:12 SPAMFAM kernel: NVRM: A GPU crash dump has been created. If possible, please run
Aug 21 20:34:12 SPAMFAM kernel: NVRM: nvidia-bug-report.sh as root to collect this data before
Aug 21 20:34:12 SPAMFAM kernel: NVRM: the NVIDIA kernel module is unloaded.

 

This time I have the diagnostics file saved if its needed.  If any other information is needed, please let me know.

Not sure if this is related to the above issue, but just saw some new errors:

Quote

Aug 29 19:42:15 SPAMFAM kernel: Modules linked in: nvidia_uvm(O) macvlan xt_CHECKSUM ipt_REJECT xt_nat ip6table_mangle ip6table_nat nf_nat_ipv6 iptable_mangle ip6table_filter ip6_tables vhost_net tun vhost tap veth ipt_MASQUERADE iptable_filter iptable_nat nf_nat_ipv4 nf_nat ip_tables xfs md_mod bonding rsnvme(PO) sr_mod cdrom nvidia_drm(PO) nvidia_modeset(PO) nvidia(PO) btusb btrtl btbcm btintel bluetooth ecdh_generic drm_kms_helper edac_mce_amd wmi_bmof mxm_wmi crc32_pclmul pcbc aesni_intel aes_x86_64 glue_helper crypto_simd ghash_clmulni_intel cryptd drm kvm_amd kvm syscopyarea sysfillrect sysimgblt fb_sys_fops igb(O) k10temp agpgart i2c_piix4 ahci ccp i2c_core nvme libahci usblp crct10dif_pclmul nvme_core crc32c_intel wmi button pcc_cpufreq acpi_cpufreq
Aug 29 19:42:15 SPAMFAM kernel: CPU: 2 PID: 31159 Comm: kworker/2:0 Tainted: P O 4.19.107-Unraid #1
Aug 29 19:42:15 SPAMFAM kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 2008 03/04/2019
Aug 29 19:42:15 SPAMFAM kernel: Workqueue: events macvlan_process_broadcast [macvlan]
Aug 29 19:42:15 SPAMFAM kernel: RIP: 0010:__nf_conntrack_confirm+0xa0/0x69e
Aug 29 19:42:15 SPAMFAM kernel: Code: 04 e8 56 fb ff ff 44 89 f2 44 89 ff 89 c6 41 89 c4 e8 7f f9 ff ff 48 8b 4c 24 08 84 c0 75 af 48 8b 85 80 00 00 00 a8 08 74 26 <0f> 0b 44 89 e6 44 89 ff 45 31 f6 e8 95 f1 ff ff be 00 02 00 00 48
Aug 29 19:42:15 SPAMFAM kernel: RSP: 0018:ffff88842e683d90 EFLAGS: 00010202
Aug 29 19:42:15 SPAMFAM kernel: RAX: 0000000000000188 RBX: ffff88842b6d0100 RCX: ffff888286597618
Aug 29 19:42:15 SPAMFAM kernel: RDX: 0000000000000001 RSI: 0000000000000081 RDI: ffffffff81e08b90
Aug 29 19:42:15 SPAMFAM kernel: RBP: ffff8882865975c0 R08: 00000000896aacaa R09: ffff8883531b31c0
Aug 29 19:42:15 SPAMFAM kernel: R10: 0000000000000000 R11: ffff8883532c8000 R12: 0000000000008481
Aug 29 19:42:15 SPAMFAM kernel: R13: ffffffff81e91080 R14: 0000000000000000 R15: 000000000000f964
Aug 29 19:42:15 SPAMFAM kernel: FS: 0000000000000000(0000) GS:ffff88842e680000(0000) knlGS:0000000000000000
Aug 29 19:42:15 SPAMFAM kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 29 19:42:15 SPAMFAM kernel: CR2: 00005621bb0b1018 CR3: 0000000001e0a000 CR4: 00000000003406e0
Aug 29 19:42:15 SPAMFAM kernel: Call Trace:
Aug 29 19:42:15 SPAMFAM kernel: <IRQ>
Aug 29 19:42:15 SPAMFAM kernel: ipv4_confirm+0xaf/0xb9
Aug 29 19:42:15 SPAMFAM kernel: nf_hook_slow+0x3a/0x90
Aug 29 19:42:15 SPAMFAM kernel: ip_local_deliver+0xad/0xdc
Aug 29 19:42:15 SPAMFAM kernel: ? ip_sublist_rcv_finish+0x54/0x54
Aug 29 19:42:15 SPAMFAM kernel: ip_rcv+0xa0/0xbe
Aug 29 19:42:15 SPAMFAM kernel: ? ip_rcv_finish_core.isra.0+0x2e1/0x2e1
Aug 29 19:42:15 SPAMFAM kernel: __netif_receive_skb_one_core+0x53/0x6f
Aug 29 19:42:15 SPAMFAM kernel: process_backlog+0x77/0x10e
Aug 29 19:42:15 SPAMFAM kernel: net_rx_action+0x107/0x26c
Aug 29 19:42:15 SPAMFAM kernel: __do_softirq+0xc9/0x1d7
Aug 29 19:42:15 SPAMFAM kernel: do_softirq_own_stack+0x2a/0x40
Aug 29 19:42:15 SPAMFAM kernel: </IRQ>
Aug 29 19:42:15 SPAMFAM kernel: do_softirq+0x4d/0x5a
Aug 29 19:42:15 SPAMFAM kernel: netif_rx_ni+0x1c/0x22
Aug 29 19:42:15 SPAMFAM kernel: macvlan_broadcast+0x111/0x156 [macvlan]
Aug 29 19:42:15 SPAMFAM kernel: ? __switch_to_asm+0x41/0x70
Aug 29 19:42:15 SPAMFAM kernel: macvlan_process_broadcast+0xea/0x128 [macvlan]
Aug 29 19:42:15 SPAMFAM kernel: process_one_work+0x16e/0x24f
Aug 29 19:42:15 SPAMFAM kernel: worker_thread+0x1e2/0x2b8
Aug 29 19:42:15 SPAMFAM kernel: ? rescuer_thread+0x2a7/0x2a7
Aug 29 19:42:15 SPAMFAM kernel: kthread+0x10c/0x114
Aug 29 19:42:15 SPAMFAM kernel: ? kthread_park+0x89/0x89
Aug 29 19:42:15 SPAMFAM kernel: ret_from_fork+0x22/0x40
Aug 29 19:42:15 SPAMFAM kernel: ---[ end trace 4067e0319717aeb0 ]---
Aug 29 19:56:05 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: RmInitAdapter failed! (0x23:0x56:515)
Aug 29 19:56:05 SPAMFAM kernel: NVRM: GPU 0000:09:00.0: rm_init_adapter failed, device minor number 0

 

Link to comment

Hello, I have just installed Unraid Nvidia (following Spaceinvaderone's guide) got to the part where you need to reboot the server. When it came back up there was a problem with the networking. I first realized this by the gui not loading up either on the actual machine or over the network. tried another reboot, and noticed during boot it said "eth0 not found" had a look in network.cfg and its empty. I am very new to Unraid and have very little to no experience with command line so without the gui im up a creek without a paddle. If there is anyone who can provide assistance that would be very appreciated.

Link to comment

Hoping for some help here, I have a Dell T7500 running Nvidia UnRaid Build 6.83 with a GTX 1050ti and RTX 2060 installed. Both are working for VM's and am able to use the 1050 for Plex for when i'm not using that specific VM. I was looking to passthrough the RTX 2060 as a device for Handbrake, and noticed that it is not listed under the UnRaid Nvidia Plug-in.

 

204626134_ScreenShot2020-09-02at5_57_22PM.thumb.png.4c130593ae8e20b84b7d27e1a756db29.png

 

Using watch-nvidia query, only shows the 1050ti as well

 

1811008107_ScreenShot2020-09-02at5_58_31PM.png.7ff5ab491172af7b2dbf23f90bda7c51.png

 

It does show up under system devices, and again can pass it through to a Windows 10 VM. I don't believe the card is stubbed (watched SpaceInvader's guide to setting up the VM, and I don't recall editing the config file on the usb.)

 

Any help would be appreciated! 

Link to comment
8 hours ago, NullZeroNobody said:

Hoping for some help here, I have a Dell T7500 running Nvidia UnRaid Build 6.83 with a GTX 1050ti and RTX 2060 installed. Both are working for VM's and am able to use the 1050 for Plex for when i'm not using that specific VM. I was looking to passthrough the RTX 2060 as a device for Handbrake, and noticed that it is not listed under the UnRaid Nvidia Plug-in.

 

204626134_ScreenShot2020-09-02at5_57_22PM.thumb.png.4c130593ae8e20b84b7d27e1a756db29.png

 

Using watch-nvidia query, only shows the 1050ti as well

 

1811008107_ScreenShot2020-09-02at5_58_31PM.png.7ff5ab491172af7b2dbf23f90bda7c51.png

 

It does show up under system devices, and again can pass it through to a Windows 10 VM. I don't believe the card is stubbed (watched SpaceInvader's guide to setting up the VM, and I don't recall editing the config file on the usb.)

 

Any help would be appreciated! 

Your card is most likely stubbed. Post the relevant output from the command lspci -k

It will most likely say it's using the vfio module.

Link to comment
9 hours ago, saarg said:

Your card is most likely stubbed. Post the relevant output from the command lspci -k

It will most likely say it's using the vfio module.

436104700_ScreenShot2020-09-03at12_10_59PM.thumb.png.23f9f692ac9bf4e38e2c21ed60ea3657.png

 

Thanks Saarg, you are correct. How do I fix this? And if I make any changes, will it affect the VM I have it passed through to? I would shut off the VM when using the GPU for Handbrake Docker encoding, and vice versa. 

Link to comment
5 hours ago, NullZeroNobody said:

436104700_ScreenShot2020-09-03at12_10_59PM.thumb.png.23f9f692ac9bf4e38e2c21ed60ea3657.png

 

Thanks Saarg, you are correct. How do I fix this? And if I make any changes, will it affect the VM I have it passed through to? I would shut off the VM when using the GPU for Handbrake Docker encoding, and vice versa. 

I'm not sure if the card is released from vfio-pci automatically or you have to force it somehow. You might have to remove it from the VM template and reboot.

I haven't playes around with this, but maybe someone else have and can chime in.

Link to comment
7 hours ago, saarg said:

I'm not sure if the card is released from vfio-pci automatically or you have to force it somehow. You might have to remove it from the VM template and reboot.

I haven't playes around with this, but maybe someone else have and can chime in.

Went into Settings -> VFIO-PCI Config, unchecked the bind for the 2060, and was able to assign it to the handbrake docker and run successfully. If i stop the docker, then the VM with the 2060 attached booted without issue, which was what I was looking for. 

 

Thanks for your help!

Link to comment

Since 6.8.3 I haven't been able to pass my GPU through to a VM without crashing the server, requiring a power cycle.

Tho only other thing that changed is I changed my GUI mode GPU from an old AMD card to a GT710 and added a 10GbE NIC

 

I haven't been able to suss out the reason, hoping something sticks out in my config to someone here.

note: pass thru works fine on vanilla 6.8.3

fortytwo-diagnostics-20200906-0802.zip

Edited by tjb_altf4
Link to comment
10 hours ago, tjb_altf4 said:

Since 6.8.3 I haven't been able to pass my GPU through to a VM without crashing the server, requiring a power cycle.

Tho only other thing that changed is I changed my GUI mode GPU from an old AMD card to a GT710 and added a 10GbE NIC

 

I haven't been able to suss out the reason, hoping something sticks out in my config to someone here.

note: pass thru works fine on vanilla 6.8.3

fortytwo-diagnostics-20200906-0802.zip 308.78 kB · 1 download

If the GPU is being used by something, you will get a crash. If it's not in use, something probably changed in either the kernel or Nvidia driver.

You have to remember that swapping it around like that isn't a normal thing to do. You should first unload the Nvidia modules and then bind it to vfio.

Link to comment
1 hour ago, saarg said:

If the GPU is being used by something, you will get a crash. If it's not in use, something probably changed in either the kernel or Nvidia driver.

You have to remember that swapping it around like that isn't a normal thing to do. You should first unload the Nvidia modules and then bind it to vfio.

Definitely nothing using the GTX1060 (card being passed thru), the GT710 is for host (GUI) and no VMs or dockers running this time.

No point binding the card, as I won't be able to use it for docker and vanilla unraid works just fine for that use case.

Link to comment
32 minutes ago, tjb_altf4 said:

Definitely nothing using the GTX1060 (card being passed thru), the GT710 is for host (GUI) and no VMs or dockers running this time.

No point binding the card, as I won't be able to use it for docker and vanilla unraid works just fine for that use case.

 

Vanilla unraid doesn't load the drivers for the GPU at all, so it's no wonder there isn't any issues passing it through to the VM. When you want to.

 

You have to unbind the card from the Nvidia modules before you use it in your VM. There is no way around it if unraid crashes when you start the VM.

Link to comment
6 hours ago, tjb_altf4 said:

Can that be done on the fly, or only at boot time?

If you look at the instructions about dumping the bios here on the forum, you should find a command to unbind the card from modules. Hopefully that is enough.

I don't remember which thread it was in, so you will need to search for it. Probably best doing it on Google.

  • Thanks 1
Link to comment

Hi all!  I'm getting this set of errors when going to the Nvida plugin page.  Not sure what has happened, but something is up, as I've never received this set of errors before.

 

<a href="https://ibb.co/KznBzBb"><img src="https://i.ibb.co/sJ7cJcb/Screen-Shot-2020-09-11-at-7-30-20-AM.png" alt="Screen-Shot-2020-09-11-at-7-30-20-AM" border="0"></a>

Link to comment
4 minutes ago, de la trevie said:

Hi all!  I'm getting this set of errors when going to the Nvida plugin page.  Not sure what has happened, but something is up, as I've never received this set of errors before.

 

<a href="https://ibb.co/KznBzBb"><img src="https://i.ibb.co/sJ7cJcb/Screen-Shot-2020-09-11-at-7-30-20-AM.png" alt="Screen-Shot-2020-09-11-at-7-30-20-AM" border="0"></a>

Your link didn't work. We prefer attachments instead of links to external sites anyway. Now that you have been approved you should be able to attach the image to your NEXT post.

Link to comment

Hey there, i've got issues with following card:
palit geforce gtx 1650 kalmX (https://www.palit.com/palit/vgapro.php?id=3494&lang=en)

I tried several Unraid Nvidia Builds that are avaible from the WebGUI, nothing seems to work.

result of running "lspci"

04:00.0 VGA compatible controller: NVIDIA Corporation TU117 [GeForce GTX 1650] (rev a1)

a old Geforce GT 730 works just fine. Somebody an idea how to get this card working?

Link to comment
13 hours ago, blackbunt said:

Hey there, i've got issues with following card:
palit geforce gtx 1650 kalmX (https://www.palit.com/palit/vgapro.php?id=3494&lang=en)

I tried several Unraid Nvidia Builds that are avaible from the WebGUI, nothing seems to work.

result of running "lspci"

04:00.0 VGA compatible controller: NVIDIA Corporation TU117 [GeForce GTX 1650] (rev a1)

a old Geforce GT 730 works just fine. Somebody an idea how to get this card working?

Not sure what your problem is, but I have one of those cards installed in one of my servers and it is working fine with the current Nvidia builds.   I specifically went for that card because of the fact that it is fanless and therefore silent.

Link to comment
1 hour ago, itimpi said:

Not sure what your problem is, but I have one of those cards installed in one of my servers and it is working fine with the current Nvidia builds.   I specifically went for that card because of the fact that it is fanless and therefore silent.

i am running the exact same card in my main server without any issues.

i popped the working gtx 1650 SUPER from my main server and have the same issues with that one in my backup server

Weirdly it shows up under SysDEvs as following:

 

IOMMU group 13:[10de:2187] 04:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] (rev a1)

[10de:1aeb] 04:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)

[10de:1aec] 04:00.2 USB controller: NVIDIA Corporation Device 1aec (rev a1)

[10de:1aed] 04:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] (rev a1)

is there a way to force the driver to use the gpu?

Edited by blackbunt
Link to comment
10 hours ago, blackbunt said:

i am running the exact same card in my main server without any issues.

i popped the working gtx 1650 SUPER from my main server and have the same issues with that one in my backup server

Weirdly it shows up under SysDEvs as following:

 

IOMMU group 13:[10de:2187] 04:00.0 VGA compatible controller: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] (rev a1)

[10de:1aeb] 04:00.1 Audio device: NVIDIA Corporation TU116 High Definition Audio Controller (rev a1)

[10de:1aec] 04:00.2 USB controller: NVIDIA Corporation Device 1aec (rev a1)

[10de:1aed] 04:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU116 [GeForce GTX 1650 SUPER] (rev a1)

is there a way to force the driver to use the gpu?

The GPU is probably stubbed. Check which module is loaded with lspci -k

  • Thanks 1
Link to comment
  • trurl locked this topic
Guest
This topic is now closed to further replies.