[Plugin] Linuxserver.io - Unraid Nvidia


2468 posts in this topic Last Reply

Recommended Posts

On 4/16/2020 at 6:58 PM, Fiservedpi said:

Same here even when the GPU Stats plugin removed logs slammed with 

CMD: > /var/log/syslog to truncate it for now, since the log was 422,000 Bytes

There's something going on here, on a kernel level I think

nvidia-container-runtime-hook.log 35.61 kB · 2 downloads

I have been running the LinuxServer.io Folding@home docker recently which has been keeping my GPU very busy fighting CORVID-19 💪!  With the GPU under load, I haven't noticed this issue come up at all; with or without the GPU Statistics plugin installed.

 

When I stop utilizing the GPU, and query the GPU with nvidia-smi or the GPU Statistics plugin, I immediately see this error in the log.  So, my latest hypothesis, there is something with the way the NVIDIA plugin is interfacing with the GPU when being queried in the low power (P0/throttled) state.  @linuxserver.io, any ideas?

 

-JesterEE

Link to post
  • Replies 2.5k
  • Created
  • Last Reply

Top Posters In This Topic

Top Posters In This Topic

Popular Posts

DEPRECATED

v6.8.2 uploaded.   Delayed for a few reasons, had problems (and still do) with the nvidia container runtime, worked around it in the end, but not a long term solution looking forward, I'm wo

DEPRECATED

Posted Images

9 hours ago, JesterEE said:

I have been running the LinuxServer.io Folding@home docker recently which has been keeping my GPU very busy fighting CORVID-19 💪!  With the GPU under load, I haven't noticed this issue come up at all; with or without the GPU Statistics plugin installed.

 

When I stop utilizing the GPU, and query the GPU with nvidia-smi or the GPU Statistics plugin, I immediately see this error in the log.  So, my latest hypothesis, there is something with the way the NVIDIA plugin is interfacing with the GPU when being queried in the low power (P0/throttled) state.  @linuxserver.io, any ideas?

 

-JesterEE

The plugin doesn't query the gpu. Getting the UUID is done once at boot and that is all.

I don't think there is much we can do about the issue as it's most likely a combination of kernel, driver, GPU and bios versions. Hopefully it's solved in a later build.

Link to post

Hello,

 

I'm trying to get my 2 Grid K2 video cards to show up in Unraid-Nvidia, but they won't show up.  I'm assuming it's a driver issue.  How do a add the drivers for my cards if they aren't included in any of the 400 series drivers?  I can find them in the 340.108 version.

 

IOMMU group 18:[10de:11bf] 05:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

IOMMU group 19:[10de:11bf] 06:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

image.png.f9ead24c17265859be33008bb0c1a455.png

Link to post
14 hours ago, Gregory said:

Hello,

 

I'm trying to get my 2 Grid K2 video cards to show up in Unraid-Nvidia, but they won't show up.  I'm assuming it's a driver issue.  How do a add the drivers for my cards if they aren't included in any of the 400 series drivers?  I can find them in the 340.108 version.

 

IOMMU group 18:[10de:11bf] 05:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

IOMMU group 19:[10de:11bf] 06:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

image.png.f9ead24c17265859be33008bb0c1a455.png

You can't as we only use the latest driver at the time of building the new build.

Link to post
21 hours ago, Gregory said:

Hello,

 

I'm trying to get my 2 Grid K2 video cards to show up in Unraid-Nvidia, but they won't show up.  I'm assuming it's a driver issue.  How do a add the drivers for my cards if they aren't included in any of the 400 series drivers?  I can find them in the 340.108 version.

 

IOMMU group 18:[10de:11bf] 05:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

IOMMU group 19:[10de:11bf] 06:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

image.png.f9ead24c17265859be33008bb0c1a455.png

Cards are so old they aren't even listed on nvidia's matrix for anything usable.

 

Why would you waste the power even using them for this?

Link to post
On 4/25/2020 at 8:00 AM, Gregory said:

Hello,

 

I'm trying to get my 2 Grid K2 video cards to show up in Unraid-Nvidia, but they won't show up.  I'm assuming it's a driver issue.  How do a add the drivers for my cards if they aren't included in any of the 400 series drivers?  I can find them in the 340.108 version.

 

IOMMU group 18:[10de:11bf] 05:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

IOMMU group 19:[10de:11bf] 06:00.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K2] (rev a1)

image.png.f9ead24c17265859be33008bb0c1a455.png

Grid k2 cards aren't compatible with anything other then vmware, they dont have standard linux drivers, its a vmware esx only card

Edited by beardymcgee
Link to post

Any ideas how to start debugging an issue I'm having where the GPU just disappears?

 

Basic Scenario

- LinuxServer.io - Plex using GPU HW Encoding

- Unraid Nvidia Plugin (for 6.8.3)

- Brand new Asus 1660 Super (power led is white indicating all is well)

- GPU Statistics Plugin

 

Initially I had the GPU setup and encoding in Plex within minutes, having followed all the nice guides on here.

The issue is that every day or so the GPU will just disappear (the webui Dashboard GPU Stats has no numbers, just '/' against each stat).

Running nvidia-smi in a terminal gives me:

"Unable to determine the device handle for GPU 0000:09:00.0: Unknown Error"

The GPU itself has the fans at max as if it's crashed, I have to reboot the system where it then works again for a day or so.

 

I was checking remotely this morning, and GPU Stats was showing sensible numbers until around 10:30 this morning, but as can be seen in the syslog, it gets spammed before/after this time with:

Apr 27 10:50:10 DIG-NAS-UR001 kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]
Apr 27 10:50:10 DIG-NAS-UR001 kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs

(hundreds of entries)

 

and once the card's disappeared, I get a small number of entries:

Apr 27 14:21:17 DIG-NAS-UR001 kernel: NVRM: GPU 0000:09:00.0: request_irq() failed (-22)

 

I just don't know where to start, I have grabbed the diagnostics just in case it's useful (will upload) but just want to get advice on where to start/if anyone can help.


The GPU with PLEX is working fantastically (when it works) I can transcode my recently ripped UHD movies and HW encode to 1080p/720p with no issues, so would love to get this working 'full time'.. :)


Thanks for any help!

 

edit - just to confirm, a remote 'restart' of unraid gets it going again.

 

 

 

 

Edited by Snubbers
Link to post
1 hour ago, aptalca said:

I'm kinda seeing a trend here. Most if not all of the people experiencing these issues are also using the gpu stats plugin. Did you try without it?

That's a very good observation, I'll uninstall that plugin, reboot to ensure it's 'clean' and then report back! :)

Link to post

Hi All, I am new to this forum so please if you need anymore information please let me know and I will post it, as I am unsure what information will be helpful :)

 

I have the unraid nividia version installed but my Quadro P400 does not show in the UNRAID NVIDIA area of settings. Yet I am able to see it in System Devices and pass it through to a VM.

 

I have not tried passing it through to a docker as I do not know what the UUID is as it does not show up in the NVIDIA UNRAID area of setting.

 

Please can you help with this? If you need anymore information please let me know :)

 

My server is a Dell Poweredge t310 btw!

Link to post
43 minutes ago, Solverz said:

Hi All, I am new to this forum so please if you need anymore information please let me know and I will post it, as I am unsure what information will be helpful :)

 

I have the unraid nividia version installed but my Quadro P400 does not show in the UNRAID NVIDIA area of settings. Yet I am able to see it in System Devices and pass it through to a VM.

 

I have not tried passing it through to a docker as I do not know what the UUID is as it does not show up in the NVIDIA UNRAID area of setting.

 

Please can you help with this? If you need anymore information please let me know :)

 

My server is a Dell Poweredge t310 btw!

The problem is that you are passing it through your a vm. That means the nvidia driver is not loaded, and the plugin doesn't see the card.

Link to post
13 minutes ago, saarg said:

The problem is that you are passing it through your a vm. That means the nvidia driver is not loaded, and the plugin doesn't see the card.

Sorry I forgot to state that I only tested if it could be passed through to a VM after seeing that it was not visible in NVIDIA UNRAID settings.

 

It is currently not being passed to a VM at all but is visible in System Devices, However it does not show in NVIDIA UNRAID settings still.

Link to post
1 hour ago, Solverz said:

Sorry I forgot to state that I only tested if it could be passed through to a VM after seeing that it was not visible in NVIDIA UNRAID settings.

 

It is currently not being passed to a VM at all but is visible in System Devices, However it does not show in NVIDIA UNRAID settings still.

Is it chosen in the VM template? If so, unraid automatically binds the GPU to vfio so the Nvidia diver can't be used.

Link to post
1 minute ago, saarg said:

Is it chosen in the VM template? If so, unraid automatically binds the GPU to vfio so the Nvidia diver can't be used.

No it is not chosen in any vm templates in fact I don't have any vm templates at all.

 

I also noticed when I remove the gpu and go into nividia unraid settings it states the driver could not be loaded. But when I have the gpu installed it just lists the nvidia driver version but no information about the gpu at all.

Link to post

I just got notified that the plugin is not known to the community apps plugin or the fix common problems plugin, and when I search for it now, it doesn't show up in the list of plugins that can be downloaded....  When I search for "NVIDIA" in the community apps plugin, it no longer shows up...  Rebooting the server, and uninstalling the GPU statistics plugin changed nothing...

 

Did this plugin just go unsupported? or is there a glitch in the community apps plugin?  if it is now unsupported, I will be very sad to see it go... 😢

 

Still thank you to the author no matter which is the case...

Link to post
2 hours ago, Warrentheo said:

I just got notified that the plugin is not known to the community apps plugin or the fix common problems plugin, and when I search for it now, it doesn't show up in the list of plugins that can be downloaded....  When I search for "NVIDIA" in the community apps plugin, it no longer shows up...  Rebooting the server, and uninstalling the GPU statistics plugin changed nothing...

 

Did this plugin just go unsupported? or is there a glitch in the community apps plugin?  if it is now unsupported, I will be very sad to see it go... 😢

 

Still thank you to the author no matter which is the case...

Glitch on github

  • Like 2
Link to post
23 hours ago, aptalca said:

I'm kinda seeing a trend here. Most if not all of the people experiencing these issues are also using the gpu stats plugin. Did you try without it?

Thanks for the help.

 I uninstalled the GPU Stats plugin and rebooted.

The issue happened again within 10 minutes (Checking with nvidia-smi I get the same GPU Lost message)

I rebooted, and it maybe lasted 4 hours or so before happening again.

 

I've used the nvidia-bug-report.sh that is mentioned when nvidia-smi loses the GPU and also carefully checked the syslog

1. Despite my GTX1660 Super being on the 440.59 supported list (checked on nvidia.com). the nvidia-bug-report.log files states "WARNING: You do not appear to have an NVIDIA GPU supported by the 440.59 NVIDIA Linux graphics driver installed in this system".. 

2. Trawling through various logs, I found the error code XID79 just before the GPU went missing on one occasion, on the Nvidia developer site, this unfortunately can be attrutable to pretty much anything, HW error, Driver Error, Temperature etc..

3. I've been checking the temperatures / HW state of the card, after boot it's in P0 (12W out of 125W) @ 33C, it them occasionally bumps up to P0 (26W/125W)@44C, so even when plex uses the card, 44C is barely ticking over, so pretty sure it's not temperature. 

4. I think (looking at logs) there could possibly be some correlation between drives spinning down and the GPU crashing (or it may well be coincidence), I would like to try bulk spinning down/up the drives to see if power spikes might be upsetting the GPU, as I know HDD's draw the most power when they are spinning up.. 

[edit] - I found example user scripts to spin down/up all disks and tried those several times, whilst the GPU is idle and whilst transcoding a 4k HDR, no issues found.

5. I did at some point (more of a quick trial) have some User Scripts to 'tweak the driver for obvious reasons' and also to bump the card back to it's lowest power setting.. I haven't had these enabled for some time, so I've deleted the scripts entirely , and re-installed the unRAID-NVidia 6.8.3 from the plugin just to 'clear' things out.. 

6. with 100% repeatability, I can trigger the "caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARS" and hte associated memory spanning message by just running nvidia-smi to check the GPU is still there, I did this every 5-10 minutes over lunch and everytime I get an associated message in syslog.

 

So nothing conclusive yet, some observations, some clutching at straws, but I sense maybe some experimentation and discussion might prompt something of note.. :)

 

 

One test I 'may' do is to go back to the normal unRAID build, and pass the GPU through to my windows 10 VM (it's only spun up once in a blue moon) and run something GPU intensive on that and see if it ever loses the GPU, whilst this is changing a few too many variables at once, it would at least indicate the HW itself is OK (Power/Temperature concerns etc)..

 

 

Edited by Snubbers
Link to post

Hey All,

Trying to sort out an issue I'm experiencing after moving to some new hardware.

I've got a P2000 I had been utilizing in my old system. I just recently upgraded the board/cpu/memory and am re-using some of the PCIe cards in the new configuration, one of which is the P2000.

 

The new build is a 3950X in a ASRock Rack X470D4U

 

Everything appears to be working correctly as far as the PCIe cards go, however I can't seem to get the P2000 to pull back in like it had been previously. Prior to swapping everything out I rolled back to 6.8.2 stock, and removed the additional configurations in the docker container that were pointing to the NVIDIA device.

 

My system boots, and I can see the P2000 under Tools > System Devices

2020_04_28_13_37_57_unRAID_SysDevs.png.45bd4afdb06ebace9b67744f4ced7256.png

 

I have tried reinstalling both 6.8.2 as well as 6.8.3 with the NVIDIA drivers and I see the same error for both:

2020_04_28_13_37_43_unRAID_Unraid_Nvidia.thumb.png.7d6e0f3e7a170215c5368307e3619ee5.png

 

Running 'nvidia-smi' I see the same error in the console and the below error in the system log:

2020_04_28_13_38_18_System_Log.png.de8636776c85e4176913e1d11647a4b0.png

 

The one thing I haven't tried is passing it to a VM yet, though I know this bypasses the need for the NVIDIA driver as it relates to the docker container so it may be a moot point to test.

 

Reading back a bit it seems a few people have had similar issues but either haven't resolved it or I'm missing their fix. Anyone able to provide a little further insight into this I'd appreciate it

 

Also attaching a diagnostic

 

unraid-diagnostics-20200428-1346.zip

Link to post
21 hours ago, Solverz said:

No it is not chosen in any vm templates in fact I don't have any vm templates at all.

 

I also noticed when I remove the gpu and go into nividia unraid settings it states the driver could not be loaded. But when I have the gpu installed it just lists the nvidia driver version but no information about the gpu at all.

You were the one that said you passed it through to a VM. I assume you removed the VM template then. Have you rebooted after you remove the VM template?

If you haven't rebooted, do so and post the output of lspci -k. If you have rebooted, still post the output of of the above command.

Link to post
On 4/28/2020 at 7:35 PM, saarg said:

You were the one that said you passed it through to a VM. I assume you removed the VM template then. Have you rebooted after you remove the VM template?

If you haven't rebooted, do so and post the output of lspci -k. If you have rebooted, still post the output of of the above command.

Appologies for the confusion, all I did was click the + icon to add a new vm and then i was able to select the gpu from there, I did not actually create a vm. I just checked if the option to pass the GPU through to the vm was available.

 

I have attached the results of lspci -k in the .txt file.

 

Appreciate your help!!!

 

Edited by Solverz
Link to post
On 4/25/2020 at 6:18 AM, saarg said:

You can't as we only use the latest driver at the time of building the new build.

So even though the 340.108 version came out in Dec of 2019, you are only using the 400 series drivers?

With your experience do you think there is another way for me to make these work?

 

LINUX X64 (AMD64/EM64T) DISPLAY DRIVER

 

Version:340.108

Release Date:2019.12.23

Operating System:Linux 64-bit

Language:English (US)

File Size:66.92 MB

Link to post
On 4/25/2020 at 5:40 PM, beardymcgee said:

Grid k2 cards aren't compatible with anything other then vmware, they dont have standard linux drivers, its a vmware esx only card

Nvidia does have drivers for linux for these cards. I'm just having issues trying to get them added.

 

LINUX X64 (AMD64/EM64T) DISPLAY DRIVER

 

Version:340.108

Release Date:2019.12.23

Operating System:Linux 64-bit

Language:English (US)

File Size:66.92 MB

Link to post
43 minutes ago, Solverz said:

Appologies for the confusion, all I did was click the + icon to add a new vm and then i was able to select the gpu from there, I did not actually create a vm. I just checked if the option to pass the GPU through to the vm was available.

 

I have attached the results of lspci -k in the .txt file.

 

Appreciate your help!!!

Results.txt 7.2 kB · 0 downloads

The correct modules are loaded, so it should work.

Is the card recognized if you run the command nvidia-smi on the comman line?

If it is, then it's just the command the plugin runs at boot to find the UUID of the card that fails for some reason. There is a command you can run to get the UUID  that was posted by chbmb earlier in this thread you can try.

Link to post

Hi all, 

 

So, I've recently upgraded my server. The RAM I put in turned out to be bad (failed Memtest miserably, oof) and all hell kinda broke loose with the server (cache drive wouldn't mount, got weird bugs when plugged into the monitor, network connection with web ui and telnet were super unstable, other stuff). Anyways, I reinstalled the old RAM and things are functioning as normal except my syslog reveals the following repeated bit of text: 

 

Apr 29 13:33:51 TheShire kernel: caller _nv000908rm+0x1bf/0x1f0 [nvidia] mapping multiple BARs

Apr 29 13:33:53 TheShire kernel: resource sanity check: requesting [mem 0x000c0000-0x000fffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000dffff window]

 

Now, it was happening every 2 seconds when I had GPU stats installed. It stopped when I uninstalled. But I ran a "watch nvidia-smi" in terminal and it reappeared (bc of course it did). I've poked around the forums but haven't seen anything conclusive on: (a) what this bit of text means or (b) how to fix it. I'm hoping to find answers to both. Thanks in advance for your responses, syslog attached (from both today and yesterday when stuff was going haywire, just in case).  

 

p.s. Maybe better for a different thread, but here goes: given the snafu I had yesterday, would it be advisable to backup my stuff and do a clean install of unraid? I'd like to avoid that if possible. 

 

 

SysLog04292020.rtf theshire-diagnostics-20200428-2239.zip

Link to post
On 4/28/2020 at 11:11 PM, saarg said:

The correct modules are loaded, so it should work.

Is the card recognized if you run the command nvidia-smi on the comman line?

If it is, then it's just the command the plugin runs at boot to find the UUID of the card that fails for some reason. There is a command you can run to get the UUID  that was posted by chbmb earlier in this thread you can try.

Sorry for late response. I have just ran the command "nvidia-smi" and the results is "No devices were found".

 

If the below command is what you mean by chbmb, then it gave the same result "No devices were found".

 

nvidia-smi --query-gpu=gpu_name,gpu_bus_id,gpu_uuid --format=csv,noheader | sed -e s/00000000://g | sed 's/\,\ /\n/g'

Link to post
  • trurl locked this topic
Guest
This topic is now closed to further replies.