Instability with dual nvidia gpu 7.2.3

January 23Jan 23

Community Expert

Good morning all

I'm having weirdness, and I am not sure where exactly to go, as the problem could be coming from many things. So here we go.

I am running Unraid 7.2.3 on an Asus z790-ayw motherboard. I have just upgraded my GPU from a nvidia 3050 8gb card to a nvidia 3060 12gb. The 3060 is more than just more memory, it moves form the 4 lanes of my 3050 to 16 lanes. I currently have the 580 nvidia driver. Had the same problem on 590.

I have 2 m.2 devices installed on m2_2 and 3 slots.

The goal was to have both cards installed to combine the vram in ollama to get 20gb to load larger models.

The computer works as expected with just either card as solo.

Testing I had installed the 3060 on pcie-1 and 3050 installed in pcie-3

Computer functions as desired to that point. I am able to reboot with both cards, and get into unraid just fine. Before I do anything, I can get both cads reported by nvidia-smi -L.

I can load up the majority of my docker containers and there seems to be no problem - including starting jellyfin with hardware tc.

Next I try to start the ollama instance. This is where things go bad rapidly. ollama (with --gpus=all) tries to enumerate the cards, and it fails. The error is one of basically timeout while searching for the cards. It then attemps to enable cpu only as the container gets caught in some molass here. Stopping the container at that point is particularly hard, and it seems to kill the container (I tried gui, cli and dockhand) i have to actually go and completely disable the docker service, before it closes down.

At this point, trying to run nvidia-smi has the app just stall and never return the card info. I can only regain control by forcing a reboot. This is where it gets even more ridiculous.

When I try to reboot at that point, the server gets into a loop, unable to unmount /mnt/cache. I have tried closing anything that might be on the drive. using lsof to see what might be open, and still cannot manually unmount the drive. It also gets caught up being unable to restart docker with the docker img loaded as /dev/loop2 and unable to kill it, so the service wouldnt restart anyway.

After forcing it, it'll reboot just fine.

I somewhat suspect that it's one of the m2 slots sharing the lanes from pcie-3, but cant verify. I checked the info here: https://www.asus.com/us/support/faq/1037507/ to read up on the bifurcation in case that is the problem. From that, and inside the uefi, it seems like bifurcation is only able to be triggered on pcie-1, which is obviously not what i need

I have reset the bios to defaults.

I am unable to try gpu2 in the lower pcie slot, as the heatsink and whatnot come up against the header for the front usb ports, and a couple of the sata ports, making it practically impossible to place the card there.

I'm just sorta lost. At first glance, it seems like a hardware issue, but things seem to work until ollama tries to search for gpus. However the hang on /mnt/cache happens ONLY after the init attempt.

So I'm just not sure which direction to head towards for help, so I started here.

Does anyone have any thoughts on the problem? Even if it's just to suggest that I need to check around the docker sphere, or...Well I dunno.

A hearty thanks in advance for any thoughts you might have, no matter how small or out there. It's just weird. THANKS :)

Quote

January 24Jan 24

Community Expert

bios setting to have the 2nd m.2 use sata 5 and 6 as you hit the pcie lane that may be preventing m.2 or the gcard form laoding.

Quote

January 24Jan 24

Author
Community Expert

bios setting to have the 2nd m.2 use sata 5 and 6 as you hit the pcie lane that may be preventing m.2 or the gcard form laoding.

Thanks for the reply! I'll say I was assuming a lane thing. Got me to upgrade the bios and root around a little more. Fixed some other random annoyances while I was in there.

However that doesn't seem to be the problem. I changed what I could, and moved around the m2's. After hours of move boot blah nothing changed.

I confirmed that I can use either card individually when exposed one by one to the container. No fail on 0 or 1. It's only when I try and expose both cards with gpus=-all that things fail. So I guess i've narrowed down to a problem with that particular comtainer and whatever it is doing. I saw suggestions to edit a systemd file, but if that's inside the container and not the appdata, it's ephemeral and just poofs out on restart of the container..Right?

Quote

January 24Jan 24

Community Expert

29 minutes ago, enesha said:
Thanks for the reply! I'll say I was assuming a lane thing. Got me to upgrade the bios and root around a little more. Fixed some other random annoyances while I was in there.
However that doesn't seem to be the problem. I changed what I could, and moved around the m2's. After hours of move boot blah nothing changed.
I confirmed that I can use either card individually when exposed one by one to the container. No fail on 0 or 1. It's only when I try and expose both cards with gpus=-all that things fail. So I guess i've narrowed down to a problem with that particular comtainer and whatever it is doing. I saw suggestions to edit a systemd file, but if that's inside the container and not the appdata, it's ephemeral and just poofs out on restart of the container..Right?

epherial kinda. while true the docker also needs the software to run and use....

if using the full docker varables for the nvida card trying to target 1 over the other...

revewi the first/second page of the suppot plugin:
https://forums.unraid.net/topic/98978-plugin-nvidia-driver/

please post a diagnostic file.
whats the output of nvidia-smi

lspci -v

do you see both g cards?

what does the unradi driver plugin say as you will need the gpu string to select the nvidia card to pass into the docker...

siomlar when passed if you console into the docekr if you run lspci do you see the gcard?

theres some adational settings and things that you may be missing... Unriad is slackware linux and uses sytem int vars no systemd... dockers can have sytemd but normaly don't run nor uses domains due to the natures of the docker.

Quote

January 25Jan 25

Author
Community Expert

Heya again

I've attached the diagnostic as requested. I've read the plugin help, and I believe that it's all correct. I am able to use the nvidia system for hardware transcoding in a jellyfin docker, and a few others. Additionally I can get ollama with one card to work, so it seems correct. When i've tried the NVIDIA_VISIBLE_DEVICES, I havn't even been able to get the container to start. using --gpus '"device=0"' worked fine, detects nvram etc , as does device=1. Separately. gpus=all results in it container throwing a timeout detecting gpu, and that's where things crash. I am also able to direct to the card directly via uuid with: --gpus device=GPU-609dbfeb-61f6-2d39-4166-99cb10673e54 I was unsuccessful in trying to get it to load if I try and define both cards by uuid. Maybe a syntax problem on my end.

I know slack doesn't use systemd, but inside the docker it seems to use it. The container seems to be based on "DISTRIB_DESCRIPTION="Ubuntu 24.04.3 LTS" I reference the systemd file based on this website:

https://markaicode.com/multi-gpu-ollama-setup-large-model-inference/

Under the "Essential Environment Variables" section. Now as I read that site, they are obviously not using a container, it seems they are just running it directly from the terminal, which indicated to me then that if it needed that file, that it's within the root fs inside the container. Honestly I never really used Docker before playing with it on unraid, so my experience and knowledge of it is pretty slim..

nvidia-smi attached as txt to enhance readability

nvidia-smi -L reports :

GPU 0: NVIDIA GeForce RTX 3060 (UUID: GPU-609dbfeb-61f6-2d39-4166-99cb10673e54)

GPU 1: NVIDIA GeForce RTX 3050 (UUID: GPU-29500f42-d1df-ba5c-32f2-a32e97a0cfcf)

lspci -v is also attacted (only the nvidia parts) as txt . Note it does show noveau, as I tried several versions of the driver, and have just left it at the open version right now.

From inside the container this is what I get, log wise, from using one card:

time=2026-01-25T03:40:23.012Z level=INFO source=types.go:42 msg="inference compute" id=GPU-609dbfeb-61f6-2d39-4166-99cb10673e54 filter_id="" library=CUDA compute=8.6 name=CUDA0 description="NVIDIA GeForce RTX 3060" libdirs=ollama,cuda_v13 driver=13.0 pci_id=0000:01:00.0 type=discrete total="12.0 GiB" available="11.6 GiB"

time=2026-01-25T03:40:23.012Z level=INFO source=routes.go:1725 msg="entering low vram mode" "total vram"="12.0 GiB" threshold="20.0 GiB"

That's the same for the 3050, other than 8GB. It's when both cards are defined it attempts to "time=2026-01-25T03:40:18.867Z level=INFO source=runner.go:67 msg="discovering available GPUs..."

After that there is no contact to the cards, and it reports that discovery timed out before completing. I can get that exact error later, if I try it right now it'll crash things, and people are staying warm watching some media from it now :)

lspci doesn't seem to be available inside the docker. At least i couldn't load it

uname -a indicates it's using the unraid kernel "Linux 5284ec90a273 6.12.54-Unraid #1 SMP PREEMPT_DYNAMIC Tue Oct 21 15:58:46 PDT 2025 x86_64 x86_64 x86_64 GNU/Linux"

running nvidia-smi (-L) inside the container displays the smae information, but only for the card that is exposed to it.

So it only seems to have trouble with both cards. Strange. It doesn't throw an error to the unraid syslog

But thanks for trying still :)

diagnostic.zip nvidiasmi.txt lspci.txt

Quote

January 25Jan 25

Author
Community Expert

Just an update - and I apologize for my confusion. I had no docker experience outside of unraid, and unraid makes it entirely too easy and user friendly lol.. I had previously only virtualized entire systems with kvm.

Regarding those env options i reference that it wanted in the systemd, I assume i just missed something. They want you to include things such as:

'Environment="OLLAMA_GPU_LAYERS=-1" '

I assume that I could deal with that by adding it either as a variable under add path/variable etc, but also maybe just adding -e OLLAMA_GPU_LAYERS=-1 to the extra parameters..

Assuming that's correct, I feel stupid for that bit lol.

Quote

January 25Jan 25

Author
Community Expert

Well looks like maybe I'm in the wrong spot. While the CA ollama container causes the above referenced crash, I was able to install the LocalAI container and it shows both cards without any issue.

So it appears that it's specific to how ollama is trying to detect the gpus. I tried asking in reddit r/ollama but seems that most have either not tried this, didn't have troubles, or were just not interested. So thanks for the assistance :)

Quote

April 24Apr 24

Author
Community Expert
Solution

Hey guys, sorry for the long delay,. Yeah I resolved this.

Turned out to be a problem with the MB. It decided on it's own to enable some OC for some reason. When I backed off all of that, it's working.

Stupid Adus lol

Quote

Instability with dual nvidia gpu 7.2.3

Featured Replies

Solved by enesha

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)