March 12Mar 12

This is the support thread for the NVIDIA NIM Single Unraid template. Made this for my personal use so I could use NVIDIA-NIM with Unraid and my AnythingLLM container.

This guide explains how to run NVIDIA NIM containers on Unraid using a consumer NVIDIA GPU.
NIM provides optimized inference servers with an OpenAI-compatible API, making it easy to connect tools like AnythingLLM or Open WebUI.

Tested environment:

RTX 3060 12 GB
Unraid 6.12+
NIM 1.10.1

Prerequisites

You will need the following:

Unraid 6.12 or later
NVIDIA GPU (Turing architecture or newer)
Examples: RTX 20xx, RTX 30xx, RTX 40xx
NVIDIA drivers installed in Unraid
- Community Applications → NerdTools or GPU Statistics plugin
Free NVIDIA NGC account
https://build.nvidia.com
NGC API key generated from your NGC dashboard

Model Selection

NIM uses pre-optimized engine profiles, which are primarily designed for data center GPUs.
Consumer GPUs require smaller models and reduced context windows.

Example models

Model	VRAM Required	Fits 12 GB GPU
meta/llama-3.2-3b-instruct	~6 GB	✅ Recommended
microsoft/phi-3-mini-4k-instruct	~8 GB	✅ Yes
nvidia/Llama-3.1-Nemotron-Nano-4B-v1.1	~10 GB	✅ Yes
mistralai/mistral-7b-instruct-v0.3	~14 GB fp16	❌ OOM
meta/llama-3.1-8b-instruct	~22 GB bf16	❌ OOM
meta/llama-3.1-70b-instruct	~80 GB	❌ Multi-GPU

If you want to run 7B+ models on a 12GB GPU, consider Ollama, which supports quantized weights.

NGC Registry Login

⚠️ This must be done before Unraid can pull NIM images.

NIM images are hosted on NVIDIA's private registry (nvcr.io), not Docker Hub.

Run this one time in the Unraid terminal:

docker login nvcr.io

Username: $oauthtoken
Password: YOUR_NGC_API_KEY

⚠️ Important:

docker login → allows pulling the container image
NGC_API_KEY → allows downloading model weights at runtime

Both are required.

Cache Directory Permissions

Before starting the container, create the cache directory with the correct permissions.

NIM runs inside the container as UID/GID 1000.
If the cache directory is owned by root, the container will fail to start.

Run:

chown -R 1000:1000 /mnt/user/appdata/nvidia-nim/cache
chmod 775 /mnt/user/appdata/nvidia-nim/cache

Environment Variables

Variable	Value	Notes
NGC_API_KEY	your_ngc_api_key	Required. Used to download model weights
NIM_MODEL_NAME	meta/llama-3.2-3b-instruct	Must match the image tag
NIM_MAX_MODEL_LEN	16384	Required for consumer GPUs
NIM_CACHE_PATH	/opt/nim/.cache	Cache directory
CUDA_VISIBLE_DEVICES	0	Use `0` for single GPU
PYTORCH_CUDA_ALLOC_CONF	expandable_segments:True	Reduces memory fragmentation

First Run

On the first startup, NIM downloads the model weights to the cache directory.

Example size:

~6 GB for llama-3.2-3b

This can take several minutes depending on your internet connection.

You can verify the container is running with:

curl http://localhost:8000/v1/models

Connecting Clients

NIM exposes an OpenAI-compatible API, so most AI clients work out of the box.

Connection settings

Setting	Value
Docs	`http://[unraid-ip]:8000/docs`
Base URL	`http://[unraid-ip]:8000/v1`
API Key	Any non-empty string (e.g. `nim`)
Model	meta/llama-3.2-3b-instruct

Compatible clients

AnythingLLM
Open WebUI
LangChain
LlamaIndex
Cursor (custom OpenAI base URL)
Any application with configurable OpenAI endpoints

Switching Models

Currently the template uses model-specific container images.

To switch models:

Stop the container
Change the Repository field
Example:

nvcr.io/nim/microsoft/phi-3-mini-4k-instruct:latest

Update the model variable:

NIM_MODEL_NAME=microsoft/phi-3-mini-4k-instruct

Start the container

To run multiple models, create additional containers on different ports:

8000
8001
8002

They can share the same cache directory — weights will not be duplicated.

Troubleshooting

Cache Permission Error

If the container fails with:

PermissionError: [Errno 13] Permission denied: '/opt/nim/.cache/local_cache'

Run:

chown -R 1000:1000 /mnt/user/appdata/nvidia-nim/cache
chmod 775 /mnt/user/appdata/nvidia-nim/cache

KV Cache Size Error

Example error:

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache

Fix:

NIM_MAX_MODEL_LEN=16384

If needed, try:

Consumer GPUs cannot handle the full context window used by data center profiles.

Common Errors

Error	Cause	Fix
401 Unauthorized	Not logged into nvcr.io	Run `docker login nvcr.io`
ValueError: invalid literal 'all'	CUDA_VISIBLE_DEVICES=all	Change to `0`
PermissionError on .cache	Wrong permissions	Fix cache directory permissions
max seq len > KV cache	Context window too large	Set `NIM_MAX_MODEL_LEN=16384`
CUDA out of memory	Model too large	Use a smaller model
No compatible profiles	GPU too old	Requires RTX 20xx or newer
nvfp4 unsupported warning	Consumer GPU limitation	Safe to ignore

XML Template

This repository includes an Unraid Community Applications-compatible template:

nvidia-nim-single.xml

To install manually:

/boot/config/plugins/dockerMan/templates-user/

After copying the file there, it will appear in the Unraid Docker template list.

Uploading Attachment...Uploading Attachment...

Edited March 13Mar 13 by PikkonMG

Quote

March 12Mar 12

Author

:HOLDER: For furture post if needed.

Quote

[SUPPORT] NVIDIA NIM on Unraid – GPU AI Inference Server

Featured Replies

Prerequisites

Model Selection

Example models

NGC Registry Login

Cache Directory Permissions

Environment Variables

First Run

Connecting Clients

Connection settings

Compatible clients

Switching Models

Troubleshooting

Cache Permission Error

KV Cache Size Error

Common Errors

XML Template

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)