OpenWebUI, Ollama and local AI

February 12, 20251 yr

Does anyone used OpenWebui and ollama for their own local AI on their server?

What models are you able to run on your unraid system.

Whats your specs?

th?id=OIP.7n_uOLNGUJS4ou1vdU0pdQHaHa%26p

Edited February 12, 20251 yr by Maximo101
words

Quote

February 12, 20251 yr

Author

I recently upgraded my server with an ASUS prime z890m wifi plus motherboard & Intel Core 5 Ultra 245K, 64GB DDR5 RAM, but no gpu. (going to add that in the future)

Is there anyway to utilise the igpu from the cpu in ollama?

I ran the deepseek-r1:8b [~4.7GB] and the response took about a minute then it proceeded to write at about about 1-3 words per second for the duration of the response.

Maxed out the cpu while it was thinking and responding.

Quote

February 15, 20251 yr

Author

I just installed the ollama integration into Home Assistant and it downloaded the llama3.2 model which is 3B parameters with 1.87GB size.

this is running much faster (since it doesnt think and is smaller) and responds pretty quick (similar to a chatgpt response time).

Quote

March 17, 20251 yr

Author

Anyone else testing any AI models locally?

I just installed the google free open source Gemma 3 (4B) on ollama and its really fast and uses less resources. There top end model is meant to be right up there in the benchmarks.

Many other models are getting smaller and better like Qwen QwQ 32B which is meant to be comparable to Deepseek R1 671B and Open Ai 03mini.

I am installing LiteLLM like suggested by NetworkChuck to control the models i run locally (and maybe add some cloud services using api)

Quote

March 17, 20251 yr

Author

Chatbot Arena is an open platform for crowdsourced AI benchmarking, developed by researchers at UC Berkeley SkyLab and LMArena. With over 1,000,000 user votes, the platform ranks best LLM and AI chatbots using the Bradley-Terry model to generate live leaderboards.

https://lmarena.ai/?leaderboard

Edited March 17, 20251 yr by Maximo101

Quote

April 12, 20251 yr

I haven't yet, but I bought a 3090 that's on the way and I got a new PSU so I can run two of them in the future, so I'm ready to go. I'm just researching openwebui and it seems really cool, so I cant wait to set it up. Do you know if I can go ahead and set it up and add in the GPU later? Did you need a VM or can you just use the container straight up on unraid? I've ran local models on my main PC but I can't wait to do it on myserver.

Quote

May 2, 20251 yr

Author

Sorry for the delayed reply @StayThePath i didn't realize i wasn't following the topic.

yes you can run on the cpu if you have enough normal RAM.

I have been running Gemma 4B and Qwen2.5 5B perfectly with no gpu.

I did also just buy a RTX 5070ti 16GB VRAM, so once i get a PCIe 5 riser cable ill install that and play around with bigger models locally.

Quote

June 8, 20251 yr

Author

anyone else running local AI models?

now that i have a gpu i have installed a few other models but not really tested too deep as i am using the free online ones for the harder tasks (gemini, claude, grok, chatgpt)

I installed the Gemma3:12B-it-qat and this is the draw on the gpu when its responding, pretty quick respond speed (also running a parity check at the same time)

Quote

August 6, 2025Aug 6

Author

Openai just released their open source models which can be downloaded via ollama(upgrade to latest version required).
https://openai.com/open-models/
We’re releasing two flavors of the open models:
gpt-oss-120b — for production, general purpose, high reasoning use cases that fits into a single H100 GPU (117B parameters with 5.1B active parameters)
gpt-oss-20b — for lower latency, and local or specialized use cases (21B parameters with 3.6B active parameters)

While the 120b version needs some decent hardware to run locally, the smaller 20b works great with laptops and moderate configs. They are compared to being close to Openai's o3 models and on par with Ali Baba's qwen3.

There is a playground to test the models too, no account required;

https://gpt-oss.com/

Just added to my AI models on my server.

Quote

August 9, 2025Aug 9

I'm running Ollama on a windows machine that has a better GPU than my unraid server. I'm happy with gemma3:4b-it-qat.

I'm running Lobe-Chat and Open-webui on unraid, and frankly Lobe-Chat works better for me.

Quote

1

August 10, 2025Aug 10

Author

I ran some benchmark questions across these open source models in ollama via script created by Gemini/Grok/Claude AI's run on my Unraid OS v7 home server.

My server specs in my signature.

You can see the results here Claude AI made it into a webpage to view.

Quote

February 9Feb 9

Author

Some new benchmarks if anyone is interested

https://claude.ai/public/artifacts/21b88a74-5087-402f-8a93-a833218a6529

Also rumours that Gemma4 will be out if not end of feb, in the next month or so. (looking at when each gemma model was released in relation to each gemini model).

Quote

March 4Mar 4

Author

I’ve been running Ollama in a Docker container on my UnRaid server for a while now and put together a couple of bash scripts to make managing and testing local models a bit easier. I put them up on GitHub and figured I'd share them here in case anyone else finds them useful for their own setups.

There are two main scripts in the repository:

ollama_benchmark_v1.sh: Tests your installed models against a few standard prompts and outputs a CSV with your tokens/second, latency, and a VRAM/RAM split. It specifically does a hard VRAM purge and an OS cache drop between runs so you get accurate "cold boot" metrics rather than testing from the cache.
update_ollama_models.sh: Just a handy maintenance script that iterates through your library, pulls the latest layers for all your models, and generates a status report CSV.

You can grab them here: https://github.com/Maximo101/ollama-benchmark-suite

Hopefully, someone finds these helpful for tracking performance or keeping things updated.

Quote

March 9Mar 9

I am waiting for UNRAID 7.3 so I can use my Arc B60 with Ollama.

Quote

March 16Mar 16

having spent all weekend and this evening i can say yes, i'm using OpenWebUI, Ollama and local AI. Now if there is just some way for ollama to not use my CPU and use my new RTX 5090 instead I would be something approaching a happy man. I've watched every video I can find, added numerous suggested parameters and restarted more times than i can remember. It still the same, the driver app sees the card, ollama goes, "feck that i'm having your CPU cores."

For the love of all that is holy, have you seen a post or a guide that forces it to use my GPU? I'm using the open driver, latest version of unraid and the last of my will to live.

Quote

March 17Mar 17

2 hours ago, marcus.glen said:
I'm using the open driver, latest version of unraid and the last of my will to live.

image.png.d0e0c4e3eb5586b6ccb726c336b215

You need to install this, then tell the Ollama docker to use your nvidia card.

Quote

1

March 17Mar 17

Thank you MrSir for the reply. It was late in the day and i should have been clearer and said i'm using the open source driver from the nvidia driver app.Like i said the system sees the card but Ollama will only use CPU. Here's the full config. Can you see something blindingly obvious?

I was just thinking as you both have got it working you may have been through the same pain and the solution is simple and not wipe it all and start again. Thanks for looking anyway.

Quote

March 17Mar 17

Hello friend, in the Extra Parameters of the Ollama docker, put--gpus=all

Quote

1

March 17Mar 17

Author

9 hours ago, MrSir said:
Hello friend, in the Extra Parameters of the Ollama docker, put--gpus=all

Add this Extra Parameter as MrSir has said,

I suggest also increasing the default context length from 4096 to 32768. (this is crucial if you want to use ollama for OpenClaw)

I also have my AI models stored on the cache, to reduce disk read/writes to the array, and increase inference speed from the ssd cache.

Quote

March 18Mar 18

Author

With local ollama models, i find its best to use a model which fully fits within your vram. It will give you the highest tokens per second, compared to models which offload part into the RAM.

Eg my 16gb VRAM, anything roughly 12B parameters or less will fully fit and produce 80 - 200 tokens per second. (gpt-oss:20b is an exception as its actual size is 13GB and fully fits in vram)

Anything over will spill over from the gpu into the ram and produce 3 - 20 tokens per second.

Also look at the capabilities of the model for what you are wanting to use it for.

eg. vision if you want it to read images. Thinking if you want it for better reasoning. Tools if you want it for Agentic use.

I am hoping they update ollama (docker version) to better use MoE models which can have a higher Parameters (B) total as only a smaller amount is actively used. Eg. 120B models which use 12B active parameters.

Feel free to try the benchmark script i added to github in the link above to see what latency and tokens per second you get with models you downloaded on your gpu.

I also have a script to auto update the ollama models, and i use it with User Scripts Plugin to auto run once a week.

You can grab them here: https://github.com/Maximo101/ollama-benchmark-suite

Also note that you can download models not only direct from the ollama libaray, but from https://huggingface.co/ if you get the GGUF file (if the architecture is MoE might not work), otherwise just sort by ones that have the ollama link.

You can make an account, add your gpu and it will show Green which quantisation will fit on your gpu

Hope thats helpful!

Quote

1

April 4Apr 4

Thank you everyone. This was a classic case of not seeing the wood for the trees. I didn't see the advance toggle and was just trying to add parameters myself. Totally fixed in 30 seconds once i saw that. Feeling a bit of a fool but its good to be learning something completely new again.

Quote

OpenWebUI, Ollama and local AI

Featured Replies

Join the conversation

Account

Navigation

Search

Configure browser push notifications

Chrome (Android)

Chrome (Desktop)

Safari (iOS 16.4+)

Safari (macOS)

Edge (Android)

Edge (Desktop)

Firefox (Android)

Firefox (Desktop)