My article explains how to install Ollama and Open WebUI through docker. You need to give it web search capability and feed it relevant docs.

I will be beginning research of docker and searxng so I can write more guides and maybe eventually develop an open source app.

Most tutorials online are extremely insecure.

When you’re running a model, run `ollama ps` or `docker exec ollama ollama ps` to see how much GPU/CPU it’s using. Models that can fit entirely on vram run at 40+ tokens per second. Models that offload to CPU/RAM are *much* slower, 8-20 tokens per second. You want the processes command to show that the model is 100% loaded to GPU.

But I haven’t messed much with AI code. I assume Qwen3, Gemma 3, and GPT-oss 20b are all good. GPT-oss 20b is a mixture of experts model, meaning it only ever has 3.6b active parameters, taking like 14gb ram. You can run it on cpu probably, it is extremely good. You need RAG

Reply to this note

Please Login to reply.

Discussion

But yeah this whole project is focused on making an assistant that’s as helpful as Gemini 3 Pro as possible, if not better

Great article, thanks for sharing!

Not using docker personally for ollama, just running it in a shell tab locally on my linux box. I have more than enough vram, still bad results... might be me doing something stupid.

Any other articles help you out in your learning journey?

Running ollama directly may introduce security vulnerabilities. It’s best to run it through docker in my research. Performance should be the same.

I haven’t found many good guides. I wrote mine because none of the guides I followed worked without exposing either app to the host network.

My guide was inspire by this video, which might help. His setup didn’t work for me though:

https://youtu.be/qY1W1iaF0yA

I will be updating the guide when I learn how to improve my process. I might switch to using docker compose or I might make a startup script that sets it up and optimizes it for security. I might take this so far as to develop a full app so people stop potentially exposing their processors to the internet to run local AI.

You probably don’t have the GPU configured correctly. I recommend just starting over lol

And remember models take more space than their weights. So if a model has 7gb of weights, it still might not fit on an 8gb vram card because it needs more memory for the prompt and other stuff. So for example, an 8gb model like gemma3:12b actually needs around 10gb.

Run `ollama ps` to see if the model is loaded to your CPU or GPU (or both)

i have a framework desktop with 128GB of vram.

even the gpt-oss:120b param model runs with like half my vram still free.

I don't think its a raw hardware problem, but the tooling around it seems to break more. Like once the model calls a tool I lose all context... its strange.

Ahhhh that explains everything. You don’t have a discrete graphics card. Your computer uses an integrated GPU on the APU.

Having 128gb of actual VRAM would be extraordinary. An RTX 5090 has 32gb of VRAM. An H200 ($40k AI GPU) has 141gb..

Your computer uses unified memory. So it uses regular RAM (LPDDR5x I assume) as VRAM.

This is extremely efficient and improves capability, but it’s slow compared to using a dedicated graphics card — 2 to 4 times as slow. LPDDR5x is up to four times slower than GDDR7. It is up to two times slower than GDDR6. You should still be able to run at usable speeds around reading speed (slow reading speed in some cases).

I expect 8-20 tokens per second when I’m using RAM and 30-80 tokens per second in VRAM (I have DDR5 RAM and GDDR6 VRAM). 10tps is like reading speed and 80 is like paragraphs appearing speed. I haven’t tried running a fast model like gemma3:4b on CPU/RAM. You might be able to go faster than I go on CPU considering that’s what yours is built for. For reference I have a Ryzen 7 7700x.

I’m not sure about the tooling and context thing.