My article explains how to install Ollama and Open WebUI through docker. You need to give it web search capability and feed it relevant docs.
I will be beginning research of docker and searxng so I can write more guides and maybe eventually develop an open source app.
Most tutorials online are extremely insecure.
When you’re running a model, run `ollama ps` or `docker exec ollama ollama ps` to see how much GPU/CPU it’s using. Models that can fit entirely on vram run at 40+ tokens per second. Models that offload to CPU/RAM are *much* slower, 8-20 tokens per second. You want the processes command to show that the model is 100% loaded to GPU.
But I haven’t messed much with AI code. I assume Qwen3, Gemma 3, and GPT-oss 20b are all good. GPT-oss 20b is a mixture of experts model, meaning it only ever has 3.6b active parameters, taking like 14gb ram. You can run it on cpu probably, it is extremely good. You need RAG