Trying a few different ones at the moment, such as Llama3.2:3b, qwen2.5:3b, mistral:7b, deepseek-r1:7b and deepseek-r1:14b. Right now qwen2.5:3b (Q4_K_M) is doing around 144 TPS (according to Claude Code).
Deepseek-r1:14b was too much for it.
Here’s what I’m working on:
https://github.com/knowall-ai/Nod.ie
I.e. an agent that runs on your machine that you can just talk to, all running locally, that ultimately can call MCP Servers concerning the operation of a Bitcoin Lightening Node. Under the hood it’s using a fork of Unmute for the backend services, but I’ve added MCP tool calling capability and “thinking”.
However, this is also using a lot of other GPU activities such as Speech-To-Text (TTS) and Text-to-Speech (TTS) and I was even trying to add video processing stuff (agent facial animation) using MuseTalk but it can’t quite all squeeze on a single RYX 3090 (24GB). Looking at options to run a second graphics card (until then putting animated AI on hold):-)
I don’t think it will replace Claude Code (I use that a lot to build it) because of the clever way they grab context of a project code base and alike. If you’re referring to Claude Desktop, I’m using Open WebUI and seems pretty comparable (at least to Open AI - you can even get results from your local Ollama instance and compare it to results from OpenAI API which is pretty cool).