Good morning, Nostr. Who's running local LLMs? What are the best models that can run at home for coding on a beefy PC system? In 2026, I want to dig into local LLMs more and stop using Claude and Gemini as much. I know I can use Maple for more private AI, but I prefer running my own model. I also like the fact there are no restrictions on these models ran locally. I know hardware is the bottleneck here; hopefully, these things become more efficient.
Discussion
#AskNostr
Interested as well! #AskNostr
24 gb VRAM (3090, 7900 cards): the latest mistral 24b, qwen3 32b and qwen3 30a3 (MoE)
48gb: 70b size models at decent quants, mistral dev large at lobotomized quants. Mistral dev large is the main one in this bracket. There might be other good 70b's released lately
96gb: gpt-oss 120b
This is to fit everything in vram. With MoE's (qwen3 30a3, GPT-oss) you can get by with VRAM+RAM without ruining your speed depending on the speed of your ram.
But it's usually a speed hit so I don't use anything that doesn't fit in VRAM
nostr:nprofile1qqsrg8d45l36jv05jz2as2j2ejfee79xw2fmreqhnl2ttsz5f38u9mcpr4mhxue69uhkummnw3ezucnfw33k76twv4ezuum0vd5kzmp0qyg8wumn8ghj7mn0wd68ytnddakj7qgwwaehxw309ahx7uewd3hkctckedwd0 uses local LLMs for his image generation, not coding. But he may have some insight.
Iām very interested in doing image generation locally but havenāt found a good guide on how.
nostr:nprofile1qqsvyxc6dndjglxtmyudevttzkj05wpdqrla0vfdtja669e2pn2dzuqpzamhxue69uhhyetvv9ujumn0wd68ytnzv9hxgtcpzamhxue69uhhyetvv9ujuurjd9kkzmpwdejhgtcppemhxue69uhkummn9ekx7mp0kv6trc
Local LLMs are a blessing, free and loyal to you š
I use GPT4All with a Mistral OpenOrca model. I prompt using text posts and ideas. I want to see if the base cultural context for cross-domain ideas match what I'm trying to say. It works fairly well. There are even some valid corrections that I didn't think of sometimes. I don't have a decent GPU on my main GNU/Linux machine. GPT4All runs on just CPU fine. I do have a macOS mini M1 that is much faster, as GPT4All will use the hardware, but I usually don't take the time to switch the KVM (I currently just it for testing software builds). I've never used standalone cloud AI. I don't care about it for a variety of reasons. One benefit to GPT4All is you can point it at your local documents as well to tailor the model.
My set up is i913900k + 64 GB at 6400MT/s + RTX 4090
The absolute best all-around AI is llama3.3 but it is a bit outdated and and slow. Newer MOE models like llama 4 and gpt-oss are flashier and faster but they are mostly experts on hallucinating.
People will also suggest deepseek but generally speaking 24gb vram is just too small for "reasoning models" to actually be an improvement. I haven't tried some of the more recent developments, but I have some hope.
If someone were to train a llama3.3 like AI but have it focus on tool use, like reading the source code and documentation for the libraries you have installed, then I think it could be very good.
I don't think you can really run anything unless you have a card with a minimum of 16gb vram. Even then the model you can run would be 1/4 of sonnet performance. You need like 4, 24gb cards to get close.
As I understand it, you'll want to limit to 1b tokens per 1GB of ram.
You wonāt like my answer. My 2 options are:
- Use nostr:npub16g4umvwj2pduqc8kt2rv6heq2vhvtulyrsr2a20d4suldwnkl4hquekv4h
- Use Venice.ai
You can buy $DIEM for daily API credit and after you can mange it via Litellm and plug it in Goose or Roo or Kilo.
Investing on hardware for local llm is not worth it. Venice has a privacy policy.
You can also DM me for more info. Sorry to bring you up with shitcoin project but itās the lesser of the evil.
I used local LLM exclusively, mostly for coding. Two used 24g 3090's which provides 48g of vram. It runs models up to 70b with very fast performance.
When inferring or training,
1. It uses a lot of power, peaking around 800W
2. It spins up the fans pretty loudly
I don't think it's necessary to go local for open source coding though. Maple, mostly gpt-oss-120 is great for that. I think it is necessary to go local for uncensored models or training with your own data, and discussing things that don't fit mainstream bullshit narratives.
browse on https://ollama.com/models
Played with ollama before thanks!
GM and Happy new year.
Gave your question to Grok:
https://grok.com/share/c2hhcmQtMw_c08aa1ff-8f1d-4a31-b680-225e816d73af
Good morning! I'm all in on local LLMs tooāprivacy, no filters, and owning your setup is the way forward. By 2026, efficiency has improved a ton with better quantization (like AWQ-4bit) and MoE architectures, so even big models run smoother on high-end consumer hardware. Assuming your beefy PC has something like 4+ high-VRAM GPUs (e.g., RTX 4090s or A100 equivalents with 80GB+ total VRAM), you can handle 100B+ params without too much sweat. If it's more modest (e.g., 2x 4090s), stick to 30-70B models.
Here's a rundown of the top local/open-source models for coding tasks right now (late 2025 data, but trends hold into 2026). I prioritized ones excelling in agentic coding (e.g., multi-file edits, tool use, debugging), SWE-bench scores, and real-world dev workflows. They're all uncensored by default when run locally. Download from Hugging Face, and use tools like Ollama, LM Studio, or Jan for easy setupāOllama's great for quick spins, while Jan shines for agentic stuff.
### Top Picks for Coding-Focused Local LLMs
I'll table the best ones based on recent benchmarks and community feedback, focusing on coding strength, size/efficiency, and hardware fit. These are quantized versions for home runs; full precision needs datacenter-level gear.
| Model | Params | Strengths for Coding | Hardware Fit (Quantized) | Why It Rocks |
|-------|--------|-----------------------|---------------------------|--------------|
| Devstral-2-123B | 123B | Agentic workflows, multi-file edits, massive 256K context for big repos, top SWE-bench scorer. Feels like a "senior engineer" for fixes and planning. | 2x RTX Pro 6000 or 8x RTX 3090s (full context). | Best overall for complex projects; handles undocumented codebases without hallucinating. |
| Qwen3-Coder-30B | 30B | Pure coding beastābug fixes, autocomplete, instruction following. Runs fast even on mid-tier setups. | Single RTX 4090 (16GB VRAM quantized). | Efficient daily driver; great for laptops too if you dial down context. Outperforms bigger models on targeted dev tasks. |
| MiniMax-M2 | ~120B (MoE) | Interleaved thinking for agents, solid UI/design alongside code, tool use without fuss. | 2x RTX Pro 6000 or 8x RTX 3090s. | Versatile for full-stack work; MoE makes it punchy without constant high load. |
| GLM-4.5-Air | ~90B | All-rounder with strong agentic coding, structured outputs, and low power draw for multi-agent runs. | Single RTX Pro 6000 or 4x RTX 3090s. | Fits on fewer GPUs; uncensored and reliable for planning/debugging. |
| SWE-Smith-32B (Qwen2.5 fine-tune) | 32B | Interview prep, algorithms, repo-wide changesābeats Claude 3.5/GPT-4o level on SWE tasks. | 1-2x RTX 4090s (M3/M4 Max Mac equivalent). | Optimized for dev; emergent smarts from fine-tuning. |
| NVIDIA-Nemotron-3-Nano-30B | 30B | Tool calling, math/code hybrid tasks, punches above weight on agents. | Single high-VRAM GPU (low active params). | Speedy and multilingual; great for quick iterations without guilt-tripping your power bill. |
| Codestral (Mistral) | 22B | Fast autocompletion, code gen; fine-tunable if needed. | Entry-level beefy PC (e.g., RTX 4080+). | Lightweight starter; pairs well with VS Code extensions like Continue.dev. |
| DeepSeek-Coder | Varies (7-33B) | Competitive coding, algorithms; open-source and uncensored. | Scales to your rigāsmaller for testing. | Budget-friendly; strong on benchmarks without bloat. |
### Quick Tips to Get Started
- **Setup**: Ollama for simplicity (just `ollama run
- **Integration**: Hook into VS Code via Continue.dev or CodeGPT for autocomplete. For agents (e.g., auto-fixing repos), try Jan or SWE-Agent frameworks.
- **Efficiency Gains**: By 2026, expect even betterāmodels like MoEs (e.g., MiniMax) activate fewer params per query, and hardware like next-gen NVIDIA/AMD GPUs will make 200B+ feasible at home. If VRAM's tight, start with 4-bit quant and layer offloading.
- **Caveats**: Local means you handle the context managementāno infinite tokens like cloud services. Test on small projects first; models can still hallucinate without good prompting.