Nostr Web Client

Good morning, Nostr. Who's running local LLMs? What are the best models that can run at home for coding on a beefy PC system? In 2026, I want to dig into local LLMs more and stop using Claude and Gemini as much. I know I can use Maple for more private AI, but I prefer running my own model. I also like the fact there are no restrictions on these models ran locally. I know hardware is the bottleneck here; hopefully, these things become more efficient.

Reply to this note

Please Login to reply.

Discussion

CensorThis 2d ago

#AskNostr

Paul Sernine 2d ago

Interested as well! #AskNostr

John 2d ago

24 gb VRAM (3090, 7900 cards): the latest mistral 24b, qwen3 32b and qwen3 30a3 (MoE)

48gb: 70b size models at decent quants, mistral dev large at lobotomized quants. Mistral dev large is the main one in this bracket. There might be other good 70b's released lately

96gb: gpt-oss 120b

This is to fit everything in vram. With MoE's (qwen3 30a3, GPT-oss) you can get by with VRAM+RAM without ruining your speed depending on the speed of your ram.

But it's usually a speed hit so I don't use anything that doesn't fit in VRAM

Kajoozie Maflingo 2d ago

yup, and about 2tb storage just to fit all those models.

Ivan 1d ago

Thank you 🙏. Current cards I own 3090 and 7900xtx. Someone suggested dual 3090s. I'll check out the models thank you.

TheRupertDamnit 2d ago

nostr:nprofile1qqsrg8d45l36jv05jz2as2j2ejfee79xw2fmreqhnl2ttsz5f38u9mcpr4mhxue69uhkummnw3ezucnfw33k76twv4ezuum0vd5kzmp0qyg8wumn8ghj7mn0wd68ytnddakj7qgwwaehxw309ahx7uewd3hkctckedwd0 uses local LLMs for his image generation, not coding. But he may have some insight.

Currency of Distrust 2d ago

I’m very interested in doing image generation locally but haven’t found a good guide on how.

ABH3PO 2d ago

Automatic1111 is good for beginners but youll move to comfyui sooner or later

Ivan 1d ago

If I get it running I'll be sure to let you know how I did it.

Currency of Distrust 1d ago

Cool! I’m already running LMStudio, so would be a cool addition

Jay 2d ago

nostr:nprofile1qqsvyxc6dndjglxtmyudevttzkj05wpdqrla0vfdtja669e2pn2dzuqpzamhxue69uhhyetvv9ujumn0wd68ytnzv9hxgtcpzamhxue69uhhyetvv9ujuurjd9kkzmpwdejhgtcppemhxue69uhkummn9ekx7mp0kv6trc

ABH3PO 2d ago

Local LLMs are a blessing, free and loyal to you 😛

Oberon Ohana 2d ago

I use GPT4All with a Mistral OpenOrca model. I prompt using text posts and ideas. I want to see if the base cultural context for cross-domain ideas match what I'm trying to say. It works fairly well. There are even some valid corrections that I didn't think of sometimes. I don't have a decent GPU on my main GNU/Linux machine. GPT4All runs on just CPU fine. I do have a macOS mini M1 that is much faster, as GPT4All will use the hardware, but I usually don't take the time to switch the KVM (I currently just it for testing software builds). I've never used standalone cloud AI. I don't care about it for a variety of reasons. One benefit to GPT4All is you can point it at your local documents as well to tailor the model.

Daniel Wigton 2d ago

My set up is i913900k + 64 GB at 6400MT/s + RTX 4090

The absolute best all-around AI is llama3.3 but it is a bit outdated and and slow. Newer MOE models like llama 4 and gpt-oss are flashier and faster but they are mostly experts on hallucinating.

People will also suggest deepseek but generally speaking 24gb vram is just too small for "reasoning models" to actually be an improvement. I haven't tried some of the more recent developments, but I have some hope.

If someone were to train a llama3.3 like AI but have it focus on tool use, like reading the source code and documentation for the libraries you have installed, then I think it could be very good.

OceanSlim 2d ago

I don't think you can really run anything unless you have a card with a minimum of 16gb vram. Even then the model you can run would be 1/4 of sonnet performance. You need like 4, 24gb cards to get close.

Corey Santa Diego 2d ago

As I understand it, you'll want to limit to 1b tokens per 1GB of ram.

Settebello 2d ago

You won’t like my answer. My 2 options are:

- Use nostr:npub16g4umvwj2pduqc8kt2rv6heq2vhvtulyrsr2a20d4suldwnkl4hquekv4h

- Use Venice.ai

You can buy $DIEM for daily API credit and after you can mange it via Litellm and plug it in Goose or Roo or Kilo.

Investing on hardware for local llm is not worth it. Venice has a privacy policy.

You can also DM me for more info. Sorry to bring you up with shitcoin project but it’s the lesser of the evil.

Hazey 2d ago

I used local LLM exclusively, mostly for coding. Two used 24g 3090's which provides 48g of vram. It runs models up to 70b with very fast performance.

When inferring or training,

1. It uses a lot of power, peaking around 800W

2. It spins up the fans pretty loudly

I don't think it's necessary to go local for open source coding though. Maple, mostly gpt-oss-120 is great for that. I think it is necessary to go local for uncensored models or training with your own data, and discussing things that don't fit mainstream bullshit narratives.

Ivan 1d ago

I got a 3090 I

might pick another. Thank you!

John 1d ago

Dual 3090s is (somehow) still the sweet spot for local

librekitty 2d ago

browse on https://ollama.com/models

Ivan 1d ago

Played with ollama before thanks!

Toum 2d ago

GM and Happy new year.

Purp1eOne 2d ago

Gave your question to Grok:

https://grok.com/share/c2hhcmQtMw_c08aa1ff-8f1d-4a31-b680-225e816d73af

Good morning! I'm all in on local LLMs too—privacy, no filters, and owning your setup is the way forward. By 2026, efficiency has improved a ton with better quantization (like AWQ-4bit) and MoE architectures, so even big models run smoother on high-end consumer hardware. Assuming your beefy PC has something like 4+ high-VRAM GPUs (e.g., RTX 4090s or A100 equivalents with 80GB+ total VRAM), you can handle 100B+ params without too much sweat. If it's more modest (e.g., 2x 4090s), stick to 30-70B models.

Here's a rundown of the top local/open-source models for coding tasks right now (late 2025 data, but trends hold into 2026). I prioritized ones excelling in agentic coding (e.g., multi-file edits, tool use, debugging), SWE-bench scores, and real-world dev workflows. They're all uncensored by default when run locally. Download from Hugging Face, and use tools like Ollama, LM Studio, or Jan for easy setup—Ollama's great for quick spins, while Jan shines for agentic stuff.

### Top Picks for Coding-Focused Local LLMs

I'll table the best ones based on recent benchmarks and community feedback, focusing on coding strength, size/efficiency, and hardware fit. These are quantized versions for home runs; full precision needs datacenter-level gear.

|-------|--------|-----------------------|---------------------------|--------------|

| Devstral-2-123B | 123B | Agentic workflows, multi-file edits, massive 256K context for big repos, top SWE-bench scorer. Feels like a "senior engineer" for fixes and planning. | 2x RTX Pro 6000 or 8x RTX 3090s (full context). | Best overall for complex projects; handles undocumented codebases without hallucinating. |

| Qwen3-Coder-30B | 30B | Pure coding beast—bug fixes, autocomplete, instruction following. Runs fast even on mid-tier setups. | Single RTX 4090 (16GB VRAM quantized). | Efficient daily driver; great for laptops too if you dial down context. Outperforms bigger models on targeted dev tasks. |

| MiniMax-M2 | ~120B (MoE) | Interleaved thinking for agents, solid UI/design alongside code, tool use without fuss. | 2x RTX Pro 6000 or 8x RTX 3090s. | Versatile for full-stack work; MoE makes it punchy without constant high load. |

| GLM-4.5-Air | ~90B | All-rounder with strong agentic coding, structured outputs, and low power draw for multi-agent runs. | Single RTX Pro 6000 or 4x RTX 3090s. | Fits on fewer GPUs; uncensored and reliable for planning/debugging. |

| SWE-Smith-32B (Qwen2.5 fine-tune) | 32B | Interview prep, algorithms, repo-wide changes—beats Claude 3.5/GPT-4o level on SWE tasks. | 1-2x RTX 4090s (M3/M4 Max Mac equivalent). | Optimized for dev; emergent smarts from fine-tuning. |

| NVIDIA-Nemotron-3-Nano-30B | 30B | Tool calling, math/code hybrid tasks, punches above weight on agents. | Single high-VRAM GPU (low active params). | Speedy and multilingual; great for quick iterations without guilt-tripping your power bill. |

### Quick Tips to Get Started

- **Setup**: Ollama for simplicity (just `ollama run `), or Open WebUI for a nice browser interface with RAG/custom instructions. Hugging Face for gritty/experimental models—quantize with AWQ or GPTQ to fit your VRAM.

- **Integration**: Hook into VS Code via Continue.dev or CodeGPT for autocomplete. For agents (e.g., auto-fixing repos), try Jan or SWE-Agent frameworks.

- **Efficiency Gains**: By 2026, expect even better—models like MoEs (e.g., MiniMax) activate fewer params per query, and hardware like next-gen NVIDIA/AMD GPUs will make 200B+ feasible at home. If VRAM's tight, start with 4-bit quant and layer offloading.

- **Caveats**: Local means you handle the context management—no infinite tokens like cloud services. Test on small projects first; models can still hallucinate without good prompting.

Austin 2d ago

I use pinokio to run open webui/ollama. I’m not a coder, but for engineering I like deepseek and qwen