Both the 21B and the 120B.
Discussion
Huh. That's kinda crazy. Haven't tried them myself yet.
Guess that's the advantage of running 10 models in tandem and choosing the best response 😅
120B? you must have a pretty sweet little GPU there
Just a 4090. Most of it has to run on my CPU. The model is quantized so is "only" 65GB.
ah, yeah, that's about 50% more grunty than my RX 7800 XT. it has 256 bit memory and 16gb. runs 22b codestral fine though.
these free models from hugging face are quite a jumble of hit and miss tho. took me a while to find a good one, and then someone put me onto codestral, which also seems to be quite good, and more parameters than the 14B qwen 3 i was using before. haven't really evaluated it though, because most of the work i do uses claude 3.7 cloud for a coding agent. i'm looking forward to eventually being able to point the agent at my local LLMs tho. i just don't see the point in using a remote service. i also don't like teaching those fuckers my process, that's more what i'm concerned about because i know copilot is already eating all of my output on github.
Sitting comfortably between y'all with a 4080 Super because I take video games very seriously
And I still use cloud LLMs lol.
meh, my 16gb video card runs LLMs fine
i don't think there is even cloud services that run the models i use anyway
i don't trust cloud hosting at all, in any way, whatsoever. that's why i'm a nostr relay dev. because i want little people to run internet services. they are more likely to be honorable.
I was messing with LM Studio and Ollama and Roocode and stuff recently. It's a bit confusing to me when choosing a model (in general). I tried a 7B model which was fucking memes. Haven't tried a 70B yet.
LM studio is the one i use, first one that i got to actually work on linux, after i finally got the AMD ROCm compute libraries installed finally (needed to use ubuntu 24 to make it work)
idk what kind of thing you want to do but so far i've found Qwen and Codestral models are both good for understanding code
The only model I recommend locally is llama 3.3. qwen and deepseek get a lot of hype but they are overall worse. What they are better at is looking like they are doing something. But they all basically ape conversation. The turing test is really a test of the user.
llama3.3 wins by being the least pretentious. That means more parameters can be used for actual knowledge rather than performance art.