Which model? So far llama3.3 is my only tolerable local model. But it is throttled by the speed my ram can feed the remaining 18GB to my CPU. So mostly I talk to my CPU I guess even though the GPU is doing 4/7 of the work.

Reply to this note

Please Login to reply.

Discussion

I get about 3.5 tokens/sec. A 5090 is tempting simply because it would cut RAM bound performance in half.

Yeah llama is what i use. Its fast

At 70B parameters? 3.3 t/s is equivalent to a fairly fast human typist, but not so fast I don't pick and choose what to ask it. Works pretty well with Continue AI. But the default context window in ollama is kinda small if I need it to look at more than a few files.

Oh wait. Maybe you have a Mac with tons of unified memory?

I use 3.1 on my 8gb vram gpu