ollama seems to load as much as it can into VRAM, and the rest into RAM. Llama 3.1 70b is running a lot slower than 8b on a 4090, but it's usable. The ollama library has a bunch different versions that appear to be quantized: https://ollama.com/library/llama3.1

Reply to this note

Please Login to reply.

Discussion

How much storage space do I need for the 70B model?

$ ollama list

llama3.1:70b-instruct-q2_k, 26 GB

llama3.1:70b, 39 GB

codellama:13b, 7.4 GB

llama3.1:8b, 4.7 GB

Thanks, appreciate it!