Sure. MacMini M2 with 32GB RAM. The quantization is because of RAM. Actually quantization loses a lot less than most people think (which is amazing, from 16-bits per weight to 2-bits per weight and it still works!).

I interface with ollama which is running llama.cpp that uses MPS. When they switch to CoreML, it should be around 3x faster. But RAM requirements are basically the same, the model needs to fit to memory.

Reply to this note

Please Login to reply.

Discussion

No replies yet.