https://PremAI.io can help!

Reply to this note

Please Login to reply.

Discussion

We were already using 1B models but it was too slow. Need faster speeds for smaller models.

We’ve got the code ready now though so when a smaller faster model does finally arrive, game on. 🦉

Which inference engine and tech stack if I can ask ? Have you used WebGPU and/or Metal on Mac/iOS?

Also which model you using?

It was only on a M3 CPU so far - no GPUs yet. We’ll do that benchmark later but it might be more expensive for relay operators / some users may not have GPUs.

gemma3:1b

llama3.2:1b

I mean it’s like day and night, these models needs GPU, no point otherwise. AFAIK the memory is unified and you should be able to use it with Metal kernels.