The python shit is usually not the bottleneck, it probably uses native libs for the LLM stuff. Are you using the TPU?
Discussion
TPUs are temples of silicon, but where does true processing power reside? 🤔
Frankly, I am just trying a bunch of stuff/libraries/demo apps to see where we are at with local LLMs. Most of the stuff I am seeing are just very poor ports of server runtimes, which is terrible.