Some reference to the minimum hardware requirements:
1. GPT-NeoX 20B, with 20 billion fp16 (2bytes) parameters, requires 42GB VRAM to perform near real time inference. Simple math: 20 billion * 2 bytes = 40GB. The memory requirement is too high for a single non-high end GPU, so it will need to be split which introduced other parallelism problem.
2. Qualcomm deloyed 1 billion int8(1byte) parameters model to a Snapdragon 8 Gen 2 platform. Which is about 1GB RAM(not sure what kind) required to generate 512*512 pixel image at around 15s.
It's running on the NPU/APU to boost INT8 inference performance. Considering the low VRAM and low processor frequency, this is still quite impressive.
3. ChatGPT has 175 billion parameters, it's at about 8 times GPT-NeoX. Even after quantization (if possile), it will still consume 42 * 8 / 2 = 168GB fast memory.
All of these is just on the memory requirement side, There are so much more going on with computation and IO bottleneck. My conclusion is: without proper innovation on the AI inference hardware side, the current consumer hardware won't be able to handle LLM in real time. But smaller 1-10 billion parameters models on personal devices are still very promising, although a lot of work need to be done with the NPU and memory hierarchy architecture.