The key behind vLLM is stuffing the GPU as full of prompts as possible. So there's one model and many prompts. As prompts get finished they're refilled with more prompts

Reply to this note

Please Login to reply.

Discussion

Old way: 1 page/sec

vLLM 1x3090: 6 page/sec

Realized it wasn't using both cards, now its at 12 pages/sec

Gonna un-voltage-limit the cards next and find diff quant, idk if its even quanted