vLLM is nutty

Before I was doing naive queuing with text-gen-webui for parsing documents and getting 1 page/sec thinking it was good (way better than human speed)

An 8b on a 3090 handles 60 reqs/second with vLLM

Reply to this note

Please Login to reply.

Discussion

The key behind vLLM is stuffing the GPU as full of prompts as possible. So there's one model and many prompts. As prompts get finished they're refilled with more prompts

Old way: 1 page/sec

vLLM 1x3090: 6 page/sec

Realized it wasn't using both cards, now its at 12 pages/sec

Gonna un-voltage-limit the cards next and find diff quant, idk if its even quanted