Nostr Web Client

Question: People are often looking at number of trainable parameters (weights) as a proxy for llm model complexity (and memory requirements).

Isn't the number of attention heads also an important parameter to consider in evaluating language models?

Please Login to reply.

Seems so, what happens if you have like 512 of them?

There's usually not so many things to pay attention to, so this would be untrainable, bulky and unusable would be my guess.