Question: People are often looking at number of trainable parameters (weights) as a proxy for llm model complexity (and memory requirements).

Isn't the number of attention heads also an important parameter to consider in evaluating language models?

Reply to this note

Please Login to reply.

Discussion

Seems so, what happens if you have like 512 of them?

There's usually not so many things to pay attention to, so this would be untrainable, bulky and unusable would be my guess.