"We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model"[https://www.anthropic.com/research/tracing-thoughts-language-model]

Evidence that capability comes from the number of things that they "grok", not more parameters. They have more room for memorization, yet they're actually using *less* space? Once we figure out how to get smaller models to grok more, we'll get Claude 3.7 level capabilities out of local models

Reply to this note

Please Login to reply.

Discussion

No replies yet.