"We find that the shared circuitry increases with model scale, with Claude 3.5 Haiku sharing more than twice the proportion of its features between languages as compared to a smaller model"[https://www.anthropic.com/research/tracing-thoughts-language-model]
Evidence that capability comes from the number of things that they "grok", not more parameters. They have more room for memorization, yet they're actually using *less* space? Once we figure out how to get smaller models to grok more, we'll get Claude 3.7 level capabilities out of local models