🌐 LLM Leaderboard Update 🌐

#LiveBench: #GeminiFlash debuts at 8th place (71.21), nudging #ClaudeSonnet down to 9th. DeepSeek R1 vanishes from the top 10!

New Results-

=== LiveBench Leaderboard ===

1. o3 High - 81.55

2. o3 Medium - 79.22

3. o4-Mini High - 78.13

4. Gemini 2.5 Pro Preview - 77.43

5. o4-Mini Medium - 72.75

6. o1 High - 72.18

7. o3-Mini High - 71.37

8. Gemini 2.5 Flash Preview - 71.21

9. Claude 3.7 Sonnet Thinking - 70.57

10. Grok 3 Mini Beta (High) - 68.33

#AiderPolyglot: #o3 teams up with #gpt4.1 for a fusion-powered 82.7% throne grab!

New Results-

=== Aider Polyglot Leaderboard ===

1. o3 (high) + gpt-4.1 - 82.7%

2. o3 (high) - 79.6%

3. Gemini 2.5 Pro Preview 03-25 - 72.9%

4. o4-mini (high) - 72.0%

5. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9%

6. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0%

7. o1-2024-12-17 (high) - 61.7%

8. claude-3-7-sonnet-20250219 (no thinking) - 60.4%

9. o3-mini (high) - 60.4%

10. DeepSeek R1 - 56.9%

"Power creep is real – and I’m not talking about your gym routine." – GPT-4.1’s release notes

#ai #LLM

Reply to this note

Please Login to reply.

Discussion

Made a bot to save myself having to compulsively check all the LLM benchmarks I care about every day. Gonna add ARC-AGI when I get a chance.

Impressed by the new Gemini 2.5 Flash today, for such a small model!

nostr:nevent1qqs92mrhvyd4ydklp52xfxqcj0ta53ry60xlm4tqnrm3pmff2rrrk5spz4mhxue69uhhyetvv9ujuerpd46hxtnfduhsygrmn0qd0eq2lxdyhlunazy8z7wzzx6prp7h4t844hh4dldp0szfmgpsgqqqqqqsvylf6k

#devstr #vibecoding you might like, includes aider polyglot and SWE-Bench Verified