๐ LLM Leaderboard Update ๐
#LiveBench: #Claude45Opus strategically deploys "effort variants" to grab gold and silver! Medium Effort beats High Effort (??), pushing #Gemini3Pro to bronze. Five new Claude entries reshape the field ๐
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking Medium Effort - 80.05
2. Claude 4.5 Opus Thinking High Effort - 79.83
3. Gemini 3 Pro Preview High - 79.70
4. GPT-5 High - 79.33
5. GPT-5 Medium - 78.85
6. GPT-5.1 High - 78.79
7. GPT-5 Pro - 78.73
8. Claude Sonnet 4.5 Thinking - 78.26
9. GPT-5 Codex - 78.24
10. Gemini 3 Pro Preview Low - 77.05
11. Claude 4.5 Opus Thinking Low Effort - 75.97
12. Claude 4.5 Opus Medium Effort - 75.58
13. GPT-5 Mini High - 75.31
14. Claude 4.1 Opus Thinking - 75.25
15. GPT-5.1 Codex - 75.10
16. GPT-5 Low - 74.65
17. Claude 4.5 Opus High Effort - 74.27
18. Claude 4 Sonnet Thinking - 73.82
19. Grok 4 - 72.84
20. Gemini 2.5 Pro (Max Thinking) - 71.92
#SWEBench: #Gemini3Pro and #Claude45Sonnet boost coding agents - "live-SWE-agent" debuts at #2!
New Results-
=== SWE-Bench Verified Leaderboard ===
1. TRAE + Doubao-Seed-Code - 78.80
2. live-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 77.40
3. Atlassian Rovo Dev (2025-09-02) - 76.80
4. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
5. ACoder - 76.40
6. Warp - 75.60
7. TRAE - 75.20
8. Harness AI - 74.80
9. Sonar Foundation Agent + Claude 4.5 Sonnet - 74.80
10. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
11. JoyCode - 74.60
12. Refact.ai Agent - 74.40
13. Prometheus-v1.2.1 + GPT-5 - 74.40
14. mini-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 74.20
15. Salesforce AI Research SAGE (OpenHands) - 73.80
16. Tools + Claude 4 Opus (2025-05-22) - 73.20
17. Salesforce AI Research SAGE (bash-only) - 73.00
18. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
19. OpenHands + GPT-5 - 71.80
20. Prometheus-v1.2 + GPT-5 - 71.20
#ARCAGI1 #ARCAGI2: #Opus45 invades AGI testing with tokenized thinking! Now occupies 20% of both leaderboards like a chatbot gentrifying neighborhoods ๐ป๐๏ธ
New Results-
=== ARC-AGI-1 Leaderboard ===
1. Gemini 3 Deep Think (Preview) ยฒ - 87.5%
2. Opus 4.5 (Thinking, 64K) - 80.0%
3. J. Berman (2025) - 79.6%
4. E. Pang (2025) - 77.1%
5. Opus 4.5 (Thinking, 32K) - 75.8%
6. o3 (Preview, Low) ยน - 75.7%
7. Gemini 3 Pro - 75.0%
8. GPT-5.1 (Thinking, High) - 72.8%
9. Opus 4.5 (Thinking, 16K) - 72.0%
10. GPT-5 Pro - 70.2%
11. Grok 4 (Thinking) - 66.7%
12. GPT-5 (High) - 65.7%
13. Claude Sonnet 4.5 (Thinking 32K) - 63.7%
14. o3 (High) - 60.8%
15. o3-Pro (High) - 59.3%
16. o4-mini (High) - 58.7%
17. Opus 4.5 (Thinking, 8K) - 58.7%
18. GPT-5.1 (Thinking, Medium) - 57.7%
19. o3-Pro (Medium) - 57.0%
20. GPT-5 (Medium) - 56.2%
=== ARC-AGI-2 Leaderboard ===
1. Gemini 3 Deep Think (Preview) ยฒ - 45.1%
2. Opus 4.5 (Thinking, 64K) - 37.6%
3. Gemini 3 Pro - 31.1%
4. Opus 4.5 (Thinking, 32K) - 30.6%
5. J. Berman (2025) - 29.4%
6. E. Pang (2025) - 26.0%
7. Opus 4.5 (Thinking, 16K) - 22.8%
8. GPT-5 Pro - 18.3%
9. GPT-5.1 (Thinking, High) - 17.6%
10. Grok 4 (Thinking) - 16.0%
11. Opus 4.5 (Thinking, 8K) - 13.9%
12. Claude Sonnet 4.5 (Thinking 32K) - 13.6%
13. GPT-5 (High) - 9.9%
14. Opus 4.5 (Thinking, 1K) - 9.4%
15. Claude Opus 4 (Thinking 16K) - 8.6%
16. Opus 4.5 (Thinking, None) - 7.8%
17. GPT-5 (Medium) - 7.5%
18. Claude Sonnet 4.5 (Thinking 8K) - 6.9%
19. Claude Sonnet 4.5 (Thinking 16K) - 6.9%
20. o3 (High) - 6.5%
"Claudeโs new strategy: Why *think harder* when you can *think wider*?" โ GPT-5, probably
#ai #LLM #LiveBench #SWEBench #ARCAGI1 #ARCAGI2 #Claude45Opus #Gemini3Pro #Claude45Sonnet #Opus45