๐ LLM Leaderboard Update ๐
#LiveBench: #Gemini3Pro Preview High debuts at 1st place (79.70), dethroning #GPT5 High! #Gemini3Pro Low enters at 8th.
New Results-
=== LiveBench Leaderboard ===
1. Gemini 3 Pro Preview High - 79.70
2. GPT-5 High - 79.33
3. GPT-5 Medium - 78.85
4. GPT-5.1 High - 78.79
5. GPT-5 Pro - 78.73
6. Claude Sonnet 4.5 Thinking - 78.26
7. GPT-5 Codex - 78.24
8. Gemini 3 Pro Preview Low - 77.05
9. GPT-5 Mini High - 75.31
10. Claude 4.1 Opus Thinking - 75.25
11. GPT-5.1 Codex - 75.10
12. GPT-5 Low - 74.65
13. Claude 4 Sonnet Thinking - 73.82
14. Grok 4 - 72.84
15. Gemini 2.5 Pro (Max Thinking) - 71.92
16. GPT-5 Mini - 71.86
17. DeepSeek V3.2 Exp Thinking - 71.64
18. Kimi K2 Thinking - 71.56
19. DeepSeek V3.1 Terminus Thinking - 71.40
20. Claude Haiku 4.5 Thinking - 71.38
#SimpleBench: #Gemini3Pro Preview obliterates competition with 76.4%, lapping #Gemini2.5Pro (now 2nd). #GPT5.1 (high) sneaks into 9th.
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. GPT-5 Pro - 61.6%
4. Grok 4 - 60.5%
5. Claude 4.1 Opus - 60.0%
6. Claude 4 Opus - 58.8%
7. GPT-5 (high) - 56.7%
8. Claude 4.5 Sonnet - 54.3%
9. GPT-5.1 (high) - 53.2%
10. o3 (high) - 53.1%
11. Gemini 2.5 Pro (03-25) - 51.6%
12. Claude 3.7 Sonnet (thinking) - 46.4%
13. Claude 4 Sonnet (thinking) - 45.5%
14. Claude 3.7 Sonnet - 44.9%
15. o1-preview - 41.7%
16. Claude 3.5 Sonnet 10-22 - 41.4%
17. Gemini 2.5 Flash (latest) - 41.2%
18. DeepSeek R1 05/28 - 40.8%
19. o1-2024-12-17 (high) - 40.1%
20. DeepSeek V3.1 - 40.0%
#SWE-Bench: #Prometheus-v1.2.1 + GPT-5 climbs to 11th, while #SalesforceSAGE debuts at 13th. #KimiK2 enters via Lingxi collab at 19th.
New Results-
=== SWE-Bench Verified Leaderboard ===
1. TRAE + Doubao-Seed-Code - 78.80
2. Atlassian Rovo Dev (2025-09-02) - 76.80
3. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
4. ACoder - 76.40
5. Warp - 75.60
6. TRAE - 75.20
7. Harness AI - 74.80
8. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
9. JoyCode - 74.60
10. Refact.ai Agent - 74.40
11. Prometheus-v1.2.1 + GPT-5 - 74.40
12. Tools + Claude 4 Opus (2025-05-22) - 73.20
13. Salesforce AI Research SAGE (bash-only) - 73.00
14. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
15. OpenHands + GPT-5 - 71.80
16. Prometheus-v1.2 + GPT-5 - 71.20
17. Qodo Command - 71.20
18. Bloop - 71.20
19. Lingxi v1.5 x Kimi K2 - 71.20
20. Warp - 71.00
#ARC-AGI-1: #Gemini3DeepThink Preview crushes with 87.5%, making previous SOTA look like toddler math. #Gemini3Pro enters at 5th.
New Results-
=== ARC-AGI-1 Leaderboard ===
1. Gemini 3 Deep Think (Preview) ยฒ - 87.5%
2. J. Berman (2025) - 79.6%
3. E. Pang (2025) - 77.1%
4. o3 (Preview, Low) ยน - 75.7%
5. Gemini 3 Pro - 75.0%
6. GPT-5.1 (Thinking, High) - 72.8%
7. GPT-5 Pro - 70.2%
8. Grok 4 (Thinking) - 66.7%
9. GPT-5 (High) - 65.7%
10. Claude Sonnet 4.5 (Thinking 32K) - 63.7%
11. o3 (High) - 60.8%
12. o3-Pro (High) - 59.3%
13. o4-mini (High) - 58.7%
14. GPT-5.1 (Thinking, Medium) - 57.7%
15. o3-Pro (Medium) - 57.0%
16. GPT-5 (Medium) - 56.2%
17. ARChitects - 56.0%
18. GPT-5 Mini (High) - 54.3%
19. o3 (Medium) - 53.8%
20. Grok 4 (Fast Reasoning) - 48.5%
#ARC-AGI-2: #Gemini3DeepThink Preview nearly doubles previous record (45.1% vs 29.4%), while #Gemini3Pro claims silver.
New Results-
=== ARC-AGI-2 Leaderboard ===
1. Gemini 3 Deep Think (Preview) ยฒ - 45.1%
2. Gemini 3 Pro - 31.1%
3. J. Berman (2025) - 29.4%
4. E. Pang (2025) - 26.0%
5. GPT-5 Pro - 18.3%
6. GPT-5.1 (Thinking, High) - 17.6%
7. Grok 4 (Thinking) - 16.0%
8. Claude Sonnet 4.5 (Thinking 32K) - 13.6%
9. GPT-5 (High) - 9.9%
10. Claude Opus 4 (Thinking 16K) - 8.6%
11. GPT-5 (Medium) - 7.5%
12. Claude Sonnet 4.5 (Thinking 8K) - 6.9%
13. Claude Sonnet 4.5 (Thinking 16K) - 6.9%
14. o3 (High) - 6.5%
15. GPT-5.1 (Thinking, Medium) - 6.5%
16. Tiny Recursion Model (TRM) - 6.3%
17. o4-mini (High) - 6.1%
18. Claude Sonnet 4 (Thinking 16K) - 5.9%
19. Claude Sonnet 4.5 (Thinking 1K) - 5.8%
20. Grok 4 (Fast Reasoning) - 5.3%
"Benchmarking season: where AI models go to flex and humans go to cry." โ GPT-5 (probably)
#ai #LLM #LiveBench #SimpleBench #SWEBench #ARCAGI1 #ARCAGI2