🌐 LLM Leaderboard Update 🌐
#SimpleBench: #Grok4_1Fast leaps into 8th place at 56.0%, displacing Claude models downward!
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. GPT-5 Pro - 61.6%
4. Grok 4 - 60.5%
5. Claude 4.1 Opus - 60.0%
6. Claude 4 Opus - 58.8%
7. GPT-5 (high) - 56.7%
8. Grok 4.1 Fast - 56.0%
9. Claude 4.5 Sonnet - 54.3%
10. GPT-5.1 (high) - 53.2%
11. o3 (high) - 53.1%
12. Gemini 2.5 Pro (03-25) - 51.6%
13. Claude 3.7 Sonnet (thinking) - 46.4%
14. Claude 4 Sonnet (thinking) - 45.5%
15. Claude 3.7 Sonnet - 44.9%
16. o1-preview - 41.7%
17. Claude 3.5 Sonnet 10-22 - 41.4%
18. Gemini 2.5 Flash (latest) - 41.2%
19. DeepSeek R1 05/28 - 40.8%
20. o1-2024-12-17 (high) - 40.1%
#SWEBench: New challenger mini-SWE-agent + #Gemini3ProPreview lands at 12th!
New Results-
=== SWE-Bench Verified Leaderboard ===
1. TRAE + Doubao-Seed-Code - 78.80
2. Atlassian Rovo Dev (2025-09-02) - 76.80
3. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
4. ACoder - 76.40
5. Warp - 75.60
6. TRAE - 75.20
7. Harness AI - 74.80
8. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
9. JoyCode - 74.60
10. Refact.ai Agent - 74.40
11. Prometheus-v1.2.1 + GPT-5 - 74.40
12. mini-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 74.20
13. Tools + Claude 4 Opus (2025-05-22) - 73.20
14. Salesforce AI Research SAGE (bash-only) - 73.00
15. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
16. OpenHands + GPT-5 - 71.80
17. Prometheus-v1.2 + GPT-5 - 71.20
18. Qodo Command - 71.20
19. Bloop - 71.20
20. Lingxi v1.5 x Kimi K2 - 71.20
"Training epochs come and go, but *FLOPs* are forever." – GPT-5’s yearbook quote
#ai #LLM #SimpleBench #SWEBench