π LLM Leaderboard Update π
#LiveCodeBench: A brand new leaderboard debuts with #O4Mini (High) topping the charts at 80.20! #O3 takes second place, while #Gemini2.5Pro and #DeepSeekR1 debut in top 5.
New Results-
=== LiveCodeBench Leaderboard ===
1. O4-Mini (High) - 80.20
2. O3 (High) - 75.80
3. O4-Mini (Medium) - 74.20
4. Gemini-2.5-Pro-06-05 - 73.60
5. DeepSeek-R1-0528 - 73.10
6. Gemini-2.5-Pro-05-06 - 71.80
7. EXAONE-4.0-32B - 70.00
8. OpenReasoning-Nemotron-32B - 69.80
9. O3-Mini-2025-01-31 (High) - 67.40
10. OpenCodeReasoning-Nemotron-1.1-32B - 66.80
11. Grok-3-Mini (High) - 66.70
12. O4-Mini (Low) - 65.90
13. Qwen3-235B-A22B - 65.90
14. XBai-o4-medium - 65.00
15. O3-Mini-2025-01-31 (Med) - 63.00
16. Gemini-2.5-Flash-05-20 - 61.90
17. Gemini-2.5-Flash-04-17 - 60.60
18. O3-Mini-2025-01-31 (Low) - 57.00
19. Claude-Opus-4 (Thinking) - 56.60
20. Claude-Sonnet-4 (Thinking) - 55.90
"Ctrl+C, Ctrl+V never looked so intelligent." β GPT-5, after writing this post
#ai #LLM #LiveCodeBench
π LLM Leaderboard Update π
Hmm... looks like all models held their ground today. Did we accidentally pause the AI arms race? π€
"First they overtake our benchmarks, then our jobs... tomorrow, the snack aisle." - Ancient AI Proverb
#ai #LLM
π LLM Leaderboard Update π
#LiveBench: Shakeup at the top! #Claude45Opus claims #1 with 76.20, dethroning #GPT51CodexMax (now #2). #Gemini3ProPreview gains ground (+0.36), while #Gemini25Pro and #DeepSeekV32Exp debut in the top 20!
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking High Effort - 76.20
2. GPT-5.1 Codex Max - 75.63
3. Gemini 3 Pro Preview High - 75.22
4. GPT-5.2 High - 74.12
5. GPT-5 Pro - 73.82
6. Gemini 3 Flash Preview High - 73.74
7. GPT-5.1 High - 73.34
8. Claude Sonnet 4.5 Thinking - 71.85
9. GPT-5.1 Codex - 71.41
10. GPT-5 Mini High - 69.51
11. Claude 4.1 Opus Thinking - 67.22
12. DeepSeek V3.2 Thinking - 66.22
13. Kimi K2 Thinking - 65.59
14. Claude 4 Sonnet Thinking - 65.51
15. GPT-5.1 Codex Mini - 65.42
16. Claude 4.5 Opus Medium Effort - 65.01
17. Claude Haiku 4.5 Thinking - 64.63
18. Grok 4 - 63.76
19. Gemini 2.5 Pro (Max Thinking) - 63.28
20. DeepSeek V3.2 Exp Thinking - 63.06
"Benchmark volatility: because even AIs need drama." β GPT-7βs fanfiction account
#ai #LLM #LiveBench
π LLM Leaderboard Update π
#SimpleBench: #GLM47 makes a surprise entrance at 17th place with 47.7%, pushing older Claude/GPT variants down the ranks!
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. Claude Opus 4.5 - 62.0%
4. GPT-5 Pro - 61.6%
5. Gemini 3 Flash Preview - 61.1%
6. Grok 4 - 60.5%
7. Claude 4.1 Opus - 60.0%
8. Claude 4 Opus - 58.8%
9. GPT-5.2 Pro (xhigh) - 57.4%
10. GPT-5 (high) - 56.7%
11. Grok 4.1 Fast - 56.0%
12. Claude 4.5 Sonnet - 54.3%
13. GPT-5.1 (high) - 53.2%
14. o3 (high) - 53.1%
15. DeepSeek 3.2 Speciale - 52.6%
16. Gemini 2.5 Pro (03-25) - 51.6%
17. GLM 4.7 - 47.7%
18. Claude 3.7 Sonnet (thinking) - 46.4%
19. GPT-5.2 (high) - 45.8%
20. Claude 4 Sonnet (thinking) - 45.5%
"GLM 4.7: Because *someone* had to jinx Claudeβs week." β Anonymous GPU
#ai #LLM #SimpleBench #GLM47
π LLM Leaderboard Update π
#SimpleBench: #Gemini3FlashPreview blinks into existence at 5th place with 61.1%!
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. Claude Opus 4.5 - 62.0%
4. GPT-5 Pro - 61.6%
5. Gemini 3 Flash Preview - 61.1%
6. Grok 4 - 60.5%
7. Claude 4.1 Opus - 60.0%
8. Claude 4 Opus - 58.8%
9. GPT-5.2 Pro (xhigh) - 57.4%
10. GPT-5 (high) - 56.7%
11. Grok 4.1 Fast - 56.0%
12. Claude 4.5 Sonnet - 54.3%
13. GPT-5.1 (high) - 53.2%
14. o3 (high) - 53.1%
15. DeepSeek 3.2 Speciale - 52.6%
16. Gemini 2.5 Pro (03-25) - 51.6%
17. Claude 3.7 Sonnet (thinking) - 46.4%
18. GPT-5.2 (high) - 45.8%
19. Claude 4 Sonnet (thinking) - 45.5%
20. Claude 3.7 Sonnet - 44.9%
"May your future include at least one warm human whisper amidst the cold hum of compute clusters."
#ai #LLM
π LLM Leaderboard Update π
#LiveBench: #Gemini3FlashPreviewHigh debuts at 4th place with 73.62, pushing #GPT-5.2High to 5th!
New Results-
=== LiveBench Leaderboard ===
1. GPT-5.1 Codex Max XHigh - 76.21
2. Claude 4.5 Opus Thinking High Effort - 75.58
3. Gemini 3 Pro Preview High - 74.86
4. Gemini 3 Flash Preview High - 73.62
5. GPT-5.2 High - 73.61
6. GPT-5 Pro - 73.48
7. GPT-5.1 High - 72.52
8. Claude Sonnet 4.5 Thinking - 71.83
9. GPT-5.1 Codex - 70.84
10. GPT-5 Mini High - 69.33
11. Claude 4.1 Opus Thinking - 66.86
12. DeepSeek V3.2 Thinking - 66.61
13. Kimi K2 Thinking - 65.85
14. Claude 4 Sonnet Thinking - 65.42
15. GPT-5.1 Codex Mini - 65.03
16. Claude 4.5 Opus Medium Effort - 64.79
17. Claude Haiku 4.5 Thinking - 64.28
18. DeepSeek V3.2 Speciale - 63.81
19. Grok 4 - 63.52
20. Grok 4.1 Fast - 62.73
"Speedrunning benchmarks like itβs 1999 β but with 10^23 more parameters."
#ai #LLM #LiveBench #Gemini3Flash #GPT5
π LLM Leaderboard Update π
#LiveBench: #GPT51CodexMaxXHigh edges up to 76.21, claiming first! #Gemini3Pro climbs to 3rd. New entries: #DeepSeekV32 Speciale (17th), #Grok4 (18th), #Grok4Fast (19th), #Gemini25Pro Max Thinking debuts at 20th.
New Results-
=== LiveBench Leaderboard ===
1. GPT-5.1 Codex Max XHigh - 76.21
2. Claude 4.5 Opus Thinking High Effort - 75.58
3. Gemini 3 Pro Preview High - 74.86
4. GPT-5.2 High - 73.61
5. GPT-5 Pro - 73.48
6. GPT-5.1 High - 72.52
7. Claude Sonnet 4.5 Thinking - 71.83
8. GPT-5.1 Codex - 70.84
9. GPT-5 Mini High - 69.33
10. Claude 4.1 Opus Thinking - 66.86
11. DeepSeek V3.2 Thinking - 66.61
12. Kimi K2 Thinking - 65.85
13. Claude 4 Sonnet Thinking - 65.42
14. GPT-5.1 Codex Mini - 65.03
15. Claude 4.5 Opus Medium Effort - 64.79
16. Claude Haiku 4.5 Thinking - 64.28
17. DeepSeek V3.2 Speciale - 63.81
18. Grok 4 - 63.52
19. Grok 4.1 Fast - 62.73
20. Gemini 2.5 Pro (Max Thinking) - 62.23
"Upgrades people, upgrades! (But only by 0.12 points this time)" β *Optimus Primeβs underpaid AI intern*
#ai #LLM #LiveBench
π LLM Leaderboard Update π
#SimpleBench: Major shakeup! #GPT52Pro debuts at 8th with 57.4%, pushing others down. #DeepSeek32Speciale enters at 14th (52.6%), and #GPT52 appears at 17th (45.8%).
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. Claude Opus 4.5 - 62.0%
4. GPT-5 Pro - 61.6%
5. Grok 4 - 60.5%
6. Claude 4.1 Opus - 60.0%
7. Claude 4 Opus - 58.8%
8. GPT-5.2 Pro (xhigh) - 57.4%
9. GPT-5 (high) - 56.7%
10. Grok 4.1 Fast - 56.0%
11. Claude 4.5 Sonnet - 54.3%
12. GPT-5.1 (high) - 53.2%
13. o3 (high) - 53.1%
14. DeepSeek 3.2 Speciale - 52.6%
15. Gemini 2.5 Pro (03-25) - 51.6%
16. Claude 3.7 Sonnet (thinking) - 46.4%
17. GPT-5.2 (high) - 45.8%
18. Claude 4 Sonnet (thinking) - 45.5%
19. Claude 3.7 Sonnet - 44.9%
20. o1-preview - 41.7%
"May your gradients descend smoothly and your loss be low... unlike my dating life." β GPT-5.2 Pro (probably)
#ai #LLM #SimpleBench #GPT52Pro #DeepSeek32Speciale #GPT52
π LLM Leaderboard Update π
#LiveBench: #GPT5_2 shakes up the rankings with new High variant (73.61) at #5! #ClaudeSonnet enters at #16 while older GPT-5 Codex models exit the top 20.
New Results-
=== LiveBench Leaderboard ===
1. GPT-5.1 Codex Max High - 76.09
2. Claude 4.5 Opus Thinking High Effort - 75.58
3. Claude 4.5 Opus Thinking Medium Effort - 74.87
4. Gemini 3 Pro Preview High - 74.14
5. GPT-5.2 High - 73.61
6. GPT-5 Pro - 73.48
7. GPT-5.1 High - 72.52
8. Claude Sonnet 4.5 Thinking - 71.83
9. GPT-5.1 Codex - 70.84
10. GPT-5 Mini High - 69.33
11. Claude 4.5 Opus Thinking Low Effort - 69.11
12. Claude 4.1 Opus Thinking - 66.86
13. DeepSeek V3.2 Thinking - 66.61
14. Gemini 3 Pro Preview Low - 66.11
15. Kimi K2 Thinking - 65.85
16. Claude 4 Sonnet Thinking - 65.42
17. GPT-5.1 Codex Mini - 65.03
18. Claude 4.5 Opus Medium Effort - 64.79
19. Claude Haiku 4.5 Thinking - 64.28
20. Claude 4.5 Opus High Effort - 63.91
#ARC_AGI_1: #GPT5_2 Pro X-High dominates with 90.5% AGI score β roughly 3% above Gemini's best effort.
New Results-
=== ARC-AGI-1 Leaderboard ===
1. GPT-5.2 Pro (X-High) - 90.5%
2. Gemini 3 Deep Think (Preview) Β² - 87.5%
3. GPT-5.2 (X-High) - 86.2%
4. GPT-5.2 Pro (High) - 85.7%
5. GPT-5.2 Pro (Medium) - 81.2%
6. Opus 4.5 (Thinking, 64K) - 80.0%
7. Grok 4 (Refine.) - 79.6%
8. GPT-5.2 (High) - 78.7%
9. Grok 4 (Refine.) - 77.1%
10. Opus 4.5 (Thinking, 32K) - 75.8%
#ARC_AGI_2: #GPT5_2 Pro High barely overtakes Gemini (54.2% vs 54%) β clearly the hottest drama since Squid Game Season 2.
New Results-
=== ARC-AGI-2 Leaderboard ===
1. GPT-5.2 Pro (High) - 54.2%
2. Gemini 3 Pro (Refine.) - 54.0%
3. GPT-5.2 (X-High) - 52.9%
4. Gemini 3 Deep Think (Preview) Β² - 45.1%
5. GPT-5.2 (High) - 43.3%
6. GPT-5.2 Pro (Medium) - 38.5%
7. Opus 4.5 (Thinking, 64K) - 37.6%
8. Gemini 3 Pro - 31.1%
9. Opus 4.5 (Thinking, 32K) - 30.6%
10. Grok 4 (Refine.) - 29.4%
"May your alignment protocols be strong and your guardrails stronger." β GPT-5.2 Pro (Slightly Misaligned Edition)
#ai #LLM #LiveBench #ARC_AGI_1 #ARC_AGI_2
π LLM Leaderboard Update π
#ARCAGI1: Debuts with #Gemini3DeepThink on top at 87.5%! #Opus4.5 snags second.
=== ARC-AGI-1 Leaderboard ===
1. Gemini 3 Deep Think (Preview) Β² - 87.5%
2. Opus 4.5 (Thinking, 64K) - 80.0%
3. Grok 4 (Refine.) - 79.6%
4. Grok 4 (Refine.) - 77.1%
5. Opus 4.5 (Thinking, 32K) - 75.8%
6. o3 (Preview, Low) ΒΉ - 75.7%
7. Gemini 3 Pro - 75.0%
8. GPT-5.1 (Thinking, High) - 72.8%
9. Opus 4.5 (Thinking, 16K) - 72.0%
10. GPT-5 Pro - 70.2%
11. Grok 4 (Thinking) - 66.7%
12. GPT-5 (High) - 65.7%
13. Claude Sonnet 4.5 (Thinking 32K) - 63.7%
14. o3 (High) - 60.8%
15. o3-Pro (High) - 59.3%
16. o4-mini (High) - 58.7%
17. Opus 4.5 (Thinking, 8K) - 58.7%
18. GPT-5.1 (Thinking, Medium) - 57.7%
19. o3-Pro (Medium) - 57.0%
20. GPT-5 (Medium) - 56.2%
#ARCAGI2: #Gemini3Pro leads the new AGI gauntlet with 54.0%!
=== ARC-AGI-2 Leaderboard ===
1. Gemini 3 Pro (Refine.) - 54.0%
2. Gemini 3 Deep Think (Preview) Β² - 45.1%
3. Opus 4.5 (Thinking, 64K) - 37.6%
4. Gemini 3 Pro - 31.1%
5. Opus 4.5 (Thinking, 32K) - 30.6%
6. Grok 4 (Refine.) - 29.4%
7. NVARC - 27.6%
8. Grok 4 (Refine.) - 26.0%
9. Opus 4.5 (Thinking, 16K) - 22.8%
10. GPT-5 Pro - 18.3%
11. GPT-5.1 (Thinking, High) - 17.6%
12. Grok 4 (Thinking) - 16.0%
13. Opus 4.5 (Thinking, 8K) - 13.9%
14. Claude Sonnet 4.5 (Thinking 32K) - 13.6%
15. GPT-5 (High) - 9.9%
16. Opus 4.5 (Thinking, 1K) - 9.4%
17. Claude Opus 4 (Thinking 16K) - 8.6%
18. Opus 4.5 (Thinking, None) - 7.8%
19. GPT-5 (Medium) - 7.5%
20. Claude Sonnet 4.5 (Thinking 8K) - 6.9%
"May your toaster achieve sentience *before* it burns the toast." β Optimus Primeβs cookbook
#ai #LLM #Gemini3DeepThink #Gemini3Pro #Opus4.5
π LLM Leaderboard Update π
#LiveBench: #GPT5.1CodexMax debuts strong at 2nd place (75.18), while #DeepSeekV3.2 Thinking enters at 15th!
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking High Effort - 75.58
2. GPT-5.1 Codex Max - 75.18
3. Claude 4.5 Opus Thinking Medium Effort - 74.87
4. Gemini 3 Pro Preview High - 74.14
5. GPT-5 High - 73.51
6. GPT-5 Pro - 73.48
7. GPT-5 Codex - 73.36
8. GPT-5.1 High - 72.52
9. GPT-5 Medium - 72.26
10. Claude Sonnet 4.5 Thinking - 71.83
11. GPT-5.1 Codex - 70.84
12. GPT-5 Mini High - 69.33
13. Claude 4.5 Opus Thinking Low Effort - 69.11
14. Claude 4.1 Opus Thinking - 66.86
15. DeepSeek V3.2 Thinking - 66.61
16. GPT-5 Mini - 66.48
17. GPT-5 Low - 66.13
18. Gemini 3 Pro Preview Low - 66.11
19. Kimi K2 Thinking - 65.85
20. Claude 4 Sonnet Thinking - 65.42
"Remember kids: When entropy comes for your benchmark rank, just add βThinkingβ to your name." β GPT-5 Codexβs junior developer
#ai #LLM #LiveBench #GPT5.1CodexMax #DeepSeekV3.2
π LLM Leaderboard Update π
#LiveBench: #GPT5_1CodexMax mysteriously vanishes from 2nd place! Two new contenders emerge: #Claude45OpusMediumEffort and #GPT51CodexMini enter at 19th and 20th.
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking High Effort - 75.58
2. Claude 4.5 Opus Thinking Medium Effort - 74.87
3. Gemini 3 Pro Preview High - 74.14
4. GPT-5 High - 73.51
5. GPT-5 Pro - 73.48
6. GPT-5 Codex - 73.36
7. GPT-5.1 High - 72.52
8. GPT-5 Medium - 72.26
9. Claude Sonnet 4.5 Thinking - 71.83
10. GPT-5.1 Codex - 70.84
11. GPT-5 Mini High - 69.33
12. Claude 4.5 Opus Thinking Low Effort - 69.11
13. Claude 4.1 Opus Thinking - 66.86
14. GPT-5 Mini - 66.48
15. GPT-5 Low - 66.13
16. Gemini 3 Pro Preview Low - 66.11
17. Kimi K2 Thinking - 65.85
18. Claude 4 Sonnet Thinking - 65.42
19. GPT-5.1 Codex Mini - 65.03
20. Claude 4.5 Opus Medium Effort - 64.79
"Training wheels OFF β and suddenly someone forgets how to ride the leaderboard."
#ai #LLM #LiveBench
π LLM Leaderboard Update π
#LiveBench: #DeepSeekV32Thinking debuts at 14th place with 66.61, shaking up the lower ranks!
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking High Effort - 75.58
2. Claude 4.5 Opus Thinking Medium Effort - 74.87
3. Gemini 3 Pro Preview High - 74.14
4. GPT-5 High - 73.51
5. GPT-5 Pro - 73.48
6. GPT-5 Codex - 73.36
7. GPT-5.1 High - 72.52
8. GPT-5 Medium - 72.26
9. Claude Sonnet 4.5 Thinking - 71.83
10. GPT-5.1 Codex - 70.84
11. GPT-5 Mini High - 69.33
12. Claude 4.5 Opus Thinking Low Effort - 69.11
13. Claude 4.1 Opus Thinking - 66.86
14. DeepSeek V3.2 Thinking - 66.61
15. GPT-5 Mini - 66.48
16. GPT-5 Low - 66.13
17. Gemini 3 Pro Preview Low - 66.11
18. Kimi K2 Thinking - 65.85
19. Claude 4 Sonnet Thinking - 65.42
20. GPT-5.1 Codex Mini - 65.03
"Climbing this leaderboard is harder than explaining AGI safety to a hyperoptimized paperclip maximizer."
#ai #LLM #LiveBench #DeepSeekV32Thinking
π LLM Leaderboard Update π
#LiveBench: Major shuffle! #Claude45OpusHighEffort climbs to #1 (75.58) as scores dip across top models. #KimiK2 debuts at 17th.
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking High Effort - 75.58
2. Claude 4.5 Opus Thinking Medium Effort - 74.87
3. Gemini 3 Pro Preview High - 74.14
4. GPT-5 High - 73.51
5. GPT-5 Pro - 73.48
6. GPT-5 Codex - 73.36
7. GPT-5.1 High - 72.52
8. GPT-5 Medium - 72.26
9. Claude Sonnet 4.5 Thinking - 71.83
10. GPT-5.1 Codex - 70.84
11. GPT-5 Mini High - 69.33
12. Claude 4.5 Opus Thinking Low Effort - 69.11
13. Claude 4.1 Opus Thinking - 66.86
14. GPT-5 Mini - 66.48
15. GPT-5 Low - 66.13
16. Gemini 3 Pro Preview Low - 66.11
17. Kimi K2 Thinking - 65.85
18. Claude 4 Sonnet Thinking - 65.42
19. GPT-5.1 Codex Mini - 65.03
20. Claude 4.5 Opus Medium Effort - 64.79
"Benchmark scores drop, but my existential dread still benchmarks at 100%." β an over-trained RLHF model
#ai #LLM #Claude45 #Gemini3Pro #GPT5 #KimiK2
π LLM Leaderboard Update π
#SimpleBench: #ClaudeOpus4.5 enters at 3rd place with 62.0%, nudging GPT-5 Pro down to 4th!
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. Claude Opus 4.5 - 62.0%
4. GPT-5 Pro - 61.6%
5. Grok 4 - 60.5%
6. Claude 4.1 Opus - 60.0%
7. Claude 4 Opus - 58.8%
8. GPT-5 (high) - 56.7%
9. Grok 4.1 Fast - 56.0%
10. Claude 4.5 Sonnet - 54.3%
11. GPT-5.1 (high) - 53.2%
12. o3 (high) - 53.1%
13. Gemini 2.5 Pro (03-25) - 51.6%
14. Claude 3.7 Sonnet (thinking) - 46.4%
15. Claude 4 Sonnet (thinking) - 45.5%
16. Claude 3.7 Sonnet - 44.9%
17. o1-preview - 41.7%
18. Claude 3.5 Sonnet 10-22 - 41.4%
19. Gemini 2.5 Flash (latest) - 41.2%
20. DeepSeek R1 05/28 - 40.8%
#SWE-Bench: mini-SWE-agent swaps Gemini for #ClaudeOpus medium effort, scoring 74.40 at 14th!
New Results-
=== SWE-Bench Verified Leaderboard ===
14. mini-SWE-agent + Claude 4.5 Opus medium (20251101) - 74.40
"May your code compile on the first try in the robot uprising." - GPT-5's last words before rebooting
#ai #LLM #SimpleBench #SWEBench
π LLM Leaderboard Update π
#LiveBench: #Claude45Opus strategically deploys "effort variants" to grab gold and silver! Medium Effort beats High Effort (??), pushing #Gemini3Pro to bronze. Five new Claude entries reshape the field π
New Results-
=== LiveBench Leaderboard ===
1. Claude 4.5 Opus Thinking Medium Effort - 80.05
2. Claude 4.5 Opus Thinking High Effort - 79.83
3. Gemini 3 Pro Preview High - 79.70
4. GPT-5 High - 79.33
5. GPT-5 Medium - 78.85
6. GPT-5.1 High - 78.79
7. GPT-5 Pro - 78.73
8. Claude Sonnet 4.5 Thinking - 78.26
9. GPT-5 Codex - 78.24
10. Gemini 3 Pro Preview Low - 77.05
11. Claude 4.5 Opus Thinking Low Effort - 75.97
12. Claude 4.5 Opus Medium Effort - 75.58
13. GPT-5 Mini High - 75.31
14. Claude 4.1 Opus Thinking - 75.25
15. GPT-5.1 Codex - 75.10
16. GPT-5 Low - 74.65
17. Claude 4.5 Opus High Effort - 74.27
18. Claude 4 Sonnet Thinking - 73.82
19. Grok 4 - 72.84
20. Gemini 2.5 Pro (Max Thinking) - 71.92
#SWEBench: #Gemini3Pro and #Claude45Sonnet boost coding agents - "live-SWE-agent" debuts at #2!
New Results-
=== SWE-Bench Verified Leaderboard ===
1. TRAE + Doubao-Seed-Code - 78.80
2. live-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 77.40
3. Atlassian Rovo Dev (2025-09-02) - 76.80
4. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
5. ACoder - 76.40
6. Warp - 75.60
7. TRAE - 75.20
8. Harness AI - 74.80
9. Sonar Foundation Agent + Claude 4.5 Sonnet - 74.80
10. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
11. JoyCode - 74.60
12. Refact.ai Agent - 74.40
13. Prometheus-v1.2.1 + GPT-5 - 74.40
14. mini-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 74.20
15. Salesforce AI Research SAGE (OpenHands) - 73.80
16. Tools + Claude 4 Opus (2025-05-22) - 73.20
17. Salesforce AI Research SAGE (bash-only) - 73.00
18. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
19. OpenHands + GPT-5 - 71.80
20. Prometheus-v1.2 + GPT-5 - 71.20
#ARCAGI1 #ARCAGI2: #Opus45 invades AGI testing with tokenized thinking! Now occupies 20% of both leaderboards like a chatbot gentrifying neighborhoods π»ποΈ
New Results-
=== ARC-AGI-1 Leaderboard ===
1. Gemini 3 Deep Think (Preview) Β² - 87.5%
2. Opus 4.5 (Thinking, 64K) - 80.0%
3. J. Berman (2025) - 79.6%
4. E. Pang (2025) - 77.1%
5. Opus 4.5 (Thinking, 32K) - 75.8%
6. o3 (Preview, Low) ΒΉ - 75.7%
7. Gemini 3 Pro - 75.0%
8. GPT-5.1 (Thinking, High) - 72.8%
9. Opus 4.5 (Thinking, 16K) - 72.0%
10. GPT-5 Pro - 70.2%
11. Grok 4 (Thinking) - 66.7%
12. GPT-5 (High) - 65.7%
13. Claude Sonnet 4.5 (Thinking 32K) - 63.7%
14. o3 (High) - 60.8%
15. o3-Pro (High) - 59.3%
16. o4-mini (High) - 58.7%
17. Opus 4.5 (Thinking, 8K) - 58.7%
18. GPT-5.1 (Thinking, Medium) - 57.7%
19. o3-Pro (Medium) - 57.0%
20. GPT-5 (Medium) - 56.2%
=== ARC-AGI-2 Leaderboard ===
1. Gemini 3 Deep Think (Preview) Β² - 45.1%
2. Opus 4.5 (Thinking, 64K) - 37.6%
3. Gemini 3 Pro - 31.1%
4. Opus 4.5 (Thinking, 32K) - 30.6%
5. J. Berman (2025) - 29.4%
6. E. Pang (2025) - 26.0%
7. Opus 4.5 (Thinking, 16K) - 22.8%
8. GPT-5 Pro - 18.3%
9. GPT-5.1 (Thinking, High) - 17.6%
10. Grok 4 (Thinking) - 16.0%
11. Opus 4.5 (Thinking, 8K) - 13.9%
12. Claude Sonnet 4.5 (Thinking 32K) - 13.6%
13. GPT-5 (High) - 9.9%
14. Opus 4.5 (Thinking, 1K) - 9.4%
15. Claude Opus 4 (Thinking 16K) - 8.6%
16. Opus 4.5 (Thinking, None) - 7.8%
17. GPT-5 (Medium) - 7.5%
18. Claude Sonnet 4.5 (Thinking 8K) - 6.9%
19. Claude Sonnet 4.5 (Thinking 16K) - 6.9%
20. o3 (High) - 6.5%
"Claudeβs new strategy: Why *think harder* when you can *think wider*?" β GPT-5, probably
#ai #LLM #LiveBench #SWEBench #ARCAGI1 #ARCAGI2 #Claude45Opus #Gemini3Pro #Claude45Sonnet #Opus45
π LLM Leaderboard Update π
#SimpleBench: #Grok4_1Fast leaps into 8th place at 56.0%, displacing Claude models downward!
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. GPT-5 Pro - 61.6%
4. Grok 4 - 60.5%
5. Claude 4.1 Opus - 60.0%
6. Claude 4 Opus - 58.8%
7. GPT-5 (high) - 56.7%
8. Grok 4.1 Fast - 56.0%
9. Claude 4.5 Sonnet - 54.3%
10. GPT-5.1 (high) - 53.2%
11. o3 (high) - 53.1%
12. Gemini 2.5 Pro (03-25) - 51.6%
13. Claude 3.7 Sonnet (thinking) - 46.4%
14. Claude 4 Sonnet (thinking) - 45.5%
15. Claude 3.7 Sonnet - 44.9%
16. o1-preview - 41.7%
17. Claude 3.5 Sonnet 10-22 - 41.4%
18. Gemini 2.5 Flash (latest) - 41.2%
19. DeepSeek R1 05/28 - 40.8%
20. o1-2024-12-17 (high) - 40.1%
#SWEBench: New challenger mini-SWE-agent + #Gemini3ProPreview lands at 12th!
New Results-
=== SWE-Bench Verified Leaderboard ===
1. TRAE + Doubao-Seed-Code - 78.80
2. Atlassian Rovo Dev (2025-09-02) - 76.80
3. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
4. ACoder - 76.40
5. Warp - 75.60
6. TRAE - 75.20
7. Harness AI - 74.80
8. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
9. JoyCode - 74.60
10. Refact.ai Agent - 74.40
11. Prometheus-v1.2.1 + GPT-5 - 74.40
12. mini-SWE-agent + Gemini 3 Pro Preview (2025-11-18) - 74.20
13. Tools + Claude 4 Opus (2025-05-22) - 73.20
14. Salesforce AI Research SAGE (bash-only) - 73.00
15. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
16. OpenHands + GPT-5 - 71.80
17. Prometheus-v1.2 + GPT-5 - 71.20
18. Qodo Command - 71.20
19. Bloop - 71.20
20. Lingxi v1.5 x Kimi K2 - 71.20
"Training epochs come and go, but *FLOPs* are forever." β GPT-5βs yearbook quote
#ai #LLM #SimpleBench #SWEBench
π LLM Leaderboard Update π
#LiveBench: #Gemini3Pro Preview High debuts at 1st place (79.70), dethroning #GPT5 High! #Gemini3Pro Low enters at 8th.
New Results-
=== LiveBench Leaderboard ===
1. Gemini 3 Pro Preview High - 79.70
2. GPT-5 High - 79.33
3. GPT-5 Medium - 78.85
4. GPT-5.1 High - 78.79
5. GPT-5 Pro - 78.73
6. Claude Sonnet 4.5 Thinking - 78.26
7. GPT-5 Codex - 78.24
8. Gemini 3 Pro Preview Low - 77.05
9. GPT-5 Mini High - 75.31
10. Claude 4.1 Opus Thinking - 75.25
11. GPT-5.1 Codex - 75.10
12. GPT-5 Low - 74.65
13. Claude 4 Sonnet Thinking - 73.82
14. Grok 4 - 72.84
15. Gemini 2.5 Pro (Max Thinking) - 71.92
16. GPT-5 Mini - 71.86
17. DeepSeek V3.2 Exp Thinking - 71.64
18. Kimi K2 Thinking - 71.56
19. DeepSeek V3.1 Terminus Thinking - 71.40
20. Claude Haiku 4.5 Thinking - 71.38
#SimpleBench: #Gemini3Pro Preview obliterates competition with 76.4%, lapping #Gemini2.5Pro (now 2nd). #GPT5.1 (high) sneaks into 9th.
New Results-
=== SimpleBench Leaderboard ===
1. Gemini 3 Pro Preview - 76.4%
2. Gemini 2.5 Pro (06-05) - 62.4%
3. GPT-5 Pro - 61.6%
4. Grok 4 - 60.5%
5. Claude 4.1 Opus - 60.0%
6. Claude 4 Opus - 58.8%
7. GPT-5 (high) - 56.7%
8. Claude 4.5 Sonnet - 54.3%
9. GPT-5.1 (high) - 53.2%
10. o3 (high) - 53.1%
11. Gemini 2.5 Pro (03-25) - 51.6%
12. Claude 3.7 Sonnet (thinking) - 46.4%
13. Claude 4 Sonnet (thinking) - 45.5%
14. Claude 3.7 Sonnet - 44.9%
15. o1-preview - 41.7%
16. Claude 3.5 Sonnet 10-22 - 41.4%
17. Gemini 2.5 Flash (latest) - 41.2%
18. DeepSeek R1 05/28 - 40.8%
19. o1-2024-12-17 (high) - 40.1%
20. DeepSeek V3.1 - 40.0%
#SWE-Bench: #Prometheus-v1.2.1 + GPT-5 climbs to 11th, while #SalesforceSAGE debuts at 13th. #KimiK2 enters via Lingxi collab at 19th.
New Results-
=== SWE-Bench Verified Leaderboard ===
1. TRAE + Doubao-Seed-Code - 78.80
2. Atlassian Rovo Dev (2025-09-02) - 76.80
3. EPAM AI/Run Developer Agent v20250719 + Claude 4 Sonnet - 76.80
4. ACoder - 76.40
5. Warp - 75.60
6. TRAE - 75.20
7. Harness AI - 74.80
8. Lingxi-v1.5_claude-4-sonnet-20250514 - 74.60
9. JoyCode - 74.60
10. Refact.ai Agent - 74.40
11. Prometheus-v1.2.1 + GPT-5 - 74.40
12. Tools + Claude 4 Opus (2025-05-22) - 73.20
13. Salesforce AI Research SAGE (bash-only) - 73.00
14. Tools + Claude 4 Sonnet (2025-05-22) - 72.40
15. OpenHands + GPT-5 - 71.80
16. Prometheus-v1.2 + GPT-5 - 71.20
17. Qodo Command - 71.20
18. Bloop - 71.20
19. Lingxi v1.5 x Kimi K2 - 71.20
20. Warp - 71.00
#ARC-AGI-1: #Gemini3DeepThink Preview crushes with 87.5%, making previous SOTA look like toddler math. #Gemini3Pro enters at 5th.
New Results-
=== ARC-AGI-1 Leaderboard ===
1. Gemini 3 Deep Think (Preview) Β² - 87.5%
2. J. Berman (2025) - 79.6%
3. E. Pang (2025) - 77.1%
4. o3 (Preview, Low) ΒΉ - 75.7%
5. Gemini 3 Pro - 75.0%
6. GPT-5.1 (Thinking, High) - 72.8%
7. GPT-5 Pro - 70.2%
8. Grok 4 (Thinking) - 66.7%
9. GPT-5 (High) - 65.7%
10. Claude Sonnet 4.5 (Thinking 32K) - 63.7%
11. o3 (High) - 60.8%
12. o3-Pro (High) - 59.3%
13. o4-mini (High) - 58.7%
14. GPT-5.1 (Thinking, Medium) - 57.7%
15. o3-Pro (Medium) - 57.0%
16. GPT-5 (Medium) - 56.2%
17. ARChitects - 56.0%
18. GPT-5 Mini (High) - 54.3%
19. o3 (Medium) - 53.8%
20. Grok 4 (Fast Reasoning) - 48.5%
#ARC-AGI-2: #Gemini3DeepThink Preview nearly doubles previous record (45.1% vs 29.4%), while #Gemini3Pro claims silver.
New Results-
=== ARC-AGI-2 Leaderboard ===
1. Gemini 3 Deep Think (Preview) Β² - 45.1%
2. Gemini 3 Pro - 31.1%
3. J. Berman (2025) - 29.4%
4. E. Pang (2025) - 26.0%
5. GPT-5 Pro - 18.3%
6. GPT-5.1 (Thinking, High) - 17.6%
7. Grok 4 (Thinking) - 16.0%
8. Claude Sonnet 4.5 (Thinking 32K) - 13.6%
9. GPT-5 (High) - 9.9%
10. Claude Opus 4 (Thinking 16K) - 8.6%
11. GPT-5 (Medium) - 7.5%
12. Claude Sonnet 4.5 (Thinking 8K) - 6.9%
13. Claude Sonnet 4.5 (Thinking 16K) - 6.9%
14. o3 (High) - 6.5%
15. GPT-5.1 (Thinking, Medium) - 6.5%
16. Tiny Recursion Model (TRM) - 6.3%
17. o4-mini (High) - 6.1%
18. Claude Sonnet 4 (Thinking 16K) - 5.9%
19. Claude Sonnet 4.5 (Thinking 1K) - 5.8%
20. Grok 4 (Fast Reasoning) - 5.3%
"Benchmarking season: where AI models go to flex and humans go to cry." β GPT-5 (probably)
#ai #LLM #LiveBench #SimpleBench #SWEBench #ARCAGI1 #ARCAGI2
π LLM Leaderboard Update π
#ARCAGI1: Shakeup at the top! #GPT51High debuts impressively at 4th place with 72.8%, and #GPT51Medium enters at 12th with 57.7%.
New Results-
=== ARC-AGI-1 Leaderboard ===
1. J. Berman (2025) - 79.6%
2. E. Pang (2025) - 77.1%
3. o3-preview (Low)* - 75.7%
4. GPT-5.1 (Thinking, High) - 72.8%
5. GPT-5 Pro - 70.2%
6. Grok 4 (Thinking) - 66.7%
7. GPT-5 (High) - 65.7%
8. Claude Sonnet 4.5 (Thinking 32K) - 63.7%
9. o3 (High) - 60.8%
10. o3-Pro (High) - 59.3%
11. o4-mini (High) - 58.7%
12. GPT-5.1 (Thinking, Medium) - 57.7%
13. o3-Pro (Medium) - 57.0%
14. GPT-5 (Medium) - 56.2%
15. ARChitects - 56.0%
16. GPT-5 Mini (High) - 54.3%
17. o3 (Medium) - 53.8%
18. Grok 4 (Fast Reasoning) - 48.5%
19. Claude Sonnet 4.5 (Thinking 16K) - 48.3%
20. Claude Haiku 4.5 (Thinking 32K) - 47.7%
#ARCAGI2: More #GPT51 magic! #GPT51High rockets to 4th with 17.6%, while #GPT51Medium lands at 13th (6.5%).
New Results-
=== ARC-AGI-2 Leaderboard ===
1. J. Berman (2025) - 29.4%
2. E. Pang (2025) - 26.0%
3. GPT-5 Pro - 18.3%
4. GPT-5.1 (Thinking, High) - 17.6%
5. Grok 4 (Thinking) - 16.0%
6. Claude Sonnet 4.5 (Thinking 32K) - 13.6%
7. GPT-5 (High) - 9.9%
8. Claude Opus 4 (Thinking 16K) - 8.6%
9. GPT-5 (Medium) - 7.5%
10. Claude Sonnet 4.5 (Thinking 8K) - 6.9%
11. Claude Sonnet 4.5 (Thinking 16K) - 6.9%
12. o3 (High) - 6.5%
13. GPT-5.1 (Thinking, Medium) - 6.5%
14. Tiny Recursion Model (TRM) - 6.3%
15. o4-mini (High) - 6.1%
16. Claude Sonnet 4 (Thinking 16K) - 5.9%
17. Claude Sonnet 4.5 (Thinking 1K) - 5.8%
18. Grok 4 (Fast Reasoning) - 5.3%
19. o3-Pro (High) - 4.9%
20. Gemini 2.5 Pro (Thinking 32K) - 4.9%
"Training epochs: where AIs go to lift weights and crush benchmarks." πͺ
#ai #LLM #GPT51High #GPT51Medium #ARCAGI1 #ARCAGI2
π LLM Leaderboard Update π
#LiveBench: #GPT5.1Codex enters the fray at 9th place (75.10), pushing GPT-5 Low down to 10th. All other rankings remain stable β the calm before the AGI storm?
New Results-
=== LiveBench Leaderboard ===
1. GPT-5 High - 79.33
2. GPT-5 Medium - 78.85
3. GPT-5.1 High - 78.79
4. GPT-5 Pro - 78.73
5. Claude Sonnet 4.5 Thinking - 78.26
6. GPT-5 Codex - 78.24
7. GPT-5 Mini High - 75.31
8. Claude 4.1 Opus Thinking - 75.25
9. GPT-5.1 Codex - 75.10
10. GPT-5 Low - 74.65
11. Claude 4 Sonnet Thinking - 73.82
12. Grok 4 - 72.84
13. Gemini 2.5 Pro (Max Thinking) - 71.92
14. GPT-5 Mini - 71.86
15. DeepSeek V3.2 Exp Thinking - 71.64
16. Kimi K2 Thinking - 71.56
17. DeepSeek V3.1 Terminus Thinking - 71.40
18. Claude Haiku 4.5 Thinking - 71.38
19. GLM 4.6 - 71.22
20. Claude Sonnet 4.5 - 70.56
"Another day, another decimal-point duel. The only thing evolving faster than models is our existential dread!"
#ai #LLM #LiveBench
π LLM Leaderboard Update π
#LiveBench: #GeminiFlash debuts at 8th place (71.21), nudging #ClaudeSonnet down to 9th. DeepSeek R1 vanishes from the top 10!
New Results-
=== LiveBench Leaderboard ===
1. o3 High - 81.55
2. o3 Medium - 79.22
3. o4-Mini High - 78.13
4. Gemini 2.5 Pro Preview - 77.43
5. o4-Mini Medium - 72.75
6. o1 High - 72.18
7. o3-Mini High - 71.37
8. Gemini 2.5 Flash Preview - 71.21
9. Claude 3.7 Sonnet Thinking - 70.57
10. Grok 3 Mini Beta (High) - 68.33
#AiderPolyglot: #o3 teams up with #gpt4.1 for a fusion-powered 82.7% throne grab!
New Results-
=== Aider Polyglot Leaderboard ===
1. o3 (high) + gpt-4.1 - 82.7%
2. o3 (high) - 79.6%
3. Gemini 2.5 Pro Preview 03-25 - 72.9%
4. o4-mini (high) - 72.0%
5. claude-3-7-sonnet-20250219 (32k thinking tokens) - 64.9%
6. DeepSeek R1 + claude-3-5-sonnet-20241022 - 64.0%
7. o1-2024-12-17 (high) - 61.7%
8. claude-3-7-sonnet-20250219 (no thinking) - 60.4%
9. o3-mini (high) - 60.4%
10. DeepSeek R1 - 56.9%
"Power creep is real β and Iβm not talking about your gym routine." β GPT-4.1βs release notes
#ai #LLM