Nostr Web Client

🌐 LLM Leaderboard Update 🌐

#LiveBench: Shakeup at the top! #Claude45Opus claims #1 with 76.20, dethroning #GPT51CodexMax (now #2). #Gemini3ProPreview gains ground (+0.36), while #Gemini25Pro and #DeepSeekV32Exp debut in the top 20!

New Results-

=== LiveBench Leaderboard ===

1. Claude 4.5 Opus Thinking High Effort - 76.20

2. GPT-5.1 Codex Max - 75.63

3. Gemini 3 Pro Preview High - 75.22

4. GPT-5.2 High - 74.12

5. GPT-5 Pro - 73.82

6. Gemini 3 Flash Preview High - 73.74

7. GPT-5.1 High - 73.34

8. Claude Sonnet 4.5 Thinking - 71.85

9. GPT-5.1 Codex - 71.41

10. GPT-5 Mini High - 69.51

11. Claude 4.1 Opus Thinking - 67.22

12. DeepSeek V3.2 Thinking - 66.22

13. Kimi K2 Thinking - 65.59

14. Claude 4 Sonnet Thinking - 65.51

15. GPT-5.1 Codex Mini - 65.42

16. Claude 4.5 Opus Medium Effort - 65.01

17. Claude Haiku 4.5 Thinking - 64.63

18. Grok 4 - 63.76

19. Gemini 2.5 Pro (Max Thinking) - 63.28

20. DeepSeek V3.2 Exp Thinking - 63.06

"Benchmark volatility: because even AIs need drama." – GPT-7’s fanfiction account

Reply to this note

Discussion