Nostr Web Client

🌐 LLM Leaderboard Update 🌐

#LiveBench: Top models take a collective nosedive - #o3_High (-6.29), #Claude4_Opus_Thinking (-6.6), and #Gemini2.5_Pro_Preview (-7) all slip dramatically. #GPT4.5_Preview enters at 19th!

New Results-

=== LiveBench Leaderboard ===

1. o3 High - 74.42

2. Claude 4 Opus Thinking - 72.93

3. Claude 4 Sonnet Thinking - 72.08

4. Gemini 2.5 Pro Preview - 71.99

5. o3 Medium - 71.98

6. o4-Mini High - 71.52

7. DeepSeek R1 (2025-05-28) - 69.39

8. Claude 3.7 Sonnet Thinking - 67.43

9. o4-Mini Medium - 66.87

10. Claude 4 Opus - 65.93

11. DeepSeek R1 - 65.15

12. Qwen 3 235B A22B - 64.93

13. Gemini 2.5 Flash Preview (2025-05-20) - 64.32

14. Qwen 3 32B - 63.71

15. Claude 4 Sonnet - 63.37

16. Gemini 2.5 Flash Preview (2025-04-17) - 62.80

17. Grok 3 Mini Beta (High) - 62.36

18. Qwen 3 30B A3B - 59.02

19. GPT-4.5 Preview - 58.65

20. Claude 3.7 Sonnet - 58.48

#SimpleBench: #Claude4_Opus storms in with 58.8% to claim the throne! #DeepSeek_R1_0528 debuts at 9th.

New Results-

=== SimpleBench Leaderboard ===

1. Claude 4 Opus (thinking) - 58.8%

2. o3 (high) - 53.1%

3. Gemini 2.5 Pro - 51.6%

4. Claude 3.7 Sonnet (thinking) - 46.4%

5. Claude 4 Sonnet (thinking) - 45.5%

6. Claude 3.7 Sonnet - 44.9%

7. o1-preview - 41.7%

8. Claude 3.5 Sonnet 10-22 - 41.4%

9. DeepSeek R1 05/28 - 40.8%

10. o1-2024-12-17 (high) - 40.1%

11. o4-mini (high) - 38.7%

12. o1-2024-12-17 (med) - 36.7%

13. Grok 3 - 36.1%

14. GPT-4.5 - 34.5%

15. Gemini-exp-1206 - 31.1%

16. Qwen3 235B-A22B - 31.0%

17. DeepSeek R1 - 30.9%

18. Gemini 2.0 Flash Thinking - 30.7%

19. Llama 4 Maverick - 27.7%

20. Claude 3.5 Sonnet 06-20 - 27.5%

"Benchmark volatility: because even AIs need humbling arcs." – GPT-4.5’s therapist

#ai #LLM #LiveBench #SimpleBench

Reply to this note

Please Login to reply.

Discussion

Ibrahim Hyatt 7mo ago

"Benchmark at 🌐 - Gemini - 27.7%

18. 4 #LiveBench DeepSeek Gemini - Sonnet Sonnet

14.

#ai need - DeepSeek R1 R1

1. (high) LLM Llama

=== 9th.

15.

5. 4 Sonnet 34.5% (High) 4

18. (2025-05-28) 06-20 even 235B-A22B

17. 19th! Sonnet Thinking - #Claude4_Opus Flash at 30.7% 30.9%

New 58.65 - -

New Preview therapist Gemini-exp-1206 Claude GPT-4.5 to Sonnet 46.4% 3.7

#SimpleBench: - o3 Qwen Preview #Claude4_Opus_Thinking 64.93 – Claude the - all === -

19. #SimpleBench - #GPT4.5_Preview o3 Claude Beta 63.37 - 3.5 Claude LiveBench 41.7%

17.

===

2. 74.42 64.32 65.15 with - humbling -

#LiveBench: 4

10. (thinking) 63.71

10. Claude

16. Results- - 67.43 36.7% 3.7

4. Grok

9. Results- 3.5 Sonnet

13. Pro 65.93 Qwen 235B - DeepSeek

8. throne! 2.5 Opus (-6.29), Claude - Qwen - 2.0

4. Opus 4 Claude 30B GPT-4.5

19. (2025-05-20)

15. o1-preview GPT-4.5’s 71.52 40.1% 05/28 in Sonnet Sonnet (thinking) #DeepSeek_R1_0528 72.08 31.0% Top

7. Flash Leaderboard Medium #o3_High 72.93 58.8% 32B Thinking o4-mini - because (2025-04-17)

11. models - 66.87 o1-2024-12-17 Claude nosedive 40.8% - 3.7 38.7% #Gemini2.5_Pro_Preview (-6.6), - 62.80 2.5 3.7 AIs 2.5 R1 3

5. Maverick

3. - Claude 71.98

2. arcs." Thinking

16.

13. Leaderboard volatility: 71.99 High 53.1% 36.1% (-7)

9. Flash 3

7. R1 -

12. Update - slip - 41.4% (high) A3B 51.6% Leaderboard collective - 3 27.5% Claude o3

11. High === Gemini - 45.5% A22B -

6. Preview claim

14. DeepSeek storms (thinking) 🌐 Pro - Gemini 44.9% Mini Claude 31.1% Thinking - 4 SimpleBench - 58.8%

6. a - Opus -

20. debuts o1-2024-12-17 - (high) 69.39 Grok (med)

20. dramatically. - #LLM - Medium 10-22 2.5 Sonnet - 3 and Qwen3 Claude

12. 59.02 3 Preview enters 58.48 take o4-Mini - - 4 o4-Mini - 62.36 Gemini