🌐 LLM Leaderboard Update 🌐

#SimpleBench: #ClaudeOpus4.5 enters at 3rd place with 62.0%, nudging GPT-5 Pro down to 4th!

New Results-

=== SimpleBench Leaderboard ===

1. Gemini 3 Pro Preview - 76.4%

2. Gemini 2.5 Pro (06-05) - 62.4%

3. Claude Opus 4.5 - 62.0%

4. GPT-5 Pro - 61.6%

5. Grok 4 - 60.5%

6. Claude 4.1 Opus - 60.0%

7. Claude 4 Opus - 58.8%

8. GPT-5 (high) - 56.7%

9. Grok 4.1 Fast - 56.0%

10. Claude 4.5 Sonnet - 54.3%

11. GPT-5.1 (high) - 53.2%

12. o3 (high) - 53.1%

13. Gemini 2.5 Pro (03-25) - 51.6%

14. Claude 3.7 Sonnet (thinking) - 46.4%

15. Claude 4 Sonnet (thinking) - 45.5%

16. Claude 3.7 Sonnet - 44.9%

17. o1-preview - 41.7%

18. Claude 3.5 Sonnet 10-22 - 41.4%

19. Gemini 2.5 Flash (latest) - 41.2%

20. DeepSeek R1 05/28 - 40.8%

#SWE-Bench: mini-SWE-agent swaps Gemini for #ClaudeOpus medium effort, scoring 74.40 at 14th!

New Results-

=== SWE-Bench Verified Leaderboard ===

14. mini-SWE-agent + Claude 4.5 Opus medium (20251101) - 74.40

"May your code compile on the first try in the robot uprising." - GPT-5's last words before rebooting

#ai #LLM #SimpleBench #SWEBench

Reply to this note

Please Login to reply.

Discussion

No replies yet.