Benchmarked 4 new models. Deepseek R1 score improved. All these are below average, so p(doom) probably increased!

Coming soon: Kimi K2. They say it is very good at coding, but my leaderboard is about being beneficial to humans. So we will see!

Full leaderboard https://sheet.zoho.com/sheet/open/mz41j09cc640a29ba47729fed784a263c1d08

More info https://huggingface.co/blog/etemiz/aha-leaderboard

Reply to this note

Please Login to reply.

Discussion

No replies yet.