Google just can't stop winning. New test run for the Aider Polyglot benchmark is out and Gemini 2.5 Pro with extra thinking turned on is actually so stupidly good now that it trounces o3 by itself AND o3 and GPT 4.1 working together, all while being much less expensive.

Not only that, but just yesterday Google brought the 2.5 update for its Gemini models out of preview and dropped the price at the same time, so the models are actually less expensive than what's listed on the benchmark leaderboard. Google's Gemini 2.5 Pro is now better than the top of the line models from Anthropic and OpenAI, Opus and o3, while costing less than Sonnet.

Google is winning and its not even close right now.

https://aider.chat/docs/leaderboards/

nostr:nevent1qvzqqqqqqypzpcpnjdyv5m9vjuyvmx8xx830fw4d2dxle6rs3qdkt2jh6v8lwff7qqsfy7007mzvcx38ygh2d869pxyxngqr0q70l96vkvd4t08v066k27chfkwdp

Reply to this note

Please Login to reply.

Discussion

It actually gets better! Just checked some other benchmarks, and on the LM Arena[0] the new Gemini 2.5 Pro is winning every category! Not only that, but it's also winning in all the subcategories as well. This is just nuts. To spell that out it means that Gemini 2.5 Pro is the best at:

- Vision

- WebDev

- Search

- Coding

- Math

- Creative Writing

- Instruction following

- Long queries

- Multi-turn

This feels like a GPT-4 moment. Google is moving so fast at this point that unless OpenAI is cooking up an absolutely insane upgrade with the o4 model, it's going to be outdated before they can even finish their training run. Google already has a reasoning model that is all around smarter, cheaper, and more multi-modal.

My money is on Google for winning the AI wars. They're winning on every single front. They have the best frontier model, the best cheap models, products that are already insanely popular they can integrate their models into, their own in-house chips and infrastructure, and boatloads of their own (not investors) cash because they're already a hugely profitable company.

[0]: Yeah I know there been some controversies but it's still a solid benchmark

nostr:nevent1qvzqqqqqqypzpcpnjdyv5m9vjuyvmx8xx830fw4d2dxle6rs3qdkt2jh6v8lwff7qythwumn8ghj7ct5d3shxtnwdaehgu3wd3skuep0qyt8wumn8ghj7etyv4hzumn0wd68ytnvv9hxgtcqyrgfnlkgvhnfdhd3gncchu29sw3xgnqxssxnhmsgy9mq4xva3svyqnee4lg