HellaSwag is “common sense” question test that is designed to be easy for a human and very difficult for a machine. Average human score is 0.95 so that’s an approximation for Turing Test.

Bloom-176b is the best performing Open model on this leaderboard.

Lots of models are rapidly closing in on GPT-4 which probably why OpenAI are out there begging Congress for legislation to bulwark their [very temporary] monopoly.

Reply to this note

Please Login to reply.

Discussion

But on coding ability GPT4 has a much more stubborn competitive advantage.

I’m guessing OpenAI got the entire Git dataset via MSFT and this has given them a massive almost unassailable advantage for now?

GPT-4 just writes vastly better code than any other model. It’s not even close.

You can play with the parameters on the API (temperature, etc) and really nail down some highly deterministic stuff. You can generate code at high temp and then do QA at cold temp.

It’s very easy to produce very high standard code with a thoughtful automated process.

It’s also super cheap and efficient once you figure out what you are doing.