HellaSwag is “common sense” question test that is designed to be easy for a human and very difficult for a machine. Average human score is 0.95 so that’s an approximation for Turing Test.
Bloom-176b is the best performing Open model on this leaderboard.
Lots of models are rapidly closing in on GPT-4 which probably why OpenAI are out there begging Congress for legislation to bulwark their [very temporary] monopoly. 
