They tested how well the models learned UN trivia?
A UN influenced leaderboard.
https://www.gapminder.org/ai/worldview_benchmark/
Notice google above average, deepseek in the middle, and meta and xai are below average. My leaderboard inversely correlated to this!
Coincidence?
Discussion
As far as I understand UN determines the "facts" and they want LLMs to parrot those.