Nostr Web Client

"A new OpenAI study using their in-house SimpleQA benchmark shows that even the most advanced AI language models fail more often than they succeed when answering factual questions.

The SimpleQA test contains 4,326 questions across science, politics, and art, with each question designed to have one clear correct answer. Two independent reviewers verified answer accuracy.

The thematic distribution of the SimpleQA database shows a broad thematic coverage, which should allow a comprehensive evaluation of AI models. | Image: Wei et al.

OpenAI's best model, o1-preview, achieved only a 42.7 percent success rate. GPT-4o followed with 38.2 percent correct answers, while the smaller GPT-4o-mini managed just 8.6 percent accuracy.

Anthropic's Claude models performed even worse. Their top model, Claude-3.5-sonnet, got 28.9 percent right and 36.1 percent wrong. However, smaller Claude models more often declined to answer when uncertain – a desirable response that shows they recognize their knowledge limitations."

https://the-decoder.com/gpt-4o-and-co-get-it-wrong-more-often-than-right-says-openai-study/

Reply to this note

Discussion