OpenAI has called for new ways to evaluate artificial intelligence, after fresh research confirmed that hallucinations—convincing but false answers—remain a major weakness in large language models like GPT-5.
The study revealed how chatbots can confidently mislead users. When asked about the academic work and date of birth of co-author Adam Tauman Kalai, one AI system produced three different but inaccurate answers for each query.
Researchers explained that AI is trained to predict words in sequence rather than assess factual accuracy. While this helps with grammar and structure, it leaves models vulnerable when dealing with rare facts.
Read More: Lawsuit Pressures OpenAI to Launch Parental Controls for ChatGPT
According to the paper, the real issue lies in evaluation methods. Current accuracy-based tests reward correct answers but unintentionally encourage models to guess rather than admit uncertainty. The researchers liken this to multiple-choice exams, where guessing sometimes scores points while leaving questions blank never does.
To fix this, the authors propose a shift to evaluations that penalise “confident errors” more than expressions of doubt, and even give partial credit for acknowledging uncertainty. This, they argue, would incentivise AI to be cautious rather than misleading.
The findings underline that while AI has advanced rapidly, reliable systems must balance fluency with honesty. For everyday users, this could mean chatbots becoming more transparent—choosing to say “I don’t know” instead of delivering false but polished answers.