Recent testing has revealed an unexpected trend in OpenAI's latest artificial intelligence models, with newer versions showing increased rates of generating false or fabricated information compared to their predecessors.
According to internal benchmarks from OpenAI, the company's new o3 model produces inaccurate or hallucinated responses 33% of the time when tested on the PersonQA benchmark. This represents a concerning increase from earlier models o1 and o3-mini, which had hallucination rates of 16% and 14.8% respectively.
Even more worrying is the performance of the o4-mini model, which was found to generate false information in nearly half of all responses, with a 48% hallucination rate.
Independent researchers have corroborated these findings. The nonprofit AI lab Transluce discovered instances of the o3 model inventing fictional processes, including claims about running code on hardware configurations that don't exist within its system.
Additional testing by Stanford researchers, led by adjunct professor Kian Katanforoosh, uncovered that o3 regularly generates non-existent website links in its responses.
The increasing frequency of hallucinations in more advanced models has left OpenAI researchers searching for answers. In their technical documentation, the company acknowledges that more investigation is needed to understand why these issues become more prevalent as the models grow in complexity.
This trend poses challenging questions about the relationship between AI model sophistication and reliability - a puzzle that OpenAI's research team continues to work on solving.