The latest AI language models, such as o3 from OpenAI, are showing a higher rate of errors compared to their predecessors. This is confirmed by multiple studies reported by The New York Times.
Similar issues are found in models from other companies like Google and DeepSeek. Even with enhanced mathematical capabilities, the actual error rates in queries are increasing.
One of the most common issues is known as "hallucinations," where models fabricate information and facts without any sources. According to Amr Awaadalla, CEO of Vectara, these hallucinations are likely to persist.
An example of such a hallucination occurred with the Cursor support bot, which falsely claimed that the tool could only be used on one computer, leading to a wave of complaints. It turned out that the company had made no such changes; the bot was simply making things up.
In separate testing, the hallucination rate reached 79%. In internal testing, the o3 model exhibited 33% hallucinations when asked about famous people, double the rate of o1. The newer model 04-mini fared even worse, with 48% errors.
When responding to general questions, the hallucination rates for models o3 and o4-mini were even higher — 51% and 79% respectively, compared to 44% for the older model o1. OpenAI acknowledges the need for further research to understand the reasons behind these errors.
Independent tests conducted by various companies also indicate that hallucinations occur in Google and DeepSeek's reasoning models. Vectara's research found that such models fabricate facts at least 3% of the time, with some instances reaching as high as 27%. Despite companies' efforts to rectify these errors, the hallucination rate has only decreased by 1-2% over the past year.