OpenAI’s o3 AI model scores lower on a benchmark than the company initially implied

The contradiction between the standard results from the first and third party of the O3 AI model of Openai raises questions about transparency practices and the company’s models test.
When Openai revealed O3 in December, the company claimed that the model could answer slightly more than questions on Frontiermath, a difficult set of mathematics problems. This competing result detonated away-the next model was able to answer only about 2 % of Frontiermath problems correctly.
Today, all offers contain less than 2 % [on FrontierMath]Mark Chen, chief research official at Openai, said during a mysterious period. “We see [internally]With O3 in aggressive test time calculation settings, we can get more than 25 %. “
As it turned out, this number was likely to be a higher limit, verifying a version of O3 with more computing behind it from the Openai model that was publicly launched last week.
EPOCH AI, the research institute behind Frontiermath, has released the results of its standard independent O3 tests on Friday. EPOCH found that O3 recorded about 10 %, which is much lower than the highest degree in Openai.
Openai O3 has released a very expected thinking model, along with O4-MINI, a smaller and cheapest model of O3-MINI.
We evaluated the new models on our wing of mathematics and science standards. Results on the topic! pic.twitter.com/5gbtzkey1b
AI era (Epochairesearch) April 18, 2025
This does not mean openai a lie, in itself. The standard results published by the company in December show a lesser degree that corresponds to the era of the outfit. EPOCH also noticed that the preparation of its test is most likely from Openai, and that it used an updated version of Frontiermath to evaluate it.
“The difference between our results and Openai may be due [computing]Or because these results were played on a different sub-group of Frontiermath (180 problems in front of Frontiermath-2024-11-26 compared to 290 problems with Frontiermath-2025-02-28-BRIVATE), EPOCH Books.
According to Post on X from the Arc Prize Foundation, an organization that has experienced the pre -version of O3, the O3 General model “is a different model […] It was set to use the chat/product, “EPOCH confirmation report.
“All levels of the O3 account that have been released are smaller than the version that we [benchmarked]”Arc Award” books. In general, it is expected to achieve greater mathematical levels to achieve better degrees.
Re-test O3 on ARC-AGI-1 will take two or two days. Since today’s edition is a materially different system, we re -mention our previous results that have been reported as a “inspection”:
O3-PREVIEW (low): 75.7 %, $ 200/mission
O3-PREVIEW (high): 87.5 %, 34.4 thousand dollars/missionAbove U1 Pro is pricing …
Mikeknooop April 16, 2025
Onda Zhou’s private from Openai, a member of technical staff, said during a mysterious period last week that O3 in production was “more improving for real world’s use and speed -ups in Demoed in December. As a result, standard “variations” may appear.
“[W]I have done [optimizations] To make [model] More cost efficiency [and] “We still hope – we still believe – this is a much better model – we still think – this is a much better model […] You will not have to wait a long time when you ask for an answer, and this is a real thing with this [types of] Models. “
It is recognized that the fact that the overall version of O3 surpasses the Openai test promises is a point, because the O3-MINI-HIGH and O4-MINI models are O3-MINI, and Openai plans to plan Openai for the first time in O3-PRO, in the coming weeks.
However, it is best to take other Amnesty International standards in the nominal value – especially when the source is a company that has services for sale.
“Differences” measurement has become common in making artificial intelligence, as sellers are racing to capture the main and mental titles with new models.
In January, EPOCH was criticized to wait for the disclosure of financing from Openai even after the company O3. Many academics who contributed to Frontiermath have not been informed with Openai’s participation until it was announced.
Recently, Xai from Elon Musk has been accused of publishing the misleading standards of the latest AI, GROK 3. Only this month, Meta has admitted of providing standard scores for a model version that differs from the model that the company provides to developers.
Updated 4:21 pm Pacific: Comments from Wenda Zhou, member of the Openai Technical team, were added from Livestream last week.
Don’t miss more hot News like this! Click here to discover the latest in Technology news!
2025-04-20 21:19:00