OpenAI's New Model o3: These 34 Questions Stump Me
Explore OpenAI's groundbreaking o3 model, the first AI to surpass ARC-AGI benchmarks, tackling complex tasks and revealing AGI challenges.
Minor failure, with a 12.5% margin.
A few days ago, OpenAI completed the final update in its 12 consecutive updates — as anticipated, introducing the new reasoning models o3 and o3-mini.
Starting with o1, OpenAI's proposed reasoning scaling laws have brought new hope for achieving AGI. The benchmark used to evaluate o3’s reasoning capabilities is ARC-AGI, which has been around for five years but remains unsolved until now.
The new model, o3, is the first AI model to surpass the ARC-AGI benchmark: its minimum performance reached 75.7%, and with more computational resources and extended processing time, it could even achieve up to 87.5%.
In comparison, the o1 model previously achieved an accuracy of only 25% to 32% on the same benchmark.
The ARC-AGI benchmark requires AI to identify patterns based on paired "input-output" examples and then predict the output for a given input.
François Chollet, the creator of ARC-AGI and father of Keras, stated in the test report that despite the high costs, the results confirm that performance on new tasks improves with increased computation.
For o3, each task costs $17-$20 in low-computation mode, while high-computation mode can cost thousands of dollars per task.