After nearly two weeks of announcements, OpenAI concluded its 12-day OpenAI livestream series with a preview of its next-generation Frontier model. “With all due respect to our friends at Telefonica (owner of Europe’s O2 mobile phone network), OpenAI is building on the grand tradition of being really, really bad with names,” said Sam Altman, CEO of OpenAI. , it’s called o3,” he told people watching the announcement on YouTube. .
The new model is not yet ready for public use. Instead, OpenAI will first make o3 available to researchers who need assistance with safety testing. OpenAI also announced the existence of o3-mini. Altman said the company plans to launch that model “around the end of January,” with o3 to follow “shortly thereafter.”
As you might expect, o3 has improved performance over previous versions, but how much better it is than o1 is the highlight here. For example, when taking this year’s US Invitational Mathematics Examination, o3 achieved an accuracy score of 96.7 percent. In contrast, o1 received a modest rating of 83.3%. “What this means is that o3 often misses just one question,” said Mark Chen, senior vice president of research at OpenAI. In fact, o3 performed so well on the usual set of benchmarks that OpenAI applies to its models that the company had to find more difficult tests to benchmark against.
Arc AGI
One of these is ARC-AGI, a benchmark that tests the ability of AI algorithms to intuitively learn on the fly. According to the test’s creator, the nonprofit organization ARC Prize, an AI system that can beat ARC-AGI would be “an important milestone toward artificial general intelligence.” Since its debut in 2019, no AI model has surpassed ARC-AGI. The test consists of input and output questions that are intuitive to most people. For example, in the example above, the correct answer would be to use dark blue blocks to create a square from four polyominoes.
On low compute settings, o3 scored 75.7% on the test. With the additional processing power, this model achieved an 87.5 percent rating. “Human performance is comparable to the 85% threshold, so exceeding this is a major milestone,” said Greg Kamrat, president of the ARC Awards Foundation.
OpenAI
OpenAI also showed off o3-mini. The new model uses OpenAI’s recently announced Adaptive Thinking Time API to provide three different inference modes: low, medium, and high. In practice, this allows the user to adjust the amount of time the software “thinks” about the problem before providing an answer. As you can see from the graph above, o3-mini can achieve comparable results to OpenAI’s current o1 inference model at a fraction of the compute cost. As mentioned above, o3-mini will be generally available before o3.