Is OpenAI o3 Really AGI?

Discover the Secret Behind Elite Athletes' Seeking Peak Performance: The Advanced Tech Supporting Their Mental Clarity and Recovery

Elite performers demand peak performance from their bodies, and optimizing health is essential for achieving that edge. Our cutting-edge EMF protection technology is designed to help athletes shield themselves from the harmful effects of electromagnetic radiation, which can lead to fatigue, decreased recovery time, and impaired focus. By using Aires Tech products, athletes can minimize exposure to EMF from the devices they rely on daily—whether it's training gear, wearable tech, or even smartphones—allowing them to focus on maximizing their physical and mental capabilities. An official partner of UFC, WWE and Canada Basketball, Aires is committed to protect and optimize elite athletes through innovation and performance excellence.

FUTURE
Is OpenAI o3 Really AGI?

The world may have changed, and we might not have realized it yet.

Yesterday, OpenAI shocked (and this is not hyperbole) everyone with the announcement of OpenAI o3 and o3-mini, the brand new models of the ‘o’ family (they skipped ‘o2’ due to trademark reasons).

o3 results are so astonishing that some people are actually convinced that it is AGI, as it destroys some of the so-called ‘impossible’ benchmarks for current models.

The importance of this announcement cannot be understated. If I’m being honest, it quite frankly outdates many of the AI breakthroughs we commented on… just weeks ago. And I promise this is not hype; even the biggest LLM skeptics, like François Chollet, are speechless.

Today, we are discussing everything (published and, more importantly, unpublished) on the historic announcement:

  • The incredible results obtained in some of the most challenging benchmarks (including those assembled by Fields Medal awardees, the Nobel Prize of mathematics), focusing primarily on what OpenAI has publicly shared,

  • We will then move on to the things OpenAI is less keen on talking about as, luckily, researchers with insider information have spilled the beans, especially regarding what makes o3 different from o1,

  • We will also explain what is OpenAI’s real moat and how this superhuman model represents a victory over one of AI’s hardest problems,

  • Of course, we will answer the real question: Is this AGI? Always in a non-hyped way,

  • And finally, I will force you to completely reframe your future around AI to ensure you aren’t left behind.

o3 opens a new chapter in our lives, so let’s ensure you are ready.

Let’s dive in!

Results That Simply Don’t Make Any Sense

For months, we all feared that the era of step function improvements between generations was over and that models were plateauing (nonetheless, since GPT-4, trained in 2022 and released in March 2023, new models only represented marginal improvements over former generations).

But these fears are definitely over with o3.

A New Coding Frontier

Some published results (the model has yet to be released) are so impressive that you wouldn’t believe them unless they were from a company like OpenAI.

For starters, the model appears to be on a completely new level regarding coding.

As seen below, the model surpasses o1’s might in both SWE-bench Verified, a benchmark that analyzes how good a model is at autonomously solving GitHub issues, and Codeforces, a platform that provides a space for programmers to solve challenging algorithmic problems and participate in contests. Top-tier programming challenges, so to speak.

Source: OpenAI

To ground the results, the model considered state-of-the-art for coding until today, Anthropic’s Claude 3.5 Sonnet, achieves 49% in the former.

For the latter, Mark Chen, OpenAI’s research lead, claims he has an ELO of around 2,500, smaller than O3. This is despite Mark being considered one of the best competitive programmers in the world. It's also higher than Jakub Pachoki’s score, OpenAI's Chief AI Scientist since Ilya Sutskever's departure.

These results have caused many software developers to freak out completely, as we are talking about models superior to most humans, even those amongst the top 0.1%.

Your Future Fellow Mathematician?

Moving on to maths (remember, these models have been defined to tackle complex domains like coding, maths, and so on), the results also showcase step-function improvements.

Again, as seen below, o3 is considerably superior to the previous generation. Although we don’t have information about Claude 3.5 Sonnet on AIME, it scores 59.4% on GPQA Diamond, much worse than o3. It’s also superior to most PhDs, as, according to OpenAI, their average score is around 70%.

Again, the outsized superiority of o3 is clear.

Source: OpenAI

But the math surprises don’t end here.

In my newsletter on November 22nd, exactly a month ago, I introduced you to FrontierMath, a benchmark designed by EpochAI alongside top experts in different mathematics fields, including Field Medal recipients and Terence Tao, considered the smartest man alive.

The benchmark encapsulates some of the hardest (and, crucially, novel) math problems, aiming to test whether AIs can perform well in unfamiliar yet complex situations. The test is so challenging that expert mathematicians in that precise field can take hours or days to solve one problem.

Terence Tao said that even himself wasn’t sure how to solve some of the problems.

In that newsletter issue, we showed how the best model score at the time was 2%, proving that AIs couldn’t handle complex novel tasks. However, o3 has set a new record, reaching 25%, more than ten times the previous SOTA.

To put this impressive achievement into perspective, Terence Tao also predicted that an AI would take years to solve this benchmark. Given these results, this benchmark could be saturated by next year.

Toward Superhuman Abstract Reasoning

Furthermore, another of the benchmarks OpenAI bragged about is one of the usual suspects when talking about tests that have resisted the wrath of AI until now, the ARC-AGI benchmark.

Beating this benchmark is considered one of the significant milestones of AI on its path toward AGI.

The structure of this benchmark is pretty straightforward. As you can see below in this elementary example, the model must complete a grid pattern similar to human IQ tests.

Source: OpenAI

Crucially, ARC-AGI's main goal is to test what François Chollet, a world-renowned AI researcher and benchmark author, considers ‘real intelligence:’ the efficient, on-the-fly acquisition of new skills.

Efficient because unlike most AIs, which are extremely data inefficient, requiring hundreds of examples to learn a pattern, an AI approaching ARC-AGI has at most two or three examples to learn the pattern.

Long story short, the benchmark tests three things:

  1. The efficient on-the-fly acquisition of skills

  2. The capability of AIs to ‘generalize’ by solving tasks they have never seen before

  3. Attest whether AIs are approaching human-level reasoning

And the results are insanely impressive. o3 sets an out-of-this-world result of 87.5% on higher thinking time and 75.7% on lower thinking time, the latter being more than two times better than the o1 series (and any other model, for that matter).

A very interesting thing to point out is that these results have been met under the compute requirements set by the benchmark organization. In other words, these models couldn’t think forever, staying under the compute maximum threshold.

Source: OpenAI

Focusing on the highest score of 87.5%, the equally exciting and scary part is that, for the first time, this result is above the average human, who usually scores 85%. In itself, this is a huge milestone:

AIs are becoming better than average humans in complex tasks.

On the product level, OpenAI announced that these models would allow you to define different thinking thresholds from which you could choose. In simple terms, they give you control over how much the model will think on the task (this is particularly useful when you want to control your API costs).

They also provided the results of o3-mini, which sets a new frontier for compute-constrained reasoning tasks. In other words, it’s the best bang-for-your-buck reasoning model relative to the costs of running these models (which, as we’ll see later, won’t be cheap).

Now, we move on to the part that OpenAI did not discuss for obvious reasons:

  • What on Earth is this superhuman model,

  • What makes it different from anything we’ve seen before?

  • What are the not-so-positive things about this model that OpenAI obviously skipped to prevent turmoil and retain its competitive advantage?

And, crucially, what does all this mean to you?

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Already a paying subscriber? Sign In.

A subscription gets you:

  • • NO ADS
  • • An additional insights email on Tuesdays
  • • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more