- TheWhiteBox by Nacho de Gregorio
- Posts
- Has OpenAI Lost Its Way?
Has OpenAI Lost Its Way?


Learn AI in 5 minutes a day
This is the easiest way for a busy person wanting to learn AI in as little time as possible:
Sign up for The Rundown AI newsletter
They send you 5-minute email updates on the latest AI news and how to use it
You learn how to become 2x more productive by leveraging AI

PREMIUM CONTENT
Other Premium Content You Will Love…
Discussing Active Inference. We take a deeper look at one of the most contrarian views in the AI space: Karl Friston’s (one of the most cited neuroscientists in the world) and Verses.AI active inference theory for building real machine intelligence. If they are right, EVERYONE is wrong. (Full Premium Subscribers Only)

FUTURE
Has OpenAI Lost its Way?
Since the release of GPT-4.5 on Thursday, the industry has been filled with negativity toward OpenAI, with claims that it has lost its lead or is out of touch with prices.
But despite being someone who isn’t particularly fond of OpenAI, I believe all these claims are exaggerated, and the conclusions that people have reached come from the fact that they are evaluating the model in the utterly wrong fashion.
To me, this appears to be a clear decoy by OpenAI ahead of a more significant release, GPT-5. They are purposefully downplaying this release, and in fact, the prices they’ve set are high on purpose; they DON’T want you to use the model.
But why?
Today, we will describe all the details, clarifying the apparent confusion around the model. You will also learn how GPT-4.5 is the precursor to two new models (merged into one, the ‘distilled reasoner’), laying out OpenAI’s grand plan. Finally, we will discuss what this release actually means from the perspective of the entire industry, which is the real takeaway.
Let’s dive in.
An Apparently Disastrous Release
When even OpenAI themselves acknowledged that GPT-4.5 ‘was not frontier’, during the live stream, and instead focused on the idea that the real improvement comes at the level of knowledge (knows a lot more about stuff, less hallucination prone) and that it captures user intent better, we knew the release of GPT-4.5 was not going to be a GPT-4-level type of release, despite being the largest model ever trained.
An Unamusing Improvement
The famous AI lab was surprisingly short of performance graph presentations, presenting just three:
In the first two, they showed how GPT-4.5 knew more about the world according to their knowledge-testing SimpleQA benchmark and also showed decreased hallucination rates. GPT-4.5 is a more knowledgeable model with firmer conviction in the things it says.
Despite always representing hallucinations in a bad light, they are actually a feature of these models. They are, in a way, introduced on purpose to foster more model creativity by sampling a random word out of the ‘x most likely ones.’ No, LLMs rarely give you the most likely word (‘Engineer’ in the image below) and instead give one of the top most likely (‘Lawyer’ or ‘Doctor’ are also semantically viable so that they can get chosen, too).
This means the hallucination problem is a two-headed beast that depends not only on how certain the model is about its predictions but also how to balance it with the need for creativity.


In the third graph (image below), they compared the model to GPT-4o in a vibe-check benchmark. In this benchmark, users saw responses to both models and had to choose which model responded better (without knowing which model they were talking to). Here, GPT-4.5 was superior to GPT-4o in every category, with an average win rate of 63% (aka GPT-4.5’s responses were preferred around 70-ish % of the time).
While a 63% win rate might be seen as ‘not that much,’ this isn’t that different from the GPT-4 win rate vs GPT-3.5 (seen as a historic improvement), which was 70.2%. In ELO-based ratings, a 63% win rate is around 100 ELO points (92 to be exact using lmarena’s method), which should put GPT-4.5 as the undisputed leader in the lmaren.ai leaderboard, way ahead of Grok-3, the current leader.

The results look good, right? Then, why are people so negative?
The Reasoning Obsession
The issue is the results on reasoning benchmarks, which are all the rage right now (mainly OpenAI’s fault as they started this paradigm, too). Before we look at them, bear in mind two things:
OpenAI purposely avoided comparisons with non-OpenAI models, which is already a red flag
They also insisted that “GPT-4.5 was not a frontier model.”
Therefore, from the very start, they incentivized people to speculate while simultaneously downplaying their results on purpose. Additionally, this is not a reasoning model, which they also acknowledged, clarifying that it would underperform not only their reasoning models (the o-family) but also other players in the market.
To serve as an example:
On the Aider polyglot benchmark, the most venered coding benchmark, the model barely scratches the top 10,
In the ARC-AGI gaming benchmark, it underperforms all o-type reasoning models except o3-mini in low compute (and also DeepSeek R1), with an unamusing score of 10.33% and worse performance than Claude 3.7 Sonnet.

But here’s the thing: This was supposed to happen; this release is not an improvement in reasoning but an improvement in unsupervised learning, which is why people are—once again—jumping to conclusions too fast and not seeing the bigger picture.
And there’s a reason why OpenAI mentioned those precise words, ‘unsupervised learning,’ countless times in the live stream despite being a term few people actually understand.
Let me explain why they did this.
What This Release Really Means
Currently, there are two main ways in which you build “intelligence” into a model: unsupervised training (UT) and Reinforcement Learning (RL).
Understanding how AIs Learn.
In a nutshell, deep-learning AIs (all frontier models today) learn in two ways: compressing data or exploring. But what does that mean?
The first involves exposing an AI model to trillions of data points. All the open Internet basically. The model’s goal here is to accurately predict the next token in a sequence (words in the case of large language models), as discussed above.
By learning to predict the next word, the model indirectly compresses the knowledge required to make such a prediction. For example, if the model can predict that the next word in the “The Capital of Turkey is” sequence is “Ankara,” the model has indirectly learned what the capital of Turkey is.
Whether the LLM actually understands what Turkey or Ankara are is another story, which is why so many people debate whether AIs are intelligent or imitate intelligence (personally, I bend toward the second group).
Therefore, by taking the entire Internet’s data and sending it to the model, if it can predict every single word given a sequence, it has more-or-less learned all the knowledge in the Internet. This procedure involves trillions of words, more than any human can handle.
So, if models learn by trial and error, how do humans check whether each prediction is accurate?
Crucially, the actual data serves as judge of whether the model is improving or not, as we can compare the model’s response to the actual next word. Thus, we can train the models in an unsupervised mode, where humans aren’t actively deciding whether each prediction is correct or not (hence the term unsupervised learning). This prevents the ‘human bottleneck,’ serving as a method that can be scaled to trillions of words, unlocking the AI’s knowledge compression powers to larger-than-life datasets.
In a nutshell, as the AI saying goes, “You get what you optimize for.” So, if you train a model to predict the next word, you are getting a model that compresses textual world knowledge into itself.
It must be said that the ‘overall unsupervised learning stage’ also includes minor supervised and preference stages, where this ‘word predictor’ is fine-tuned to behave in a specific way. Human involvement is much more active in these stages, but they are nevertheless generally included in this ‘pretraining stage’ of development.
Having covered the first AI learning method, we now turn to Reinforcement Learning, another training procedure in which models optimize for goal achievement. This crucial step is all the rage right now, as this method is what turns base models like GPT-4o into reasoning models like o3.
This is the cutting edge of the industry, and trillions of dollars have been bet on this single technique. If you understand scaled RL, you understand AI’s frontier, so let’s do just that.

Subscribe to Full Premium package to read the rest.
Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In.
A subscription gets you:
- • NO ADS
- • An additional insights email on Tuesdays
- • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more