GPT-5 Has Arrived

AI leaders only: Get $100 to explore high-performance AI training data.

Train smarter AI with Shutterstock’s rights-cleared, enterprise-grade data across images, video, 3D, audio, and more—enriched by 20+ years of metadata. 600M+ assets and scalable licensing, We help AI teams improve performance and simplify data procurement. If you’re an AI decision maker, book a 30-minute call—qualified leads may receive a $100 Amazon gift card.

For complete terms and conditions, see the offer page.

THEWHITEBOX
TLDR;

Finally, after more than a year of teasing, GPT-5 has finally arrived a few hours ago. Besides that, we take a look at:

  • Anthropic’s new flagship model, Opus 4.1,

  • a short analysis of AMD’s recent successes,

  • leaks on OpenAI and Anthropic revenues,

  • The reality that AI startups like Cursor are starting to face,

  • and a deep dive into OpenAI’s open-source releases a few days ago.

Enjoy!

THEWHITEBOX
Things You’ve Missed By Not Being Premium

On Tuesday, we had a start of the a week full of announcements coming from all sides. From Genie 3, to me the most impressive AI model since the launch of ChatGPT, to OpenAI’s imminent GPT-5 release, the industry is delivering as much hype as always.

However, we will also look at its tremendously unstable foundation, which may raise several alarms in your head.

hropi

FRONTIER MODELS
Anthropic Release Opus 4.1

In a very low-key release, overshadowed by Google’s Genie 3 and OpenAI’s first open-source release in six years, Anthropic has updated its best model, Opus 4, to Opus 4.1.

The model is largely focused on coding and agentic tasks, setting the state-of-the-art in those areas with little debate, yet underperforms competitors in areas like common knowledge or maths.

TheWhiteBox’s takeaway:

Anthropic isn’t even pretending anymore. They have one thing on their minds and one thing only: creating the best coding and agentic (tool-calling) models on the planet.

In other words, they are optimizing for a cohort of users who work in software programming, or, as we saw last Sunday, those who want to automate their lives by running personal agents that can call other tools to perform actions on their behalf.

The last one is the one that really matters, because having the best tool-calling AI on the planet is quite literally equivalent to having the best AI agent on the planet.

And while OpenAI can’t simply stop being a good conversationalist model, or one that knows a lot about a lot of things, as their primary user base is people looking for common knowledge, searching stuff on the Internet, or simply having a conversation, absolutely zero people on this planet use Claude for that, so they can naturally optimize toward the things their models are being used for.

And the strategy, revenue-wise, is panning out, as you’ll see for yourself on the Markets section.

HARDWARE MARKETS
AMD Disappoints, It’s Still Crushing it

AMD fell almost 7% yesterday after reporting mixed results and, once again, refusing to give more details on its data center business. Still, it’s the best-performing chip stock of the year, even ahead of 4 trillion in market cap rival NVIDIA.

So, what should we make of its current state?

TheWhiteBox’s takeaway:

Ever since our deep analysis of AMD in March this year, the stock is up 70%, from around $100 to $171 at the time of writing. If you bought around that time, congrats!

In that article, we discussed why AMD was a hidden gem, but we also outlined the biggest issues: bad software and, above everything else, poor server scale-ups.

In AI, there’s no such thing as a non-distributed workload. That means that models are served to users using several GPUs per model (sometimes we even divide sequences between GPUs, meaning several GPUs are helping in responding to you).

This introduces an extra metric to be concerned about, beyond a GPU’s compute power and memory bandwidth (how fast we can send data in and out of memory to the compute chips).

This third metric is GPU-to-GPU communication speed, in which the server topology (how GPUs are connected to each other) is crucial to offer a good user experience. The number of GPUs you can install in a single server is known as the ‘scale-up’ (and how many servers you can combine together is the ‘scale-out’).

Scale-up is arguably the most critical metric in the entire chip accelerator industry, and in that domain, NVIDIA rules, with up to 72 state-of-the-art GPUs all connected in an all-to-all topology thanks to NVIDIA’s crown jewel, NVLink and NVSwitch technologies.

Except for the fact that NVIDIA doesn’t have the highest scale-up, Google and Huawei do, with their Ironwood and CloudMatrix servers. Luckily for Jensen Huang, one doesn’t commercialize its TPUs, and the other is Chinese, so it’s basically out of Western markets (for now).

AMD’s biggest lagging indicator emerges in this particular discussion, due to their very poor scale-up of just 8 GPUs (and in a ring topology, totally unacceptable unless we are talking about small models).

The reason AMD is doing so well is more investor confidence in betting on the next NVIDIA, and also the promise of their MI400 server, which promises to match NVIDIA’s scale-up at 72 GPUs per server.

I’m an AMD investor, so I trust they’ll deliver, but let me be clear that AMD’s stock rise, beyond good quarterly results, is more about what AMD can be than what it is right now.

AI REVENUES
OpenAI and Anthropic Lead the Pack

Thanks to several sources, we now have solid evidence of OpenAI and Anthropic’s impressive revenue growth (still painfully small compared to the size of this inflated industry), with both potentially ending at $20 billion and $10 billion annualized recurring revenues (ARR) by the end of the year.

Interestingly, despite OpenAI more than doubling Anthropic’s global revenue, for the first time, the latter’s API revenues are higher than those of OpenAI’s, with the largest difference in global revenues coming from the consumer subscriptions, where OpenAI has enjoyed a comfortable headstart since its historic ChatGPT release back in November 2022.

TheWhiteBox’s takeaway:

As I was suggesting in the Technology section, OpenAI and Anthropic have clearly parted ways in what they are trying to build.

  • ChatGPT is a general-purpose product, one that, while not being bad at anything, is clearly optimized as a conversational AI that can aid you in pretty much everything.

  • Claude is a software and agents product, one that, while being terrible at most things (especially search), it’s great at software engineering and agents.

The Trojan Horse here (potentially for both) is Google.

Their models excel in recall (knowledge-based tasks) and coding, making them the best positioned to be the ultimate solution. Conversely, AI is very prone to the “just give me the best model, bro” approach, and Google could end up being in no man’s land, as some choose OpenAI and others Anthropic.

AI REVENUES
The Delicate Cursor Situation.

This interesting article by The Information discusses the delicate situation (which we have predicted for some time) that AI coding products, some of which have seen some of the biggest revenue explosions in the history of capitalism, might still end up acquired at best, or simply failing.

Despite having great products, great growth, and deep pockets, these companies are struggling with the most essential thing: profitability.

And the reason is, interestingly, AI.

If you look at the previous graph we shared comparing OpenAI and Anthropic projected revenues, you'll see that Anthropic’s API revenues come from two primary sources: Cursor and GitHub Copilot.

Assuming a 50% share, that means Cursor will $700 million this year on Anthropic models.

And their revenues? As per June’s monthly revenues, they have a $500 million run rate, so the number could easily be below their COGS, meaning not only they aren’t operationally profitable, they probably aren’t even capable of paying their Anthropic bill without raising more money.

Another point the article points out is the terrible churn rates of some of these companies, which we predicted a few weeks ago with Lovable’s fake $100 million ARR announcement.

While these companies are defining their revenues directly as MRR/ARR, meaning they are recurrent, most of these businesses, especially the vibe-coding apps like Lovable, don’t have a recurrent business. No one, at least today, builds apps recurrently; it’s a one-off task.

In plain English, customers are churning on mass, with the latest estimates putting Lovable’s 4-month retention at 61%, meaning they have an 11%-ish monthly churn, absolutely outrageous.

To put this into perspective, Netflix has historically had a 4% churn rate, which was considered ‘very bad’ and led investors to believe the streaming market would never be profitable for years. 11% is a totally different speed of customer loss. At that rate, they lose half of their entire “subscriber” base every 6 months (49% retention at six months).

So, what does all this mean?

Simple, they aren’t a recurrent business yet (and will never be), and are simply using revenues from new subscribers to hide churn. As long as new revenues outpace lost revenues, everything is fine. Until it’s not.

OPENAI
GPT-5 Drops

Finally. More than two years after the release of GPT-4, OpenAI has finally released GPT-5.

The results? On paper, it is an absolute state-of-the-art. The model (or group of models) comes in four formats:

  1. GPT-5: The flagship model

  2. GPT-5 mini

  3. GPT-5 nano

  4. And GPT-5 chat, with similar size and performance as GPT-5 but optimized for chatting (most likely the default option in the ChatGPT app.

One of the best things about the release is the pricing, which is unexpectedly competitive:

Naturally, the model is great at basically every benchmark imaginable, and it’s has a hybrid reasoning system; if the question is easy, it thinks fast, if the question is hard, it will think for longer (not exactly a hybrid model because they are combining reasoning and non reasoning models, but the behavior is similar).

But how good is it? The issue, typical of OpenAI releases, is that AI influencers have massively overhyped it, and the result is, as expected, less impressive than what we would have hoped.

Importantly, it’s not a step-change improvement over what we had earlier in the same way GPT-4 was for GPT-3. It’s SOTA everywhere, which in itself is a resounding success, but it’s not the leap some would have hoped.

So, even though it’s very, very early (the model has been out for three hours), let me be clear:

  • Is it the best model you can use today? Most likely

  • Does it have competitive pricing? Very, really commendable effort here by OpenAI, probably thanks to their smart routing system that selects the best model for each task, allowing OpenAI to serve cheaper models when possible.

  • Is it AGI? Hell no, Grok 4 outcompetes it in ARC-AGI 2

The main takeaway for me is that this confirms that we need new stuff. Step-function improvements are a thing of the past, and since this isn't AGI, it's clear that our current capabilities won’t lead to AGI.

We clearly need new ideas and new data. That said, this doesn’t change the fact that the industry could stall in terms of progress, and what we have right now would still be enough to cause massive disruption in the economy.

Agents are already legit, but let’s not fool ourselves anymore with the ‘AGI by 2027’ predictions that do more harm than good.

And perhaps even more importantly, let’s stop lying to people. I mean, what on Earth is this bottom graph in OpenAI’s presentation? Fire the marketing team.

TREND OF THE WEEK
gpt-oss, is it that of a big deal?

Finally, after many months of waiting, OpenAI has released an open-weight model with a fully open-sourced license, the first time they have released something in such a permissive license since GPT-2 back in 2019.

As with any OpenAI release, the vibes have been incredible, with the entire AI industry very excited (and expectant) about the release.

But as time passes, things are getting less and less exciting. I’ll be sharing my thoughts and overall impressions, including a mandatory comparison with Chinese open-source models.

Everyone is Building the Same Thing

As mentioned, OpenAI has released two open-weight models (the datasets they used for training were not released, so it can’t be considered fully open-source:

  1. gpt-oss-120B, a 120-billion parameter model with 5.1 billion activated parameters (more on this in a second).

  2. gpt-oss-20B, a 20-billion parameter model with 3.6 billion activated parameters.

This means both models are ‘mixture-of-experts’ (MoE). This is a technique, pioneered by Google but first widely adopted by OpenAI with GPT-4, in which only a part of the model’s neurons (weights) activate.

If we envision models as something you send text to, and returns a continuation o the input as a response, inside, the model ‘queries’ its weights (neurons), which activate or not—similar to how brain neurons fire based on input, hence the name of artificial neurons.

The more weights that participate in the next-word prediction, the more computations have to be done for that prediction, and the longer it takes.

Instead, MoEs “divide” the model into 'experts’ and a router, which chooses the best experts to answer any given question.

In practice, this means that only a small fraction of the model actually participates, in a similar fashion as to how your brain doesn’t activate fully for every task, and specializes regions of the brain dynamically—this is a very, very loose analogy, to drive the point home.

In the larger GPT-oss model’s case, only 4.2% of the model activates every time, which means it’s more or less 20 times faster than what this same model would be without MoE.

If you’re wondering why we aren’t making the model twenty times smaller, the answer is that being large and then letting the model define its experts is a best-of-both-worlds bridge between a large model (a smarter model) and a smaller model (faster yet dumber).

This is important because it confirms that everyone is basically building the same models, as this ‘fine-grained experts’ approach is the predominant one among Chinese models, too.

With my laptop's 128 GB of memory and a “decent” compute system featuring 40 GPU cores (a refurbished M3 Max MacBook Pro), I can run both. I haven’t tried the small one, but I managed to get +30 tokens/second on GPT-oss-120B, which is above the minimum acceptable recommended threshold of 20 tokens/second.

gpt-oss-120B answering my question (wrong, by the way).

Besides this, they seem to have trained the model entirely on synthetic data (data generated by other AI models, not human data). We’ll circle back to this later because, well, it has its pros and cons.

Beyond that, nothing really remarkable, the models are another ‘chain-of-thought’ reasoning models, meaning they increase performance on a task by ‘thinking for longer on it’ by generating a concatenation of thoughts inspired by a human’s predominant reasoning behavior when facing hard problems:

  1. Plan how to solve

  2. Break plan into steps

  3. step-by-step problem solving

So, really, nothing remarkable about reasoning training. So, if architecture and training methods are the same as most new models, is there anything special about these models?

The only thing worth mentioning is that they are the best—and to my knowledge, first—models that are natively trained on 4-bit weights, meaning each weight weighs just half a byte. As we are talking about 120 billion weights, that means that instead of weighing 120 billion bytes, or 120 GB, the model’s size is cut in half to 60 GB.

In reality, it’s a little bit more, as only the expert neurons are quantized, attention weights remain at higher precisions, yet expert weights represent more than 90% of total weight, so the result approximates a half in size savings.

But, theory aside, is the model any good?

SOTA for Open-source?

At first glance, the benchmark results look absolutely incredible. When evaluated at the “high” reasoning level on canonical benchmarks, gpt-oss-120B closed much of the gap to OpenAI’s proprietary o4-mini model.

  • On the American Invitational Mathematics Examination (AIME) 2024, without tools, it scored 95.8% accuracy, rising to 96.6% when permitted tool access.

  • Its MMLU (college-level exams) performance reached 90.0 % accuracy.

  • The gpt-oss-20B model, despite being nearly six times smaller, still achieved 92.1% on AIME without tools (96.0% with tools) and 85.3% on MMLU. These results place the 120B edition just shy of o4-mini and the 20B model at par with o3-mini.

In coding and agentic tasks, the larger model again approached frontier levels.

  • Measured via Codeforces problem solving, gpt-oss-120B earned an Elo of 2,463 without tools and 2,622 with tool support, while gpt-oss-20B scored 2,230 and 2,516, respectively, at high reasoning effort.

  • On SWE-Bench Verified and τ-Bench Retail, similar trends held: the 120B model outperformed o3-mini, and the 20B matched or exceeded earlier open models under comparable conditions.

So, to summarize:

  • gpt-oss-120B is a o4-mini-ish-level model,

  • and gpt-oss-20B is a o3-mini-level model,

Now, this is what OpenAI said. But what are others saying?

The first vibe-killer is Artificial Analysis, which claims that the models are good, but not SOTA for open-weights, trailing behind (closely, though) DeepSeek R1 and Qwen3 235B.

Thus, while they are the best open-weight models on a performance-to-size metric, they are not, nominally speaking, the best open-weights on the planet.

As of today, China continues to lead in that area.

In this article’s thumbnail, you can see the performance-to-cost (cost is highly correlated with size) where the 120B model is, indeed, the best open-weights model in that regard.

But things get worse from here. On the Aider polyglot benchmark, the coding benchmark of reference, the 120B model scores pretty poorly, just 41.8%, well below several Chinese counterparts like Kimi K2 and R1, and only barely better than Qwen3 30B, a model three times smaller. Not great.

So, what on Earth is going on? The answer is that the models are, as AI pundits would say, ‘fried’. But what does that mean?

Although it may still be too early to conclude, it seems the models were completely overtrained to do well on benchmarks.

This isn't good, not because it excels in those areas per se, but because it only does well in those areas. Whenever you train a model too much on any given area, you’re invariably sacrificing performance on other areas. But optimizing on benchmarks is on an entirely different level of overtraining that you would only do if you want to, well, safe face.

This is why people seem to struggle with non-competitive math/coding areas, where the model also appears to have difficulty. Furthermore, it lacks good taste in its conversational style and frequently hallucinates, among other critiques I’ve been hearing.

So, what to make of all this?

There is no Moat.

It’s pretty clear what OpenAI has done here: save face.

  • They knew they were not going to offer a best-in-class model without revealing too much,

  • To avoid looking bad, they benchmaxxed the model, so marketing-wise, the picture is good.

While OpenAI has obviously hidden its key IP in this release, just releasing the bare minimum, I believe the reality is more daunting: there really isn’t a technological moat, and everyone is mainly building the same models.

Sure, OpenAI may have a “universal verifier” and a low-verbosity training methodology as we saw on Tuesday’s news rundown for Premium subscribers, but word gets out eventually.

Also, China now has confirmation that they no longer depend on US progress to progress themselves; they can now start innovating.

Assuming AI becomes a commodity, the moat is what you do about it. Distribution, operational excellence (more on that on Sunday), and brand awareness will be the name of the game.

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]