China Sounds the 🚨, Spending, & New Models

THEWHITEBOX
TLDR;

Today, we have a newsletter packed with interesting news and insights, from leaks about xAI’s latest supermodel, Grok 3.5, to Microsoft’s new super reasoner, Phi-4-reasoning.

We also discuss Big Tech earnings and whether the AI spending party continues, as well as Sam Altman’s latest venture and the hidden costs of AI models.

Finally, we look at a transgressive research paper from China that challenges the common assumptions on reasoning models, the industry’s latest sweethearts; they aren’t what you’ve been made to believe.

THEWHITEBOX
Things You’ve Missed By Not Being Premium

On Wednesday, we covered the release of China’s new powerful model, as well as interesting releases by NVIDIA and DynaRobotics, the latter of which showcases the most advanced robustness we have ever seen on AI robots.

We also covered OpenAI’s dizzying revenue projections, the latest features, and the greatest controversy: the story of the most absurdly sycophant the AI industry has ever seen.

FRONTIER MODELS
Will Grok 3.5 Be an Engineering Beast?

On Tuesday, Elon Musk confirmed the launch of Grok 3.5 next week (although it could be released any time now), making very strong claims about it.

But it seems that a small mistake from one of the engineers has shown that xAI has up to sixty different Grok fine-tunes internally (sixty different versions of the model), with very surprising names like ‘grok-SpaceX.’ What this implies is that xAI is actively training its models on SpaceX, Neuralink, Starlink, or Tesla data, which could give them unparallel advantage on engineering efforts.

This could be a moat itself because some of these companies have quite literally unique data on crucial areas like space, autonomous driving, or even brain-computer interfaces.

For instance, a very recent research paper by Tufa Labs shows how small reasoning models trained on rocket design data can even outcompete not only frontier AIs, but they also outcompete human experts.

TheWhiteBox’s takeaway:

Every day that passes, it’s becoming harder and harder to bet against xAI. Not only have they managed to become a top-3 AI Lab in under a year, but have the necessary capital-raising capacity combined with the force of being part of the ‘Elon network,’ that it’s hard to fathom a future where they don’t completely blow out of the water most labs in terms of ‘raw intelligence’.

But the point here is that this isn’t a big deal as AI is clearly not a winner-takes-it-all industry, and most labs will specialize on specific use cases.

  • OpenAI appears to have a great lead for everyday, consumer-end questions or tasks and with clear interest on agentic computer use.

  • Google also has great distribution for more desktop-based search via its AI-enhanced search engine and they are also great at coding.

  • Anthropic or Cohere seem to be focusing more on enterprise use cases, the former on more agentic-style/coding work and the latter more on enterprise search.

  • xAI could be developing a really strong presence on engineering use cases (design, prototyping, etc.)

Of course, none have renounced their in any particular use case and remain generalistic solutions, but the end game will probably end up being more specialized.

REASONING MODELS
Microsoft’s Phi-4 Reasoning. A Small Beast

Microsoft has released new open-source models, Phi-4-reasoning and Phi-4-reasoning-plus, accessible through HuggingFace for free download, which exhibit incredible performance at their size, ‘just’ 14 billion parameters. This means they can be run on consumer-end hardware (the bare minimum for this model would be 32 GB of RAM, though, but that’s pretty acceptable for modern high-end laptops).

As seen above, the model's performance falls around the range of DeepSeek R1 despite being 48 times smaller, meaning it punches way above its weight class.

The secret?

The Phi models are trained using carefully curated synthetic data, which means the base model is already quite strong. However, to create the reasoning versions, they distilled data from o3-mini, one of OpenAI’s strongest reasoning models (Microsoft owns OpenAI’s IP, so it’s fine).

In other words, the model was fine-tuned to behave like o3-mini, which means that it exhibits behaviors and reasoning capabilities of much larger models for a fraction of a larger model's size. In some instances, it can even surpass the performance of the teacher (o3-mini).

TheWhiteBox’s takeaway:

Probably the most exciting progress in AI today is being made at the smaller sizes. The dream of running truly knowledgeable and thoughtful LLMs in the confines of our computers, without needing Internet access or worrying about data breaches, is closer by the day.

I am fascinated by the progress of reasoning models, but in our trend of the week below, we will return to Earth with healthy doses of the actual reality of frontier AI reasoning.

BIG TECH
AI Spending Party Continues

After Amazon’s report yesterday, we now have the complete picture of AI spending for this quarter, which has not fallen; quite the contrary.

Big Tech companies invest in AI by building data centers in two ways: self-build, in which they have to invest in land, equipment, and so on (PP&E), and financial leases when they want to grow faster; they don’t own nor operate the data center, but they have the right to use it. Generally, both count as CAPEX.

  • Amazon increased PP&E (Property, Plant, & Equipment) investments to $25 billion (this is self-build data center building) and also additional commitments in property and equipment of an extra $3 billion.

  • Meta reached ~$13 billion, again mostly PP&E but also including data center leases.

  • Microsoft came a little bit lower than expected, $21.4 billion, but they claimed they would meet the end-year goal of $80 billion. In their case, they are doing some cutting more into the future, having canceled 2 GW of lease contracts. And to showcase this, they announced they will moderate spending from 2026 onward.

Microsoft’s CFO, Amy Hood, had an interesting comment: “spending will grow at a lower rate than FY 2025 and will include a greater mix of short-lived assets, which are more directly correlated to revenue than long-lived assets.“

What she’s implying is that 2026 onward, they will focus less on buying land and long-lasting equipment in favor of shorter-lived assets like GPUs, which monetize better.

But don’t they need both? Sure, but what she’s saying is that they feel they have enough land, and want to start actually deploying compute in that land.

TheWhiteBox’s takeaway:

I believe these five guys' spending is the industry's compass; if they falter, the party could end, so this is generally good news. However, I still need more information about the so-called ‘huge demands they are seeing.’

All these companies justified continuing to invest in AI due to unwavering demand by claiming they could not meet the demand for AI compute. We will look into this in detail on Sunday, but their claims don’t seem to align with what enterprises are saying, which raises the question:

Is this ‘demand’ being monetized?

IDENTIFYING AIs
Sam Altman’s Wolrd Unveils Orb Mini

Sam Altman’s startup, Tools for Humanity, has introduced the Orb Mini, a compact, smartphone-like device designed to verify human identity through iris scanning.

Unveiled at the “At Last” event in San Francisco, the Orb Mini is a portable version of the company’s earlier Orb device. It aims to provide “proof of personhood” by assigning users a unique blockchain-based ID after scanning their eyes.

This initiative addresses the growing challenge of distinguishing humans from AI agents online. In fact, this company's sole existence is based on the belief that AI agents and humans will be indistinguishable in the future.

Alongside the device launch, Tools for Humanity is expanding its World Network in the U.S., opening storefronts in cities such as Austin, Atlanta, Los Angeles, Miami, Nashville, and San Francisco.

TheWhiteBox’s takeaway:

I do share the sentiment regarding the indistinguishability of AIs and humans in the digital world.

In fact, just recently, a group of researchers at Zurich University carried out a highly controversial, undisclosed experiment with AIs in a Reddit forum, ‘r/Changemyview’, in which the AIs disguised as humans tried to persuade others to change their views. Fascinatingly, the AIs not only matched human performance, they even surpassed it in some instances, obtaining up to six times the ‘persuading votes’ than the baseline.

But how where they so effective? Before replying, another bot stalked the post history of the target, learned about them and their beliefs, and then crafted responses that fitted that user’s persona.

So, yeah, this risk is very real. However, I don’t think I’m ready to scan my eyeballs for this.

Also, this isn’t the only “proof of personhood” option I’ve seen. A while back, a group of prominent researchers in the AI space presented another proof-of-personhood system, which we talked about in this newsletter months ago.

In that case, the idea was to use zk-proofs to make the system privacy-preserving; unlike Sam’s case, in which they will use your eyeballs to discern whether you’re human, in that system, you don’t need to scan any part of your body thus not revealing your identity to prove you’re human.

FRONTIER COSTS
The Hidden Costs in AI Models

I rarely trust media to deliver insightful articles on AI, but this VentureBeat article is one of them.

In this study, they review the hidden costs of model APIs, finding interesting facts like Claude, despite being cheaper on a token-by-token basis, is more expensive to deploy than OpenAI’s models because the tokenizer used generates, on average, many more tokens during the tokenization phase.

In other words, the unitary costs appear smaller, but the overall costs can be larger. To understand this, we need to understand what tokenization means in the first place, as this is the way you are charged.

As you probably know, AI models work on computers, and computers can only understand numbers. Thus, when we input a text sequence, we must somehow transform it into numbers.

Currently, language models have a fixed tokenizer that breaks sentences into chunks the model knows based on its token vocabulary, and each token is then automatically transformed into its numerical form. As you can see below, when we send GPT-4o a text sequence (I apologize for the self-bragging), it gets broken down into words or subwords, the famous ‘tokens’.

As you can see, the total token count is 27, meaning that, at OpenAI’s current API prices, that sequence would cost you {$3.75/1e6 × 27 = $0.0001} to process (plus the costs of the answering tokens). But assuming Llama’s API had the same price, that same sequence was processed by Llama 3-70B would be cheaper because the token count in that model would be 20:

Therefore, the more tokens a model needs to process a sequence, the more expensive it is. And while Anthropic seems cheaper in a Claude 3.5 sonnet vs. GPT-4o comparison, it really isn’t.

TheWhiteBox’s takeaway:

The issue here is that you can literally do nothing besides being more concise in your prompts, which will probably backfire because context is crucial for performance.

The tokenizer is fixed, meaning a model will tokenize your sequence the exact same way every time. Be very careful about costs that may not be conspicuous at first.

TREND OF THE WEEK
China Sounds the Alarm

A group of researchers at Tsinghua University (if it doesn’t ring a bell, it’s the MIT of the East) has presented research on reasoning models that have spooked many.

The truth?

It just states the obvious, but this industry is so high on AI fever that it has completely lost the plot, with claims of superintelligence being nigh when the truth is actually much less sexy.

Discover whether you’re under the ‘AI illusion’ and let me help you rethink your intuitions on frontier AI models.

Let’s dive in.

The Limits of AI

This newsletter is very optimistic about AI. Thus, it’s necessary to ground ourselves in reality every once in a while to avoid getting lost in the midst of capital-backed hype and remain aware of the limitations.

And this paper perfectly meets that goal. But first, a little background.

The Need for Speed

Just like Tom Cruise in 1986’s classic Top Gun, the AI industry was very much in “need to feel the speed” just a few months ago.

Until the arrival of reasoning models, we had been stuck at the same level of “intelligence” with non-reasoning models (traditional Large Language Models) for two years since GPT-4 was first trained in the summer of 2022.

Everyone feared the party was coming to an end. Then, in September, OpenAI presented the ‘o’ family of models, basic copies of non-reasoning models that, behavior-wise, worked differently, generating a ‘chain of thought’ when questioned to maximize the chances of getting it correct.

Source: Sebastien Raschka

Just like you improve your performance on a maths test, the more time you’re given, models see increased performance on those types of problems that benefit from “thinking for longer on them.”

And just like that, as the lights of the party were almost off, the party resumed as if nothing had happened, as we saw a new way to ‘reach AGI’: scaling inference-time compute, the industry’s way of saying increasing the compute the model will spend on solving the problem in real-time.

But how do we get a model to think for longer?

Just RL It.

The method used is Reinforcement Learning (RL). In fear of not sounding like a broken record because I’ve touched on the topic multiple times, RL is a way to train models in which we reward specific actions we want and punish the rest, thereby ‘reinforcing’ the desired behaviors.

I want to add that one crucial intuition about RL is that it’s designed to incentivize exploration. When the model faces a problem it doesn’t know how to solve (although this might not be true, as we’ll see in a minute), it can explore different ways to solve it, using the reward mechanism as guidance, until it reaches the correct response.

Therefore, with RL, you take an LLM that will not ramble on any question, providing the first response that ‘comes to its mind’ (that is more statistically likely based on its training data) and instead teach it to generate preliminary thoughts (like making a plan, reflecting on previous answers, and so on), to maximize the chance we find the correct path to a complex response by breaking problems into separate steps. In other words, forcing ‘reasoning behavior’ into the model, thereby becoming ‘reasoning models.’

And just like that, reasoning models emerged as the salvation of the AI industry. Just like we once thought that making models bigger and bigger would take us to AGI, leading to stagnation, this belief was swiftly rewritten to: “increasing a model’s thinking time will take us to AGI.”

But is all this… actually true?

New Reasoning… or Faster Reasoning?

The researchers at Tsinghua, skeptical of some of the claims being thrown around, raised the question: Are we measuring correctly whether reasoning models improve the reasoning capabilities of non-reasoning LLMs?

Of course, based on their amazing results, your intuition has to be yes, right?

Well, hold your horses for one second.

Sampling Efficiency Matters

In general, to measure performance in AI, we use metrics like ‘pass@k,’ where ‘k’ is the number of samples the model generates. This translates to: If the model tries to solve a problem ‘k’ times, what are the chances that at least one is correct?

Naturally, the best indication of performance is ‘pass@1,’ meaning that I only give the model one chance to nail it. If we measure ‘pass@10’, we are measuring the expectation that the model will get at least one correct out of the 10 tries—which isn’t great performance, as you may guess.

Sadly, that doesn’t mean labs don’t do things ‘their way’ to present their model in a better light, and things like pass@64 or even pass@1024 are surprisingly a thing!

Next time you see a model benchmark, look at the footnotes and you’ll realize that pass@1 is, in fact, a rarity, and the metrics used are much more feeble and gamified.

This is why AI is having so much trouble to go into actual deployments; we have a preconception of performance that isn’t actually true in areas where robustness matters (i.e., everywhere).

Either way, it’s safe to say that the value of ‘k’ is important. And for low values, reasoning models have unequivocally improved performance over non-reasoning models in complex tasks.

But the researchers then asked: what if we increase k? What if we give non-reasoning models more tries?

And when they did so, well, not only were non-reasoning models capable of getting the same nominal performance as their apparently-superior reasoning counterparts…

❝

But they actually got better results!

Looking at the images below, they compare a non-reasoning model (green) and that same model trained for reasoning (red) for two tasks: coding (two left) and maths (right).

And here is where things get weird.

The reasoning model is clearly better for low values of k (when the model is given a small number of tries to get it correct). But as the number of allowed tries grows, performance converges, and, eventually, the non-reasoning model gets better results!

But what does this mean?

Well, first, it means two things:

  1. Reasoning training allows models to improve sampling efficiency. In layman’s terms, they require fewer tries to get the problem correctly. This is good.

  2. However, for a larger number of tries, non-reasoning models catch up, and not only that, but they are capable of solving more reasoning-requiring problems than reasoning models, ironically.

This has two greater implications:

  1. First, it proves that reasoning models do not develop reasoning capabilities beyond what they knew before they become reasoning models. In other words, their reasoning abilities are already present in the non-reasoning model they were pre-reasoning training; they just facilitate the elicitation of such behaviors.

  2. Concerningly, reasoning models sacrifice broader reasoning capabilities in the name of sampling efficiency.

Using an analogy to explain point 2, it’s like comparing a generalist human engineer and one specialized in rocket propulsion; the latter sacrifices overall engineering reasoning across several areas by becoming excellent (more accurate and faster) in that particular field.

However, this also separates AI reasoning from human reasoning, because rocket scientists can develop further reasoning capabilities beyond their broad training, unlike the AI model, which remains circumscribed to what it knows and only to what it knows.

Furthermore, to solidify their claim, they tested different RL training methods to see whether one was at fault for this, but the differences were not statistically significant; no matter the method used, the previous results were always the end result.

However, researchers found that distillation, unlike reasoning training, can introduce new knowledge into a model.

What is distillation?

Known as the teacher-student method, involves training a model to predict the next word in a sequence correctly and to do so in a way similar to a teacher model.

By having this dual objective (being a good LLM but also similar to another LLM), the teacher LLM can, for lack of a better term, ‘teach’ the student model new knowledge it might not have in its training data, once again positioning distillation as a core method for AI training (it already was because it’s the primary method for training small models that are as good as larger ones).

So, what’s the takeaway here?

We Saw This Coming

The results are very clear: reasoning models do not represent, at this moment in time, a path to reasoning beyond what AIs already know from experience (from their own training).

❝

This is something that has been proclaimed far and wide, and it’s blatantly false.

However, these results aren’t surprising if you’ve read this newsletter for a while. We have been adamant that AIs are fundamentally limited by what they know; they can’t reason about things they don’t know.

In reality, AI reasoning isn’t a faithful representation of real human reasoning. Humans reason with data they know, but we can also reason in situations we have never experienced before.

As Jean Piaget would say, “intelligence is what you use when you don’t know what to do.” This simple test is failed by each and every frontier model today.

But what are they missing? Simple: adaptation capabilities.

That’s the key piece that humans have AIs clearly don’t. That’s the word you need to remember whenever someone tries to tell you that AIs are “as smart as PhDs.” No, they aren’t because they can’t adapt to new data on the fly, a core feature of human intelligence because it allows you to gain new experience to build new intuition and reasoning. Intelligence is a flywheel of knowledge gathering, compression, and search.

AIs compress knowledge (non-reasoning models) and can search the solution space (reasoning models), but in both instances, they fall invariably short when experience is unavailable. Thus, to close the loop, they need to have the capacity to learn about the task in real-time, gather feedback, compress the reward signal, and learn.

  • To the industry, this research must serve as a way to stay humble about what we have achieved.

  • To me, it means nothing more than a reminder not to listen to overly dramatic statements about AI intelligence.

But here’s the thing, we don’t need true machine intelligence to change the world with AI. That’s the key takeaway for me.

That and, well, until further notice, that data remains king in AI.

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]