- TheWhiteBox by Nacho de Gregorio
- Posts
- China Sounds the đ¨, Spending, & New Models
China Sounds the đ¨, Spending, & New Models


THEWHITEBOX
TLDR;
Today, we have a newsletter packed with interesting news and insights, from leaks about xAIâs latest supermodel, Grok 3.5, to Microsoftâs new super reasoner, Phi-4-reasoning.
We also discuss Big Tech earnings and whether the AI spending party continues, as well as Sam Altmanâs latest venture and the hidden costs of AI models.
Finally, we look at a transgressive research paper from China that challenges the common assumptions on reasoning models, the industryâs latest sweethearts; they arenât what youâve been made to believe.

THEWHITEBOX
Things Youâve Missed By Not Being Premium
On Wednesday, we covered the release of Chinaâs new powerful model, as well as interesting releases by NVIDIA and DynaRobotics, the latter of which showcases the most advanced robustness we have ever seen on AI robots.
We also covered OpenAIâs dizzying revenue projections, the latest features, and the greatest controversy: the story of the most absurdly sycophant the AI industry has ever seen.


FRONTIER MODELS
Will Grok 3.5 Be an Engineering Beast?
On Tuesday, Elon Musk confirmed the launch of Grok 3.5 next week (although it could be released any time now), making very strong claims about it.
But it seems that a small mistake from one of the engineers has shown that xAI has up to sixty different Grok fine-tunes internally (sixty different versions of the model), with very surprising names like âgrok-SpaceX.â What this implies is that xAI is actively training its models on SpaceX, Neuralink, Starlink, or Tesla data, which could give them unparallel advantage on engineering efforts.
This could be a moat itself because some of these companies have quite literally unique data on crucial areas like space, autonomous driving, or even brain-computer interfaces.
For instance, a very recent research paper by Tufa Labs shows how small reasoning models trained on rocket design data can even outcompete not only frontier AIs, but they also outcompete human experts.
TheWhiteBoxâs takeaway:
Every day that passes, itâs becoming harder and harder to bet against xAI. Not only have they managed to become a top-3 AI Lab in under a year, but have the necessary capital-raising capacity combined with the force of being part of the âElon network,â that itâs hard to fathom a future where they donât completely blow out of the water most labs in terms of âraw intelligenceâ.
But the point here is that this isnât a big deal as AI is clearly not a winner-takes-it-all industry, and most labs will specialize on specific use cases.
OpenAI appears to have a great lead for everyday, consumer-end questions or tasks and with clear interest on agentic computer use.
Google also has great distribution for more desktop-based search via its AI-enhanced search engine and they are also great at coding.
Anthropic or Cohere seem to be focusing more on enterprise use cases, the former on more agentic-style/coding work and the latter more on enterprise search.
xAI could be developing a really strong presence on engineering use cases (design, prototyping, etc.)
Of course, none have renounced their in any particular use case and remain generalistic solutions, but the end game will probably end up being more specialized.
REASONING MODELS
Microsoftâs Phi-4 Reasoning. A Small Beast

Microsoft has released new open-source models, Phi-4-reasoning and Phi-4-reasoning-plus, accessible through HuggingFace for free download, which exhibit incredible performance at their size, âjustâ 14 billion parameters. This means they can be run on consumer-end hardware (the bare minimum for this model would be 32 GB of RAM, though, but thatâs pretty acceptable for modern high-end laptops).
As seen above, the model's performance falls around the range of DeepSeek R1 despite being 48 times smaller, meaning it punches way above its weight class.
The secret?
The Phi models are trained using carefully curated synthetic data, which means the base model is already quite strong. However, to create the reasoning versions, they distilled data from o3-mini, one of OpenAIâs strongest reasoning models (Microsoft owns OpenAIâs IP, so itâs fine).
In other words, the model was fine-tuned to behave like o3-mini, which means that it exhibits behaviors and reasoning capabilities of much larger models for a fraction of a larger model's size. In some instances, it can even surpass the performance of the teacher (o3-mini).
TheWhiteBoxâs takeaway:
Probably the most exciting progress in AI today is being made at the smaller sizes. The dream of running truly knowledgeable and thoughtful LLMs in the confines of our computers, without needing Internet access or worrying about data breaches, is closer by the day.
I am fascinated by the progress of reasoning models, but in our trend of the week below, we will return to Earth with healthy doses of the actual reality of frontier AI reasoning.


BIG TECH
AI Spending Party Continues
After Amazonâs report yesterday, we now have the complete picture of AI spending for this quarter, which has not fallen; quite the contrary.
Big Tech companies invest in AI by building data centers in two ways: self-build, in which they have to invest in land, equipment, and so on (PP&E), and financial leases when they want to grow faster; they donât own nor operate the data center, but they have the right to use it. Generally, both count as CAPEX.
Amazon increased PP&E (Property, Plant, & Equipment) investments to $25 billion (this is self-build data center building) and also additional commitments in property and equipment of an extra $3 billion.
Meta reached ~$13 billion, again mostly PP&E but also including data center leases.
Microsoft came a little bit lower than expected, $21.4 billion, but they claimed they would meet the end-year goal of $80 billion. In their case, they are doing some cutting more into the future, having canceled 2 GW of lease contracts. And to showcase this, they announced they will moderate spending from 2026 onward.
Microsoftâs CFO, Amy Hood, had an interesting comment: âspending will grow at a lower rate than FY 2025 and will include a greater mix of short-lived assets, which are more directly correlated to revenue than long-lived assets.â
What sheâs implying is that 2026 onward, they will focus less on buying land and long-lasting equipment in favor of shorter-lived assets like GPUs, which monetize better.
But donât they need both? Sure, but what sheâs saying is that they feel they have enough land, and want to start actually deploying compute in that land.
Google (Alphabet) reported $17.2 billion, a 43% year-over-year increase while reaffirming its commitment to spend $75 billion during the year.
Oracle grew its AI spending to $2.3 billion, and it projected a whopping $16 billion for the entire year, double what it spent in 2024.
TheWhiteBoxâs takeaway:
I believe these five guys' spending is the industry's compass; if they falter, the party could end, so this is generally good news. However, I still need more information about the so-called âhuge demands they are seeing.â
All these companies justified continuing to invest in AI due to unwavering demand by claiming they could not meet the demand for AI compute. We will look into this in detail on Sunday, but their claims donât seem to align with what enterprises are saying, which raises the question:
Is this âdemandâ being monetized?


IDENTIFYING AIs
Sam Altmanâs Wolrd Unveils Orb Mini
Sam Altmanâs startup, Tools for Humanity, has introduced the Orb Mini, a compact, smartphone-like device designed to verify human identity through iris scanning.
Unveiled at the âAt Lastâ event in San Francisco, the Orb Mini is a portable version of the companyâs earlier Orb device. It aims to provide âproof of personhoodâ by assigning users a unique blockchain-based ID after scanning their eyes.
This initiative addresses the growing challenge of distinguishing humans from AI agents online. In fact, this company's sole existence is based on the belief that AI agents and humans will be indistinguishable in the future.
Alongside the device launch, Tools for Humanity is expanding its World Network in the U.S., opening storefronts in cities such as Austin, Atlanta, Los Angeles, Miami, Nashville, and San Francisco.
TheWhiteBoxâs takeaway:
I do share the sentiment regarding the indistinguishability of AIs and humans in the digital world.
In fact, just recently, a group of researchers at Zurich University carried out a highly controversial, undisclosed experiment with AIs in a Reddit forum, âr/Changemyviewâ, in which the AIs disguised as humans tried to persuade others to change their views. Fascinatingly, the AIs not only matched human performance, they even surpassed it in some instances, obtaining up to six times the âpersuading votesâ than the baseline.
But how where they so effective? Before replying, another bot stalked the post history of the target, learned about them and their beliefs, and then crafted responses that fitted that userâs persona.
So, yeah, this risk is very real. However, I donât think Iâm ready to scan my eyeballs for this.
Also, this isnât the only âproof of personhoodâ option Iâve seen. A while back, a group of prominent researchers in the AI space presented another proof-of-personhood system, which we talked about in this newsletter months ago.
In that case, the idea was to use zk-proofs to make the system privacy-preserving; unlike Samâs case, in which they will use your eyeballs to discern whether youâre human, in that system, you donât need to scan any part of your body thus not revealing your identity to prove youâre human.
FRONTIER COSTS
The Hidden Costs in AI Models
I rarely trust media to deliver insightful articles on AI, but this VentureBeat article is one of them.
In this study, they review the hidden costs of model APIs, finding interesting facts like Claude, despite being cheaper on a token-by-token basis, is more expensive to deploy than OpenAIâs models because the tokenizer used generates, on average, many more tokens during the tokenization phase.
In other words, the unitary costs appear smaller, but the overall costs can be larger. To understand this, we need to understand what tokenization means in the first place, as this is the way you are charged.
As you probably know, AI models work on computers, and computers can only understand numbers. Thus, when we input a text sequence, we must somehow transform it into numbers.
Currently, language models have a fixed tokenizer that breaks sentences into chunks the model knows based on its token vocabulary, and each token is then automatically transformed into its numerical form. As you can see below, when we send GPT-4o a text sequence (I apologize for the self-bragging), it gets broken down into words or subwords, the famous âtokensâ.
As you can see, the total token count is 27, meaning that, at OpenAIâs current API prices, that sequence would cost you {$3.75/1e6 Ă 27 = $0.0001} to process (plus the costs of the answering tokens). But assuming Llamaâs API had the same price, that same sequence was processed by Llama 3-70B would be cheaper because the token count in that model would be 20:

Therefore, the more tokens a model needs to process a sequence, the more expensive it is. And while Anthropic seems cheaper in a Claude 3.5 sonnet vs. GPT-4o comparison, it really isnât.
TheWhiteBoxâs takeaway:
The issue here is that you can literally do nothing besides being more concise in your prompts, which will probably backfire because context is crucial for performance.
The tokenizer is fixed, meaning a model will tokenize your sequence the exact same way every time. Be very careful about costs that may not be conspicuous at first.

TREND OF THE WEEK
China Sounds the Alarm

A group of researchers at Tsinghua University (if it doesnât ring a bell, itâs the MIT of the East) has presented research on reasoning models that have spooked many.
The truth?
It just states the obvious, but this industry is so high on AI fever that it has completely lost the plot, with claims of superintelligence being nigh when the truth is actually much less sexy.
Discover whether youâre under the âAI illusionâ and let me help you rethink your intuitions on frontier AI models.
Letâs dive in.
The Limits of AI
This newsletter is very optimistic about AI. Thus, itâs necessary to ground ourselves in reality every once in a while to avoid getting lost in the midst of capital-backed hype and remain aware of the limitations.
And this paper perfectly meets that goal. But first, a little background.
The Need for Speed
Just like Tom Cruise in 1986âs classic Top Gun, the AI industry was very much in âneed to feel the speedâ just a few months ago.
Until the arrival of reasoning models, we had been stuck at the same level of âintelligenceâ with non-reasoning models (traditional Large Language Models) for two years since GPT-4 was first trained in the summer of 2022.
Everyone feared the party was coming to an end. Then, in September, OpenAI presented the âoâ family of models, basic copies of non-reasoning models that, behavior-wise, worked differently, generating a âchain of thoughtâ when questioned to maximize the chances of getting it correct.

Source: Sebastien Raschka
Just like you improve your performance on a maths test, the more time youâre given, models see increased performance on those types of problems that benefit from âthinking for longer on them.â
And just like that, as the lights of the party were almost off, the party resumed as if nothing had happened, as we saw a new way to âreach AGIâ: scaling inference-time compute, the industryâs way of saying increasing the compute the model will spend on solving the problem in real-time.
But how do we get a model to think for longer?
Just RL It.
The method used is Reinforcement Learning (RL). In fear of not sounding like a broken record because Iâve touched on the topic multiple times, RL is a way to train models in which we reward specific actions we want and punish the rest, thereby âreinforcingâ the desired behaviors.
I want to add that one crucial intuition about RL is that itâs designed to incentivize exploration. When the model faces a problem it doesnât know how to solve (although this might not be true, as weâll see in a minute), it can explore different ways to solve it, using the reward mechanism as guidance, until it reaches the correct response.
Therefore, with RL, you take an LLM that will not ramble on any question, providing the first response that âcomes to its mindâ (that is more statistically likely based on its training data) and instead teach it to generate preliminary thoughts (like making a plan, reflecting on previous answers, and so on), to maximize the chance we find the correct path to a complex response by breaking problems into separate steps. In other words, forcing âreasoning behaviorâ into the model, thereby becoming âreasoning models.â
And just like that, reasoning models emerged as the salvation of the AI industry. Just like we once thought that making models bigger and bigger would take us to AGI, leading to stagnation, this belief was swiftly rewritten to: âincreasing a modelâs thinking time will take us to AGI.â
But is all this⌠actually true?
New Reasoning⌠or Faster Reasoning?
The researchers at Tsinghua, skeptical of some of the claims being thrown around, raised the question: Are we measuring correctly whether reasoning models improve the reasoning capabilities of non-reasoning LLMs?
Of course, based on their amazing results, your intuition has to be yes, right?
Well, hold your horses for one second.
Sampling Efficiency Matters
In general, to measure performance in AI, we use metrics like âpass@k,â where âkâ is the number of samples the model generates. This translates to: If the model tries to solve a problem âkâ times, what are the chances that at least one is correct?
Naturally, the best indication of performance is âpass@1,â meaning that I only give the model one chance to nail it. If we measure âpass@10â, we are measuring the expectation that the model will get at least one correct out of the 10 triesâwhich isnât great performance, as you may guess.
Sadly, that doesnât mean labs donât do things âtheir wayâ to present their model in a better light, and things like pass@64 or even pass@1024 are surprisingly a thing!
Next time you see a model benchmark, look at the footnotes and youâll realize that pass@1 is, in fact, a rarity, and the metrics used are much more feeble and gamified.
This is why AI is having so much trouble to go into actual deployments; we have a preconception of performance that isnât actually true in areas where robustness matters (i.e., everywhere).
Either way, itâs safe to say that the value of âkâ is important. And for low values, reasoning models have unequivocally improved performance over non-reasoning models in complex tasks.
But the researchers then asked: what if we increase k? What if we give non-reasoning models more tries?
And when they did so, well, not only were non-reasoning models capable of getting the same nominal performance as their apparently-superior reasoning counterpartsâŚ
But they actually got better results!
Looking at the images below, they compare a non-reasoning model (green) and that same model trained for reasoning (red) for two tasks: coding (two left) and maths (right).
And here is where things get weird.
The reasoning model is clearly better for low values of k (when the model is given a small number of tries to get it correct). But as the number of allowed tries grows, performance converges, and, eventually, the non-reasoning model gets better results!

But what does this mean?
Well, first, it means two things:
Reasoning training allows models to improve sampling efficiency. In laymanâs terms, they require fewer tries to get the problem correctly. This is good.
However, for a larger number of tries, non-reasoning models catch up, and not only that, but they are capable of solving more reasoning-requiring problems than reasoning models, ironically.
This has two greater implications:
First, it proves that reasoning models do not develop reasoning capabilities beyond what they knew before they become reasoning models. In other words, their reasoning abilities are already present in the non-reasoning model they were pre-reasoning training; they just facilitate the elicitation of such behaviors.
Concerningly, reasoning models sacrifice broader reasoning capabilities in the name of sampling efficiency.
Using an analogy to explain point 2, itâs like comparing a generalist human engineer and one specialized in rocket propulsion; the latter sacrifices overall engineering reasoning across several areas by becoming excellent (more accurate and faster) in that particular field.
However, this also separates AI reasoning from human reasoning, because rocket scientists can develop further reasoning capabilities beyond their broad training, unlike the AI model, which remains circumscribed to what it knows and only to what it knows.
Furthermore, to solidify their claim, they tested different RL training methods to see whether one was at fault for this, but the differences were not statistically significant; no matter the method used, the previous results were always the end result.
However, researchers found that distillation, unlike reasoning training, can introduce new knowledge into a model.
What is distillation?
Known as the teacher-student method, involves training a model to predict the next word in a sequence correctly and to do so in a way similar to a teacher model.
By having this dual objective (being a good LLM but also similar to another LLM), the teacher LLM can, for lack of a better term, âteachâ the student model new knowledge it might not have in its training data, once again positioning distillation as a core method for AI training (it already was because itâs the primary method for training small models that are as good as larger ones).
So, whatâs the takeaway here?
We Saw This Coming
The results are very clear: reasoning models do not represent, at this moment in time, a path to reasoning beyond what AIs already know from experience (from their own training).
This is something that has been proclaimed far and wide, and itâs blatantly false.
However, these results arenât surprising if youâve read this newsletter for a while. We have been adamant that AIs are fundamentally limited by what they know; they canât reason about things they donât know.
In reality, AI reasoning isnât a faithful representation of real human reasoning. Humans reason with data they know, but we can also reason in situations we have never experienced before.
As Jean Piaget would say, âintelligence is what you use when you donât know what to do.â This simple test is failed by each and every frontier model today.
But what are they missing? Simple: adaptation capabilities.
Thatâs the key piece that humans have AIs clearly donât. Thatâs the word you need to remember whenever someone tries to tell you that AIs are âas smart as PhDs.â No, they arenât because they canât adapt to new data on the fly, a core feature of human intelligence because it allows you to gain new experience to build new intuition and reasoning. Intelligence is a flywheel of knowledge gathering, compression, and search.
AIs compress knowledge (non-reasoning models) and can search the solution space (reasoning models), but in both instances, they fall invariably short when experience is unavailable. Thus, to close the loop, they need to have the capacity to learn about the task in real-time, gather feedback, compress the reward signal, and learn.
To the industry, this research must serve as a way to stay humble about what we have achieved.
To me, it means nothing more than a reminder not to listen to overly dramatic statements about AI intelligence.
But hereâs the thing, we donât need true machine intelligence to change the world with AI. Thatâs the key takeaway for me.
That and, well, until further notice, that data remains king in AI.

THEWHITEBOX
Join Premium Today!
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]