- TheWhiteBox by Nacho de Gregorio
- Posts
- The Bubble Grows, SEALs, & China
The Bubble Grows, SEALs, & China


THEWHITEBOX
TLDR;
Welcome back! Much of today’s news revolves around a scary idea: we are terribly deep into an AI bubble, and you’ll see why below. Additionally, we include mentions of China’s remarkably rapid progress, and a notable AI expert makes a case that most AI startups will go to zero.
Beyond that, there’s also room for hope in progress, thanks to insightful research by MIT on Self-Adapting Language Models.
Enjoy!

THEWHITEBOX
Things You’ve Missed For Not Being Premium
On Wednesday, we examined several fascinating studies, ranging from one that could be the key to AI curing addictions to proof that AI makes us worse, as well as China’s latest supermodel.
We also saw confirmation that AI-powered search is harming media companies, as well as xAI’s substantial capital spending and Anthropic’s multi-agent research architecture, culminating in an influencer earning millions with AI clones.
Enjoy!


FRONTIER AI
Gemini 2.5 Pro’s Strong ARC-AGI Results

circles are results in ARCAGI1, triangles ARCAGI2
The ARC team behind super popular benchmarks like ARC-AGI, specifically designed to be memorization-resistant, has published Google’s Gemini 2.5 Pro results.
And they are quite good. But why is this particular benchmark so relevant?
As we have commented several times, most of the benchmarks we test frontier AI models on are susceptible to being memorized, to be cheated, in plain words.
Because how do we discern whether the model is reasoning a result or simply vomiting it?
For that, benchmarks like ARC-AGI have emerged to serve as a way to make that separation; if you solve this benchmark, it’s more or less guaranteed that memorization has not played a crucial part, opening the door to discussing real chances the model is ‘intelligent.’
In thi case, the model reaches 41% in the ARC-AGI 1 and 4.9% in ARC-AGI 2. But those numbers look really low, right?
Why am I saying ‘great results’?
Simple, because I’m resistant to this industry’s gaslighting, and a 41% in this benchmark is leaps and bounds more impressive than a 70% on popular benchmarks like MMMU-Pro, a contaminated benchmark (most models have seen the answers) that can be memorized.
But this is also a reminder of how unintelligent these models are. They are incredible at imitating intelligence, but when they have to elicit real intelligence, well, you get a 4.9% score.
Hence, currently, AI is more useful than it is intelligent. And even the former is exaggerated.
If you’re here, I assume you believe AI will change the world. And it will, but that doesn’t mean we can’t be honest about the current state of the industry.
CODING
OpenAI’s Codex is a PR machine

As shown in this tweet, OpenAI’s autonomous background software agent, Codex, is merging an overwhelming number of pull requests (PRs), demonstrating the product's impact across the software industry.
Pull requests are code edits to a repository, meaning that the AI is actively making a huge amount of changes to coding repositories. The fact that they are being merged means that someone has validated the code, a sign that the model not only identifies issues but also solves them.
TheWhiteBox’s takeaway:
In layman’s terms, the era of autonomous software engineers is well underway. While AI’s impact in other areas is still open for debate, it’s hard to disagree wth the fact that AI is, indeed, transforming the software industry forever.
However, we must be cautious with such claims, as, at least in the case of this metric, one can create a PR for the simplest of things, such as changing a variable’s name. In other words, the number of PRs is not proof of truly transformative productivity, which is evident when we see the actual impact of each merged PR.
Still a mind-blowing statistic, though.


POACHING
Meta Close To Hiring Top AI Talents
Reported by The Information, Mark Zuckerberg is in advanced talks to onboard both Nat Friedman (GitHub co-founder and the creator of GitHub Copilot) and Daniel Grover, one of the three co-founders of SafeSuperIntelligence, Inc. (SSI), Ilya Sutskever’s Israeli-American AI company, valued at $32 billion despite no product.
This isn’t Zuck’s only recent news, as he has gone quite literally berserk with buying people from around the industry after the massive flop Llama 4 has been, including their ‘acquihire’ of Scale AI for $14 billion, onboarding its CEO, Alexander Wang, in the process (Nat and Daniel will report to Wang).
And this is only a part of the process, as we already covered the massive $100 million in signing bonuses (yes, that’s just the signing bonus).
TheWhiteBox’s takeaway:
Two conclusions:
There’s a bubble.
Zuck is betting its entire company on AI.
And, well, one last conclusion: did I mention we’re in a bubble?
AI has reached a point where companies are buying entire companies just to sign their key people, billions of dollars to hire one guy. That doesn’t seem healthy.
ENTERPRISE
OpenAI Discounts Enterprise Offering by 20%
As reported by The Information again, OpenAI is now offering its enterprise ChatGPT solution at a discount.
While this move could be seen as a ‘gotcha’ to Microsoft as part of their decadent relationship, let’s be real: nobody discounts a product that is selling really well because you’re mad at your primary investor and owner.
We’ve been very insistent on OpenAI’s revenues being massively dependent on consumer instead of business, but here’s a graph that illustrates it better than I could ever do with words:

See that small red part? Yes, that’s what OpenAI expects to get from Enterprise throughout the decade. OpenAI’s revenues are experiencing robust growth, but this growth is becoming increasingly concentrated by the day.
TheWhiteBox’s takeaway:
Don’t mind if I do, but I’ll brag a little bit.
If you’re a recurrent reader of this newsletter, you know AI revenues are mostly a barefaced lie, and that as long as enterprises don’t start spending massively, this is just one gigantic bubble.
In fact, I would bet that a significant portion of the cited AI revenues is incestuous, meaning Big Tech is investing money in startups and the latter claiming that investment as revenue while the former claims it as revenue for their cloud businesses.
Generally, Hyperscalers don’t give you money, they give you discounted access to their compute.
Nonetheless, we already have strong accusations of several startups inflating their Annual Recurring Revenue (ARR) fraudulently, lying about having customers they didn’t, or falsely claiming to use AI when, in reality, it was Indian developers working under the hood.
AI is legit, but don’t let that legitness make you believe this industry is not living in borrowed time.
BUBBLE BEHAVIORS
Thinking Machines Raises $2 Billion
And continuing on my journey to show you how much we are in a bubble, Thinking Machines, a months-old startup with many star researchers (mainly from OpenAI), has raised $2 billion at a $10 billion valuation… despite having no product.
TheWhiteBox’s takeaway:
The company, which has no product beyond a simple website, is said to be focused on multimodality.
Still, the fact that ex-OpenAI CTO Mira Murati, OpenAI cofounder John Schulman, and other notable figures in the space, like Alec Radford, seem to be enough for investors to give them what they want, even granting Mira a golden vote that allows her to veto anything with which she disagrees with the Board.
What’s the point of the Board, then?
There’s a lot of money to be made in AI, but there’s plenty more to be lost. Again, extremely concerning bubbly behavior.


VIDEO GENERATION
China, China, China…

While immersed in a deep economic crisis, China is killing it in AI. Not only do they have another supermodel in Minimax M1 (covered here), as well as a new super Deep Researcher in Kimi72B that is allegedly better than ChatGPT’s version, but they have been the kings of video-generation models for quite some time.
That same company’s new product, Hailuo 02, has surpassed Google’s Veo3 as the best video-generation model according to Artificial Analysis. But get this, they aren’t even the best model out there, as Bytedance’s Seed AI lab has another even better model (Seedance, thumbnail).
The difference between these two models is small, but they both exhibit a +100-point advantage over Google’s model, which is substantial in ELO-based ratings.
Furthermore, in fourth place we have, well, another Chinese model, Kling, and only Google’s other Veo 2 model and Runway’s Gen4 make the top 10, with OpenAI’s Sora nowhere to be seen.
In similar news, Midjourney has finally released its first video-generation product. Looks good… like every single AI demo.
TheWhiteBox’s takeaway:
In some areas, China is already ahead. And what’s more, they publish research on their models, and they are open-sourcing their breakthroughs.
I’ve said it, and I’ll say it again: this is how you win tech wars. While US AI Labs don’t share information and often compete, poaching each other's top talent on a massive scale, Chinese labs collaborate, share discoveries, and progress together.
I’m not implying that Chinese Labs don’t compete with each other; China’s internal market is as competitive as it gets, a full-blown hypercapitalist environment. However, they understand that, in order to really win, all of them must win.
And the proof is in the pudding; it’s no coincidence that all Chinese top AI Labs now have a powerful video model, they are acting as one single team. And for people who wish to use open-source text models, well, the best ones are all Chinese (Qwen, DeepSeek, Minimax, Kimi… they are all Chinese).
GPU export controls will only take the US so far. Open-source is what really creates traction and goodwill around your products. Otherwise, I wouldn’t be using Chinese models in my local use cases.
The US is fumbling this massively.
AI PRODUCTS
Is the App Layer Worth Anything?
Here’s a brief moment from a George Hotz interview, one of the most famous hackers on the planet and an expert in AI hardware (or AI in general; few people command more authority than this individual), stating that most application-layer companies will eventually go to zero despite some of them having 10-figure valuations.
Application-layer AI startups, such as Sierra, Cursor, and Windsurf (acquired by OpenAI), build their entire business on the shoulders of models provided by companies in the model layer, including OpenAI, Google, or xAI.
Reality or exaggeration?
TheWhiteBox’s takeaway:
You’ve read it here first.
One of my most contrarian takes is that, like Hotz states, most of the value in the industry will accrue to the model-layer companies, the owners of the models, and that these will only let compete the less-attractive application-layer companies.
However, for more attractive segments, model-layer companies, which are hugely capitalized, will outcompete them, as their rivals have built their entire business around, well, their models.
This is the reason why I can guarantee to you that most VCs will lose money, as they are all betting in the application layer as their last place to make money (they aren’t rich enough to play in any of the other layers unless you are top VCs like a16z or Thrive Capital and Softbank).
As I’ve said in the past, the only way a software company can survive this change is by becoming AI tools (think MCP servers, like the ones I’m showing you in this very newsletter) that autonomous AIs can call when they need to.
In other words, instead of framing your value proposition as ‘software that uses AI for x’, it should be ‘software used by AI agents for x.’
Subtle yet deadly difference.
Therefore, if you’re an AI startup that still thinks software is the same as five years ago, you’re in for a nasty surprise.

TREND OF THE WEEK
Self-Adapting Language Models

One thing AI models are embarrassingly bad at, or to be clearer, they simply can’t do it, is adapting to new data. We are claiming ‘human-level intelligence’ for static digital files which have no mechanism whatsoever to adjust to an ever-changing world.
For that reason, when fantasizing about AI’s future, AI enthusiasts tend to envision a future where AI models perform self-improvement, the moment they can optimize themselves without human intervention.
This is also one of the typical existential risks shared by doomers, arguing that a self-improving AI could elevate above us and destroy civilization.
Now, whether you believe this is a good idea or not, a group of researchers at MIT thinks they have a potential solution they call SEAL, or SElf-Adapting Language Models.
And the excellent results speak for themselves.
Anthropomorphized Static Blobs
Currently, language models, which some incumbents claim are “smart as PhDs” (which is absolute horseshit, by the way), are static, meaning they have a learning cutoff to which they stop learning.
But what does that mean?
They Don’t Know What Happened Today.
To get the models you and me use, these endure a long training regime, in which they first learn to imitate the world (via text) and then, in cases like reasoning models, they also undergo trial-and-error training, with the hopes that with enough trials, the models develop new skills (they do, as we saw last week).
Once the model is trained to our liking, it’s deployed.
However, from then on, the model enters ‘inference mode’, where it performs an infinite number of interactions with users, but will learn nothing, and I mean nothing, from those interactions.
But don’t some of these products have ‘memory’ features so they do remember past conversations?
Yes, but this feature is context-enhancing, not real learning.
If the model identifies something worth remembering, it adds it to ‘memory,’ which in reality means that piece of information will be added to the model’s prompt every future interaction they have with you. But the model isn’t learning anything; it just has updated information for future interactions with you in the prompt.
But why is a static model a problem?
Imagine your current self having to live the rest of their life without learning anything.
For instance, you might experience a friend betraying you tomorrow. Too bad, cause they will be able to betray you every single day of the rest of your life because you don’t internalize anything; you’re a static blob of knowledge and experience from the past, unable to learn anything new.
That’s the reality of our models.
The solution, of course, is to retrain them with new data, a term known as ‘fine-tuning.’ This is done periodically (for instance, OpenAI’s GPT-4o has changed multiple times in the past year, despite being the “same” model), but this requires a significant amount of human data annotation where members of the technical staff at OpenAI recall user feedback and data, and prepare it for the model for retraining.
This works, but it is incredibly costly and has to be done all at once every several weeks or months. Instead, for more timely updates, the models rely on search engines like Perplexity, Google, or Bing to retrieve updates in real-time.
However, this is not the model learning anything new; it’s very similar to the memory feature we discussed earlier: simply adding a blob of context extracted from the search engine to the prompt and hoping the model utilizes it.
In reality, though, the actual model remains unchanged, ever-ignorant of world changes, unable to adapt.
But what if there were a way to enable AIs to learn as they interact with you?
This, known as test-time training, is an idea that MIT may have finally streamlined.
Test-Time Training
For centuries, humans have relied chiefly on deductive reasoning.
A genius amongst us observes something or has a unique idea, tests it, and propagates it as a new rule of nature. In other words, we can’t make infinite tests or process large amounts of data.
Thus, deductive reasoning involves moving from the general to the specific. We propose a new theory that is logically guaranteed to explain how the world works. It works, but requires genius-like behavior from the originator of the idea.
With AI, we have opened civilization to inductive reasoning (from the specific to the general). As these models can process vast amounts of data, we can make discoveries by observing patterns across massive datasets.
This is how models like ChatGPT learn, by replicating a larger-than-life dataset, compression happens, meaning the model eventually finds the patterns that govern language (grammar, translation, etc).
So, what is test-time training then?
Test-Time Training, or TTT, is the concept of training a model while it's making predictions, allowing it to improve at the task it’s facing. In other words, it runs into a task it does not know how to solve, figures it out, and internalizes the learning, directly updating the weights.
This introduces a “new” form of discovery, transductive learning (which humans also heavily leverage), in which the goal isn’t generalization (learning something that applies to other data, as in deductive and inductive learning), but instead learning how to perform a precise task very well. Put another way, reasoning from specific examples to other specific examples without ending with a general rule, just one that works for this case.
Therefore, our hope is that, through TTT, AIs will learn on the fly, adapting themselves to “known unknowns,” a critical component of human intelligence.
With adaptability, one could seriously start considering the possibliity that we might end creating real machine intelligence.
However, this is easier said than done, and more often than not, it requires human intervention.
But, what if we teach models to self-adapt? Enter SEAL.
A Path to Self-Adaptation?
As mentioned, the idea of TTT is straightforward: upon encountering a new text passage, the model undergoes a self-adaptation process, internalizing the latest data in real-time.
But SEAL goes beyond that.
Instead, the idea is to train the AI to improve this self-adaptation. The point isn’t only that the model learns new stuff on the fly, but that it has been literally trained to do so. In other words, we are actively training an effective transductive learner.
Therefore, SEAL involves two phases.
First, we aim to train the best possible self-adapter, one that generates data that facilitates effective learning.
After training, deploy the self-adapter in the wild and let it run wild and free, learning about the world.
Of course, the breakthrough comes in step 1, so let’s explain that in detail.
The Genius Looped Trick
Two ideas must be briefly covered (which we have done so multiple times already), which are RL and LoRA adapters.
Reinforcement Learning (RL): Mentioned earlier, it’s the famous trial-and-error method. Instead of training models to imitate data, we give them a goal, and the model will try and fail until it figures it out.
LoRA: The latter is a clever trick to fine-tune models without having to change the entire model. The trick is to train small models and ‘attach’ them to the larger model, adapting their behavior accordingly. Think of this as if you could add a chip to your head that suddenly made you know something. If you detach the chip, you forget it. This is how Apple Intelligence works, by the way.
The reason we use LoRAs is that we don’t actually want to train the base model until we verify the self-edit was good. As LoRAs enable us to train attachable surrogate models without modifying the base model, we can create a scalable process to teach a model to learn.
The SEAL process is as follows:
The model is shown a new text passage
It generates ‘self-edits’, transforming the text passage into ‘learnable statements’
As we are conducting a trial-and-error exercise, the model explores different self-edit alternatives.
For every proposal it does, it trains a small adapter
We test every adapter, checking whether the new augmented model has learned the facts generated by that self-edit.
The ‘self-edits’ that have resulted in a well-calibrated adapter (trial and error) are then used to update the self-editing policy.
By ‘self-editing policy’ we mean the model.

But what’s the point?
The idea is that by doing this at scale, the model will improve over time in generating self-edits, thereby creating a good self-adapting model—a model that can be deployed in the wild and is highly efficient at learning new information or skills.
In the image below, you will appreciate how the model, after each round of improving its self-editing capabilities, becomes increasingly proficient at generating the trainable statements.

Hence, once we have taught the model to be a good learner, we can then trust it to deploy it in the wild and learn on the fly.
Results
The model was evaluated in two types of learning: abstraction and knowledge. In the first one, we are testing whether the model can solve complex pattern puzzles similar to those found in IQ tests. The second one tests the capacity to internalize new information.
In few-shot abstract reasoning, using a subset of the ARC benchmark (a previous version of the one we discussed earlier, for which Google got good results), SEAL learned to autonomously select the best data augmentations and training parameters for new tasks. This resulted in a success rate of 72.5%, dramatically outperforming standard in-context learning (0%) and a baseline model using unoptimized self-edits (20%).
Going from 0% to 72% without previous training on benchmark data is absolutely impressive. It’s literally proof that the model can adapt to the task. And we aren’t talking a powerful model, researchers used Llama 3.2 1B, a very dumb model by modern standards.
In the domain of knowledge incorporation, SEAL was tasked with integrating factual information from text passages to answer questions without having access to the original text. Fine-tuning on its self-generated data boosted question-answering accuracy to 47.0%, which was superior to fine-tuning on the raw text alone (33.5%) and even surpassed the performance of using synthetic data generated by the larger GPT-4.1 model (46.3%), proving that SEAL training actually worked, as the model surprassed a smarter model at self-edits (in standard terms, GPT-4.1 is more capable, by far, than the model used here, Qwen2.5-7B).
In summary, SEAL works incredibly well in allowing models to acquire new skills or new facts.
Closing Thoughts
While most research focuses on making incremental improvements on what we have, other aims to open new paths of progress.
This study belongs in the latter group.
Some problems do remain, though. As with RL training pipeline, we still rely on verifiable rewards (in this case, the data we were training the model on had a clear correct answer), so this doesn’t solve what’s clearly the main unknown in AI today: how to train models in a trial-and-error regime in areas where we don’t know if the trial went well (because then we can’t measure the error, which is the learning signal).
Another problem here is that we haven’t yet fully figured out how to prevent catastrophic forgetting, which occurs when models forget older information when learning new material.
There are areas for improvement, but this research deflates bubbles not by releasing a slightly larger model with the same limitations as the previous ones, but by releasing research that tackles AI’s real unsolved issues.
This is what real progress looks like.

THEWHITEBOX
Join Premium Today!
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]