- TheWhiteBox by Nacho de Gregorio
- Posts
- The Frontier Battle, OpenAI Drama, & More
The Frontier Battle, OpenAI Drama, & More


Meet your new assistant (who happens to be AI).
Meet Skej — your new scheduling assistant. Whether it’s a coffee intro, a client check-in, or a last-minute reschedule, Skej is on it. Just CC Skej on your emails, and it takes care of everything: checking calendars, suggesting times, and sending out invites.

THEWHITEBOX
TLDR;
This week, we take a look at a clearer review of GPT-5, Opus 4.1’s incredible performance, OpenAI’s latest drama, the ongoing tensions between the US and China on GPUs, the potential arrival of DeepSeek R2, Perplexity’s jaw-dropping bid for Chrome, and other news of interest.
Enjoy!


FRONTIER MODELS
Vibe shift on GPT-5, & Gemini 3.0
Days after the release of GPT-5, the dust is settling around it, giving us a clearer picture of its capabilities. For those unaware, it’s a system of models (reasoning and non-reasoning) with a router that decides which model answers your request.
And the analysis is obvious at this point:
GPT-5 Reasoning models are absolutely great, setting several records across hard benchmarks like LiveCodeBenchPro (coding), putting a 700-point ELO gap with Gemini 2.5 Pro, or IQ tests, and a 148 IQ score, miles above all other models.
The router and, thereby, the release, were intended for free users rather than paying users. Although the router experience should improve over time, the result has been tragic for paying users, who not only get an, on average, dumber model, but also lower rate limits—a debatable (but maybe necessary on the path to profitability) decision by OpenAI.
Furthermore, it seems Gemini 3.0 is around the corner, as its score on Humanity’s Last Exam (HLE), the benchmark of choice to test a model’s capacity to answer very hard epistemic questions, has been leaked from an Artificial Analysis code update, showing a marked improvement not only over Grok 4, but also over GPT-5:

TheWhiteBox’s takeaway:
The overall feeling is that models continue to get better, but we might be starting to see sigmoid-like trend (diminishing returns).
OpenAI’s router decision is also clear evidence that ‘grow at all costs’ mantras are starting to fade, and labs are starting to see pressures to close the profitability gap.
Fallbacks to cheaper-to-run models should become the norm on consumer web applications (where most, if not all, free users reside) and on agentic tools like Claude Code or Gemini CLI, where a decent portion of tasks are already being routed to cheaper models like Claude’s Haiku or Gemini 2.5 Flash and Flash Lite.
The overwhelming reaction is pretty negative, as users are mostly paying a flat-price subscription and, thus, want to maximize smart-model usage as much as possible.
However, for enterprises primarily using pay-as-you-go APIs, this is something they look forward to. We will also see a business opportunity for cross-Lab model routers, such as Martian, Nexus, or OpenRouter, to strike a balance between using the best cost-optimized model for each task.
SAFETY
Falling in Love with ChatGPT

The text is pretty self-explanatory, but let me cut to the chase: this woman is clearly in love with their OpenAI model, even more so than with her, in her own words, “human boyfriend,” which to me directly implies she also has a “non-human lover.”
Troublesome to say the least, and goes well into the weeds of past conversations we’ve had in this newsletter in which we predicted that these models were not acting safely due to their obnoxious sycophancy that predates those in need of attention or even the slightest consideration.
I’ve been very loud and clear on this: we need to dehumanize these things, because some humans are desperately anthropomorphizing these digital files to dangerous levels. I know these companies have to make money, but sycophancy, no matter how well it works, should not be encouraged… or at least stop bullshitting society by saying you claim about safety if you’re not going to uphold those self-proclaimed values.
The deprecation of GPT-4o sparked significant controversy and concern, as seen in posts like the one above, which called for the model's return. This led OpenAI to reinstate GPT-4o, making it available to users again.
This is a massive victory in terms of user retention for OpenAI, but terrible news for society; just imagine what will happen once these things have a physical body.
The solution, for now, has two legs:
Draft anti-sycophancy regulation that prevents this altogether, maybe setting restrictions on model use based on professional testing to verify susceptibility to generating AI-dependency (in the same way we have age restrictions, or gambling control).
Educate society on what these models are. Dehumanize them completely as just a bunch of matrix multiplications who have been taught to match your mood, behavior, say good things about you, and so on. If you know the model is just a bunch of 1s and 0s matching your vibe, it’s much less likely you fall in love with it.
LLMs should be regarded as parrots on steroids, not lovers, or friends, or anything that gives them human attributes.
The issue for OpenAI is that, unlike Anthropic, they are much more heavily used for companionship use cases. Unsurprisingly, in this regard, Anthropic’s approach seems to be much better, and you can tell simply by interacting with Claude; the model feels much more artificial, and that’s how it should be.
To be fair on OpenAI, GPT-5 feels much more artificial too, in part justifying the uproar about GPT-4o’s deprecation. Balancing this won’t be easy for OpenAI.
FRONTIER MODELS
Anthropic Launches 1 million context
Anthropic, a company whose models, the Claude family, had a pretty low context window (the amount of information a model can process at any single time), has extended it to one million tokens, matching Google’s models, allowing the model to process up to approximately 750,000 words simultaneously, opening them to much more powerful use cases in areas where context sizes grow considerably, like coding.
TheWhiteBox’s takeaway:
You probably know by now I am not a fan of Anthropic’s leadership, but my job here is to call balls and strikes, and Anthropic has only been hitting home runs recently.
Booming business, a product whose utility is blatantly obvious to users (coding and agent use cases, nothing else), and leading the pack in those areas with a comfortable margin; it’s clear that going niche was a wise decision.
Regarding context windows, the current sizes are still insufficient. Eventually, models should aim for billion-token context windows, capable of processing entire human lifetimes of information, knowledge, and more.
But why can’t we just, well, increase the context window size? Well, for three reasons:
It’s too damn expensive, as the attention mechanism has a quadratic cost complexity for both compute and memory. In layman’s terms, if you double the sequence size, compute and memory quadruple. If you triple, they grow ninefold. Nowadays, nobody uses full attention and instead heavily relies on simpler attention mechanisms that sacrifice granularity in lieu of cheaper processing, but that has another consequence…
Models get lost if the context is too large. As you can very well say, they don’t truly understand what they are reading or seeing, and strongly rely on superficial patterns; the longer the sequence is, the less likely they’ve seen similar stuff. This makes them highly susceptible to getting lost in too much context.

Examples like this would not be happening if these models weren’t, at least for now, just superficial pattern matchers.
They also suffer from extrapolation sickness. Long-sequence attention patterns require training. In other words, if you want billion-token context windows, your training has to include billion-token sequences at some point (or using techniques like YARN, table stakes these days in training procedures, including OpenAI’s open-source models). However, it’s still going to increase your training costs considerably.
All this is to say we definitely need new methods to unlock the larger sequences without sacrificing too much in the process, but Anthropic’s increase is great news for those of us who rely on Claude models daily.


BROWSERS
Perplexity Makes Huge Bet on Chrome
Perplexity, no stranger to making histrionic headlines, has just presented a $34.5 billion bid over Chrome, a value more than double the market value of the entire company. Perplexity joins other interested potential bidders like OpenAI, PE firm Apollo, and Yahoo.
But where is this coming from? As you may know or not know, Google is potentially being forced to sell Chrome due to an ongoing anti-monopoly lawsuit against the company.
The entire point is that, for years, Google allegedly used Chrome as a way to monopolize search (as Chrome defaults back to Google’s search index), even reaching the point of paying $20 billion per year to Apple to have them put Chrome as the default browser for the iPhone.
Google’s cash cow is search advertising, so Chrome became an essential customer funnel to Google search, cementing its undisputable lead.
This comes at a time when Perplexity is dealing with significant controversies, with very strong accusations by Cloudflare’s CEO that Perplexity acts like “North Korean hackers” in the way they try to prevent Cloudflare’s anti-crawlbot policies.
Some supposedly “reputable” AI companies act more like North Korean hackers. Time to name, shame, and hard block them.
— Matthew Prince 🌥 (@eastdakota)
2:42 PM • Aug 4, 2025
But how do we make sense of this entire news story?
TheWhiteBox’s takeaway:
Perplexity has just recently launched Comet, their agentic AI browser. This browser has an AI always connected, acting as an agent that helps the user throughout the search, explaining things, doing searches for you, etc.
In all honesty, I don’t understand where this is coming from for Perplexity. The price is outrageous for them, and they have already gone the extra length with Comet. Yes, getting Chrome would be great for them, as they good default a one-billion-customer product to the Perplexity engine, but I don’t think this company is any ready to pay that money.
To me, this is just another marketing stunt from an AI company trying to get some eyeballs for their brand. The truth, I said it last Sunday and for many months: I don’t think Perplexity can win the search game, it’s losing way too much money, and an acquisition is just a matter of time, or else...
Fun fact, Sundar Pichai, Google’s CEO, was the product manager of Chrome when it was first developed by Google.
IT COSTS
The Company That Wants to Help You Downsize
An interesting—yet most likely, paid sponsorship—article on the Wall Street Journal talks about a new AI startup with a very niche but automatically identifiable use case: streamlining IT departments by building a global knowledge graph amongst all data sources.
The reason this caught my eye is that it tackles an obvious problem, data silos, and has very strong repercussions: the days of system-of-record SaaS dominance are over. Yes, I’m talking about SAP, Salesforce, and the likes.
These companies make billions of products that are, quite literally, databases with an ugly frontend for users to interact with these databases. Crucially, they are products that are almost unanimously despised by the same very customers who feel locked to these overpriced offerings for eternity, as the cost of migrating out of them far exceeded the pain of using them.
But if AI now offers a way for companies to migrate that data off the platform, the churn these companies will experience would be record-setting.
TheWhiteBox’s takeaway:
But going beyond this particular case, IT cost-cutting, this also speaks to a larger trend, which is the increasing pressure companies will see to cut down costs.
And while there will be a lot of pressure set by CEO/CFOs on COOs/CTOs to cut costs to improve margins because that’s what their colleagues in other companies claim to be doing, I believe they won’t have a choice because most cost-cutting pressures will come out of necessity because of price drops to remain competitive in rapidly commoditizing sectors like software or service businesses.
There will be a huge opportunity for either:
Launching competitors offering 20% of features for 90% less cost (thanks to being much smaller)
Rolling up service businesses, heavily trim headcounts, and outcompete other players.
Either way, one thing’s for sure: both prices and costs are coming down. Or, to be more specific, it’s a matter of time before deflation kicks in.
HARDWARE
NVIDIA/AMD Sign Revenue-Sharing Agreement with Trump
In news published initially by the Financial Times, NVIDIA and AMD have agreed to share 15% of the revenues coming from China, in order to be granted an export license to sell GPUs to that country.
The move has come with severe scrutiny by Chinese hawks, who argue that China could use the H20 chips NVIDIA is allowed to sell to China for military purposes, a claim NVIDIA vehemently disagrees with. Important to note, both NVIDIA and AMD CEOs are of Taiwanese origin (and curiously enough, cousins).
TheWhiteBox’s takeaway:
Preventing Chinese companies from using US technology will, in my opinion, backfire, as it will force Chinese companies to build their own stack, which would then, of course, sell at very competitive prices, possibly undermining US hardware sales worldwide.
At risk of repetition, I outlined my reasons on Monday for the US not to impose strong restrictions on US GPUs, as having the world rely on US technology for AI is crucial to their national interests and the protection of the dollar’s dominance.
Forcing the Chinese to compete is not the best strategy for US interests, because we all know they will deliver eventually and will undercut US competitors in prices pretty easily on a market, accelerated compute, that is rapidly commoditizing.
And to top it off, Chinese model progress and adoption are accelerating. This, coupled with an eventual Huawei mass-market accelerator, could create the perfect incentive for companies worldwide to adopt the Chinese tech stack.
At that point, with China surely devaluing the yuan to make prices even more attractive, the ol’reliable Chinese tactic to outcompete other economies, and something the US can’t simply do without making life very hard for themselves (the CCP can definitely do that to its citizens), could create an adoption spiral that, in the event AI becomes ubiquitous, could cement the much-feared currency sorpasso of the yuan versus the dollar.
As to what to make of Trump’s revenue-sharing move, the US is paying a trillion dollars just servicing debt, which is to say, low double-digit billion in new income is not needle-moving money for the largest economy on the planet.
Instead, it seems like a move to give some ‘victory smell’ to an otherwise unpopular decision for many strategists and quite possibly to the average US citizen. But I think it’s the right decision to disincentivize Chinese hardware progress, as Chinese Labs wholeheartedly prefer NVIDIA/AMD stacks for now, even if some reports suggest that demand is being artificially set back by the CCP.


ROBOTICS
Laundry-making Robot
Figure AI, one of the hottest AI robotics startups in the space, has shown a new video of their robot putting clothes in the washing machine. The robot is painfully slow, but it does the job well.
TheWhiteBox’s takeaway:
If you’re used to feeling underwhelmed by robotics videos, that’s probably because you’re dealing with the Moravec paradox without knowing it. This paradox states that what’s easy for humans is hard for AIs, and vice versa.
So, whenever you see a robot doing a task that seems stupidly easy for you, then you should be, in fact, impressed, and be less impressed about AIs being great at maths or coding (because humans are pretty terrible at those things, the other side of Moravec).
CHINESE AI
DeepSeek R2 is Around the Corner
According to a recent leak, DeepSeek might be nearing the release of their R2 model, their new flagship reasoning model to compete with US counterparts.
Importantly, it appears that the model was trained and will be run on Huawei Ascend chips, making it the first globally available Chinese-chip native flagship model (at least to my knowledge; others, such as Qwen 3 or GLM-4.5, were trained on US chips).
TheWhiteBox’s takeaway:
Little is known about the upcoming model, but no Chinese lab generates as much excitement and fear simultaneously, and the release will be heavily leveraged as a way to ascertain the Chinese global state of affairs regarding AI.
An underwhelming release would soften the narrative that China has leveled the playing field with the US. Conversely, a great release, coupled with the fact it’s Huawei-native, could cause serious unrest amongst US officials, and even potentially lead to bans in an urge to protect national interest.
Banning an open-source model would be one hell of an action; I’m not sure it’s even possible unless you prohibit inference providers from serving it (a ban on DeepSeek APIs and web/mobile applications is almost guaranteed). But if the US were to take the step of banning the model completely, it would also be a straightforward way to admit nervousness, which would be quite humiliating.
FRONTIER MODELS
How is Opus 4.1 so Good?
Anthropic’s Claude Opus 4.1, its latest flagship model, has jumped to the number two spot on LMArena, a suite of benchmarks testing models in different areas and tasks while being validated by users, only behind GPT-5 but with reasoning mode set to high, while Claude Opus 4.1 was working on non-thinking mode.
TheWhiteBox’s takeaway:
At first, the result makes absolutely no sense, as Anthropic’s model is almost as good as OpenAI’s despite not using test-time compute, the method by which we increase a model’s performance by allowing it to think for longer on a task.
This method has been the main progress driver for almost a year at this point, since the release of OpenAI’s o1 model last September.
Therefore, the reason Anthropic’s result is so impressive is that, while GPT-5 thinks for longer on tasks, Opus 4.1 is just responding “instinctively”, selecting the correct approach directly and obtaining great results.
Think of this as comparing you to a person that has time to think about the solutions to problems while you are forced to answer immediately; it’s not only unfair, but it’s highly likely you’ll shit the bed in questions that, if given time, you would get right.
This automatically raises the question: How are Anthropic’s models so good even when non-thinking mode is applied?
The most likely reason is that, even though this seems like an unfair comparison for Anthropic, which doesn’t have test-time compute, this is also probably just as unfair for GPT-5, and the latter is probably a much, much smaller model.
In other words, yes, GPT-5 has the advantage of thinking for longer on tasks, but Opus 4.1 has the advantage of being a much larger model and, therefore, one that has retained much more knowledge, trained over larger amounts of data, and so on.
We can predict this based on the fact that Anthropic’s model is around 8 times more expensive to run, while also being slower (on a per-token basis).
However, the fact that the model remains so freakishly accurate, despite being forced to “think straight” with zero chance of redemption, meaning it has to be great at every single prediction, is very impressive.
The most likely explanation is that, while GPT-5 gains accuracy by thinking longer at inference (exploring multiple solution paths before settling on an answer), Opus 4.1’s strength is that its first pass is already incredibly strong. Put simply, Anthropic seems to have trained a model that does far more reasoning per token, so that each prediction already reflects the kind of multi-step thought another model would have to unroll over many extra tokens.
The way they have achieved ‘higher pre-prediction reasoning’ is unknown. It could be for various reasons: just better data, the fact that the model is much larger (more computation per prediction), longer training, or proprietary reasoning training heuristics (induction biases baked into the model).
It’s most likely a combination of all of the above.
From a user perspective, Opus offers a better user experience (much faster, the model goes right to the point), but at a higher cost.
Closing Thoughts
If there’s a single thread running through this week’s stories, it’s that AI’s frontier is still advancing rapidly.
GPT-5 and Claude Opus 4.1 show that there’s more than one path to top-tier performance: one betting on longer test-time reasoning, the other packing more reasoning into each token. Yet, economics are now just as important. The era of “always give the best model” is giving way to “give the best model we can afford,” and that shift will be most felt by those using subscriptions (hence my long-standing suggestion of using APIs).
And while the industry progresses, the market, political, and human factors are now shaping AI’s day-to-day as much as technical breakthroughs are. And as always, the rivalry between the US and China will continue to make headlines, especially if DeepSeek delivers.
But the overall sentiment for me this week isn’t positive. The GPT-4o drama has once again shown us the glaring new reality AI puts us into: one where people are literally falling for digital files.
Dehumanizing these systems in public perception may end up being as crucial for safety as alignment techniques themselves.

THEWHITEBOX
Join Premium Today!
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]