
THEWHITEBOX
TLDR;
Welcome! Today, we discuss China’s latest model, GLM-5.2, which is possibly China’s first frontier model, as well as many other news related to SpaceX, Anthropic, Google, SK Hynix, Microsoft, and more.
Enjoy!

RESEARCH
The Time has Come
Zhipu, one of China’s top AI Labs, has released a model that might set a precedent in AI, as it’s the first Chinese AI model, to my knowledge, to be actually competitive in raw performance with top US models.
Yes, I’m not saying intelligence-per-cost, I’m saying raw performance. Besides being better than anything Google has ever released for Large Language Models (LLMs), at least on benchmark metrics, it beats GPT-5.5 on several benchmarks.
Previous models like DeepSeek v4 showed promise, but they are unequivocally behind in most benchmarks. However, the latest string of Chinese AIs, models like this one or Kimi K2.7 Code, are putting up a very serious fight in raw performance while remaining overwhelmingly superior on a cost-efficient basis.
Architecturally, the model is very similar to everything we’ve seen before. But just like any other Chinese model, they use sparse attention, particularly the same DeepSeek uses (appropriately named DeepSeek Sparse Attention).
But what does that mean?
Most LLMs today work the same way; they take in a sequence of words and output the next. To do this, every word looks back at previous words looking for interesting attributes to attend to (e.g., in “The green cup”, ‘cup’ attends to ‘green’ to gain the attribute of “greenness”).
The question here is the following: should every word attend to all previous words?
For example, is the sequence “The green, uhm, ehhh, mmm, ah yes, cup, was…” does ‘cup’ have to attend to ‘uhm or ‘ehhh’, or should it simply focus on ‘green’?
In dense attention, what US models mostly do, we perform dense attention; each word pays attention to all, without distinction. This ensures that all relationships are found, but at the cost of a very large amount of required compute.
Sparse attention mechanisms actually ask that question first, whether all words matter, preidentify good candidates, and only then pay attention to those selected. This takes the form of an indexer that selects good candidate words that a particular word can attend to and sets a limit. If the indexer only lets you attend to 20 words and you have 1,000 previous words, only 20 will be attended to.
This allows the compute required to perform attention to remain constant across sequence length for every word with only a slight performance reduction, making it an irresistible alternative for Chinese Labs, which aren’t nearly as compute-rich as US Labs.
With regards to DeepSeek’s models, GLM-5.2 adds an extra compute-saving technique called IndexShare. While a DeepSeek model applies the indexer at every single model layer, GLM-5.2 applies the indexer only every few layers, avoiding this ‘candidate identification’ process from being executed more times than needed. In long sequences, this reduces compute by almost 3x.
If you want to better understand this indexer mechanism and the overall functioning of sparse attention (and attention in general), I wrote a very detailed description here.
But what’s the main takeaway here?
TheWhiteBox’s takeaway:
I keep hearing delusional takes such as “China is 1 year behind.” I think this model settles this as unfathomably wrong. But social media narrative is always about extremes, so either you’re in that camp, or you’re in the “China has caught up” one, which isn’t true either.
The key difference is generalization. Chinese Labs, considerably less compute-rich than US Labs, can’t compete on general capabilities with US models; they have smaller models and much smaller training budgets.
Instead, they focus on particular domains (mostly coding) to be competitive on specific, high-value tasks. A simple example of this problem can be seen below, a benchmark, FutureSim, that measures how well models can forecast future events that occurred after their training using real news articles and such. The idea is to test whether models can work well with new information, basically.
And in such instances, those that require models to work with truly unknown data, the performance gap between closed models and open models is gigantic, which proves that Chinese models are reaching excellent performance on specific domains, but have glaring deficiencies overall relative to US top models.

But here’s the thing: it doesn’t matter whether China is catching up in overall capabilities.
What matters is whether Chinese models are reaching a capability threshold that allows them to be used, while costing 10-60 times less.
I’ve said it in the past, and I’ll say it again, enterprise workflows are meant for depth, not breadth. They yearn for specialized models and don’t give a dime whether the AI model is good at tasks outside the task at hand.
Generalization is key to pushing the frontier, but not for enterprise adoption.
And they’ve hit that threshold in many areas, making them the primary option for cost-effective deployments. This is terrible news for US interests; the way the US has let China dominate the Pareto frontier is unjustifiable.
I firmly believe that commodity tokens will represent at least 80% of the total tokens generated in a few years. And right now, that means 80% of the world’s tokens will be coming from Chinese models.
As for local inference, is this model a good option? Not at all.
The problem here is the KV Cache, the working memory the model uses to respond to specific requests.
Unlike DeepSeek v4, GLM-5.2 doesn't compress this working memory. The model is much more opinionated about what parts to use each time (attention), but still stores the entire thing.
Think of this as still having to remember everything, but for any particular task, use selected parts of your working memory, making every thought faster. However, you’re still having to store “everything”.
Consequently, a single one-million-token sequence on GLM-5.2 requires an additional 92 GB of DRAM to store the model's weights. The model is also BF16, so 1.4 TBs are needed.
In comparison, DeepSeek v4 requires just 4 GBs, because DeepSeek models do, in fact, compress memory (e.g., they decide what has to be remembered and what can be forgotten).
Still incredible progress by China, once again.
CYBERSECURITY
Are Mythos-level Cyber Capabilities Overhyped?

New work by EpochAI has tested whether the risks associated with the new generation of AI models are overhyped. In short, they seem better at exploiting vulnerabilities but not at finding them.
Mythos Preview (the unchained, not broadly available version of the only available-to-US-born Fable) appears to be a major advance in exploit development: its Cyber-ECI benchmark aggregation puts it about 7 months ahead of the trend, and clearly above GPT-5.5, especially once newer, less-saturated benchmarks are included.
The evidence is less clear for vulnerability discovery. Epoch notes a large spike in publicly recorded high and critical CVEs among 21 organizations after Mythos Preview’s release, but says that may partly reflect Project Glasswing’s large increase in spending and access, not only better model capability.
Reports from partners were mixed but generally positive:
Mozilla compared Mythos Preview to elite security researchers;
Palo Alto Networks said frontier models produced the equivalent of a year of penetration testing in under three weeks;
AWS said the model helped identify additional hardening opportunities.
But curl’s lead maintainer said he saw no evidence that Mythos found issues at a more advanced level than prior tools, so the really is still pretty much out on how transformational these models will be for cybersecurity (if ever released).
TheWhiteBox’s takeaway:
The conclusions seem to be that Mythos isn’t just hype. However, there were definitely way too many bells and whistles. Even Anthropic, in its response to the government after they blocked Fable, alleged that GPT-5.5, which is widely available, posed just as much of a threat, so who knows.
Recall that Fable is just a Mythos model with guardrails on top, and the USG's entire position is that those guardrails can be skipped, exposing the real Mythos to the user who manages to jailbreak it.
Nonetheless, as I’ve mentioned in the past, many of the associated risks that Mythos introduces have already been shown to be possible with current models.
Which is to say, it’s not like Mythos isn’t better at cybersecurity than most, if not all, available models, but many of the risks that led to its “it’s too dangerous to release” narrative are already here with us.
In short, as several cybersecurity leaders have stated, just give us the models.
COSTS
A move toward measuring tasks, not tokens

Artificial Analysis has updated its Intelligence Index to v4.1, adding new task-level cost and time metrics alongside model-quality scores.
The index now measures not only model capability, but also cost per Intelligence Index task, time per task, and token use per task.
The obvious highlight is clearly the cost difference. While DeepSeek V4 Pro Max reaches an Intelligence Index of 44 at about $0.06 per task, GPT-5.5 on high reasoning mode costs $0.99 per task and $1.78 for Claude Opus 4.8 max, with Fable 5 setting a new high bar in terms of performance (topping the index at 60) and cost, being 54 times more expensive per task than DeepSeek V4 Pro Max.
TheWhiteBox’s takeaway:
I don’t know who needs to hear this, but if you’ll excuse my words, US Labs need to wake the fuck up. It’s great that Fable 5 tops benchmarks (if we could use them, of course), but is the model 54 times better than DeepSeek on average tasks?
Come on.
Yes, it may give you an edge on some tasks DeepSeek’s models can’t solve yet, and for those, you will default to the frontier, but what I believe most people in the US resist understanding is that those tasks are a tiny, tiny part of what the world will ask of AI.
A business is about solving a problem for the user, and the user wants it solved at the lowest possible cost. Nobody drives a Ferrari to go to work in the fields because a 20-year-old crappy Toyota does that job brilliantly.
If Chinese models are consistently the best intelligence-per-cost option, the world will run on Chinese models. Your choice, San Francisco.
COMPUTE
All Signs Continue to Point in the Same Direction: Compute
In the days after Fable's release, before it was banned at least, I noticed a pattern: progress still seems overwhelmingly linked to compute, and two recent evaluations prove this very clearly.
On the one hand, a clearly counterintuitive result by Vals.ai. Despite Claude Fable 5 fallbacking to Opus 4.8 199/200 times, meaning the request was sent to Fable 5, but the system downgraded the request to Opus 4.8 because the task was flagged as “unacceptable” by Anthropic and thus doesn’t want you to use the best model for, meaning it was essentially Opus 4.8 responding the entire evaluation, it still got twice the marks than running the task directly on Opus 4.8.

How is that possible?
Most interestingly, the cost was also double, even though both were charged at Opus 4.8 tokens. This means the first result generated twice as many tokens as the latter, proving that inference-time compute, just deploying more compute into the task, shows no signs of slowing at all.
Simply put, doubling the token count doubled the performance.
Nonetheless, OpenAI’s Reasoning Lead, Noam Brown, had a really good article explaining just this, stating, “As LLMs become more capable, benchmark performance is increasingly a function of test-time compute,” arguing that AI products are still not showing AI’s real capabilities because we’re forced to clamp down on how much they can actually think on the problem due to costs.
Another obvious example is the ECI index from EpochAI, which shows Fable 5 scoring just 1 point higher than GPT-5.5 Pro. And you may ask, how does GPT-5.5, an undeniably inferior model, get such a close result to the next-generation model?

And the answer is that this is apples-to-oranges because Fable 5 is one model, GPT-5.5 Pro is a best-of-N sampling method where OpenAI deploys several separate AIs into the task and keeps the best result of the bunch (usually the most common one).
This proves that, despite the individual model being worse, the additional inference-time compute required to deploy several AIs closes the gap.
OpenRouter’s Panels API, which we discussed last Sunday, is an outcome of the same idea: by deploying several agents to a problem, the combined computational power exceeds that of a single model, yielding an outsized outcome.
The implication of this? As I said in the last newsletter, blocking a specific model is useless as the genie is pretty much out of the bottle.
TheWhiteBox’s takeaway:
Placing model restrictions doesn’t make any sense at all because worse models with higher compute thresholds reach the same performance. Consequently, if you really really wanted to halt progress, you have to go for the compute.
Anthropic’s entire framing of going after the models is just regulatory capture, because Chinese models commoditize its business badly and want them out of the way.
The US advantage is a compute advantage; I’ve always said this and will continue to push it. However, this must not lead to banning US compute for the rest of the world, because if you do so, China is going to flood the world with Chinese compute, and I challenge you to give me a single reason the USG would want that.
Instead, if you really want the US to thrive, you should acknowledge the importance of compute and, instead of placing export controls, focus all your efforts on reducing capital costs for businesses that deploy it. That is where China beats the US: in getting that computer deployed at a better cost.
Interestingly, export controls only make the customer pool smaller, reducing demand, and thus pushing prices higher.
Instead, give companies in the industry energy credits and provide liquidity for construction; do whatever it takes to bring down the $50 billion-per-gigawatt cost, which is, in fact, not falling but is currently rising.
That is how the US should compete (and most likely win).
I know I’ve said multiple times that we shouldn’t be bailing out businesses that are overspending. But China is doing so, so if you really feel AI is a true national security risk, a race between both countries, then you’ll have to.
But let me be clear, the focus should not be on models; it should be on computation, which, by the way, is exactly what the CCP is doing.
ROBOTICS
Using Autoresearch for robotics

One of the most popular themes in AI today is recursive self-improvement, the idea of letting models improve themselves or improve other models. In an ideal world, AI will be in charge of autonomously improving itself, potentially creating an explosion in progress. At least, that’s the hope.
And now, NVIDIA has applied it to robotics… and it worked.
Called ENPIRE, it’s a harness framework for coding agents that instantiates a physical feedback routine in which coding agents analyze logs, consult the literature, improve training infrastructure, and refine algorithm code to address failure modes, thereby improving robotic arms (the previous link includes video demonstrations).
As the blog states, “Powered by ENPIRE, frontier coding agents can autonomously develop a policy to achieve a 99% success rate on challenging, dexterous manipulation tasks in the real world, such as PushT, organizing pins into a pin box, and using a cutter to cut a zip tie.”
As seen in the image above, having Codex try different ways at improving performance autonomously leads to clear improvements.
TheWhiteBox’s takeaway:
The real question here is, of course, cost. Having an AI agent (or several of them) run endlessly to solve the problem can lead to enormous per-trial costs (don’t forget that an OpenAI engineer spent $1.3 million in API credits in a single month).
The other big question here is what the upper bound of these runs is. It’s clear that the ability of AI models to improve scientific discovery is bound by their “intuition”; the ability to suggest novel ideas that don’t overlap with previous ones and genuinely improve the ability to test different stuff.
Recent mathematical breakthroughs, like those in the Erdos problems, suggest these models are, in fact, improving their ability to be creative in their suggestions, but I’m skeptical as to how truly novel these are.

M&A
SpaceX Acquires Anysphere
With a confirmation that surprised no one, SpaceX has confirmed the acquisition of Anysphere, the team behind the popular coding tool Cursor, in a $60 billion stock-based merger. The deal is expected to close in Q3 2026.
The deal puts Cursor’s ARR at $2 billion, adding to the fast-growing SpaceX AI revenue, which is much needed to justify considering its current trading above Amazon at the time of writing, with Amazon at almost $3 trillion. Alan Greenspan surely “loves” our current market.
TheWhiteBox’s takeaway:
We can argue if the price was too high, but we can’t argue that the acquisition makes total sense for SpaceX.
They acquire a super-talented team behind what’s possibly the best agent coding harness there is right now (i.e., Claude/GPT models run better in Cursor than in their own commercial products), and a team that is also training its own models, initially fine-tuning Chinese models, and currently training from scratch their own, as announced by its CEO here.
Surprisingly, the CEO also disclosed that Opus and GPT both have 1.5 trillion parameters. Feels quite low-ish, but he surely knows more than I do on the topic, so unless he got confused, it’s quite the reveal.
Additionally, they gain >$2 billion in much-needed revenues and swaths of human coding data. I see no issues with the acquisition, except that paying 30 times revenue isn’t precisely cheap.
REGULATION
DeepSeek, To the Ban List?
As reported by Reuters, the US has held off on adding China’s DeepSeek, CXMT, and more than 100 other firms to the Commerce Department’s Entity List, despite an interagency committee reportedly deeming them national security risks. The delay comes as the Trump administration seeks to avoid worsening tensions with Beijing.
Reuters reports that the Entity List, which restricts exports of US goods, software, and technology, has not received new additions since October, the longest gap in more than a decade.
Some of the pending companies were reportedly linked to supplying Russian drones recovered in Poland, selling restricted Nvidia chips to Chinese universities, or supporting China’s military-related drone and robot-dog programs.
China’s foreign ministry said the US should stop “politicizing” economic and technology issues, while the Commerce Department’s Bureau of Industry and Security said it continues using export-control tools to address “bad actors.”
Is this the beginning of the end for Chinese open models being legal in the US?
TheWhiteBox’s takeaway:
In a basic scenario where DeepSeek posts weights publicly, and anyone can download them for free, the Entity List alone would not clearly make downloading or using those weights illegal.
The risk increases if the download involves a transaction with DeepSeek, such as signing a license directly with the company, paying for access, or using its hosted API. Those could be treated differently from the way publicly available files are handled.
However, most people using Chinese models today access them through US LLM providers like Fireworks or the Hyperscalers, meaning the models are hosted by US entities that have entered into no direct agreements with DeepSeek.
Banning open models would require treating digital files as contraband, which would be hilarious, since it would essentially be a ban on a file packed with matrix multiplications.
Jokes aside, open models keep token prices in check. Without them, you’re left with Anthropic, OpenAI, Google, and a handful of remaining players who can then raise prices with no risk of commoditization at all.
If AI truly becomes ingrained in enterprise workflows, it would increase operational costs significantly relative to foreign companies, which would, of course, rely on cheaper tokens.
The US needs to stop thinking of banning stuff as a solution to its competitiveness and start competing; force US Labs to compete at the Pareto frontier, not give them the victory via regulatory capture. I’m not saying it’s going to happen, I’m just hoping it doesn’t.
MEMORY
Hynix ADR Listing Soon?
SK Hynix, one of the key suppliers of DRAM (the memory used by AI accelerators like GPUs), is allegedly preparing to list on the American stock market (via an ADR) as soon as next month, opening the door for US investors to invest in the company.
Right now, investing in Hynix is complicated because investing in Korean companies is very hard for non-Koreans, with only a couple of brokers like Interactive Brokers offering them. Therefore, this would represent a considerable increase in liquidity for the company.
Also, as part of a plan to increase shareholder returns, they will begin repurchasing shares and paying cash dividends in the fourth quarter.
TheWhiteBox’s takeaway:
While memory is one of the most important elements in the AI supply chain today, the market's cyclicality risks (memory has traditionally been brutally cyclical) may lead Hynix to act more cautiously and, instead of fully reinvesting to extend capacity (they are expanding capacity, though clearly not as much as they could), use that cash to inflate stock value and shareholder returns.
It’s not the greatest sign for people looking to hold this stock for years, but it’s great news for current shareholders.

ENTERPRISE
Microsoft is considering using DeepSeek for Copilot
In quite an unexpected turn of events, it seems Microsoft is seriously considering using DeepSeek v4 Flash as one of the underlying models in its Copilot offering, aiming to reduce serving costs, with the idea of offering a more competitive Copilot pricing too, a matter of much discussion in recent times after the massive June 1st price hikes, with some customers allegedly spending up to 100 times what they did before.
TheWhiteBox’s takeaway:
If regulatory capture doesn’t prevent it, this will become the norm unless US Labs get serious about offering good intelligence-per-cost models.
But do you realize how tragic this is?
Microsoft, which basically owns OpenAI and has access to its IP, decides to use a Chinese model because US models are simply too expensive to run.
I would be surprised if not for the fact that we’ve been saying this would happen for a long time in this newsletter, but one can’t help but feel “impressed” by how easy US Labs are conceding defeat in the commodity token market.
Nonetheless, GLM-5.2, the model discussed above, scores 11 points higher on the Artificial Analysis benchmark than GPT-5.4 mini (on high reasoning), despite costing less per task.
Outrageous and a complete strategic defeat for Frontier Labs. As mentioned earlier, US Labs need to wake up. And fast.
ENTERPRISES
The Semantic Layer we were waiting for?
Google has released the Open Knowledge Format, a proposed standard for representing organizational knowledge in a form that humans, data tools, search systems, and AI agents can read.
OKF captures context around enterprise data and systems. This includes definitions of tables, metrics, datasets, APIs, business processes, and runbooks, as well as links to authoritative sources and usage notes.
The format is based on directories of markdown files. Each file describes a single concept and may include structured metadata such as type, title, description, resource links, tags, and timestamps.
---
type: BigQuery Table
title: Orders
description: One row per completed customer order.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, revenue]
timestamp: 2026-05-28T14:30:00Z
---
# Schema
| Column | Type | Description |
|---------------|-----------|------------------------------------------|
| `order_id` | STRING | Globally unique order identifier. |
| `customer_id` | STRING | FK to [customers](/tables/customers.md). |
# Joins
Joined with [customers](/tables/customers.md) on `customer_id`.OKF does not replace databases, schemas, APIs, or data catalogs. Instead, it documents what those systems mean, how assets relate to each other, and what caveats apply. It’s a semantic layer that packages knowledge about enterprise systems into simple files that LLMs can easily interpret.
TheWhiteBox’s takeaway:
It’s all context engineering in the end. As Google explains, “AI is only as smart as the context you give it.” Thus, improving the quality of the provided context is one of the easiest ways to improve performance, with the added benefit that here, Google is standardizing how this semantic layer is built.
But you may be asking? Isn’t this Anthropic’s skills all over again? Or the memory systems these Labs use? Isn’t it just a bunch of markdown files like the other context engineering methods?
Kind of, but not quite.
Anthropic’s Skills package behavior, how to use a certain tool. OKF packages knowledge. It’s all the same in essence, instructions in a simple-to-read file, but it’s important that we understand when to use what.
If you’re up for a read, it’s heavily inspired by Andrej Karpathy’s LLM-wiki, a persistent wiki available to the model about something, in this case, enterprise systems, an idea quite similar to what OpenAI does with its memory systems.
But OKF is the first one to try to standardize this process, which is welcomed.
CONSUMER
AI AirPods in 2027?
According to Mark Gurman, a really well-known journalist who is almost exclusively focused on Apple, Apple will launch AI AirPods in 2027, as well as other products.
The lineup is expected to include camera-equipped AirPods, a second-generation foldable iPhone, and a redesigned iPhone tied to the product’s 20th anniversary.
The camera-equipped AirPods are reportedly scheduled for late 2027. The cameras are not intended for taking photos or videos, but for providing Siri and Apple’s AI systems with visual context of the user’s surroundings.
Apple is also planning a second foldable iPhone for 2027, following the expected launch of its first foldable model. The anniversary iPhone is expected to feature a new design with a near-edge-to-edge display and curved glass on the sides.
The products remain in development, and Apple has not publicly announced the devices.
TheWhiteBox’s takeaway:
It’s obvious Apple is going to focus on winning the consumer AI market, and if Siri actually delivers in September, they have a great chance of running away with it, unless OpenAI does something really transformational about its consumer device being developed by Jony Ive.
I don’t own any AI consumer devices yet, like Meta’s RayBan/Oakley Meta glasses, but they are starting to become quite appealing to me.
My only concern is that people may feel uneasy about you having a device that is literally recording them, so Apple’s AI AirPods, which focus more on surroundings and providing better context, might be less intrusive. We’ll see.

Closing Thoughts
We’re finally witnessing the emergence of China as a clear rival. And it’s a brutal one with costs that are simply too low to ignore.
Chinese Labs are becoming incredibly sophisticated, but I don’t think anyone is particularly surprised at this, considering one-third of “US researchers” are actually Chinese.
The New York Times already highlighted that talent gap months ago, and I believe they have older articles about it.
US policy should be pressuring Frontier AI Labs to compete at the commodity token market. To me, this strikes as obvious, but for whatever reason, only Google seemed to understand that, only to then release Gemini 3.5 Flash at three times the original price of the third version.
Result? Nobody uses that model, and the question of which model offers the best intelligence per cost is analogous to which Chinese model offers the best intelligence per cost.
If the US is desperate to tackle national security risks, this is the biggest one.
Sadly, US Labs can’t compete on price with Chinese labs. And no, it’s not because Chinese model architectures are more frugal or that Chinese researchers are marked by the mandate of heaven. Sparse attention is something any researcher sees as a great compromise.
No, it’s much simpler: capital costs.
When Anthropic serves you a token, they aren’t only paying for the operational cost of serving it; they are also paying the cost required to build the data center in the first place, and that’s the big one, actually.
I really give US Labs a hard time, but I acknowledge it’s partially not their fault that, to compete, they have to pay $50-$100 billion per gigawatt. It’s not normal for Meta to spend more on AI this year than the German state spends on defense.
This is the product of a highly commoditized software running on top of non-commoditized, and thus very pricey, hardware. Let’s stop pretending this isn’t happening, and let’s start enacting policies that decrease those prices somehow.

Give a Rating to Today's Newsletter
For business inquiries, reach me out at [email protected]

