THEWHITEBOX
DeepSeek, China, & the future of AI

As promised a few days ago, this post is going to focus on the nothing-short-of-crucial release of DeepSeek v4 from the homonymous company, because it reveals a lot about the state of Chinese AI and, overall, the future of the industry.

How much behind are they? Are they behind at all? What’s the long-term plan?

We’ll answer all these questions today by presenting the main insights from the paper. If you read this piece in full, you’re going to:

  1. Fully understand how frontier AI models work and how they are built, with no word more complex than the previous,

  2. Capturing subtleties about AI software and hardware that almost no one understands,

  3. And comprehend the nuanced geopolitical implications of this model. There’s a reason the entire industry holds its breath whenever DeepSeek drops a new model, especially considering the last time they had a big model release, NVIDIA crashed 17%, half a trillion dollars, the largest fall on record, based on a paper by these guys.

I mean, what other non-US model could cause the level of panic this AI Lab consistently creates, even inside the US Government?

This piece will leave you with a bittersweet taste, especially if you’re concerned about US supremacy. Not because China has “caught up”, we are way past that simplistic conversation that judges who’s ahead based on who scores higher in contaminated benchmarks.

The real battle is being fought elsewhere, and I can’t help but feel quite worried about the US's chances if things don’t change before it’s too late.

You’re about to learn a ton about frontier AI today, so let’s dive in.

Redefining the open frontier

First, let’s start with the official release numbers, which, in themselves, are very impressive for an open-source model.

DeepSeek has released two new extremely cheap open models: DeepSeek-V4-Pro and DeepSeek-V4-Flash.

  • V4-Pro is the flagship model, with 1.6 trillion total parameters and 49 billion activated per token. That means only 3% of the model is active for each prediction (i.e., a huge model running like a “small” one).

  • V4-Flash is the cheaper, smaller variant, with 284 billion total parameters and 13 billion activated per token. Much less sparse relative to the other, but much smaller.

Both support context windows of up to 1 million tokens (i.e., they can handle ~750k words in context, so you can feed them a lot of information), making the release especially well-suited for long-context reasoning, coding, search, and agentic workflows.

But let’s dissect its claims more. On general knowledge and factuality benchmarks, DeepSeek-V4-Pro-Max, the highest-reasoning version of V4-Pro, shows a major jump over previous open models but still trails the strongest proprietary systems in some categories.

  • It scores 57.9 on SimpleQA Verified, compared with 46.2 for Claude Opus 4.6, 45.3 for GPT-5.4, and 75.6 for Gemini-3.1-Pro.

  • On Chinese-SimpleQA, it reaches 84.4, close to Gemini-3.1-Pro’s 85.9 and ahead of Claude Opus 4.6 and GPT-5.4. On MMLU-Pro, a broad academic knowledge benchmark, it scores 87.5, behind Gemini-3.1-Pro’s 91.0 and Claude Opus 4.6’s 89.1.

One of the strongest parts of the release is the models’ reasoning, coding, and maths capabilities.

  • On LiveCodeBench, which measures coding ability, DeepSeek-V4-Pro-Max scores 93.5, ahead of Claude Opus 4.6’s 88.8 and Gemini-3.1-Pro’s 91.7.

  • On Codeforces, a competitive programming platform, it reaches a rating of 3206, above GPT-5.4’s 3168 and Gemini-3.1-Pro’s 3052.

  • On IMOAnswerBench, a difficult math benchmark, it scores 89.8, close to GPT-5.4’s 91.4 and ahead of Gemini-3.1-Pro’s 81.0. On Apex Shortlist, another advanced math reasoning benchmark, it scores 90.2, slightly above Gemini-3.1-Pro’s 89.1.

But the model’s strongest suit is, undeniably, its long-context results.

  • On MRCR 1M, a retrieval benchmark testing whether a model can find information across a one-million-token context, DeepSeek-V4-Pro-Max scores 83.5. That beats Gemini-3.1-Pro’s 76.3 but trails Claude Opus 4.6’s 92.9.

  • On CorpusQA 1M, a more realistic corpus-level question-answering benchmark, it scores 62.0, ahead of Gemini-3.1-Pro’s 53.8 but behind Claude Opus 4.6’s 71.7.

On agentic benchmarks, DeepSeek-V4-Pro-Max is competitive with frontier models but not consistently ahead of them.

  • On SWE Verified, which measures real software issue resolution, it scores 80.6, effectively tied with Gemini-3.1-Pro and just below Claude Opus 4.6’s 80.8.

  • On Terminal Bench 2.0, which tests command-line agent tasks, it scores 67.9, behind GPT-5.4’s 75.1 and Gemini-3.1-Pro’s 68.5.

  • On Toolathlon, a tool-use benchmark, it reaches 51.8, behind GPT-5.4’s 54.6 but ahead of Claude Opus 4.6 and Gemini-3.1-Pro.

So, overall, we’re talking about an almost frontier-level model that’s free to download. Not quite Opus 4.7 or GPT-5.5-level, but pretty incredible considering that puts it a few weeks behind the absolute frontier.

Of course, benchmarks never tell the whole story, as exemplified by Gemini 3.1 Pro, which scores SOTA in most benchmarks but is rarely used in key settings like coding or agent work.

The important thing to note here is that this is a radically different model relative to the others. It’s one that presents an extremely novel architecture, unique enough to warrant this entire newsletter.

But to capture DeepSeek’s genius, we need to understand how frontier AI models are built in first principles.

What is a Transformer in words anyone can understand?

Inside every single model you’ve used lately, and I mean every single one, lies a piece of software known as a Transformer. Conceived in 2017, it has become synonymous with AI.

Just as no one doubts the use of sugar in any sweet recipe, no one doubts the use of the Transformer as the foundation of any model aspiring to be good.

As the name suggests, it’s a model architecture that works by ‘transforming’ the meaning of data, be that words, images, or audio.

But what do I mean by that?

In AI, or in any softwaredata is represented as numbers (recall that digital machines can only work with numbers), where each number can be seen as an attribute that, when combined, defines what a data point represents.

For example, say we have a toy model that does nothing but classify food by its ‘dessertness’ or ‘sandwichness.’ An apple strudel can be represented as [0.5, 1], a food with a decent degree of sandwichness and, undoubtedly, a dessert. This means that, in the model's eyes, an apple strudel is nothing but two numbers: 0.5 and 1.

Modern AIs have thousands upon thousands of dimensions, meaning they can get extremely granular in how they categorize each concept (e.g., applying different levels of dessertness, such as sweetness, texture, etc.).

Of course, the more attributes two concepts share, the more “alike” they are. This allows the model to know that ‘hot dog’ is closer to ‘shawarma’ than it is to ‘apple strudel’.

More importantly, it’s key to understand that meaning is not static; it can be modified. Words like ‘king’ and ‘queen’ are closely related, mostly separated by the ‘sex’ attribute.

Thus, if you give an AI model the word ‘king’, it can represent a ‘queen’ by taking out the ‘man’ attribute and adding the ‘woman’ one, giving you ‘queen’.

The third and final insight is that meaning is contextual. Say we have the word ‘bank’. It has not one, but several intrinsic meanings depending on the context.

Thus, we need AIs to do all three things:

  1. categorize concepts by attributes,

  2. apply transformations to move from one concept to the next,

  3. and use context to capture the meaning of each word.

And the Transformer is the perfect architecture for this. When a Transformer architecture like ChatGPT receives a piece of text, it processes its meaning and adds relevant information to predict the next word. In other words, it’s mostly a meaning-update process.

And the reason it works so well is that it breaks down processing data into two dimensions: context and knowledge.

As I always say, AI models are just ‘maps’; they map an input to an output.

  • House data → price, for a housing price predictor.

  • Several words to the next one, as ChatGPT does.

If my experience tells me 5-bedroom homes rarely go below one million dollars, whenever I see a 6-bedroom home, I’ll infer that it’s very likely more than one million dollars.

AIs never talk in absolutes, always in probabilities. This is done on purpose to help them model uncertainty; learning to assign a probability they are wrong.

But at heart, AI is mostly statistics, but on steroids. Hence, at their core, they all do the same thing: capture patterns in data and use them to make inferences about new data.

And the tool they all overwhelmingly use to capture such patterns is the Transformer.

However, they differ in how they “work” with the input. In the housing scenario, the quality of the model’s prediction is mostly about knowledge: how much housing data it had seen previously.

Large Language Models (LLMs) add a twist: context matters just as much. The model has not only seen what’s basically all public digitized content the world has ever created, plus proprietary data that AI Labs create, but it also receives real-time context.

For the sequence “Jimmy ate some peanuts, what do I do?” that seems like a harmless, and quite frankly hard-to-answer, question without more context, but if the model knew ‘Jimmy’ is the user’s peanut-allergy-maximizer son, that question is clearly a medical emergency.

Alternatively, even if the model doesn’t know Jimmy has a peanut allergy, if the user had written “Jimmy ate some peanuts by accident, what do I do?” the model can already guess Jimmy probably shouldn’t be eating peanuts.

Thus, ChatGPT not only has to know “everything,” but it needs to be able to process unforeseen, present context.

And here’s where the Transformer truly starts to shine. A Transformer model, a model like GPT-5.5 or Opus 4.7, is a concatenation of ‘Transformer blocks’ where each block is made up of two things:

  1. Attention layer. It’s what lets the model process the request. To do so, it lets each word in the sequence talk to all previous ones. For the sequence “The pink bracelet”, ‘bracelet’ can “attend” to ‘pink’ and identify its pink. Bracelets aren’t pink by default, so the word ‘pink’ gives them that attribute.

  2. MLP layers. Each word is “enriched”. For the sequence “I love Michael’s music”, the model can incorporate information about famous Michael singers, such as Michael Jackson, into the mix. This is why LLMs know so much about everything; that knowledge is stored inside them.

Intuitively, you can think of a block as a transformation (hence the name) over the words. Each word is updated contextually (with respect to previous words and the model’s knowledge on the matter).

For instance, earlier blocks might focus on capturing how nouns and adjectives relate to each other, while deeper layers can represent much more complex structures, such as compositional reasoning.

An example of compositional reasoning is answering “Can the son of Mary’s brother be older than Mary’s daughter?” To answer, you don’t retrieve one memorized fact. You compose several relationships:

Mary’s brother is Mary’s sibling.

The son of Mary’s brother is Mary’s nephew.

Mary’s daughter is Mary’s child.

A nephew can be older or younger than a cousin, depending on when each sibling had children. So the answer is yes: Mary’s nephew can be older than Mary’s daughter.

As you may guess, the more blocks you concatenate, the more transformations you can apply, the better results (hence why bigger models are ‘smarter’).

And how do we scale this architecture besides just concatenating more blocks? In MLP layers, one way to make them more efficient is to split them into ‘experts’.

We’ll talk about it later because it will become important, but for now, in terms of innovating over the quintessential Transformer architecture, it’s the attention mechanism where DeepSeek is redefining how we do it.

Attention is all you need

The most impressive thing about these models is two things: cost, and long-context performance, both of which can be considered absolutely at the forefront amongst all models, closed and open.

With regard to the cost of the DeepSeek API, the numbers are almost embarrassing, with some multiples reaching ~80-100 times cheaper.

And the culprit behind this incredibly cheap numbers, besides the fact DeepSeek isn’t nearly as much in debt to investors as OpenAI or Anthropic (their key costs are not running AIs, but securing hardware and training) is this new attention mechanism, a first in this industry that beautifully explains why these models are technological marvels that represent an order of magnitude decrease in costs relative to the previous generation.

And to explain this, we first need to understand why the attention mechanism can play such a larger role in costs.

As explained earlier, attention is the mechanism that the operator models use to process the input sequence, which requires words to “talk” to one another. This ‘talking’ is a very effortful computational process we call attention, which grows proportionally to sequence length.

Think of it like this: if you grow a sequence from 100,000 words to 1,000,000, the one millionth word has to attend to 999,999 previous words instead of 99,999. That’s not free.

Of course, an ideal model would not require attending to all 999,999 previous words, only those that matter, meaning attention should be dynamic and input-dependent, not a fixed computational cost per word.

By ‘matter’, we can intuit that, for a given word, recent words provide more meaningful information, called ‘recency bias’ (e.g., a noun is more likely to be affected by adjectives close by) and that some words are higher quality than others by default (e.g., somewhere in the 999,999 previous words there’s probably a ‘meh’ or ‘uhm’; words that provide little or no value relative to words like a character’s name in a novel).

This is what we call ‘sparse attention’, reducing computational costs by making attention more computationally efficient.

Despite this, most models make no assumptions about attention, and instead use global attention (assuming all previous words are worth attending to), which, for very long sequences, can make each new prediction require an overwhelming amount of compute.

This sounds awesome, but it’s not only expensive, but it’s also hard to “get right”. This is something known as context rot, where the model simply can’t handle too much context.

In fact, most of the long-context claims AI Labs make are bullshit: one thing is for Google to claim Gemini can process one million words; the other is that it does so well.

As you can see in the table below, the MRCR long-context benchmark shows that model performance degrades significantly from 128k words onward, let alone 1 million, with most models becoming unusable, while only GPT-5.5, released last week, maintains good quality, besides DSv4.

Nonetheless, if you look at the sequence below, you see that DeepSeek models handle long context best by far, and only the recent release of GPT-5.5 saves the US from total humiliation.

So, how did DeepSeek pull this off? And the answer is hybrid attention.

The Attention Miracle

A typical trick to make attention more cost-effective, popularized by Mistral and adopted here by DeepSeek, is sliding-window attention.

For a given word, it can only attend to a certain number of previous words, a value that is way smaller than the actual sequence length.

In the image below, ‘on’ can only attend to ‘cat’ and ‘sat’, but not to ‘The’, which is too far away (out of the window), because the window is three tokens (including the token itself). Cumulatively across all tokens, this implies significant cost savings because each word imposes much lower computational requirements.

Vanilla attention is global attention. Source: Mistral

However, this decision doesn’t come free.

By enforcing strict recency bias, the model is guaranteed to miss key long-horizon patterns. For example, in a ten-chapter crime novel, a key detail to uncover the mystery might have been mentioned in the first chapter. If the model can only see three chapters back, it’s going to forget that key detail.

This is a toy example, but in practice, this seriously undermines model performance. An alternative is to use hybrid attention, mixing local attentions (short-range) with global attention layers in a pattern that’s usually 5:1 (5 short-range attentions for every dense one).

And one of these hybrid attention mechanisms is precisely what DeepSeek is proposing. And it’s a big deal.

In late 2025, DeepSeek released DeepSeek Sparse Attention. The gist is to add a ‘selector’ that lets each word decide which previous tokens it wants to attend to, considerably reducing computation requirements.

The idea is that, instead of forcing the word to look back to only a certain number of immediately previous words, it can “skim” the entire sequence and choose what words it wants to look closely at.

Theoretically, this would be the ideal solution to grow to arbitrarily large sequences because we keep attention computation in check, and the model can still decide what’s important.

However, the skimming part, the part where the model still has to skim through the entire sequence to select the words that matter, is still very taxing. So DeepSeek has taken a next step and provided the most aggressive attention mechanism we’ve ever seen.

The idea is to break the sequence into chunks of words, compress the information in each chunk into a single representation, and then share that representation with the word. I know this is confusing, so let’s imagine a simple example.

We have a 1,000,000-word sequence. How does DeepSeek v4 process such a humongous amount of data? For reference, this is a couple of hundred thousand more words than Harry Potter’s entire saga.

The answer is Compressed Sparse Attention, or CSA.

Recall that the naive approach is for the millionth word to attend to 999,999 words. The 999,999th word has to attend to 999,998 words, and so on. This is incredibly taxing computationally.

Instead, the DeepSeek models break the sequence into chunks of, say, 50,000 words (20 chunks) and compress the information in each chunk into a single representation, which can be thought of as a summary of that chunk.

This is, of course, a considerable reduction in information, but at least the model can look way back with ease; even if it doesn’t see each individual work, it sees summaries of all that happened previously.

Additionally, each word gets dense access to the most recent words using SWA. If the SWA window is 5,000 words, that means that a word, say the 500,000th, receives information from 10 compressed 50,000-word chunks, plus individual access to the last 5,000 words (from the 495,000th onward).

Then, the model selects “what chunks matter” using the selector explained earlier, so that each word can see everything that happened recently, along with selected summaries of the past.

The DeepSeek team then added an even sparser attention step called Heavily Compressed Attention (HCA), which, as you may guess, is an even more radical compression of what we were discussing just now (meaning each chunk is larger, basically).

And considering that DeepSeek scores SOTA in long context, we are safe to claim that this is literally a SOTA attention mechanism; the first time we have a fully hybrid attention mechanism taking the first spot in expressiveness.

But this isn’t close to all the new stuff these cracked Chinese guys have presented to us.

Next, we’ll cover the remaining research breakthroughs, both algorithmic and in terms of infrastructure, and cover why this is a much bigger deal than meets the eye because while the US frontier is still focused on ‘big number go up at all costs’, Chinese Labs, severerly compute constrained, are dong ‘big number go up, but only if it does so efficiently’, which is music to the ears of the big customers in the AI game: enterprises.

At the end of the piece, we’ll turn to geopolitics and explain why DeepSeek v4 might be the first model the Chinese Communist Party can actively weaponize against the US.

Residual hyperconnections

Another of the big algorithmic breakthroughs in this paper is perhaps the most interesting.

Why?

Because it represents the largest change to the quintessential architecture, the Transformer, in almost a decade.

logo

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

A subscription gets you:

  • NO ADS
  • An additional insights email on Tuesdays
  • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more

Keep Reading