Let's Settle This Once & For All...

Ignacio de Gregorio Noblejas
February 04, 2025

Engage Prospects at the Perfect Moment With Our AI BDR

A poorly timed outbound message is a wasted message. Ava tracks your prospects in real-time and waits for them to trigger an intent signal before automatically sending them a personalized email or LinkedIn message.

Hire Ava who automates your entire outbound demand generation process, including:

Intent-Driven Lead Discovery Across Dozens of Sources
High Quality Emails with Human-Level Personalization
Follow-Up Management
Email Deliverability Management

Book a demo to see how Ava can 10x your outbound.

FUTURE
Let’s Settle The DeepSeek Saga Once and for All

You’ve probably seen one too many posts on DeepSeek this past week: “China says GPUs are overspent,” “The West is so wasteful,” and so on. If you’ve read my takes, you already know I believe the market got it all wrong and that DeepSeek’s results will only drive more compute and demand.

Interestingly, there’s already proof, considering the price of H100s has been increasing since the DeepSeek release and how the app already represents 25% of ChatGPT’s daily active users.

So, fearmongering over, let me ask you: Do you want to know what really caused Silicon Valley executives to freak out? If so, here are your answers.

By the end of this essay, you’ll have gained:

Critical intuition as to how AIs learn and how they are trained and served, building the necessary intuition to comprehend DeepSeek’s achievements.
Unmatched clarity into the world of AI engineering that has become a $300+ yearly CAPEX investment. In other words, you will now understand what the headline “Microsoft plans investing $80 billion in AI” actually means.
Of course, an understanding of the key breakthroughs DeepSeek has presented and their impact, including proof that this event is going to force frontier labs to make one of my 2025 predictions true (that this is the year of intelligence efficiency)
And, finally, what all this really means to Western Labs, Hyperscalers, NVIDIA, and, importantly, your wallet.

Let’s dive in.

AI’s Toughest Problem

DeepSeek (DS from now onward) has challenged the very root of the West’s understanding of AI training and inferencing. While AI models are pretty simple on a theoretical level, they are very complex in practice, and DS has greatly simplified both processes.

Thus, to understand the crucial innovations that DeepSeek introduces, it’s necessary to recap the essence of frontier AI models without complicated jargon—just the basics anyone can understand.

Transformers, the Root of Everything, in 10 Seconds

All models, be that DeepSeek or ChatGPT, are a concatenation of Transformer blocks that process the input text sequence and predict what word should come next “similarly” to how a human would.

But what does “process like a human would” mean?

If I take the sequences “The Green Forest” and “The Blue Forest,” the word ‘forest’ has an intrinsic meaning (a group of trees), but its contextual meaning depends on the other words in the sequence; one forest is green, the other blue.

Therefore, to humans, a word's meaning is the sum of its intrinsic and contextual meanings. Additionally, to predict what word comes next, humans add knowledge they’ve learned before they encountered the sequence. For instance, for the sequence “Babe Ruth played…” the only way a human can predict the next word is “baseball” is to know how Babe Ruth is.

Consequently, when humans talk, by predicting the next word in a sequence one after the previous, they do so by doing both things; contextualizing each word with regard to other words of the sequence and adding essential knowledge.

And why I’m telling you this? Well, because if you’re wondering how humans have managed to create machines that talk like us, well, that’s because Transformers replicate the exact same procedure by doing two operations:

A Transformer block. All frontier models are concatenations of this diagram.

Mixing operation (Attention): The model uses the attention mechanism to make words in the sequence talk to each other. Circling back to the sentence “The green forest,” via the attention mechanism, ‘forest’ can update its meaning with ‘green,’ absorbing the attribute green and adding it to its intrinsic meaning. This is done with the attention layers.
Knowledge operation (MLP): The model adds knowledge learned during the training. Put the sentence “Michael Jordan played…” To predict ‘basketball,’ the model needs to add information that is not present in the sequence. This is done in the MLP layers, based on the fact that MLPs fulfill the Universal Approximation Theorem. For more on that, read the Notion article {📏 The Universal Approximation Theorem}.

As mentioned, our AI models are a concatenation of Transformer blocks that leverage these two operations, as you can see in the image above.

Thus, every block is a refinement of the model’s understanding of the sequence, (like building the meaning progressively) by updating the meaning of each word in a sequence with regard to other words in the sequence and also adding additional information to the model thinks is necessary until it develops a global understanding of the sequence. Every single frontier model today behaves exactly this way.

But how do we train/run this? Let’s focus on training first.

The Training Pipeline

Without getting into the low-level details, whenever you see someone talking about training an AI model, it all boils down to one thing: trial and error.

But what do we mean by trial and error? All AI training procedures are a four-step process:

We force the model to predict,
we measure how well the prediction was compared to the ground truth,
we measure the deviation (the loss)
And, leveraging the fact that this prediction is a function of the model’s parameters (known as weights), use the error as a learning signal, aka we update the model's parameters so that the loss falls over time.

The analogy we’ll be using from now on throughout the essay:

Imagine an AI model as a DJ turntable with lots of knobs you can move and rotate to get the best sound. The idea is that you force a sound, see if it sounds good, and, if not, the DJ moves the knobs to tune the sound so that it eventually sounds good.

In AI model training, the model is the turntable, the DJ is the human model trainer, every parameter is a knob, and the quality of the sound is the error signal that tells us how good the sound is (the model’s prediction).

The only difference here is that frontier models are general-purpose, as if one very specific—and huge—‘turntable’ with every knob in a particular position allowed that turntable to make almost every single song sound good without the DJ having to intervene anymore. And taking that huge turntable with billions of knobs and taking it to that state is called foundation model training.

Moreover, we have two types of trial-and-error training: imitation learning and Reinforcement Learning (RL); one trains models by imitating prior examples (the DJ imitating certain knob positions to generate a specific sound we know sounds well), and the other implies the DJ and turntable trying out different sounds, exploring, and using this exploration to find new sounds.

Interestingly, LLMs and particularly reasoning models (LRMs) do both. Read Notion article {🎭 Types of Imitation Learning} for more detail on this.

This is all fun and games, but the question is how to apply it in practice.

Turning Theory into Practice

The four steps seen earlier translate into two computation paths when applying them to hardware like a GPU:

Forward pass: The series of computations that lead to the prediction, akin to the DJ pressing play and listening to the sound the turntable generates,
Backward pass: Known as backpropagation, it refers to calculating the gradients of each variable regarding the loss and the optimizer states, a series of surrogate calculations that tell us how much we should update each weight with that prediction. Using the DJ analogy, this is the process where the DJ listens to the generated sound, evaluates its quality, and moves the knobs to see if the sound quality improves.

Source

To know how much the DJ needs to move the knobs, we use derivatives, which tell us the ‘rate of change,’ or ‘gradient’, that every rotation we do on a knob has on the sound quality.

Additionally, we use something called ‘optimizer states’ that help us perform a more ‘steady’ and tailored adaptation of each knob.

For more detail on why we calculate derivatives, read Notion article {🔙 The Backpropagation Algorithm}.

Now that we know the two computation types that lead to model learning, we are ready to train our first model. Of course, we’ll use DeepSeek V3 for this exercise and the NVIDIA H800 that DeepSeek allegedly used as hardware.

DeepSeek v3 (DSv3) is a 671 billion parameter model (671 billion of the weights we discussed earlier, or knobs if the analogy helps), with 37 billion active parameters, making it a mixture-of-experts, or MoE model, where the model is ‘broken down’ into experts.

It’s important to note that a mixture-of-experts model does not mean breaking the model into different models, that is false and a widespread mistake (a mistake made very recently by David Friedberg in the All-in podcast). MoE models break the MLP layers; the attention layers remain untouched. Using the DJ analogy, this means that to generate sound, the turntable only activates 37 billion knobs instead of the 671 billion available.

Feedforward and ‘MLP’ are used interchangeably (although not quite the same).

For today’s purpose, don’t worry about MoE. You just need to know that MLP layers are extremely compute-intensive. Therefore, the whole point of MoEs is that by breaking them into parts, we are effectively fractioning the model's compute requirements by a factor of the number of activated experts. In this case, we are reducing prediction effort by 95%.

For more details on MoEs, read Notion article {🥸 What are Mixtures-of-Experts}.

Now, let’s learn how to draft GPU clusters.

Building your cluster

DS used a cluster of 2,048 NVIDIA H800s, each with 1,516 TFLOPs peaking at FP16. This means each GPU can perform fifteen hundred trillion floating operations (with decimals) per second, or 24% less than NVIDIA’s state-of-the-art H100 (1,979 TFLOPs), when precision is 2 bytes per parameter (FP16).

Don’t worry about the numbers yet. But what does it really mean to train a model on a GPU?

Think of GPUs as calculators. To calculate, they need two things: the calculating cores (where the calculations take place) and memory to read the inputs and write the results.

Hence, any workload running on a GPU involves an interplay between the GPU cores (calculators on steroids) and memory, where the cores read data to perform computations and write back the results. For today’s review, you only need to know that reading and writing to memory takes time, which, if not optimized, is time the “calculators” aren’t calculating.

❝

Therefore, AI labs aim to minimize idle time, ensuring that the calculators are working every second possible.

While GPUs have quite a complex memory hierarchy (registers, L1 and L2 cache, shared memory, and global memory), for the sake of simplicity, we will refer to all as one single piece of ‘memory,’ as most DS innovations covered today are more about minimizing the total amount of read/writes and do not optimizing where things are read/wrote from.

But wait, there’s a problem, a single H800 only has 80GB of RAM, so a 671 GB model does not fit in one GPU! We need many. Naively, we would need at least nine. Sadly, we will need many more.

But how many? To answer this, let’s look at the overall steps and see where memory is required:

We start the forward pass. The compute cores read the first layers (first Transformer blocks), giving each layer's output (the input of the next one, as discussed before). The results are written into memory, and the next Transformer blocks are loaded until the entire model is processed. This means that the model needs to be stored in memory.
The results of each Transformer block are stored in memory, not only because they are the input of the next block, but because they are required for the backward pass. This means these results (called activations) also need to be stored in memory.
Once the model outputs a prediction, we use it to measure our loss (aka, how good the prediction was).
This starts the backward pass, which, in a nutshell, implies introducing two new calculations: the gradients (the rate of change of each weight (each knob) in the model with regard to the loss, indicating how each weight’s value has affected that loss), and the optimizer states, which will tell us how to adapt each weight (the signal as to how much we need to rotate each knob). Again, this implies that we also need to store the gradients and the optimizer states.

Long story short, we actually need thousands of GigaBytes. In DS case, that number ended up being up to 5120 GB, or 64 entire GPUs.

To better understand why we need so many GigaBytes, read Notion article {🧐 How many GBs I Need?}.

So, the question is, how do we “divide” the training across so many GPUs, and even more importantly, how do we take 64 GPUs to 2,048?

Parallelization methods

As you can probably guess, if we train one instance of the model across 64 GPUs, we need to break it up (and not only the model but everything else, too).

To do so, there are plenty of ways:

Pipeline parallelism (PP), breaking the model on a layer basis (across its length, i.e., two Transformer blocks to GPU1, two to GPU2…)
Tensor parallelism (TP), breaking the model by the activation dimension (by its width instead of by its length, i.e., GPU1 gets a portion of all layers in the model)
Data parallelism (DP) divides the cluster into GPU groups, each receiving a distinct part of the data. This isn’t about breaking the model but the training data.
Sequence parallelism (SP) involves breaking data sequences into chunks of text and sending them to each group of GPUs. This is the same approach as DP, but instead of dividing the dataset of 1 million sequences into groups of 250k, we break every sequence, too.
Expert parallelism (EP). By partitioning the MLP layers, we can distribute the experts across several GPUs (GPU1 takes four experts, GPU2 four others…).
Through frameworks like ZeRO-1 parallelism, the optimizer states are not stored in all of every GPU but are divided across several GPUs.

For a proper description of each, read the Notion article {➗ Types of Parallelisms Used}.

So, how does the picture look for DeepSeek-V3?

According to the paper, they use a 2,048 GPU cluster with 256 8-GPU nodes. They use four parallelization methods listed above: PP (16-way), EP (64-way), DP (not mentioned by DS as it’s trivial, obviously 32-way), and 64-way ZeRo-1.

In other words, DSv3 was trained on 32 clusters like the one shown below. But what does that mean in practice?

Via PP, each model instance is broken into 16 GPUs. Hence, each one gets 1/16th of the model’s layers.
Via EP, each model instance’s experts are divided across 64 GPUs. Thus, each one gets 1/64th of the experts of each MLP layer (EP)
Similar to EP, each GPU gets 1/64th of the model’s optimizer states (ZeRO-1)
Again, similar to the prior two, each GPU gets 1/64th of the model’s gradients. However, as it’s an MoE, only 5% of weights activate, so the gradients to be computed fall by 95%. In layman’s terms, if only 5% of knobs participate in a prediction, the learning signal only applies to these, not the rest, saving tons of memory.
Each GPU has to store the activations of its layers to send them to the GPU in charge of the next layers due to PP, so they store 1/64th of the instance’s activations.

To understand the full details of this parallelization, read the Notion article {☄️ The DeepSeek Cluster}. And to understand how they reached the 64-GPU number, read the Notion article {🧐 How many GBs Do I Need?}.

However, weren’t they using a 2,048 GPU cluster? If we “only” need 64 GPUs to train a model instance, why are they using 2,048?

The key here is parallelization. If I have a 64-GPU cluster, I can theoretically train DSv3. But how long would that take? Well, the answer is years, 33 to be exact. Therefore, we need to parallelize and train many instances instead of one if we hope to train the model in a logical time.

To understand why a single 64-GPU cluster is not viable and would take 33 years to train, read Notion article {⌚️ Estimating Training Run Length}.

But wait, if we have 32 64-GPU clusters all training different instances of the same model, doesn’t that mean we are training 32 different models?

Yes… and no, because we synchronize the learnings. This introduces another crucial piece: communications.

Communication

Some numbers before we talk about communications.

DS reported 160 and 50GB/s for NVLink and Infiniband (IB) for node and cross-node communication, respectively. In other words, GPUs inside a node share data at 160GB/s, while GPUs in different nodes see that speed drop to 50GB/s.

As we separate model and data, GPUs must communicate consistently, which is an improvable but unavoidable problem (but in which DS improves results dramatically).

For instance, for a token being processed by a GPU, that token might be assigned to a GPU far away in another cluster due to expert parallelization, meaning that the token must travel through the IB cables. While the token ‘travels,’ the latter GPU has to ‘wait’ until that token arrives to perform the computation.

This essentially means that there are two types of communication pathways:

Intra-node communication, when GPUs from the same node communicate. This communication is done via the much faster NVLink cables.
Cross-node communication, when GPUs from different nodes communicate. This communication goes through IB cables.

In the case of DSv3, due to the extensive parallelism efforts, many communication pathways are going on.

For more detail on which communications take place and why, read the Notion article {📫 Communication Pathways in DSv3}.

But we still haven’t explained the question we stated earlier: Sure, GPUs in different nodes inside the 64-GPU cluster will communicate with each other but aren’t we still training 32 different instances of the model? (recall we have 2,048 GPUs).

And the answer is no, thanks to data parallelization and global training updates.

Scaling to 2,048 GPUs

Once the training step of a given cluster is finished, DS executes a communication step between all 32 clusters so that they share what they have learned. This process is called all-to-all communication, and is broken down into four steps:

Intra-64-GPU cluster:

All Dispatch: As we are using ZeRO-1 parallelism, each of the 64 GPUs in the cluster has 1/64th of the optimizer states. In plain English, each GPU can update one small fraction of the model’s weights. Therefore, after the weights stored in that GPU are updated, the new values are broadcasted to the other GPUs in the cluster.
Reduce: As each GPU receives the new weights from other GPUs, they update the weights they had not updated themselves. By the end of this process, all GPUs in the 64-GPU cluster are updated.

Note: This step wouldn’t have been necessary if we didn’t have Expert Parallelization, as GPUs don’t need to know the state of the weights of layers they aren’t storing themselves.

All Dispatch: Once every 64-GPU cluster has updated all its weights, we now have 32 versions of the same model, as they have been trained with different training data. We want all 32 clusters to have the same model and proceed to the next round, so they need to share their weights with the rest.
Reduce: As every cluster in the 32 receives the data from the other 31, they reduce it via some aggregation method, usually the average. In other words, they compute the average value across the new weights of each cluster.

At this point, every DP group has a model instance identical to those of the other 31 clusters.

It’s important to note that this process is mostly synchronous, meaning that the entire training is stopped so that communication is performed across the larger DP groups. This is a huge problem in itself DS doesn’t touch in its research, and which is usually resolved via low-communication methods like Google’s Diloco.

Now, we are finally ready to see what DeepSeek proposes in all this, starting with DualPipe.

DeepSeek’s Contributions

DualPipe

As we mentioned earlier, training involves two computation pathways: forward and backward. Theoretically, one goes first, the forward pass to compute the predictions, and then the backward is executed so that the error signal is used to update the models.

Instead, what DeepSeek proposes is to ‘overlap’ both pathways; to make them happen simultaneously.

Let’s say I have two independent pieces of data I want to send to the model. While the backward pass of the first piece depends on first doing the forward pass (to learn from the loss signal, I have to make a prediction and measure the error), it’s independent of both pathways of the second piece of data.

Therefore, once I start the backward pass of the first piece of data, I can immediately start with the forward pass of the second. And while I start the backward pass of the second, I may start with another computation I have to perform, like communications.

Long story short, DualPipe is about parallelizing as many independent computations as possible to minimize GPU idleness. Illustrated below, you can see how GPUs simultaneously handle computation and communication instructions.

For a more detailed explanation of DualPipe, read Notion article {🪛 DualPipe}.

❝

DS’ contribution nº1: Achieving near-zero GPU idle time by overlapping forward, backward, and communication tasks across the entire DP group.

In plain English, DS’s training is much faster and more efficient than traditional model training, which translates into them having to use fewer GPUs (for reference, Llama 3.1 405B was trained on a cluster 8 times larger) and with worse GPUs (24% less performant than the ones used by Meta.

And, no, this does not mean you need less GPUs. Instead, you can do more with less, and the incentives to grow your cluster are still very much there.

But while DualPipe is truly a success story for them, we still have a problem: Expert Parallelism.

Scaling MoEs is Hard.

As we recall, DS distributes the MoE layers (the MLP layers) into 64 groups, sending a 64th part of the layer to every GPU. However, MoE models always struggle with expert balancing.

If an expert gains a knowledge advantage, it starts to be picked far more often than others. This can lead to expert collapse, in which some experts are hardly picked, if at all.

This is, on paper, irrelevant (not really, but it’s not a tragedy in theory), but it’s definitely a tragedy if you’re trying to maximize average GPU FLOP usage. If the experts in GPU1 are chosen a lot, and those on GPU56 aren’t, one GPU is ‘working extra hours’ while the other is idle. This is very bad because you can overload some GPUs, which causes overall delays (and you can burn them, too).

Therefore, ideally, you want all experts to be picked as much on average so that all GPUs are used as much as possible.

Until DS, most people solved this with an auxiliary loss formula, which taxes the model if it tries to pick an expert too often. This naturally leads to a more balanced expert choice distribution. However, that’s not ideal because if an expert is better, it should be chosen more often, right? Unsurprisingly, this method degrades performance.

Instead, DS achieves efficient expert balancing by using a bias term that is dynamically added to each expert as the model makes predictions. If an expert is getting chosen a lot for the current data, this term penalizes experts who are chosen too much and incentivizes the model to pick others (and vice-versa).

Using the DJ analogy, this ‘bias term’ acts as a second DJ that is carefully counting which knobs are being used more by the primary DJ, suggesting the use of other knobs that the DJ is ignoring to incentivize that all knobs are used evenly.

❝

DS’ Contribution nº2: Achieving what was thought impossible. Scaling a MoE with small experts using an expert balancing method that doesn’t degrade performance, opening the door for MoEs become the default and widely improving AI inferencing.

For more detail on this, read Notion article {⚖️ Auxiliary-Loss-Free Node Balancing}.

Another very important milestone achieved by DS is that, as mentioned earlier, they managed to train its models in FP8 precision.

Native FP-8

DeepSeek’s most significant victory is probably being the first model trained on FP8 precision, meaning that each weight contains 1 byte of information. The biggest consequence is that the model weighs just 1 byte per parameter.

Thus, if we have 671 billion parameters, we require 671 gigabytes of memory (with FP16, we would have needed 1342 GB). Additionally, this also implies that the theoretical compute throughput doubles to 3,032 FLOPs, which is even more accelerated because only 5% of the model activates per prediction.

❝

This is why DeepSeek is so fast and cheap to run.

But how did they achieve this? This is probably the most fascinating part of the entire research; it’s quite literally a work of art. In a nutshell, the computational graph of DeepSeek's forward and backward passes is a beautifully executed dance between different precisions, quantizations, and dequantizations.

Choosing your precision is important. If you choose FP8, your model's parameters can take a smaller range of numbers. However, the more precision you have, the more numbers you can take in (think of this as weights being able to take in more decimals to be more accurately modified/run). On the flip side, the more precision you have, the higher the cost of both compute and memory.

Therefore, while the logical incentive is to go for as much precision as possible, it’s not viable to serve. Thus, DS’s goal is to allow most computations to be done in FP8 but increase/decrease precision when necessary.

❝

DS Contribution nº3: Managing to train a native FP8 model for the first time, balancing different precisions when necessary while still allowing the model to be stored at FP8 and guaranteeing that most computations are made at smaller precisions. With this, DS has shown the world the way to native FP8 training.

Now, we move on to cover the inference improvements.

Inference

Once the model is trained and we want to run it, we don’t need to update the weights anymore; instead, we just make predictions.

This means that there’s only one pathway: from input to prediction. The model receives an input sequence and predicts the next word, appends this word to the sequence, and repeats the process.

However, there’s a twist to this process.

The Need for A Cache

The mixing operation we mentioned, attention, is deterministic. This means that attention between two words in a sequence has the same result every new word in the sequence is predicted. Therefore, to avoid redundancy, the attention mechanism in inference is not an all-to-all computation but a one-to-all computation.

But what do we mean by that?

In essence, the previous attentions between words that have been already processed are cached, and for every new prediction, we only perform attention from the last-predicted word to the previous ones and fetch the others.

As seen below, in non-cached attention, we perform attention from every word to all its prior words (meaning words cannot ‘attend’ to future words, and, of course, they can’t attend themselves).

For example, as the attention computation of ‘dog’ attending to ‘big’ will be repeated for each new word prediction, we cache it once it’s done and simply fetch the outcome from memory when needed.

This is what we call the KV Cache.

Algebra-wise, this means that, for every new prediction, attention becomes a matrix-vector multiplication, not a matrix-matrix one, which is why GPUs aren’t, on paper, the best option for inference (they excel on the latter).

And what does all this mean for the underlying hardware?

Dealing With Two Memory Concerns

While training is a much more complicated endeavor (as you’ll surely appreciate by now), inference isn’t exempt from its complications. The main one here is that, as mentioned, we have to cache some of the attention computations.

Although I provide all the necessary detail in the Notion article {💾 The KV Cache}, the KV Cache is a considerable issue because it grows quadratic to sequence length in its standard form.

If the sequence of text to process doubles in size, the KV Cache quadruples; if it triples, the cache increases nine-fold. In some cases, the KV Cache memory requirements may grow larger than the model itself, into the hundreds or even thousands of GB, reaching the TeraByte zone!

To address this, DS introduces a technique that compresses this cache and only expands it during computation. Instead of storing the cache ‘as it is,’ it first computes and stores a compressed form. Then, whenever the cached activations are required, they are fetched in compressed form, expanded into their original form, and used as part of the forward pass.

The point is that while the KV Cache is still being used as always, it’s stored in a compressed form that, according to sources, is around 95% smaller. Thus, the cache's memory size is reduced by almost 95%.

The reasons why this works (and how it works) are explained in the Notion article {🗜️Multi-Latent Attention}.

❝

DS’ Contribution nº4: Achieving a 95% KV Cache memory requirement reduction with no discernible performance loss.

Another change they introduce is multi-token prediction.

Sampling Multiple Tokens at Once

I won’t get too much into the details because it’s pretty straightforward to understand. The idea is that the model can predict more than one word in each prediction instead of the standard one.

In a way, you can think of multi-token predictors as multi-headed beasts, where each head, like the Hydra in Greek mythology, predicts its independent word.

It’s worth mentioning that DeepSeek introduced this multi-token prediction regime as a training improvement, not an inference one. While it might sound like an obvious technique to boost inference efficiency, it surprisingly also improves training.

They argue that the model's several heads imply that the heads predicting prior tokens will consider future tokens' predictions, and vice versa. Consequently, the model is ‘looking into the future,’ making predictions that consider not only what has already been predicted but also what will be predicted next.

Think of this as a human who is much more careful about the words it says based on what it will say next. However, I wouldn’t call this a huge contribution because this technique has been applied multiple times in a similar fashion and with similar outcomes.

Finally, we reach our destination: why would you give a damn about all this, and why is SV panicking about it?

It’s the Year Of Intelligence Efficiency!

If there’s a takeaway from DeepSeek’s research is that there’s still plenty of room for improved techniques that allow training and inference to be executed more efficiently.

And if there’s a takeaway from markets this past week, besides the fact they are clueless, it is that increasing intelligence in brute terms with utter disregard for costs or efficiency will no longer be tolerated.

While the price has been paid mainly by NVIDIA, probably because most AI investors were in that bag and not the others, while in reality, DeepSeek’s results are more than favorable for them than anything else, it’s a strong warning to Hyperscalers, and their frontier labs, that pushing the frontier “at all costs” is not acceptable anymore.

❝

Put simply, people really don’t care if OpenAI is in the lead with a 10% advantage over DeepSeek’s models if the latter is 100 times cheaper.

Finally, these labs have realized that if China continues to lead the Pareto frontier of AI (while the US holds the best models all-around, none of these can compete, maybe with the exception of Google, with the cost efficiencies that DeepSeek has implemented), people will continue to flock into Chinese models; it’s really not that hard.

As for you and me, here are some considerations:

The past week's events have also sent a clear signal that tech moats in AI don’t exist and, if they do exist for a while, are ephemeral.
Compute’s role is more important than ever; those AI labs with outsized access to computing will have a considerable advantage, as they will be able to run reasoning models at scale for much, much longer (both in training and inferencing).
In turn, this means that unless China does something about it, we could soon see costly products that only those with deep pockets can access. This, just like between AI labs, could create inequality among consumers, as some can pay to access these longer-thinking models, and others are left with the cheaper ones. If this scaling law is valid, cheaper = dumber, and those with access to o3 will have insane productivity advantages over those running GPT-4o at a $20/month price tag.
The pressures to ban Chinese models will be huge, and some senators are already preparing bills that attack your rights not in fear of ‘the CCP stealing your data’ (they couldn’t care less about your privacy) but in fear that Chinese models leave AI start-ups and their Hyperscaler backers with no chance of making a buck.
DS’ distillations (not mentioned today but in my previous cover of R1) prove that running extremely powerful models in consumer-end hardware is very possible. It could very well be the case that, during this year, you might even consider buying high-performant laptops or desktop computers and start running powerful reasoning models like DeepSeek’s in the comfort and security of your personal device. It’s funny because reasoning models certainly gave some breathing air to OpenAI, but they also might make their entire business model worthless unless they truly deliver a model that can’t be accessed anywhere else. In a way, they might have opened Pandora’s Box on themselves, as reasoning models can be much smaller and, coupled with Chinese efficiencies, can be run at home.
As long as open-source is allowed to live, we will see increased speeds in shipping AI products from companies trying to remain in control; it’s no secret that DS forced OpenAI to release o3-mini. No system card, no safety evaluations, just pure desperation to continue to justify the huge $200/month price tag.
In turn, if you believe in AI’s risks, you will not be happy to hear that labs will probably cut many more corners and get things into the market than earlier. Whether you believe in AI existential risks, there’s no denying AI is powerful, and clumsy delivery is never a good idea.

I can only thank you

If you’re one of the brave souls that has reached this, quite frankly, endless essay, you are now totally up to date with the frontier of AI.

You have learned:

Crucial aspects of AI training and inference
Learned how clusters are drafted and built
The crucial contributions that DeepSeek has undeniably offered the world
And the considerations that all this really implies for you and me.

Now, my scheduling will go back to normal, but I hope you understand why I felt the urge to make this extra effort and clarify all the questions you might have on your mind.

Until Thursday!

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

Give a Rating to Today's Newsletter

For business inquiries, reach me out at [email protected]