The Whale Strikes Again

In partnership with

10x Your Outbound With Our AI BDR

Imagine your calendar filling with qualified sales meetings, on autopilot. That's Ava's job. She's an AI BDR who automates your entire outbound demand generation.

Ava operates within the Artisan platform, which consolidates every tool you need for outbound:

  • 300M+ High-Quality B2B Prospects, including E-Commerce and Local Business Leads

  • Automated Lead Enrichment With 10+ Data Sources

  • Full Email Deliverability Management

  • Multi-Channel Outreach Across Email & LinkedIn

  • Human-Level Personalization

THEWHITEBOX
TLDR;

This week, we discuss the latest AI drama, OpenAI’s cheating accusations on xAI, the extremely creepy advent of the first musculoskeletal robot, and an amazing example of agents working together to create a web app from scratch.

Moving on, we discuss the birth of a new AI super lab and Crunchbase’s AI startup success AI predictor. We then move to discuss China’s latest AI model by Kimi AI and two breakthroughs in AI biology and research, the largest foundation model for biology ever, and Google’s AI Co-scientist.

Finally, for this week’s trend of the week, we discuss DeepSeek’s latest research and how they intend to turn the world of AI upside down.

FRONTIER RESEARCH
The OpenAI vs. xAI Drama

Model performance on Math-olympian level maths problems. Source: xAI

On Monday, xAI released their Grok-3 and Grok-3 Reasoning model families, which we covered in detail in Tuesday’s Premium-only news segment. Allegedly, these models were superior to OpenAI’s entire offering, making Elon Musk’s claim to have built the “smartest AI on Earth” valid.

But it seems that we might have been misled, leading to the latest drama in the AI industry, after the former accused the latter of dishonesty or, in the words of OpenAI’s Head of Applied Research, just straight cheating.

The reason is the numerous charts, like the one shown above, that xAI used to prove Grok-3’s frontier-level capabilities. At first glance, if you look at the graph, you simply assume that the darker blue represents Grok-3’s performance without reasoning capabilities, and the lighter blue represents the results when reasoning capabilities are added. Everyone, including OpenAI employees and me, interprets it this way because it seems evident without further information and assumes xAI is being honest.

Under this assumption, Grok-3 Reasoning appears to be way ahead of o1 and also ahead of o3-mini. However, in the release blog, xAI clarifies that the lighter blue part is the result after ‘cons@64.’ And that, my dear reader, changes things a bit.

What you’re looking at above is the following:

  1. The darker blue already refers to Grok-3 Reasoning, but on a pass@1 metric. In other words, it shows Grok-3 Reasoning’s performance when measuring whether it correctly solved the problem in one try.

  2. The lighter blue refers to Grok-3 Reasoning’s performance on ‘cons@64,’ which translates to ‘consensus at 64 tries.’ In layman’s terms, it shows the performance when the consensus response (the most common) across 64 tries is correct. For instance, if the accuracy is 93%, it means that 93 out of 100 problems in the benchmark, the consensus response in every problem’s 64 tries happens to be the correct answer.

This wouldn’t be a significant issue if it weren’t for the fact that the results xAI shows in the graph for o3-mini are not ‘cons@64,’ but pass@1. Therefore, in an apples-to-apples comparison, Grok-3 Reasoning gets less than 80% while o3-mini gets 87%, clearly proving that, benchmark-wise, o3-mini is better. However, looking at the benchmark makes you believe that Grok-3 Reasoning is actually better when, in reality, it seems it’s more at o1’s level.

TheWhiteBox’s takeaway:

This supports my argument that you should take model results on public benchmarks with a pinch of salt and try the models for yourself. In fact, I’ll go further: your first reaction to a frontier AI lab claim, whether xAI or OpenAI, must be to assume you are being misled.

So, you may wonder, what’s the actual ranking? Here you go:

ROBOTICS
The First Musculoskeletal Android. And it Moves!

Today, I feel like providing some fuel for your nightmares.

This video shows Protoclone, the world's first bipedal, musculoskeletal android. Unlike most robots being designed, including Tesla’s Optimus or Figure AI’s robot, which use rigid frames and servomotors assembled to assemble the humanoid (this video serves as an example of both what a standard robot looks like and also how people are literally trying to build Terminator), Clone uses artificial muscles, directly mimicking the human body, making their movements more ‘natural.’

This allows the robot to have more than 200 degrees of freedom, over 1,000 myofibers (meaning that the muscles have artificial fibers, again mimicking human muscles), and 500 sensors.

TheWhiteBox’s takeaway:

I really don’t know what to say. What is your first reaction? I was mind-blown, and I must admit I was slightly scared too. The music choice doesn’t help.

On the technical side, we must be skeptical. There’s a reason why most labs are choosing more robotic options, metal frames, and servomotors instead of actual artificial fibers. Also, one thing is to build the thing, and the other is to make it move (there is a reason why it’s hanging and not walking). There is a long way ahead for this company.

AGENTS
OpenAI & Replit Agent Working Together

This video showcases what I consider a remarkable scenario: OpenAI’s Operator (the agent that interacts with the web) uses Replit’s Agent to create a web app with zero human interaction.

TheWhiteBox’s takeaway:

You may wonder if I use these tools. I do not.

For now, I prefer to participate more actively in the building process via tools like Cursor. Sure, there’s a non-zero chance that human coding stops making sense by the end of the year, but in the current state of affairs, if you want things to work as intended, you need to get your hands dirty.

Sure, the video mentioned above is fantastic and literally gives my grandad the ability to build software, but taking it to production is another story. To me, it is a great way to visualize the future of software… but not the present.

TRUMP ADMINISTRATION
Crunchbase’s Startup Sucess Predictor

Crunchbase has announced that their internal AIs can predict startup fundraising events, acquisitions, IPOs, and even potential layoffs. While the VentureBeat headline is frankly sensationalist (Crunchbase does not predict with 95% success whether a startup succeeds. Instead, it has 95% accuracy to predict fundraising events), the use of AI to find common patterns among successful startups feels completely obvious.

The CEO even considers the old data model to be “dead” and that companies that have built their business around their data are doomed in an AI-led world because data always leaks, and companies with better AIs will be able to match that data with open Internet information and find key patterns the old-data company does not. If the CEO’s intuition is correct, he’s basically saying system-of-record data companies are dead, which would be disastrous for the SaaS market (I myself I’m a huge hawk on SaaS companies in general).

TheWhiteBox’s takeaway:

I’ll repeat it: AI’s greatest (maybe only) true superpower is finding patterns in data. In fact, most frontier AI performance is just pattern matching, not real intelligence, but I digress.

Using big data to find patterns is not new, but Generative AI models can help us make sense of these patterns better than Machine Learning methods that use statistical analysis and require extensive human engineering. That’s the key: frontier AI models break free from their human chains and find patterns without our help, naturally leading to finding patterns humans can’t. On the other hand, ML’s extensive human engineering naturally leads to ‘discoveries’ that are biased by human assumptions that hide the non-obvious patterns.

Finally, on web agents like OpenAI’s Operator, Microsoft has released a new version of OmniParser, which allows you to turn any LLM into an agent. The process is quite straightforward: the OmniParser understands the image (where to button), and the LLM does the reasoning (what button the agent has to click). The video shows how it works remarkably well.

FRONTIER RESEARCH
New AI Super Lab.

A team of superstar researchers from OpenAI, Character.ai, Mistral, Meta, among others, and led by ex-OpenAI’s CTO Mira Murati and OpenAI cofounder and ex-Anthropic researcher John Schulman as Chief Scientist, have launched Thinking Machines Lab, a new AI startup.

Via its first blog presentation, it presents itself as an ‘open lab,’ clarifying that it intends to publish its breakthroughs and models and declares it will focus on multimodality, the idea that AI models can work on more data types than just text.

No investment rounds have been announced, but considering Ilya Sutskever’s latest $30 billion valuation of SafeSuperIntelligence, Inc., researcher star power almost guarantees billion-dollar financing rounds.

TheWhiteBox’s takeaway:

Besides the honorable exception of xAI, which jumped onto the frontier train while it had already left the station, new AI startups will have a hard time matching up to OpenAI or Anthropic, even if you're founded by the same people who built that company.

For that reason, it’s not surprising that they are focusing on multimodality. This feature was all the rage a year ago but has lost priority lately in favor of chain of thought models, also known as reasoning models.

  • OpenAI isn’t really focused on multimodality anymore,

  • Anthropic never was.

  • Google is only multimodal in processing stuff, not generating it.

  • DeepSeek’s frontier models aren’t multimodal, full stop, and their multimodal efforts are still very early with examples like Janus.

  • xAI’s Grok-3 capabilities, albeit multimodal, were mentioned just once and only during the Q&A, which clarifies that they aren’t a priority for them either.

Therefore, this project looks like these researchers’ way of putting themselves against the trajectory that most Western labs have taken, creating their own niche and, at the same time, reverting this trend of becoming a closed source.

FRONTIER RESEARCH
Kimi AI’s Free SOTA Model

As Donald Trump would say, ‘China, China, China.’ China is delivering, my dear reader, and looks committed to destroying Western AI labs’ hopes of making money. This time, Kimi AI has launched Kimi 1.5, which appears to be on par with Western non-reasoning models.

But the key is that it’s free and, crucially, has unlimited usage. They are quite literally giving it away for free.

TheWhiteBox’s takeaway:

While the world discusses the frankly stupid question of who’s ahead in the AI race right now (the US is obviously ahead, at least for now and by not as much as some think), the real insight is not what they deliver but how they deliver it.

China isn’t worried about having the best AIs or beating OpenAI. It is obsessed with making Western business models unprofitable and doing so amazingly well. If the US intends to prevent China from accessing GPUs, China will ensure that no one has a business model.

If we don’t win, we’ll make sure nobody does. That’s a great way to summarise the Chinese mentality regarding AI.

So yeah, in case you’re wondering, I’m saying that China is on your side, forcing Western companies to drop prices and ship faster. Whether you use their products or not, they are still dropping the prices of the products you use, be it ChatGPT or Gemini.

If it weren’t for China, I can guarantee you wouldn’t have O3 shipped at the prices we enjoy currently. Besides Google, which has unparalleled knowledge and efficiency regarding inference, the others are basically subsidizing their products to keep users invested. Not even OpenAI is making profits, even in its priciest tiers.

DECIPHERING THE GENOME
Evo2, The Largest Biology AI Model

Almost a year ago, I introduced EVO, the first biological foundation model. Now, the same researchers have presented the new version, a foundation model for biology with more than 100,000 species and 128,000 complete genomes.

But why? One of the researchers summarises it best:

"Evo 2 has a generalist understanding of the tree of life that's useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life. We’re excited to see what the research community builds on top of these foundation models.” 

In layman’s terms, using trillions of data points, they have built an AI model that understands the complex nature of genetics and can find key patterns in our genome that could lead to diseases, even if these are millions of nucleotides apart.

TheWhiteBox’s takeaway:

In our 2025 predictions, we included the first true AI discovery.

Considering the advent of EVO for biology and Google’s AI Co-Scientist, a Gemini 2.0-powered system that allows scientists to go through research ideas and hypotheses orders of magnitude faster, my prediction is starting to be deliciously closer to becoming true.

TREND OF THE WEEK
The Whale Strikes Again

DeepSeek, the whale-logo company that a few weeks ago provoked—although wrongly—the largest single stock decline in history in NVIDIA, and China’s best AI lab, has released new research. And based on what happened recently, the world is paying immediate attention.

And even more so when its intentions are clear and bold: prove once again that the West is Wasteful.

Called Natively-trainable Sparse Attention, it dares to propose modifying the operation at the heart of all frontier models for good and possibly unlocking infinite-length context windows, something most AI labs have tried and failed.

But if someone can pull it off, it's these guys.

Today, you’ll learn two things:

  1. the intuition as to how frontier AI models understand language

  2. and, crucially, how DeepSeek pretends to change the status quo once again.

Talking Words

Despite their impressive capabilities, all frontier AI models are surprisingly simple at heart. I’ve explained this multiple times in more detail, so I’ll keep it short, but I still need to explain a few things to make sense of DeepSeek’s research.

The Crucial Basics

Have you ever wondered what a word means? It’s derived from two elements:

  • its intrinsic meaning (forest represents a group of trees, not a flock of birds),

  • and its meaning as a function of its surroundings (i.e., ‘bank’ means two different things depending on context).

Thus, while the former is provided by default, the AI must build the latter once it receives the sequence. Therefore, to process (understand) language, language models perform a series of updates, updating each word with its surrounding context.

For the sequence “The Blue Pirate,” the word ‘pirate’ attends to ‘blue’ to incorporate its meaning so that, in the next iteration, the pirate isn’t just a bearded, smelly armed sailor; it’s also blue now.

Therefore, for every sequence ChatGPT reads, every word does this process with every other word without exception. This process is called full attention.

To perform this, words take the form of vectors representing those words, where each number in the vector can be considered an attribute of the word. For instance, a pirate may have four numbers in its vector, such as [3, 5, 8, 1], where each might represent common pirate attributes such as [ ‘country,’ ‘won battles,’ ‘color,’ ‘beard/beardless’].

So, when it attends to ‘blue,’ the number 8, which might represent the pirate as ‘red,’ may become a 5, representing that the pirate is now blue. This is an extreme oversimplification, but that’s what happens in first principles.

To reinforce this idea that each number of a concept’s vector represents an attribute, look at the example below, where we can combine these vectors and add/subtract specific features that lead to those concepts becoming other concepts, like taking the attribute of ‘man’ from ‘king’, and adding ‘woman,’ resulting in ‘queen’.

This is one of the most famous events in AI history, the day computational linguistics, the foundation of modern AI, was born.

But why use vectors?

Besides this combinatorial benefit, well, machines can only process numbers. Second, it allows us to compare the semantic similarity between words. If ‘dog’ and ‘cat’ have similar vectors (because they share multiple attributes in real life), that indicates to the machine that they are semantically related (both mammals, domestic, four legs,…).

This is crucial because it allows machines to understand the similarity between real-world concepts. It turns the complexities of our world into mathematical operations that can be computed; if two concepts have similar vectors, this signals the AI that they are similar in real life.

Isn’t it beautiful?

Understood this, and you know understand how any frontier model works. Every other complexity you run into in AI is no more, as every breakthrough you see on the news, in first principles, boils down to computing semantic similarity between vectors.

Circling back to attention, it now makes sense how LLMs perform it. Every word searches for semantically similar words in the sequence and absorbs their information by ‘attending’ to them. This way, adjectives find their nouns, adverbs find their verbs/adjectives, and so on.

For a deep rundown of attention in more technical terms, read my blog I wrote about it in full detail.

And well, it seems DeepSeek now wants to change this completely. But why?

From Global to Sparse

The issue with the attention mechanism is that the longer the sequence, the more expensive it gets—and it gets really expensive—as each new predicted word has to attend to a growing number of words.

Worse, most attention mechanisms are global, meaning that every word attends to every single word in the sequence with equal importance. If you send ChatGPT a book, words in Chapter 12 will still attend to every word in Chapter 2, even though most have no valuable information to share. This is done on purpose because if there’s a chance something important has to be shared, we guarantee the model picks on it.

For example, if we send ChatGPT an Agatha Christie book, at one point later in the book, the protagonist may reference a key occurrence that happened earlier. If attention isn’t global, ChatGPT would basically forget that fact and lose the context.

Therefore, we are at a crossroads. We desperately need to lower the cost of attention, but we don’t want to lose our capacity to attend to everything so that the model can pick up even the minor details.

In consequence, researchers have tried to introduce sparse attention for years, a way to trade off granularity with sparsity, i.e., to allow less attention granularity without significantly reducing performance. However, they have mostly failed miserably.

And DeepSeek might have finally solved the puzzle.

A Three-Step Process

As mentioned, most AI models are global attention models, meaning that every word in the context is treated equally.

The Great Scaling Problem.

As we have also explained, each word looks back at all previous words in the sequence and asks, “What information can you provide me?” Then, every one of these words answers back, and the word decides which words it should pay more attention to.

Naturally, as some words have much more meaningful context to provide than others (i.e., the killer’s name mentioned in Chapter 2 by Hercule Poirot is much more relevant than the word ‘uhm’ in Chapter 6), attention usually collapses in a few words. Put another way, while we force the model to perform full attention, the results mostly end up being sparse (only a few words actually matter).

The problem is that if your sequence has a million words preceding the word performing attention, it will ‘talk’ to the million words and then decide which ones have valuable information. This means considerable compute increases as the longer the sequence goes. Worse, the memory requirements, which will be avoided in today’s discussion, also explode.

This, my friend, is the biggest limiting factor as to why ChatGPT and other AI products don’t let you feed infinitely long sequences; it’s not that the model can’t process them; it’s that it’s too expensive.

So, what does DeepSeek propose?

Making Sparse Attention Viable

The question is, how do I identify the best way to process long sequences without having to examine every word independently?

Traditionally, the most common approaches have been two:

  1. Alternative architectures like state-space models (Mamba). We aren’t touching them today, but just know that they propose to keep a state or memory that is fixed in size. For every new word the model reads, it asks whether the information is worth remembering or not. We keep our memory requirements low, but we trade off this with global attention.

  2. Sparse attention methods, which compress the attention process.

Sadly, the first group has serious performance issues. The model is forced to have a fixed-sized memory, eventually forgetting important things. Regarding the latter, their theoretical efficiency gains rarely translate to reality.

DeepSeek’s approach falls in the second category but fundamentally differs from previous proposals. It divides attention into three parts: compressionselection, and a sliding window.

I’ll first explain the three components and then use an analogy to ground everything, so bear with me.

  1. Compression: It divides the sequence of words into chunks, compressing the content into a summarised, single piece of information for each chunk. 

  2. Selection: Of course, a summary always implies information loss. Therefore, it adds a selection mechanism that chooses the most important chunks. If a chunk is selected, the entire chunk is reread (meaning, attention is performed with every word in the chapter).

  3. Sliding window: Most text has recency bias, meaning that recent words are generally more important for a word in a sequence. For instance, to understand Chapter 12, the information in Chapter 11 is usually more important than the information in Chapter 5. Therefore, inside a window of recent tokens, it pays attention to all of them.

When the three sets of information are retrieved, they are combined and shared with the word, preventing it from attending to every word in the context. Instead, the word is provided with summaries of all chunks in the context and can dive deep into where it might be most interesting.

This way, although we are performing sparse attention and some words are not attended to individually (risk of context loss), you still have great chances of not missing any important stuff.

For instance, using the book analogy:

  1. The compression method breaks the book into chapters and compresses the information of each chapter into its summary (i.e., in chapter 3, this happened).

  2. For those chapters that appear more relevant to the current one, these are selected and read thoroughly in case it has to remember any noteworthy detail (i.e., in Chapter 12, Poirot recollects his thoughts about the night of the crime, detailed in Chapter 6, so Chapter 6 is retrieved fully)

  3. For the most recent chapters inside the recency window, it reads them thoroughly nonetheless (i.e., if we’re in Chapter 13, Chapter 12 is read fully due to recency bias).

Finally, all the information is combined and provided to the word, which can update its contextual meaning with a great deal of information processed across millions of words… without having to attend to every single one of them.

And the beautiful thing is that… it just works.

Results and Hardware

As seen in the image below, models trained with NSA instead of global attention (full attention) match or exceed the latter's performance while having insane speed and cost improvements.

But why does a method that purposely forgets some context information outperform the full attention method?

We don’t fully know, but my intuition is that the algorithm finds a signal-to-noise sweet spot.

In layman’s terms, NSA seems to eliminate the noise and let the model focus on what matters, with the added virtue of being cheaper.

But DeepSeek wasn’t finished, as the most significant reason this could eventually work is that the method is “hardware-aware.” Most sparse attention methods select individual tokens to attend to in the context. This is suboptimal for GPUs, which, by design, perform best when words are retrieved in contiguous blocks.

In layman’s terms, GPUs work best when they retrieve entire chunks of the context to process instead of choosing individual, segregated data, as the latter requires many more reads to memory. In plain English, this algorithm minimizes the number of times the GPU has to read the memory, which takes time and increases latency.

In other words, it ensures that contiguous data, such as words in the same chapter, are processed together, making the entire process insanely fast compared to traditional methods.

This Could be Bigger than R1.

If NSA delivers on its promises, this is hands down DeepSeek’s biggest contribution, even more so than DeepSeek v3 and R1. And we know what happened that time with the markets…

If AI finally conquers long-context data, the explosion of use cases and value could be enormous. And, once again, that could be thanks to China.

How does that make you feel about the West?

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]