Acrobatic Robots, OpenAI Buying the World, DSA

THEWHITEBOX
TLDR;

Welcome back! This week, we have exciting news, with acrobatic robots, new algorithm breakthroughs, OpenAI “buying the world", an insanely good document understanding product you can use today, and more.

Enjoy!

THEWHITEBOX
Things You’ve Missed By Not Being Premium

On Tuesday, we came packed with announcements, including the best coding model (self-proclaimed) in the world, Claude 4.5 Sonnet, as well as another powerful frontier model from China, GLM-4.6.

We also discussed the magnitude of data center buildouts in the US, accompanied by a few images and numbers that will leave you speechless, as well as a new risk for NVIDIA’s dominance coming from China and a range of new products from OpenAI that will leave you equally impressed and worried.

STARTUPS
Most Used AI Startups… by Other Startups

a16z, a top-three venture capital firm in terms of deployed capital and an investor in most of the hottest AI startups, has presented a list of AI startups whose products are being used by other startups.

Although the list screams the following meme, as most of the so-called winners here are part of a16z’s portfolio, it’s still a nice way to visualize what AI startups are deemed ‘popular’.

Among the main insights, we have:

  • 60% of top spend is in horizontal applications, those designed to boost productivity across roles. The remaining ~40% is in vertical / role-specific solutions.

  • Among the vertical (role-specific) apps, most (~12 out of 17) are augmenting human work (automating repetitive tasks). Only a minority (~5) aim to replace human workflows fully (end-to-end agentic systems). What a surprise, maybe because framing AI as a substitution tool is too much of a grifter thing to say?

  • “Vibe coding” is no longer just a consumer or hobbyist trend, they say. Four such companies made the top 50 spend list (Replit, Cursor, Lovable, and Emergent). Personally, beyond prototyping, using vibe-coding apps like Lovable is ludicrous spending. Cursor is an obvious choice, but I do not consider it to be a vibe-coding platform in the same way Claude Code isn’t either.

  • Nearly 70% of the top 50 companies can be adopted individually (i.e., they don’t require enterprise licensing). Et voilà, the real golden nugget here. They are framing this as enterprise spending when, in reality, it’s just scrappy founders and early employees frugally using non-enterprise licenses. Hardly proof of enterprise value.

I may come across as overly sarcastic, but I really, really believe most of these companies are living on borrowed time. The truth is, many of them are as irrelevant as they get.

Additionally, many of these startups (at least one in the list above) engage in outright “ARR fraud” by claiming recurring revenues that are not actually recurring, but rather aggressive customer acquisition strategies that offset the massive churn these products experience, and even outright lying about customers they don’t actually have.

I genuinely believe that a newer generation of AI startups, based on strong models trained with end-to-end on-the-job training, will emerge and outperform all these.

But if I could highlight one that is maybe different, that would be Plaud, one of the few American startups with enough balls to build hardware.

My main criticism of the US’s startup ecosystem is that they all take the easy route; they are all building the same thing with different shapes and sizes. In what appears to be a flourishing and varied ecosystem from the outside, it’s actually the same VCs pouring money into the same company types.

The lack of innovation is glaring.

Plaud is different. I haven't yet purchased the product, but having an AI-powered note-taker attached to my phone feels very compelling for someone like me who likes to take notes for almost everything. For once, my immediate reaction to a hardware product coming from Silicon Valley isn’t to burst out laughing.

Besides, it’s not a spy disguised as innovation, as products like the Friend pendant, which only record meetings, chats, and similar content, are also used by people who register information on the web.

I mean, you really want to have this ‘thing’ around your neck? As a Fortune magazine journalist put it, “it’s like wearing your senile, anxious grandmother around your neck.”

One thing is to record people in live, informal settings where you’re clearly violating their privacy, another is to record a meeting. But beyond Plaud, there’s really zero hardware being built in the US except for one or two humanoids.

ADOPTION
Chatbot Usage Consolidates, and… Falls?

A survey by The Information is starting to show a pattern of adoption that can be summarized in a few points:

  1. Overall chatbot usage continues to grow

  2. However, usage levels for top chatbots are, surprisingly, falling

  3. Ironically, usage is consolidating among top players, with residual products experiencing a decline in usage (such as MidJourney).

  4. Revenues might still be growing because paid conversions are increasing, especially in professional settings.

TheWhiteBox’s takeaway:

The survey sends some contradicting messages. Overall growth yet smaller per-chatbot usage? Higher concentration on small players while their individual metrics worsen?

It’s confusing. Therefore, don’t jump to conclusions beyond one clear insight: people are becoming more accustomed to spending money on these products, signaling better product-market fit and better features.

As shown below, people are utilizing more advanced features, such as Deep Research or computer agents, and the overall sentiment is positive (63% are satisfied with the services, with 20% being “extremely satisfied”).

Although there are very few things to celebrate in this industry regarding adoption, it’s becoming clear that chatbot assistants are helpful.

We must not confuse the fact that while most incumbents’ huge demand is self-generated (investing in AI Labs in exchange for compute credits), with the fact that the products are valuable.

MEMORY HARDWARE
Samsung and OpenAI Sign Huge Deal

Samsung, SK Hynix, and OpenAI have signed a partnership to co-develop next-generation AI infrastructure, anchored by OpenAI’s Stargate Project. Sources claim that OpenAI has predicted a requirement of up to 900,000 DRAM wafers per month, a staggering amount of memory chips.

TheWhiteBox’s takeaway:

There’s a reason why all three leading DRAM providers (Samsung $SSNLF ( ▲ 9.01% ), Micron $MU ( ▲ 2.28% ), and SK Hynix up 131% YTD) are having such great stock performances this year: they are the only providers of one of the most valuable pieces of equipment on the planet right now: high bandwidth memory, or HBM.

But why?

Well, because this is the memory used on GPUs. So, unless the data center bubble bursts, these companies essentially have a guarantee of selling every single SKU (stock-keeping unit) they produce over the next five years, at the very least.

In fact, they will actually be totally incapable of meeting demand for the foreseeable future. The reasons are two-fold:

  1. A large pool of buyers, all very wealthy and willing to outbid competitors to acquire this highly priced asset.

  2. The per-GPU HBM allocation, the amount of HBM memory each GPU carries, is growing considerably. At this point, it’s the most significant single cost in NVIDIA’s GPU BOM (Bill of Materials).

The reason for this is that AI inference (when models are run) in Large Language Models (LLMs) is memory-bound, meaning the bottleneck is memory, not compute power.

An undersupplied product desired by filthy rich companies. What more can you ask for?

On the geopolitical side, two comments:

  1. HBM is South Korea’s way of becoming essential in the AI supply chain (Samsung and SK Hynix account for 73% of the market, with the rest produced by the US company Micron).

  2. The DRAM wars are just another way for AI Labs and Hyperscalers to build moats. OpenAI’s ambitions extend well beyond its product; it is actively seeking to undermine its rivals at the hardware level.

No hardware, no AI. No AI, no competition.

RESEARCH
DeepSeek’s New Breakthrough

It’s been a while, but it seems DeepSeek has finally decided to innovate again. And they have done this by solving what’s possibly one of the last remaining challenging tasks in AI these days: sparse attention.

Let me explain. Models like ChatGPT are impressive, no doubt about that. But they are also equally inefficient in the way they process data.

The human mind, when reading text, immediately realizes that some words matter more than others. We see words like ‘uh’ or ‘mmm’ or filler parts of the passage, and ignore them, focusing on the real message the messenger is trying to send.

Imagine listening to politicians talk and having to scrutinize everything they say, even though most of it is just empty words! Luckily, our brains are good at ignoring the irrelevant.

Most AI models work differently. Models like ChatGPT also capture the message, but only after scrutinizing the entire sequence with utmost detail.

This is what we call the ‘full attention mechanism’, an AI’s way of making words ‘talk’ to each other and share information. This can be intuitively understood if we think of the word ‘bank’, which has two possible meanings: river bank, or a financial bank.

Thus, to know which bank we are referring to, we need the context of the sequence. Thus, if the model sees the sequence “I need to take the money out of the bank” using attention, the word ‘bank’ will pay attention to ‘money’, and thus the model realizes we are talking about the financial bank here.

But as I was saying, the model doesn’t instinctively know that ‘money’ is the keyword that clarifies what bank we are talking about. Instead, it will attend to all words, individually, without exception.

In a nutshell, they treat every word as equally important, making them extremely fragile as the sequence length grows in two ways:

  1. As the text sequence grows, so does the amount of context. As they are treating every single word equally, it becomes increasingly challenging to distill the key messages. Put simply, they get ‘lost in the sauce’.

  2. This makes the cost of processing the sequence and generating new ones quadratic to sequence length, meaning costs explode as models are fed larger amounts of data.

Intuitively, the solution comes naturally; we need to train models to identify which words are relevant and which are not. Easier said than done, to the point that, to this day, models are still using global attention, paying attention to everything, and only then seeing what’s worth keeping.

But what does attention entail in the first place? Internally, models do the following:

  1. Each word has three “communication tools”: a query, a key, and a value

    1. Each word uses its query to say ‘here’s what I’m looking for’ (e.g., for the word ‘bank’, that I am something that could be a river bank or a financial institution)

    2. Each word uses its key to say ‘here’s what I can share with you if you attend to me’ (e.g., for the word ‘money’, if you attend to me, I’ll tell you that I’m a currency)

    3. Each word uses its ‘value’ component to say ‘here’s what I’ll give you if you attend to me’. (e.g., once ‘bank’ attends to ‘money’, it receives enough evidence to realize what type of bank it is)

  2. Each word’s query interacts with the keys of the other words, identifying which ones are most suited, but without eliminating any words; all are processed equally.

  3. Finally, this attention score is used as an importance-scoring mechanism to scale up or down the value components of each word as they are added to the word. Put another way, each word receives information from all other words, weighted by the appreciated importance, based on the attention scores in step 2. Although ChatGPT may treat all words in our example as important, this weighted average ensures that ‘bank’ pays the most attention to ‘money’.

This is done between all words in the sequence, giving you an ‘attention matrix’ that shows how each word attends to all other words in a sequence. As you can see below, ‘bank’ pays most attention to ‘money’.

But a question remains: While the attention mechanism judges the importance of each word, it still requires attending to all words, even if they are mostly irrelevant.

So, can’t we just not waste compute on words that yield no benefit? And it turns out, we can.

DeepSeek has introduced DeepSeek Sparse Attention, or DSA, as a means to make this system smarter about what it pays attention to, while maintaining performance and reducing costs.

To do so, they add an extra attention layer that runs a fast interaction between each word and the rest. This results in a set of top ‘k’ interactions (i.e., the top words worth paying attention to), and the rest of the interactions are simply rejected.

In other words, this system identifies, for each word, which other words are worth paying attention to before paying attention to them, instead of wasting compute on words that are irrelevant.

For instance, in the sentence “Where is Mike? Mm, I think I saw him upstairs a few hours ago,” the word ‘him’ directs attention to ‘Mike’ as it’s a pronoun representing Mike. But thanks to DSA, it will also ignore ‘Mm’, which provides zero information worth attending to.

And is this new model any good? Is DSA working well?

They have presented an updated version of their flagship model, achieving similar scores to previous versions, but at half the price, with a cost of $0.28 per million input tokens and $0.42 per million tokens generated.

Same performance, half the price.

For reference, Anthropic’s latest model, Claude 4.5 Sonnet, costs $3 and $15, respectively, 10 and 35 times more expensive than a model that is better, but only marginally.

And perhaps more importantly, they prove how this evolution to DSA can be done as a post-training recipe, meaning you can retrain models that have already been trained, and make them become “sparse attenders” without having to scrap the entire model.

TheWhiteBox’s takeaway:

Man, are these DeepSeek guys frugal and smart.

Sparse attention is such a natural ‘next step’ in terms of algorithmic progress that I’m shocked it took so long. And once again, an efficiency breakthrough comes from China, not the US.

I’m not saying labs like OpenAI haven’t developed their own sparse attention mechanisms, but they no longer publish research, so I can’t credit what I don’t know.

However, the key takeaway is that this demonstrates China's innovative capabilities at this stage. It’s no longer a follower, it’s a leader.

And while US Labs might still be releasing the best models, they are not releasing the best models in terms of cost (whether they are cost-effective is also a consideration, as Labs may simply be increasing margins). That said, I don’t think that’s the case, because we both know what customers, especially enterprises, value the most: money.

ROBOTICS
Unitree Robots’ Impressive Movements

Another week, another impressive robotics video emerging from China. This time, we have another Unitree robot demonstrating greater movement capabilities than most humans, performing a whole 360-degree horizontal spin in the air.

TheWhiteBox’s takeaway:

As I’ve been repeating for weeks, China is investing heavily in robotics, particularly in the hardware aspect. The most concerning here is the complete lack of hardware investments in the US and Europe.

As an analyst put it, you can’t simply software your way out of this problem. We desperately need Western companies investing in AI robotics hardware; otherwise, there’s absolutely no way they can compete with Chinese companies. Because, guess what, software is the least of the issues when considering robotics.

The good thing is that US investors seem to have realized this (better late than never), with a16z partner Martin Casado writing a piece on the matter in the Wall Street Journal. The title of the piece is pretty self-explanatory and shows some self-awareness by saying: China is winning the race for AI robots.

As recently as 2023, there was basically only one player in the field, as China produced more robots than the rest of the world combined:

Nowadays, the picture is probably way worse.

Furthermore, among the reasons the VC gives as to why China is ahead, he points to a critical one: the role of the state. This goes deeper than what you might realize at first.

Without putting words in his mouth, what Casado might be suggesting is that, in AI, the Government’s role will be crucial. In other words, he’s basically praising China’s state capitalism, a government system in which the state is particularly present in key technologies, with AI being one of them.

I have no proof of this, but I believe it’s a matter of time before the US copies the Chinese model (in some ways, they are already doing so).

FINE-TUNING
Thinking Machines Introduces Tinker API

Thinking Machines Labs, the latest super-funded AI lab from the US ($2 billion at a valuation of $10 billion, despite being pre-revenue) and founded by star researchers from OpenAI, Anthropic, Meta, and Google DeepMind, has released its first product.

And, in my view, this is a huge deal.

And it’s not what people expected. It’s not a model, but a training service. That is, they offer a service in which researchers can define the parameters (data, training regime, configuration, and so on), alongside a model in a selection of accepted open models, and the service returns a fine-tuned model.

For example, you can choose to fine-tune say Qwen3-30B, send the data you want the model to be fine-tuned on, the parameters of the training process, and click ‘go’.

The advantage? You only focus on the configuration side; TML takes care of the infrastructure side (the trickiest part) while also alleviating you from the associated capital costs of owning the GPUs.

In other words, it’s basically a service to train your own models while abstracting the hardest (and most expensive) parts; it’s about making fine-tuning easy to do.

I cannot overstress the importance of this. But why?

TheWhiteBox’s takeaway:

For some time now, I have been advocating for the idea that the reason enterprise AIs are failing is that these AI models are not being trained on the job.

Instead, what most ‘enterprise AIs’ resemble is just ChatGPT with a couple of additional prompts on top, which enable it to adopt the style of the job (but, crucially, not the skills).

The “ChatGPT for lawyers” products of the world are literally ChatGPT with a prompt on top saying “You are a fantastic, awesome lawyer… this is what lawyers do… this is a bunch of lawyer-like expressions you should use… yadda yadda yadda.”

I recently wrote a Medium article ranting about this and why I believe most AI startups will soon disappear.

And then we pretend to be surprised when implementations fail! No shit, Sherlock!

Thus, the answer to enterprise issues is nothing but on-the-job training; taking an AI model, a bunch of data that dutifully represent the job’s tasks, and training the model; it’s that simple.

And this service solves just that; companies now only need to worry about the dataset (and the environment if they are using RL, but that’s another story) and send the data to this service, returning a fully-fledged fine-tuned model that, I can guarantee you, won’t be going into MIT’s list of failed implementations.

DOCUMENT EXTRACTION
Andrew Ng’s ADE is Amazing

As you know, I don’t normally highlight any particular product (and if so, and it’s a sponsored ad, it’s very clear). This is not one of those times, because this unsponsored product is absolutely incredible.

Andrew Ng, one of the most highly respected AI researchers and founders in the world, founder of Google Brain, and founder of Coursera, the online course platform, has released the Agentic Document Extraction product under his LandingAI company.

And this is the first document parsing (understanding) and extraction product I’ve ever seen that feels like it has solved PDF document extraction. It’s that good.

And we mustn’t understate this at all, as most frontier AI Labs like OpenAI or Anthropic use parsing systems inside their products (like ChatGPT), and they are much, much worse than this product at it.

But why is this such a big deal? Document understanding, in my opinion, is the most critical enterprise use case that, if solved by AI, would unlock hundreds of billions of dollars in adoption immediately.

Why? Simple:

  • Most enterprise data is distributed in an unstructured, document-based form. Enterprise data is not in clean tables and readable, pre-parsed Word files. It’s in ugly PDFs, scanned invoices, and scanned handwritten documents.

  • It’s incredibly hard for AIs to extract such data without making mistakes. ChatGPT or Gemini don’t even come close to this accuracy.

ADE might have solved this for good.

  1. It understands structure. As you can see in the image above, it breaks down unstructured PDFs into sections, allowing for an understanding of the overall layout.

  2. For each section, it performs a parsing/extraction process using its DPT-2 model, a Transformer similar to ChatGPT, but trained specifically for document extraction.

  3. It can extract information not only from text, but also from graphs, as shown below, even when the data is in another language (in my case, Spanish).

  4. It can also parse tables perfectly, extracting every single cell with dead-on accuracy.

  5. It also captures hand-written data with ease.

But how does this work? In simple terms, it’s a combination of traditional PDF parsing tools, which can read the binary data directly, and an LLM that performs the intelligent understanding and extraction.

TheWhiteBox’s takeaway:

You know I’m not easily impressed. Well, I’m beyond impressed with this one. This is a very tough problem to solve. Perhaps the most impressive aspect is the layout and graph understanding, where the model appears to be fully aware of what it’s seeing, not just mindlessly extracting all text.

I really don’t know what they’ve done to achieve this, but it’s very easy to see where this can be used.

  • Invoice reconciling

  • Data gathering for AI training

  • Business process digitalization

Most companies on this planet run on these documents (sometimes they aren’t even scanned), and now they can be parsed and processed into the digital realm.

And the best thing is that you don’t have to take my word for it. Just head to their playground and see it for yourself. They have 1,000 free credits, so no need to use your credit card to try it.

Closing Thoughts

At this point, the differences in approaches to AI between the US and China are strikingly clear: China continues to focus heavily on research, while the US is now more focused on product development. To me, this is just more evidence that China is treating AI not as a product, but as a national security concern; the US, albeit the Action Plan, seems now much more focused on product (making money).

Particularly grueling is seeing US VCs pouring money into robotics software instead of focusing on what really matters: hardware. Meanwhile, Chinese robots are essentially ninjas at this point.

That said, I’m an optimist by nature, and I believe we are going to see greater investment in hardware in the next few months. I think it’s a great time to be a robotics hardware founder in the US, and FigureAI’s insane $39 billion valuation at basically pre-revenue levels should make it appetizing enough.

See you tomorrow in the Leaders segment.

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]