Frontier Deception, Conquering Human Cognition, & More

In partnership with

Tackle your credit card debt by paying 0% interest until 2026

If you have outstanding credit card debt, getting a new 0% intro APR credit card could help ease the pressure while you pay down your balances. Our credit card experts identified top credit cards that are perfect for anyone looking to pay down debt and not add to it! Click through to see what all the hype is about.

THEWHITEBOX
TLDR;

Technology:

  • 🧐 A Humanoid that’s too good to be true?

  • 😱 A First in AI: A Metagenomics Model

Markets:

  • 🤧 Is NVIDIA Deceiving Us?

Product:

  • 🤨 Are OpenAI's o1 Results Inflated?

TREND OF THE WEEK: A Foundation Model for Human Cognition

PREMIUM CONTENT
Premium Content You May Have Missed…

  • Running Frontier Models at Home. Understanding the implications of NVIDIA’s latest supercomputer, which will be available to you soon (Only Full Premium Subscribers)

ROBOTICS
A Humanoid That’s Too Good to be True?

Recently, Chinese company EngineAI unveiled its humanoid robot SE01, which has unheard-of mannerisms and movements. The videos are so impressive that even NVIDIA’s robotics lead, Jim Fan, openly questioned whether they were real or AI-generated (the researcher later confirmed that the video is, in fact, real).

TheWhiteBox’s takeaway:

As a young adult, I grew up seeing the amazing developments that Boston Dynamics showed with their humanoids. However, the speed of progress of this particular industry in the last few years is staggering.

The impressive thing here is not the robot’s autonomy (the robot could very well be teleoperated in the video).

What makes this humanoid unique is that its movements are so natural that it’s hard to fathom how much we will progress this decade alone. Elon Musk was ridiculed for saying we would have billions of robots by the end of the decade, but these improvements, added to the excellent unit economics of some of these companies (especially the Chinese ones), which offer quadruped robots for the price of $1,600, make me believe that robots are coming to your home sooner than we could have ever predicted.

Conquering good robot movements is a much harder task in robotics than improving the robot’s actual intelligence.

BIOLOGY
Prime Intellect’s Amazing MetaGenomics Model

Prime Intellect, the guys behind Intellect-1, the largest decentralized LLM ever trained, has announced METAGENE-1, a 7-billion-parameter model referred to as a metagenomic foundation model developed collaboratively by researchers from USC, Prime Intellect, and the Nucleic Acid Observatory.

Trained on over 1.5 trillion base pairs of DNA and RNA sequences obtained from human wastewater samples, this model captures the patterns in the extensive genomic diversity present in the human microbiome.

TheWhiteBox’s takeaway:

We are becoming accustomed to AI breakthroughs in every field imaginable. But this release is different. Instead of trying to understand patterns in a narrowed view, as the name of the model suggests, researchers are trying to capture population-level patterns (across a wide number of people). This could be essential for identifying risks for large numbers of people and, potentially, even preventing pandemics like COVID-19.

Once again, we observe the insane powers of AI in discovering patterns in big data. This theme is so important that it is the main center point of today’s trend of the week.

While recent research has proven that humans, despite ingesting billions of data points per second through our senses, can only process around 10 bits per second (making our cognitive bandwidth extremely limited), AI operates at the directly opposite trend: AI’s cognitive capabilities are still weaker than ours (i.e., the quality of our 10 bits per second of thoughts are immensely more powerful than theirs), but the amount of data AIs can process is orders of magnitude larger.

In layman’s terms, while an apples-to-apples comparison between a human and an AI for a given task may turn out badly for AIs, their capacity to digest enormous amounts of data gives them an insane edge in cases where ‘big data > intelligence.’

Simply put, whenever the goal is to find patterns across large amounts of data, AIs are already unequivocally superior to us. The issue is that, in our obsession with building machines that surpass our intelligence, we always forget to mention the cases in which this technology already exceeds our capacity. That is why I consider AI a tool of discovery above everything else.

HARDWARE
Is NVIDIA Deceiving Us?

In Tuesday’s Premium rundown, we discussed NVIDIA’s latest product, NVIDIA Digits. This at-home supercomputer, which costs $3,000, allows you to run 200-billion-parameter (gigantic) models.

The computer has up to 1 PetaFlop, or one quadrillion operations per second, of compute power, if you run your model in FP4 precision (half a byte per parameter), and also has 128 GB of unified memory and up to 4 TeraBytes of NVMe hard memory, specs that, as covered in my deep dive of the hardware and what you can do with it, really help us envision a future where you can run frontier models at home.

But then I went more closely into the numbers, and things started to look… different.

TheWhiteBox’s takeaway:

NVIDIA has recently become a footnote merchant. Every graph they publish always includes a tiny footnote clarifying the assumptions that led to those numbers (check the image above).

That’s where the problem lies.

The main concern is the main number they have marketed, the 1 PetaFLOP performance at FP4. Almost no one has managed to train models with such limited precision; most models are trained at BF16 (2 bytes per parameter), and only recently has DeepSeek opened the veil to FP8 with DeepSeek v3. But FP4 remains in an uncanny valley, as training models with such small precision has been proven impossible.

It must be said that you can run FP4 models through methods like QLoRA, which was used below to create Centaur. However, the real problems start when you compare DIGITS with another of NVIDIA’s announced GPUs, the RTX 5090, mainly used for gaming. If we look at the specs of that new GPU, it offers 660 TeraFLOPs at FP8 at a price tag of $2k, while the $3k DIGITS supercomputer has 500 TFLOPs at FP8 (0.5 PetaFLOPs).

In other words, this $3,000 beast underperforms NVIDIA’s own GPU gaming cards at two-thirds of the price!

Of course, you can make a case of DIGITS concerning memory128 GB of unified memory lets you store huge models at home. But does that justify the $1k mark-up?

Well, you can buy 128 GB of RAM for $249, so I would say no. You could still argue that DIGITS offers a +500GB/s bandwidth to RAM, which is crucial for decreasing communication overhead, but DIGITS still feels exceptionally overpriced.

I’m not trying to undermine this release. I will consider buying this supercomputer because I genuinely believe running AI models at home is the future of a small company or a GenAI power user. But NVIDIA’s deliberate obtuseness in their claims for every new release makes me feel like we might be hitting diminishing returns in hardware regarding compute performance, and they are too afraid to admit it.

Nonetheless, NVIDIA’s new flagship GPU, Blackwell, isn’t a compute step-function improvement but a communication one; we are pairing many more GPUs per server (going from 8 to 36 or 72), leading to outsized better performance, but the die-to-die performance has been obtained by basically doubling the number of compute dies in a chip to two from the Hopper platform.

I’m not doubting NVIDIA’s dominance; my concerns lie more in the market itself, whose products are starting to look severely overpriced in general. This is why (tin hats entirely on) NVIDIA is desperately trying to open new businesses in cloud computing and robotics.

LARGE REASONER MODELS
Are OpenAI’s o1 benchmark results inflated?

In a series of tweets, open AI researcher Alejandro Cuadron explains that when testing the alleged results of OpenAI’s o1 model in SWE-Bench Verified, making an apparently harmless change of the open-source framework that governs the agent's actions from Agentless to their OpenHands framework drops performance from 48% to 30%.

Importantly, OpenHands gives the agent greater freedom than Agentless, meaning that a drop in performance may suggest that, in genuinely autonomous settings, o1 isn’t as bright as claimed.

Interestingly, Claude 3.5 Sonnet, using that same framework, reaches 53%, which could imply that OpenAI’s star Large Reasoner Model (LRM), considered an improvement in reasoning capabilities of standard Large Language Models (LLMs) like Claude 3.5 Sonnet, still performs worse than Anthropic’s state-of-the-art model.

TheWhiteBox’s takeaway:

Without jumping too quickly to conclusions, as I still believe o1 is superior to Claude 3.5 Sonnet in most “reasoning” tasks (if you think otherwise, look at Aider’s latest coding benchmark), this proves that we must always take benchmark results with a pinch of salt and that actual performance for a given task will only emerge if you take the time to design internal domain-specific benchmarks for your task.

I assume OpenAI is well aware of this performance decrease and decided to let it slide for obvious reasons, as these companies are under insane pressure to control the narrative and are in massive “debt” to their investors.

This also justifies my obsession with intelligence efficiency. While few doubt LRMs are the next big thing, I sincerely believe the added costs of running these models for longer during inference far outweigh the benefits (for now). Only after dramatically reducing inference costs can we claim that o-type models’ value to society is greater than that of LLMs.

For the time being, I use LLMs much more than LRMs in my daily routine, and I believe that is the case for you, too.

TREND OF THE WEEK
An AI Model for Human Cognition

A broad group of researchers from top universities, such as Cambridge or Oxford, and top companies, such as Google Deepmind, have presented Centaurwhich might be considered the first foundation model for human cognition.

In other words, it is a model that can predict human behavior and does so with previously unseen performance. It even showcases out-of-distribution accuracy (it can confidently predict never-seen-before behaviors from individuals not participating in the study).

Also, this model shows an uncanny capacity to predict brain patterns, suggesting that this discovery could allow us to decode the human brain and even lead to important breakthroughs in curing cognitive-impaired illnesses.

But how is this even remotely possible? Let’s dive in.

Understanding the Mind

First of all, what do we mean by a foundation model of human cognition?

Predicting Human Behavior

At the end of the day, what we are trying to build with Centaur is a better decision-maker in human environments.

If we are willing to fathom a future where AI humanoids inhabit our world, clean our dishes, take the wet laundry to the backyard, or, for the extremely lonely, have sexual intercourse with them (I know, that sounds too “Black Mirrorish,” but trust me there’s demand for that, too), we need robots that act like humans and can predict our behaviors such as the ones below:

But why are we striving to build a unified model of human cognition? Couldn’t we just train machines for every problem and compound them into one system?

And the answer to that question is also one of the keys behind Centaur: the power of going broad.

General Beats Specific

For decades, AI has been a domain-specific game. If you wanted to create a model that played Go, you made a model for that particular task, like Google’s AlphaGo.

One task, one model—that was the mantra for decades. However, with the arrival of the Transformer architecture, we managed to switch from specific to general. Simply put, one model, many tasks.

This is not only easier for us to manage, but generality also leads to more outstanding performance. The most striking example of why this is true is a paper by Cambridge researchers.

This paper, which one of the researchers discusses in more detail in this lecture, provides a striking example of the importance of going broad. While comparing different AI models for physics, they showed that a model trained from scratch on a given physics domain underperforms a model trained on cat videos (seemingly unrelated) and then fine-tuned on the same data as the first model.

In layman’s terms, the only difference between these two models was that the latter was trained on cat videos first. Still, this model considerably outperformed the first one despite the additional data being apparently irrelevant.

Thus, what does this tell us?

  • Well, for starters, no data is irrelevant, a memo that AI labs have long understood, hence why all are ingesting humongous datasets into training their models.

  • The second takeaway is that generalistic models are simply superior to domain-specific models. Nonetheless, the best-performing model in the research above was a general model for physics. Put another way, a model trained on all physics domains vastly surpassed the performance of domain-specific models in their particular domains.

Apply this textbook to human cognition, and you get Centaur.

A Foundation Model of Human Cognition

As seen below, Centaur beats both Llama (the baseline model) and the domain-specific cognition model in almost every task.

What is Pseudo-R2?

R² measures how well the model explains the variance of the target variable (actual outcomes). In other words, R² is fundamentally about how well the predictions align with the ground truth (the actual observed outcomes), aka it indicates the quality of fit of the model’s predictions to the observed data; a measure of prediction accuracy.

Pseudo-R² approximates this concept for models with discrete target variables, providing an analogous measure of model fit. Both are widely used metrics to measure model accuracy.

Most importantly, Centaur also displays shocking out-of-distribution (OOD) capabilities.

From General to Truly General

With AIs, we continuously encounter the same obstacle: true generalization. When interacting with ChatGPT, you may be tempted to assume that the model works well in every task.

But that’s not true, that’s you using your extremely narrowed view of the world as a proxy for true generalization, the moment where these AIs can work not only in known domains, but also in fully unknown ones. Simply put, if you test our frontier models in unknown areas, they always crack.

Besides the particular case of the o3 models we reviewed here, most AI models suffer from this problem extensively (o3 does, too, for any non-verifiable domain like creative writing, which proves it’s nowhere close to AGI).

However, Centaur appears to have truly emergent OOD capabilities. When tested in cover-ups (switching the problem’s words, like changing spaceship for ‘flying carpet’ to check whether the model can abstract the reasoning trace), modifying problem structure completely, or even probing it in entirely new domains, Centaur appears to behave just as well, at least comparatively to previous models:

But the finding that really makes this model worthy of being this week's trend is the researcher’s analysis of the human mind using Centaur.

Centaur, a Door to Our Minds?

Researchers posited the following: if we have a model that can accurately predict behaviors, can we use that model to reveal the secrets of the mind?

To test this, they wanted to see whether they could use Centaur to predict brain activations that led to the same task. For instance, for a task solving a short puzzle, can Centaur predict what parts of the brain will activate?

Taking fMRI scans of several people doing specific tasks, they took Centaur’s own representations (the internal activations of the model that lead to the prediction, akin to the neurons in a human brain firing) and used them to see whether the model could predict which areas of the brain would activate on a human, to solve that task.

And, surprisingly, it worked.

It’s important to note that this alignment between Centaur’s inner representations of human behavior and human neural activations was not forced. The model was not trained to match human brain patterns. Instead, they became ‘naturally aligned’ by learning to predict human behavior.

As seen below, Centaur’s representations were the closest toward human ones (akin to saying that Centaur’s internal behavior mimics humans the most out of all frontier models).

But you may be thinking: why does predicting brain patterns give any insight into how the human brain works?

Think about it the following way. Both the model’s inner activations and the human brain activations serve the same purpose: predict a specific human behavior. So, if both predict the same behavior, the patterns required to be learned to make such prediction become ‘aligned;’ they must match.

For example, “John has 20 apples, and Karen takes three of them. How many does John have left?” Both the brain and the model’s neurons need to abstract the same pattern (Person 2 takes three apples from Person 1, leaving Person 1 with the initial apples minus three). Therefore, if Centaur can predict how the brain will activate, this is analogous to finding that pattern.

That probably reads like a word salad, but keep in mind this: if pattern x needs to be learned to predict behavior y, if I can predict the brain patterns that lead to prediction y in a human, I must have learned pattern x and, importantly, what the brain is actually doing.

Consequently, we can probe the model to see whether the pattern occurs, which confirms that specific brain patterns in the human mind execute the precise pattern. In other words, we use the model to verify that that pattern is actually occurring inside the human brain and, to be more specific, where it takes place.

Long story short, breakthroughs like Centaur open the door to decoding the human brain. For instance, in the future, someone who has suffered an injury to the part of the brain in charge of language may benefit from ‘brain decoding AIs’ that can still make sense of the brain activations of the healthy brain parts. These AIs would allow this perfectly capable human to share their thoughts even if they can’t articulate them.

It’s like a mind reader!

By the way, in case you’re wondering, this idea of matching brain patterns to outcomes, be that predicting human behavior or moving an arm, is literally what companies like Neuralink are doing to help people move the cursor in a computer.

This is huge, as our understanding of the human brain is limited. Thus, as I always say, AI once again proves to us the following:

❝

ChatGPTs aside, AI’s greatest superpower is inductive discovery, using Big Data to discover insights that our low-data-bandwidth minds could never.

While this won’t make money-grabbing VCs any more interested in these use cases as the potential return is smaller than productivity-based applications, it pushes scientific discovery forward, which is—or should be—our real goal with AI.

THEWHITEBOX
Join Premium Today!

If you like this content, you will receive four times as much content weekly by becoming a Premium subscriber without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at nacho@thewhitebox.ai