- TheWhiteBox by Nacho de Gregorio
- Posts
- Frontier Deception, Conquering Human Cognition, & More
Frontier Deception, Conquering Human Cognition, & More
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/85827a5d-0907-4631-9a41-28ac51611092/image.png?t=1736405451)
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b061bdd6-46c8-4305-876e-fd2aa8c6eed1/image.png?t=1721758548)
Tackle your credit card debt by paying 0% interest until 2026
If you have outstanding credit card debt, getting a new 0% intro APR credit card could help ease the pressure while you pay down your balances. Our credit card experts identified top credit cards that are perfect for anyone looking to pay down debt and not add to it! Click through to see what all the hype is about.
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b061bdd6-46c8-4305-876e-fd2aa8c6eed1/image.png?t=1721758548)
THEWHITEBOX
TLDR;
Technology:
đ§ A Humanoid thatâs too good to be true?
đą A First in AI: A Metagenomics Model
Markets:
𤧠Is NVIDIA Deceiving Us?
Product:
𤨠Are OpenAI's o1 Results Inflated?
TREND OF THE WEEK: A Foundation Model for Human Cognition
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/728f3fef-504f-4d12-9529-6fcb7f6d1491/image.png?t=1721747381)
PREMIUM CONTENT
Premium Content You May Have MissedâŚ
Running Frontier Models at Home. Understanding the implications of NVIDIAâs latest supercomputer, which will be available to you soon (Only Full Premium Subscribers)
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/728f3fef-504f-4d12-9529-6fcb7f6d1491/image.png?t=1721747381)
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/47df458e-cba7-46e6-ac5c-a9422e2a59e7/image.png?t=1723545364)
ROBOTICS
A Humanoid Thatâs Too Good to be True?
Recently, Chinese company EngineAI unveiled its humanoid robot SE01, which has unheard-of mannerisms and movements. The videos are so impressive that even NVIDIAâs robotics lead, Jim Fan, openly questioned whether they were real or AI-generated (the researcher later confirmed that the video is, in fact, real).
TheWhiteBoxâs takeaway:
As a young adult, I grew up seeing the amazing developments that Boston Dynamics showed with their humanoids. However, the speed of progress of this particular industry in the last few years is staggering.
The impressive thing here is not the robotâs autonomy (the robot could very well be teleoperated in the video).
What makes this humanoid unique is that its movements are so natural that itâs hard to fathom how much we will progress this decade alone. Elon Musk was ridiculed for saying we would have billions of robots by the end of the decade, but these improvements, added to the excellent unit economics of some of these companies (especially the Chinese ones), which offer quadruped robots for the price of $1,600, make me believe that robots are coming to your home sooner than we could have ever predicted.
Conquering good robot movements is a much harder task in robotics than improving the robotâs actual intelligence.
BIOLOGY
Prime Intellectâs Amazing MetaGenomics Model
Prime Intellect, the guys behind Intellect-1, the largest decentralized LLM ever trained, has announced METAGENE-1, a 7-billion-parameter model referred to as a metagenomic foundation model developed collaboratively by researchers from USC, Prime Intellect, and the Nucleic Acid Observatory.
Trained on over 1.5 trillion base pairs of DNA and RNA sequences obtained from human wastewater samples, this model captures the patterns in the extensive genomic diversity present in the human microbiome.
TheWhiteBoxâs takeaway:
We are becoming accustomed to AI breakthroughs in every field imaginable. But this release is different. Instead of trying to understand patterns in a narrowed view, as the name of the model suggests, researchers are trying to capture population-level patterns (across a wide number of people). This could be essential for identifying risks for large numbers of people and, potentially, even preventing pandemics like COVID-19.
Once again, we observe the insane powers of AI in discovering patterns in big data. This theme is so important that it is the main center point of todayâs trend of the week.
While recent research has proven that humans, despite ingesting billions of data points per second through our senses, can only process around 10 bits per second (making our cognitive bandwidth extremely limited), AI operates at the directly opposite trend: AIâs cognitive capabilities are still weaker than ours (i.e., the quality of our 10 bits per second of thoughts are immensely more powerful than theirs), but the amount of data AIs can process is orders of magnitude larger.
In laymanâs terms, while an apples-to-apples comparison between a human and an AI for a given task may turn out badly for AIs, their capacity to digest enormous amounts of data gives them an insane edge in cases where âbig data > intelligence.â
Simply put, whenever the goal is to find patterns across large amounts of data, AIs are already unequivocally superior to us. The issue is that, in our obsession with building machines that surpass our intelligence, we always forget to mention the cases in which this technology already exceeds our capacity. That is why I consider AI a tool of discovery above everything else.
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b67649ed-f616-4dd7-97fc-997f8e172571/image.png?t=1721747392)
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0f5dc821-2e0a-4130-bd2d-eedeca3768b7/image.png?t=1723548000)
HARDWARE
Is NVIDIA Deceiving Us?
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a0f333ce-0b5a-4c98-9968-fc177d911452/image.png?t=1736443373)
In Tuesdayâs Premium rundown, we discussed NVIDIAâs latest product, NVIDIA Digits. This at-home supercomputer, which costs $3,000, allows you to run 200-billion-parameter (gigantic) models.
The computer has up to 1 PetaFlop, or one quadrillion operations per second, of compute power, if you run your model in FP4 precision (half a byte per parameter), and also has 128 GB of unified memory and up to 4 TeraBytes of NVMe hard memory, specs that, as covered in my deep dive of the hardware and what you can do with it, really help us envision a future where you can run frontier models at home.
But then I went more closely into the numbers, and things started to look⌠different.
TheWhiteBoxâs takeaway:
NVIDIA has recently become a footnote merchant. Every graph they publish always includes a tiny footnote clarifying the assumptions that led to those numbers (check the image above).
Thatâs where the problem lies.
The main concern is the main number they have marketed, the 1 PetaFLOP performance at FP4. Almost no one has managed to train models with such limited precision; most models are trained at BF16 (2 bytes per parameter), and only recently has DeepSeek opened the veil to FP8 with DeepSeek v3. But FP4 remains in an uncanny valley, as training models with such small precision has been proven impossible.
It must be said that you can run FP4 models through methods like QLoRA, which was used below to create Centaur. However, the real problems start when you compare DIGITS with another of NVIDIAâs announced GPUs, the RTX 5090, mainly used for gaming. If we look at the specs of that new GPU, it offers 660 TeraFLOPs at FP8 at a price tag of $2k, while the $3k DIGITS supercomputer has 500 TFLOPs at FP8 (0.5 PetaFLOPs).
In other words, this $3,000 beast underperforms NVIDIAâs own GPU gaming cards at two-thirds of the price!
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/96ae1d07-a449-4734-b0aa-73ba300de9d2/image.png?t=1736443561)
Of course, you can make a case of DIGITS concerning memory; 128 GB of unified memory lets you store huge models at home. But does that justify the $1k mark-up?
Well, you can buy 128 GB of RAM for $249, so I would say no. You could still argue that DIGITS offers a +500GB/s bandwidth to RAM, which is crucial for decreasing communication overhead, but DIGITS still feels exceptionally overpriced.
Iâm not trying to undermine this release. I will consider buying this supercomputer because I genuinely believe running AI models at home is the future of a small company or a GenAI power user. But NVIDIAâs deliberate obtuseness in their claims for every new release makes me feel like we might be hitting diminishing returns in hardware regarding compute performance, and they are too afraid to admit it.
Nonetheless, NVIDIAâs new flagship GPU, Blackwell, isnât a compute step-function improvement but a communication one; we are pairing many more GPUs per server (going from 8 to 36 or 72), leading to outsized better performance, but the die-to-die performance has been obtained by basically doubling the number of compute dies in a chip to two from the Hopper platform.
Iâm not doubting NVIDIAâs dominance; my concerns lie more in the market itself, whose products are starting to look severely overpriced in general. This is why (tin hats entirely on) NVIDIA is desperately trying to open new businesses in cloud computing and robotics.
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b67649ed-f616-4dd7-97fc-997f8e172571/image.png?t=1721747392)
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/0da3c58f-6de7-4370-89d1-3b4b194987da/image.png?t=1723548104)
LARGE REASONER MODELS
Are OpenAIâs o1 benchmark results inflated?
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/76844645-4d2d-4ad9-8d72-9a0f624a7410/image.png?t=1736439943)
In a series of tweets, open AI researcher Alejandro Cuadron explains that when testing the alleged results of OpenAIâs o1 model in SWE-Bench Verified, making an apparently harmless change of the open-source framework that governs the agent's actions from Agentless to their OpenHands framework drops performance from 48% to 30%.
Importantly, OpenHands gives the agent greater freedom than Agentless, meaning that a drop in performance may suggest that, in genuinely autonomous settings, o1 isnât as bright as claimed.
Interestingly, Claude 3.5 Sonnet, using that same framework, reaches 53%, which could imply that OpenAIâs star Large Reasoner Model (LRM), considered an improvement in reasoning capabilities of standard Large Language Models (LLMs) like Claude 3.5 Sonnet, still performs worse than Anthropicâs state-of-the-art model.
TheWhiteBoxâs takeaway:
Without jumping too quickly to conclusions, as I still believe o1 is superior to Claude 3.5 Sonnet in most âreasoningâ tasks (if you think otherwise, look at Aiderâs latest coding benchmark), this proves that we must always take benchmark results with a pinch of salt and that actual performance for a given task will only emerge if you take the time to design internal domain-specific benchmarks for your task.
I assume OpenAI is well aware of this performance decrease and decided to let it slide for obvious reasons, as these companies are under insane pressure to control the narrative and are in massive âdebtâ to their investors.
This also justifies my obsession with intelligence efficiency. While few doubt LRMs are the next big thing, I sincerely believe the added costs of running these models for longer during inference far outweigh the benefits (for now). Only after dramatically reducing inference costs can we claim that o-type modelsâ value to society is greater than that of LLMs.
For the time being, I use LLMs much more than LRMs in my daily routine, and I believe that is the case for you, too.
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b67649ed-f616-4dd7-97fc-997f8e172571/image.png?t=1721747392)
TREND OF THE WEEK
An AI Model for Human Cognition
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/a87e79a7-1c4b-4327-9bb2-e3638ad5d332/image.png?t=1736444159)
A broad group of researchers from top universities, such as Cambridge or Oxford, and top companies, such as Google Deepmind, have presented Centaur, which might be considered the first foundation model for human cognition.
In other words, it is a model that can predict human behavior and does so with previously unseen performance. It even showcases out-of-distribution accuracy (it can confidently predict never-seen-before behaviors from individuals not participating in the study).
Also, this model shows an uncanny capacity to predict brain patterns, suggesting that this discovery could allow us to decode the human brain and even lead to important breakthroughs in curing cognitive-impaired illnesses.
But how is this even remotely possible? Letâs dive in.
Understanding the Mind
First of all, what do we mean by a foundation model of human cognition?
Predicting Human Behavior
At the end of the day, what we are trying to build with Centaur is a better decision-maker in human environments.
If we are willing to fathom a future where AI humanoids inhabit our world, clean our dishes, take the wet laundry to the backyard, or, for the extremely lonely, have sexual intercourse with them (I know, that sounds too âBlack Mirrorish,â but trust me thereâs demand for that, too), we need robots that act like humans and can predict our behaviors such as the ones below:
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b87b9753-3d32-4ec6-8d21-f074959cd8ef/image.png?t=1736406400)
But why are we striving to build a unified model of human cognition? Couldnât we just train machines for every problem and compound them into one system?
And the answer to that question is also one of the keys behind Centaur: the power of going broad.
General Beats Specific
For decades, AI has been a domain-specific game. If you wanted to create a model that played Go, you made a model for that particular task, like Googleâs AlphaGo.
One task, one modelâthat was the mantra for decades. However, with the arrival of the Transformer architecture, we managed to switch from specific to general. Simply put, one model, many tasks.
This is not only easier for us to manage, but generality also leads to more outstanding performance. The most striking example of why this is true is a paper by Cambridge researchers.
This paper, which one of the researchers discusses in more detail in this lecture, provides a striking example of the importance of going broad. While comparing different AI models for physics, they showed that a model trained from scratch on a given physics domain underperforms a model trained on cat videos (seemingly unrelated) and then fine-tuned on the same data as the first model.
In laymanâs terms, the only difference between these two models was that the latter was trained on cat videos first. Still, this model considerably outperformed the first one despite the additional data being apparently irrelevant.
Thus, what does this tell us?
Well, for starters, no data is irrelevant, a memo that AI labs have long understood, hence why all are ingesting humongous datasets into training their models.
The second takeaway is that generalistic models are simply superior to domain-specific models. Nonetheless, the best-performing model in the research above was a general model for physics. Put another way, a model trained on all physics domains vastly surpassed the performance of domain-specific models in their particular domains.
Apply this textbook to human cognition, and you get Centaur.
A Foundation Model of Human Cognition
As seen below, Centaur beats both Llama (the baseline model) and the domain-specific cognition model in almost every task.
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/e5df7c22-5353-48df-9d7c-0edc1c9ef6c2/image.png?t=1736408170)
What is Pseudo-R2?
R² measures how well the model explains the variance of the target variable (actual outcomes). In other words, R² is fundamentally about how well the predictions align with the ground truth (the actual observed outcomes), aka it indicates the quality of fit of the modelâs predictions to the observed data; a measure of prediction accuracy.
Pseudo-R² approximates this concept for models with discrete target variables, providing an analogous measure of model fit. Both are widely used metrics to measure model accuracy.
Most importantly, Centaur also displays shocking out-of-distribution (OOD) capabilities.
From General to Truly General
With AIs, we continuously encounter the same obstacle: true generalization. When interacting with ChatGPT, you may be tempted to assume that the model works well in every task.
But thatâs not true, thatâs you using your extremely narrowed view of the world as a proxy for true generalization, the moment where these AIs can work not only in known domains, but also in fully unknown ones. Simply put, if you test our frontier models in unknown areas, they always crack.
Besides the particular case of the o3 models we reviewed here, most AI models suffer from this problem extensively (o3 does, too, for any non-verifiable domain like creative writing, which proves itâs nowhere close to AGI).
However, Centaur appears to have truly emergent OOD capabilities. When tested in cover-ups (switching the problemâs words, like changing spaceship for âflying carpetâ to check whether the model can abstract the reasoning trace), modifying problem structure completely, or even probing it in entirely new domains, Centaur appears to behave just as well, at least comparatively to previous models:
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/17909b16-69e9-42df-adaf-a216b5d43008/image.png?t=1736429162)
But the finding that really makes this model worthy of being this week's trend is the researcherâs analysis of the human mind using Centaur.
Centaur, a Door to Our Minds?
Researchers posited the following: if we have a model that can accurately predict behaviors, can we use that model to reveal the secrets of the mind?
To test this, they wanted to see whether they could use Centaur to predict brain activations that led to the same task. For instance, for a task solving a short puzzle, can Centaur predict what parts of the brain will activate?
Taking fMRI scans of several people doing specific tasks, they took Centaurâs own representations (the internal activations of the model that lead to the prediction, akin to the neurons in a human brain firing) and used them to see whether the model could predict which areas of the brain would activate on a human, to solve that task.
And, surprisingly, it worked.
Itâs important to note that this alignment between Centaurâs inner representations of human behavior and human neural activations was not forced. The model was not trained to match human brain patterns. Instead, they became ânaturally alignedâ by learning to predict human behavior.
As seen below, Centaurâs representations were the closest toward human ones (akin to saying that Centaurâs internal behavior mimics humans the most out of all frontier models).
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/8bcda517-5a6a-40c4-b318-8e311adc0f01/image.png?t=1736429887)
But you may be thinking: why does predicting brain patterns give any insight into how the human brain works?
Think about it the following way. Both the modelâs inner activations and the human brain activations serve the same purpose: predict a specific human behavior. So, if both predict the same behavior, the patterns required to be learned to make such prediction become âaligned;â they must match.
For example, âJohn has 20 apples, and Karen takes three of them. How many does John have left?â Both the brain and the modelâs neurons need to abstract the same pattern (Person 2 takes three apples from Person 1, leaving Person 1 with the initial apples minus three). Therefore, if Centaur can predict how the brain will activate, this is analogous to finding that pattern.
That probably reads like a word salad, but keep in mind this: if pattern x needs to be learned to predict behavior y, if I can predict the brain patterns that lead to prediction y in a human, I must have learned pattern x and, importantly, what the brain is actually doing.
Consequently, we can probe the model to see whether the pattern occurs, which confirms that specific brain patterns in the human mind execute the precise pattern. In other words, we use the model to verify that that pattern is actually occurring inside the human brain and, to be more specific, where it takes place.
Long story short, breakthroughs like Centaur open the door to decoding the human brain. For instance, in the future, someone who has suffered an injury to the part of the brain in charge of language may benefit from âbrain decoding AIsâ that can still make sense of the brain activations of the healthy brain parts. These AIs would allow this perfectly capable human to share their thoughts even if they canât articulate them.
Itâs like a mind reader!
By the way, in case youâre wondering, this idea of matching brain patterns to outcomes, be that predicting human behavior or moving an arm, is literally what companies like Neuralink are doing to help people move the cursor in a computer.
This is huge, as our understanding of the human brain is limited. Thus, as I always say, AI once again proves to us the following:
ChatGPTs aside, AIâs greatest superpower is inductive discovery, using Big Data to discover insights that our low-data-bandwidth minds could never.
While this wonât make money-grabbing VCs any more interested in these use cases as the potential return is smaller than productivity-based applications, it pushes scientific discovery forward, which isâor should beâour real goal with AI.
![](https://media.beehiiv.com/cdn-cgi/image/fit=scale-down,format=auto,onerror=redirect,quality=80/uploads/asset/file/b67649ed-f616-4dd7-97fc-997f8e172571/image.png?t=1721747392)
THEWHITEBOX
Join Premium Today!
If you like this content, you will receive four times as much content weekly by becoming a Premium subscriber without saturating your inbox. You will even be able to ask the questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at nacho@thewhitebox.ai