TheWhiteBox by Nacho de Gregorio
Posts
A Chinese o1 Model, AI's Manhattan Project, & More

A Chinese o1 Model, AI's Manhattan Project, & More

Ignacio de Gregorio Noblejas
November 22, 2024

In partnership with

For business inquiries, reach me out at [email protected]

THEWHITEBOX
TLDR;

😟 AI’s Manhattan Project
🥳 We Finally Have the First Open-Source ‘o1’
😳 Writer’s Self-Evolving Models
🥰 Another Victor for AI Healthcare
🤖 Robotic Surgeons
🤔 ChatGPT Beats Doctors…
[TREND OF THE WEEK] FrontierMath, AI’s Door to AGI?

Run Your Sales on Autopilot

Increase the output of your sales team without buying more tools or hiring new SDRs. Onboard Agent Frank, Salesforge’s AI SDR, to fully automate prospecting, personalized outreach and booking meetings while your team focuses on closing deals. Get a personalised email from Agent Frank to see his work in action:

Show me the magic

NEWSREEL
AI’s Manhattan Project

Shocked. That’s how most people have felt after reading this small paragraph. This is part of the annual report by the US-China Economic and Security Review Commission, signaling that AI is a matter of National Security, similar to how nuclear power and nuclear bombs were treated decades ago.

The program openly calls for a Manhattan-like Project so that the US races toward AGI before China (and based on the next news, we can’t blame them, considering China is catching up fast).

TheWhiteBox’s takeaway:

If there’s a place where we are skeptical about AI’s alleged capabilities is this newsletter. Obviously, I feel this Commission is being heavily lobbied/influenced by Big Tech and other AI incumbents because such AI treatment as of ‘vital importance’ would give them business security and even regulatory capture.

However, while I agree that the country that reaches AGI first will have a tremendous advantage, two things come to mind:

Achieving AGI won’t be a slowly-then-suddenly event; it will be something that will evolve over time. I feel they are framing AGI as a ‘Trinity test’ type event where the US built the nuclear bomb and ended the war in a couple of days in August 1945. I sincerely believe that won’t be the case for AGI.
Technology will be mostly open-sourced, so this idea that AI can be built inside a lab in Los Alamos feels utopic and would require a complete ban on AI development; this ban should be supranational, a matter which is very hard to push through with US enemies.

I see many loopholes in this strategy, but one thing’s for sure: AI is nothing like we have seen before, so caution is not a bad thing.

OPEN-SOURCE
We Finally Have the First Open-Source ‘o1’

DeepSeek has announced DeepSeek-r1, the first open-source implementation of an o1-type model, the new Large Reasoner Models (LRMs) that OpenAI presented back in September that allegedly vastly improved the reasoning capabilities of frontier AI models.

As you can see above, the model surpasses the results obtained by o1-preview (OpenAI’s top model) in several benchmarks, a wake-up call for them and their investors. Just over two months since o1’s release, both Nous Research with its Forge API and DeepSeek have presented features that match or surpass the performance of OpenAI’s flagship model.

Fascinatingly, you can chat with the model for free here today.

TheWhiteBox’s takeaway:

The best thing about this release, besides the fact that we now have open-source LLMs that can perform ‘deep’ reasoning tasks (and by deep, I mean compared to what a standard LLM can do, which is not much), the most remarkable thing about this is that the researchers have also opened the internal chain-of-thought the model engages in before answering.

In other words, you can see the ‘inner thinking’ going on while the model processes your input (think about this as the model’s introspection or “thinking about thinking”).

But how smart is this model?

Well, it’s still easily fooled. I sent it the following input: “Javier has M brothers and (N - 7 + M) sisters, and the (N - 4) sister loves pudding. How many brothers does one sister have?“This includes both tricky ways of referring to the number of brothers and sisters and an inconsequential fact (that one of the sisters loves pudding).

While the model did, in fact, initially arrive at the correct answer (each sister has (M + 1) brothers), it then went completely bananas, concluding that each sister has four brothers, completely failing to realize that numbers are expressed as variables. Thus, the answer should be based on variables, too, contradicting its first conclusion.

The entire thought process is absolutely hilarious. And although it’s too long to share here, here are some of the highlights:

The model failed to acknowledge that the pudding fact was irrelevant (it seemed close to realizing but didn’t fully commit to calling bullshit on that part). Honestly, the failure to acknowledge that ‘M’ and ‘N’ are variables seems to be the root of the issue. Even after telling it they are variables, it still ‘managed’ to tell me ‘M = 3’ and ‘N = 4’. Yikes.
At one moment, it arrived at the ol’ reliable {N = N} equation. Don’t lie to yourself; you’ve been there, too.
It acknowledged it was going in circles, which is very interesting.
It nevertheless committed to getting a numerical response, a complete reasoning failure.

I highly recommend that you converse with this model. It will give you a good idea of what a state-of-the-art model can and can’t do and how it thinks (unlike OpenAI, which is too insecure to share its thought process).

ADAPTIVE LEARNING
Writer’s Self-Evolving Models

Writer, an AI company that builds products around Large Language Models (LLMs) and a sponsor of this newsletter, has announced self-evolving models (although the model will remain unreleased for now). These are a new implementation of the idea we’ve discussed many times recently: allowing AI models to learn while making predictions to users.

In other words, the model continues to learn and adapt to new inputs instead of having a training phase and an inference (use) phase.

TheWhiteBox’s takeaway:

As seen in the above image, thanks to this self-evolving capability, the model improves its performance the more it takes each test, signaling that this adaptive mechanism works.

Although they don’t provide much detail, understanding how they are doing this doesn’t take much effort.

They talk about updating a ‘per-layer memory pool,’ referring to the KV Cache (in case they are providing the memory through the prompt at all times) or some sort of separate memory (like a vector database) from which they extract the critical data using the ‘keys’ and ‘values’ (this is attention mechanism notation, read more here) of the words in the user’s prompt (i.e., they use the user’s prompt as a ‘search query’ to search the memory pool and extract valuable information for the task, similar to what Retrieval Augmented Generation (RAG) does).

They also discuss uncertainty measurements, meaning the model measures its ‘surprise’ upon seeing an unknown input (as a way to learn if it’s new and, thus, worth learning). Although they explicitly say the models are Transformers (just like ChatGPT), this surprise measurement mechanism is eerily similar to how Mamba and TTT layers work. I had a detailed overview of the latter in our Notion knowledge base.

Upon seeing an unknown input (like a fact), they store it in the memory pool for future retrieval, which explains why the model improves upon benchmark retesting.

Finally, although I’m speculating, it doesn’t appear as if this model is actively learning (updating its parameter weights) but increasing its memory pool on a matter, meaning this is not quite test-time training, the hottest research avenue these days.

HEALTHCARE
Another Victory for AI Healthcare

Researchers from the University of Michigan and the University of California, San Francisco, have developed an AI model called FastGlioma. It is designed to assist neurosurgeons in identifying residual brain tumor tissues during surgery.

This model utilizes foundation models trained on extensive datasets, including over 11,000 surgical specimens and 4 million unique microscopic fields of view, to distinguish between healthy and tumorous brain tissues.

FastGlioma can detect tumor infiltration using lower-resolution images with 92% accuracy in 100 seconds and 90% in just 10 seconds. Notably, the research team has open-sourced the model and provided an online demonstration to facilitate its broader adoption and potential application to other cancer types in future studies.

TheWhiteBox’s takeaway:

Do I need to say it again?

AI’s productivity-enhancing capabilities are nothing compared to its capability to discover patterns in data, which is undeniably changing the world by saving lives.

Surgeons with an AI that can confidently detect tumor bits he/she might have missed is a huge victory for healthcare. So, for once, let’s stop debating how smart our frontier models are (to see how ‘smart’ they are, read DeepSeek’s news above again), and let’s start putting some extra capital into creating AIs that save lives; it’s discouraging to see how most money is flowing into models that can speak like Shakespeare while most of the healthcare breakthroughs are entirely developed in frugal, low-capital environments.

Besides Google Deepmind, which deserves an honorable mention, most frontier AI labs don’t touch healthcare with a ten-foot pole, and that’s unacceptable.

HEALTHCARE
Robotic Surgeons: One Step Closer

A group of researchers at Stanford University developed an AI model that could imitate human surgeons simply by imitating them. Now, a group of researchers at Johns Hopkins University have trained a version of this Da Vinci Surgical Robot to perform extremely dexterous surgical procedures.

Researchers claim that we are reaching a point where fully autonomous surgical robots are on the horizon.

TheWhiteBox’s takeaway:

In summary, it is a vision-language model (VLM) that inputs images from cameras and predicts movements based on those images. It’s still a Transformer model like ChatGPT, but instead of outputting words, it outputs robot actions and instead of inputting words, it inputs images.

Again, all AI models follow the same principle: they are maps between a set of inputs and outputs, nothing more, nothing less.

As a highlight, alongside the previous news about AIs detecting cancer ‘remains’ to help the surgeon remove all the tumors, AIs are also gaining physicality, which may allow them to participate in surgical interventions actively.

Importantly, they don’t get tired or sleepy, haven’t argued with their husband/wife the past night, and don’t have hyperactive kids. So, while I wouldn’t be comfortable having a fully autonomous robot arm opening me up, I like the idea that a human surgeon combined with an AI can lead to better surgical outcomes.

HEALTHCARE
ChatGPT Beats Doctors

To finish our rundown on AI healthcare, a new study suggests that ChatGPT beats human doctors in identifying illnesses by up to 16%.

Interestingly, however, the results also showed that doctors using ChatGPT were not as accurate as doctors who weren’t using the LLM, but the standalone AI outcompeted both cohorts (doctors without ChatGPT and doctors using ChatGPT).

TheWhiteBox’s takeaway:

These are some bizarre results, right? My central intuition is that doctors have predisposed biases that lead them to ignore AI recommendations and follow their own faulty intuition.

Therefore, I urge AIs to become a common tool for doctors, who can’t always recall or remember everything and can rely on AI for opinions.

However, the study implies that we should at least consider having autonomous AI diagnose patients. This could lead to a world where doctors provide support throughout illnesses and use AI’s unmatched pattern-matching accuracy to perform the diagnosis.

This may seem too futuristic, but AI is already extensively used to diagnose defects in machinery or equipment in manufacturing, as humans cannot identify the subtleties in the defects the same way a Convolution Neural Network (CNN) can.

TREND OF THE WEEK
FrontierMath, AI’s Door to AGI?

EpochAI, an AI research think tank we commonly mention in this newsletter for its predictions of AI scaling, data, or compute, has released a maths benchmark, FrontierMath, developed alongside top AI research, academics, renowned mathematicians, and even with the collaboration of Fields Medal awardees (Mathematics’ Nobel Prize equivalent) that can already be proclaimed the hardest AI benchmark in history.

Upon seeing the benchmark, Terence Tao, the greatest mathematician alive and often considered the smartest human on this planet, stated:

❝

“These are extremely challenging… I think they will resist AIs for several years at least.”

Later, Professor Tao added that he did not know how to solve many problems but knew who to ask. In other words, these problems were conceived and can be solved by humans, but no known person can solve all of them.

Unsurprisingly, the benchmark destroys all frontier AIs (both Large Language/Reasoner Models (LLMs/LRMs)). However, seeing how AIs always eventually solve every benchmark, is this the last hurdle AI must overcome to reach AGI?

Well, it will depend on how it solves it. But what do we mean by that?

A God-Level Benchmark

The FrontierMath benchmark is a sophisticated evaluation framework designed to measure AI systems' advanced mathematical reasoning capabilities.

Conceived to Be Hard

It includes hundreds of original problems carefully crafted by expert mathematicians to test a model’s ability to solve complex mathematical challenges.

The distribution of maths problems in the dataset

The problems span many modern mathematical fields, including number theory, algebraic geometry, combinatorics, and category theory. They are designed to require deep theoretical understanding, creativity, and specialized knowledge, often demanding hours or even days of expert mathematicians' time to solve.

Importantly, all tests must fulfill these three constraints:

Originality: Problems must be novel, either by transforming existing ideas in innovative ways or by combining multiple concepts to obscure their origins, ensuring genuine mathematical insight is required.
Automated Verifiability: Solutions must have definitive, computable answers (e.g., numerical results, symbolic expressions) that can be verified programmatically, often using tools like SymPy.
Guessproofness: Problems must avoid susceptibility to guessing, requiring rigorous reasoning and work to achieve the correct solution, with a less than 1% chance of guessing correctly.
Computational Tractability: Solutions must be verifiable with standard algorithms and run under a minute on standard hardware, ensuring efficient evaluation.

This way, FrontierMath and other recent benchmarks like ARC-AGI are specifically conceived to prevent models from saturating the benchmark and data contamination (when the tests are leaked into a model’s training set).

And how do models fare in this benchmark? Well, it isn’t good.

It’s a Massacre

The results show the huge differences between AI and expert humans. Among all tested models, all considered state-of-the-art, 2% performance was achieved by both Claude 3.5 Sonnet and Gemini 1.5 Pro.

OpenAI models, including the o1 family, considered the best reasoning model in the world, only reached 1%.

Source: FrontierMath

This makes FrontierMath the unequivocal most brutal AI benchmark in the world and probably in history, as the gap between model performance when compared to other tough benchmarks is considerable:

As an interesting reference, the benchmark in the bottom right corner, the MMLU, is the most commonly mentioned benchmark that AI labs use to showcase the improvements of their latest models. Given the stark differences in task complexity, I assume they won’t be citing FrontierMath anytime soon.

All this begs the question: why do LLMs/LRMs perform so poorly?

The Usual Suspects

The first reason AIs fail these tasks is that they are novel. LLMs/LRMs are still, until proven otherwise, AIs that can only retrieve the solution to a problem from memory or past experience, as we saw in our newsletter about the anti-LLM revolution.

In layman’s terms, they will most likely fail when faced with a problem where memorization is of no help.

FrontierMath creators knew this and created the benchmark tasks in ways that required genuine, novel, and even counterintuitive ways to solve them, the literal opposite of what LLMs/LRMs can do (although the latter, like o1 models, are thought to be more capable for this).

The other topic is data. The required data to solve these problems and the key insights are present in just a handful of papers, as stated by Terence Tao in his interview with the research team.

AIs not only need data, they need a lot of it.

Not only do AIs not work properly without experience in the problem, but they also need orders of magnitude more data than the average human to learn the task. These two constraints combine to make this task incredibly complex for AIs.

The third issue is time. These problems are meant to take hours or days to solve, even for expert mathematicians. This makes them harder for models, who, even when considering o1-preview, can only think for a few minutes at most due to the extremely costly nature of their inferences.

But let’s be—extremely—optimistic. Suppose an AI model cracks this benchmark a year from now. In that event, is that model AGI?

It’s Not the Outcome, It’s the Process

In last month’s ‘The Anti-LLM Revolution Begins,’ we introduced the task complexity/familiarity conundrum that François Chollet proposes. In simple terms, we evaluate AIs incorrectly, as intelligence is not proved by outcomes but by the process leading to that outcome.

This seemingly minor mental adaptation is crucial because it avoids confounding intelligence with memorization.

Let me put it this way: OpenAI has some of the most brilliant people alive inside its organization. If these people, many brilliant mathematicians, laid out the exact resolution paths to all tests in this benchmark, the model’s performance (at least in the public evaluations) would skyrocket.

Why?

Simple, because with memorization, there isn’t a task complex enough that an AI can solve. As long as they memorize the solution function, they will solve it.

However, true intelligence emerges when we solve tasks we don’t know how to solve initially, requiring ‘on the fly’ skill acquisition (in fact, François Chollet describes intelligence literally as ‘efficient skill acquisition’). And let me be clear: no AI has yet proven capable of doing this.

Some non-LLM AIs using brute force search, aka trying many solutions until one converges, can perform well in novel settings. Still, you will agree that brute force search suffers from combinatorial issues and would never scale to the complexity of the real world).

Either way, FrontierMath is undoubtedly a great step in reframing our misconceptions about AI intelligence. It clarifies that AIs are still very far behind us, at least when it comes to mathematics. It also makes an extra effort to be novel and resistant to memorization (due to the lack of data).

Don’t get me wrong, if AIs solve this benchmark, that will be historic. But when that happens, you will be flooded with ‘AGI is here’ terms. But before we celebrate, the AI that solves the task must have proven novelty, absence of memorization, and true generalization.

Only then can AGI be considered conquered.

Closing Thoughts

This week, we have learned about the following:

It could set a historical precedent for AI’s future moving forward with the possible creation of a Manhattan-Project-type effort by the US to build AGI.
LRMs are no longer OpenAI-only territory. DeepSeek and Nous Research have built open-source solutions that match or exceed the performance of O1 models.
AI’s impact on healthcare is still severely underappreciated. Most breakthroughs are being pushed by Universities rather than cash-rich labs, which is very disappointing as we’re literally talking about saving lives.
And thanks to several mathematicians and EpochAI, we also have a neat view of how limited AIs are still compared to expert humans, coming nowhere close to the genuine reasoning capabilities of humans.

See you on Sunday!

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.

Until next time!

A Chinese o1 Model, AI's Manhattan Project, & More

THEWHITEBOXTLDR;

Run Your Sales on Autopilot

NEWSREELAI’s Manhattan Project

TheWhiteBox’s takeaway:

OPEN-SOURCEWe Finally Have the First Open-Source ‘o1’

TheWhiteBox’s takeaway:

ADAPTIVE LEARNINGWriter’s Self-Evolving Models

TheWhiteBox’s takeaway:

HEALTHCAREAnother Victory for AI Healthcare

TheWhiteBox’s takeaway:

HEALTHCARERobotic Surgeons: One Step Closer

TheWhiteBox’s takeaway:

HEALTHCAREChatGPT Beats Doctors

TheWhiteBox’s takeaway:

TREND OF THE WEEKFrontierMath, AI’s Door to AGI?

A God-Level Benchmark

Conceived to Be Hard

It’s a Massacre

The Usual Suspects

It’s Not the Outcome, It’s the Process

Closing Thoughts

THEWHITEBOXPremium

Give a Rating to Today's Newsletter

THEWHITEBOX
TLDR;

NEWSREEL
AI’s Manhattan Project

OPEN-SOURCE
We Finally Have the First Open-Source ‘o1’

ADAPTIVE LEARNING
Writer’s Self-Evolving Models

HEALTHCARE
Another Victory for AI Healthcare

HEALTHCARE
Robotic Surgeons: One Step Closer

HEALTHCARE
ChatGPT Beats Doctors

TREND OF THE WEEK
FrontierMath, AI’s Door to AGI?

THEWHITEBOX
Premium