- TheWhiteBox by Nacho de Gregorio
- Posts
- A Chinese o1 Model, AI's Manhattan Project, & More
A Chinese o1 Model, AI's Manhattan Project, & More
For business inquiries, reach me out at [email protected]
THEWHITEBOX
TLDR;
đ AIâs Manhattan Project
𼳠We Finally Have the First Open-Source âo1â
đł Writerâs Self-Evolving Models
𼰠Another Victor for AI Healthcare
đ¤ Robotic Surgeons
đ¤ ChatGPT Beats DoctorsâŚ
[TREND OF THE WEEK] FrontierMath, AIâs Door to AGI?
Fully Automated Email Outreach With AI Agent Frank
Hire Agent Frank to join your sales team and let him take care of prospecting, emailing and booking meetings for you, so your team can focus on closing deals!
Agent Frank works in two modes - fully autonomous Auto-pilot and Co-pilot, where you can review and monitor his work. And heâs super easy to set up in just 4 quick steps!
He learns using first-party data you provide him during onboarding and continuously gets better as he works to book you more meetings đ
NEWSREEL
AIâs Manhattan Project
Shocked. Thatâs how most people have felt after reading this small paragraph. This is part of the annual report by the US-China Economic and Security Review Commission, signaling that AI is a matter of National Security, similar to how nuclear power and nuclear bombs were treated decades ago.
The program openly calls for a Manhattan-like Project so that the US races toward AGI before China (and based on the next news, we canât blame them, considering China is catching up fast).
TheWhiteBoxâs takeaway:
If thereâs a place where we are skeptical about AIâs alleged capabilities is this newsletter. Obviously, I feel this Commission is being heavily lobbied/influenced by Big Tech and other AI incumbents because such AI treatment as of âvital importanceâ would give them business security and even regulatory capture.
However, while I agree that the country that reaches AGI first will have a tremendous advantage, two things come to mind:
Achieving AGI wonât be a slowly-then-suddenly event; it will be something that will evolve over time. I feel they are framing AGI as a âTrinity testâ type event where the US built the nuclear bomb and ended the war in a couple of days in August 1945. I sincerely believe that wonât be the case for AGI.
Technology will be mostly open-sourced, so this idea that AI can be built inside a lab in Los Alamos feels utopic and would require a complete ban on AI development; this ban should be supranational, a matter which is very hard to push through with US enemies.
I see many loopholes in this strategy, but one thingâs for sure: AI is nothing like we have seen before, so caution is not a bad thing.
OPEN-SOURCE
We Finally Have the First Open-Source âo1â
DeepSeek has announced DeepSeek-r1, the first open-source implementation of an o1-type model, the new Large Reasoner Models (LRMs) that OpenAI presented back in September that allegedly vastly improved the reasoning capabilities of frontier AI models.
As you can see above, the model surpasses the results obtained by o1-preview (OpenAIâs top model) in several benchmarks, a wake-up call for them and their investors. Just over two months since o1âs release, both Nous Research with its Forge API and DeepSeek have presented features that match or surpass the performance of OpenAIâs flagship model.
Fascinatingly, you can chat with the model for free here today.
TheWhiteBoxâs takeaway:
The best thing about this release, besides the fact that we now have open-source LLMs that can perform âdeepâ reasoning tasks (and by deep, I mean compared to what a standard LLM can do, which is not much), the most remarkable thing about this is that the researchers have also opened the internal chain-of-thought the model engages in before answering.
In other words, you can see the âinner thinkingâ going on while the model processes your input (think about this as the modelâs introspection or âthinking about thinkingâ).
But how smart is this model?
Well, itâs still easily fooled. I sent it the following input: âJavier has M brothers and (N - 7 + M) sisters, and the (N - 4) sister loves pudding. How many brothers does one sister have?âThis includes both tricky ways of referring to the number of brothers and sisters and an inconsequential fact (that one of the sisters loves pudding).
While the model did, in fact, initially arrive at the correct answer (each sister has (M + 1) brothers), it then went completely bananas, concluding that each sister has four brothers, completely failing to realize that numbers are expressed as variables. Thus, the answer should be based on variables, too, contradicting its first conclusion.
The entire thought process is absolutely hilarious. And although itâs too long to share here, here are some of the highlights:
The model failed to acknowledge that the pudding fact was irrelevant (it seemed close to realizing but didnât fully commit to calling bullshit on that part). Honestly, the failure to acknowledge that âMâ and âNâ are variables seems to be the root of the issue. Even after telling it they are variables, it still âmanagedâ to tell me âM = 3â and âN = 4â. Yikes.
At one moment, it arrived at the olâ reliable {N = N} equation. Donât lie to yourself; youâve been there, too.
It acknowledged it was going in circles, which is very interesting.
It nevertheless committed to getting a numerical response, a complete reasoning failure.
I highly recommend that you converse with this model. It will give you a good idea of what a state-of-the-art model can and canât do and how it thinks (unlike OpenAI, which is too insecure to share its thought process).
ADAPTIVE LEARNING
Writerâs Self-Evolving Models
Writer, an AI company that builds products around Large Language Models (LLMs) and a sponsor of this newsletter, has announced self-evolving models (although the model will remain unreleased for now). These are a new implementation of the idea weâve discussed many times recently: allowing AI models to learn while making predictions to users.
In other words, the model continues to learn and adapt to new inputs instead of having a training phase and an inference (use) phase.
TheWhiteBoxâs takeaway:
As seen in the above image, thanks to this self-evolving capability, the model improves its performance the more it takes each test, signaling that this adaptive mechanism works.
Although they donât provide much detail, understanding how they are doing this doesnât take much effort.
They talk about updating a âper-layer memory pool,â referring to the KV Cache (in case they are providing the memory through the prompt at all times) or some sort of separate memory (like a vector database) from which they extract the critical data using the âkeysâ and âvaluesâ (this is attention mechanism notation, read more here) of the words in the userâs prompt (i.e., they use the userâs prompt as a âsearch queryâ to search the memory pool and extract valuable information for the task, similar to what Retrieval Augmented Generation (RAG) does).
They also discuss uncertainty measurements, meaning the model measures its âsurpriseâ upon seeing an unknown input (as a way to learn if itâs new and, thus, worth learning). Although they explicitly say the models are Transformers (just like ChatGPT), this surprise measurement mechanism is eerily similar to how Mamba and TTT layers work. I had a detailed overview of the latter in our Notion knowledge base.
Upon seeing an unknown input (like a fact), they store it in the memory pool for future retrieval, which explains why the model improves upon benchmark retesting.
Finally, although Iâm speculating, it doesnât appear as if this model is actively learning (updating its parameter weights) but increasing its memory pool on a matter, meaning this is not quite test-time training, the hottest research avenue these days.
HEALTHCARE
Another Victory for AI Healthcare
Researchers from the University of Michigan and the University of California, San Francisco, have developed an AI model called FastGlioma. It is designed to assist neurosurgeons in identifying residual brain tumor tissues during surgery.
This model utilizes foundation models trained on extensive datasets, including over 11,000 surgical specimens and 4 million unique microscopic fields of view, to distinguish between healthy and tumorous brain tissues.
FastGlioma can detect tumor infiltration using lower-resolution images with 92% accuracy in 100 seconds and 90% in just 10 seconds. Notably, the research team has open-sourced the model and provided an online demonstration to facilitate its broader adoption and potential application to other cancer types in future studies.
TheWhiteBoxâs takeaway:
Do I need to say it again?
AIâs productivity-enhancing capabilities are nothing compared to its capability to discover patterns in data, which is undeniably changing the world by saving lives.
Surgeons with an AI that can confidently detect tumor bits he/she might have missed is a huge victory for healthcare. So, for once, letâs stop debating how smart our frontier models are (to see how âsmartâ they are, read DeepSeekâs news above again), and letâs start putting some extra capital into creating AIs that save lives; itâs discouraging to see how most money is flowing into models that can speak like Shakespeare while most of the healthcare breakthroughs are entirely developed in frugal, low-capital environments.
Besides Google Deepmind, which deserves an honorable mention, most frontier AI labs donât touch healthcare with a ten-foot pole, and thatâs unacceptable.
HEALTHCARE
Robotic Surgeons: One Step Closer
A group of researchers at Stanford University developed an AI model that could imitate human surgeons simply by imitating them. Now, a group of researchers at Johns Hopkins University have trained a version of this Da Vinci Surgical Robot to perform extremely dexterous surgical procedures.
Researchers claim that we are reaching a point where fully autonomous surgical robots are on the horizon.
TheWhiteBoxâs takeaway:
In summary, it is a vision-language model (VLM) that inputs images from cameras and predicts movements based on those images. Itâs still a Transformer model like ChatGPT, but instead of outputting words, it outputs robot actions and instead of inputting words, it inputs images.
Again, all AI models follow the same principle: they are maps between a set of inputs and outputs, nothing more, nothing less.
As a highlight, alongside the previous news about AIs detecting cancer âremainsâ to help the surgeon remove all the tumors, AIs are also gaining physicality, which may allow them to participate in surgical interventions actively.
Importantly, they donât get tired or sleepy, havenât argued with their husband/wife the past night, and donât have hyperactive kids. So, while I wouldnât be comfortable having a fully autonomous robot arm opening me up, I like the idea that a human surgeon combined with an AI can lead to better surgical outcomes.
HEALTHCARE
ChatGPT Beats Doctors
To finish our rundown on AI healthcare, a new study suggests that ChatGPT beats human doctors in identifying illnesses by up to 16%.
Interestingly, however, the results also showed that doctors using ChatGPT were not as accurate as doctors who werenât using the LLM, but the standalone AI outcompeted both cohorts (doctors without ChatGPT and doctors using ChatGPT).
TheWhiteBoxâs takeaway:
These are some bizarre results, right? My central intuition is that doctors have predisposed biases that lead them to ignore AI recommendations and follow their own faulty intuition.
Therefore, I urge AIs to become a common tool for doctors, who canât always recall or remember everything and can rely on AI for opinions.
However, the study implies that we should at least consider having autonomous AI diagnose patients. This could lead to a world where doctors provide support throughout illnesses and use AIâs unmatched pattern-matching accuracy to perform the diagnosis.
This may seem too futuristic, but AI is already extensively used to diagnose defects in machinery or equipment in manufacturing, as humans cannot identify the subtleties in the defects the same way a Convolution Neural Network (CNN) can.
TREND OF THE WEEK
FrontierMath, AIâs Door to AGI?
EpochAI, an AI research think tank we commonly mention in this newsletter for its predictions of AI scaling, data, or compute, has released a maths benchmark, FrontierMath, developed alongside top AI research, academics, renowned mathematicians, and even with the collaboration of Fields Medal awardees (Mathematicsâ Nobel Prize equivalent) that can already be proclaimed the hardest AI benchmark in history.
Upon seeing the benchmark, Terence Tao, the greatest mathematician alive and often considered the smartest human on this planet, stated:
âThese are extremely challenging⌠I think they will resist AIs for several years at least.â
Later, Professor Tao added that he did not know how to solve many problems but knew who to ask. In other words, these problems were conceived and can be solved by humans, but no known person can solve all of them.
Unsurprisingly, the benchmark destroys all frontier AIs (both Large Language/Reasoner Models (LLMs/LRMs)). However, seeing how AIs always eventually solve every benchmark, is this the last hurdle AI must overcome to reach AGI?
Well, it will depend on how it solves it. But what do we mean by that?
A God-Level Benchmark
The FrontierMath benchmark is a sophisticated evaluation framework designed to measure AI systems' advanced mathematical reasoning capabilities.
Conceived to Be Hard
It includes hundreds of original problems carefully crafted by expert mathematicians to test a modelâs ability to solve complex mathematical challenges.
The distribution of maths problems in the dataset
The problems span many modern mathematical fields, including number theory, algebraic geometry, combinatorics, and category theory. They are designed to require deep theoretical understanding, creativity, and specialized knowledge, often demanding hours or even days of expert mathematicians' time to solve.
Importantly, all tests must fulfill these three constraints:
Originality: Problems must be novel, either by transforming existing ideas in innovative ways or by combining multiple concepts to obscure their origins, ensuring genuine mathematical insight is required.
Automated Verifiability: Solutions must have definitive, computable answers (e.g., numerical results, symbolic expressions) that can be verified programmatically, often using tools like SymPy.
Guessproofness: Problems must avoid susceptibility to guessing, requiring rigorous reasoning and work to achieve the correct solution, with a less than 1% chance of guessing correctly.
Computational Tractability: Solutions must be verifiable with standard algorithms and run under a minute on standard hardware, ensuring efficient evaluation.
This way, FrontierMath and other recent benchmarks like ARC-AGI are specifically conceived to prevent models from saturating the benchmark and data contamination (when the tests are leaked into a modelâs training set).
And how do models fare in this benchmark? Well, it isnât good.
Itâs a Massacre
The results show the huge differences between AI and expert humans. Among all tested models, all considered state-of-the-art, 2% performance was achieved by both Claude 3.5 Sonnet and Gemini 1.5 Pro.
OpenAI models, including the o1 family, considered the best reasoning model in the world, only reached 1%.
Source: FrontierMath
This makes FrontierMath the unequivocal most brutal AI benchmark in the world and probably in history, as the gap between model performance when compared to other tough benchmarks is considerable:
As an interesting reference, the benchmark in the bottom right corner, the MMLU, is the most commonly mentioned benchmark that AI labs use to showcase the improvements of their latest models. Given the stark differences in task complexity, I assume they wonât be citing FrontierMath anytime soon.
All this begs the question: why do LLMs/LRMs perform so poorly?
The Usual Suspects
The first reason AIs fail these tasks is that they are novel. LLMs/LRMs are still, until proven otherwise, AIs that can only retrieve the solution to a problem from memory or past experience, as we saw in our newsletter about the anti-LLM revolution.
In laymanâs terms, they will most likely fail when faced with a problem where memorization is of no help.
FrontierMath creators knew this and created the benchmark tasks in ways that required genuine, novel, and even counterintuitive ways to solve them, the literal opposite of what LLMs/LRMs can do (although the latter, like o1 models, are thought to be more capable for this).
The other topic is data. The required data to solve these problems and the key insights are present in just a handful of papers, as stated by Terence Tao in his interview with the research team.
AIs not only need data, they need a lot of it.
Not only do AIs not work properly without experience in the problem, but they also need orders of magnitude more data than the average human to learn the task. These two constraints combine to make this task incredibly complex for AIs.
The third issue is time. These problems are meant to take hours or days to solve, even for expert mathematicians. This makes them harder for models, who, even when considering o1-preview, can only think for a few minutes at most due to the extremely costly nature of their inferences.
But letâs beâextremelyâoptimistic. Suppose an AI model cracks this benchmark a year from now. In that event, is that model AGI?
Itâs Not the Outcome, Itâs the Process
In last monthâs âThe Anti-LLM Revolution Begins,â we introduced the task complexity/familiarity conundrum that François Chollet proposes. In simple terms, we evaluate AIs incorrectly, as intelligence is not proved by outcomes but by the process leading to that outcome.
This seemingly minor mental adaptation is crucial because it avoids confounding intelligence with memorization.
Let me put it this way: OpenAI has some of the most brilliant people alive inside its organization. If these people, many brilliant mathematicians, laid out the exact resolution paths to all tests in this benchmark, the modelâs performance (at least in the public evaluations) would skyrocket.
Why?
Simple, because with memorization, there isnât a task complex enough that an AI can solve. As long as they memorize the solution function, they will solve it.
However, true intelligence emerges when we solve tasks we donât know how to solve initially, requiring âon the flyâ skill acquisition (in fact, François Chollet describes intelligence literally as âefficient skill acquisitionâ). And let me be clear: no AI has yet proven capable of doing this.
Some non-LLM AIs using brute force search, aka trying many solutions until one converges, can perform well in novel settings. Still, you will agree that brute force search suffers from combinatorial issues and would never scale to the complexity of the real world).
Either way, FrontierMath is undoubtedly a great step in reframing our misconceptions about AI intelligence. It clarifies that AIs are still very far behind us, at least when it comes to mathematics. It also makes an extra effort to be novel and resistant to memorization (due to the lack of data).
Donât get me wrong, if AIs solve this benchmark, that will be historic. But when that happens, you will be flooded with âAGI is hereâ terms. But before we celebrate, the AI that solves the task must have proven novelty, absence of memorization, and true generalization.
Only then can AGI be considered conquered.
Closing Thoughts
This week, we have learned about the following:
It could set a historical precedent for AIâs future moving forward with the possible creation of a Manhattan-Project-type effort by the US to build AGI.
LRMs are no longer OpenAI-only territory. DeepSeek and Nous Research have built open-source solutions that match or exceed the performance of O1 models.
AIâs impact on healthcare is still severely underappreciated. Most breakthroughs are being pushed by Universities rather than cash-rich labs, which is very disappointing as weâre literally talking about saving lives.
And thanks to several mathematicians and EpochAI, we also have a neat view of how limited AIs are still compared to expert humans, coming nowhere close to the genuine reasoning capabilities of humans.
See you on Sunday!
THEWHITEBOX
Premium
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |