The Anatomy of the AI Mind

In partnership with

THEWHITEBOX
TLDR;

Today, we analyze the most exciting research I’ve ever read in AI, giving amazing insights into deciphering the mysterious mind of AI models, thanks to the leading AI Lab Anthropic.

You’ll not be disappointed by what you read.

We also examine new investment rounds, Google's striking new SOTA model, new computer agents, and more.

Enjoy!

THE I IN AI
Google is the New King

On Tuesday, I covered some eye-opening new research that exposed the intelligence issues of frontier AI models. The research tested the capacity of our frontier reasoning models to solve complex mathematical questions that required a perfect proof to be counted as correct; it wasn’t about getting to a final correct answer, but the reasoning process should be perfect, too.

This exposed the models’ limitations, as none surpassed the 2% mark, and with models like o1-pro costing $200 to run with the API, sound results were quite embarrassing.

Well, the same day the researchers published the results, Google launched Gemini 2.5 Pro, and yesterday the researchers published the results for Google. And well, they speak volumes: 24%, twelve times better than the best of the rest.

Insane.

Importantly, as mentioned, the model couldn’t have been trained on these tests as it was released the same day, so the results are legit.

Fascinatingly, it seems that the model did not rely on code to solve the tasks (most reasoning models right there reasoning on code in order to test it and iterate over the mistakes), making it even more impressive.

TheWhiteBox’s takeaway:

All signs are there. Google is the new SOTA, and this is hardly debatable. Not only because of this insane result, but also because:

Furthermore, it has the largest context window by far with 1 million tokens (~750,000 words), crucial for tasks like coding where repository context can be massive.

I use it with Cursor, and I have to say I prefer it over Claude 3.7 Sonnet because it doesn’t overthink that much and commits very strictly to solving my task. The former tends to change many things I didn’t ask for.

Ah! And we are talking about the Experimental version; the actual 2.5 Pro model has yet to be released. Google is so back.

But will that translate into real traction (aka money)? 

AGENTS
AI Research Agents, Closer than we Think?

OpenAI has published new research. In it, they present a new benchmark for agents, PaperBench, which evaluates agents' capability to replicate top AI research from ICML 2024 (International Conference on Machine Learning).

The models are evaluated using rubrics co-designed by the researchers behind these papers to guarantee that the models achieve actual replication.

Surprisingly, the most performing model is Claude 3.5 Sonnet, ahead of OpenAI’s models.

TheWhiteBox’s takeaway:

OpenAI recent shift toward more openness (they are also releasing an open-weights model in the upcoming months, a model you will be able to download for free into your computer) is clear and I’m really up for it.

They seem to have finally understood that going open not only makes the US more competitive as a whole, it generates tremendous goodwill around their products. While before many of us would use OpenAI’s models despite our disdain for their ways of working, we are now rooting for them. Let’s hope they keep it that way.

As for the results, they are quite scary, aren’t they? I firmly believe AIs will mostly end up being augmentation tools, but some jobs could very easily be automated away. The funny thing is that some of these jobs might surprise everyone.

HEALTHCARE
Isomorphic Raises $600 Million

Isomorphic Labs, Google’s subsidiary founded—and opearted—by Google Deepmind’s CEO and Nobel Laureate Demis Hassabis, has raised $600 million at an undisclosed valuation from investors including Thrive Capital (one of OpenAI’s most prominent investors).

Isomorphic Labs is focused on using AI for drug discovery, and one of the key tools in this space is AlphaFold, developed by DeepMind. AlphaFold takes a protein’s amino acid sequence and predicts its 3D structure, mapping from a linear sequence to a folded shape. While it’s not a language model like ChatGPT, the underlying idea is similar: learning patterns in sequences to make useful predictions, just in a different domain.

The rationale is that understanding protein structures helps us design better drugs. If we know the shape of a protein involved in a disease, we can try to design molecules that bind to it effectively. In layman’s terms, the goal is to use these AI tools to help invent new medicines.

Now, this company has raised more than half a billion dollars to make this vision real.

TheWhiteBox’s takeaway:

Everything Demis touches turns into gold. Importantly, unlike Sam Altman, Demis is a known expert on AI, while the former is just a great operator.

As I always mention, Google doesn’t only have all the cards, it owns the dealer, the table, and the entire casino. They have everything to dominate, even what is the best model today (see Technology section above).

If Google weren’t Google, I would have invested all my savings in them. The problem? They are Google, and the way they approach go-to-market strategies is just wrong: great products, excellent potential distribution, horrible execution.

COMPUTE
Google to Lease Coreweave Data Centers

According to sources, Google is in talks to lease NVIDIA chips from Coreweave. The deal could also include Coreweave data center space being used to house Google’s TPUs.

The deal isn’t closed at the time of writing, and both parties declined to comment.

TheWhiteBox’s takeaway:

This deal has several connotations:

  1. For starters, it would expand Coreweave’s customer pool, which today consists mostly of OpenAI and Microsoft. If the deal is completed, I would be hugely bullish for Coreweave.

  2. It suggests that even the mighty Google, with more available compute than anyone, is running into serious space constraints to house its TPUs and to deploy more compute to users (this explains Google’s extremely limited usage rates for their models).

  3. Compute spending go brr’. No matter how much better we are getting at running AI models, the demand for compute is insane. At this pace, compute deployment and, worse, energy provisions will become a severe bottleneck. Additionally, we have just learned that Chinese companies have placed $16 billion in orders for NVIDIA’s H20 chips.

  4. While it’s hardly debatable that most AI compute will run in the cloud, the serious constraints we will face in the long run make owning some compute yourself very attractive. I’m not saying you should invest $3 million into a GB200 NVL72 NVIDIA server rack (120kW of power demand, good luck with that because you'll need to invest in an ad hoc electrical transformer), but owning some local compute at home. If AI becomes what determines your edge over other humans, having a strong accelerated system at home could be hugely beneficial to not having your moat dependent on externalities. If compute is going to be the currency of the future, owning a piece of the pie can’t hurt that much, can it?

COMPUTER AGENTS
Amazon Releases Nova Act

Amazon AGI Lab, Amazon’s AI arm, has launched a software development kit (SDK) for agents called Nova Act. This tool allows you to seamlessly leverage Amazon’s Nova LLMs to perform actions on your behalf on a computer (using a headless browser deployed for you).

The SDK appears clean and powerful (see the above image, where it outcompetes results from Anthropic and OpenAI) and gives an idea of the world we are moving into: Most of the “code” is actually natural language, you declaring what you want and AIs obeying. Moreover, the models support structured outputs, facilitating parsing the results from the model for further analysis/use.

TheWhiteBox’s takeaway:

Another tool for one of the most crowded developer fields today: agent development. What makes me feel optimistic about this release is that the team behind it is the Adept team, which was acquired by Amazon for hundreds of millions in 2024. The team is led by David Yuan, ex-Head of Engineering @ OpenAI and one of the minds behind GPT-2 and GPT-3 models, considered the precursors of all modern AI.

Legit team.

My concern with these tools is that, being honest, I don’t see the real utility in consumer-end computer agents. I feel most of the agent value will accrue in use cases that imply repetition because these are actually time-consuming for humans. Putting an AI in turn will massively decrease costs while increasing human productivity.

But do I need an AI to book me a flight or perform an analysis to find the cheapest flight? I feel many of these cases are flashy products that aren’t solving a problem but looking for one to solve.

In fact, I’m pretty sure no one actually uses OpenAI’s Operator tool; not because it doesn’t work in the use cases it is meant for, but because those use cases do not need to be automated because the time savings simply aren’t there.

COMPUTER AGENTS
General Agents Comes Out of Stealth with Ace

Talking about computer agents, Startup General Agents Co has emerged from stealth with Ace, another computer agent that, according to its creator, is the best computer agent out there.

The different videos showcase several impressive demos, using a variety of tools like Adobe or Chrome, and performing what appear to be reasonably complex tasks.

TheWhiteBox’s takeaway:

The demos are crazy impressive, but again: who needs this?

Because I don’t.

I understand that to reach really useful agents that can perform tasks requiring thousands of steps, we need to first get them good at simpler tasks. But I don’t know how on Earth they are going to monetize these tools because, when push comes to shove, nobody in his/her right mind will pay for this because it has almost negligible productivity gains. Hope I’m wrong, though!

TREND OF THE WEEK
The Anatomy of the AI Mind

Anthropic has published one of the most fascinating research papers I’ve read in months (or probably ever), one that dissects the anatomy of Large Language Models (LLMs) to uncover fascinating aspects of their behavior.

I have to acknowledge that this research has taught me more about AI than most of the research I’ve read, and it has actually forced me to rethink my intuitions.

Stay if you want to be amazed.

A Small Intro to Neurons

To understand how AIs think, we first need to know what AIs are. And most current AI models are neural networks.

But what is that?

In very succinct terms, they are a conglomeration of elements called neurons deeply interconnected to each other (think of this as a rather loose analogy to the neurons in the brain).

For instance, LLMs like ChatGPT receive a set of words as input; their job is to predict the next word.

The above diagram are MLP layers, part of ChatGPT’s overall architecture, not the entire model. This is just to help you visualize neurons.

The problem? We have no idea why and how they work.

To us, LLMs are black boxes that perform a series of computations (each one elementary, but a lot of them) and output the next word to the sequence with surprising accuracy.

And to make matters worse, current frontier models have hundreds of billions (with a ‘b’) of neurons interacting to make the prediction. This makes them highly complex to unravel or, simply put, to understand how they execute even the simplest of predictions.

So, is all hope lost? Luckily, no.

Neurons and Features

The first logical step to decipher their behavior is to monitor the activations of these neurons (when and how they fire). For instance, we can give them an input, and see which neurons activate to make the next prediction.

For every prediction, each neuron is ‘queried’ and may return a value that is then passed to the next round of neurons, or collapse to 0. This is what we mean by activating or ‘firing,’ with the term being highly influenced by how brain neurons behave.

To predict the behavior of models, we would want to be able to predict outputs based on how these neurons activate. Sadly, as mentioned, this is a very hard problem.

Ideally, we should be able to match individual neuron activations to topics, meaning that when a specific neuron activates, we can predict the model’s output. However, researchers soon realized that neurons were polysemantic, meaning they activated for several seemingly unrelated topics.

Luckily, the same researchers of today’s paper at Anthropic, one of the leading AI labs, made a fascinating discovery a while ago: neurons were indeed polysemantic, but certain neuron combinations were monosemantic (uniquely related to a specific outcome).

In layman’s terms, they realized that when specific neurons in the model fired together, the model’s outcome was usually related to the topic assigned to that neuron group.

This presented us with the idea of ‘features,’ which allowed us to map a model’s different neuron combinations to specific topics. In other words, we could build a knowledge map for the model; what it does know and what it does not.

If the Shakespeares neurons activate, the model predicts Shakespeare work!

This ‘activated neuron path’ is called an attribution graph because it’s highly correlated with the model’s output.

Suddenly, we had found a promising way of deciphering LLMs, going from a ‘blob’ of mystical elements poorly referred to as ‘neurons’ to an interpretable mesh of neuron circuits where we could match a specific circuit to a given topic and thereby predict model outcomes.

To do this, they introduced the idea of sparse autoencoders, the main method we use today to make this neuron-feature mapping. I wrote about this idea when they published it, if you wish to understand it better, but understanding SAEs isn’t mandatory to capture the essence of what we are discussing today.

A few months later, they presented new findings by using this method to map the mind of their Claude Sonnet LLM, leading to fascinating discoveries. For instance, they found that the model always talked about the Golden Gate Bridge whenever a certain group of neurons fired.

This relationship was so strong that when the researchers clamped this feature (they forced the neurons related to it to activate), the model essentially became the embodiment of the bridge, convincing itself it was the Golden Gate:

In summary, until today’s research:

  1. We had found a way to map the internal elements of models, called neurons, to specific topics, giving us a sense of ‘what the model knows.’ This turns what was essentially a complete black box into a map of features that explain the model’s knowledge.

  2. After making such mapping, we also learned we can “steer” the model’s behavior by intervening it, clamping or dialing down the neurons responsible for that feature, leading to predictable behaviors (as if we could clamp a human’s neurons to force it to behave a certain way).

And now, these same researchers—goated team, by the way—have pushed the frontier of model understanding to new heights.

Prepare to be mind blown.

The Anatomy of the AI Mind

In their latest research, the Anthropic team takes it a step further introducing the idea of replacement models.

But what is that? We know that certain neuron combinations map to specific topics, but it’s a very complex thing to visualize.

Instead, we build a ‘replacement model’ represented as a feature graph that is way more interpretable. In layman’s terms, instead of trying to map the series of inputs the model receives to the output the model predicts and visualizing the entire neuron trace (the aforementioned attribution graph), we map the different neuron circuits to different features (concepts) and we then draw the same trace but using the features to make it more understandable.

It sounds like a word salad at the moment, but don’t worry; it will all make sense in a second.

The Capital Circuit

For instance, let’s say we have the input prompt “Texas capital?” to which the model should respond ‘Austin.’ Instead of taking billions of neurons and seeing how they combine, we can transform the world of neurons into the world of features (using the mapping method described earlier).

And when we do, a model’s outputs suddenly become understandable:

But what have they done?

Simple: We look at the attribution graph for the word ‘Austin’ (the outcome), which essentially means we trace back to see what combinations of neurons (features) activated to make such a prediction. Then, we use our neuron circuit-to-feature mapping to identify what features (topics) the different neurons being activated represent.

And guess what, when we do that, we obtain a completely understandable and logical feature graph.

Let’s break it down:

  1. When the model sees the word Texas, neurons related to the topic ‘Texas’ activate. Regarding the word ‘capital’, the same happens.

  2. Once the neurons related to capitals activate, they promote the set of neurons responsible for forcing the model to output a capital, causing them to activate too.

  3. Finally, the neurons that promote topics related to ‘Texas’ and those that promote the model to talk about capitals (depicted as ‘say a capital’ above) combine to activate the ‘Austin’ feature neurons, leading to the model predicting “Austin”, which makes complete sense knowing it’s Texas capital.

But what does this mean? Simple, LLM outputs aren’t magic, and crucially, there’s a mechanistic explanation to their behavior! In other words, they are interpretable.

In other words, neurons related to different concepts related to the input combine to enable the correct response to emerge.

Is this emergent reasoning?

At this point, you may be inclined to take a cynical stance and argue that this is nothing extraordinary; these circuits are still memorized patterns and not actual reasoning. And I would agree.

Nonetheless, this doesn’t prove whether the model has actually internalized that Austin is Texas’ capital (capturing knowledge) and not memorized the exact sequence “Texas capital? Austin,” which would be memorization and thereby not real reasoning.

Deepmind has just published research on how to differentiate memorization from acquired knowledge in LLMs, thereby proving that LLMs do, in fact, acquire knowledge instead of memorizing everything.

But here’s the thing about these circuits that solves that question: they generalize.

Truly modular and adaptable circuitry

For example, if we clamp down the ‘Texas’ feature (we force the set of neurons that activate for all things Texas to be zero), the model will still predict a capital… just not the Texas capital.

And we can take it a step further, we can actually control what capital the model chooses by clamping the neurons representing other states, regions, or countries despite always using the same circuit:

This means that the circuit is general, meaning that the model uses a generalizable circuit to answer questions about state/country capitals and simply adapts the part of the circuit that needs adaptation based on the input.

This clearly proves that this isn’t rote memorization, and the model actually understands what you’re asking (or at least understands the relationship between regions and their capitals), not simply memorizing the literal sequence.

But Anthropic went even further and proved that the model can perform more complex circuits. To the prompt “The capital of the state containing Dallas is…“, the model engages in a multi-hop activation process that leads to Austin.

  1. First, the model acknowledges ‘capital’ and ‘state’, promoting the set of neurons that activate capital predictions.

  2. In parallel, the model promotes ‘Texas’ after seeing ‘Dallas’

  3. Combining the urge to ‘say a capital’ with ‘Texas’ makes the model promote the neurons in charge of predicting ‘Austin’

This feels eerily similar to how humans would have answered this question. Amazing, right?

But wait, there’s more: the model can also plan ahead.

The Autoregressive Planner

On Monday, we covered the idea of autoregressive models, models that predict the next word based on previous ones. Based on this behavior, it seems counterintuitive to imagine them planning ahead of the immediate word to predict.

However, this skill is very necessary in areas like poetry, as the last word in the second verse usually rhymes with the last word in the first. Consequently, humans must build the second verse to ensure it makes semantical sense while ensuring the last word rhymes, which may force us to first predict the rhyming word—the last one—and then assemble the rest of the verse despite those words coming earlier.

And fascinatingly, researchers witnessed how the model did just that:

As you can see below, as the model processes the ‘next line token’ to jump into the second verse, it’s already thinking about what words would make semantic sense to end the second verse while rhyming, by activating neurons responsible for words like ‘rabbit’ or ‘habit’.

Put another way, the model is preparing to predict ‘rabbit’ or ‘habit’ ahead, even though that word won’t be predicted until several previous predictions are performed.

Impressive right? And I could go on for several other outstanding examples, such as:

  1. Multilingual circuits. The model understands the user request in a language-agnostic form, using the same circuitry to answer while adapting it interchangeably to the language of the input.

  2. Addition. The model memorizes simple additions, but performs elaborate addition circuits to produce accurate results for more complex ones.

  3. Complex medical diagnoses, analysing reported symptoms, use them to promote follow-up questions, and elaborate on a correct diagnosis.

All this is explained (and visualized) in the paper.

Closing Thoughts

Here’s the main takeaway from all this:

While there’s enough evidence to suggest that models still rely considerably on memorization patterns (ByteDance just published yet another piece of research yesterday that showcases how you can collapse model performance by making very minor changes to the prompt), this is the first time we have seen mechanistic evidence that the model is creating generalizable reasoning circuits internally, at least on a primitive basis.

In other words, this research proves that these models engage in behavior that certainly goes beyond memorization, even possibly being considered a primitive form of reasoning.

The reasons are four-fold:

  1. The circuits shown are general and are used by the model to answer similar yet distinct questions. If all models did was memorization, they would develop unique circuits for every single prompt, which has been proven not to be the case. Instead, the model is developing ways to answer questions by performing reasoning abstractions that allow it to flexibly adapt to different prompt variations (different facts, numbers, languages, etc.).

  2. The circuits are modular, meaning the model can combine different, simpler circuits to develop more complex ones to tackle more challenging questions.

  3. Circuits can be intervened and adapted, making models more predictable and steerable. I firmly believe this is the future of model alignment: blocking certain features to block certain behaviors. This is fundamental to AI adoption, especially for enterprise use cases that require predictability.

  4. Models plan ahead. Despite their autoregressive, backward-looking nature, the model can plan ahead for what word it wants to predict in a future prediction and alter the current predictions to lead to that one. Planning is a crucial element of reasoning, and this is evidence that LLMs can plan.

The final question is: This behavior is primitive at best, despite having ingested trillions of data points, raising concerns about the viability of improving these capabilities. Thus, will models actually develop human-level reasoning capabilities?

Personally, I feel that we need algorithmic breakthroughs that improve data efficiency (models learn more with less). Otherwise, we run a considerable risk of these models plateauing.

But the proof is in the pudding: LLMs still heavily rely on memorization, but at least we finally know it’s not the only thing they do. I don’t know about you, but this research has made me more optimistic about our current direction.

Exciting times ahead!

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

You’ve heard the hype. It’s time for results.

After two years of siloed experiments, proofs of concept that fail to scale, and disappointing ROI, most enterprises are stuck. AI isn't transforming their organizations — it’s adding complexity, friction, and frustration.

But Writer customers are seeing positive impact across their companies. Our end-to-end approach is delivering adoption and ROI at scale. Now, we’re applying that same platform and technology to build agentic AI that actually works for every enterprise.

This isn’t just another hype train that overpromises and underdelivers.
It’s the AI you’ve been waiting for — and it’s going to change the way enterprises operate. Be among the first to see end-to-end agentic AI in action. Join us for a live product release on April 10 at 2pm ET (11am PT).

Can't make it live? No worries — register anyway and we'll send you the recording!

For business inquiries, reach me out at [email protected]