We finally uncovered what AIs REALLY think.

THEWHITEBOX
We finally uncovered what AIs REALLY think.

Have you ever wondered why this newsletter is called TheWhiteBox? For the sake of my ego, I’m going to assume you ask yourself that question first thing in the morning, every day.

Jokes aside, the idea was that AI is a “black box” (we don’t know how they work, like, at all), and my newsletter would try to demystify it. Rather pretentious, but what marketing giveth, marketing taketh.

But let me tell you something: Did you know that a model might be thinking bad things about you while sounding completely normal?

Well, it’s true. Until now, we suspected it, but we can now see it thanks to Anthropic’s new research, Natural Language Autoencoders (NLAs), which begin to answer the question of what lies beneath ChatGPT’s overly formal and professional demeanor.

And the answer is a mixture of candidness, honesty, lies, anger, scheming, and many other behaviors and, for lack of a better term, “beliefs” and “sentiments” that are now emerging and that you should definitely be aware of.

The opportunity? An entirely new AI industry worth hundreds of billions in AI assurance.

Let’s dive in.

From Neurons to Thoughts

To fully comprehend the huge implications of today’s research, it’s important that we transition ourselves from seeing AIs for what they look like to what they actually are, so that we can understand the fundamentals of interpretability, the field that aims to uncover AI’s biggest mysteries.

The basic structures

On paper, AIs are just a bunch of elements called neurons that interact and combine to produce the output.

We call them ‘neurons’ because they are tightly connected to each other, much like human brain neurons, and they also have the “fire or mute” behavior, too.

It’s a “little bit” more complicated, but at the heart of every AI model you’ve interacted with lately, sits something like this:

And how does it work?

Say we have an animal classification model that takes in several images of a specific animal and outputs its name.

We feed the three inputs to the model, which processes them through the hidden layers to identify common patterns and determine the likely animal, for example generating a text response, “horse,” to three inputs depicting horses.

Source: Author

This is something ChatGPT would do, though the actual architecture under the hood is way more complicated than this. But at heart, it’s just a “bunch of neurons.”

In reality, what is going on under the hood looks something like the image below.

Source: Author

I know, this sounds awfully mathematical and complex, but don’t worry, I’m not going to bore you with maths. Instead, what we’re going to do is reveal what these activation circuits, like the one depicted in red, actually mean.

That is, the issue with these neurons is that, taken at face value, they are gibberish; they are a bunch of numbers magically understanding that there were horses in the images. Taken at face value, they reveal very little about their nature, making AIs look like black boxes.

However, neurons can be surprisingly revealing when viewed through the right lens. They aren’t just a bunch of numbers, and they actually “encode” meaning.

But what do I mean by that?

The idea of representations

Machines only understand numbers, so every concept we present to them must be ‘represented’ as numbers. Therefore, an ‘AI representation’ is just a mathematical representation of a real concept.

I’m not going to get into the weeds of this, which would warrant an entire additional piece I already wrote recently. For today, it’s more than enough to simply internalize two things:

Every concept can be represented in numbers
AIs take concepts and transform them into other concepts.

A way to understand representations is to view them as a list of attributes; a pelican will have a ‘1’ for ‘beak or no beak’, a ‘1’ for ‘feathered’, and a ‘0’ for mammal, among multiple others:

Source: Author

❝

Frontier models today have in the order of tens of thousands of attributes per representation, not just 6.

But the crucial thing to understand about AI models is point two, the transformation part.

For instance, if we put the word pelican in a sequence such as ‘The pink pelican,’ we can ‘transform’ the representation of pelican into a pink one by turning on the ‘pink’ attribute.

Source: Author

If you’ve understood this, you can actually say you understand neural networks, at least Transformers, the overwhelming majority of modern AIs today (including all the ones you know by name), because this is “all” they do.

Simplified much, they take a bunch of concepts represented as numbers in, and they apply transformations to them, creating new representations that lead to the desired prediction; “Draw me a pink pelican” is something ChatGPT can do because it can transform a pelican, the word it knows, into a pink one by applying such transformations internally.

But again, we run into the same problem: interpretability. We infer that this is what is going on, but we have no way to actually “see” it, as the model doesn’t tell us “I’m thinking about a pink pelican” or whether a specific number in a representation makes it a mammal.

Thus, how do we decode those numbers?

Decoding the encoding

Probably one of the biggest contributions Anthropic has ever made to this industry, given they hardly publish anything, is the discovery of monosemantic circuits and attribution graphs, which sound scarier than they are.

In layman’s terms, AI models can represent and create concepts like the ones we’ve been describing by combining neurons into identifiable circuits.

Once we can map certain neuron activation circuits to certain concepts (because they fire every time that concept appears in the output), we can then trace these and find how they combine internally to create other concepts, as seen below, where the neurons experts on sport, basketball, and Michael Jordan combine to help the model predict ‘basketball’.

❝

This was an incredibly powerful discovery for a simple reason: if we can map neuron circuits to neurons, we can steer the model.

This led to famous examples like the Golden Bridge LLM; once Anthropic found one of its models had a Golden Gate Bridge concept mapped to certain neurons, they clamped the values of these neurons, and the model essentially became “the embodiment” of the actual bridge. Basically, it couldn’t stop talking about it.

Or if they clamped the sycophantic praise feature, the model became excessively sycophantic:

Source: Anthropic

This was great progress, but it didn’t allow us to actually see what the model was thinking. Put another way, while we can map certain parts of the human brain to certain behaviors or movements, we can’t read human minds.

And how do we visualize thoughts? Enter Anthropic’s new proposal, the NLAs.

Verbalizing the inner monologues

Natural Language Autoencoders represent a totally new approach to AI interpretability that aims to verbalize (turn into words) the model’s internal activations.

❝

In other words, verbalize what the model actually thinks, not necessarily what it claims to think. An “AI thoughts translator”, basically.

Simply put, NLAs analyze how models behave internally and describe those behaviors in text. The hope? That we can “read into the minds” of AI models, know what they are thinking, and hopefully steer them to our liking.

This has led to some of the most surprising, exciting, and, in some cases, disturbing discoveries in AI in a long time, like the example below, where, even if the model wasn’t explicitly saying so, it was internally suspecting it was being evaluated:

But how on Earth did Anthropic build this magical ‘AI mind reader’?

Behind the paywall, we go down the rabbit hole of explaining the internal functioning of AI models in first principles, explaining reward hacking and agentic misbehaviors like scheming or blackmailing (and importantly why they actually occur), the difference between verbalized thoughts and a model’s internal thoughts and beliefs system, the architecture and principles underpinning this new model class called NLAs, what NLAs unlock, and implications beyond research.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

A subscription gets you:

NO ADS
An additional insights email on Tuesdays
Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more

We finally uncovered what AIs REALLY think.

THEWHITEBOX
We finally uncovered what AIs REALLY think.