Historical Deals & Mental Imagery Models

THEWHITEBOX
TLDR;

Welcome back! Today, we look at the recent AI successes of Tesla, OpenAI and Oracle’s record-setting deal, and new features coming from two of the hottest AI startups.

Finally, our trend of the week highlights a beautiful piece of research from MIT that proposes equipping models with the ability to construct mental images, thereby enhancing their reasoning capabilities.

Enjoy!

AUTONOMOUS DRIVING
Tesla Car Drives Autonomously Through Madrid

A really impressive video that showcases the undeniable future of cars. As a negative point, it’s pending regulatory approval, yet another example of how Europe fails to maintain the pace set by the US and China due to excessive bureaucracy.

They claim to be doing so in the interests of Europeans, but the truth is, they are just under pressure to justify the absurd number of European bureaucrats whose jobs depend on those regulations existing in the first place.

Added to the first autonomous delivery that Tesla performed last week, it’s clear that years of effort are finally yielding success.

But will that be enough to compete with Waymo?

TheWhiteBox’s takeaway:

You’re probably asking: what makes Tesla’s cars different from Waymo's? Well, it’s two things:

  1. The technological approach: Teslas rely on an AI vision model and cameras and sensors as input. Waymo relies on LiDAR (Light Detection and Ranging), a remote sensing technology that uses laser pulses to measure distances and create precise, high-resolution 3D maps of objects and environments. Both rely on AI models to drive the car; the input is what distinguishes them.

  2. Depth vs breadth: Waymo focuses on depth, meaning its technology is designed ad hoc to the place it’s being deployed. On the other hand, Tesla aims for breadth, just like humans; the same software that drives a car through Phoenix drives it through Madrid (albeit I assume some fine-tuning is done to handle different driving rules).

While Tesla is a generic software, it doesn’t handle edge cases as well as Waymo due to the latter’s higher location-specificity (and cameras can also be blinded, unlike LiDAR).

If you want to compare both systems, you have to compare Waymo with the Tesla RoboTaxi (which is not what it’s being shown in the video above), and which I believe does rely more on ad hoc software than Tesla’s general software.

Bottom line, both approaches are quite different, which is why the comparisons between the two systems aren’t entirely valid.

CODING
TogetherAI open-SOTA Coder, DeepSWE

TogetherAI has open-sourced a remarkable new model, called DeepSWE, which excels at coding tasks and ranks among the top open-source models.

This model is a fine-tuned version of Alibaba’s Qwen 32B model, enhancing its coding performance in the popular SWEBench to more than double that of the original model.

As with all things progress in AI these days, the secret is nothing but Reinforcement Learning all over again; training models to achieve goals in constrained settings by rewarding good behaviors and punishing bad ones.

TheWhiteBox’s takeaway:

As we’ll comment in Cursor’s news presented below, there’s a strong chance that sooner than later, humans won’t be required to code, like at all.

AI already handles most of the coding in Big Tech and software companies (as noted by previous comments from individuals like Zuck at Meta or Marc Benioff at Salesforce), and it continues to improve consistently.

Nonetheless, we have yet to see the real power of techniques like RL. For example, just yesterday, a small team of researchers proved how you can take a very small, low-to-mid tier AI model and make it state-of-the-art on any particular task, well beyond frontier models, just using RL, despite being just 1.7 billion parameters, dozens or even hundreds of times smaller than frontier models.

But how is that even possible? It’s once again the breadth vs. depth conundrum we just explained, this time with Tesla vs. Waymo.

Waymo is this model, while Tesla is the frontier model, in the sense that the latter is optimized for generality (being good enough on many tasks). In contrast, the former is optimized for excellence in one particular task.

Breadth models are more monetizable, but as I always say, sophisticated open-source, one that includes fine-tuning for your task, still blows proprietary frontier models out of the water every day of the week.

The question is: can we find a way to bridge both worlds? Creating a general model without sacrificing per-task excellence? Well, that’s precisely what AI Labs are pursuing with their coding agents, creating the first domain-specific superhuman AI tool for coding.

Is it just around the corner? Maybe.

VIDEOGAMES
Playable AI Videogames… soon?

Although we have been discussing this precise idea for months in this newsletter, it seems that AI videos becoming tools to create on-the-fly playable video games is coming faster than we thought, based on hints by Google DeepMind’s CEO and Gemini PM Lead.

TheWhiteBox’s takeaway:

Imagine a future where you can create your own video games and play them instantly, even sharing them with friends based on your latest ideas.

The future of AI videogames seems nigh, and although I’m not a gamer myself, I have to acknowledge it’s a killer use case.

As per Google, what remains for them to see their stock explode in value?

I don’t see any company on the planet coming close to the ecosystem of products and use cases this company is assembling. In the meantime, I religiously buy their stock every month (not financial advice, just stating my bias).

CODING IDEs
Cursor Goes Shopping in Anthropic

In a somewhat surprising (and probably costly) move, Cursor has managed to poach the two leads behind Anthropic’s Claude Code, which will now join the startup, quite possibly to replicate the same product.

TheWhiteBox’s takeaway:

Although I use both tools, Cursor seems to be a relic of the past, given the current direction of coding becoming something mostly done by AIs.

For those of you who don’t know what these tools are,

  • Cursor allows you to write code with the aid of an AI, a more hands-on experience for those who want to have more presence during code writing,

  • While Claude Code is a more hands-free tool, where coding falls almost entirely on the shoulders of the AIs, while humans guide the process.

This move highly suggests that Cursor, which has just surpassed half a billion in Annual Recurring Revenue (ARR), is realizing that they need to evolve in the direction of this more hands-free paradigm, which might indicate that the days of human-written code could be over faster than we imagined.

DATACENTERS
OpenAI Behind Oracle’s $30 Billion-a-year Deal

In one of the most impressive deals in quite some time, OpenAI has agreed to rent up to 4.5 Gigawatts of power for its AI business from Oracle, resulting in an influx of $30 billion per year revenue for the latter.

That unprecedented value can’t be understated. 4.5 gigawatts is equivalent to 4.5 average nuclear reactors and enough to power millions of homes.

For reference, the average US home consumes ~10,500 kWh of electricity per year. Thus, the power in this deal could provide electricity for ~3.75 million US homes (that number would be much higher in Europe, as Europeans consume much smaller amounts of energy on average).

TheWhiteBox’s takeaway:

AI is going to eat our grids alive.

For that reason, besides expecting an explosion of new energy supply, I think that consumer-end hardware (our smartphones and laptops) will be crucial in meeting AI demand; there’s no way to meet everyone’s demand unless on-device AI models meet a decent portion of it.

In my case, I won’t take my chances of having to fight with millions of others for the same cloud services, so I expect that, by the end of the year, most of my AI demands are met by local models, especially after I buy the Macbook Pro M5 scheduled for this fall.

VIDEO EDITING
HeyGen Launches AI Campaign Agent

HeyGen has officially announced a product that looks exceptionally good. They describe it as a creative operating system, which is a really marketing way of describing an AI campaign creator agent.

The video showcases the tool and what it can do for you; send it the assets (your videos of your product, for instance) and the tool builds everything: the storytelling, the video itself (with automatic edits), and so on.

TheWhiteBox’s takeaway:

As for the vibes, AI has just made the lives of several marketing agencies a lot harder. Yes, this will make their job much more productive, but it’s also a tool that will allow prospective customers to do the campaigns themselves.

It’s another great example of the gap that AI fills. AI isn’t taking all jobs, it’s just taking mediocre ones. Don’t think this tool can compete with good marketing agencies, no way, and don’t expect you to create campaigns as well-designed as those created by top people in the industry.

As I always say, AI tools are mostly average in most domains, so AI isn’t taking the job of sophisticated agencies, but will for sure kill mediocre ones.

TEXT-TO-DOCUMENT
Genspark adds support for AI Docs.

Another tool that looks better by the day is Genspark, which now has support for AI Docs, allowing you to create well-crafted Microsoft Word documents with ease.

In addition to supporting other tools like PowerPoint, Excel, or Gmail, Genspark, one of the fastest-growing startups in the industry, is gradually becoming a one-stop shop for all things text-to-document.

TheWhiteBox’s takeaway:

I continue to remain skeptical of the future of application-layer apps (I don’t see what will stop AI Labs from replicating this exact tool in weeks if they wanted to), but that doesn’t mean these tools aren’t already very helpful.

The fact that they also offer direct templates to choose from is a brilliant business move, which tells me this team knows what they are doing.

TREND OF THE WEEK
Mental Imagery in AI Models

Researchers at MIT and Amherst have published a “different“ paper on the first AI model that ‘thinks visually,’ just like we can imagine scenes in our brain, this AI model does too, with promising results that suggest we might be on the brink of “spatial-aware” AI.

Let’s dive in.

Counterintuitive Machine Behavior

If we take what I believe is the most powerful model on the planet, o3 (’most powerful’ does not imply best), the model is capable of performing—or, dare I say, imitating—human reasoning across both text and images, what OpenAI dubs ‘Thinking with images’, a very powerful feature with a very counterintuitive mechanism.

Write and Draw to Think

In layman’s terms, the model can create sequences of interleaved text and images to discuss the text and images the user sends it.

However, one of the most counterintuitive features of modern Generative AI models is that they need to speak and draw to think. That model needs to generate both text and images, not only to provide an answer, but to ‘think’ about the answer, as if humans had to speak to think about language and draw to think about scenes.

And while there has been considerable exploration on how to prevent models from having to speak in order to think, known as ‘latent reasoning models’, no efforts have been made to allow machines to imagine images instead of drawing them.

Until now. But why would we want that?

The Power of Mental Images

We have known for several decades that humans create depictive, quasi-pictorial images in our minds.

Sparing you from technical jargon, Stephen Kosslyn proved back in 1994 that humans build ‘mental images’ in our minds built on twenty years of studies largely influenced by Roger Shepard and Jacquline Metzler who, in 1971, proved that humans perform rotations of our mental images and that the time those rotations require is proportional to the angle of rotation.

Examples shared with participants during the famous study. Source

Put simply, the mental effort required during the imagined rotation was proportional to the angle of rotation, providing evidence that these representations possessed spatial properties, as the human was actively reconstructing the image from a new angle.

If these weren’t actual images, rotation would not only be immediate, but it wouldn’t even be necessary if these images had no spatial structure.

So, if humans build mental images, highly suggesting these give us spatial reasoning capabilities, why shouldn’t AIs do the same?

Mirage, AI Meets Mental Imagery

Talking about AIs and mental imagery immediately raises the question: Do AIs have minds? Well, it’s complicated.

Latent Space

In modern AI, everything revolves around the idea of ‘representations.’ These are numerical vectors (machines only work with numbers) that represent the AI's understanding of real-life concepts.

These vectors are governed by the principle of similarity: an AI constructs an understanding of the world based on the relative similarity of each concept to the rest.

In layman’s terms, a ‘cat’ is understood not by its Platonic meaning, but by how it differs from other concepts in the model’s index of known concepts.

Put simply, the model figures out what ‘cat’ is based on how similar it is to concepts like ‘dog’ or ‘tiger’, and equally importantly, how they are much more dissimilar to concepts like ‘tungsten cube.’

This way, the model builds a ‘representation space’, more formally known as ‘latent space.’ As this space is vectorial, the distance between concepts is computable. Thus, the meaning of ‘cat’ is measured by its relative distance to all other concepts in this latent space that define what a model knows.

As concepts have different attributes, we can locate them in this ‘AI mind’ based on these attributes. Source: Google

As ‘cat’ is very close to other concepts like ‘dog’ or ‘tiger’, it is then considerd by the model to be an ‘animal’ (and more specifically, in sub cluster of ‘mammals’, because ‘cat’ is very close to other animals which are also ‘mammals’, the model infers that ‘cats’ are ‘mammalian animals’.

As this space is vectorial, each direction in this space represents an attribute, which allows the model to distribute concepts according to the image above based on similarity.

Based on this latent space, when the model encounters a new input—whether it is text, images, or both—it uses this latent space to determine what the input is saying and what should come next.

However, these models, despite being allegedly ‘multimodal’, still rely heavily on ‘text thinking’ to respond. And that’s an issue.

From ‘Only text’ to True Multimodality

If we examine the image below, how would you approach these problems? If we look at the first one on the left, do you imagine the path, or are you defining the algorithm using text symbols (up → right → up…)?

Of course, in all three instances, you are imagining the solution because all three require visual cues to solve them.

However, our modern AIs would attempt to solve this problem via text, disregarding all the vital spatial cues the problems provide, and instead solve the tasks as if they were written math problems.

So, how can we equip AIs with this ability? Here is where today’s research comes in.

The idea is to train the model to acknowledge when it needs to ‘think visually’, generating a special token that enters the model into spatial thinking, reasoning over the latents (preventing it from going to the text space), and then going back to writing text while it has finished the visual thinking phase.

This sounds pretty esoteric, but let’s drive this home with an example:

  1. To any of the example problems above, the model starts generating text such as ‘I’m seeing a game scene similar to Pokémon…what I should do is… and then introduces the special token <mental imagery>’

  2. This special token at the end relieves the model from speaking further and allows it to think introspectively in the ‘latent space’ we described earlier.

  3. After it finishes thinking visually, it then responds with the solution via text again.

The key insight is that the latent space we described earlier is multimodal (the model has seen a lot of images, too, during training). These visual latents have spatial attributes that, if we forced the model to reason in text over the problem, it would be forced to discard those features.

One way to think about this is to imagine that, instead of allowing you to imagine or draw a cat in your mind, your brain could only think about cats using words. You could generate a very good textual description, but it would still be unmatched to what an image of a cat conveys.

However, as we allow the model to continue making predictions in its internal visual world, we enable it to visualize the solution before providing an answer.

This results in important improvements on spatial reasoning and planning benchmarks across the board, showing how this technique not only makes logical sense.

It also works:

THEWHITEBOX
Closing Newsletter Thoughts

Today, we have introduced a new approach to understanding how AI models should think, using mental imagery, grounded in the principles that govern human cognition.

We have also covered the weekly dose of relevant AI news, which can be summarized as follows:

  1. The future of human-written code appears brittle as ever, and the moves from the very startup that really started the AI-enhanced coding era suggest that the days of human code are reaching an end.

  2. AI’s hunger for energy is unprecedented, with OpenAI and Oracle announcing one of the deals of the century in terms of the amount of power that will be used by AI.

  3. We have seen several examples (Tesla/Waymo, Osmosis/Frontier models) that the breadth vs depth AI issue is very much alive (and there are no reasons to believe we are nearing a generalist model with deep capabilities across all areas, because that would be AGI).

  4. AI continues to facilitate the generation of mediocre yet somewhat valuable outputs, ones that aren’t anywhere near destroying all human jobs, but ones that will surely displace many, especially those knowledge workers with unremarkable talent.

See you again on Sunday!

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]