Why Google Is Your Best AI Bet

Ignacio de Gregorio Noblejas
April 16, 2025

COMPANY DEEP DIVES
Here’s Why Google Will Win

A few months ago, we covered Google and discussed how its internal structure and ways of working were killing its chances of success in the AI race.

Today, the state of AI shows a totally different picture for them. From hardware that goes toe-to-toe with NVIDIA, to models that humiliate some of the most hyped AI incumbents, to sitting on top of the biggest data pile in the history of capitalism.

Heck, Google even has the time to publish AI that decodes dolphin language. Here’s all that you are going to learn today:

Build intuition on AI hardware, understand the key metrics, and also comprehend the uniqueness of the data centers business (notably, answers to questions such as ‘why the cloud?’ and ‘why are accelerated clouds different?’), while also comparing Google’s Ironwood and NVIDIA’s B200. This is crucial to understanding not only Google’s future but also that of NVIDIA/AMD and other Hyperscalers.
Learn what makes Google’s AI model strategy unique, pointing out vital insights (Pareto graphs, unbiased usage rates, etc.) that I assure you will change your perspective on frontier AI.
How Google is correcting its path of self-destruction and it’s finally assembling a robust set of AI products (and even has the time to publish new protocols for agents),
Discuss how they are even hitting it out of the park in M&A.

This piece is aimed at the following cohorts:

Investors who want to understand Google like experts do,
Tech enthusiasts who want to learn about AI hardware and software in great detail without jargon,
and AI practitioners who wish to familiarize themselves with Google’s stack at a deep level.

When PEs hire me to discuss a company’s future, this is what I tell them. Let’s dive in.

This piece turned out very long (+8,000 words). To shorten it, some of the most technical parts have been offloaded to Notion articles you can click if they are of interest to you (you must be a Premium subscriber).

Hardware - Ironwood & the Drums of Liberation

For years, all Hyperscalers (Amazon, Microsoft, Google, Meta, and their Chinese counterparts) found themselves in a very unfamiliar situation. Accustomed to having great power over their suppliers, they all found themselves barking at the same tree: NVIDIA.

NVIDIA found itself with what was essentially a blank check from several cash-rich companies all fighting for every single GPU NVIDIA could deliver. This was—is—a very uncomfortable situation for these companies, who have all independently initiated their own chip ambitions.

But one company saw this need coming from a mile away: from 2018, actually. And that was Google.

And this week, this grand vision of being chip-independent reached its ‘it moment’ as Google announced the newest version of its Tensor Processing Unit (TPU) hardware, v7, codenamed ‘Ironwood.’

And while this is the seventh version of this accelerated chip (more on what that means in a second), based on the paper results we could confidently say that Google has become the first Hyperscaler to eliminate its dependency on NVIDIA for model training and inferencing (they still purchase NVIDIA as part of their cloud computing business, though).

Google did use the sixth version to train Gemini models, but they needed NVIDIA for some cases. With Ironwood, they literally don’t need NVIDIA anymore.

❝

Google is now officially the first company to cure the NVIDIA—expensive—addiction.

But to understand why this is the case and why this is extremely important for an AI company’s future, let’s quickly recap what ‘AI hardware’ is and why it should matter to you.

What is Accelerated Hardware?

AI runs on accelerated hardware. But what does that even mean? When we think about hardware in general, or a computer, this is just two essential pieces:

Compute chips
Memory chips

The rest of a computer's hardware needs are built up on these two; the former make calculations necessary to run programs on a computer, the latter allow the former to read/write/store the results.

Based on this, we can define a computer’s performance based on three main metrics:

Processing power, how many computations per second our computer can perform. We can scale this in two ways: increasing the power of the compute chips or the number of them.
Memory capacity, how much stuff can our computer store both permanently (the information is not lost after unplugging the computer) and temporarily (that information is lost)
Memory speed, how fast can the compute chips read or write stuff to memory. This creates the need for two memory components. The faster yet volatile RAM, and the slower yet permanent (unless you delete the data) hard memory. Thus, your computer's operating system loads into the faster memory the programs it’s using to increase speeds and offloads to hard memory the stuff we need to store for future reference but don’t need at this moment.

In a nutshell, a computer does all kinds of computations and reads and writes to its memory depending on the task at hand. Here, we introduce two computer workloads: sequential and parallel.

Sequential workloads are executed sequentially, and each step might be very complicated, but you can only execute a few at the same time. On the other hand, you can have very simple computations but many in parallel.

Accelerated hardware excels at the latter; ‘accelerated’ is just a fancy way of saying ‘parallel compute hardware,’ because the number of computations per second skyrockets with this hardware.

And why is this so important?

Simple: While most of the world’s computing has traditionally been sequential (watching videos, using Excel, and so on), with the noble exception of video game rendering, the main use case for accelerated hardware for decades, AI and High-Performance Computing are in dire need of what we call ‘embarrassingly parallel’ workloads (yes, that term is a thing).

In fact, current AI models like ChatGPT or Gemini are a massive bunch of multiplications and additions performed in parallel that, despite their complex responses, are ‘embarrassingly simple’ at heart.

As we covered in our DeepSeek piece, we can use several different methods to increase parallelization. Eventually, this parallelization is only limited by Amdahl’s law, which proves that parallelization has diminishing returns.

For more information on Amdahl’s law, read the following Notion piece.

Although this may sound incredible, the hardest computation in a running AI model is a multiplication of two numbers or an exponential; the issue is that we need to perform billions of them simultaneously.

So, how does accelerated hardware enable this?

While all computing hardware is identical at heart, the core difference lies in the foundation: computation and memory chips.

In the case of CPUs, we have a few computing cores, each very powerful and capable of performing very complicated arithmetic.
In the case of accelerated hardware like GPUs or TPUs, they are packed with thousands of smaller computing units that perform elementary calculations but can act in unison, instead of having just a few but very powerful ones (what a CPU is).

Consequently, we achieve parallelization by having many computing cores acting simultaneously. The constraint is that each of these cores only performs simple computations, but as we mentioned, that’s precisely what AI needs.

However, despite the massive amount of compute required, AI workloads have a more important factor to consider.

It’s not about compute, it’s about data transfers.

Even when thinking about accelerated workloads that aren’t AI, like video games, the most important metric has always been processing power (calculations per second).

Growing this number is also relevant in AI, as we need to perform many computations per second for the larger models, but less so than memory size and memory speed. With AI, we must stop looking at the power of our compute cores as the deciding factor; in today’s world, everyone is looking at memory size and speed.

But why?

In a nutshell, accelerated hardware is notably memory-constrained in speed and size, despite AI workloads requiring large models and a hefty amount of data transfers.

If you want to fully understand the reasons behind this, I highly encourage you read the Notion article {🤔 Achieving compute-boundness}, where we cover the crucial concept of the arithmetic intensity ratio, we see an example that helps us understand how this ratio can be computed to define ‘our ideal AI workload’, and we also understand why models need a lot of RAM.

The point is that running AI is the stuff of nightmares, because the amount of memory transfers and the limited per-hardware RAM make it almost impossible to reach compute-boundness, leading to painful compute idleness, or having the compute cores spending energy but not doing a single calculation. Therefore, most AI workloads are ‘memory-bound’, which, as we’ll see later, is a revenue disaster.

In fact, with models being more or less commoditized at this point, as with any other commoditized industry, it’s what you do about your costs (margins) and with the structure of your workloads to maximize GPU usage that become the deciding factors.

In fact, all AI Labs are under extreme pressure to work on their margins; engineering is the moat, not the technology, and in that arena Google appears beautifully positioned.

But I don’t want to get ahead of myself; we’ll touch on that later.

In summary, we now understand the high-level complexities of accelerated hardware and, thus, we can finally understand Google’s newest hardware product: Ironwood.

Loosening NVIDIA’s Grip

As mentioned earlier, Google now has top-tier AI hardware called Ironwood. This is an AI ASIC (Application Specific Integrated Circuit exclusively meant for linear algebra, the kind of computations we have seen are majority in AI).

But what makes it different from NVIDIA’s GPUs?

In short, they are generally less powerful but more power-efficient than GPUs (NVIDIA/AMD’s products). But nobody expected the seventh version of this chip to be this impressive compared to its predecessors:

Source: Google

At a more detailed level, they have the following specs we can now understand thanks to our previous explanation:

4.6 PFLOP/s FP8 of compute power
192 GB of HBM (high-speed RAM)
Intra-TPU, 7.4 TB/s of memory transfer speed
TPU-to-TPU, 600 GB/s (unidirectional) with Inter-Chip Interconnect (ICI), which refers to the communication speed between different hardware working on the same model
~1000 watts TDP, the Thermal Design Power required to run the TPU
9,216 chips in a 3D torus topology, the amount of chips that can be run in parallel for a single pod.

Ok, what does all that even mean, and how does it compare to NVIDIA’s top GPU, the Blackwell B200?

To summarize, these specs are very similar to those of NVIDIA’s new superGPU, the Blackwell B200. However, due to their superior pod size, Google’s TPU cluster might be better suited for very large training runs. NVIDIA seems to have an edge for inference workloads where local hardware interconnect matters the most. Again, this is a nuanced comparison.

For a full comparison between both to cover the additional nuances, touching on all the similarities but importantly in all the differences (mainly the GPU-to-GPU interconnect, the communication topology, and understanding which one is better where), read Notion article {🕸️ Ironwood vs B200: The Battle of Topologies}.

The point is that Ironwood implies Google no longer requires NVIDIA chips to train or run its models, and will only purchase NVIDIA GPUs to resell them to clients of the Google Cloud Platform that want to use NVIDIA GPUs.

❝

But for all intents and purposes, Google is the first Hyperscaler to be ‘NVIDIA-free’.

But Google’s bullish perspective on hardware goes beyond just the chips; they are unequivocally the most sophisticated Hyperscaler in the world.

The King of Data Center Metrics

Having a good chip and a well-crafted scaling strategy for massive workloads is just the beginning. In this world, issues are measured in seconds, and every second the TPUs aren’t processing data is lost revenue.

But why do I say lost revenue?

We will discuss this in greater detail next week with my analysis of Coreweaves and general ‘GPUnomics’, but accelerated clouds have a totally different business model compared to traditional clouds (where compute is/was mainly CPU-driven).

But first, have you ever wondered why we need cloud computing in the first place?

When we think about what a cloud provider does, we think of offering infrastructure services to run their digital products for companies that can’t or don’t want to spend huge CapEx (capital expenditures) on buying compute hardware and instead treat digital costs as pure OpEx (operating expenditures) in a pure renting model.

Having your own digital products and services running on-premises is not only a perilous move requiring huge initial investments in hardware, land, labor, and electricity, but predicting user/employee demand is a living nightmare, making you choose between two alternatives:

Overinvest so that you can service your customers in peak workloads, but having depreciating assets that are idle most of the time, which translates to bad return on investment (or even losses), or
Underinvest to meet your average demand, but be incapable of providing service in peak times. Needless to say, in most cases, companies opt for the first option.

These assets (especially accelerated hardware) also depreciate pretty fast, by the way, with an average depreciation schedule between 3 to 5 years, at which your servers are fried to the core and you need to repurchase—of course, most companies do not refresh their stack every three years, which leads to underperforming products/services.

Instead, cloud compute offers this desired flexibility with no initial investment (in hardware, I mean), which is more comfortable, which is why the cloud is so appealing.

Thus, assuming the vast majority of companies embracing AI will inevitably end up with cloud deployments (even if they take the open-source route), the crucial point to outline here is that, from the perspective of Google and other Hyperscalers, they must acknowledge that customers will treat traditional clouds and accelerated clouds in a completely different fashion; they are quite literally two totally different businesses with wildly different business models.

But why do I say this?

In particular, purchasing behavior varies radically, with investment in accelerated compute growing over time. However, this will eliminate high margins on software for good while also making customers more price-sensitive.

In turn, this will mean that the most sophisticated Hyperscaler that can decrease prices without killing margins will win. And in this scenario, Google has the highest chances of success due to its having the lowest PUE and the most experience in running multi-data center workloads.

For more details on why this is the case, read Notion article {⚔️ The Economics of Accelerated Clouds & Why Google Stands Out}, where we cover Google’s superiority at running efficient data centers compared to other Hyperscalers.

Furthermore, Google is also making the right moves required to land privacy-centric companies: Gemini on-premise deployments.

The Google Distributed Cloud

One of the biggest concerns in enterprise adoption is security and data privacy. While end consumers aren’t nearly as protective of their data, enterprises literally require protecting their data to survive.

Consequently, giving their crucial data to a Gemini server in Arizona, potentially thousands of miles away, is not the best idea—and in some cases, it’s literally illegal.

To circumvent this issue, more enterprise-focused labs like Cohere or Mistral have been trying to offer their models in on-premise form factors, where the model is safely stored in your organization, but prevents you from accessing the actual model files to protect the AI Lab’s IP.

In theory, they represent the best of both worlds: they can’t see your data as it never leaves your organization, but you can’t see their model. Now, Google is offering this option to enterprises, giving them access to what seems to be the best model in the world right now (as we’ll see later).

While these types of implementations are still not as adaptable and controllable by the enterprise—and probably not nearly as cheap, as you need to pay for the continuous support by the Hyperscaler—they are a great way to abstract all the complexities of running AI models (which are many, as we’ve seen) while still safely accessing the frontier models.

All things considered, if you consider having ownership of your hardware stack and some of the best data center engineers in the world crucial for success in AI, Google is literally the only company in the world that can say that.

With Google, you have the holy trinity of bullish hardware facts:

A verticalized offering (no dependence on NVIDIA beyond offering NVIDIA GPUs to customers via GCP)
Better margins than other Hyperscalers in a razor-thin business that accelerated compute is (margins will only get slimmer),
A cloud offering that matches the performance of the best, but can capitalize on point 2 to offer cheaper prices,
A sound, privacy-centric go-to-market strategy for enterprises, especially when leveraging companies running Google’s enterprise business applications.

At this point, Google’s position feels pretty good already, but when we start talking about data and AI models, their privileged position becomes clearer than ever.

Gemini, The Undisputable King of AI Models

At the time of writing, at least until OpenAI releases o3, there’s hardly a doubt that the best frontier AI model is Gemini 2.5 Pro. This model doesn’t just shine on the common benchmarks we see with each release:

And to be clear, I don’t expect OpenAI’s o3 to be that good to offset the very likely price difference between both models.

It’s also the best model in LMArena (and without shady tricks like Meta). Perhaps the most impressive result was published just two days ago on the Aider polyglot benchmark, the benchmark most experts use to judge which coding model is the best.

Unlike other benchmarks, the Aider Polyglot benchmark much better represents whether AI models can do tasks that actual human developers do.

This isn’t only impressive because the model being judged is the experimental version, but the biggest takeaway is not particularly obvious in that graph, while representing the perfect staple of Google’s dominance: cost.

Be that because of algorithmic improvements or data-center engineering brilliance (as we saw earlier), the model isn’t only the best in raw performance, it’s also much, much cheaper, in some cases like OpenAI by an order of magnitude.

Using the above example, Gemini 2.5 Pro experimental clinched the first spot for $6 on the overall benchmark. OpenAI’s best model, o1 (high), performed almost 10% worse while costing almost $200.

While OpenAI appears to be solidly on par with Google in raw performance across most benchmarks, Google is showing them that there are levels to this, and when hype gives way to efficiency, good luck competing with them.

In fact, just yesterday, we saw OpenAI’s release of GPT-4.1, a non-reasoning model mainly conceived for coding. But does this change my view on who’s on top?

How do they compare?

Continue reading to learn the following:

- How Gemini compares to OpenAI’s latest release: GPT-4.1. The analysis incumbents don’t want you to see.

- View the Pareto frontier of models. Who’s really leading the pack in performance relative to cost.

- Usage trends of leading models based on unbiased actors like Openrouter or Poe.

- The delicate situation Anthropic is in.

- And of course the rest of the Google analysis, looking at data, product, and key ventures they’ve invested in.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• NO ADS
• An additional insights email on Tuesdays
• Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more