- TheWhiteBox by Nacho de Gregorio
- Posts
- AI's State of the Union: Hardware, Infra, and Model Layers
AI's State of the Union: Hardware, Infra, and Model Layers
It Costs $0 to Look This Good
Optics matter, and Agree helps you look like the pro you are. Whether you're a founder raising funds, sending Y Combinator SAFEs, or an entrepreneur billing clients, Agree combines (free) e-signature and payments into one seamless platform. Streamline agreements, speed up payments, and impress clients with sleek, branded processes that show you’re running a tight ship. Your business deserves tools that make you look as sharp as you are.
NEW PREMIUM CONTENT
Agents and SOTA Math Models
On the use of automated agents. Responding to a Premium subscriber, I discuss which off-the-shelf products offer fully autonomous agents, and a set of tips to build your personal autonomous agent at home. (Only Full Premium subscribers)
rStar-Math, a Huge AI Milestone. Following my viral take on Microsoft’s new supermodel, I dive deeper into its technicalities, helping you see a detailed review of how a state-of-the-art math model is built. (All Premium subscribers)
FUTURE
AI’s 2025 State of the Union
As 2025 progresses, we are beginning to observe the true consequences of the paradigm shift represented by large reasoner models (LRMs). And the more I reflect on this, the more I realize it’s much more than I expected.
Fundamental changes are coming in all areas of the industry.
Introducing LRMs will affect unit economics, CAPEX, and go-to-market strategies… declaring some huge winners. However, it will also cause the first major failures across Generative AI start-ups.
It’s a different industry now, an opportunity for the redemption of some, like those at Cupertino, and a death sentence for others in places like Toronto or France.
Therefore, I am taking a holistic view, from hardware to product/service level, to explain how ‘the game changes.’ At each layer, I will provide the key insights to watch out for and name-drop companies I consider having an edge… and those being at the edge of disaster. Today, I will start with the three main layers: hardware, infra, and models.
Let’s dive in.
LRMs Aren’t Models, They Are Systems.
LRMs represent a considerable transformation both in shape and form. In layman’s terms, AI models have changed in size and, more importantly, structure and deployment.
In fact, let’s clarify this once and for all: LRMs are not AI models; they are systems, a concatenation of multiple AI models, and, crucially, other stuff.
Generators, Verifiers, & Search
As discussed in my rstar-Math piece published in our knowledge base, every LRM has at least two components:
Generator: The AI model that provides the ‘thought’ tokens. This is usually a Large Language Model (LLM).
Verifier: Another AI model (it could be the same model as the generator) that reflects and evaluates the quality of the generator’s thoughts by assigning a quality reward. For that reason, they are also named ‘reward models.’ These are also usually LLMs, but fine-tuned to output scalar reward scores instead of words.
This turns the answer process into a chain of thought, a reasoning trace in which one model suggests solutions to a problem, and the other confirms or reframes the accuracy of these thoughts. For more advanced LRMs, we add a third component, a search algorithm, allowing both models to explore multiple chains of thought in real time. This leads us to models like OpenAI’s o3.
Of course, the significant change here is verifiers. But what are they really?
When anyone mentions verifiers, they can mean a reward model like we just described, but they can also be referring to a system, which can represent multiple components: the reward models themselves (which can be more than one), and we also have symbolic engines that can automatically verify the quality of thoughts in some domains.
For example, the AI models behind ChatGPT use coding interpreters and/or compilers to verify whether the code is correct. o3 also uses maths engines (think of something similar to Wolfram Alpha, a really advanced calculator) to check the validity of math operations.
Besides improving performance, verifiers have huge repercussions regarding hardware, unit economics, and employability. However, LRMs’ biggest impact has been model compression.
A Change in Shape and Form
One of the biggest (and most beneficial) parts of LRMs is that they are considerably smaller than big-time LLMs, as they leverage inference-time compute (thinking for longer) to “increase” their intelligence.
Put another way, if an LLM has to be an eight on a scale of 1 to 10 in intelligence to display an intelligence grade of eight in its responses, an LRM can be a five on the smartness scale but leverage time allocated to the task to reach an eight or even a level nine, simply by thinking for longer on the task or, quite frankly, try multiple times until one works (method known as self-consistency).
Simply put, there are two ways AIs can output smart predictions:
being incredibly smart from the get-go
or thinking for longer on the task, giving you a time advantage to reflect and self-correct.
LRMs represent a transition toward the latter.
This can be observed below. With enough inference-time compute, a Llama 3.2 1 and 3 billion models, considerably “dumber” than their larger 3.1 8B and 70B siblings, surpass their respective math performance simply by thinking for longer on tasks (or increasing the number of tries to the problem).
Source: HuggingFace
Now that we are convinced this is the right direction toward better AI models, how does this inevitable shift change the AI hardware industry?
What Changes in Hardware in 2025?
All Stars are Aligning
As discussed, with LRMs, the compute balance shifts completely in favor of inference, meaning that more computing power is spent running models than training them.
This seemingly harmless shift has huge (with huge being an euphemism) repercussions on the hardware layer.
In inference, unlike in model training and as discussed multiple times in this newsletter, memory is the main bottleneck. The model’s cache and its size require huge amounts of memory to store and run the workloads.
Consequently, it takes no genius to realize that if there are two factors determining how much memory I need (model size and cache), and we know one is growing (the cache, due to lengthier sequences), then the other should at least strive to fall (especially considering how we’ve just seen that inference-time compute elevates the intelligence of dumber models, usually correlated with smaller size).
Therefore, we will see an industry-wide effort to compress models into smaller and smaller sizes.
To be fair, we were already doing this transition to smaller models:
Two years ago, our frontier models were in the 100-500 billion-parameter range (GPT-3.5), or even 2 trillion (GPT-4, Opus).
Today, our best models LLMs fall around the 70 billion parameter threshold (GPT-4o, Claude 3.5 Sonnet), way smaller.
Just hours ago, DeepSeek released a Qwen 2.5 1.5B distillation that reaches state-of-the-art performance, and it’s so tiny that you can run it on your iPhone (yes, you read that correctly; more on Thursday).
With LRMs, this trend will accelerate, and we will soon see state-of-the-art models with a sub-20-billion parameter range that achieve state-of-the-art results while being considerably smaller (as we are already seeing). A series of techniques I will explain later will enable this, so you must take my word for a few minutes.
Interestingly, despite models ‘losing weight’ (pun intended), the overall compute required will increase, as these models generate multiple times the number of tokens original LLMs did.
But it isn’t only a ‘more tokens’ problem, as you also need to accommodate compute for the different model verifiers, the symbolic engines (like virtual coding environments for the AI to use), and other auxiliary components that, together, build the LRM system.
This is a problem for hardware companies, which have to deal with a hard truth: AI is developing and becoming more compute-hungry much faster than what Moore’s and Koomey’s laws allow.
In other words, compute requirements are growing considerably faster than the amount of compute AI hardware can provide per unit of consumed energy.
But that’s only the beginning of the problems for hardware companies:
As mentioned, inference workloads are memory bottlenecked, meaning that hardware, particularly GPUs, saturate memory before they saturate compute cores. In simple terms, they are running at a compute discount, as cores are idle for a considerable time, tensioning this compute drought even more.
Energy will continue to be limited. The capacity of countries to assemble new power generation is even slower, as lead times, especially at the transmission level (taking power from plants to the points of consumption) due to transformer construction lead times (they range between 115 and 130 weeks, with larger ones even reaching 210 weeks, unacceptable at the speed AI is progressing), so the pressure to figure out a way to escape Moore’s law’s stagnation will be huge.
Hardware companies will see themselves as supply-constrained regarding energy and with efficiency laws running behind demand, a really dark place to be in.
This will lead to the following actions by these companies (or you should use these as signal to whether they have their shit together):
Increased per-GPU memory allocation. NVIDIA’s latest GPU, the recently announced GB300, has a RAM capacity of 288 GB per GPU, up from the 80GB per GPU of the H100 series.
Increased GPUs per node. With Blackwell, NVIDIA has gone from 8 GPUs per node to 36 or 72 GPUs per node. In simple terms, all GPUs in the node behave as one ‘fat GPU’, ensuring small latency in GPU-to-GPU communication, which is essential as these models are parallelized across multiple GPUs. In LRM inference, latency matters a lot, so the more time wasted in this communication, the worse the user experience. Thus, the more GPUs per node, the better.
Beyond GPUs, hardware companies will put a huge emphasis on CPUs. With LRMs, the CPU’s role becomes more essential than ever. As the CPUs have to orchestrate everything (deciding what GPUs do what, multiple models per prediction, handling coding sandbox environments and other symbolic environments, managing data movement across all components, etc), they will center much of the attention. In fact, they are already doing so.
A growing number of edge data centers, smaller, MegaWatt-sized data centers that offer better latency to users, will be deployed. As mentioned, with LRMs, time becomes a challenge, so bringing servers closer to users is another great factor.
The need for lower latency will lead power-demanding clients to explore alternatives to memory-constrained GPUs, such as more inference-optimized hardware like Groq’s LPUs or Cerebras’ WSEs.
We will see more private capital activity toward ASICs, hardware products tailored exclusively to particular AI architectures, like Etched’s Sohu chip.
Better small models will lead to greater edge adoption, and by 2025, most people who aren’t living under a rock will have at least one on-device LLM running in their laptops (if you live in the US and have an iPhone 15 and up, you already have at least one LLM in it as we speak with Apple Intelligence, despite it sucking big time). But more on edge AI’s renaissance below.
And what companies are poised to succeed or suffer?
Winners And Losers
Subscribe to Full Premium package to read the rest.
Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In.
A subscription gets you:
- • NO ADS
- • An additional insights email on Tuesdays
- • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more