
LEADERS
The Anti-Hype Guide for 2025
In the last few weeks, two of the most influential people of our time, Rich Sutton, one of AI’s Godfathers, and Andrej Karpathy, considered by many (yours truly included) as the clearest thinker in the AI space, have come out with quite the anti-hype takes regarding modern AI.
Coupled with the general sentiment of AI being in a bubble, this has caused serious turmoil, even if both are generally optimistic about AI and its value to society.
Take today’s article as a way to:
Intuitively understand what AI experts who aren’t looking to sell you something think AI’s limitations are
Learn about recent research that they point out as a potential way out of these issues.
And ironically, how their somewhat bearish views about AI could in fact justify the lavish spending we are seeing even more, growing the AI bubble larger.
In an industry where every article, every blog, and every announcement is designed to sell you something, take today’s piece as a way to see the AI industry in a more straightforward, unbiased way, a cure from this ‘hype illness’ all of us, me included, suffer from.
Let’s dive in.
AI’s Big Limitations: The Sutton View
Rich Sutton's interview in the Dwarkesh Podcast is a very, very esoteric conversation; truly not meant for someone who isn’t deeply in the weeds of AI.
But we can make it easy to understand. And we must, because there’s a lot that can be understood about this industry based solely on Sutton’s views.
Where he’s coming from
First and foremost, Rich Sutton is a legendary figure; he’s the author of what is possibly the most critical essay in the history of this industry, at least in terms of money thrown into the ideas shared. In it, he argues that the best AI algorithms are those that unlock more compute and data at scale.
It’s called the bitter lesson because it proves that human intervention (human data, human heuristics, etc.) that we use to train models actually works worse compared to a simpler algorithm that simply allows more compute thrown into the problem.
He also happens to be one of the fathers of Reinforcement Learning (RL), a way of training AIs extremely popular these days that is similar to how you would train your dog: reward them whenever they make something desirable, “punish” them when they make mistakes.
Over time, this training system allows the dog to reinforce—hence, the name—actions more likely to yield the treat they most desperately want, and will avoid the actions that make their owner mad. Interestingly, this is literally the same approach we use with AIs today.
But what is the ‘dog’ in AI’s case?
To use RL effectively, you need a prior, something from which we can sample actions that aren’t purely random.
For example, no matter how hard you try to reward good actions, you can’t train a six-month-old baby to calculate a second-order derivative; you need your prior (i.e., the baby, or the dog for the previous example) to at least have a higher-than-chance probability of executing good actions.
This introduces a very interesting, almost philosophical discussion amongst AI experts: to what extent does RL allow the prior to go beyond what it knows? In the same way you can’t get second-order derivatives from a baby, can we teach new skills to a prior that has never seen those skills before? But I digress.
In modern AI, at least the AI that pertains to companies like OpenAI, Google DeepMind, or Anthropic, this prior is a Large Language Model, or LLM.
This AI model acquires the knowledge necessary to apply RL (to make the prior behave more like a PhD, not as a baby), through next-token prediction over the entire written Internet; we give it, say, three words, and it has to predict the next, and we do this for the entire Internet text and beyond.
In other words, we train LLMs via imitation learning, training LLMs to get very good at predicting the next word in a sequence of text. Eventually, this LLM becomes sort of a compressed representation of the Internet, a blob of words you can ask anything that appears on the Internet, and the model “knows.”
The point here is that once you have a prior that knows stuff, you can ask it more complicated questions, and the model will be able to suggest possible solutions.
At this point, solutions aren’t random, so the RL training method can now converge; we can apply the reward/punish learning mechanism not over any random suggestion, but ones that are ‘plausible’, in the same way that if you ask a friend what is the capital of Austria that it doesn’t know, it will start saying European city names to see if they get the answer correct, and not random fruits. That is what a good prior is needed for.
Intuitively, this is a fundamental description of the frontier models we have today. Called ‘reasoning models,’ they are just an LLM that has seen everything and then retrained using the reward/punishment training method we call RL.
But if this is what Sutton envisioned, why isn’t he pouring champagne and sniffing cocaine out of celebration in every interview he gives?
And the problem is the ‘nature’ of the reward.
To reward or not to reward.
RL sounds stupidly easy to apply, but boy, is that statement wrong. When we’re working with dogs, this is very hard to see; we just want the dog to sit, give the paw, lie down, and if you put enough time into it, you can make them stay still until commanded.
I am one of those “chosen people” who managed to do that on a Husky among all rebellious dog breeds (lots of work!).
Here, not only are the number of commands you aspire to teach your dog very few. But even more importantly, the consequences both the dog and the owner face are not consequential at all; it’s not the end of the world if your dog doesn’t learn to give you the paw.
The same applies to board games like chess or Go (areas where RL has allowed humans to create superhuman AIs that are literally unbeatable, like AlphaZero). What’s the worst thing that can happen? That you lose the game?
Our current RL systems are not only extremely limited in what they can learn, but they are also not exposed to the consequences of their actions (or they are harmless). But more importantly, they are hackable.
But what do we mean by hackable?
Circling back to the dog example, something very typical of dogs (at least mine) is that if they don’t get the action you asked for correctly, they become not only puzzled but also very anxious for the treat, so they start spamming different actions (give one paw, then the next, then sit, then lay down, and so on), to see if any of the learned actions is the one I (the owner) was asking for; quite literally the canine equivalent to the expression “throw things to the wall and see what sticks”.
This is a fun example of a pervasive—and very problematic—issue in RL: reward hacking. This is the effect that happens when the AI model (or the dog in this case) finds loopholes in the reward mechanism that yield the reward, but in a suboptimal way.
I, as the owner, don’t want my dog to spam actions every time I give him the command I want; I want only that one action I asked for. However, during my RL training, I did not punish action spamming, so the dog learns a way to cheat himself into the treat; if I had only taught him four actions, it’s just a matter of spamming all four, and that guarantees a treat.
To be clear, this is the result of poor reward design on my part, because I should have added a punishing mechanism that, if he spams actions, no reward is given.
This ‘harmless’ result from my dog hacking the treat reward system is literally what happens in AI models during RL all the time.
In fact, is this reward hacking what sits at the heart of most modern safety issues in Generative AI these days: lies, deceiving, blackmailing, sycophancy; all those situations where AIs have done any of those things you’ve seen in the media, are results of poor reward design by AI Labs.
For example, when GPT-4o went famously out of control with its praising toward the user (i.e., sycophancy, the tendency to agree), they had to pull back immediately. But this was no mistake; it was just a result of OpenAI wanting to bake in a little bit of sycophancy into the model (helps usage metrics), coupled with very bad reward designing (i.e., the model found a way to meet the sycophancy reward by being obnoxiously sycophant).
Put another way, they forgot to punish obnoxious sycophancy in the same way I forgot to punish action spamming on my dog.
And what does all this have to do with Sutton’s bearish view on AI’s current state? The answer is simple: actions have consequences, yet we are totally hiding AIs from the consequences of their actions.
Which is to say, what Sutton thinks we’re missing goes to the heart of what he believes is the next big thing in AI: the era of experience.
Subscribe to Full Premium package to read the rest.
Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.
UpgradeA subscription gets you:
- NO ADS
- An additional insights email on Tuesdays
- Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more