All You Need To Know About Agents

Ignacio de Gregorio Noblejas
January 12, 2025

In partnership with

Writer RAG tool: build production-ready RAG apps in minutes

Writer RAG Tool: build production-ready RAG apps in minutes with simple API calls.
Knowledge Graph integration for intelligent data retrieval and AI-powered interactions.
Streamlined full-stack platform eliminates complex setups for scalable, accurate AI workflows.

Learn more about our production ready RAG tooling here.

FUTURE
All You Need to Know About Agents

You are probably tired of hearing the mantra ‘2025 is the year of agents.’ It’s literally every tech and non-tech CEO’s only answer to basically any uncomfortable AI question.

But here’s the thing: no one has a fricking clue of what an agent really is. But today, we are solving just that by answering:

What are agents really?
What lies underneath this overused term?
What is the implementation framework you should always follow?
What are its main components?
What are the current limitations they must overcome?
And what are the most likely implementations of agents in 2025?

Today, we answer all that in the only way we know: explaining complex thoughts in simple words. By the end of this newsletter, you will be more prepared than the average CEO to confront this exciting new AI field and, crucially, how to call bullshit on their claims and those of “AI agent” companies trying to sell their latest lame product.

While absent of incomprehensible jargon, this newsletter requires some cognitive effort, so don't bother reading otherwise. However, you won’t be able to claim to be prepared for what’s to come. But if you read to the end, I guarantee you will gain unpayable intuition far beyond what almost any executive thinks he/she does.

And, importantly, the areas where you should be paying attention to.

But these executives, unknowingly, are correct. 2025 is indeed the year of agents. But instead of focusing on agents, you need to focus on yourself. “AIs won’t replace you; a human using AI will” is the most important lesson in AI, and 2025 is the year when you will have to decide on which side you’re on.

And that decision starts today.

Agents 101

First off, we need to provide a basic definition. Yet, the word ‘agent’ has become a rather loose term that people use for very different purposes, depending on how knowledgeable that person is about the space, and their marketing incentives.

Here’s to you what agents are succinctly.

An All-encompassing Definition… that Actually Isn’t.

A poor man’s definition of an agent is as follows:

❝

AI models that can perform actions in your behalf.

Most people knowingly or unknowingly refer to this definition when discussing agents. However, this is not all; far from it.

Agents have a more accurate definition, primarily used for decades since the advent of Reinforcement Learning. Hence, while agents must meet the prior definition, we must complete it.

Agents and RL

The original definition of an agent is any AI model trained through Reinforcement Learning (RL) procedures.

An agent receives an instruction (or goal), observes and interacts with an environment, and executes a series of actions that, hopefully, maximize a cumulative reward. Naturally, the environment's constraints limit the agent's actions.

Simply put, the AI is trained to choose the concatenation of actions that lead to the best possible outcome, with “best” meaning the one that maximizes the expected cumulative reward. In layman’s terms, the action path where the sum of the reward for each action is the largest.

This leads to one of the biggest problems in AI: the exploration vs. exploration dilemma, which will be as crucial as ever with the advent of agents into our lives.

While a given action may initially seem the best, allowing the model to explore apparently less attractive options leads to discovering new, unexpected paths that yield greater cumulative rewards.

While the right option seems more appealing based on the current state (the outside of the restaurant), the left option might have better food. Thus, we allow the AI to explore before deciding. Source: HuggingFace

But why am I telling you this?

Simple: The agents every CEO mentions to his/her board or at a quarterly earnings call must meet both definitions (they aren’t aware of this, but it is what it is). They take action on our behalf and are also RL-trained, and that makes a crucial difference in deciding where they fit perfectly… or where they are a terrible, cash-burning choice.

Still, the definition I’ve just given you, even if far more detailed than what AI “influencers” usually explain, still falls short of conceptualizing agents as they should to prime you for success.

Let’s solve this.

An Accurate Representation of Agents

How do we actually train an agent? An agent’s training breaks down into three components:

Environment: The agent is ‘placed’ in an environment that it can observe and measure.
Policy: It learns an ‘action policy’ that chooses the following action after measuring/observing the environment. This policy considers the environment state (a snapshot of the environment at a given time) and determines the action based on that.
Reward: A reward mechanism measures the quality of the action based on two variables: its intrinsic reward and the discounted cumulative reward (the expected reward of future actions based on this one).

Let’s use an example to drive this home. New exciting features like Anthropic’s Computer Use, released in beta, Google’s Project Mariner, and OpenAI’s unannounced yet acknowledged agent product, are great examples.

These are what we call ‘computer agents.’

Their environment is the GUI (Graphical User Interface) of a computer. That is, their ‘world’ is the computer screen.
The action policy is the distribution of actions the agent can execute given the current state of the environment and the previous set of chosen actions in the action path. For instance, clicking a button or typing a new URL in the browser’s search bar. The action policy is usually a neural network that has learned the policy through gradient descent, making most agents Deep RL agents (trained with neural nets like ChatGPT).

Agents are mostly modeled with Markov Decision Processes (MDP). The defining characteristic of a Markov chain is the Markov property, which states that the probability of transitioning to the next state depends only on the current state and not on the sequence of previous states, as if a human decided its next action based solely on the immediate state of its environment.

Of course, that makes agents memoryless with regard to the environment, which limits their capacity in many situations (this is crucial when asking yourself if a task can be ‘agentic’). The reason behind choosing this constraint is that modeling every single previous state is a computationally intractable problem.

The reward evaluates the quality of the chosen action, given its execution state. Simply put, this evaluator checks whether the agent’s action brings the agent closer to fulfilling its goal.

As you may imagine, rewards are the tricky part here, to the point they are OpenAI’s moat with o3.

The evaluator can be shaped in several ways:

It can be a formula, like calculating the entropy of the action distribution. Without going into detail, it mathematically measures how “convinced” the model is of the chosen action as a measurement of quality. In other words, it’s a way to compute a confidence score of the model’s output mathematically.
It can be an LLM acting as a judge. Upon each action, an LLM critiques the output with natural language.
It can be another neural network. RL agents like AlphaZero (Google’s superhuman Go and Chess agent) use a value network that scores the quality of the action ‘looking into the future’ (i.e., an estimation of how good your likelihood of meeting the goal is once future actions are performed without having to simulate those actions).
It can be done through traditional search methods like Monte Carlo Tree Search (MCTS). While searching different solution paths, the model simulates future actions after the current one until arriving at a terminal state (known as a roll-out). Think of this as the model ‘playing out’ the rest of a chess game and seeing whether this action leads to victory or defeat.
Bayesian methods. The least explored of the five, the model applies Bayes theorem, a methodology in which it updates its beliefs (posterior distribution) based on prior (current) beliefs (prior distribution) and gathered observations (evidence). This is similar to how the human brain works (continuously learning about the world by making guesses based on current beliefs and using feedback to update them). However, in its original form, it is a computationally intractable problem (modeling the posterior requires contemplating every possible next move). In this category, we find GFlowNets, which use an amortized version of Bayes Theorem (making it computationally tractable), which I have talked about in detail in my Notion database (here and here) as it's rumored to be another of OpenAI’s secrets to o3.

Currently, most AI models fall into options one to three. Option 4 requires forward-looking simulations, which are computationally expensive, and option 5 is largely unexplored.

But you can also combine many of these.

For instance, OpenAI’s o3 model allegedly combines options one, two, and four (and five, but I won’t go into detail here for length’s sake). Using LLMs as critique models (option two) and computationally verifiable evaluations through symbolic engines (like a code interpreter/compiler that reviews whether the code works and meets the goal, option one), both are used to evaluate the quality of every thought in each chain of thought the model generates during its search of the solution space (evaluating multiple ways to solve the problem, aka option four). They do so to avoid having to compute the rollouts MCTS requires and instead validate intermediate thoughts through options one and two.

Alternatively, some researchers rely on MCTS in its natural form (with roll-outs, option four), combined with entropy measurement (option one), like Alibaba with Marco-o1.

Marco-o1’s inference process

For the geeks out there, option four is non-differentiable, meaning you can’t apply gradient descent. Instead, the simulation is executed, resulting in a scalar score (win/lose using the chess example again) that updates the model's beliefs regarding what option is best. Researchers run away from non-differentiable methods unless extremely necessary.

So, now, we are finally ready to give a proper definition of agents.

❝

An agent is an AI model, usually a neural network, trained using one of the five RL learning methods above (or a combination of them). It is assigned a goal in a constrained environment that it must fulfill by executing actions according to an action policy trained by maximizing the cumulative reward of these actions. The policy is executed sequentially, with each action conditioned on the last current state of the environment (making it an MDP).

This is the definition that must stick in your mind, not the loose explanation people will give you when they want to pretend they know what they are talking about or sell you their latest product/feature.

Now that we have finally acknowledged agents for what they are, and before we move into their limitations and, crucially, their most significant uses in 2025, let’s start driving the definition home with what is unequivocally the general approach to implementing them.

Implementing an AI Agent

If you thought understanding what an agent is sounded complicated, it’s not the hardest part. The main pain point of agents is to actually make them work.

But how? To answer this and avoid disaster, we need to answer two questions:

What’s the general framework the agent must follow?
What components make an agent? (if you thought agents are just an AI model, you’re deeply wrong)

Answering the first question, most successful agent implementations involve four well-defined steps.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• NO ADS
• An additional insights email on Tuesdays
• Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more