- TheWhiteBox by Nacho de Gregorio
- Posts
- Anti-Hype ChatGPT Agent Deep Dive
Anti-Hype ChatGPT Agent Deep Dive


AGENTS
Anti-Hype ChatGPT Agent Deep Dive
I was preparing another piece of content, but the release of ChatGPT Agent requires detailed examination. We’ve been hearing about agents for a long time, ‘2025 is the year of agents’ they said.
Well, now we have our first frontier agent product. OpenAI is the first among the top AI Labs—again—to release an agent product.
The industry has gone crazy for it, but is this really the ground-breaking product it is predicted to be?
Today, we are looking at:
What is it capable of
What it is (and what it isn’t) and how it works under the hood
Lastly, how real is the risk? What it means for software, jobs, particularly affected industries, and, of course, the direction of the industry.
Let’s dive in.
Very Promising Results
In a nutshell, it’s a purpose-built reasoning model that has two core capabilities:
Deep research and report generation
Computer use, to surf and interact with the web
That is, ‘Agent’ is akin to ChatGPT’s ‘Deep Research’ and ‘Operator’ having a child, enabling it to perform complex tasks through dozens or even hundreds of sources (deep research), while also having the capability to interact with these sources and execute actions on them (Operator).
For instance, you might ask it to:
Comb through your last quarter’s two thousand customer support tickets, identify the top three recurring issues, and draft a concise action plan memo for you.
Generate a fully formatted twelve-page Dungeons & Dragons adventure PDF, complete with NPC stats, maps, and encounter descriptions tailored to your campaign.
Analyze your company’s latest profit-and-loss spreadsheet and automatically build a polished PowerPoint deck highlighting key financial trends and strategic recommendations.
In other words, long-horizon tasks requiring extensive Internet searches, heavy tool execution, and the capability to execute extended plans.
Some real examples (I preemptively apologize for the cringeness of some of the posts, X is what it is these days) include:
PPTX Creation Success. Requested a presentation on "Legística Material" in Portuguese. Delivered smoothly as the first prompt test.
Real-World Research. Used for autonomous research, Excel file assembly with formulas, and PowerPoint creation. It handled complex work effectively in early access.
Retirement Plan. Crafted a complete early retirement strategy, including tax research and investment scenarios, in just 20 minutes—equivalent to expert-level advice (or so they claim).
Automated Workflow. Observed Agent in action for tasks like data analysis and document creation—loved the seamless, hands-off process.
Multi-Demo Showcase Success. Demonstrated browsing, code execution, app logins, and task completion. It includes several examples to illustrate each case.
Practical Task Execution Mix. Article by The Verge, explains how ChatGPT Agent handled parking requests, date night planning with calendar/OpenTable integration, and competitive analysis reports—strong on utility but noted latency issues.
What ChatGPT Agent is and what it isn’t.
As we explained on Thursday, ChatGPT Agent is capable of executing various tasks in a browser, coding, and, via OpenAI connectors, integrating with your email and other productivity apps.
Just like we explained in our agent framework at the beginning of the year, the system can be broken down into the four usual components:
The AI
Memory
Tools
Multi-agent orchestration
The final solution is something similar to what you’re seeing in the graph below, but with some important improvements I’ll explain next:

We’ll first look at the overall system and then dive deeper into every component.
The Overview/System
Just like we predicted in our agent’s newsletter post months ago, ‘Agent’ follows the staple Plan + ReAct pattern. That is, the plan → execute → evaluate loop.
Initially, the user makes a request such as the ones above.
A planning AI model that may or may not be the same model that executes (most probably not) drafts a detailed plan.
It then starts the ReAct pattern, which is a fancy way of describing a loop that will:
First reason about the plan (where we are, what to do now, aka understanding the current state of the task)
Execute an action, most likely via some tool execution (i.e., calling your email client API to retrieve emails)
Generate an output that a critique model then evaluates. This critique model also determines whether the plan is finished or requires changes, defining the new state.
Repeat until a condition is met that exits the loop (as determined by the critique model).
In more advanced agents like today’s protagonist, we also include a user-intervention mechanism, whether the user can proactively intervene if deemed necessary, or the system proactively decides to go back to the user for clarifications if required (i.e, when it’s about to execute a tool with safety considerations, like sending money).

ReAct Pattern. Source: IBM
While this provides an excellent overview of the overall process, it fails to reveal the real magic that occurs behind the scenes.
Let’s look at each block in more detail.
The AI
Under the hood, they have fine-tuned an AI model specifically for the task (probably o3 Deep Research, o4-mini Deep Research, or even this could be the first presentation of o4).
This means clicking on the ‘Agent’ button is not only a UX decision, but also a signal to switch into this model. It’s also important to note that with the imminent release of GPT-5, all models will become wrapped into the same system (GPT-5 won’t be a model, but many) with a router that will decide for you what model is best for every task.
The reason this feature requires a specific model is that we are starting to see new, task-adapted models for every new feature these labs release, which is a direct result of the fact that these are all Reinforcement-Learning trained models.
These models have been specifically trained for the task by assigning them a complex goal and having them execute dozens of tools, hundreds of searches, and a plan, and then retraining the model on the successful executions.
The issue with RL training is that it’s almost guaranteed to entail a loss of performance in other areas, so Labs are simply creating new models for each task. This might sound irrelevant, but it means a lot; it’s undeniable proof of the playbook that soon enterprises will need to follow to deploy AI models.
No RL fine-tuning = almost-guaranteed enterprise deployment failure. But more on that later.
The reason for all this is that, as we’ve commented several times, while imitation learning will help us create models with okay-ish performance in most areas, RL excels at depth (being very good at a particular task) at the expense of breadth; it’s like putting a generalist engineer (the pretrained model) through a PhD in Quantum Thermofluidic Topologies for Entropy-Stabilized Microbubble Propulsion in Anisotropic Media.
The outcome is a human expert in that particular area, at the expense of years of ignoring other fields.
To further prove this point, with OpenAI’s unofficial gold medal in the 2025 Math Olympiad (we’ll talk about it in detail in Tuesday’s news rundown for Premium members), researchers confirmed that the model is also a fine-tuned version.
A few weeks ago, we predicted the new generation of models would be a ‘menagerie of agents’, and that prediction is starting to look pretty good.
AGI will most likely be a manifold of AI models, not a God AGI.
And it is at the AI layer where we start to see a potential moat for model-layer companies, like OpenAI or Anthropic, compared to application-layer companies like Genspark or Manus, which have built strikingly similar products to the one we are discussing today.
While both are building the same products, the former owns the entire stack and can fine-tune the models to their liking, adapting behaviors without having to justify it to anyone.
Conversely, the latter have to adapt to what OpenAI/Anthropic provides, and while they might be able to introduce some fine-tuning, that comes at the expense of giving their complex data back to their model providers. Hold this thought for later when we discuss repercussions.
The Memory Stack
The next piece in the puzzle is memory, or how we provide models with the appropriate context for every single interaction they have. This is the famous idea of “context engineering” and is quite possibly the most critical predicting factor, along with end-to-end RL training, of an agent’s performance.
Here, most agent applications leverage internal short-term memory. The model will have access to previous, short-term interactions, at:
The task level (remembering earlier decisions in the working task)
The session level, remembering past task executions.
However, the real unlocker, and what sets apart tools like ChatGPT Agent or Manus, is long-term memory. These are much more intricate mechanisms that may include:
Memory-formation events, where the model identifies something worth remembering, and stores it for all future interactions with that user.
Content redactions, taking summaries of previous conversations, or certain outcomes, into something that will be easily retrievable in the future
Storage of single-point facts that could be relevant in the future (i.e., the user saying they always expect brief interactions, or specific output formats).
More technically, while short form memory is most likely inserted in full into the prompt, assuming recency bias (recent interactions are usually more relevant), long-form memory is added mostly using semantic retrieval (i.e., the model takes the user’s query and performs cosine similarity to identify chunks of text in the data storage—a vector database—that are semantically-related to the given task.)
Semantic-retrieval has gotten a lot of hate lately, mostly justified as people thought that it ‘was all you needed’, which was, of course, dead wrong, but it’s still very relevant, to the point that days ago, Google launched a state-of-the-art Gemini embeddings model for this precise use case.
Connectors
The next fundamental component of ChatGPT Agent is the tools that allow it to take action. Extensive tool use is quite possibly the major unlocker we are seeing with these models these days.
Precisely, as Kimi K2 proved, one of the primary ways to improve model performance is to teach it to use tools extensively. Here, we must consider four tool types:
Headless browser. The ChatGPT Agent includes virtualized computers, likely utilizing Docker containers and Kubernetes, that open computer interfaces for the model on the fly, providing access to a browser and file system (for memory access). Here, the model directly interacts with the virtualized user interface (so it doesn’t need yours), which you can see in real-time in the ChatGPT app.
Coding tools. The model also has access to code interpreters that support certain languages, such as Python, which the model can use to write code or perform mathematical operations, or to serve as a thinking scratchpad (agents rely heavily on coding for reasoning).
File system. A series of tools that allow the agent to interact with an operating system, be that yours (to send you messages) or the virtualized one.
Connectors and MCPs. For access to third-party software, the agent has access to native tools, called ChatGPT connectors, which include Gmail, GitHub, and SharePoint, and will soon include access to MCP Servers, providing the agent with access to hundreds of other tools.
MCP is not currently supported in the ChatGPT app (will soon be), but it is supported on the API (you can expose MCP Servers to models accessed through the API).
And last but not least, we have the biggest ‘new thing’, the orchestrating system.
The Orchestrating System
It has been confirmed that, pretty much like Grok 4 Heavy or Claude Research, ChatGPT Agent is a multi-agent system. The multi-agent system grows in two ways:
Vertically, an orchestrator and subagents
Horizontally, increasing statistical coverage or multi-team execution.
As for the vertical agent distribution, the underlying architecture will most likely resemble Claude Research’s agent system, shown below, a multi-layered system of agents working in unison to accomplish a task, aiming for a separation of concerns.
If we take the first part of the process, planning, as an example, we can deploy a planning leader who lays out the grand plan, and sub-agent planners refine each step before the planning leader's validation.
This not only reduces context windows, a key issue with current agents (they may run out of working memory), but it is also potentially much cheaper, as sub-planners can be smaller models. Here, the large model, the most sophisticated, primarily acts as a verifier/approver.
On the other hand, we have horizontal scaling, where we deploy several of these vertically scaled systems working on the same task. Here, we can choose different decision heuristics (such as the best of N, majority vote, or even Monte Carlo Trees) to select the final response sent to the user.
This is, of course, redundant, but it increases statistical coverage in a “similar” way; you may consider several possible ways to tackle a problem in your mind, rejecting all but the chosen one.
Advanced systems like ChatGPT Agent are likely to use both multi-agent approaches, and the multi-layered agent system may originate from even a lower level, from the model itself (o3 vs o3-pro, or Grok 4 vs Grok 4 Heavy).
Having understood the key technical details of this release, we now focus on the important. Knowing what ChatGPT Agent is, is one thing; understanding what it isn’t is more relevant.
Is this the product that marks the beginning of the AI revolution? The era of post-labor economics?
Is this the product that will massively displace jobs?
As always, the answer is: depends on who you are.

Subscribe to Full Premium package to read the rest.
Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.
Already a paying subscriber? Sign In.
A subscription gets you:
- • NO ADS
- • An additional insights email on Tuesdays
- • Gain access to TheWhiteBox's knowledge base to access four times more content than the free version on markets, cutting-edge research, company deep dives, AI engineering tips, & more