- TheWhiteBox by Nacho de Gregorio
- Posts
- It was inevitable.
It was inevitable.
Writer RAG tool: build production-ready RAG apps in minutes
Writer RAG Tool: build production-ready RAG apps in minutes with simple API calls.
Knowledge Graph integration for intelligent data retrieval and AI-powered interactions.
Streamlined full-stack platform eliminates complex setups for scalable, accurate AI workflows.
THEWHITEBOX
TLDR;
Technology:
🤩 2024, What a Year
🥳 Cerebras’ Trillion-Parameter Model
Markets:
🧐 AI As A Tool for Phishing
🤓 OpenAI’s AGI Definition
Product:
😍 Okta’s Auth for Agents Platform
🫢 The First AI Movie
TREND OF THE WEEK: The Inevitability of Latent Reasoning Models
RECAP
2024, What a Year
The Information has released a series of graphs (I’m showing two, but the article has more), providing a very visual view of AI in 2024. First, we saw the massive AI spending sprees of the four largest Hyperscalers (Microsoft, Amazon, Google, and Meta), accounting for over $200 billion, a CAPEX larger than many European countries’ annual GDP.
This trend will only accelerate in 2025 with large reasoner models (LRMs) like o3, as these models are much more compute-intensive when run.
On the other hand, we witnessed a decrease in token price (how we measure the cost of running Generative AI models), with examples like the GPT family, which has decreased its price by 85% in less than two years.
If we regard token prices as the ‘cost of intelligence’ (‘intelligence is doing a lot of work here, but bear with me), the several orders of magnitude decrease in costs is essential to making these tools widely accessible.
That said, I’m skeptical about how much these costs can be decreased if computing costs increase exponentially unless energy costs drop to basically zero.
For that reason, nuclear energy, as well as wind and solar, will play a crucial role in AI’s short-term future. Expect many more PPA (power purchase agreements) between Big Tech companies and the utility sector to access power plants, as Small Modular Reactors, which will fuel most future data centers, aren’t coming live before 2030 at the very least.
HEALTHCARE
Cerebras’ Trillion-Parameter Model
Cerebras Systems, in collaboration with Sandia National Laboratories, has achieved a significant milestone by training a 1-trillion-parameter AI model on a single CS-3 system, their AI training clusters. This accomplishment, announced at NeurIPS 2024, showcases the capabilities of Cerebras’ Wafer Scale Cluster technology.
Traditionally, training models of this magnitude necessitate thousands of GPUs, extensive infrastructure, and specialized teams. In contrast, the CS-3 system, utilizing a 55-terabyte MemoryX device, managed this feat without requiring any modifications to the model or infrastructure code. This streamlined approach significantly reduces the complexity and resources typically associated with large-scale AI training.
Following the initial success, the model was seamlessly scaled across 16 CS-3 systems, achieving near-linear performance improvements with a 15.3x speedup.
TheWhiteBox’s takeaway:
What makes Cerebras’ chips different from those of NVIDIA’s GPUs that have allowed this amazing training accomplishment?
The key difference is how they approach chip development.
Chips are electronic circuits sliced and diced from a circular Silicon Wafer after being tested (the defective dies, which happen a lot, are usually discarded). This GPU chip is then mounted on a circuit board with the other components (CPU and memory chips, among others for intra-board communication and networking).
Instead, Cerebras takes the entire wafer and treats all these smaller die as one chip. The crucial reason this is possible for Cerebras is that their technology is ‘defect-aware,’ meaning that even the dies that malfunction are used as part of this grand chip. In other words, the computations performed by this massive-scale chip account for the defects in real-time, allowing the company to use all of them.
These chips are then connected to huge external memory sources, leading to a large-scale server with around 4 trillion transistors. For reference, NVIDIA’s latest GPU, Blackwell, has 208 billion transistors, almost 20 times less, giving you an idea of the power difference.
That’s why they manage to train such massive models so easily, and their inference performance (the rate at which models produce outputs) is basically unheard of.
The only reasons you wouldn’t be extremely bullish on Cerebras’ future (they are eyeing an IPO) are that their customers are very concentrated (the capital costs required to buy this hardware are insane), even more so than NVIDIA, and that China has recently shown how much-untapped potential our hardware has by training a state-of-the-art model on sub-par GPUs for less than $10 million.
This could lead to many labs cutting their model-growing efforts and brainless spending and focusing on squeezing as much performance as possible from what they currently have—although the huge inference demand coming soon makes me doubt that will be the case.
CYBERSECURITY
AI as a Tool for Phishing
In what is a secret to no one, corporate executives are facing a surge in highly personalized phishing scams generated by AI bots.
These sophisticated attacks utilize AI to analyze online profiles, crafting emails that closely mimic the tone and style of legitimate communications.
TheWhiteBox’s takeaway:
Since ChatGPT’s release, it was very clear that AI would be used to generate better-quality phishing emails. Of course, this has led to more attacks but also to filler articles like this one.
What these articles always forget to mention is the ‘okay, so what now?’ part. Yes, hackers and scammers have become more sophisticated, but companies can also improve their phishing defense mechanisms:
Improve email filters with AIs that detect whether an AI writes an email to a reasonable accuracy,
Introduce multi-signature approvals so that not only one executive can provide that information,
More secure login mechanisms that require biometric data so that the stolen password is useless without that data.
And many more.
I’m very triggered by articles only looking for engagement and sharing instead of offering a more balanced, less catastrophic view of AI. Otherwise, we are simply generating fear for nothing.
FUTURE
OpenAI AGI Definition
Microsoft and OpenAI have finally given a name and surname to Artificial General Intelligence, or AGI, and that is when AI companies generate up to $100 billion in profits directly from AI.
That means that, according to OpenAI, AGI is easily decades away.
TheWhiteBox’s takeaway:
Microsoft and OpenAI have given AGI such a financial perspective for a simple reason. According to their agreement, Microsoft’s grip on OpenAI disappears once AGI is achieved.
This is no small grip, as Microsoft takes a 20% cut of OpenAI’s profits and owns the IP of its models in exchange for cheaper access to computing and capital; OpenAI is basically a Microsoft subsidiary.
As Satya Nadella, Microsoft’s CEO, himself said about OpenAI, Microsoft is “below them, above them, around them.” As for the numbers, if we believe AGI is far away in terms of technology, and if we focus on finance, things seem even farther away.
For starters, no AI company building models is profitable, not even close. OpenAI has reportedly lost $5 billion just this year and doesn’t expect to turn a profit until 2029, only after reaching a staggering $100 billion (this story’s thumbnail), which, for reference, is almost as large as NVIDIA’s revenues for 2024.
The key strategy to reach that value is for companies to adopt AI.
Currently, most of OpenAI’s profits come from end users, as companies struggle to implement this technology. The cause is quite simple: companies require robust technology that can be trusted to be deployed at scale and contained within costs forecasts.
However, even after amazing releases like o3, these models still struggle with hallucinations (even more so with LRMs), making any LLM/LRM deployment outside iterative environments like code (where errors are tolerated as part of the coding process) a nightmare.
AGENTS
Okta’s Auth for Agents Platform
Okta is an identity and access management (IAM) platform that provides secure authentication and authorization services for businesses and organizations, including the likes of OpenAI.
Now, with the arrival of the agent era, they are releasing a authentication platform exclusive for agents, aimed to help developers give AI agents the least privileged access to sensitive data and secure API access to take action on behalf of users.
In layman’s terms, it manages the authorizations agents have to access certain APIs and data for you.
TheWhiteBox’s takeaway:
Agents are preparing themselves to become a foundation of enterprise software. However, most disruptive technologies hit a cybersecurity wall whenever they want to be used, as they rarely meet the required security standards.
It is fundamental to have a reliable way of ensuring agents have the minimum privileges needed to do their work and prevent unnecessary data breaches (e.g., an agent of a low-level employee gains access to high-level data through an agent).
Moreover, if you watch carefully, you will see many areas of software adapting to agents already.
Websites are adding llms.txt files to their browser and API versions so that LLMs can parse them.
Many huge SaaS companies, like Salesforce or Workday, are slowly transitioning into agent software companies to prevent being killed by this new paradigm.
And now Okta is also streamlining auth control for these agents.
If Marc Benioff, Salesforce’s CEO, is remotely correct in that by the end of 2025, we will have 1 billion deployed agents worldwide, the industry needs to mature fast. Luckily, it seems that incumbents have finally understood the assignment; research is pointless if it isn’t grounded in applicable reality.
VIDEO GENERATION
The First AI Movie
An Indian team is preparing the release of the first Bollywood AI-generated movie (and probably the first AI movie full-stop). The link includes a trailer for the movie.
TheWhiteBox’s takeaway:
Achieving this task is much more complex than it seems. For once, most frontier AI models can generate only minute-long sequences, meaning that to create a movie, they need to stitch different generations.
This is an extreme challenge. AIs are already pretty inconsistent throughout a single video sequence (objects change shape, physics aren’t perfect, etc.), and you also have to condition the next sequence of frames with the previous one. This can be done but requires extensive video production (which the team is doing).
The other big challenge they are facing is culture. Most AI models are Western-biased, meaning that they aren’t great at generating sequences tailored to Indian traditions due to their lack of knowledge in those areas. This has led the team to record some sequences with human actors and then ‘inpaint' the actor’s face with the AI-generated image.
The beautiful thing about this is that it perfectly showcases how AI isn’t a tool that will one day substitute workers in the video industry; it will just empower them to generate movies at greater speeds and with more creative freedom, but will inevitably require work from the very same people that today are creating movies through traditional methods.
Not even once we achieve AGI will this change; AI is a tool that unlocks your creative workers and transforms how you work, but it won’t be capable of generating full-length movies without extensive human participation.
TREND OF THE WEEK
Meta’s Latent Reasoners
It was inevitable, and Meta seems to have finally connected the dots.
Among the many counterintuitive aspects of AI models, one is particularly ‘inhuman’ in its approach: reasoning.
In fact, they are generally very poor reasoners, and one crucial reason is that they can only reason through language. In other words, they think while they speak, instead of thinking before they speak, like your drunk friend in the bar on a Tuesday, leading to poor outcomes.
But that changes with Meta’s new model, COCONUT, which stands for ‘Chain of Continuous Thought.' It performs reasoning in the latent space, which, as we’ll see later, is precisely what—sober—humans do, considerably improving models' reasoning capabilities and helping us liberate their true power.
This small article will explain how Large Language Models (LLMs) or Large Reasoner Models (LRMs) reason compared to humans and how Meta might have pulled them closer to us than ever before.
Another step in the direction of human-level reasoning? Let’s find out.
The Source of Reason
Imagine you had to speak out every single thought coming across your mind. Simply put, you had to generate language to think. Naturally, that makes no sense and leads to poor reasoning.
A Striking AI Limitation
In fact, neuroimaging research has wildly proven that, during most reasoning acts, the parts of the brain in charge of language aren’t even activated. Specifically, our language is a tool for communication, not thinking, as many still claim (hold this thought for later).
However, that’s precisely how our frontier AI models “reason.” In AI parlance, ‘they reason in the language space’ a cool way of defining what I just described.
But what is really happening inside the LLM/LRM?
These models are autoregressive: they take in a sequence of words, and they output the next one, one after the previous.
This process involves four big steps:
Patching. The sequence is broken down into chunks. Most models use a tokenizer, a learned dictionary of tokens (words or subwords) that breaks the entire model’s vocabulary into fixed parts. New models like Meta’s BLT chunk the sequence dynamically, preventing a static computational graph (to read more on this, click here). Focusing on tokenized models (most are this way), each chunk has a number assigned, called the index. What this implies is that a specific sequence will always be chunked in the same way.
Using this index, we automatically obtain the embedding for that token from an embedding table (like a VLOOKUP in Excel). This gives us the vector form of that token, which is essential for the AI to process the words (they only understand numbers).
Processing. The models then process the embeddings through the different layers, making them interact with each other and forming a general understanding of the sequence.
These embeddings, while being processed internally by the model, are formerly called ‘latents,’ because the are ‘hidden’ inside the model.
Decoding. Once processing is finished, the model chooses the next word, and the last embedding is then ‘decoded’ back into language. This is the opposite process of step 2; instead of turning words into vectors, we look at the chosen word in vector form and ‘unembed’ it back into natural language through an ‘unembedding table.’
Long story short, for AIs to ‘think’ they need to ‘speak.’
Conversely, humans ‘reason in the space of abstract concepts.’ Once we know what to say, we articulate our thinking.
But Meta might have finally made AIs do the “exact” same.
Reasoning over Latents
The problem with tokenization is that it isn’t learned during training. In layman’s terms, it’s imposed on the model, which can only chunk the word sequences into a predefined form.
What Makes COCONUT Different
Specifically, that means that the costs of processing a particular combination of input words, like a sentence, are always the same.
As we saw in our review of byte-level transformers, also by Meta, this prevents tokenized models (all models, including ChatGPT, Gemini, or Claude) from actively deciding how much compute they are willing to invest in a prediction, leading to an overdrive of unnecessary computation.
For instance, predicting the next word in the sequence, “The capital of France is…” is notoriously easier than “Give me a reason why China could surpass the US in AI.”
But what if we allowed the model to think for longer before making the prediction? That wouldn’t be necessary for the first example, but a more complex sequence like the latter would benefit from the model thinking for longer internally more before committing to a response.
This is the essence behind today’s breakthrough: latent reasoning models.
The idea is pretty straightforward: looking at the four steps above, instead of starting the decoding automatically (step 4), we take the last latent state and reinsert it back into the model, which then processes the sequence and the last latent state; we start a loop on step three and only go into step 4 (decoding) once the model is ready to talk. In layman’s terms, the model inserts its last layer output back into the model to process the next one, meaning that these outputs stop being language and become ‘thoughts.’
The difference can be seen below. Standard LLMs/LRMs like ChatGPT behave like the diagram on the left. After the model processes the sequence, it is forced to produce the first word to continue it (‘Paris’ in the example we discussed above).
On the other hand, COCONUT models (right) only produce the actual word the user sees once it’s confident about it.
Intuitively, the difference is that a standard model ‘thinks by speaking’ while COCONUT continues processing the sequence internally and, eventually, produces a token that would have received a lot more allocated compute before being generated.
In other words, while o3 generates the chain-of-thought (CoT) during the actual answer, COCONUT generates the CoT internally and then answers, like a human thinking in silence and then answering.
It’s important to note that this ‘thinking for longer’ idea occurs only if the model considers it necessary. This traces back to the dynamic compute allocation feature we mentioned earlier, which is necessary for creating smarter and more efficient models. If the model thinks it’s ready to speak, it does.
It’s also important to distinguish what I mean by ‘thinking for longer’ here, especially when we always frame LRMs as ‘LLMs that think for longer.’
The crucial difference is that a COCONUT model, whether an LLM or an LRM, thinks for longer internally. In contrast, a standard LRM thinks for longer by generating a language-based chain of thought, generating many more output tokens (words) during its thinking. However, a standard LRM still has the same flaw as any LLM: it needs to speak to think, precisely what we are trying to prevent here.
Smarter And More Thoughtful
Another important feature COCONUT introduces is more efficient inference. When we force a model to think while speaking, it also adds many tokens that enhance its fluency but don’t necessarily provide any additional information.
For instance, the model might use better phrasal verbs, play with sentences to make their explanations less repetitive and use all the other style-oriented words we add to language to provide coherence and make it digestible.
Instead, as COCONUT first thinks and then speaks, it goes straight to the point, generating considerably fewer tokens to answer.
Besides fluency-based tokens, current LLMs/LRMs are extremely wordy in their explanations because, as they need to speak to think, they use words to ‘make up their mind.’
This is similar to when you ask a human a question, and he/she starts mumbling or adding sounds like ‘uh’ or ‘mmm’ to imply he/she’s thinking about the answer as that person gets some additional time to think.
These unnecessary inferences are tolerable in a conversation with another human, but the excessive wordiness of current models is insufferable. As COCONUT LLMs aren’t forced to speak “just because” and instead think for longer before making the first prediction, you:
Get a response that cuts to the chase and answers whatever you ask (also reduces hallucinations).
Get a more thoughtful response as the model generated the chain of thought internally before muttering a single word.
Overall, COCONUT improves the reasoning capabilities of models extensively (although we can’t jump this early to conclusions, as in some cases standard CoT gets the better of it, this isn’t surprising for a new algorithmic breakthrough).
Still, its performance across reasoning tasks, especially with regards to planning, suggests that this might be the future of AI models. They are more than 10 times more inference-efficient and, quite frankly, behave much more like humans, which is a great sign.
Latent Reasoner Models, models that think before they speak, are here to stay.
Progress Is Only Accelerating
COCONUT is proof that AI is still very much in its early days. The good thing is that we are progressing at speeds like nothing we’ve done before.
Significantly, AI is slowly closing the efficiency gap, making its computations much more efficient. This is essential for making these models usable in real life.
2025 will be the biggest year in AI history. Mark my words; this is only the beginning.
THEWHITEBOX
Join Premium Today!
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |
For business inquiries, reach me out at [email protected]