- TheWhiteBox by Nacho de Gregorio
- Posts
- Meta's Multi-token Paradigm & Google's Pixel Sound Predictor
Meta's Multi-token Paradigm & Google's Pixel Sound Predictor
đ TheTechOasis đ
part of the:
Welcome to the newsletter that keeps you updated on the latest developments at the cutting edge of AI by breaking down the most advanced systems in the world & the hottest news in the industry.
10-minute weekly reads.
đ¨ Weekâs Update đ¨
Welcome back! This week, we have news from Groq, Figure, OpenAI, Cognizant, & more.
Starting strong, Groq unveiled new features you can try on their website for free with ultra-fast throughput. This article includes video examples for inspiration.
Groqâs insane capabilities really give a perspective on how narrowed the industryâs focus is on GPUs, while other hardware like LPUs offer better performance at scale.
And once Etched.ai launches Sohu, things might start getting âinterestingâ for NVIDIAâs dominance in LLM inference workloads.
In a recent study, Cognizant and Dataiku commented on enterprises' increased adoption of GenAI.
Letâs not get carried away with the hype and follow the money. We need less âyeah we are deploying GenAIâ marketing stunts and more âstop bluffing and give me your core budget number for GenAIâ.
The truth? Only 33% of companies have moved on from the experimental phase, aka no serious commitments for the time being.
Writer, one of the few GenAI companies that seem to be experiencing this unusual feeling called âhaving revenues,â has announced new features, including a graph-based RAG pipeline.
The interesting thing to mention here is how knowledge graphs seem to be gaining momentum in the retrieval augmented space to detriment of vector databases, meaning we might be lowkey transcending to a new, more robust enterprise architecture.
SenseTime has announced a model, SenseNova 5.5, that allegedly surpasses GPT-4o in 5 out of 8 critical benchmarks.
In the latest issue of âChina is catching up quicklyâ series, SenseTime really pushes the boundaries, putting to shame all those claims that âChina is years behindâ.
In fact, are they actually behind?
Talking about politics, both Apple and Microsoft have relinquished their board seats at OpenAI, amidst consistent scrutiny by regulators considering theâobviousâconsolidation of the LLM market.
Finally, on more futuristic news, Figure AI has published a video with its partner BMW in which a humanoid is seen working in one of the latterâs factories.
As shown in the video, the robot is capable of autocorrecting the position of the metal piece. Seeing these robots self-correcting this early in robotics is truly remarkable.
đ§ You Should Pay Attention to đ§
𤊠Metaâs New Model Paradigm
𤯠Google Predicts Sounds from Pixels
𤊠Metaâs New Model Paradigm đ¤Š
Have we discovered a better way of training Large Language Models (LLMs)?
This is what Meta is promising with its new research: it presents a model that predicts multiple tokens at once in every prediction, not just one, and with no training overhead.
This speeds up the model's text generation and, incredibly, makes it smarter, suggesting a new training paradigm for frontier AI.
With this article, you will learn about an interesting new paradigm that could be foundational for future LLM training, but you will also gain nice intuition on why they work the way they work.
A Weak Form of Learning
To comprehend how industry-changing Metaâs research could be, we must understand how LLMs are taught what they know.
A Universal Task Interface for LLMs
When training a deep neural network, you must define the task you want the model to optimize.
In the LLM case, that task is next-word prediction; a model receives a set of input words in the form of a text sequence and predicts the next token (a word or a subword). Then, as with all neural networks today, we find a way to evaluate the prediction quality (the error) as a signal to tune the model.
However, Metaâs model doesnât just predict one word.
As we want our model to be capable of modeling the entire written language and to be creative when required, we force it to assign a probability to every single token in its vocabulary.
This way, the model has to factor in uncertainty, as many different words might be reasonable continuations to any given sequence. As shown below, for the sequence âThe boy went to theâŚâ, all five options are âsemantically valid,â but only one is the actual answer.
So, how do we measure the error in this situation?
How LLMs Learn
In an LLMâs case, we do so through the cross-entropy function, which only focuses on the probability the model has assigned to the ground truth.
The Cross-Entropy function
Using the example above, we look at the probability the model assigned to the actual next word, âPlayground,â which is 40%. This is well below what it should have been (100%), so the model has plenty of room to tune its parameters.
But why do we learn this specific task?
If you create a model that, for almost any given sequence, can confidently predict what comes next, that model becomes capable of speaking the language, translating between languages, performing very basic reasoning, and so forth.
In other words, predicting the next token task has become universal, and everyone follows this method to train their LLMs.
When it comes to inference, we perform autoregressive decoding.
In laymanâs terms, the model always looks backward, not forward. Thus, the probability of any given word generated depends on the words that came earlier.
Now, Meta suggests we should change all this. But how⌠and why?
Letâs Predict More Tokens!
Meta proposes a new training paradigm.
A New Architecture
In succinct terms, we modify the LLM to predict the next âkâ words instead of the next word (they settled for 4, so we will use that number hereon).
For that, we add extra output heads to the model, similar to the Greek mythology beast the Hydra, with each head performing the same exercise: predicting the next four tokens in a sequence.
To be clear, it does not mean that the total predicted tokens are 16. While all four heads produce 4 tokens, we are just keeping the last one out of each head (denoted as words 5, 6, 7, and 8 in the image below).
Source: Meta
For instance, for the famous John Lennon verse âYou may say Iâm a dreamer, but IâmâŚâ a standard LLM would predict ânot,â then âthe,â then âonly,â and finally âone.â
On the contrary, Metaâs model would predict the entire ending in one turn.
In inference, we can discard the additional heads and perform standard next-word prediction. Alternatively, we can leverage all heads and increase generation speed by up to three times.
However, at this point, you may be asking: if we predict one token at a time and discard the other heads, why bother training the model to predict four tokens simultaneously?
A Smarter Learning Process
The main takeaway from this research is not to make LLMs faster but to make them smarter.
Nonetheless, these models obtain remarkably better results than the standard LLM architecture when trained at scale for code, as evidenced by the results for MBPP or HumanEval coding benchmarks.
When measured at a certain scale (3 billion parameters onward), the 4-token prediction models heftily surpass the âmightâ of standard LLMs.
But why does this happen? Why is multiple next-token prediction creating better models?
Not all Predictions are Created Equal
Itâs important to note that all output heads share the same LLM backbone (depicted as âsharedâ in the diagram we saw earlier). This means all heads use the same representation to predict the next four tokens.
Simply put, for this to work well, this shared representation must consider not only previous words but also the likely potential next words to the one itâs predicting at any given moment.
In a way, contrary to standard LLMs, this new paradigm forces the model to âlook forward in timeâ.
This is particularly important when realizing that not all predictions are equal. While some words directly impact the next words, a term the researchers refer to as a âchoice point,â other predictions might be inconsequential to those coming next.
I know this is hard to grasp, so letâs look at an example of both:
Choice point
âThe hero entered the dark cave, unsure what he might find. Suddenly, he heard aâŚâ
At this point, the next predicted word can significantly affect the story's direction. For instance, if the model predicts âroarâ and then predicts âcoming from a rat,â that would be nonsensical. Thus, âroarâ seriously constrains the next predictions.
Inconsequential Prediction
Now, consider a context where a character performs a mundane task: âShe picked up the pen and started toâŚâ
Here, the next predicted word is likely inconsequential to the overall narrative. For example, whether the next word is âwrite,â âdraw,â or âdoodle,â the story remains the same; the outcome doesnât change.
The point Iâm trying to make is that by training a model to predict the next, letâs say, 4 tokens, the model will be much more likely to be aware of whether the next prediction is of high risk if performed wrongly (choice point), improving generation quality.
Another intuition as to why next-token prediction enhances performance is that it reinforces local patterns.
Syntax is Subtle
In standard next-word prediction, each word is independently predicted, and while LLMs can still learn patterns between words close to each other (they learn that âI play the guitarâ is correct and âI guitar the playâ is not), if we are forcing the model to predict all four at once, the model will learn to produce the tokens in that order, eliminating the chances the second generation ever appears.
This also explains why the best results have been observed in coding.
Syntax errors in coding are much more subtle and have a tremendous negative impact (the code does not work), meaning that while LLMs rarely make syntax errors in natural language, they rarely write perfect code.
Thus, with multi-token prediction, we allow LLMs to learn these short patterns so that the model can output the entire pattern simultaneously.
Moving beyond quality, naturally, these models are faster, too.
Outrageously Fast
One method Meta proposes to enhance speed for multiple-head decoding is Medusa, created by researchers at Together.ai, which increases generation speed up to three times compared to standard LLMs.
In Medusa, each head gets assigned one position in the generative process. If you have four heads, each is charged with predicting one token (first head the first, second head the second token, and so forth).
With the top-k predictions of tokens for each position ready, the model builds a set of candidates, who are chosen using the following heuristic: the longest syntactically valid candidate is chosen.
Source: Together.ai
This way, models can write insanely fast, which can be critical for performance, especially when they run large batches (when they provide a service to multiple users at once).
TheWhiteBoxâs take
Technology:
Is this the future of LLM training?
In terms of architecture, this is a big deal, as we could eventually change how we train all frontier LLMs. While Iâm not convinced this will be the case for text (the results in that modality werenât that impressive), it has a very high likelihood for coding models.
Products:
Generating correct code is a key requirement not only for coding models, but also for agents, as they must interact with third-party systems (APIs). Improving how we train these models could be a huge enhancement for companies that train them.
Markets:
And who are the people building the aforementioned products?
Public markets: Microsoft (Copilot), Amazon (Alexa), Alphabet (Gemini for Google Cloud), Meta, Apple (Apple Intelligence)
Private markets: Imbue, Adept (acquihired by Amazon), Rabbit (most likely a scam), Cognition Labs (Devin), or Anthropic (Claude), among others.
Most, if not all, private players are invested by either Microsoft, Alphabet, or Amazon. So either way, they are winning.
𤯠Google Predicts Sounds from Pixels đ¤Ż
In what might be one of the most impressive models in quite some time, Google Deepmind has presented a video-to-audio model that takes in video and generates the matching audio.
The model detects car drifts, horror themes, or accurate drum playing, and Googleâs insights on how they did it provide invaluable intuition on how AI audio is generated, as well as their plans for some of its key products.
But how does it work?
Diffusion is King
In frontier AI, we have two dominant architectures: autoregressive transformers, like ChatGPT or Gemini, or diffusion transformers (Sora, Veo, or Stable Diffusion).
Autoregressive transformers model sequences by using prior words in a sequence to predict the next one
Diffusion transformers model sequences by denoising a noisy canvas until you get the result.
While the former is the absolute king of LLMs, the latter is the undisputable standard in modeling images or video.
But how does Diffusion work?
From Stone to Statue
For the sake of length, we will use my best analogy: marble sculpturing. Um, what? Bear with me.
The great Rennaissance artist Michelangelo once said:
âThe sculpture is already complete within the marble block, before I start my work. It is already there, I just have to chisel away the superfluous material.â
Diffusion is this process applied to AI.
We have an initial, randomized noisy canvas (be that a noisy image, noisy audio,âŚ), and the model, by conditioning on the input prompt (a cue to guide the generation), slowly âchisels away the excess materialâ (noise) using a fixed number of time steps (it doesnât do it in one go) until you are left with the result.
Michelangelo would have been a great diffuser
So, what is the diffusion model actually learning to do?
In a more technological sense, the model learns the distribution of the training data (just like LLMs) but is capable of taking random distributions (pure noise) and âflowingâ them into its target distribution conditioned on the userâs prompt (âdraw me a cat portraitâ).
In other words, diffusion models learn to âuncoverâ the image or video âhiddenâ beneath all that random noise.
But what do we mean by conditioning?
We want a specific outcome!
Most models today are âx-to-x2â, where âxâ represents a human desire (âI want whateverâ) and can be expressed in various ways (text in the case of ChatGPT), and âx2â represents the way the model will output what you want (text or images in ChatGPT).
Both âxâ and âx2â are sequences of tokens, hence why all these models are considered âsequence-to-sequenceâ.
However, in the case of Diffusion models, the schematic is slightly different:
Generated by author
Here, the inputs to the model are twofold:
The Gaussian (random) canvas, aka the noisy distribution
The user condition
Consequently, the diffusion process involves denoising the noisy distribution using the userâs cue so that the outcome semantically matches the condition (they represent the same thing).
In technical terms, the user condition impacts how the model predicts the noise that is going to erase at every time-step through the use of cross-attention layers.
Finally, knowing all this, how does Googleâs Video-to-Audio (V2A) work?
From Pixels to Sound
Now that we know how diffusion works, the rest is easy to understand. V2A looks like this:
Source: Google
As expected, the model conditions on two things:
The video we want sound on
Positive and negative prompts: The positive signal what we want, the negative prompts what we donât want
The inputs are then encoded into vector representations and merged
From random audio (pure noise), the model denoises that audio using the three conditions.
After the denoising process ends, we have our desired audio, but in a compressed form (in the form of a vector, not actual sound).
This compressed representation is decoded into a mel-spectrogram (a snapshot of the audio representing the intensity at every frequency).
Finally, some element (probably a vocoder) converts the mel-spectrogram into actual audio waves and overlaps it with the video.
Et voilĂ , you have a model that can take videos and a human cue to generate overlapped audio with high semantic fidelity to the video frames.
TheWhiteBoxâs take:
Technology:
Once again, all boils down to the attention mechanism, the quintessential component of all things AI today.
Insightfully, Google acknowledged they also tried a pure autoregressive approach (where the audio would have been generated in a next-token manner similar to ChatGPT), but diffusion offered better performance.
Therefore, diffusion is not only the primary choice to model images and video; itâs now for audio, too.
Products:
Based on Googleâs acknowledgment, this model will be part of their video generation model, Veo, their rival to OpenAIâs Sora. Moreover, we can see Googleâs short-term play from a mile away here: YouTube.
Nonetheless, theyâve just released an eraser feature that allows you to eliminate sound from specific video segments (to avoid copyright claims), meaning they are really betting hard on AI for their video platform.
Markets:
For quite some time, people have speculated that AI is moving on from text-based models (LLMs) to video models (VLMs). In that scenario, nobody is better positioned than Google and its PetaBytes of video data, period.
Google has a huge lead here, not OpenAI or Anthropic.
At the other end of the spectrum, freelance video editors have another reason to reconsider their career choices, as all of Googleâs recent moves signal they are venturing into all things video editing.
đ§ Closing Thoughts đ§
With Metaâs bombshell, we have learned how the industry might reconsider how it trains LLMs, especially regarding coding and agentic tasks, and we have learned great insights into why they work the way they work.
Also, we have seen yet another proof of how ubiquitously powerfulâand essentialâthe attention mechanism is to model every single damn thing in todayâs AI, even allowing the creation of unique audios to match our desired video.
That said, I insist that the reason I talk about things like attention is that, in an industry where everything seems to be changing every week, knowing what doesnât change (attention, diffusion, etc.) is what gives you the edge over others.
Finally, in this Sundayâs Premium segment, we will tackle one of the most pressing questions in AI today:
Is AI actually intelligent?
Until next time!
Give a Rating to Today's Newsletter |
Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]