JAMBA, RAG 2.0, DeepSeek Experts, and The Key to GPT-5

šŸ TheTechOasis šŸ

part of the:

Breaking down the most advanced AI systems in the world to prepare you for the future.

5-minute weekly reads.

TLDR:

  • AI Research of the Week:

    • JAMBA, The First Production-Grade Hybrid

    • RAG 2.0, The Death of Standard RAG?

    • DeepSeek, Hyperspecialization of Experts

  • Leaders: Are Agentic Workflows the Path to GPT-5?

šŸ”‹ Jamba & The Hybrid Forthcoming šŸ”‹

For almost six years, nothing has beaten the Transformer, the heart of all Generative AI models.

However, due to its excessive costs, many have tried to dethrone it but failed.

But we can finally hear the winds of change. Not to substitute the Transformer, but to create hybrids. And we finally have our very first production-grade one, Jamba.

A Quadratic Nightmare

In technology, thereā€™s always a trade-off. And in the case of the Transformer, itā€™s a big one.

And although we arenā€™t going into the technical details of the Transformer for the sake of the length and because we have described them many times, hereā€™s the gist.

Models like ChatGPT, Gemini, or Claude are all based on a concatenation of Transformer blocks.

Each of these blocks contains two things:

  1. An attention operator

  2. A Feedforward layer (MLP in the image)

The former enforces the renowned attention mechanism, a mixing operation that processes the input sequence and helps each word in the sequence pay attention to the words that matter (verbs into nouns, adverbs to verbs, pronouns to nouns, and so on).

For a full review of attention, read my blog here.

The latter is a feedforward layer, which helps in extracting key features and relationships from the data.

But as attention has a problem, it has quadratic cost complexity. In laymanā€™s terms, if the input sequence is doubled, the cost of processing it quadruples.

In practice, for short sequences, this isnā€™t much of a problem, but for longer sequences, the computation and memory requirements both skyrocket.

For reference, if we look at popular models like LLaMa-2 7B or Mistral-7B, despite their small size, for a sequence of 256k tokens (192k words) they require 128GB and 32GB respectively, despite the latter using Grouped-Query Attention.

And this is just one single input.

For this reason, for years, the Transformer has been incapable of scaling to larger sequence lengths due to its prohibitive costs.

And despite some newsworthy innovations like Ring Attention facilitating things by reducing communication overhead between GPUs, it still doesnā€™t solve the ā€˜quadratic barrierā€™.

And this takes us to Jamba.

Mamba, Attention, and MoE

Jamba, an open-source model created by AI21 labs, is an LLM that looks too good to be true.

Instead of just using Transformer blocks, it includes another piece in the puzzle, Mamba blocks.

But why?

Mamba is stateful. While the Transformer has to consider the entire context every single time, Mamba has memory.

Imagine you are writing a book, and you are writing page 101. But for every single new word you write, you have to reread all the previous pages every time to get the context.

Thatā€™s how Transformers work.

On the other hand, Mamba blocks carry a fixed-size state. In other words, they keep an updatable memory of all previous context. As this state is fixed in size, for every single new input the Mamba block has to decide if itā€™s relevant for context or not.

For instance, if the next word is ā€˜umā€™, you probably donā€™t want to update your memory, right?

Sadly, the reality is that Mamba underperforms Transformers quality-wise, so researchers still need to keep the Transformer.

Although not fully proven, it seems that the reason Mamba underperforms is due to its incapacity to create induction heads, the capability of ā€˜copy & pastingā€™ patterns from data, what is thought to being the key to in-context learning, the key superpower behind LLMs.

Additionally, Jamba incorporates one added feature, mixture-of-experts.

In simple terms, for every prediction, only a fraction of the model runs, allowing researchers to scale to huge sizes while reducing computation fairly.

In the case of Jamba, that means that even though it has 52 billion parameters, only 12 billion are activated for every single prediction.

In practice, this means that the memory requirements to run Jamba fall off a cliff despite being larger in comparison:

Also, the combined effects of using Mamba and MoE allow Jamba to have insane throughputs (tokens per second prediction), much higher than much smaller models like LLaMa2 13B despite being much larger:

And despite the reduced requirements, the model is perfectly competitive with the others:

Overall, Jamba looks really good.

What We Think

Iā€™ve said it before. Hybrid architectures are not an ā€˜ifā€™, but a ā€˜whenā€™.

Jamba is not trying to create a generational leap in terms of quality, but proving that hybrid architectures are as good as the Transformer despite being much cheaper.

Indeed, we are already seeing researchers trying hybrid architectures on things like DNA (EVO with StripedHyena) too.

The reason is simple, as long as compute scales quadratically, we will keep running into blockages as the worldā€™s electricity grid is not prepared for such compute. Thereā€™s a reason why Microsoft is building a $100 billion data center for OpenAI.

The obvious next step should be combining hybrids with 1-bit LLMs, with examples like Microsoftā€™s 1.58 bit LLM, LLMs where each parameter takes only one single bit (1, 0, or -1) to drastically reduce memory requirements even more.

If successful, we might be on to a paradigm shift for AI.

šŸ„‡ This week on Leadersā€¦ šŸ„‡

This week we will diving deep into Agentic Workflows, a new type of models that Andrew Ng, one of the most prominent AI researchers in the world, argues that will provide better results even when compared to AIā€™s next generation of models like GPT-5.

To know what they are and how they work, click below.

šŸ”® RAG 2.0, The Death of Standard RAG? šŸ”®

Looking at the AI industry, we have grown accustomed to seeing stuff get ā€˜killedā€™ every single day.

But rarely the case is as compelling as what Contextual has proposed with Contextual Language Models (CLMs) in what they call ā€œRAG 2.0ā€ to kill standard RAG.

Behind the claim, none other than the initial creators of Retrieval Augmented Generation (RAG).

But first, whatā€™s RAG?

Grounding on Data

As you may know or not know, all standalone Large Language Models, with prominent examples like ChatGPT, have a knowledge cutoff.

What this means is that pre-training is a one-off exercise (unlike continual learning methods). Thus, they have ā€˜seenā€™ data until a certain point in time.

For instance, ChatGPT is updated until April 2023 at the time of writing. Consequently, they are not prepared to answer about facts and events that took place after that date.

This is where RAG comes in.

As the proper name implies, the idea is to retrieve the data from a known database, and feed it into the model in real-time so that it has updated context to provide an accurate answer.

But how does this retrieval process work?

Itā€™s all semantic similarity

The whole architecture stems from one single principle: the capacity to retrieve semantically meaningful data relevant to the context at hand (usually the userā€™s question).

This process involves the use of three elements:

  1. The embedding model

  2. The retriever, often a vector database

  3. The generator, the LLM

First and foremost, to make this retrieval process work, you need the data to be in ā€˜embedding formā€™, a representation of text in the form of a number vector.

And more importantly, they have a similarity principle: similar concepts will have similar vectors.

For example, the concepts of ā€˜dogā€™ and ā€˜catā€™ are similar to us: both are animals, mammals, four-legged, and crucially, domestic. Translated into vectors, ā€˜dogā€™ could be [3, -1, 2] and ā€˜catā€™ [2.98, -1, 2.2], for instance.

For a more detailed explanation of the intuition behind embeddings, check my new blog for free.

After we have the embeddings, we insert them into the vector database (retriever), a high-dimensional space where similar things are closer together.

Then, whenever the user sends a request, like ā€œgive me similar results to a ā€˜yellow catā€™ā€, the vector database performs a ā€˜semantic queryā€™.

In laymanā€™s terms, it performs an extraction of the closest vectors (in distance) to that of the userā€™s query.

As these vectors represent the underlying concepts, similar vectors will be representing similar concepts, in this case, other cats.

Once we have the extracted content, we build the LLM prompt, encapsulating:

  • The userā€™s request

  • The extracted content

  • and, generally, a set of system instructions

As part of the prompt engineering process, you also want to tune how the model has to respond. A typical system instruction might be ā€œbe conciseā€.

Thatā€™s RAG in a nutshell, a system that provides relevant content to the user query to enhance the LLMā€™s response.

But this process is too good to be true and far from ideal today.

Stitching with no refinement

One way to visualize current RAG systems is the following trouser:

Although these trousers might work for some audiences, most people would never wear them, as the colors and shapes have nothing to do with the base color.

Thereā€™s no homogeneity, despite patched trousers being meant to go unnoticed.

The reason behind this analogy is that standard RAG systems assemble three different components that were pre-trained separately and that, by definition, were never meant to be together.

Instead, RAG 2.0 systems are defined to be ā€˜one thingā€™ from the beginning.

In practice, the complete system is trained end-to-end while being together, like assuming that LLMs should always have a vector database together if you need your LLM to be always updated.

And the results show for it.

Despite using what is almost guaranteed a worse standalone model than GPT-4, this new methodology outperforms every other possible combination between GPT-4 and other retrieval systems.

Itā€™s too soon to tell, but RAG 2.0 might become the enterprise standard shortly.

What We Think

Enterprise-grade GenAI almost always involves confidential data. RAGs are essential to companies embracing GenAI, and RAG 2.0 feels like a natural next step.

On the flip side, two things.

In a similar fashion to another increasingly popular RAG-based innovation, RAFT, they both involve fine-tuning with use-case data.

However, one of the perks of RAG is to avoid feeding your data to the LLM providers. This means that RAG 2.0 will only make sense if you use open-source models.

Also, with the increasing size of LLM context windows, with examples like Gemini 1.5 reaching 1 million tokens (around 750.000 words) in every single prompt, some people are claiming RAG, no matter its version, is ā€˜deadā€™.

Needless to say, avoiding having to deal with additional components makes it a better experience overall. But if we account for costs, RAG 2.0 systems are a far cheaper option today than having large context windows for every single prompt.

Therefore, whether RAG disappears or not will depend on how cheap running long sequences becomes eventually.

šŸ“š Sponsor of the Week: Growth School šŸ“š

Become an AI & ChatGPT Genius in just 3 hours for FREE!  (Early Easter Sale)

Join ChatGPT & AI Workshop (worth $199) at no cost (Offer valid for first 100 people only) šŸŽ

šŸ‹ Hyperpersonalization of Experts šŸ‹

As proven by our previous article on Jamba, Mixture-of-Experts (MoE) is pretty much a standard by now.

Even GPT-4 is rumored to be a MoE.

In fact, not only itā€™s more computationally efficient, it might even improve quality. This is an extremely rare sight in the world of technology, reducing costs by increasing quality.

But ever since its popularity rose, the methods used havenā€™t changed that much, meaning that some of its greatest issues remain unsolved. But DeepSeek researchers think they have found a solution.

The current problems in MoE

As we explained in Jambaā€™s post, MoE architectures ā€˜breakā€™ the model into smaller models, named experts. Thus, during inference, a certain number of experts are chosen to ā€˜answerā€™ to that prediction.

As you are essentially silencing the majority of the experts, the computing costs fall proportionally (if you activate 2 out of 8, costs could drop on average by 4).

But when things work just fine, people tend to avoid touching them. In the case of MoE, that has meant that most models today have 8 experts and activate 2 per round, with examples like Mixtral 8-7B.

But despite the recent successes, these architectures suffer from two illnesses: Knowledge hybridity and knowledge redundancy.

  1. Knowledge Hybridity occurs when each expert ends up handling a broad range of knowledge due to the limited number of experts. This broadness prevents experts from specializing deeply in specific areas.

  2. Knowledge Redundancy occurs when different experts in an MoE model learn similar knowledge, which defeats the point of partitioning the model in the first place.

To solve this, Deepseek proposes a new type of MoE architecture.

Specialized while still Sharing

Deepseek proposes an architecture that includes two new modifications:

  1. A much higher number of experts, around 64 (eight times more than the usual)

  2. Shared experts, experts that fire for every single prediction

But why do this?

As for the former, by increasing the number of experts, you naturally reduce the number of topics their weights will become experts on, inducing deeper specialization by design.

This also enhances the number of possible expert combinations.

For example, for a model with 16 experts that chooses 2 per prediction, you can have 120 possible expert combinations. But with 64 experts and 8 per prediction, that number increases to 4,426,165 combinations.

However, extreme hyperspecialization offers its problems, as too-specialized experts might be incapable of learning broader topics. To solve that, they include shared experts, a set of neurons that will fire for every single prediction.

While the shared experts provide broad knowledge, the appropriate set of smaller experts will fire for the topics they know best.

And the results? Highly promising.

DeepSeekMoE 16B matches or exceeds the performance of comparable models like LLaMA2 7B across a range of benchmarks with only about 40% of the computational requirements.

But do these advantages scale to bigger sizes? Sure as hell they do.

Preliminary results from scaling DeepSeekMoE to 145B parameters show it maintaining substantial advantages over traditional MoE models and with performance comparable to much larger dense models while using significantly fewer computations.

With ā€œsimilarā€ research from the likes of Meta, the idea of improving MoEs is one of the hottest areas of research today. And it shows.

What We Think

When proposing a MoE, most people do so on the grounds of training huge models without having to deal with the huge costs. But it has an added benefit that is rarely discussed.

With MoEs, you have the opportunity to leverage the usual sparsity of neural networks. In laymanā€™s terms, for large networks, a very small fraction of neurons fire for every single prediction.

Consequently, by breaking down the model not only do you avoid unnecessary computation on neurons that wouldnā€™t fire anyway, but you also have the opportunity to specialize their training much more and reduce the complexity for a specific neuron to elicit knowledge from extremely variable topics.

Overall, everything in this research just makes sense to me.

šŸ‘¾ Best news of the week šŸ‘¾

šŸ‘¾ Thereā€™s no intelligence without a body according to Meta

Do you have any feelings, questions, or intuitions you want to share with me? Reach me at [email protected]