The 4 Unsolved Problems in AI

A Journey to no-hype AI

Ignacio de Gregorio Noblejas
October 13, 2024

FUTURE
The 4 Unsolved Problems in AI

The main goal of this newsletter is to show you AI just how it is, not how they usually portray it with all the bells and whistles.

Today, following that vision, I've gathered the four main unanswered questions that remain a mystery to everyone in areas such as technology, engineering, markets, and model intelligence.

Each one of them has to be solved. Otherwise, AI is DOA (Dead On Arrival), which is why you rarely, if ever, see incumbents mentioning these problems.

Therefore, in this article, you’ll learn about the ugly side of frontier AI, such as:

the underlying complexities of running AI models at scale,
the key intuitions of how they work,
engineering problems that are the ‘it question’ that keeps Silicon Valley CEOs awake at night,
A deep look at the leaked and extremely concerning financials of OpenAI,
And finally, we will look at recent data that suggests that Large Language Models (LLMs) might actually be in the utterly wrong direction toward AGI, and also about a bet by one company that calls bullshit on the entire industry (& now on the Nobel Prize, too)

And all this in simple-to-understand language. Let’s dive in.

Problem 1: Hallucinations & Non-Determinism

As the Nobel Prize’s recent rewardees suggest, AI is eating the world. More specifically deep learning (DL), the field of AI that works with neural networks (all five physics and chemistry laureates went to neural network researchers).

Awards aside, DL is also leading the mass arrival of AI products in society, democratizing the use of AI tools by the masses with examples like ChatGPT. However, while these products offer some value, they have extremely complex technological barriers to solve: hallucinations and non-determinism.

As the definition implies, a generative model generates an output based on an input. For that, it needs to be exposed to billions of examples to learn that mapping and, over time, generate sequences that closely match the ones it saw during training.

But there’s a problem. If they simply memorize the training sequences (of text, for instance), they always generate the exact sequences every time. That’s no different from a simple database and pointless.

Therefore, we add two design features to make LLMs useful:

Contrary to popular belief, LLMs do not just predict the most likely token (let’s assume every token in a word or subword) but also rank all other possible words by probability. This way, it’s modeling uncertainty, which is the same as saying that the model is also estimating the likelihood it’s wrong. This is crucial, as we’ll see in a minute.
When the time comes to output a word, we don’t always choose the most likely, but one of the top ones at random.

Both things imply that if the most likely next word is correct (let’s say we are predicting a fact), but the random sampling chooses the second one, it might be a good guess but wrong nonetheless.

Consequently, defining ‘hallucinations’ as when the model gets something wrong is a misnomer because ‘all LLMs do is hallucinate’ because every single prediction has a randomness added to it.

This, in turn, implies a troubling realization: hallucinations can’t be eliminated entirely, which is surprisingly unknown, even among so-called Silicon Valley experts. However, the issue isn’t quite that, but the fact that, as of today, we haven’t yet found a way to reduce them—like, at all.

One very recent idea that has generated considerable excitement is the use of model certainty. This is a fundamental insight into how LLMs work; the fact that the model thinks the word ‘x’ is the most likely one compared to the rest does not mean it’s certain about its correctness.

If the model assigns a 99% probability to one word, we can assume it’s certain. But if the top probability token is just 20%, even if it’s the most likely, the model is basically clueless and just guessing.

Fascinatingly, as shown by Google research, if we let the LLM ramble, certainty tends to increase. Thus, by adding a ‘certainty’ requirement or ‘do not commit to a response until you reach a specific certainty threshold,’ we are forcing the model to continue thinking until it is sufficiently certain. This reduces hallucinations substantially, as we covered in the Entropix Notion article.

By letting the model ramble, certainty (in blue) increases. Source: Google

Other ideas include grokking and augmented LLMs, which I also covered in this newsletter here and here. Still, we have yet to find proof any of these methods truly drop hallucinations to the point of making LLMs robust.

Moving on, the fact we are modeling uncertainty and adding randomness on purpose implies another caveat: it makes LLMs non-deterministic. In layman’s terms, not two outputs are identical for any given input.

Sadly, however, the world is filled with determinism.

Mathematics, coding, working with Excel… there's a plethora of tasks we intend to use LLMs on that do not tolerate ‘the model mildly altering the output randomly every time.’

That said, this problem, although largely unsolved, is more “solvable”.

OpenAI already offers Structured Outputs that guarantee 100% adherence to certain structures, alleviating some issues.
As mentioned in part newsletters, Rysana also guarantees 100% structure adherence through inference inversion.

Yet, these features do not fully solve the randomness problem. Still, can’t we just block the random sampling and force the model to choose the most likely word every time?

The answer is yes, we can, and it’s called greedy decoding. Sadly, companies like OpenAI still can’t guarantee non-determinism, so the model might still generate different outputs every time.

Theoretically, blocking random sampling should lead to determinism. However, due to changes in weight precision at the hardware layer due to the distibuted nature of LLM workloads, the model’s outputs shift slightly every time.

Consequently, non-determinism is not only unsolved; we are fighting against the laws of physics in this one, as the cause of the non-determinism is largely influenced by unpredictable changes in the behavior of one out of the thousands of GPUs that may be participating in that prediction.

But how does that make sense? To understand this, we move on to the next unsolved problem: LLMs are notoriously complex engineering problems.

Problem 2: The Distribution Problem

When we talk about ‘LLMs,’ what are they actually? In reality, it’s a combination of two file types:

Weight files. This can be one single file (.bin) or multiple (.safetensor) that store the weights (the variables that form the model).
Executable file. This file stores the structure of the model and executes it by retrieving the weights from the weight files and using them to compute every prediction.

While the second file’s memory weight is negligible, the weight file/s is/are huge. A simple rule of thumb is to take the total number of parameters, usually in the billions, and multiply them by 2, which means we assume that every variable weighs 2 bytes in memory (most LLMs are trained with datatype bfloat16, so it’s a sensible assumption).

For instance, a minute model like Llama 3.1 8B “only” weighs 16 GB {8×10⁹ × 2 = 16×10⁹ bytes = 16 GB}. However, GPT-4 is rumored to have 1.8 trillion parameters. At that size, the model weighs 3,600 GB, or 3.6 TeraBytes.

And here is where things get nasty. LLMs do not run in one single GPU, but potentially thousands.

Hardware nugget: Why do we use GPUs and not CPUs?

CPUs have far more processing power per core, but they can perform far less simultaneous calculations because the number of cores is much smaller compared to a GPU, which is precisely what you require to run LLMs.

A standalone NVIDIA H100, the most demanded GPU right now considering that H200s are just ramping up deliveries and the new platform, Blackwell, is still yet to make a single GPU delivery, has 80GB of HBM memory, meaning you would need 45 GPUs to store one GPT-4 replica.

H100s come in blocks of 8, called a DGX server, and cost around $200,000. They essentially behave like one fat GPU. That brings the HBM size to 640 GB, which is still 3,000 GB short of even considering running GPT-4.

For that, you would need at least 6 DGX servers, putting your capital costs at well over $1 million to store the fricking model replica. And even then, you would still not be able to run it.

But why?

Even if you manage to get a hold of 6 DGX servers, you might need to double that number due to the KV Cache, which I cover in my blog in detail. In a nutshell, LLMs have extensive computation redundancies between predicted words, which involves unnecessary recomputation and significant per-prediction latency.

Therefore, many of these computations are stored in memory and fetched when needed.

The main purpose of the KV Cache is to dramatically increase the prediction speed for future predictions besides the first one (which takes the longest as we need to build the cache).

This is an important performance metric known as the time-to-first token. You will realize this by interacting with ChatGPT. Try send it a very large PDF. You’ll immediately notice the difference in speed between the first prediction and the rest.

That is thanks to the KV Cache.

Another awesome feature enabled by the KV Cache is that it allows LLMs to have global attention sustainably. Transformers like ChatGPT do not compress memory, meaning they always have access to the entire sequence. This way, the model will still remember if it needs to recall a fact that was said 100,000 words ago.

The trade-off is that this cache grows proportional to sequence length. The longer the sequence, the larger this cache will be (the relationship is quadratic, which makes matters worse).

Concerningly, as there’s no hard stop to how much it can grow, it can even become larger than the actual model for just one sequence! For instance, using Llama 3.1 8B’s model specs (page 8), we can calculate the KV Cache for any sequence that the model processes using the following formula:

KV Cache = 2 x {model precision} x {nº layers} x {nº KV heads} x {head dimension} x {sequence length}.

If we take a 500,000 token sequence (around 375k words, which may seem like a lot, but in terms of video frames, it’s actually a short video), the KV Cache of that single sequence is 65 GB, four times the size of the model… and that’s just to serve one sequence!

I wrote a short article exemplifying this calculation so that you can learn how to perform it yourself and understand where the formula comes from.

Hence, if you run batches of hundreds of simultaneous sequences, the KV Cache will grow into TeraBytes.

This is one of the main reasons why context windows, the maximum sequence length you can send these models, are very limited. Only Google, which is head and shoulders above the rest in compute, has managed to scale these windows into millions of tokens.

Long story short, the complexity of running LLMs at scale is highly underestimated in the eyes of the common folk, probably because incumbents hide this complexity to not spook investors.

So, what are LLM providers doing?

One of the most popular methods today is to train the largest model possible (think GPT-4, Google Ultra 1.0, or Claude 3 Opus) and then distill them into smaller-sized models that can be better served while still punching way above their weight, a process called model distillation.

Nonetheless, OpenAI now offers it through its API, and it is the single and only reason these companies have found a ‘reasonable way’ of providing services to their users without going bankrupt (as we’ll see below, they are doing so anyway, but at least it’s more manageable).

But what is distillation? Without going into much detail, distillation implies sampling a lot of data from the big model (teacher) and training a smaller model on that data (student).

The blend of extremely high-quality data plus adding an ‘imitation objective’ leads to a model that, without excessive training costs, ‘behaves’ like the teacher. Examples of models trained this way include GPT-4o, Claude 3.5 Sonnet, or Gemini 1.5 Pro, casually the best models out there.

Still, as we’ve seen, even running a very small Llama 3.1 8B to potentially hundreds of millions of people may require thousands of DGX servers, billions in capital costs, and extremely inefficient and hard-to-predict workloads. An absolute engineering nightmare.

The AI race is no longer about algorithms, it’s about engineering.

While the world thinks OpenAI, Deepmind, or Anthropic are competing to have the most advanced AI algorithm, the real battle is being fought at the engineering layer (who gets to run these models more efficiently because, truth be told, model quality is fairly similar in raw terms).

And what is the other thing these companies are doing?

Well, as you may have guessed, lose a shitload of money. While one could make the case that companies like OpenAI or Anthropic are already the most unprofitable companies in history, things aren't improving based on OpenAI’s leaked financials.

In fact, things are projected to get much, much worse, which leads to the third problem: is this technology’s business case savable?

Problem 3: Where Are the Profits?

While I won’t focus too much on the overall state of the AI industry and its lackluster profits concentrated in one single company, NVIDIA, because I already did, recent leaks have taken our understanding of how messed up things are inside these companies.

OpenAI’s financials, which can be perfectly extrapolated to the rest of incumbents, can’t get any uglier, and the scale of their mess is like nothing capitalism has ever seen.

Subscribe to Full Premium package to read the rest.

Become a paying subscriber of Full Premium package to get access to this post and other subscriber-only content.

Upgrade

Already a paying subscriber? Sign In.

A subscription gets you:

• NO ADS
• Get high-signal insights to the great ideas that are coming to our world
• Understand and leverage the most complex AI concepts in the world in a language your grandad would understand
• Full access to all TheWhiteBox content on markets, cutting-edge research, company and product deep dives, & more.