Meta's BLT, ChatGPT Computer, & More

Ignacio de Gregorio Noblejas
December 20, 2024

In partnership with

For business inquiries, reach me out at [email protected]

THEWHITEBOX
TLDR;

😏 ChatGPT’s New Step Toward LLM OS
📲 1-800-CHATGPT
🫡 Microsoft’s Phi-4 is a Landmark Model. Here’s Why.
🔐 The Verifiable Compute AI Framework
[TREND OF THE WEEK] Meta’s Generational Bet: Byte-Level Transformers

Build Smarter, Faster: AI Voice Agents for Every Industry

Save time building your AI calling assistant with Synthflow’s AI Voice Agent templates—pre-built, pre-tested, and ready for industries like real estate and healthcare. Get started fast with features like lead qualification and real-time booking. You can even create and sell your own templates to earn commissions!

NEWSREEL
ChatGPT’s New Step Toward LLM OS

OpenAI has expanded ChatGPT’s desktop app integrations, enhancing its ability to interact with various applications on PCs and Macs. Previously, ChatGPT supported integrations with tools like VS Code, Xcode, Terminal, iTerm 2, and TextEdit.

The latest update introduces compatibility with additional integrated development environments (IDEs) such as BBEdit, MatLab, Nova, Script Editor, TextMate, VSCodium, Cursor, WindSurf, and the JetBrains suite, including Android Studio, IntelliJ IDEA, and PyCharm. Terminal applications like Warp and Prompt are also now supported.

Beyond coding tools, ChatGPT has extended its reach to note-taking and productivity apps, adding Apple Notes, Notion, and Quip to its integration list. These enhancements enable features like Advanced Voice Mode to function seamlessly within these applications, allowing users to interact with ChatGPT more naturally and efficiently.

According to Kevin Weil, OpenAI’s Chief Product Officer, these developments are part of a broader strategy to make ChatGPT more “agentic,” transitioning from a simple question-and-answer tool to an assistant capable of performing tasks on behalf of users.

TheWhiteBox’s takeaway:

On Sunday, we discussed how the biggest use case for LLMs was the LLM OS, where products like ChatGPT or Claude essentially become platforms on which most of the new software applications and the interactions human-computer occur.

Now, OpenAI has taken a crucial next step toward that view by allowing ChatGPT to interact with non-agentic applications (applications in which humans are still in charge of action-taking) and starting the transition toward ‘declarative applications’ in which humans demand what they want, and the computer executes.

In my view, this is the biggest product evolution we are going to see during 2025, and I bet that by the end of next year, a considerable portion of your computer usage will be driven by AIs, elevating the experience to something that a few years ago would have been worthy of a Hollywood film.

The next step?

This will surely make most new software AI native. That is, most software will essentially become front-end wrappers built on these platforms, where the interaction with the back-end (databases and workflow logic) takes place autonomously by large language models (LLMs) like ChatGPT that can understand English commands. 2025 is the year of AI products, and I can’t wait to see what comes next.

NEWSREEL
1-800-ChatGPT

For the tenth day of its twelve-day product-release spree, OpenAI has announced access to its Advanced Voice Mode via WhatsApp.

For a limited amount of minutes a month, you can access their frontier models through this application, facilitating the distribution of their product to a potential market of 2.7 billion active users of the app.

TheWhiteBox’s takeaway:

Justin Kan, the founder of the video game streaming platform Twitch, once said, “First-time founders focus on product; second-time founders focus on distribution.”

While this appears like an inconsequential release, it’s actually huge for the AI company. After accessing the iPhone market through Apple Intelligence, they will now enjoy an even larger customer base with WhatsApp.

Distribution is tough, and while all readers of this newsletter know very well what ChatGPT is (and, crucially, what it isn’t), you would be surprised by the number of people who have yet to try the product or haven’t even heard of it. Most times, it’s not about how good your product is; it’s how good you are at putting that product in the hands of people.

HARDWARE
NVIDIA’s New Home Supercomputer

NVIDIA has announced a price-cut version of its home supercomputer, Jetson Orin Nano Super. This compact, affordable supercomputer offers up to 67 TOPS of AI performance (67 trillion operations per second), significantly surpassing its predecessor for as low as $249.

TheWhiteBox’s takeaway:

For reference, the NVIDIA H100, NVIDIA’s current best GPU, has a peak of 1,979 TOPS. Although the difference may be large (30 times less compute throughput), we are talking about a GPU that is 100 times more expensive than the Jetson GPU.

Thus, you get more than three times more value per dollar spent. As a caveat, you only have 8GB of system-on-module memory, which means that the computer is seriously memory-bottlenecked compared to its processing power, and you will struggle to run large models in it.

That said, smaller models (sub-2 billion parameters) are becoming more powerful, so the range of strong models you can now run at home with this computer is considerable.

FRONTIER RESEARCH
Microsoft’s Phi-4 is a Landmark Model. Here’s Why.

Microsoft has updated its smallest model family by introducing a new, high-performance language model, Phi-4. Phi-4 is a 14-billion-parameter model that surpasses significantly larger models like Llama 3.3 70B and Qwen 2.5 (72B parameters) on math and reasoning benchmarks despite being five times smaller.

The key insight is that Phi-4 owes much of its performance edge to carefully curating its pretraining and fine-tuning datasets. The pretraining involved high-quality web data, including books and research papers, with additional filtering using custom-trained classifiers to ensure only top-tier text was included.

On benchmark tests, Phi-4 demonstrated almost state-of-the-art performance, very close to SOTA models like Meta’s latest model, Llama 3.3 70B, and very close to the performance of GPT-4o.

TheWhiteBox’s takeaway:

This announcement proves two things:

Distillation (training smaller models on data generated by larger ones) has become the default procedure to construct powerful models that are also simple and cost-effective to deploy
The transition to smaller models is inevitable. While most AI labs will continue to build ever-larger models because they can be fed larger quantities of data, these models are not deployable. Thus, small language models should be your first option when adopting Generative AI models at scale.

Small Language Models are closing the gap quickly, with examples like Phi-4 or Gemini 2.0 Flash, which is great news for everyone.

HARDWARE
The Verifiable Compute AI Framework

EQTY Lab, in collaboration with Intel and NVIDIA, has introduced the Verifiable Compute AI Framework, a solution designed to enhance AI security and trustworthiness.

It serves as a digital notary embedded into the chips and guarantees that no weird behavior, like hacks, goes on during the processing of data while running AI models. This is an alternative to something Apple already designed for their Private Cloud Compute server platform that supports Apple Intelligence tasks that the on-device models can’t handle and thus are derived to the servers.

TheWhiteBox’s takeaway:

Achieving fully confidential AI training and inference loads is necessary. Nonetheless, according to Retool’s 2024 Generative AI report, data security was cited as one of the top three biggest pain points for adopting the technology.

To make matters worse, companies in places like Europe have very strict laws regarding the protection of client data, making the whole issue a nightmare and leading them to reject implementations for the time being.

The issue is that while small models are improving, it’s unlikely that they will ever surpass the quality of the larger models. Thus, the incentives to offload some of the compute to larger models running in cloud environments will always exist.

But if the Verifiable Compute AI Framework lives up to its name, we now have a cryptographically verified solution that guarantees that data being processed by the AI chips isn’t tampered with or stolen, as the digital notary can stop the chip completely if it identifies vulnerabilities. If this is in fact true, we might be able to run AI workloads on the cloud safely, which would be huge.

TREND OF THE WEEK
Meta’s Generational Bet: Byte-Level Transformers

Very few times in recent AI history have we seen research that dares to challenge the foundations of the current frontier AI models.

But that’s precisely what Meta has done by introducing Byte-Level Transformers (BLTs), which may be the final solution to one of AI’s current biggest problems and, simultaneously, make AI models think similarly to how humans do.

With today’s trend of the week, you’ll gain a clearer understanding of AI, exposing its limitations while offering an intuitive solution to a problem that has caused many fully awake nights in Silicon Valley.

Let’s dive in!

The Tokenization Drama

While we have gotten quite good at training models that simulate intelligence (even if, in reality, it’s mostly memorization, as we’ve discussed previously), these models still have a very counterintuitive way of processing data.

The Static Computation Problem

Not all problems are created equal. As humans have limited energy and cognitive bandwidth, we adapt the ‘thought effort’ we allocate to each problem; you won’t work as hard to solve a complex math problem as to sing a lullaby to your baby.

As you probably know, Large Language Models (and Large Reasoner Models, too) “work” by predicting the next word to a text sequence (i.e., “The capital of Poland is…” and outputting “Warsaw,” although the reality is a little more complicated).

However, while humans do not commit the same compute to every word, current models allocate the exact thought effort to every single prediction. In practice, the number of computations the GPUs running the model perform is exactly the same in every instance (the compute requirements grow as the text sequence grows in size, but the per-prediction cost is independent of the prediction task).

Long story short, there’s a whole lotta’ unnecessary compute going on behind the scenes. And the reason for this is what we call ‘tokenization.’

The Crucial Role of Tokens

You’ve probably heard the word ‘token.’ In the case of text, they are usually words or subwords, which are the actual predictions; LLMs don’t predict words; they predict tokens, which can be entire words or not.

The idea of tokenization is applied across all data modalities; all Generative AI models, from text to video generation, perform some sort of ‘tokenization’ as a pre-processing step.

What you may not know is that these tokens are chosen before model training. A vocabulary of tokens is decided, and the model is trained on that vocabulary. Of course, newer and more powerful models have larger vocabularies.

When prediction time comes, the model ranks all the tokens in its vocabulary by probability and chooses the next token based on its likelihood of being a reasonable continuation of the sequence.

Source

But tokenization has two big problems:

The model is ‘forced’ to choose one of these tokens regardless, and if the sequence requires a previously unseen combination of letters, like a new word, that’s too bad.
The way a word sequence is broken down into tokens is fixed, which prevents the model from deciding which parts of the sequence are more worthy of computation than others. Instead, the entire sequence is processed as equally important.

But now, Meta has decided to abandon this approach completely. Instead, BLTs treat sequences on the byte level, which, my dear reader, has wild repercussions.

When Models Decide

While byte-level processing, chunking the sequence on bytes instead of tokens, was considered an intractable problem, Meta has found the way. And if their intentions are an indication of anything, Llama 4, their next model, could be a real revolution in AI.

Transformer Parallelization

In case you didn’t know, current models like ChatGPT do not sequentially process the sequence, but all words in parallel. After tokenization, all words are inserted simultaneously into the model, which then performs two operations:

Mixing operation: Known as the attention mechanism, it makes words in the sequence ‘talk’ to each other, updating the ‘meaning’ of each word with regards to its previous words (think of this as having an adjective that talks to the rest of the sequence to find the noun it’s referring to, updating its ‘global meaning’ to not only capture its intrinsic meaning as a word but also how it relates to the noun).
Knowledge embedding operation: The model adds more meaning to the sequence using its core knowledge. For the sequence “Michael Jordan played the game of…” the model uses its knowledge of Michael Jordan to assess that the next word should be basketball, even if the sequence provides no hints of this.

In short, the former contextualizes each word with regard to other words in the sequence, while the latter adds additional required knowledge. This is how models like ChatGPT “understand” what you are saying to them. That said, I insist that this process is done in parallel for every token in the sequence.

By now, you may have realized the issue at hand: as tokenization fixes how the sequence gets chunked independently of the complexity, the entire sequence is treated as equal, and compute is allocated to every token, disregarding its real value to the next prediction, even including auxiliary words like ‘really’ or ‘umm’ which provide little value yet are still processed with the same effort as any other word.

Here’s where BLTs enter the scene.

Toward Dynamic Computation

When a BLT model receives a sequence, it chunks it into patches of bytes. Crucially, these patches have dynamic size. Without entering into the hard details for the sake of length (this patching involves using an auxiliary model that computes byte entropy), the model looks at every byte and measures its ‘surprise.’

In layman’s terms, the model asks itself, " Seeing the previous words, how surprised am I to see this one?” If the answer is ‘a lot,’ the model ends the previous patch and starts a new one. This sounds complex, but an example is enough to get it.

For instance, when starting a text sequence, basically every letter in the vocabulary is a reasonable candidate to some extent. Here, the entropy (the surprise) is very high.

But if the model sees the sequence “The famous composer from Salzburg known as Moz…” the candidate's next few letters are few, and, as the name already rings a bell, the next few letters are “art " for Mozart. In this case, the prediction difficulty is very low.

Thus, the surprise is low, which means the model is pretty sure what the next predicted letters should be. It has no reason to start a new patch and adds “art” to the patch, ending an entire sequence of words being treated as one.

As you can see below, what this entails is that the number of ‘chunks’ a sequence has tends to fall as patches grow larger in size by grouping easy-to-predict words together.

That is, while tokenization (BPE) chunks the sequence deterministically (always the same), BLT (Entropy) allows for larger patches that, despite being much larger, convey the same information as breaking them into smaller chunks.

BPE Tokenization vs Byte Patching.

As we discussed earlier, this is highly desirable because if the patches grow in size, the number of chunks in the sequence results in falls, which, as we recall, are all processed in parallel and with the same amount of allocated compute. This means that the model has to process fewer chunks, dramatically reducing computation.

❝

The overarching idea here is that we want to maximize patch size relative to bits of information per patch; if I can group two chunks into one while providing the same amount of information (“Mozart” and “Moz” and “art” convey the same information), we are obtaining the same value while cutting costs by more or less half.

In summary, BLTs are orders of magnitude smarter with their use of compute, allocating it where it matters.

And when Meta tested BLTs with the same training budget as Llama 3, they observed very similar performance, while the BLT showcased a massive decrease in computation cost. Impressively, the BLT Llama outcompeted Llama 3.1, which had undergone 16 times more training data.

TheWhiteBox’s takeaway

In summary, Meta might have shown the world the way to more efficient computation of frontier AI models, a deeply-coveted desire.

These models are smarter about computation while retaining or improving performance, a truly shocking development for an industry that is fighting a tough battle to make its products affordable.

Also, the idea makes sense. Why on Earth can’t AIs ‘decide’ how much computing should be allocated to every problem? Shouldn’t they be able to use compute when it really matters?

That’s a nice thing about this research; it’s not an extremely esoteric problem; it’s something anyone can read and say: “It makes total sense.”

This is why I firmly believe that Llama 4, unless this research comes out after pre-training, will likely be the first large-scale BLT model. And soon, most AI labs will adopt this new framing as the default.

Closing Thoughts

With AI potentially becoming ubiquitous through the LLM OS (or AI OS), the progressive availability of powerful AI accelerators for end consumers like Jetson, and with most frontier AI labs putting considerable effort into making models smaller (Microsoft with Phi-4) and Meta with byte-level transformers, the industry appears to be maturing at lightning speed, going from fancy demos to products that actually provided value.

Thanks for reading; let’s talk again on Sunday!

THEWHITEBOX
Premium

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

Meta's BLT, ChatGPT Computer, & More

THEWHITEBOXTLDR;

Build Smarter, Faster: AI Voice Agents for Every Industry

NEWSREELChatGPT’s New Step Toward LLM OS

TheWhiteBox’s takeaway:

NEWSREEL1-800-ChatGPT

TheWhiteBox’s takeaway:

HARDWARENVIDIA’s New Home Supercomputer

TheWhiteBox’s takeaway:

FRONTIER RESEARCHMicrosoft’s Phi-4 is a Landmark Model. Here’s Why.

TheWhiteBox’s takeaway:

HARDWAREThe Verifiable Compute AI Framework

TheWhiteBox’s takeaway:

TREND OF THE WEEKMeta’s Generational Bet: Byte-Level Transformers

The Tokenization Drama

The Static Computation Problem

The Crucial Role of Tokens

When Models Decide

Transformer Parallelization

Toward Dynamic Computation

TheWhiteBox’s takeaway

Closing Thoughts

THEWHITEBOXPremium

Give a Rating to Today's Newsletter

THEWHITEBOX
TLDR;

NEWSREEL
ChatGPT’s New Step Toward LLM OS

NEWSREEL
1-800-ChatGPT

HARDWARE
NVIDIA’s New Home Supercomputer

FRONTIER RESEARCH
Microsoft’s Phi-4 is a Landmark Model. Here’s Why.

HARDWARE
The Verifiable Compute AI Framework

TREND OF THE WEEK
Meta’s Generational Bet: Byte-Level Transformers

THEWHITEBOX
Premium