- TheWhiteBox by Nacho de Gregorio
- Posts
- Meta's BLT, ChatGPT Computer, & More
Meta's BLT, ChatGPT Computer, & More
For business inquiries, reach me out at [email protected]
THEWHITEBOX
TLDR;
đ ChatGPTâs New Step Toward LLM OS
đ˛ 1-800-CHATGPT
𫡠Microsoftâs Phi-4 is a Landmark Model. Hereâs Why.
đ The Verifiable Compute AI Framework
[TREND OF THE WEEK] Metaâs Generational Bet: Byte-Level Transformers
Build Smarter, Faster: AI Voice Agents for Every Industry
Save time building your AI calling assistant with Synthflowâs AI Voice Agent templatesâpre-built, pre-tested, and ready for industries like real estate and healthcare. Get started fast with features like lead qualification and real-time booking. You can even create and sell your own templates to earn commissions!
NEWSREEL
ChatGPTâs New Step Toward LLM OS
OpenAI has expanded ChatGPTâs desktop app integrations, enhancing its ability to interact with various applications on PCs and Macs. Previously, ChatGPT supported integrations with tools like VS Code, Xcode, Terminal, iTerm 2, and TextEdit.
The latest update introduces compatibility with additional integrated development environments (IDEs) such as BBEdit, MatLab, Nova, Script Editor, TextMate, VSCodium, Cursor, WindSurf, and the JetBrains suite, including Android Studio, IntelliJ IDEA, and PyCharm. Terminal applications like Warp and Prompt are also now supported.
Beyond coding tools, ChatGPT has extended its reach to note-taking and productivity apps, adding Apple Notes, Notion, and Quip to its integration list. These enhancements enable features like Advanced Voice Mode to function seamlessly within these applications, allowing users to interact with ChatGPT more naturally and efficiently.
According to Kevin Weil, OpenAIâs Chief Product Officer, these developments are part of a broader strategy to make ChatGPT more âagentic,â transitioning from a simple question-and-answer tool to an assistant capable of performing tasks on behalf of users.
TheWhiteBoxâs takeaway:
On Sunday, we discussed how the biggest use case for LLMs was the LLM OS, where products like ChatGPT or Claude essentially become platforms on which most of the new software applications and the interactions human-computer occur.
Now, OpenAI has taken a crucial next step toward that view by allowing ChatGPT to interact with non-agentic applications (applications in which humans are still in charge of action-taking) and starting the transition toward âdeclarative applicationsâ in which humans demand what they want, and the computer executes.
In my view, this is the biggest product evolution we are going to see during 2025, and I bet that by the end of next year, a considerable portion of your computer usage will be driven by AIs, elevating the experience to something that a few years ago would have been worthy of a Hollywood film.
The next step?
This will surely make most new software AI native. That is, most software will essentially become front-end wrappers built on these platforms, where the interaction with the back-end (databases and workflow logic) takes place autonomously by large language models (LLMs) like ChatGPT that can understand English commands. 2025 is the year of AI products, and I canât wait to see what comes next.
NEWSREEL
1-800-ChatGPT
For the tenth day of its twelve-day product-release spree, OpenAI has announced access to its Advanced Voice Mode via WhatsApp.
For a limited amount of minutes a month, you can access their frontier models through this application, facilitating the distribution of their product to a potential market of 2.7 billion active users of the app.
TheWhiteBoxâs takeaway:
Justin Kan, the founder of the video game streaming platform Twitch, once said, âFirst-time founders focus on product; second-time founders focus on distribution.â
While this appears like an inconsequential release, itâs actually huge for the AI company. After accessing the iPhone market through Apple Intelligence, they will now enjoy an even larger customer base with WhatsApp.
Distribution is tough, and while all readers of this newsletter know very well what ChatGPT is (and, crucially, what it isnât), you would be surprised by the number of people who have yet to try the product or havenât even heard of it. Most times, itâs not about how good your product is; itâs how good you are at putting that product in the hands of people.
HARDWARE
NVIDIAâs New Home Supercomputer
NVIDIA has announced a price-cut version of its home supercomputer, Jetson Orin Nano Super. This compact, affordable supercomputer offers up to 67 TOPS of AI performance (67 trillion operations per second), significantly surpassing its predecessor for as low as $249.
TheWhiteBoxâs takeaway:
For reference, the NVIDIA H100, NVIDIAâs current best GPU, has a peak of 1,979 TOPS. Although the difference may be large (30 times less compute throughput), we are talking about a GPU that is 100 times more expensive than the Jetson GPU.
Thus, you get more than three times more value per dollar spent. As a caveat, you only have 8GB of system-on-module memory, which means that the computer is seriously memory-bottlenecked compared to its processing power, and you will struggle to run large models in it.
That said, smaller models (sub-2 billion parameters) are becoming more powerful, so the range of strong models you can now run at home with this computer is considerable.
FRONTIER RESEARCH
Microsoftâs Phi-4 is a Landmark Model. Hereâs Why.
Microsoft has updated its smallest model family by introducing a new, high-performance language model, Phi-4. Phi-4 is a 14-billion-parameter model that surpasses significantly larger models like Llama 3.3 70B and Qwen 2.5 (72B parameters) on math and reasoning benchmarks despite being five times smaller.
The key insight is that Phi-4 owes much of its performance edge to carefully curating its pretraining and fine-tuning datasets. The pretraining involved high-quality web data, including books and research papers, with additional filtering using custom-trained classifiers to ensure only top-tier text was included.
On benchmark tests, Phi-4 demonstrated almost state-of-the-art performance, very close to SOTA models like Metaâs latest model, Llama 3.3 70B, and very close to the performance of GPT-4o.
TheWhiteBoxâs takeaway:
This announcement proves two things:
Distillation (training smaller models on data generated by larger ones) has become the default procedure to construct powerful models that are also simple and cost-effective to deploy
The transition to smaller models is inevitable. While most AI labs will continue to build ever-larger models because they can be fed larger quantities of data, these models are not deployable. Thus, small language models should be your first option when adopting Generative AI models at scale.
Small Language Models are closing the gap quickly, with examples like Phi-4 or Gemini 2.0 Flash, which is great news for everyone.
HARDWARE
The Verifiable Compute AI Framework
EQTY Lab, in collaboration with Intel and NVIDIA, has introduced the Verifiable Compute AI Framework, a solution designed to enhance AI security and trustworthiness.
It serves as a digital notary embedded into the chips and guarantees that no weird behavior, like hacks, goes on during the processing of data while running AI models. This is an alternative to something Apple already designed for their Private Cloud Compute server platform that supports Apple Intelligence tasks that the on-device models canât handle and thus are derived to the servers.
TheWhiteBoxâs takeaway:
Achieving fully confidential AI training and inference loads is necessary. Nonetheless, according to Retoolâs 2024 Generative AI report, data security was cited as one of the top three biggest pain points for adopting the technology.
To make matters worse, companies in places like Europe have very strict laws regarding the protection of client data, making the whole issue a nightmare and leading them to reject implementations for the time being.
The issue is that while small models are improving, itâs unlikely that they will ever surpass the quality of the larger models. Thus, the incentives to offload some of the compute to larger models running in cloud environments will always exist.
But if the Verifiable Compute AI Framework lives up to its name, we now have a cryptographically verified solution that guarantees that data being processed by the AI chips isnât tampered with or stolen, as the digital notary can stop the chip completely if it identifies vulnerabilities. If this is in fact true, we might be able to run AI workloads on the cloud safely, which would be huge.
TREND OF THE WEEK
Metaâs Generational Bet: Byte-Level Transformers
Very few times in recent AI history have we seen research that dares to challenge the foundations of the current frontier AI models.
But thatâs precisely what Meta has done by introducing Byte-Level Transformers (BLTs), which may be the final solution to one of AIâs current biggest problems and, simultaneously, make AI models think similarly to how humans do.
With todayâs trend of the week, youâll gain a clearer understanding of AI, exposing its limitations while offering an intuitive solution to a problem that has caused many fully awake nights in Silicon Valley.
Letâs dive in!
The Tokenization Drama
While we have gotten quite good at training models that simulate intelligence (even if, in reality, itâs mostly memorization, as weâve discussed previously), these models still have a very counterintuitive way of processing data.
The Static Computation Problem
Not all problems are created equal. As humans have limited energy and cognitive bandwidth, we adapt the âthought effortâ we allocate to each problem; you wonât work as hard to solve a complex math problem as to sing a lullaby to your baby.
As you probably know, Large Language Models (and Large Reasoner Models, too) âworkâ by predicting the next word to a text sequence (i.e., âThe capital of Poland isâŚâ and outputting âWarsaw,â although the reality is a little more complicated).
However, while humans do not commit the same compute to every word, current models allocate the exact thought effort to every single prediction. In practice, the number of computations the GPUs running the model perform is exactly the same in every instance (the compute requirements grow as the text sequence grows in size, but the per-prediction cost is independent of the prediction task).
Long story short, thereâs a whole lottaâ unnecessary compute going on behind the scenes. And the reason for this is what we call âtokenization.â
The Crucial Role of Tokens
Youâve probably heard the word âtoken.â In the case of text, they are usually words or subwords, which are the actual predictions; LLMs donât predict words; they predict tokens, which can be entire words or not.
The idea of tokenization is applied across all data modalities; all Generative AI models, from text to video generation, perform some sort of âtokenizationâ as a pre-processing step.
What you may not know is that these tokens are chosen before model training. A vocabulary of tokens is decided, and the model is trained on that vocabulary. Of course, newer and more powerful models have larger vocabularies.
When prediction time comes, the model ranks all the tokens in its vocabulary by probability and chooses the next token based on its likelihood of being a reasonable continuation of the sequence.
But tokenization has two big problems:
The model is âforcedâ to choose one of these tokens regardless, and if the sequence requires a previously unseen combination of letters, like a new word, thatâs too bad.
The way a word sequence is broken down into tokens is fixed, which prevents the model from deciding which parts of the sequence are more worthy of computation than others. Instead, the entire sequence is processed as equally important.
But now, Meta has decided to abandon this approach completely. Instead, BLTs treat sequences on the byte level, which, my dear reader, has wild repercussions.
When Models Decide
While byte-level processing, chunking the sequence on bytes instead of tokens, was considered an intractable problem, Meta has found the way. And if their intentions are an indication of anything, Llama 4, their next model, could be a real revolution in AI.
Transformer Parallelization
In case you didnât know, current models like ChatGPT do not sequentially process the sequence, but all words in parallel. After tokenization, all words are inserted simultaneously into the model, which then performs two operations:
Mixing operation: Known as the attention mechanism, it makes words in the sequence âtalkâ to each other, updating the âmeaningâ of each word with regards to its previous words (think of this as having an adjective that talks to the rest of the sequence to find the noun itâs referring to, updating its âglobal meaningâ to not only capture its intrinsic meaning as a word but also how it relates to the noun).
Knowledge embedding operation: The model adds more meaning to the sequence using its core knowledge. For the sequence âMichael Jordan played the game ofâŚâ the model uses its knowledge of Michael Jordan to assess that the next word should be basketball, even if the sequence provides no hints of this.
In short, the former contextualizes each word with regard to other words in the sequence, while the latter adds additional required knowledge. This is how models like ChatGPT âunderstandâ what you are saying to them. That said, I insist that this process is done in parallel for every token in the sequence.
By now, you may have realized the issue at hand: as tokenization fixes how the sequence gets chunked independently of the complexity, the entire sequence is treated as equal, and compute is allocated to every token, disregarding its real value to the next prediction, even including auxiliary words like âreallyâ or âummâ which provide little value yet are still processed with the same effort as any other word.
Hereâs where BLTs enter the scene.
Toward Dynamic Computation
When a BLT model receives a sequence, it chunks it into patches of bytes. Crucially, these patches have dynamic size. Without entering into the hard details for the sake of length (this patching involves using an auxiliary model that computes byte entropy), the model looks at every byte and measures its âsurprise.â
In laymanâs terms, the model asks itself, " Seeing the previous words, how surprised am I to see this one?â If the answer is âa lot,â the model ends the previous patch and starts a new one. This sounds complex, but an example is enough to get it.
For instance, when starting a text sequence, basically every letter in the vocabulary is a reasonable candidate to some extent. Here, the entropy (the surprise) is very high.
But if the model sees the sequence âThe famous composer from Salzburg known as MozâŚâ the candidate's next few letters are few, and, as the name already rings a bell, the next few letters are âart " for Mozart. In this case, the prediction difficulty is very low.
Thus, the surprise is low, which means the model is pretty sure what the next predicted letters should be. It has no reason to start a new patch and adds âartâ to the patch, ending an entire sequence of words being treated as one.
As you can see below, what this entails is that the number of âchunksâ a sequence has tends to fall as patches grow larger in size by grouping easy-to-predict words together.
That is, while tokenization (BPE) chunks the sequence deterministically (always the same), BLT (Entropy) allows for larger patches that, despite being much larger, convey the same information as breaking them into smaller chunks.
BPE Tokenization vs Byte Patching.
As we discussed earlier, this is highly desirable because if the patches grow in size, the number of chunks in the sequence results in falls, which, as we recall, are all processed in parallel and with the same amount of allocated compute. This means that the model has to process fewer chunks, dramatically reducing computation.
The overarching idea here is that we want to maximize patch size relative to bits of information per patch; if I can group two chunks into one while providing the same amount of information (âMozartâ and âMozâ and âartâ convey the same information), we are obtaining the same value while cutting costs by more or less half.
In summary, BLTs are orders of magnitude smarter with their use of compute, allocating it where it matters.
And when Meta tested BLTs with the same training budget as Llama 3, they observed very similar performance, while the BLT showcased a massive decrease in computation cost. Impressively, the BLT Llama outcompeted Llama 3.1, which had undergone 16 times more training data.
TheWhiteBoxâs takeaway
In summary, Meta might have shown the world the way to more efficient computation of frontier AI models, a deeply-coveted desire.
These models are smarter about computation while retaining or improving performance, a truly shocking development for an industry that is fighting a tough battle to make its products affordable.
Also, the idea makes sense. Why on Earth canât AIs âdecideâ how much computing should be allocated to every problem? Shouldnât they be able to use compute when it really matters?
Thatâs a nice thing about this research; itâs not an extremely esoteric problem; itâs something anyone can read and say: âIt makes total sense.â
This is why I firmly believe that Llama 4, unless this research comes out after pre-training, will likely be the first large-scale BLT model. And soon, most AI labs will adopt this new framing as the default.
Closing Thoughts
With AI potentially becoming ubiquitous through the LLM OS (or AI OS), the progressive availability of powerful AI accelerators for end consumers like Jetson, and with most frontier AI labs putting considerable effort into making models smaller (Microsoft with Phi-4) and Meta with byte-level transformers, the industry appears to be maturing at lightning speed, going from fancy demos to products that actually provided value.
Thanks for reading; letâs talk again on Sunday!
THEWHITEBOX
Premium
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.
Until next time!
Give a Rating to Today's Newsletter |