

For business inquiries, reach me out at [email protected]
THEWHITEBOX
TLDR;
- 🤯 OpenAI Reimagines Image Generation 
- 😍 Anthropic’s New Models 
- 📹 HeyGen Zoom Avatars & RunWay’s Act-One 
- 🧐 Solving the 9.8<9.11 Dilemma 
- 🎭 Adobe’s New Image Rotation Tool 
- 😳😳 Ex-OpenAI Claims Illegal Activities 
- [TREND OF THE WEEK] Physical AI Foundation Models 

Writer RAG tool: build production-ready RAG apps in minutes
RAG in just a few lines of code? We’ve launched a predefined RAG tool on our developer platform, making it easy to bring your data into a Knowledge Graph and interact with it with AI. With a single API call, writer LLMs will intelligently call the RAG tool to chat with your data.
Integrated into Writer’s full-stack platform, it eliminates the need for complex vendor RAG setups, making it quick to build scalable, highly accurate AI workflows just by passing a graph ID of your data as a parameter to your RAG tool.

NEWSREEL
OpenAI Redefines AI Image Generation

Source: OpenAI
As rare as seeing pigs fly, OpenAI has released open research that, for once, is actually insightful and contributing to the space. They explore the idea of consistency models, diffusion-based architectures that, instead of standard diffusion models that require hundreds of steps to output the image, can generate it almost instantly.
But what are diffusion models? The golden standard for image and video generation models, they aim to denoise a noisy image like the one in the gif above based on the text the user provides to the model (e.g., “draw a butterfly”).
Mathematically speaking, the model receives the text input and a noisy canvas (a random distribution of data). In an iterative step-by-step approach, the model progressively predicts noise in the image and removes it, eventually arriving at a new distribution (butterflies in this case) that semantically relates to the user's request.
Intuitively, think of diffusion as a similar process to creating a marble statue; you initially have a marble block, and by chiseling away the excess, you uncover the hidden figure. Therefore, diffusion can be understood as chiseling out the noise to uncover the image.
TheWhiteBox’s takeaway:
To me, this is an unmistakable sign that OpenAI is no longer an AI lab but a fully-fledged AI product company. The idea behind these models is not to push the industry forward or take us closer to AGI but to make image generators faster and cheaper to run.
It’s also a sign that model distillation is here to stay. These faster models are distillations of diffusion-based models. In other words, you first pre-train a more powerful yet slower diffusion model. Then, you distill its knowledge and capabilities into a smaller and also more sample-efficient one to serve to customers.
Distillation involves taking a more powerful model (teacher) and use it to train a less powerful but more efficient model (student) by teaching it to imitate the responses of the larger model.
Distillation is steadily becoming the main deployment method for generative AI solutions. First with LLMs like GPT-4o and Claude 3.5 Sonnet (both distillation of GPT-4 and Claude 3 Opus, respectively) and now with image generators, deployment has become a game of who deploys the best Pareto-optimized model, a model that gives you 80% performance over the bleeding frontier while requiring 20% of the costs (the numbers are illustrative to illustrate the point).
PRODUCT
Anthropic’s Claude 3.5 Sonnet v2 & Computer Use

Source: Anthropic
Anthropic has released two new models, Claude 3.5 Sonnet (New) and Claude 3.5 Haiku, as well as a new computer use feature that allows Claude to control your screen through the API.
The announcement has generated a lot of hype because the new Claude 3.5 Sonnet is pretty incredible in terms of reasoning, coding, and complex tasks in general. It even exceeds the prowess of OpenAI’s o1-preview in SWE-Bench verified, a benchmark of complex GitHub issues where the model has to solve them autonomously.
As for the computer use feature, it’s pretty self-explanatory, and this video showcases a simple demo. It’s not a released product, and there are many reasons to suggest it will stay that way for quite some time.
TheWhiteBox’s takeaway:
I’m actually pretty confused with these results. As mentioned, Claude 3.5 Sonnet (New) surpasses o1-preview in some instances.
This is not an apple-to-apple comparison, as o1-preview is an LLM with an added layer of exploration (the model can explore different ways to solve a problem before answering). Naturally, this should give a prominent edge over non-explorative models like Claude, but that doesn’t seem to be the case in all reasoning tasks, meaning one of two things:
- Claude 3.5 Sonnet (New) is so absurdly superior to GPT-4o (allegedly o1-preview’s underlying LLM) that even with exploration, the former is simply superior. 
- This added layer of exploration is very overvalued, and it doesn’t imply a true increase in reasoning capabilities. 
Either way, I’m starting to cozy up to the idea of openly stating, as many in the space already have, that Anthropic’s LLMs are superior to OpenAIs. If Anthropic releases an o1-type model, OpenAI’s recent investors could start doubting their decisions.
PRODUCT
HeyGen & RunWay New Amazing Models
Another field that is making amazing progress is video generation. HeyGen has released a Zoom feature that allows AI avatars to connect on your behalf, leading to uncanny interactions like this. Interestingly, the product is already available, and you can test one of the five pre-generated avatars for free. It’s quite the experience.
The models run on HeyGen’s avatar engine and OpenAI’s real-time voice integration, ensuring an uncanny interaction with AIs that feel human.
Following this release, Runway presented Act-One, which creates alternative characters based on humans interacting on camera, as seen in the linked video.
TheWhiteBox’s takeaway:
Both demos are very, very impressive. We are certainly up for a massive progress increase over the next months. The ways these products could change how humans create video games or animated films are, allegedly, dramatic.
But one thing troubles my mind: repeatability.
As we described earlier, video generation models are also diffusion models. They depart from a random canvas of noise, meaning that every generation departs from a slightly—or largely—different starting point. This makes creating robust or at least similar objects across several frames very complicated.
Don’t get fooled by the demos; users of OpenAI’s Sora mentioned this is one of the worst problems with the model, as objects and characters varied between frames, making the overall video unbearable.
One thing is to generate a 20-second video of a character speaking; another is to generate hundreds of independent cuts to assemble a film in which the character is consistent across all of them.
One thing we can do is generate pre-conditioned sketches that constrain the generated images and video to a certain structure, such as ControlNet, which is precisely what Act-One might be doing based on the human videos.
Still, this requires extremely intensive preprocessing (drawing the sketch or recording the video), which implies that AI-generated videos will require loads of work from humans… just in a different way.
PRODUCT
Solving the 9.8<9.11 Dilemma
For many months, a particular problem that LLMs can’t seem to solve has intrigued researchers worldwide. For some reason, LLMs continuously fail the following problem: “Which number is larger, 9.8 or 9.11.”
Now, a new product from Transluce, a company that has just emerged from stealth mode to help the industry better understand foundation models, has finally shed light on the reasons behind this weird issue, at least with the Llama 3.1 8B model. Fascinatingly, the reason is a mixture of 9/11, gravity, and the Bible.
Yes, you read that right.
Using their tool called Monitor, which allows you to steer the model by suppressing certain behaviors and enhancing others, if you suppress the neurons in charge of those three features (especially the Bible one), the model suddenly becomes capable of doing these numerical comparisons correctly, as proven in the following short video.
TheWhiteBox’s takeaway:
Interpretability is a growing field of interest in AI that aims to demystify the inner workings of these models. Understanding them not only gives us insight into how they work but also provides us with ways to steer models, even blocking undesired responses in a way that current alignment methods can only dream of.
PRODUCT
Adobe New Image Rotation Tool
Adobe has released a new tool for rotating images. In the videos provided, you can select objects or characters in an image and rotate them, providing the scenery with richer, more 3D-looking interactions.
The model respects shapes and physics (shadows) well and provides image designers with yet another AI tool that seems helpful for once.
TheWhiteBox’s takeaway:
The last step in the Generative AI value pipeline is going from demo to actual product. However, it’s proving to be the biggest hurdle in the entire process, as lack of reliability, poor robustness, and, quite frankly, useless value in most cases prevent companies from embracing these tools once and for all.
In creative design, however, companies like Adobe or Canva seem to be capitalizing really well on GenAI, releasing products that appear to be really helpful and have a low adoption barrier.
LAW
Ex-OpenAI Claims Illegal Activity
I hope you weren’t getting used to OpenAI drama. Now, a former OpenAI employee has written in his personal blog that he believes the company is not complying with U.S. copyright law.
Specifically, this refers to the idea that OpenAI is scraping copyrighted data from the Internet to train its AI models without any attribution or remuneration (let alone permission) from the original sources, which surely include both corporations and individual content creators.
This news couldn’t have been made public at a worse time, as a few days ago this statement signed by 13,000 artists, actors, and musicians came out asking for these companies to respect copyright law, stating: “The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted.”
TheWhitebox’s takeaway:
I mean, are you surprised? Foundation models, the business OpenAI, Google, Meta, or Anthropic are in, involve progressively growing the size of models and the datasets they use.
This scaling is thought to be the essential piece in the growth formula of these models’ capabilities. Consequently, these labs are highly incentivized to reach the limit of what they will do to get this data. While synthetic data training, generating data from these models and using it to train newer iterations or surrogate models (like in model distillation), is heavily leveraged, having excessive synthetic data in your overall dataset severely affects performance.
This also explains why the market for generating new data from human contractors is booming, with examples like ScaleAI quadrupling revenues in the first half of 2024.
Still, it seems that the data-hungry nature of these models is so great that these labs can’t resist performing dubious work.

TREND OF THE WEEK
Physical AI Foundation Models

For centuries, humanity has uncovered the laws of physics and nature by observing experiments that revealed reality’s hidden patterns. Still, many of those patterns remain hidden today, and several researchers have questioned whether AI can be of any use here.
And that vision might be coming into reality. A group of researchers at Archetype AI have presented their advances in this fascinating field: the idea that sensors can be used to create a physical AI foundation model, a model that understands the physics of the world and can reconstruct and predict physical events with uncanny accuracy.
The model, named Newton, not only seems to ‘understand’ the events it was trained on, but it also seems to have found key underlying regularities that allow it to predict motions, temperatures, and events it has not seen before.
Today, we will learn about an often overlooked area of AI research that could prove much more relevant than ‘ChatGPT’: how AI could help us find new laws in nature and physics that could advance our understanding of the real world, and the evidence that suggests this is coming soon.
The Importance of Patterns
For the last few weeks, I’ve discussed AI models' limitations, which seem to lack real reasoning. Yet, they do have a very powerful feature: their capability to find patterns in data.
But what do we mean by that?
The Kaleidoscope Hypothesis
As illustrated by François Chollet in a recent lecture, the Kaleidoscope hypothesis claims that the world, the universe, and everything inside it follow a repeating structure.
It appears endlessly novel and different in many ways, but it all boils down to underlying structures from which all things develop. In a way, it works like a kaleidoscope, a tube-shaped optical instrument that contains mirrors and small colored objects, such as beads or pieces of glass. As you turn the kaleidoscope, the mirrors move, forcing light to interact with them differently, creating new patterns (thumbnail image).
In other words, while the amount of true bits of information is fixed (mirrors inside the tube), the amount of different patterns we can create with light feels infinite, but this ‘infinity’ is, in reality, an extrapolation of those key structures.
This is why the laws of nature, like Newton’s laws of gravity, Maxwell’s laws of electromagnetism, and so on, work; there are hidden laws that rule the behavior of things, be they planets or electrons.
And it appears that AI models are great at finding these patterns.
From Grammar to Rediscovering
The idea of using AI to discover the laws of nature isn’t new. Back in April, I wrote a full article on how some researchers were striving to achieve just this, diving into the inner functioning of neural networks and how they can be used to discover reality.
In particular, a group of Cambridge researchers rediscovered the laws of gravity using a neural network.
While the model only had the position and velocity of each object at any given time, it was tasked with predicting its movement and acceleration. By doing so, it was forced to find the system's dynamics, even finding its masses (the model didn’t even know at first that ‘mass’ was even a thing).
In layman’s terms, the model found the hidden laws behind objects' movements and masses, aka the laws of gravity, simply by observing them. This was proven to be the case after they distilled the model’s predictions into symbolic rules (analytical equations) that, when compared, were identical to the ones Newton found.
This is pattern-matching in a nutshell: observing the behavior of things and inducing the general structures that govern their behavior, be that observing how planets move or how words follow each other, like ChatGPT.
However, one thing is rediscovering laws; another is for the AI to predict unknown behaviors. Enter Newton.
The Newton Model
The idea around this foundation model for physics is to train a model to predict physical behaviors from sensor measurements.
Simultaneously Similar & Different
As shown below, we use sensors to capture information from the physical world, and the model encodes these measurements to do two things:
- Reconstruct the past: Learn to reproduce measurements that occurred previously. 
- Predict the future: Learn to predict the future trajectories of physical quantities of interest. 
With the former, we teach the model to find the underlying patterns. With the latter, we force the model to apply these past regularities to extrapolate them to future trajectories, aka predict how the object will behave next.

Technology-wise, Newton isn’t very different from the many different models we’ve discussed in this newsletter many times. It’s a standard Transformer that has learned the mapping between sensor measurements and past and future behaviors of the objects being measured by the sensors.
But what is a Transformer?
In a nutshell, Transformers take input, such as a text sequence, an image, or a set of sensor measurements, and break it into individual patches of information (i.e., words in the case of ChatGPT). Then, they make these patches talk to each other (a process known as attention) so that every patch can update its meaning relative to other patches in the input.
Doing this several times builds an understanding of the overall input. This process is completely universal in today’s AI, being used in LLMs or Neuralink’s brain-to-computer interfaces because it offers unmatched pattern-matching capabilities.
What makes Newton different is that it has been focused exclusively on sensory data, allowing to develop unmatched intuition as to how the world works. And it walks the talk.
Impressive Zero-Shot Capabilities
The model was trained with 590 million samples of physical data captured from sensors, which allowed the model to predict the patterns in movements like this harmonic oscillator (green is the input, yellow is the prediction, and blue is the ground truth).

As you can see, the prediction and ground truth almost overlap, signaling that the model has captured the key pattern.
But this is just an experiment. The real value comes when the Newton model can predict things it has not seen before. In particular, they tested the model in unseen situations like temperature modeling, Turkey’s power consumption, or meteorological parameters like humidity or wind speed.
In these scenarios, we saw two fascinating discoveries:
- The model can predict patterns in all these examples, showing that the model can predict how these physical quantities, like temperature or electricity consumption, will evolve in those particular unseen settings. 
- Newton outperformed the target-trained models (models trained for that particular task) in all settings. This indicates that the Kaleidoscope hypothesis is true and that by studying vast quantities of physical data, the models can learn key structures that apply broadly across physics and nature, leading to higher-quality abstractions (for instance, by studying electricity and temperature, the model learns global laws that apply to both, leading to greater performance over a model that just learns about temperature modeling). 

Prediction loss across several tasks. Smaller is better. Source: Archetype AI
TheWhiteBox’s takeaways:
Technology:
Archetype’s Newton isn’t an innovative AI tech approach; it uses the same underlying structures as most models today.
Instead, it’s a reframing of what AI can be for humans; not just a tool of productivity, but a tool of discovery. Considering my skepticism on AI’s reasoning capabilities, I predict AI’s biggest contribution to society in the short term will be as a discovery element, one that benefits from AIs uncanny capability to find hidden patterns in data to make sense of our world.
Products:
Although Archetype AI, a team formed by NASA, Google, & MIT, among others, is still in its very early days, the idea would be to streamline this discovery process by building a foundation model of physical sciences.
Markets:
Currently, markets are still all about productivity-focused AIs. However, discovery-based AIs could add a lot of value and hype to the AI narrative, one that is being heavily questioned these days, not only from the intelligence perspective, but the fact that many AI products aren’t living up to their promises.
Closing thoughts
After a week of many model releases, the industry seems inevitably directed toward products instead of research, signaling that the pressure to start making money is getting to these billion-dollar-backed start-ups.
But the highlight to me is clearly Newton, an exciting alternative to the view of what AI can deliver. It is a breath of fresh air that, for a moment, pushes us away from the endless spree of Generative AI LLM and video generation news and overhyped demos.
This Sunday, we are looking into the traits of future AI leaders.

THEWHITEBOX
Premium
If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask your questions you need answers to.
Until next time!





