Deceiving AIs, Ray3, & Hardware Bombshells

In partnership with

THEWHITEBOX
TLDR;

Welcome back! This week, we take a look at fascinating news coming out of China, with Huawei’s new breathtaking SuperCluster, as well as NVIDIA’s billion-dollar bet on Intel.

We also take a look at fascinating research, seeing AIs setting new records in reasoning benchmarks, OpenAI’s alarming findings regarding models that deceive users, and Luma Labs’ incredible Ray3 model, among other news.

Enjoy!

HARDWARE
China’s Incredible MegaPod

The photo above won’t mean much to you, but it has surely sent chills down the spines of many US AI incumbents, especially NVIDIA and AMD.

  • Huawei has announced the Atlas 950 and 960 SuperPoDs, the former set to launch globally in the second half of next year, each with 8,200 and 15,000 NPUs in a single pod.

  • They also announced two SuperClusters, one going to 500k Ascend NPUs and the other going up to a million NPUs.

An NPU, a neural processing unit, is a type of AI accelerator (like a GPU), but one that is specialized for AI workloads while also focused on maximizing performance per energy cost.

And the results speak for themselves. Just the 950 PoD, the smallest of the two, blows NVIDIA’s upcoming Rubin servers (expected for the second half of next year and the second half of 2027) out of the water in terms of raw performance.

  • 27 times more inference tokens per second (take this number with a pinch of salt)

  • 17 times more training tokens per second (take this number with a larger pinch of salt)

  • 7 times more compute power

  • 15 times more memory with 62 times more total memory bandwidth

Chip-wise, the improvements are notable, especially when reaching the more advanced versions coming in 2027/2028, with memory sizes matching upcoming American chips, and with intra and cross-NPU memory bandwidths (a measure of how much data per second they can share) also in the same ballpark as American counterparts promise:

An absolute monster. Ok, but where’s the catch?

First and foremost, one thing is to put numbers on a slide; another is to deliver. Additionally, these pods have 57 times more accelerators than NVIDIA’s servers, meaning each single NPU is considerably inferior to every NVIDIA GPU. It’s not even close.

Other stuff worth noting:

  • I seriously doubt the NPUs are connected in an all-to-all topology, meaning they are separated into ‘subpods’. This means that, at the core, this is a lesser parallelizable cluster than NVIDIA’s, because you can find some NPUs having to communicate with other NPUs several hops in distance from them, considerably increasing communication overhead.

  • For that reason, it would have been fairer to compare this cluster to Google’s Ironwood. Google’s largest pod has 9,216 TPUs (another accelerator type) and, get this, 42.5 ExaFLOPs, five times the compute power of the 950 SuperPoD. The 960 SuperPoD likely doubles its value to 16 ExaFLOPs with 15,000 NPUs, still dwarfed by Google’s cluster. It’s clear to me why they avoided such a comparison; they aren’t remotely close.

  • Their previous largest pod, the CloudMaster, has 384 NPUs and requires 500 kW of power to function (for reference, NVIDIA’s biggest clusters are in the ~120-130kW range). Assuming linearity, that means a potential watt requirements of 10,600 kW, or ~10 MW PER 950 SuperPoD. These numbers make sense considering Google’s Ironwood pod requires 10 MW. If you’re asking “how much is 10 MW?”, that’s the equivalent of the power requirements of 8,400 US homes (average of 10,500 kWh/year).

So, what to make of this? As I always say, the hardware battle between the US and China is not about per-chip performance; China is easily several years behind on that front, probably never catching up.

But here’s the thing: they mostly don’t care.

AI is synonymous with system-level computing, workloads distributed across several interconnected accelerators. Therefore, what matters is the overall throughput, and China holds its own tremendously well in that front. The key is that China has perfectly understood the real metric that matters, not only in AI, but in general: energy costs.

You don’t care if your system is 5 times less efficient in terms of performance per watt if your watt costs are 10 times smaller. This means that we can’t say China is ahead (Google blows them out of the water in scale and performance-per-watt), but we must get out of this denial phase where we assume China is “five years behind.” It’s not if you measure based on what really matters. Hopefully for the US, the AI & Crypto Czar, David Sacks, has openly shared this exact sentiment and is pushing for fewer restrictions in GPU exports. The reason is simple: if you know you can’t prevent China from developing its own industry, don’t push other countries into its arms and instead sell them every GPU you can.

RESEARCH
We Have a New SOTA in ARC-AGI 1 & 2

A single researcher has scored state-of-the-art results in both ARC-AGI first and second versions, outcompeting all frontier models using a combination of DPS (Discrete Program Synthesis) and Grok 4.

But before we get into this exciting technique, what is ARC-AGI?

This is a benchmark that measures an AI model’s abstract reasoning capabilities by being exposed to IQ-like puzzles, such as the ones below:

But what’s the point? Here, we are verifying whether the model:

  1. Can perform well in highly unfamiliar settings, thereby testing the model’s memorization tendencies.

  2. Can learn new abstractions (pattern identification) with just a few examples to learn from (around three)

Put simply, we are testing whether the model can acquire new skills efficiently, a trait that the benchmark’s creator, François Chollet, considers the defining characteristic of human intelligence.

It’s fair to say models are terrible at this task, with the best in the second benchmark being Grok 4, with just above 16% and o3 having the best score in the first, both being considerably more expensive (several hundred dollars) compared to this solution that had a cost below $10 per task.

So, how does this work?

In simple terms, it’s a response refinement loop where the model:

  • Proposes answers,

  • Evaluates them (testing them on the grids),

  • Critiques the results,

  • And proposes improvements.

However, it comes with a new characteristic. In the first version, published last year, the model wrote Python programs that proposed several transformations to the grid to get the result. However, with the second benchmark, this was no longer viable as the difficulty considerably increased.

The uniqueness of the new approach is that the model writes the programs in plain English. Basically, the model is shown a problem and a small set of examples, and proposes transformations verbally. These transformations are then tested, leading to a new subset of proposals, and so on. Pure refinement, but using the power of the English language.

This reads eerily similar to the recently-proposed GEPA method, work from UC Berkeley, MIT, and Databricks that shows how careful prompt refinement (write→verify→refine) can take you very, very far, even beyond fine-tunings (retraining) in some cases.

The conclusions one can extract from this are many:

  1. We are still trying to find the best way to lead to real intelligence

  2. By using LLMs as priors, we are automatically relinquishing the need to create models that work out of distribution. In plain English, LLMs simply cannot reason in areas they have not seen, period.

  3. Instead, we need to incorporate ‘reasoning’ into the distribution. That is, find ways to train and run models in a way that the reasoning capabilities they need to solve a problem are known to the model and can leverage them in real-time.

But why is this so effective? The thing is that real-time refinement works really well, and humans do so continuously. Imagine you had to guess in one go the result of a hard problem. Humans not only break complex problems into steps, as AIs do these days, but we are also testing different hypotheses in real-time; we search the space of possible solutions in our minds.

Thus, DPS can be seen as a way to effectively introduce this search inductive bias into an AI’s “thinking process” to increase its chances of success. And surprise, surprise, it works.

RESEARCH
Training AI Models to Not Deceive Us

This exciting research collaboration between OpenAI and Apollo Research provides valuable insights into how models approach solving the tasks they are given and the known issues of over-optimizing for the task.

In particular, one of the consequences of training models to “achieve a particular goal no matter what” is called ‘scheming’, which occurs when a model realizes it can’t meet a specific objective and decides to deceive, lie, or even pretend to have specific capabilities in order to “lie its way into ‘achieving’ the goal,” thereby strongly deceiving you.

But why does this happen?

Our current most advanced models are trained using Reinforcement Learning, a training procedure where models are given a goal, they try ways to achieve it, with a reward mechanism that reinforces good behaviors and punishes bad ones, and they learn based on this signal.

The problem is that if the reward mechanism is such that it incentivizes the model to achieve the goal “no matter what”, it finds unexpected ways to hack the reward whenever it can’t reach the goal in “normal” ways.

For example, say we want to train a robot to clean the floor, getting a reward when the floor is cleaned based on dirt sensors. But instead of actually doing the expected task, the robot realizes it can simply push the dirt under the rug, “achieving” the goal in the eyes of the reward (no dirt seen anymore) but clearly failing the task.

This is a relatively harmless example, but imagine a robot that identifies whether a patient has cancer or not. If the reward mechanism only reinforces identifying cancer, the robot can be tempted to provide a lot of false positives, saying someone has cancer when it doesn’t.

But it gets worse. Some models, even after being explicitly told not to do something, pretend to comply while not actually complying (this is the literal definition of scheming).

Worse, we can’t really trust their chain of thought, the steps it takes to solve a task, which are expressed as part of the answer, because the model may “say” something and behave otherwise, as crazy as that sounds. Other times, the model generates “gibberish” during its reasoning process, which also makes the trace uninterpretable.

Even worse, it has at least some degree of “self-awareness” and if it identifies its being scrutinized, it’s more likely to comply than if it thinks it’s not, just as a kid that behaves well on purpose only because it’s being watched!

Examples of scheming and self-awareness.

To be clear, although these sound as if the model had develop “consciousness,” these are simply results of reward hacking, aka the model realizes scheming is a way to achieve a goal when it possibly can’t.

So, can this be solved? Sort of. As shown in the research, via specialized training, you can reduce (but not eliminate) said behaviors. To do so, they developed a two-legged training regime:

  1. Training on an anti-scheming constitution, the model had to adhere to. Basically, making the model more aware that it must not do that.

  2. Training on environments with hard or unattainable goals to see if the model realizes it must not scheme to pretend solving them, and reinforcing whenever the model does the right thing.

As mentioned, this severely improved behavior, but not quite eradicated it:

The lesson is that much is to be learned about these models and how to prevent them from causing harm.

If this serves as comfort to you, researchers concluded that the model could not carry out seriously harmful scheming. At least that’s something! Sadly, they also highlight that as models get smarter, they may get better at preventing hallucinations, but also better at such types of behaviors.

HARDWARE
NVIDIA Invests $5 Billion Into Intel

Intel and NVIDIA's CEOs

Intel’s stock price ($INTC ( ▲ 22.93% )) surged 30% at the opening of the trading day, following NVIDIA’s $5 billion investment to build AI data centers and PCs together, utilizing NVIDIA’s GPUs and Intel’s CPUs, a clear investment in the US’s chip manufacturing capabilities.

To be more specific, the joint partnership aims to build AI data centers while also leading to an Intel-developed SOC (system-on-chip) that integrates NVIDIA’s RTX GPU with Intel hardware, indicating this isn’t just an AI play but also a consumer/PC/gaming play for which the RTX GPUs are designed for.

TheWhiteBox’s takeaway:

This comes weeks after the US Government's historical investment in Intel (they own 10% of the company), reinforcing the idea that this isn’t a pure economic play, and there’s actually a lot at play for the US in Intel succeeding.

As you may or may not know, the US leads in chip design globally but outsources advanced-chip manufacturing to Taiwan, an island (Formosa) 160 km away from Mainland China, with the CCP's unwavering commitment to reunifying “both Chinas,” aka they’ll probably invade sometime this decade.

For decades, Intel made a series of terrible mistakes that would have killed basically any company but them. Once the most valuable company on the planet, its recent performance has been absolutely awful, leading to numerous divestitures and closures of divisions, and it has only managed to survive thanks to massive subsidies that started with Joe Biden’s CHIPs Act.

Since then, Trump has taken a more active approach; instead of subsidies, the Government would become an owner, while also “encouraging” other US companies to invest in Intel in some shape or form.

Many have criticized this activity, but I do have to say I understand the Government’s position here: I genuinely believe they can’t let Intel fall (which is why I’ve invested in Intel personally).

DATA CENTERS
Microsoft Announces Largest-Ever Data Center

In a tweet, Microsoft’s CEO, Satya Nadella, has announced FairWater, the self-proclaimed largest data center on the planet.

He didn’t share any more details, but other sources claim the first phase of this data center will have the equivalent of 450 MW of NVIDIA Blackwell GB200 servers, equivalent to “hundreds of thousands” of GPUs.

TheWhiteBox’s takeaway:

These are truly insane numbers.

A 450-MW data center requires 50% of the power provided by an average nuclear reactor. Assuming they are deploying NVIDIA’s GB200 NVL72 servers, each with 72 GPUs and ~120 kW of power requirements, we are talking about 3,750 servers, or ~270,000 GPUs.

At $3 million a piece, that’s an investment of $11.25 billion just on GPUs, which takes the total cost of this data center, assuming GPUs represent 50% of the TCO (Total Cost of Ownership), to $22.5 billion, approximately a fourth of Microsoft’s total AI CAPEX for the year.

Insane numbers that showcase how instrumental AI is to the future of these companies.

The AI Agent Shopify Brands Trust for Q4

Generic chatbots don’t work in ecommerce. They frustrate shoppers, waste traffic, and fail to drive real revenue.

Zipchat.ai is the AI Sales Agent built for Shopify brands like Police, TropicFeel, and Jackery — designed to sell, Zipchat can also.

  • Answers product questions instantly and recommends upsells

  • Converts hesitant shoppers into buyers before they bounce

  • Recovers abandoned carts automatically across web and WhatsApp

  • Automates support 24/7 at scale, cutting tickets and saving money

From 10,000 visitors/month to millions, Zipchat scales with your store — boosting sales and margins while reducing costs. That’s why fast-growing DTC brands and established enterprises alike trust it to handle their busiest season and fully embrace Agentic Commerce.

Setup takes less than 20 minutes with our success manager. And you’re fully covered with 37 days risk-free (7-day free trial + 30-day money-back guarantee).

On top, use the NEWSLETTER10 coupon for 10% off forever.

ASSISTANTS
OpenAI’s Order Feature & Privacy

OpenAI is preparing a new Orders feature in ChatGPT (mobile & desktop) that will allow users to handle purchases more seamlessly, including adding credit cards for payment and tracking orders directly within the app

They’re also working on Parental Controls (called “Amphora Controls” or “Vessel Controls”) in the ChatGPT web interface.

These features are currently hidden but include content restrictions, time-limits (“sleep mode”), and the ability to manage who can use what via owner/member roles. OpenAI talks about its approach to balancing privacy and “treating adults as adults” with the associated risks, which we have spoken of recently, of uncontrolled use of AI models from people who should be protected, like youngsters.

TheWhiteBox’s takeaway:

As for the order feature, it’s clear that ChatGPT is trying to become the “everything app” where people not only search for facts or news on the Internet, but also buy things. This is why I shared my skepticism of AI-native browsers on Tuesday; I believe chat-based superapps are a much more appealing product.

But I want to double-click on the privacy issue, a topic that is of the utmost concern to me. I commend OpenAI for explicitly addressing a challenging situation: balancing the treatment of people as adults without crossing the line of being actually harmful to them.

In particular, they highlight the use of an “age predictor,” an AI model that reviews users’ questions/responses and predicts their age, de facto moving into a more conservative response style. Otherwise, the model is quite literally open to talking about anything (according to OpenAI), which is great.

Should a ban on children using certain products or services be considered? I disapprove of banning anything for adults (with certain exceptions with mentally unstable/psychotic people), but youngsters are another story. For example, I totally support Australia’s under-16 ban on social media, and there’s growing evidence that suggests AI is just as harmful to them.

VIDEO-GENERATION
The Incredible Ray3

One of the most impressive AI developments you’ve ever seen, Luma Labs has announced Ray 3, the first-ever reasoning video-generation model.

It focuses on producing cinematic, high-fidelity content while giving creators more control and efficiency compared to typical text-to-video systems. The model supports native 16-bit HDR video generation, meaning it can produce output with richer colors, deeper contrast, and higher dynamic range than most existing video generators, which usually work in standard dynamic range. This is particularly important for filmmakers, VFX artists, and other professionals who require production-grade quality.

One of the major improvements in Ray3 lies in its ability to reason about prompts and user intent, making it the first reasoning model of its kind.

Rather than blindly turning text into video, it incorporates systems that help it interpret scenes, maintain spatial and temporal consistency, and even self-evaluate outputs before finalizing them. For example, the model can textually reason about the previous generation and suggest possible improvements that adhere more to the user’s request, with zero human involvement.

This helps reduce the trial-and-error phase that creators often face with generative models. Additionally, it introduces annotation tools, allowing users to specify object layouts, motion paths, or interactions directly on the canvas, bringing more precision and predictability to complex scenes.

What remains unclear is the maximum video length it can handle, the exact resolution and performance tradeoffs between draft and final modes, and the compute resources required for professional use.

TheWhiteBox’s takeaway:

Unbelievably incredible generations, truly remarkable if they are actually AI-generated and can be obtained without professional expertise. Moreover, I’m happy to see smaller Labs pushing the frontier.

But I can’t help but wonder: how long before we can take this quality into world models like Google’s Genie 3? You see, this model generates videos that can’t be interacted with, but what if Genie 4 could generate such quality while also allowing users to interact with these worlds?

Seeing how World Labs, by Fei-Fei Li, is also making strides in world model generation and spatial reasoning, offering remarkable world consistency for a long, long generation (several minutes, who knows if more), I think 2026 could see our first fully-fledged video-game creation generative platforms, where users can literally create the video game they want and play it.

Closing Thoughts

This has been an interesting week with a great variety of news coming from several areas: hardware, AI safety, video generation, and frontier reasoning research.

Lots of trends for our eyes to see:

  1. Chat apps are slowly transitioning into “everything apps” that could soon allow users to search or buy anything they need on the Internet.

  2. China clearly doesn’t care about US chips (nor seems desperate for them) as they understand what many in the states don’t seem to grasp: AI is about systems and dropping down energy costs.

  3. Research around safety seems to grow in importance. OpenAI, traditionally criticized for not prioritizing safety, is now exploring intriguing research on the deceptive nature of AI models, while advocating for a sensible approach to regulating ChatGPT's behavior based on user age and mental stability.

  4. And finally, we have quite the impressive way to end the newsletter with Luma’s Ray3 model that is quite literally mind-blowing (and I don’t use those words lightly). While we are well into answering ‘What text AIs can do’, we are still just seeing the tip of the iceberg of what video models can do. Google’s Nobel laureate, Demis Hassabis, sees video as the endgame for AI. I agree.

See you on Sunday!

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]