Will o3 Take My Job? Cough-to-Tubercolisis, & More

In partnership with

Need a personal assistant? We do too, that’s why we use AI.

Ready to embrace a new era of task delegation?

HubSpot’s highly anticipated AI Task Delegation Playbook is your key to supercharging your productivity and saving precious time.

Learn how to integrate AI into your own processes, allowing you to optimize your time and resources, while maximizing your output with ease.

THEWHITEBOX
TLDR;

Technology:

  • 🫡 Cough-to-Tuberculosis AI

  • 🤩 Google’s Gemini LRM

  • 🕳️ The Well Dataset

Markets:

  • ⚡️ Sriram Enters Trump’s Cabinet

Product:

  • 🥵 Liquid AI Scores $250 Million Funding Round

  • 🫣 o1-preview outperforms Doctors

TREND OF THE WEEK: Why o3 Won’t Take Your Job in 2025

HEALTHCARE
Cough-to-Tuberculosis AI

An article in IEEE Spectrum discusses the development of an artificial intelligence (AI) system designed to detect tuberculosis (TB) by analyzing cough sounds. This innovative approach aims to provide a quick, cost-effective, and accessible screening tool, which would be especially beneficial in regions with limited access to traditional diagnostic methods, like India.

The AI model was trained on a vast dataset of cough recordings from individuals with and without TB. By identifying subtle acoustic differences, the system can distinguish TB coughs from those caused by other conditions. Preliminary results indicate that this method could serve as an effective preliminary screening tool, potentially prompting individuals to seek further medical evaluation.

TheWhiteBox’s takeaway:

I’ve said it many times, and I’ll say it again.

While OpenAI has massively shifted the discourse of AI as a tool to enhance productivity (or even substitute humans, which is, of course, a massively overstated claim but inflates the value of these companies), AI continues to push the boundaries of what’s possible in the way it has always done: as a tool of scientific discovery.

AI’s greatest superpower is pattern recognition. That is, finding subtle patterns in vast amounts of data that can be extrapolated to successful predictions abstracted from those patterns.

Just like a doctor can identify some illnesses based on a human cough, AIs can detect tuberculosis similarly by finding the key nuances that make an ill patient’s cough different from a healthy one.

The issue with healthcare costs is not a small one. The US is currently involved in Luigi Mangioni’s case, and a decent portion of the population favors the alleged killer of the healthcare CEO despite having—again, allegedly—cold-blooded killed a dad of two kids.

This showcases how much this country’s healthcare needs a complete rethink. Luckily, researchers worldwide are finding ingenious ways to use AI to expedite illness detection and, hopefully, reduce healthcare costs worldwide.

This proves that AI may be one of the crucial technologies that can change an industry with total spending by patients and insured almost as big as the US government’s annual budget.

FRONTIER RESEARCH
Google’s Gemini LRM

The same week OpenAI released o3, Google presented its first LRM, Gemini-2.0-Flash-Thinking-Exp-1219. This model quickly rose to the top of all leaderboards in the LMSYS arena (of course, these results are pre-o3 but beat o1-preview and o1-mini).

This is important because it shows the first US-based competitor to o-type models, signaling that Google is quickly catching up with OpenAI’s efforts.

TheWhiteBox’s takeaway:

Google is having a great end of the year, with many remarkable models like Veo2, Genie2, and the Gemini 2.0 series, with this thinking model as the latest release.

For instance, regarding video generation, one could make a solid case for them being already ahead of OpenAI (Veo2 appears to be head and shoulders better than Sora), and they also have a massive video data advantage over them with YouTube.

They are also pushing for real-time videogame generation with Genie2, which could soon see them commercializing solutions that generate playable videogames at scale.

And with regards to LRMs, it’s clear OpenAI is ahead with o3, but Google seems to have cured the huge illness we discussed back in our Google deep-dive: their fear of disrupting themselves; they once allowed OpenAI to eat their lunch with LLMs amidst fears that this technology would eat into their search business (which is invariably doing).

Now, they don’t seem willing to let that happen again, and they have the means to win the AI race.

DATA
The Well Dataset

The Well is a comprehensive collection of machine-learning datasets featuring approximately 15 terabytes of numerical simulations across various spatiotemporal physical systems.

Curated by domain scientists and numerical software developers from multiple top universities, it encompasses 16 datasets spanning fields such as biological systems, fluid dynamics, acoustic scattering, and magneto-hydrodynamic simulations, including phenomena like supernova explosions.

These datasets are designed to be utilized individually or as part of a broader benchmark suite, aiming to advance research in machine learning and computational sciences.

TheWhiteBox’s take:

The importance of this release can’t be understated. This huge 15-terabyte dataset (almost as large as some of the biggest pretaining data distributions used to train 2024 frontier models) will provide AIs with rich intuition about physics, a fundamental requirement for these models to, one day, physically inhabit our world.

Moreover, they are also releasing another 100 terabytes of astronomical data, called the “Multimodal Universe,” which will help AIs deepen their understanding of this field. As we know, AIs are great discovery tools, which could push human understanding of the Universe to new heights.

TRUMP ADMINISTRATION
Sriram Enters Trump’s Cabinet As an Advisor

Sriram Krishnan, ex-a16z General Partner, has joined David Sacks’ team, the US Government’s new AI & Crypto Czar, as an advisor to the Trump Administration for matters regarding AI and crypto. The important thing to highlight here is that Sriram has been, like Andreessen Horowitz, notoriously in favor of open-source AI.

In other words, Sriram joins other pro-open-source voices like Vice-President-Elect JD Vance and relevant campaign donor Mark Andresseen as key advisors to Trump on AI matters related to open-source.

In 2025, regulators will be pressured to ban open-source sources effectively. AI frontier labs are not only not seeing decreased operational costs but are also accelerating. Training LRMs requires massive human and computing effort to build entirely new datasets, as reasoning data is largely non-existent on the open web.

That means that AI training costs will increase heavily during 2025, and the pressure to drop token prices will increase even more as competition between large labs becomes fierce. Naturally, this asymmetry breaks the business case, as they also have to deal with Meta and the Chinese lab's open-source Pareto-optimized solutions. They aren’t as good as frontier models, but they go 85% of the way while offering the technology free to the world.

As Keith Rabois said in the All-In podcast, “There are no secrets in AI,” meaning AI labs are desperate to eliminate open-source competition. Will they succeed in 2025?

I hope people like Sriram prevent it.

LIQUID MODELS
Liquid AI Scores $250 Million Funding Round

Liquid AI, an MIT spin-off specializing in efficient, general-purpose AI systems, has secured $250 million in a Series A funding round. This investment aims to accelerate the development and deployment of Liquid Foundation Models (LFMs), lightweight AI models designed for enterprise applications.

The key point about Liquid AI models is that, unlike Transformer models like ChatGPT or Gemini, these LLMs are time-constant (networks where time and memory are almost constant), making them much more efficient to run.

Specifically, just like any state-space model, the amount of information the model can store at any given time doesn’t grow with the input length because the model actively decides what to remember and what to forget.

This is very similar to how humans process knowledge. We actively discard irrelevant data, as there’s a limited amount of things one can remember. On the other hand, Transformers don’t work that way, forcing them to increase their memory allocation tremendously as the input size grows.

TheWhiteBox’s takeaway:

I’m watching what this team is doing very carefully. Liquid AI is founded and led by some of the most prolific AI researchers out there, and their mission is to solve a truly enormous problem today for AI: memory bottlenecks.

If they solve this while maintaining performance, their ability to disrupt the industry will be massive, as all frontier AI labs are trading off huge losses in lieu of nominal intelligence improvements that, while net improvements, aren’t economically viable, which is why models like OpenAI o3 won’t be released anytime soon.

HEALTHCARE
o1-preview outperforms doctors

A study by several healthcare institutions and AI companies/universities, such as Microsoft or Stanford, has released compelling results showing that AIs outperform human doctors in diagnosing illnesses.

The results aren’t even close, as doctors score around 30% while o1-preview scores 80%. Crucially, these results weren’t obtained by simply having an AI and a human take a test; they were performed in scenarios that truly mirrored real situations.

TheWhiteBox’s takeaway:

Few jobs in the world are more exposed to AI than doctors. When performing diagnoses, they have to go through extensive data, find patterns in the data, and extrapolate those patterns into illnesses that meet the patterns.

There’s no way around it; it’s not surprising that AIs are already better than humans at this; it’s precisely the best thing they do.

Does that mean that doctors aren’t necessary anymore? Of course not; illness detection is a very sensitive process that not only requires human accountability in case the diagnosis is wrong, but the emotional role the doctors play during the whole process of curing the illness is something AIs won’t replace anytime soon, if ever.

Doctors aren’t going anywhere. Instead, AIs will provide better detection capabilities, which are crucial in most cases where the patient is in danger. It will also enhance productivity, and doctors will reach the correct conclusions far more often and, importantly, faster.

Also, AIs could find hidden patterns that our healthcare systems may never have found earlier (like diagnosing tuberculosis by listening to someone’s cough sound as we saw earlier), expanding the range of ways we can diagnose illnesses.

TREND OF THE WEEK
Why o3 Won’t Take Your Job in 2025

The mass hysteria around the announcement of OpenAI’s o3 model is something we’ve never seen before. Some people are literally panicking and opting out of their computer science careers as they consider that, soon enough, those skills won’t even be necessary with AI.

But all of this is just plain-simple bullshit.

This is the worst side of the industry: an undiscernible amount of unnecessary hype fueled by AI influencers who are desperate for clicks to monetize their content, only to look like utter grifters in my eyes or ignorants in the best case.

Today, I’ll explain why this is nonsense, and not one single person will lose their job to o3 in 2025. Let’s dive in.

An Impressive Yet Nuanced Announcement

As the industry beats the hangover from OpenAI's amazing announcement of its latest o-type Large Reasoner Model (LRM), o3, reality is kicking in.

A Historical Moment

Yes, the model is absurdly impressive in nominal terms, showing results that make you wonder whether you will have a job next year.

Among the amazing results across many benchmarks, one, in particular, has been consistently highlighted by many: the 87.5% accuracy of o3 in the high-compute threshold (when the model is allowed to think for longer) in the ARC-AGI public benchmark.

For the first time ever, an AI obtained better results than the average human in abstract reasoning tasks designed to test whether a test taker, whether an AI or a human, can identify a subtle pattern on a grid and apply it to a new example.

As we explained on Sunday (full overview of what o3 models are here), it tests two crucial aspects of intelligence:

  1. Acquisition of skills on the fly. In other words, whether the test taker can learn a new pattern on the spot.

  2. Efficient learning, or whether the test take can learn the new pattern with just a few examples.

And the fact that one AI model can overcome this challenge successfully is truly commendable in itself, but these results have way more than meets the eye.

The Unmentioned Subtleties

While we can’t go as far as to say that the results don’t reflect reality, they aren’t as scary as what {insert AI “influencer”} will say to you.

Picture this. You have two kids taking a test. One scores 80% in just over 20 minutes of deep thinking. The other scored 90%, but it took two entire months to finish the test.

Which kid is smarter? The one that got 80% of the way in 20 minutes, or the one that got a better score but took two months to do so?

In my personal view, intelligence isn’t just a raw value, efficiency plays a crucial role. And in that category, AIs are still very, very bad.

According to insiders, the average processing cost of o3 solving one task in the benchmark was $5,000.

In layman’s terms, o3 spent 57 million tokens (accounting for approximately +40 million words) on average per task, which leads to the outstanding figure of $5,000, on solving a grid pattern that humans would take, at most, minutes to solve.

This tells you that o3 and other Large Reasoner Models (LRMs) are surfing the wave of absurdly large compute and capital, making the game of problem-solving one that, with enough compute, will lead you to the correct answer.

And that’s without considering these models burn upward of $500 million to be trained, making the overall number of o3 one that could make Warren Buffet faint.

I’m trying to make the point that for AIs like o3 to truly impact society, they need to severely improve their intelligence efficiency, or the amount of “intelligence” as a fraction of compute.

But how can we measure that?

The Bits per Byte Metric

With standard Large Language Models (LLMs), the main performance metric is perplexity. In layman’s terms, it measures the model’s “surprise,” or how confident it is about predicting the next word.

If perplexity falls, the model is less “perplexed,” meaning it is more confident (measured as a probability assigned to the chosen word) about what this word should be.

But with LRMs, bits per byte (BpB) becomes the main metric.

The Emergence of a New Metric

ByB measures the “amount” of information conveyed by each produced token or word.

With LRMs, LLMs that generate both reasoning and response tokens when answering, the amount of produced tokens per task is considerably larger. Here, it’s not enough to be accurate with your next predicted word, but this word must also be relevant in order to, over time, reduce the number of generated tokens.

It’s great to see that o3 got almost 90% on ARC-AGI, until you realize that it generated millions of tokens per question in something a human would generate, at most, between 100 and 200 (if that comparison actually existed, of course).

Consequently, if we truly want to measure the intelligence of an o-type model like o3, we must measure not only the quality of the response but also how efficient the model was in getting to the value.

This is why BpB is a great metric; o3’s responses are generally correct, but the Bits per Byte, the amount of information per produced token, is absurdly low. Using the earlier analogy, humans are the kids scoring 80% in 20 minutes; AIs beat us (sometimes only), but it takes a “human lifetime” for them to respond.

And problems don’t end there. As top AI researcher Miles Cranmer has shown, o-type models don’t seem to improve with hallucinations (whenever they make mistakes).

In fact, the user experience is actually worse, as the model doubles down on its mistakes far more often than previously, as if it’s grown much more arrogant about its knowledge.

That makes the experience of using o-type models not only an expensive one, but one that can turn out expensively wrong.

Hold Your Horses

For AI research labs, quoting benchmark results is a neat way to compare their offerings with those of other labs and convey a model’s utility and “intelligence” in a way that isn’t even close to reflecting reality.

Celebrate o3 for what it is

o3’s results in ARC-AGI or FrontierMath are to be celebrated for one very important reason: they once again give us hope that humans are, maybe, heading in the right direction toward building Artificial General Intelligence (AGI).

Instead, they are being celebrated as the ‘conquering of AGI,’ which is invariably false. This is also intended to convey that these models are much smarter than they really are; when measuring intelligence efficiency, they are still dumber than a young kid, and o3’s results don’t change that.

In fact, they illustrate this point even further: o3 requires millions of dollars to be run on a single benchmark as they have to generate millions of tokens to respond to somehow challenging grid pattern-finding problems.

That’s not AGI, that’s proving that, with enough compute, AI models can indeed obtain remarkable results (again, that’s the real victory, that larger computes lead to greater outcomes).

If anything, o3 must be viewed as proof that compute seems to be the key unlocker of intelligence, but we are nowhere near the real intelligence that we aspire to build with these systems (even acknowledged by OpenAI).

That being said, we have reasons to be optimistic about this: ChatGPT has reduced its processing costs by 100x since its launch. Also, o3-mini is cheaper to run than o1-mini despite being much “smarter.”

In other words, we are indeed improving the ‘bits per byte’ metric, but the case in point is that this process will be much longer than what people think.

And what about our jobs? What’s the it factor?

Well, it’s quite simple: money.

Incentives are everything

The real reason why o-type models can’t make a real dent into the labor market right now is nothing but costs. Think about it:

  • Would you care if our frontier AIs were very ‘intelligence efficient’ if the costs and latency of running them were close to zero?

  • Would you mind if the model generated millions of tokens to reach the answer if the answer is correct and cheap to obtain?

Of course not!

If o3’s price went to zero, everyone would have a model that can solve some of the hardest math problems, problems that even expert mathematicians struggle with.

You wouldn’t care if this act involved rote memorization and ungodly amounts of ‘thinking’; you would only care about the result. But right now, o3 would bankrupt your company in days if you deployed it at scale.

The truth about AI and the search for machine intelligence?

It was never—and will never be—about building true intelligence; it was always about making machine intelligence cost less than a human’s.

If AI labs achieve that, we can then ask whether these tools will substitute some humans (again, framing AI as a substitute for all humans for workers is cheap fearmongering).

While LLMs are already achieving this inversion, they are dumb as rocks. The real unlocker will come once LRMs become cheaper than hiring humans. o3 does have the potential to truly ask yourself whether you need that new extra software developer or if it’s better value for money to pay the subscription and give that tool to your senior SWE.

Those questions are coming, but I highly doubt those numbers will add up in 2025, especially seeing how constrained all AI labs are in compute and energy.

But that’s enough from me for today. Thanks for reading, and see you on Sunday!

THEWHITEBOX
Join Premium Today!

If you like this content, by joining Premium, you will receive four times as much content weekly without saturating your inbox. You will even be able to ask the questions you need answers to.

Until next time!

For business inquiries, reach me out at [email protected]