The Cognitive Revolution · 2025-04-10

Google DeepMind's AMIE & Co-Scientist: AI Beats Doctors, Makes Discoveries

Hosts: Nathan Labenz

Guests: Vivek Natarajan, Anil Palepu

AMIECo-ScientistMedical AIMulti-agent systemsScientific discoveryClinical trialsAgent scaffoldingLong contextTest-time computeStructured reasoningTournament evaluation

Why it matters

AMIE surpasses medical fellows in cardiology and oncology.

Key claims

  • AMIE now beats primary care physicians on both diagnosis and guideline-grounded treatment recommendations in multi-visit simulated consultations
  • AMIE surpasses medical fellows in cardiology and oncology; cardiologists with AMIE assistance outperform those without it across nearly every metric
  • Co-Scientist is a multi-agent Gemini system (no custom fine-tuning) that succeeded on drug repurposing, target identification, and bacterial drug resistance
  • Co-Scientist independently proposed the same drug-resistance mechanism that human scientists at Imperial College had discovered experimentally but not yet published

Episode summary

Summary

Google DeepMind researchers Vivek Natarajan and Anil Palepu join the show to discuss two major papers published in Nature. The first covers AMIE (Articulate Medical Intelligence Explorer), which has now extended beyond diagnosis to outperform primary care physicians on treatment recommendations grounded in clinical guidelines, and is surpassing medical fellows in cardiology and oncology while remaining complementary to attending physicians. The second introduces Co-Scientist, a multi-agent AI system that succeeded on three open-ended scientific challenges—drug repurposing, therapeutic target identification, and bacterial drug resistance—independently converging on the exact drug-resistance mechanism that Google's academic collaborators had discovered experimentally but not yet published.

The researchers emphasize that neither project relied on heavy custom fine-tuning; instead, both are built from general-purpose Gemini models via structured agent scaffolding, long-context reasoning (up to ~256K–1M tokens), tool use (web search, AlphaFold), and tournament-style pairwise evaluation of candidate hypotheses. They highlight three practical lessons for builders: structured/JSON-enforced reasoning outperforms free-form chain-of-thought, injecting new information or entropy (via search, tools, or human input) is essential to prevent mode collapse during extended self-critique loops, and Elo/tournament-style ranking is currently an industry-best practice for surfacing top ideas.

Deployment is underway: AMIE is entering a clinical trial with real patients at Beth Israel Deaconess Medical Center (Harvard), and Co-Scientist is being rolled out through a trusted-tester program already engaging ~100 leading scientists, with ambitions for millions of users by year-end. The guests describe the path to AI doctors and "data-center geniuses" as increasingly an engineering challenge rather than a fundamental unknown, though they note incentive and benchmark-tradeoff frictions around deeply integrating specialist scientific modalities into frontier models.

  • AMIE now beats primary care physicians on both diagnosis and guideline-grounded treatment recommendations in multi-visit simulated consultations
  • AMIE surpasses medical fellows in cardiology and oncology; cardiologists with AMIE assistance outperform those without it across nearly every metric
  • Co-Scientist is a multi-agent Gemini system (no custom fine-tuning) that succeeded on drug repurposing, target identification, and bacterial drug resistance
  • Co-Scientist independently proposed the same drug-resistance mechanism that human scientists at Imperial College had discovered experimentally but not yet published
  • Practical engineering lessons: enforce structured reasoning via JSON outputs, inject entropy (search/tools/human input) to prevent collapse during long self-critique loops, and use tournament/Elo-style pairwise ranking
  • Long-context Gemini (up to ~1M tokens) was an underrated enabler—allowed Co-Scientist to persist feedback as implicit memory without custom RAG
  • AMIE clinical trial launching at Beth Israel Deaconess (Harvard); Co-Scientist trusted-tester program already has ~100 scientists, targeting millions by year-end
  • All reported results were produced before Gemini 2.5 Pro, implying a near-term step-change improvement is expected

Source material

Transcript

for AI.

Hello, and welcome back to The Cognitive Revolution.

Today's episode features an eye-opening conversation with Vivek Natarajan and Anil Palepu from Google DeepMind.

Their groundbreaking work on AMIE, the Articulate Medical Intelligence Explorer, and Co-Scientist represent what seems to me an important threshold moment in AI capabilities.

I always say that if people truly understood what AI can already do today, many would be fundamentally rethinking their plans.

And these projects provide perhaps the clearest evidence yet—that AI systems are beginning to outperform highly intelligent humans in domains that require years of specialized training.

Remarkably, this work was accomplished without special continued pre-training or extensive custom post-training that could only have been done within Google.

On the contrary, these approaches could have been developed and can be replicated by Google's API customers, using commercially available models, advanced prompting techniques, and thoughtful agent design.

We begin by discussing AMIE.

A year ago already, the vacant co-authors showed that AMIE was able to outperform human general practitioners in diagnostic accuracy.

Now, with just a few important caveats remaining, a Nelan team have demonstrated that it also beats human primary care physicians in analysis and treatment recommendations.

The implications for healthcare access are obviously profound, and are beginning to extend into specialized medicine too.

The second AMIE paper we cover shows that the AI system is already surpassing medical fellows in both cardiology and oncology, and closing in on, but still falling a bit short of, attending level performance.

Notably, when cardiologists have access to AMIE, their performance dramatically improves across almost every metric, suggesting a short-to-medium term future in which AI doctors have the potential to both raise the floor for access to quality care globally, and also raise the reliability ceiling even for those of us fortunate enough to have access to first-world specialized care.

This is, to put it plainly, crazy.

And I am super excited that Google is moving AMIE into something like a clinical trial, in partnership with Beth Israel Deaconess Medical Center, a Harvard Medical School teaching hospital in Boston, for real-world validation.

All that said, somehow, in Vivacantine's co-scientist paper, we see something equally, if not even more, amazing.

This multi-agent AI scientist system, which is capable of accepting human input and feedback at any step in its process, was tested in fully autonomous mode on three increasingly complicated scientific challenges.

First, drug repurposing, an advanced, but reasonably well-defined task amenable to combinatorial analysis.

Second, therapeutic target identification, a more open-ended challenge requiring the AI to understand and/or make quality hypotheses about causal relationships within cells.

And third, and definitely most dauntingly, the wholly open-ended challenge of understanding the process by which bacteria achieve drug resistance.

As you might have guessed, co-scientists, which by the way Google is now making available to trusted partners, succeeded on all three of these tasks.

And on the challenge of understanding drug resistance in particular, it blew everyone's minds by proposing the exact same mechanism that Google's independent scientific collaborators had recently discovered experimentally, but had not yet published at the time of co-scientists' analysis.

Overall, co-scientists demonstrates that AI systems are now capable of generating novel insights by connecting the dots between far-flowing bits of hard-won human knowledge.

This system is not simply regurgitating its training data.

On the contrary, it's performing meaningful synthesis and proposing novel hypotheses that even human expert scientists recognize as both insightful and significant.

If all that's not enough alpha for one episode, the implementation details behind these systems offer valuable lessons for AI engineers everywhere.

First, structured reasoning proves far more effective than simple chain of thought approaches, especially when working with lots of input context.

Both of these systems demonstrate the value of thinking carefully about exactly how you want your AI system to reason about specific types of problems.

Second, finding ways to add new information or even just a bit of entropy, such as by giving the model access to search, is key to making self-cortique and self-improvement schemes work over many rounds of successive iteration.

And third, for now at least, the tournament-style evaluation process used to surface the best candidate hypotheses out of the many that were generated seems to be an industry best practice that you can and should use in your own work.

What's most amazing to me about all of this is that it was achieved before Gemini 2.5 Pro was available to use, meaning that everything we talk about today is still subject to a step-change improvement that should come more or less for free with a simple model upgrade.

With this level of performance already established and core model progress continuing, the path to an AI doctor in your pocket and data centers full of AI geniuses is honestly becoming quite clearer.

AIs are no longer just tools for routine tasks.

They are becoming legitimate thought partners in some of humanity's most complex intellectual endeavors, from diagnosing disease to expanding the very frontiers of scientific knowledge.

As always, if you're finding value in the show, please take a moment to share it with friends, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube.

And please know that I sincerely value your feedback and suggestions.

Whether we get to live in a post-scarcity society in which we all enjoy instant access to superhuman AI doctors, or perhaps on the other extreme end up going extinct due to some crazy AI-driven scientific accident seems to me to depend largely on how responsibly we handle the upcoming AI transition.

And I take my role in AI discourse very seriously.

If you think I can be doing better, please contact us via our website, cognitiverevolution.ai, or feel free to DM me on your favorite social network.

For now, I hope you enjoy this conversation on the emergence of genuine AI expertise, which you almost certainly would have considered to be AGI just a few short years ago, and which I think you should still find absolutely mind-blowing today.

With Vivek Natarajan and Anil Pulebu from Google DeepMind.

Vivek Natarajan and Anil Pulebu, authors of Co-Scientist and Amy from Google DeepMind, welcome to the Cognitive Revolution.

Thanks for having us.

Likewise.

Guys, again, so Vivek, this is your fourth time.

What an unbelievable heater you and the team at Google have been on.

I always say, people, if they just had a little bit better sense of what is already out there today, they would be updating their plans in many ways that I just don't see people doing.

So this is really an unbelievable example of that.

I came away, there's three papers we're going to go down the rabbit hole on today.

And I came away from this feeling like it might be time to call it.

The AIs might in fact now be clearly smarter than me.

And it is sort of, and we can get into the nuances of what they're still missing a little bit, but I think almost everybody would read these reports on what you've been able to get the AIs to do and feel like they would have a very hard time matching that and to get to this level that like a single AI model scaffolded in different ways and put to different purposes, it would be like years of undertaking for me to get there for sure.

So let's do some headlines.

So there's two papers with Amy.

We covered this once before.

This is the, what is it?

The Articulate?

Articulate Medical Intelligence Explorer.

Sorry, we're done.

The Articulate Medical Intelligence Explorer.

So I've been using a graph in some slides that I occasionally present for like the last year or so since that the first Amy paper came out showing that basically when a patient chats to the AI, the AI is more accurate in its ability to diagnose the person than human primary care physicians are as judged by other human doctors.

So now we've got two new extensions to that.

The first one is, and there are some caveats here that I think are definitely worth unpacking.

I'll give you a chance to do that.

It's basically now outperforming general practitioners, not just on the diagnosis part, but also on reasoning through what to do about it and ultimately recommending treatments, which is obviously a big part of what the doctor has meant to do for you.

So yeah, just unpack that headline for me a little bit.

I mean, it's crazy that that is out there in the world today.

And again, I always say like, I swear when I was a kid, if something like this happened, it would have been like headline news, everybody would be talking about it.

And there's just so much going on that some of this stuff, even dramatic breakthrough as it is, doesn't seem to crack the consciousness.

So tell us more about Amy outperforming now, not just on diagnosis, but also on recommending treatments.

Yeah, yeah.

I mean, like prior to those, I think first two kind of Amy papers, it was a lot of the work in this space was, you know, on medical question answering.

And, you know, there was some notion, right, that these, these language models do encode, you know, clinical information well, and they have a lot to offer.

I think like with those papers, we were trying to start to ask that question of like, okay, but that's not clinical practice, right?

Like in clinical practice, the doctor is interacting with patients, they have to kind of gather this information themselves.

They're not really like preventant presented with like all the information upfront.

And so that was the study, right?

And it was, it was doing this objective structured clinical examination is this format.

And it was basically trying to see like, can the doctor or the AI in this case, you know, interact with patients, gather that information and still get to that, you know, diagnostic endpoint.

And of course, I think we do a really good job in the paper and I encourage people to read it of like describing the many limitations, right?

Like this is a text based chat.

That's not how doctors talk to people.

And kind of our future, like our direction from there was really about starting to unpack like some of these limitations.

And so one of those limitations, right, is this idea that, you know, it's more than just, you know, from the first visit, you see a patient diagnosing them, like there's a lot more to, you know, clinical care, right?

And it's about managing a patient over, you know, multiple visits, you know, the endpoint at the first time you see a patient might be really like we need to just order the right test and we need to kind of set them in the right direction.

It's not always that you know exactly what to do with the patient after seeing them one time.

And so I think the management reasoning paper is really trying to unpack that.

And, you know, we'll talk more about like how that study was designed.

I think it's like super interesting, right?

But and there's other aspects to it as well, like, rather than kind of more general recommendations, like can we get really precise and can we kind of ground in accepted clinical practice guidelines can be ground and, you know, medication labels and, you know, start to turn these into like slightly more actionable things.

And similarly, like with the specialty papers, right, it's trying to expand beyond kind of the bread and butter common presentations of common diseases, like how does this work in more niche areas of medicine?

And so I think and you know, we have a lot more work, you know, kind of just trying to expand on some of these limitations that you know, we've identified in the first paper.

So how would you summarize the I mean, if I understand correctly, it is still a chat based interaction today, right?

So one major possible extension would be to go to multimodal, but you guys have also done work on that separately, right?

So maybe like, why is it not multimodal in in this particular study?

Was there a reason not to just like let people you know, throw in selfies into the chat?

Yeah, I think it's more like when you're doing research, you want to isolate components that you're studying and do that well.

And so when you add multimodal, I think artifacts add like necessarily more confounders over here.

So we try to avoid that.

And so it's just easier to study a text based system to begin with.

But clearly, we've done work on multimodal before.

And I know that like on two podcasts, we've spoken about like the med mom and work and the Gemini.

So all those components and pieces exist.

And so very soon you'll see like the multimodal one also come out.

Yeah, yeah.

Okay.

Paces relentless, that's for sure.

So just to bottom on this one more time, basically we have with the caveat that it's not yet in this particular paper, although coming soon multimodal, we have multi visit, you know, kind of longer time horizon interactions between patients and doctors, where the AI doctor is outperforming the human doctors on both the diagnosis and the reasoning through what to do and ultimately landing on sort of standard of care, you know, accepted proper treatment for these conditions.

Anything else that you know, that we should should I soften that at all?

Or is that like a good I mean, I think I think like on billboards, right?

I mean, yeah, I think the biggest thing right now is, you know, and we'll talk more about how we're trying to test this in the real world.

But right, these are simulated consultations, like they are patient actors, they're not real patients.

And obviously, there's a whole new set of challenges when you get real patients who you know, all kinds of things can happen and they're not necessarily going to stick to their script.

So I think that's a whole another thing that we need to test and validate that our results truly do translate to the real world setting.

That being said, I mean, I think simulated consultations do show that you know, we do have promise in this setting.

And you know, we personally are very optimistic that these results would would translate at least to a certain degree.

Yeah, okay.

So the next Amy headline is moving to specialized medicine.

And there you look at cardiology and oncology.

And here we basically I would summarize the findings as the Amy system is surpassing fellows and closing in on but not yet hitting the level of attending physicians in these specialist domains.

So what what further you know, complications or caveats you know, should we have to understand that?

Yeah, I think largely I'd agree with the notion that you know, we have a lot of improvements in terms of being as consistent in all in all domains as you know, the most experienced most experienced attendings and there's room for improvement.

I think the real headline for me, especially if you look at the cardiology paper is that, you know, the types of errors they make are pretty different, AI and and the general cardiologists.

And I think the really exciting thing is we see that they're quite complimentary.

So you know, the comparison we made was, you know, we compared Amy to the general cardiologist head to head.

There was, you know, some uncertainty about what was better there, somewhere better in some areas somewhere better in others.

But when we compared the general cardiologist with access to Amy's assessments to the general cardiologist alone, it was like a landslide, right?

And in that case, in almost every aspect, it was considered superior when they had assistance.

And you know, I could talk a little bit more about maybe why I think that might be obviously there's more investigation needed.

But you know, I think that is a really exciting aspect of this that, you know, it seems to be just a helpful system, you know, in use by these experts.

And maybe to just contextualize that work a little bit more.

I think if you look at like the access to specialists, like in the country today, I believe like for getting like consultations on neurologies, for example, it's like 12 to 18 month wait time.

And that's simply not sustainable.

Right.

And so the question is, okay, like, I mean, clearly that we have like better reasoning AI systems that seems to show promise in medicine.

So why do we have like, how can we do better over here?

Can we improve the serious goal?

And we should be able to do that radically.

Like no one should be waiting for 18 months to get like consultation on like a rather serious thing, like neurology is like not straightforward.

And so so that's kind of like the motivation.

There's like a lot of like access issues around specialist care, cost issues around specialist care.

And so how can we do better?

And that's like the key question that we're trying to address.

And then I think the second thing is generally like if you look at like how medicine has evolved, and there's led to these silos or these compartments and specializations.

And so you have like this primary care, which is kind of like the front face to everything else, that's the door.

And then you have all these silos.

But the way that it has evolved, I think is primarily because of like the limitation of the cognitive aspects of the human mind.

Like there's only so much expertise that we can cram into our given brain.

And so like we have to know.

So because of those limitations, you have to go like study neurology or cardiology, or I don't know, internal medicine, but not everything together.

But like AI systems don't need to have that kind of like limitation, they should be able to like do, I mean, like given what we're seeing, like they should be able to integrate knowledge from multiple different sources, multiple different disciplines.

And so that's also like there's a fundamental rethinking that is happening does like the new age of AI empowered, like AI powered healthcare, does it need those silos?

Can it could be possible that you not only have like a PCP in your pocket, but like, like an expert neurologist in your pocket, obviously with caveats and things like that.

So that's, that's another class of like question that we are trying to address over here.

And then maybe the third point I'll just add on is this study is like what, three, four, five months old now.

Yeah.

So that was primarily done with the palm version of the models, if I'm not wrong.

The cardiology one was done with flash.

Flash?

Yeah.

1.5.

Yeah.

Yeah.

1.5.

Okay.

Yeah.

So it's 1.5.

And so since then had two and 2.5.

So yeah, I mean, in that study, obviously, what we saw was that 1.5 was not as good as say at endings.

But who knows?

With a new one.

Yeah.

Yeah.

That's an important caveat.

Hey, we'll continue our interview in a moment after a word from our sponsors.

In business, they say you can have better, cheaper, or faster, but you only get to pick two.

But what if you could have all three at the same time?

That's exactly what cohere, Thomson Reuters and specialized bikes have since they upgraded to the next generation of the cloud, Oracle cloud infrastructure.

OCI is the blazing fast platform for your infrastructure, database, application development and AI needs, where you can run any workload in a high availability, consistently high performance environment and spend less than you would with other clouds.

How is it faster?

OCI's block storage gives you more operations per second.

Cheaper?

OCI costs up to 50% less for compute, 70% less for storage and 80% less for networking.

And better?

In test after test, OCI customers report lower latency and higher bandwidth versus other clouds.

This is the cloud built for AI and all of your biggest workloads.

Right now with zero commitment, try OCI for free.

Head to oracle.com/cognitive.

That's oracle.com/cognitive.

Being an entrepreneur, I can say from personal experience can be an intimidating and at times lonely experience.

There are so many jobs to be done and often nobody to turn to when things go wrong.

That's just one of many reasons that founders absolutely must choose their technology platforms carefully.

Pick the right one and the technology can play important roles for you.

Pick the wrong one and you might find yourself fighting fires alone.

In the e-commerce space, of course, there's never been a better platform than Shopify.

Shopify is the commerce platform behind millions of businesses around the world and 10% of all e-commerce in the United States.

From household names like Mattel and Gymshark to brands just getting started.

With hundreds of ready to use templates, Shopify helps you build a beautiful online store to match your brand's style, just as if you had your own design studio.

With helpful AI tools that write product descriptions, page headlines and even enhance your product photography, it's like you have your own content team.

And with the ability to easily create email and social media campaigns, you can reach your customers wherever they're scrolling or strolling, just as if you had a full marketing department behind you.

Best yet, Shopify is your commerce expert with world-class expertise in everything from managing inventory to international shipping to processing returns and beyond.

If you're ready to sell, you're ready for Shopify.

Turn your big business idea into Chaching with Shopify on your side.

Sign up for your $1 per month trial and start selling today at Shopify.com/cognitive.

Visit Shopify.com/cognitive.

Once more, that's Shopify.com/cognitive.

Okay, well, let's do the headlines then for the cositis because these are, I would say, similarly striking headlines.

So, cositis is basically, I guess I would describe it as an agent scaffolding type of setup where you, and we can get into the granular details.

But we've seen different things like this before.

I did an episode with James Zhao, who has kind of a similar thing with the AI virtual lab and the original cositis that was made just within the first couple of weeks of GPC4, which is crazy to think that they got anything out of that with like 8,000 tokens of context.

But even then, we're seeing some interesting stuff.

But this, I would describe as just kind of taking sort of all of the lessons learned about how to make agents work from the last two years and hitting the gas on all of them.

And then coming up with a system that basically can do science, right?

I mean, it's not executing the actual physical experiments at this point, but I was really amazed by the different things that you tested the system on.

The first one is kind of like three levels of challenge of the problem that the system is given.

The first one is like a relatively well-scoped thing that you could kind of grind through.

That might be the one that I might be able to do with some real effort.

And that was drug repurposing, right?

So, take a drug that's out there, look for other things that it might be useful for.

And you have kind of a combinatorial approach that's like available to you.

So, if you set that up and go through it systematically and you have like decent judgment around each individual sub-question that you ask, you can sort of imagine how an AI system would be able to do something like that.

But then you go up the level to the second scope of task, which is identifying new therapeutic targets within a particular kind of diseased cell.

And this is like starting to get at what I've often called one of the grand challenges in biology, which is just like what causes what?

Obviously, we know that there's like super complicated causal graphs going on in the cell of this promotes this, but inhibits that and yada, yada, yada.

And we don't, I don't know if you guys would venture a number, but I've kind of understood broadly that we have like a long way to go in terms of really mapping that out.

Maybe we understand 10% of what the graph is in today's world.

So, the challenge here is to basically go into a cell, so to speak, and try to figure out, can we identify something that we can target with a drug that will actually make things better?

Obviously, this is hard given the vast complexity and the many, many unknowns in cells.

So, okay, that was just the middle one.

Then the third one is, can you figure out why or how bacteria are becoming drug resistant?

There was like a little bit of a hint, I think, because there was one sort of observation, right?

That there was something that was conserved across a couple different species that sort of was a notable observation that sort of served as a seed to unpack the challenge.

But beyond that, go figure it out, right?

Super open ended, really, really tough.

And that is like a daunting question for me to consider and really would have you kind of, spinning your wheels, I would think in just an absolutely vast literature for a long time before you would even, you know, at least for me, before I would even have any, you know, sense that I might be able to start to contribute to the discussion.

Okay, those three problems.

Bottom line is, AI is able to do all three of them pretty well.

And in that last case, it actually surfaced as its number one candidate idea for the mechanism of the drug resistance, something that had been discovered experimentally, but was not yet published, right?

So you guys partnered with an academic group that was doing this research.

And they had the answer, but nobody else had the answer.

It's not in the literature.

And the system was able to grind through all this content that it had available, you know, the whole kind of vast body of medical and biological literature.

And I mean, crazy, right?

Landed on the exact number one hypothesis that turned out to be the actual answer.

I was really blown away by just how successful it was.

So tell me more, I guess, you know, in terms of what else, are there any caveats that I glossed over or, you know, are there any other Eureka moments that you would want to highlight from those results?

I think the last one was kind of like interesting and also like funny, because I don't think Jose Santiago, who are collaborators at Imperial would take offense at like me saying this, is I don't think they really believed that AI could do this thing.

So I think we've been trying to like, you know, we said, okay, like we have the system, Jose and I really want to try it.

And then it took us like a few months before we got like enough time with them.

And then I think after like enough bestering, they were like, okay, we have some experimental results in the lab, we'll just try and challenge your AI system.

Let's see what your AI system can do.

And so, yeah, I think it was roughly around like Thanksgiving.

And so they sent us this prompt.

And I think that's detailed in another preprint along which was go time with the co-scientist paper.

And as you said, I think there were some clues in there, but like it was not totally giving it away.

And so we take that prompt, we set out this system to it and it runs for like a couple of days.

It's with something out and then like Euro are like the first author on that paper.

And he's an amazing like technical fellow.

He's probably the least well-known technical genius at Google in some ways, because he likes to keep a low profile.

But yeah, so he just sent it over to them.

And then I think it was Thanksgiving.

I was like late in the evening.

And so Euro and Alan were both based in Europe.

They were like offline.

And then like within 10 minutes, like Jose was also based in London.

He sends us like an email and he was like, I need to talk to you right now.

And I was like, okay, I didn't understand the seriousness of it.

But I was like, okay, and I'm not doing anything better if we can talk.

And then he was like, are you like reading my email?

And I was like, I'm not sure.

I was like, what are you asking?

He was like, no, no, no, it seems like you're reading your email.

And then I was like, we do many things at Google, but reading your email is not one of them.

And then he goes into it.

He was like, I'm not published as anywhere, but your AI system came up with the same set of results that like we hypothesized and we found in our experiments.

So I'm like really surprised.

And so he was like, okay, do you guys get responses from like chat GPT?

I was like, no, no, no, that's not possible.

That doesn't happen.

And so he was like, okay, so if you're not reading my emails, and then if you're not, if you don't have like any information from chat GPT, then it's likely that you have something like really, really magical.

And that was kind of his response.

And then he said, like, yeah, the first one is great.

But we're also like, send him four more.

And he was like, all the other four also make a lot of sense.

And I think immediately after the Thanksgiving break, he was like, okay, I'm going to like set a few of my postdocs to work on this.

And they've been like working on validating those other hypotheses.

And so it was kind of that moment, like where someone who's like very pragmatic, very experienced, has spent like a decade, in fact, several decades on the field, like when they have these kind of reactions.

So that was kind of like, okay, that told us like, okay, we might be on to something over here.

But again, I think like it was not like one single moment, but it was like, okay, we it's a very hard thing to do, right?

I mean, to get like AI to not just like synthesize and integrate and summarize information, but help traverse history of knowledge and like uncover new original things and knowledge and facts about the world.

And to do that reliably, that's a super hard problem.

In some ways, that's like the Holy Grail of AI.

And to think that okay, like a system which is relatively simple in nature, like we probably talk about this more, it's like, it's probably the simplest version of the system that you can imagine.

And like throwing a bunch of computer did, you're already seeing evidence of that, like you're seeing evidence of this happening reliably, that felt like, okay, super magical to us.

Yeah, it's crazy.

So this has been something that people have been discussing recently online a bit.

Like, Dorkesh has sort of advanced the idea that like, why don't we see AI's coming up with these, like, you know, they have this like incredible breadth of knowledge, shouldn't we be seeing more connections made, you know, more sort of insights across this like super diverse knowledge base, his contention is like, if I could have all that knowledge, you know, surely I would come up with like more insights than the AI's currently seem to.

And I think he even kind of went as far as to say, like, I haven't seen a single example of this.

And I feel like some of the things that I've covered, definitely seem like they could be said to count.

But there, you know, there's always a lot of details and caveats and sort of either the beholder type of stuff.

But this seems pretty clear to me that, you know, the fact that this was essentially independently discovered by human scientists in a lab and an AI system in a data center over a couple days in parallel, to the point where the scientist accused you of reading his email, presumably not super seriously, but you know, nevertheless, you know, is there anything is there any reason that we should sort of not take this as a genuine like discovery, you know, a sort of qualitative eureka moment from an AI system?

Yeah, this is not the first evidence, like, I think right in 2023, right after we did our metform.

And that's what actually the genesis of this co-scientist work is actually right after we published metform.

There was this professor from Stanford, Dr. Gary Peltz, who reached out to us and he was like, and I think you probably remember Tao who came to one of these previous episodes before, he called both of us up and then he was like, they're going to tell you don't know me, but your AI system can potentially help like millions of people with rare diseases.

I was like, okay, Gary, that's a nice introduction, please go on.

And then he was like, I know that your models are trained on a lot of scientific literature.

And I think they can help me discover useful facts about genetic diseases.

And, yeah, I mean, that felt interesting enough for us.

I mean, if you were able to help him, probably we could help a lot of people.

And so we started working with him and on this problem of genetic discovery.

So can, like language models, can they come up with like the right kind of causative factors that are responsible for like given combination of phenotypes or symptoms.

And so we started doing that with metform and even with metform and later with metgemini, we saw that these systems were able to do this.

In fact, with metform, we, Gary was working on like mouse models and like a very specific kind of hearing loss for which he had this NIH grant back then.

I don't think he'll ever get that kind of a grant again, but it's a separate discussion.

But so yeah, he had that grant.

And so one of the hypotheses that the model came up with, it was like a biogenic model for hearing loss, which Gary had not thought of before.

And so he went ahead and did like these CRISPR knock-in experiments in his lab.

And he was able to reverse the cause of the disease.

And so we had written that up and still under review in actually a prestigious venue.

And then later on, we use our more advanced version of our LLM, which is metgemini.

And we extend that work to like human variants of unknown significance.

And even there, like based on retrospective data, we see like these systems are able to do like pretty interesting work in genetic discovery, which is like one of, it can be caused as a hypothesis generation problem.

But the key thing is at that point of time, when we were using these LLMs in like a pretty much like a crude single shot fashion, it was very unreliable in this act of like hypothesis generation.

So literally for that metgemini to come up with like one hypothesis that was very, very helpful.

It had come up with like thousands of things that were out of garbage.

But I mean, we were grateful to like work with Gary who had the expertise to like very quickly discard away things that were nonsense and also had the patience to like work through all of them because could easily imagine another scientist or maybe someone inexperienced who would like say, looking at the first five, this is out of garbage, it's not working.

And then we would not have like, pursued this line of work at all.

And so it's about like getting together and working with the right people who believe in this.

And so that like, put us on this journey, like, okay, how do we make this more reliable?

Like, we should not be sampling like thousands of times to get something useful.

Rather, every single generation should be something useful and the system should be like well-c calibrated over here.

And so that kind of like led us to this.

And so the simplest way to do this would be to like, call like an LLM repeatedly and hope that it leads to something useful.

And but that very quickly fails.

And it leads to like these degenerated solutions, which has this more collapse.

And so it became important to like, you know, introduce like net new knowledge into the world.

And also think about how you can like gamify that process in some ways, and like introduce like repeated new helpful feedback that can help the system self improve.

And so that kind of like led us to the design that we eventually had.

And in hindsight, it should be quite obvious, I think it's like very naturally follows how like the scientific method works.

Or if you ask like a scientist, how do they come up with new ideas, you'll see like they were just roughly compartmentalized into the set of agents that we had.

And they're like just doing the same thing.

But it was just like this iterative process of like, okay, we tried something, it didn't work, we tried to improve it.

And then we eventually ended up with this design, which is like, I think remarkably intuitive.

Hey, we'll continue our interview in a moment after a word from our sponsors.

It is an interesting time for business.

Tariff and trade policies are dynamic supply chains squeezed and cash flow tighter than ever.

If your business can't adapt in real time, you are in a world of hurt.

You need total visibility from global shipments to tariff impacts to real time cash flow.

And that's NetSuite by Oracle, your AI powered business management suite, trusted by over 42,000 businesses.

NetSuite is the number one cloud ERP for many reasons.

It brings a county financial management inventory and HR all together into one suite.

That gives you one source of truth, giving you visibility and the control you need to make quick decisions.

And with real time forecasting, you're peering into the future with actionable data.

Plus with AI embedded throughout, you can automate a lot of those everyday tasks, letting your teams stay strategic.

NetSuite helps you know what's stuck, what it's costing you and how to pivot fast, because in the AI era, there is nothing more important than speed of execution.

It's one system giving you full control and the ability to tame the chaos.

That is NetSuite by Oracle.

If your revenues are at least in the seven figures, download the free ebook, "Navigating Global Trade, Three Insights for Leaders" at netsuite.com/cognitive.

That's netsuite.com/cognitive.

Yeah.

There's a couple of maybe quasi philosophical questions that come to mind there.

One is, how do you think about the relationship between hallucinations on the one hand and creativity or hypothesis generation on the other hand?

It seems like we have very, people have very different intuitions about that.

And I'm also not quite sure myself, like, do I, when we do sort of extensive post-training to try to minimize hallucinations, does that help with hypothesis generation?

Because maybe it makes the models more like disciplined in their reasoning, or in some ways, does it hurt because they're maybe less willing to come up with a very random idea that every once in a while, it's these random ideas that are the big ideas.

So how do you guys think about the, like, maybe that's a false trade off, but yeah, how do you think about the relationship between those behaviors?

Yeah.

Kishir, here your take.

Yeah, I mean, I think intuitively, there's some aspect of like hallucination, like does foster creativity.

In some ways, like it's interpolating between, you know, the data it's seen, right?

And it's kind of necessary for hypothesis generation to deviate from, you know, the script in a bit.

But yeah, I mean, like, I don't have a, I don't know if I have a clear understanding or, you know, answer to that.

But intuitively, at least that's how I feel.

Yeah.

I always used to think that like hallucination and creativity are like two sides of the same coin in some ways.

And I don't think this no longer holds true, but like what we used to see with the previous generation of models was it was actually much more helpful to use the models without post training for this task, because they were then more likely to come up with these crazy ideas.

Right.

But now I don't think that's any longer the case, because we've been able to like systemize, we have been able to like put a structure around that process of coming up with new ideas.

And so it feels like now the process is much more reliable, but maybe we are sacrificing, say some like crazy new things that would require maybe like an unnerved, like non post trained model to come up with.

So we don't know.

So I think we've gained like reliability in the process, like in terms of like consistently coming up with like new original thoughts.

But like, we don't know if you're like sacrificing something else over here.

So yeah, yeah.

And I think one advantage of our system too is like we, we allow like diversity of models, you know, so, you know, we can kind of get the best of both worlds.

And through this tournament process, through this ranking, like, you know, we have a diverse set of hypotheses, and those can be reranked.

And hopefully, you know, we see high quality hypotheses kind of bubble up to the top.

So so yeah, I mean, I think maybe it's both is kind of what I take away from the hallucination reflections there.

It's like, if you're in a, you know, in the context where there's a rich literature, and it's more about, you know, really working your way through it, then maybe you want a disciplined reasoner.

And if you're doing something where there really isn't much to go on, you know, maybe you want a sort of more whimsical hallucinator.

And, as you said, like maybe get the best of both worlds with some of these setups.

So let's describe the different setups.

I mean, there's multiple different systems here that, you know, they have their sort of intricacies.

But it seems like at a high level, if I, if I understand correctly, what you're doing to design these systems is basically just kind of introspecting or, you know, maybe interviewing people and sort of saying, one of my favorite questions in in AI, you know, automation in general is, how do you think about it?

So it seems like you're kind of doing that and saying, like, okay, how do you think about it?

What do you do next?

Then then once you do that, what do you do?

And you're basically turning all these steps of a process, whether it's the scientific method, or it's the diagnostic process, or the reasoning to treatment process, you're basically just kind of mapping that out with, you know, a subject matter expert, yourself or somebody else, and then creating little sub agents that are sort of prompted to do those sub tasks, and then kind of scaffold them together.

And also some interesting details on giving them tools, you know, literature search is obviously a huge one, I was also interested to see that in the one project, the model did have access to alpha go and maybe some other things.

So, you know, you're increasingly like giving it something like the full complement of tools that a human could use.

And then it seems like after that, it's like turn all the hyper parameters up, you know, kind of do more rounds, more generations, more, you know, more evaluations, more rounds of feedback.

And I guess my impression is that if you do that, and you have the budget for it, like, in today's world, you could probably be successful at almost anything.

But maybe, you know, tell me if that's wrong, like, are there things about the designs of these systems that you think are like, actually very, you know, sort of important hinges where like, if it had been if you designed it a little different, like, it wouldn't work?

Like, how sort of within that general framework that I outlined, could you go like a bunch of different directions?

Or do you feel like it actually is sort of a narrow design space that actually works?

Yeah, I think maybe there's a high level narrative change that's happening over here.

So yeah, this is my fourth time on the podcast.

And I think previously we discussed med palm, I think my palm and my German and the key with all of them were all like med, like we were taking some generalized model, and then you were trying to like fine tune them and specialize them.

I think the key differentiation with the current version of Amy and co scientists is we're no longer trying to do that fine tuning step or that specialization step.

And part of it is because like, okay, some of the data that went into that fine tuning and specialization step is now upstreamed and part of Gemini.

But it Yeah, I mean, it just feels like the better approach to set things up is by simply having agents with like specialized problems and change up together in like a nice manner.

So that, yeah, it just gives a lot of flexibility and control.

And so it does away with this need for like, maintaining and specialization.

I'm not sure what you think about that.

But yeah, I kind of agree.

You know, not that there's no role for post training.

But certainly, like in our first Amy paper, and like med palm, all these papers, like it was really about, you know, this medical data we're curating or creating and how we're making that and you know, that's what was driving the model success.

And in these latest papers, that's really, I think, taken a backseat to simply like, how we're designing the system, you know, maybe to perform these sort of tasks at inference time.

And I would say like, I don't think anything is like, particularly like, hyper optimized, like, you know, there's probably certain ways that, you know, someone could design a system that does the stuff better.

I'm sure I'm sure there is, I think like our goals have been kind of largely to build towards like a functional prototype for these and kind of get there quickly and do the kind of tasks that we're interested in doing in each study.

So I think there are there's so much room for improvement and in all of these.

How many rounds of iteration did you go through as you built the system?

And were there any moments when it was like, Oh, we tweaked, you know, this prompt or we, yeah, we switched this, the order, we put this agent before this agent, you know, kind of re scaffold it that way.

And all of a sudden you saw a big leap.

Or Yeah, yeah.

So tell me about that.

I mean, for like any work, I think we rely like really heavily on like auto evaluations.

So I think we do try a lot of different, you know, you know, prompting a lot of different like configurations, but ultimately, like, you know, it's only finite how many things you can try before, you know, you're like, this is, this seems good.

This is good enough.

So I say we use that as like a rough signal.

We also like, honestly rely pretty heavily on like vibe checks and just like, and you know, I, I think like, you can pretty quickly like start to see differences.

Like I think with our like manager reasoning agent, for example, like, you know, we saw a big difference when we started drafting like concurrent plans, and then refining them together.

We saw a big difference when we did this top down.

Can you unpack that just a little bit more?

Like, what was the before and what was the after that?

So like before we were we were just generating one plan.

And basically now, like we're generating four different plans, they might have some similarities, some differences.

But we found basically like, you know, in sort of like self consistency style manner, right, like, that the model was able to combine these plans in a way that it kind of was able to take the good stuff from each plan and leave out the bad.

And that's just something like once we tried it, it was like, our internal like auto evaluation signal was like very clear, like this was making a huge difference.

But like, you know, in terms of like hyper parameters, it's like, we went with four plans, like we could have gone with eight, we could have gone with two, like, I don't think we optimized every hyper parameter in that sense.

But I think like the big things we were able to pick up from this kind of signal.

Yeah, I think on the course scientists side, it was it was a much longer iteration, it was almost like an first 18 month project in some ways.

And it was, yeah, driven by this need to like, okay, make this process of hypothesis generation, like more reliable.

And so, yeah, it was like, okay, we decompose these tasks, we try to like, get like individual models to work on them.

And I think the good thing is, again, like, what it has shown was how good like individual algorithms are getting at instruction following.

And that also does the way they need to like, fine tune and specialize and whatever, because simply like if a model has the knowledge, and if you give it like a precise set of instructions, it can just follow it.

And so that just makes it much, much easier, like create agents and chain them up together, where these agents are specialized to do like specific tasks, but are like just prompted versions of like, the general purpose models.

And so yeah, I mean, like it was a process of like, okay, setting up some evals, seeing how well the system does, where the weaknesses are.

And then like, again, iteratively going in and fixing those weaknesses by adding like specialized versions of agents or like fixing the prompts.

And then ultimately, like there was a process of like, okay, let's just simplify the architecture, we don't need all of these things.

And so yeah, that ultimately led to this design.

So I would say it was a lot of iteration, like figuring out where the weaknesses are, trying to cover those weaknesses.

And then at the end, like, okay, let's do a simplification push.

Yeah, I've been through that on the list world changing project myself multiple times where you're like, a new model can do wonders for the simplification of your system.

Yeah, I definitely live that.

So I guess, you know, in terms of what is driving the improvements, you know, 18 months ago, we maybe like, couldn't, I mean, it's all happening so fast, right?

Like I even have some, you know, lingering sympathy for the stochastic parrot people, because it's like, as of GPT-2, you know, that was probably still mostly true.

At this point, you know, it's pretty clearly not, but 18 months ago, you know, you maybe couldn't, you know, models just couldn't do certain things that now they can, seems like just core model progress is just is the tide that is like, you know, lifting all boats dramatically.

Is there anything else that you think is just like super important?

Or is it really just down to foundation models getting better and just grinding out the process of figuring out how to use them?

Yeah, I think on the core scientists, I feel like long context was an important part of it as well.

Because we don't have an explicit memory store in the system, but the fact that Gemini models instantiated can take up to like 2 million tokens, it means that we can just like, you know, generate ideas, generate reviews of them, run debates of these ideas, and like generate these walls of text, essentially, which can just become feedback.

And this just put all of them back into the context of the model in the next round.

And the model just figures out how to make sense of it and use the feedback in a very like implicit manner to improve.

Like if you did not have these long context abilities, like the ability to reason over like reliably over like millions of tokens, then you would not be able to do that, like you would have to engineer like rag based systems, and maybe you have to like end to end train them.

And that would have been a lot more complex.

But like all that has been made like remarkably simple just by these long context ability of Gemini, which I think is a little bit underappreciated in the field, because we don't have like enough of these, I don't know, studio Ghibli style, wild moments with long context.

But it just enables a lot of these practical applications.

I think that's the same with the management reasoning.

It's like, yeah, I mean, like our recall for that paper, right, like the management agent, it's whole point is it's trying to, you know, take in clinical guidelines and reason over those different management plans.

And like, yeah, I think if we were like, solely reliant on, you know, always picking out the right guidelines, like we would struggle, right.

But you know, with the long context, like we don't really need to worry about it, right, we we take in a bunch of guidelines and, you know, with with, you know, 256 K, right, like we're going to catch the something relevant, you know, regardless of whether our retrieval system would have been able to do that.

So yeah, I definitely agree on the long context.

I think just like, like with co scientists in particular, like we're talking about like a really long time scale, we have this thing running for days, right?

Like, I think just like that inference compute, like increase is offering a lot of benefit as well.

I'm just looking up a friend sent me just in the last day or two fiction live bench for long context, deep comprehension, you know, one of obviously many benchmarks that look at these things.

And the Gemini 2.5, even relative to Gemini two is absolutely just crushing on its command of long context.

And I've definitely felt that in my initial testing of it.

This does feel like something that's like hard to go viral.

It's like hard to go viral with a notion of like, you know, I had hundreds of thousands or a million, you know, tokens of context.

It's all like very idiosyncratic, whatever I'm working on, you know, hard for people to like, even know what you're talking about, right?

When you post that on Twitter, the contrast between those context window and you know, the length of a tweet is like pretty, pretty severe.

But I mean, it's amazing that that is that that was working that well, because just looking at these benchmark results, it's like 2.5 stands out in a massive way relative to everything that had come before.

Did you I guess, like qualitatively or vibes or you know, your own sort of sense, I would have said, up until 2.5, I would have felt like, yeah, I'm not so sure if I can just dump like hundreds of thousands of tokens in and like, yes, it can handle it.

But does it really handle it?

Does it really have command?

And especially if it's like material that I don't have full command of myself, you know, it can be very hard to evaluate that.

So I guess how did you handle that?

And how did you how did you know if it was like, actually, you know, making effective use of the super long context?

Yeah, I mean, like in the Amy setting, I think, like I mentioned before, like we really relied on auto evaluation.

And like, I have to shout out like Valentin, my team around the paper did like a great job setting this up.

But essentially, like we also didn't know like, how well would it how how well would this work?

We stuff a bunch of, you know, guidelines into context.

And we found like, you know, initially, we were going to go all the way up to 1 million, we found that actually, like, we seem to be getting better performance when we drop it down to 256 K.

But, you know, ultimately, like there was a clear difference between the presence of, of, you know, that that much knowledge versus, you know, trying to do a zero shot or with with, you know, one guideline or summer triple or something.

So it was just we were also uncertain, but we, you know, we tested it internally, at least in that seemed to work.

Yeah, I think on the co scientist side, it might be a little bit more unscientific.

I think what we relied on was primarily the redundancy aspect.

Because like, you generate some ideas, you review them, there's a tournament that happens, you get like a bunch of feedback from the tournament.

And then you when you put that back into the state of the system, like, you're not generating only once you're like generating like numerous ideas.

And so your hope is because of that redundancy that creates like at least one of the generations would catch the key elements of the feedback that has been generated and propagated back over here into the system.

So yeah, I wouldn't say we have like any specific evage that target measuring how well the long context is doing.

But rather by engineering in this redundancy, we're hoping that it would be effective.

And, and yeah, and then the other distinguishing factor for us in this work was at least like, I feel like there's a lot of like science assistant style, like scientific discovery style projects, not just I think in a lot of different places, but where they do get a little bit hung up is in like curating these really nice cozy benchmarks where you can hill climb on.

And that was not our philosophy at all.

For us, we feel like the key deal is like if a system does something useful, we should rather sprint straight ahead so that like we can validate it in the lab, and then hopefully take that onward towards like a real meaningful discovery.

And so that was what we were most focused on, like, if we engineer a system, we go straight to the scientist who's like an expert in the field, we show them the idea.

And if they like it, we like try to convince them to like validate it.

And if they like validate it becomes a discovery, then yeah, great.

So, so there was that thing where we like really wanted to like not micro optimize on like specific benchmarks and he'll climb on them and like, you know, wait for like, I mean, if I think if we did that, because there's like so many different components in the system, I think this work would have taken easily another year, if you try to like, you know, make it the best rather we were just focused on, let's test them all up together, let's get it to do something reasonable.

And then we'll do like end to end validation.

Yeah, okay.

That's it's a brave new world out there.

One thing that I had seems to be changing now, like not too long ago, I would have said I could have cited, you know, a handful of papers that sort of showed this that like, typically three to six rounds of self critique and you know, sort of auto self improvement seem to be where like GPT-4 would kind of max out.

And if anything beyond that, usually I felt like I would see that, you know, performance would decline after if you kept running it longer than that.

You guys are talking about running these things for days.

Maybe you can tell us a little bit of like, how, you know, what are the budgets for this?

Like, how many tokens are we talking about?

And how many, you know, if we translate that to retail price, you know, what is a, you know, what would the inference bill be for, you know, finding the mechanism for micro drug resistance?

And is there any limit at this point to how many rounds of this you can run?

Or is it, you know, are we already at the point where you can just like run the thing for longer and longer sort of potentially indefinitely?

Yeah, I think this is a fascinating question.

It's something that like, has also like kind of like, I don't, I wouldn't say bothered me, but it's like intrigued me as well.

Because I remember reading one of these Andrea Carpathi's posts where he talks about like leaving like a CNN training over like the winter break.

And then it magically led to like state of the art performance on some benchmark.

I forget which one that was, but all that he had to do was like ready to run for like 40 days, which is a lot of compute back then.

But like, it was like unprecedented in some ways.

So yeah, I think for us, like with the co-scientists, I think the key thing is the fact that the system is not like close looped.

The fact that it has access to these different kinds of tools means that in every round of like self critique or like iteration, it can bring in new information to the system that increases the entropy.

And when that happens, that prevents the possibility of like mode collapse and like degenerate solutions from happening.

So I think that is the key thing.

So the fact that like the system can go and browse web search, browse like interesting parts of the world, take information out of it and integrate that like knowledge that it has and do that in like an effective manner.

I think that is the part that leads to like, okay, more computation being spent like efficiently and effectively over here and helping.

And it's not just web search, right?

I think like increasingly we'll be able to like get feedback from like other kinds of like knowledge bases, specialized tools, alpha fold.

And so more as the quality or like the surface area of the hypothesis like increased, like the more different kinds of feedback, we'll be able to plug that into the system.

And I expect that it wouldn't then like, you know, more collapse.

And there's likely that is going to be like even more increasing value to like spending more time on computation in this setup over here.

But if you were to like strip away that, then I think it comes down to okay, like the quality of like published information in any given domain and also down to the complexity of the problem.

If like the prop, again, I don't know how to put like a precise definition to it, but maybe you know, but if there's a problem where there's like a clear unknown that is like impossible to solve, I think it's very likely that no amount of computation reasoning, test time compute is going to be able to like get you that information if the system does not have the capacity to get that information.

And on the other hand, there can be like problems which are like very trivial that it doesn't matter like if you, you probably get it in the first or like a few tries and it doesn't matter.

So I think there's that sweet spot over here where that problem is like within limits, where it's also not trivially easy, where spending this computation helps.

And my hypothesis is I feel like a good chunk of problems that we as humanity care about today actually fall in that sweet spot, where I think we can spend a lot of computation with in silico and like get like very useful, interesting ideas and answers, which is I think exciting.

And what do you think?

Yeah, I mean, just like a small thing to add is like, I think, you know, also like our system allows for humans to input ideas, like there's like other avenues to to to kind of add to this entropy.

I think one one other like important thing to think about is like, we're comparing like pairwise, like every, you know, combination of hypotheses, like there is a lot of like, you know, variability, when you compare it to like maybe other papers that are talking about self critique in a more like, we're just going to keep trying to improve the same idea.

Like there's there's many directions where we're getting this this variance.

Yeah, that's quite insightful.

So the and I don't know, do you actually know the like total number of tokens for the microbe project?

I'm going to guess it was like 10 billion or something.

No, no, I don't think it's that bad.

It's probably it should fit in within the context limits of these models.

I haven't done an exact analysis, but I would think it's less than 10 million.

10 million total inference tokens for the whole thing?

Yeah, I think so.

For the whole tournament?

Oh, for the whole no, no, okay.

I feel like from the time you give it the question to the time that it's like, what would my a little bit more than it might be a little bit more than that.

But we did some back off the map calculations over here.

And based on like current prices on GCP, we expect that most queries would be just a few dollars less than 10 dollars, most queries, including all the tournament and everything in France.

So it should be fairly feasible.

And it's probably just going to come down more and more in the next six to 18 months.

Yeah, that's really especially if the performance per token that was also, you know, continuing to go up there.

Yeah, there's a lot of tailwinds.

And maybe the other thing I would say is that this is probably the most downbound inefficient version of the system that we have.

And I think we can there's like so many things over here, which we could like improve inefficiency.

And also from like an intelligence perspective, we can make like much better.

And so I think the bang that you would get on each token generated and dollar spent is going to be much, much more as we keep on improving the efficiency and capabilities of the system.

Are there other, you know, narrow specialist models that it has and how important is that now and how important do you think that will be in the future as like a source of new entropy?

Because that seems like a potentially dramatic unlock, right?

I mean, the model itself is already trained on like most of the literature to search the literature again at runtime, like helps with grounding helps things that, you know, maybe wasn't in the training data that it can find after the cutoff date, what have you.

But like the ability to actually go to simulations and bring that kind of information back that literally maybe nobody's ever run that simulation before at all.

Yeah, that seems like a potentially pretty big step change.

Yeah, I totally agree.

And that's why I feel like this is probably like day zero or day one of this journey in many ways, because we primarily scratch the surface of information that is just written down in papers and peer reviewed and published.

But like a lot of like scientific information is just not in that format.

So for example, we don't like publish negative results, which because of like just incentives are on scientific publication.

So that's like kind of like some dark matter that's hidden away.

And you'll have to figure out like a good incentive mechanism to have that kind of data also flow into the system.

But I think that's going to be important as well going ahead.

But yeah, I think more excitingly is the fact that like a lot of these papers and publications will have this supplementary data file, for example, which contains these like giant data sets of experimental data.

And they contain a lot of useful nuggets of information, which again, like you can imagine like system like this, paired up with like another system that came up like a data science agent, and if you saw, like it could like, automatically go and analyze the data and then this kind of like giant hypothesis, that data science agent can go analyze the data and like get the right kind of feedback back into the system to improve.

So I think certainly at like small scale data sets that can happen automatically.

But I'm even more excited about like what we might be able to do when we pair up the co-scientist elements of the data science agent and like go after say, so our institute recently came up with this virtual cell atlas, it's like 300 million for like gene perturbations for like 300 million cells.

That's like such a vast space.

It's even for humans, it like teams of humans, it would take like years to like explore and look at the data and see the richness in there and come up with interesting insights.

But we could set up these agents to like generate hypotheses, look at the data, analyze, come back with feedback.

And so I'm just super excited about the insights that it would unlock into like basic biology, target discovery and things like that.

When we pair up these systems and set them to go on like these giant data sets that we are generating right now, which like even teams of humans like, like very impractical to like go after and analyze right now.

Do you think okay, here's a very high concept question that you may think is totally misguided or you might think it's the future.

Obviously, for the last three years, we've seen a dramatic convergence of like a couple core modalities, language, vision, speech, right?

For me, one of the like earlier Rika moments where I was like, I think I'm going to study this, you know, subject for the rest of my life, or at least until the singularity, was when I realized that it's basically the same architecture doing all these things.

And at the time I was like trying to make, you know, a video generation product work.

And I had all these like different specialist models, but I could see pretty clearly that if these, you know, fundamentally similar architectures can do these different tasks independently, then there's going to be some integration that's going to happen and they'll, you know, the single model will be able to do them all.

Now, you know, and this has come, you know, with a very sort of, you know, viral moment, both from flash and from GBT 4.0 in the last two weeks.

Now it seems like that might happen again with the reasoning models and the narrow specialist models.

Like right now you have a model calling alpha fold, getting results, but there's not, you know, this would be akin to an earlier language model, like calling Dali, you know, to generate a, an image.

And there's this sort of very lossy language bottleneck that happens there that, you know, was definitely a point of like major frustration for people trying to generate images that looked like they wanted them to look.

And now that this integration has happened, it's like that problem is basically no more.

So the question is, do you think that that will happen for these other modalities as well?

Like my sense is that an early super intelligence might be reasoning models akin to what we have today that actually have these other modalities integrated in a more deep way so that they are not bottlenecked through like API calls, but are actually able to start to do some of this like reasoning in biological space or in material science space or in, you know, transcriptome space or whatever.

There's a lot of spaces out there, obviously.

And it seems like we already see like pretty superhuman performance by the narrow models is like, I'm not aware of any human that can look at a amino acid sequence and like intuit what the shape is going to be.

And maybe a couple savants out there, but certainly, you know, it's not common.

Do you think that happens or, you know, what's your what's your reaction to that possibility?

Yeah, I'm not sure.

I'm not sure.

I mean, to be honest, I think that's I think the API call thing is like clearly, I think we're very close to, you know, being able to do that pretty well.

I think a deeper integration.

Yeah, I don't know exactly how easy that is for for all of these kind of specialist type areas.

Yeah.

I think it actually depends on some of the incentives of like people who are developing these frontier labs models.

I think it makes a lot of sense to combine speech and vision and language.

But if you still look at the data that's going into these models, it's primarily non-specialized public data.

And then when you're thinking about biology modalities and data sets, they are not as close to like the kind of data sets that are going in over here from the public internet over here.

They're very, very different.

And from like some personal experience, what we have seen is when you try to introduce, say, some of these most special modalities, even if it's medical imaging modalities, right?

It's like images and natural images.

And then you have medical images, they lead to like regressions and like benchmark performance.

And so like, then the question becomes, okay, are you willing to sacrifice some regressions on say, LMSIS or like some other benchmark that is generally like considered like an important benchmark to have your system have like a little bit more capability in like medical images, but like it's going to be obviously not be as big a fraction of your users.

So I think it's it's a question of like incentives right now.

And I can see this tension in like many of the frontier model companies where when you introduce these interesting new modalities of information, it is, it leads to sacrifice in other areas.

And so you sometimes like sacrificing benchmark performance, I would argue that's a good thing.

It doesn't matter so much these benchmark performance, I think you should aim for like practical utility.

But in the absence of like clear measures of like utility, and it becomes difficult to convince.

And so, so I think it's not a question of like, can we do it, I think the architectures exist, the even the computer exists at most of these places.

But do the incentives exist?

That part I don't know.

And I don't know when that'll happen because it's unclear.

I think today I cannot articulate, like at a high level, I think that we all agree that if we were to, you know, encode the biomedical universe, that model should be able to do a lot of interesting things.

But that is sometimes conflicting with like benchmark performance on like, I don't know, like LMSIS or whatever else that you want to use right now.

And then so it just becomes like a question of incentives.

Yeah, okay.

That's reminds me of, I've brought this up a couple of times as I'll keep it brief, but I heard Yitei from he was at Recca at the time on the latent space podcast talk about how like the separation between vision and language models was sort of a reflection of like the research history.

And you know, you at one point would have like a language team and a vision team.

And then it was like, maybe we can bridge these together, but then you would have like late fusion models because those things would already be sort of done and baked.

And you know, now can we like get them to sort of talk to each other via cross attention or whatever.

And then it sort of became like, well, if this works, you know, probably work even better.

If we do it all just kind of, you know, interwoven data sets from the beginning.

And I can see how that, you know, that same thing might be about to play out again.

I hadn't really heard so much of the benchmark thing or you were saying like you've observed that in, for example, adding image capability, like standard benchmarks do decline.

Yeah, I can't give you full details.

But like when you've tried like adding medical images or like, say genomic information and try to like, again, it depends on how you train these models.

But like, we've tried to do things like in a pretty standardized, but maybe also like the easiest way.

And we've seen like, okay, like while you obviously like on the benchmarks that are reflective of the new modalities that you're adding, over there, the performance goes up, you're sacrificing performance on like the main original benchmarks that are like language understanding focused or vision focused.

And so yeah, you have to do this part optimization.

And again, that also requires computing in its own ways.

And so but that's that's all in like a late fusion paradigm, like you're starting with a model that's already like scoring on benchmarks and then trying to kind of Yeah, I mean, the easiest thing to do is like do continued pre training or SFT.

So yeah, I mean, like the experiments that I'm talking about are not like late fusion, but it's like more continued pre training and SFT over here.

But obviously, the better thing to do would be to put all that data back into the free training and train all of them together.

But to motivate that kind of like undertaking would require you to show some benefits.

And so the typically how this works is you take like a pre trained checkpoint to like continued pre training or SFT.

And then if you show that okay, like your new data is helping improve performance on like benchmarks that everyone else cares about, then that data goes back in.

But it's a Yeah, and so that is where like I think there's these tensions because these are like super esoteric modalities.

And so I mean, we could spend all the time doing that pre optimization, throwing computer and trying to figure out the best combination.

Or we take this other approach where we say, okay, doesn't matter, we'll train our own models, we'll train our own agents and then we'll like system up together.

And so I think that's where we are at right now.

It's a little bit of a local optimum.

But I think it's still fine, because it still allows us to do a lot of these, a lot of these interesting things.

Yeah, the proof is in the pudding.

That's really, that's a remember to come back and you know, ask a follow up on that next time you're here.

So let's see, not too much time left.

And you know, there's so much in these papers that we could cover.

One thing for kind of practical utility, maybe a couple things for practical utility for just people that are building their own systems out there.

One is that this tournament style, head to head evaluation, I've seen like, seems to be kind of becoming like a industry standard.

I don't know if you'd go quite that far, but there's a strong trend, it seems toward trying to surface the best ideas by doing pairwise comparisons and having some sort of, you know, World Cup style round robin approach to doing that.

I want you to add anything else to that.

And then I also wanted to talk about structured reasoning as opposed to chain of thought, because I think that is one that a lot of people listening could probably go apply to their projects and get a boost, you know, like tomorrow.

So yeah, maybe unpack those two things.

Yeah, I think the tournament one is interesting, because we were actually motivated by AlphaStar where we had these tournaments of agents competing and that led to a lot of strong results in that setting.

I don't know if it's an industry standard, because I also feel like it's somewhat inefficient and it hints at the limitations of these models in some ways, where they're not maybe able to independently score and verify ideas, but rather have to do these like N-square pairwise comparisons.

And so we do have to do some optimizations where we have to cluster them up together, group them up together.

So I should reduce the computation and not do 1000 cross 1000 idea comparisons, because that would be too expensive.

So I would think that would become a little bit computationally efficient going ahead.

But I think that overall ranking of things, the more you can do that in data space, I think it's going to lead to more interesting results, better reasoning, and so on and so forth.

So yeah, I expect that idea to stay, but to not happen as explicitly as it's happening right now, but more of it happening in the latest phase of the reasoning of these models.

I'm not sure if you know that.

And I mean, I think regarding the structured reasoning approach, I think there's practical reasons for it.

When we have two agents that we want to talk to each other, it helps to have a data structure that we're passing.

And that just makes the engineering of it itself a little bit easier.

But I think beyond that, comparing to like, just like asking the model to like, you know, in a chain of thought style, like just reason about this versus defining a certain, like reasoning structure, I think the advantage there is we can actually kind of better enforce it to follow a certain path.

So like in our case, right, we wanted it to do like this long analysis before kind of going into like higher level management goals, before then finally like forming its management plan.

And like being able to define that structure force the model to take that path through its reasoning rather than, you know, free form, you know, if we if we allow it to do it free form, like maybe it starts to form its management plan before it is done these kind of like higher level steps that we want it to go through.

Yes, it's basically to just try to describe this for people who might want to implement it.

I mean, for one thing, it takes advantage of another, you know, notable feature that models have gained over the last year or so, which is that we can now specify as part of an API call, this is the exact JSON data structure that you are supposed to return.

So that's huge, because it makes it really easy to set, you know, set that up and then get something back.

It kind of is a little bit like airline checklist thing sort of where you're just like, okay, I want you to absolutely every time go through these steps.

And if you do that, you know, we're confident you're going to get better results in the end versus just kind of, you know, walking out there and, you know, randomly walking around the plane and coming back and saying, yeah, it all looks good to me, right.

So the intuition for that is pretty simple.

How dynamic did you make that because I've never actually done a dynamic structured.

Yeah, I think in our case, like we I think we tried a lot of different structures and like strategies for, you know, generating the ultimate management plan.

I think there were like when we tried to get a little too like fine grained with like, you know, okay, first, like summarize the patience, like, you know, chief complaint, then summarize, like when we tried to get like a little too granular with that, I think it's I think I think having the flexibility of like just these like kind of higher level things like analysis management goals, like things that are pretty general tended to lead to better management plans.

Of course, this is all under our own auto evaluation and five checks, as I mentioned.

So, you know, that's maybe up to for debate.

But I think, yeah, and I don't think it was like super.

So it was it was dynamic in that sense.

Like, you know, there it can have any number of analyses items.

It's like a, you know, a list of how are many items, any number of management goals.

But I think we try not to constrain it too much, just constrained to the point of like we want it to go through a certain like reasoning structure.

Yeah, okay.

So structured outputs people don't overdo it, but definitely use it.

I mean, that that was one of the things that seemed like it really drove a pretty, pretty big lift.

Go back just to the comparisons for one more second.

Are you basically saying that like, you think in the future, it won't have to be so head to head and instead, it'll just be like, here's 10 things pick the best.

Yeah, I would hope that so in some ways, what we're trying to do is like force the model to explicitly do the tree search, right, you know, come up with the new ideas, like go to different nodes, and then do the comparison over here.

But maybe what I'm so it's almost like, okay, like you come up with ideas, you write them down, then you review them, and then you figure out what's the best that's happening.

But like, the question is, okay, can all of that happen in your head in some ways?

Like, does it have to be explicitly written down?

Do you have to explicitly generate all those tokens?

Can you do something in the way you set up the architecture itself, or yeah, in some other mechanism where that all of that happens in like the latent space itself, so that you're more efficient with the tokens that you're generating.

So I think there's an inefficiency right now, I'm sure it helps with like interpretability and other different aspects over here.

But I think there's like, a lot to begin by encoding that tree search within the latent reasoning of the models, and we don't do that as well right now.

Yeah, okay, cool.

That's helpful.

Maybe the last two things, what do you think would happen if you just prepended a question identification agent to the co-scientist, and just had the thing, you know, kind of, you know, run in a loop where like, the first thing it did instead of taking a question from a human scientist is just like, go out on the internet and you know, search around and come up with an interesting question for itself, and then just try to answer its own question.

Is there anything that about that, that you think wouldn't be effective?

I feel like in some ways, like, we like to talk about the concept of like root node problems at Google and DeepMind.

And we felt like, like, once you have a question, like chanting novel original solutions to that is a root node problem.

But in some ways, what you are describing over here is an even upstream root node problem.

Like, how do you ask the right question?

And I feel like the day we get systems to reach that, then I think that is the day we can truly say like, okay, we have geniuses in data centers.

I think that is like, they're going to be the most impactful and important unlock.

And I mean, my feeling is like, there really should be like, decent capability in these models to like, you know, go surf the internet, read information and figure out what are the right questions to ask.

So yeah, I mean, if it's okay, we'll go and give it a try.

Have you had a chance?

This is a bonus.

How have you had a chance yet to try this with Gemini 2.5?

It seems like for my qualitative assessment, it would be a lot better.

Yeah, no, I think that's the exciting part, because whatever we've described in the paper was all Gemini 2.0.

So should be coming up very soon.

But yeah, we're super excited about that.

Feels like the path to geniuses in a data center is honestly pretty clear at this point, which is a crazy thing to say.

But you know, do you see are there any like, you know, big barrier questions that you feel are, you know, just fundamentally unanswered still, you know, programmers often call it a simple matter of programming, like, it's going to be work, but you know, we can make it work.

Is that kind of the mindset right now for you guys?

Or are there questions where you're like, we really don't have a good answer to how we're going to get over that part?

Like we have all the building blocks over here.

And so we probably build something that will look very close to what you're talking about over here, whether that's the most beautiful one, the most elegant one, we don't know, but does that even matter?

I think it doesn't.

So I think that is why it's truly exciting, where I think we have line of sight to one solution, which feels like will get us where we want to.

And that in turn is going to like lead to a lot of new unlocks.

And so, yeah, I would say I think for the next couple of years, at least, to me feels mostly an engineering challenge, rather than like trying to answer some fundamental unknowns.

Yeah.

And of course, the other big challenge is going to be the social challenge of introducing the stuff to the world and, you know, getting scientists to pay attention.

Maybe to close, you want to talk a little bit about what you're doing in that regard?

Because I was excited to see that we're now getting to the point where you're inviting scientists to reach out and, and partner with you on this.

And also going into, I don't know if you would officially call it a clinical trial, but like something, you know, in the actual field of medical practice with real patients too.

So, you know, tell us what you're doing on the deployment side.

Yeah, I think the co-scientist is a little bit easier for us to deploy.

I think there are maybe less questions around like regulatory and things like that.

And so it's, so I mean, the only thing that maybe like bothers us a little bit is that the system is highly capable.

And so there are also like many ways in which it can possibly not do so well.

Right.

And so we just want to ensure that like, as we scale up the system, we do that in a responsible manner.

And so that's why like, you know, we have this trusted tester program, we've already been working with like close to 100 scientists right now across the world.

And these are all like world leading experts.

And with the trusted tester, we want to invite like more organizations.

And so our hope is that like, we can do this in like batches and in waves.

And then every batch, we get like feedback behind, if I like the weak points of the system, we improve it and make it better for the next next batch of scientists.

And so, yeah, this shouldn't take too long, I think like by the end of the year, I'm hoping that, yeah, if I'm optimistic, like millions of scientists around the world will have access to this tool.

And hopefully it like raises the bar like the ceiling for all of them and they give some do like more creative and interesting work.

I think the the one with Amy is a little bit more tricky, it's obviously a more complex space.

But again, there's a part there's like no one path to taking such systems out there in the real world that can like give diagnosis and treatment recommendations.

And so we are very excited about the clinical trial that we have coming up.

So it's going to be I think, one of the first studies of its kind where an LLM based system is going to be interacting with real patients.

And the nice thing about the setup that we have right now is we are deploying it in a clinic where there is sufficient presence of clinical experts who can oversee the system and provide oversight.

And so that is a very safe environment for us to like deploy the system, where there's like sufficient number of doctors who can take over if something goes wrong.

And our hope is in that study, like not a lot, lot like goes wrong.

So that allows us to dial down the amount of like oversight that's needed.

And so yeah, I think if things go well, then we'll probably scale it out to more centers, like reduce the amount of expertise, and also introduce like more net new capabilities into the system that we're going to trial and make them more patient facing.

So yeah, I think that's that's the exciting part where I feel like like the researchers progress quite a bit.

And so it's now time to like see, like take it for a drive in the real world.

It's exciting times, guys, really outstanding work.

People should be paying more attention than they are.

And hopefully we'll, you know, put a little, you know, dent in the consciousness by bringing, you know, some more attention to this, but just really mind blowing, really mind blowing stuff and, you know, quite a series of work that you guys have put out.

Anything else you want to share in parting or any other thoughts you want to leave people with?

No, I mean, it's been a pleasure to work on these projects, obviously, and we have a lot cooking still.

So, you know, I'm excited.

I'm excited for the future of this.

Yeah.

Yeah.

And likewise, I think, for me, like talking to Nathan, it's like just just a lot of fun.

And yeah, I mean, it's a real pleasure, like four time over here.

I think there's going to be at least like one more where it's big enough that we'll come back again.

And I just hope you don't get too big for me.

That's my obviously here.

I think it's so much fun.

Cool.

Well, really appreciate it.

Again, fantastic work, Vivek Natarajan and Anil Palefu.

Thank you for being part of the cognitive revolution.

It is both energizing and enlightening to hear why people listen and learn what they value about the show.

So please don't hesitate to reach out via email at TCR at turpentine.co, or you can DM me on the social media platform of your choice.