Latent Space · 2025-02-18

The Inventors of Deep Research — Inside Gemini Deep Research at Google DeepMind

Hosts: Alessio, Swyx

Guests: Aarush Selvan, Mukund Sridhar

Deep Research agentsGeminiAgentic UX designPost-training / fine-tuningIterative planning and tool useLong context vs RAGEvaluation methodologyLatency vs user perceptionAsynchronous orchestrationReasoning / thinking modelsPersonalization and multimodality

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

Deep Research is a fine-tuned post-training of Gemini 1.

Key claims

Gemini Deep Research was built and shipped before OpenAI's Deep Research and is what the hosts credit with creating the entire Deep Research agent category.
Deep Research is a fine-tuned post-training of Gemini 1.5 Pro, not the base model — the team's main post-training work was teaching reliable iterative planning.
UX bets: an editable research plan shown upfront (their #1 user tip), live website stream during browsing, and a side-by-side report + chat artifact so follow-ups stay in the same surface.
Architecture: two core tools (web search and deeper webpage reading), parallelizable sub-steps in the plan, then an analysis phase that reconciles inconsistencies and self-critiques before final output.

Episode summary

Summary

Aarush Selvan (PM) and Mukund Sridhar (Tech Lead) from Google DeepMind's Gemini Deep Research team join Latent Space to walk through how they built the product that, by their account, originated the entire "Deep Research" agent category — before OpenAI's Deep Research launched on Feb 2 and a wave of clones followed. They explain why Deep Research is an agentic, ~5-minute web research workflow (not just LLM search), and unpack the design choices: an editable, transparent research plan shown upfront; real-time visibility into the websites being read; and a side-by-side artifact + chat layout so users can keep iterating without losing context.

On the technical side, the team describes post-training a special version of Gemini 1.5 Pro (rather than using the base model) so the agent can do iterative, parallelizable planning across two core tools (search and deeper webpage reading), then enter an analysis mode that reasons over inconsistencies, drafts an outline, and self-critiques before finalizing the report. They discuss long-context vs RAG trade-offs (keeping recent research in-context for follow-ups, falling back to RAG for older turns), HTML-to-markdown handling, vision deferred due to latency cost, and an in-house asynchronous orchestration platform built to handle multi-minute, retryable jobs with durable state.

The conversation also covers evaluation (high-entropy outputs, a mix of auto-metrics like research-plan length and iteration counts plus heavy human evals against an ontology of use cases spanning broad/shallow to narrow/deep), the counterintuitive finding that users reward longer waits as evidence of effort, and where the product is headed: personalization, multimodal in/out and generative UI, and complementing the open web with private documents and subscriptions. They note that next-gen reasoning/thinking models are being explored but introduce new tension between leveraging internal model memory and grounding in sources.

Gemini Deep Research was built and shipped before OpenAI's Deep Research and is what the hosts credit with creating the entire Deep Research agent category.
Deep Research is a fine-tuned post-training of Gemini 1.5 Pro, not the base model — the team's main post-training work was teaching reliable iterative planning.
UX bets: an editable research plan shown upfront (their #1 user tip), live website stream during browsing, and a side-by-side report + chat artifact so follow-ups stay in the same surface.
Architecture: two core tools (web search and deeper webpage reading), parallelizable sub-steps in the plan, then an analysis phase that reconciles inconsistencies and self-critiques before final output.
Context strategy: keep recent research tasks in the long context window (up to 1–2M tokens, with RAG fallback); older turns are fine to store via RAG because cross-comparison is rare.
Counterintuitive latency insight: users reward longer runtimes as evidence of effort — the team originally capped jobs at 10 min and rejected a 15-min 'hardcore mode,' but discovered people wanted more, not less.
Evals are hard because output entropy is high: they use auto-metrics (plan length, iteration counts, behavior distributions on a dev set) plus opinionated human evals against an ontology of research behaviors (broad/shallow vs narrow/deep, plus compound 'project' queries like wedding planning).
Roadmap themes: personalization, multimodal in and out (charts, maps, images), generative UI tailored to audience, and letting users incorporate private docs and paid subscriptions into Deep Research.

Source material

Transcript

Everybody's going deep now.

Deep work, deep learning, deep mind.

If 2025 is the Year of Agents then the 2020s are the decade of deep.

While LLM powered search is as old as perplexity and search GPT and open source projects like GPT Researcher and clones like Open Deep Research exist, the difference with commercial deep research products is they are both agentic and bundling custom tuned frontier models like OpenAI's O3 or, as today's guests discuss, a fine tuned version of Gemini.

Since the launch of OpenAI's Deep Research on February 2nd, the reactions have been nothing short of breathless.

Quote, "Deep research is the best public facing AI product Google has ever released.

It's like having a college educated researcher in your pocket."

End quote from Jason Callicanus.

Quote, "I have had Deep Research write a number of 10-page papers for me, each of them outstanding.

I think of the quality as comparable to having a good PhD level research assistant and sending that person away with a task for a week or two, or maybe more.

Except Deep Research does the work in five or six minutes."

End quote from Tyler Cowan.

Quote, "Deep research is one of the best bargains in technology."

End quote from Ben Thompson.

Quote, "My very approximate vibe is that it can do a single digit percentage of all economically valuable tasks in the world, which is a wild milestone."

End quote from Sam Altman.

Since then, a dozen open and closed source clones have emerged from the woodwork trying to replicate this success, from Perplexity to X.AI with their Grok3 launch late yesterday.

In today's episode, we welcome Arash Selvan and Mukun Sridhar, the lead PM and tech lead for Gemini Deep Research, the originators of the entire category of Deep Research agents, which have overnight become the newest killer use case for AI.

We asked detailed questions from inspiration to implementation, why they had to fine tune a special model for it instead of using the standard Gemini model, how to run evils for them, and how to think about the distribution of use cases.

Arash and Mukund will also be joining us as keynote speakers for the agents engineering track at the AI Engineer Summit in New York City on February 21.

This is our last in our recent series of upcoming AI Engineer Summit speakers, and we hope you're as excited for their talks and workshops as we are.

You can sign up for the online live stream linked in the show notes.

See you at the summit.

Watch out and take care.

Hey everyone, welcome to the Latent Space podcast.

This is Alessio, partner and CTO of Desiope Partners, and I'm joined by my co-host, Swix, founder of Small.ai.

Hey, and today we're very honored to have in our studio Arush and Mukun from the Deep Research team, the OG Deep Research team.

Welcome.

Thanks for having us.

Yeah, thanks for making a trip up.

I was fortunate enough to be one of the early beta testers of Deep Research when he came out.

And I would say I was very keen on like even, I think even at the end of last year, people were already saying like it was one of the most exciting agents that was coming out of Google.

You know that previously we had on Riza and Usama from the Novokalem team.

And I think like this is like an increasing trend that Gemini and Google are shipping interesting user-facing products that use AI.

So congrats on your success so far.

Yeah, it's been great.

Thanks so much for having us here.

Yeah, excited.

Yeah, thanks for making a trip up.

And I'm also excited for your talk that is happening next week.

Obviously, we have to talk about what exactly it is.

I'll ask you towards the end.

So basically, okay, you know, we have the screen up.

Maybe we just start at a high level for people who don't yet know, like what is Deep Research?

Sure.

So Deep Research is a feature where Gemini can act as your personal research assistant to help you learn about any topic that you want more deeply.

It's really helpful for those queries where you want to go from zero to 50 really fast on a new thing.

And the way it works is it takes your query, browses the web for about five minutes, and then outputs a research report for you to review and ask follow up questions.

This is one of the first times you know, something takes about five, six minutes trying to perform your research.

So there's a few challenges that brings like you want to make sure you're spending that time in the computer doing what the user wants.

So there's some ways of the UX design that we can talk about as we go through an example.

And then there's also challenges in browsers, the web is super fragmented and being able to plan iteratively and as you pass through this noisy information is a challenge by itself.

Yeah, this is like the first time sort of Google automating yourself as searching like you're, you know, you're supposed to be the experts at search, but now you're like meta searching and like determining the search strategy.

Yeah, I think at least we see it as two different use cases.

There are things that you know, you know, exactly what you're looking for.

And there's such still probably, you know, a very, you know, probably one of the best places to go.

I think when deep research really shines is there like multiple facets to your question and you spend like a weekend, you know, just opening like 50, 60 tabs.

And many times I just give up and we wanted to solve that problem and give a great starting point.

Do we want to start a query so that it runs in the meantime and then we can chat over it?

Yeah, here's one query that we like, we love to test like super niche random things, like things where there's like no Wikipedia page already about this topic or something like that, right?

Because that's where you'll see the most lift from a feature like this.

So for this one, I've come up with a query.

This is actually Mokon's query that he loves to test is help me understand how milk and meat regulations differ between the US and Europe.

What's nice is the first step is actually where it puts together a research plan that you can review.

And so this is sort of its guide for how it's going to go about and carry out the research, right?

And so this was like a pretty decently well specified query.

But like, let's say you came to Gemini and we're like, tell me about batteries, right?

That query, you could mean so many different things.

You might want to know about the latest innovations in battery tech.

You might want to know about like a specific type of battery chemistry.

And if we're going to spend like five to even 10 minutes researching something, we want to one understand what exactly are you trying to accomplish here?

And to give you an opportunity like to steer where the research goes, right?

Because like, if you had an intern and you asked them this question, the first thing they do is ask you like a bunch of follow up questions and be like, okay, so like, help me figure out exactly what you want me to do.

And so the way we approached it is we thought like, why don't we just have the model produces its first stab at the at the research query at how it would break this down, and then invite the user to come and kind of engage with how they would want to steer this.

Yeah.

And many times when you try to use a product like this, you often don't know what questions to look for or the things to look for.

So we kind of made this decision very deliberately that instead of asking the users just follow up questions directly, we kind of lay out, hey, this is what I would do.

Like these are the different facets.

For example, here it could be like what additives are allowed, and how that differs or labeling restrictions and so on in products.

The aim of this is to kind of tell the user about the topic a little bit more, and also gets to you at the same time, we elicit for like, you know, a follow up question and so on.

So yeah, editable chain of thought.

Right.

Exactly.

Yeah, I think that you know, we were talking to you about like your top tips for using deep research and your number one tip is to edit the path.

Just edit it, right?

So like we actually, you can actually edit conversationally.

We put in a button here just to like draw users' attentions to the fact that you can edit this.

Oh, actually, you don't need to click on this.

You don't even need to click on this chat.

Yeah, actually, like in early rounds of testing, we saw no one was editing.

And so we were just like, if we just put a button here, maybe people will like, I think I just hit start a lot.

I think like we see that too.

Like most people hit start.

It's like the I'm feeling lucky.

Yeah.

All right.

So like I can just add a step here.

And what you'll see is it should like refine the plan and show you a new thing to propose.

Here we go.

So it's added step seven, find information and milk and meat labeling requirements in the US and EU.

Or you can just go ahead and hit start.

I think it's still like a nice transparency mechanism, even if users don't want to engage, like you still kind of know, okay, here's at least an understanding of why I'm getting the report I'm going to get, which is kind of nice.

And then while it browses the web and Morgan, you should maybe explain kind of how it how it browsers.

We show kind of the websites it's reading in real time.

Yeah.

Preface this if I haven't, I forgot to explain the roles, your PM and your tech lead.

Yes.

Okay.

Yeah.

Just for people who don't know.

We maybe should have started with that.

Yeah.

We do each other's work sometimes as well, but more or less that's the boundary.

Yeah.

So what's happening behind the scenes actually is we kind of give this research plan that is a contract and that has been accepted.

But then if you look at the plan, there are things that are obviously parallelizable.

So the model figures out which of the sub steps that it can start exploring in parallel.

And then it primarily uses like two tools.

It has the ability to perform searches and it has abilities to go deeper within a particular webpage of interest.

And oftentimes it'll start exploring things in parallel, but that's not sufficient.

Many times it has to reason based on information found.

So in this case, one of the searches could have led the EU commission has these additives banned.

It wants to go and check if the FDA does the same thing.

Right.

So this notion of being able to read outputs from the previous turn ground on that to decide what to do next, I think was key.

Otherwise you have like incomplete information and your report becomes a little bit of a like a high level bullet points.

So we wanted to go beyond that blueprint and actually figure out what are the key aspects here.

So yeah.

So this happens iteratively until the model thinks it's finished all its steps.

And then you kind of entered this analysis mode.

And here there can be inconsistencies across sources.

You kind of come up with an outline for the report, start generating a draft.

The model tries to revise that by self-gritting itself, to finalize the prompt, finalize the report.

And that's probably what's happening behind the scenes.

What's the initial ranking of the websites?

So when you first started it, there were 36.

How do you decide where to start?

Since it sounds like, you know, the initial websites kind of carry a lot of weight too, because then they inform the following.

Yes.

So what happens in the initial turns, again, this is not like a, it's not something we enforce.

It's mostly the model making these choices.

But typically we see the model exploring all the different aspects in the, in the research plan that was presented.

So we kind of get like a breadth first idea of what are the different topics to explore.

And in terms of which ones to double click on, I think it really comes down to every time you search the model, get some idea of what the page is.

And then depending on what pieces of it, sometimes there's inconsistency.

Sometimes there's just like partial information.

Those are the ones that double clicks on.

And yeah, it can continually like iteratively search and browse until it feels like it's done.

Yeah.

I'm trying to think about how I would call this a simple question would be like, do you think that we could do this with the Gemini API, or do you have some special access that we cannot replicate?

You know, like is if I model this with a so-called of like search, double click, whatever.

Yeah.

I don't think we have special access per se.

It's pretty much the same model.

We of course have our own post-training work that we do.

And you all can also like, you know, you can find him from the base model and so on.

I don't know that we can do this.

I don't know.

If you use our JAMA open source models, you could find you.

Yeah.

So I don't think there's a special access per se, but a lot of the work for us is first defining these, oh, there needs to be a research plan.

And how do you go about presenting that?

And then a bunch of post-training to make sure, you know, it's able to do this consistently well and with high reliability.

Okay.

So 1.5 Pro with Deep Research is a special edition of 1.5 Pro.

Yes.

So it's not pure 1.5 Pro.

It's a post-training.

This also explains why you haven't just, you can't just toggle on 2.0 Flash and just, yeah.

Right.

Yeah.

But I mean, I assume you have the data and you know, it should be doable.

Yeah.

There's still this like question of ranking.

Yeah.

Right.

And like, oh, it looks like you're already done.

Yeah.

We're done.

Okay.

We can look at it.

Yeah.

So let's see.

It's put together this report and what it's done is it's sort of broken started with like milk regulation and then it looks like it goes into meat probably further down and then sort of covering how the US approaches this problem of like how to regulate milk comparing and then, you know, covering the EU.

And then yeah, like I said, like going into the meat production.

And then it'll also, what's nice is it kind of reasons over like, why are there differences?

And I think what's really cool here is like, it's, it's showing that there's like a difference in philosophy between how the US and the EU regulate food.

So the EU like adopts a precautionary approach.

So even if there's inconclusive scientific evidence about something, it's still going to prefer to like ban it.

Whereas the US takes sort of the reactive approach where it's like allowing things until they can be proven to be harmful.

Right.

So like, this is kind of nice is that you, you also like get the second order insights from what it's being put, what it's putting together.

So yeah, it's, it's kind of nice.

It takes a few minutes to read and like understand everything, which makes for like a quiet period during a podcast, I suppose.

But, but yeah, this is, this is kind of how it, how it looks right now.

Yeah.

And then from here, you can kind of keep the usual chat and iterate thing.

So this is more, if you were to like, you know, compared to other platforms, it's kind of like a entropic artifact or like a chat GPD canvas where like you have the document on one side and like the chat on the other and you're working on it.

Yeah.

This is something we thought a bit about.

And one of the things we feel is like you're learning journey shouldn't just stop after the first report.

And so actually what you probably want to do is while reading, be able to ask follow-up questions without having to scroll back and forth.

And there's like broadly a few different kinds of follow-up questions.

One type is like, maybe there's like a factoid that you want that isn't in here, but it's probably been already captured as part of the web browsing that it did.

Right.

So we actually keep everything in context, like all the sites that it's read remain in context.

So if there's a piece of missing information, it can just fetch that.

Then another kind is like, okay, this is nice, but you actually want to kick off more deep research.

Or like, I also want to compare the EU in Asia, let's say, in how they regulate milk and meat.

For that, you'd actually want the model to be like, okay, this is sufficiently different that I want to go do more deep research to answer this question.

I won't find this information in what I've already browsed.

And the third is actually maybe you just want to like change the report.

Like maybe you want to like condense it, remove sections, add sections, and actually like iterate on the report that you got.

So we broadly are basically trying to teach the model to be able to do all three.

And the kind of side-by-side format allows sort of for the user to do that more easily.

Yeah.

So as a PM, there's an open-end docs button there, right?

How do you think about what you're supposed to build in here versus kind of, it sounds like the condensing and things should be a Google docs.

Yeah.

Bart extensions is different.

It's just like an amazing editor.

Like sometimes you just want to direct edit things.

And now Google docs also has Gemini in the side panel.

So the more we can kind of help this be part of your workflow throughout the rest of the Google ecosystem, the better, right?

Like, and one thing that we've noticed is people really like that button and really like exporting it.

It's also a nice way to just save it permanently.

And when you do export all the citations, in fact, I can just run it now, carry over, which is also really nice.

Gemini extensions is a different feature.

So that is really around Gemini being able to fetch content from other Google services in order to inform the answer.

So that was actually the first feature that we both worked on on the team is actually building extensions in Gemini.

And so I think right now we have a bunch of different Google apps, as well as I think Spotify and a couple, I don't know if we have and Samsung apps as well.

Who wants Spotify?

I have this whole thing about.

What's that?

In deep research, I think less.

But like the interesting thing is like we built extensions and we didn't, we weren't really sure how people were going to use it.

And a ton of people are doing really creative things with them.

And a ton of people are just doing things that they loved on the Google Assistant.

And Spotify is like a huge, like playing music on the go was like a huge value.

It controls Spotify.

It plays.

Yeah.

Deep research for deep research.

Yeah.

But otherwise, yeah, like you can, you can have Gemini go.

Yeah.

You have YouTube maps and search for flash thinking experimental with apps.

The newest, longest model name that has been launched.

But like, yeah, I think Gmail is obvious one.

Calendar is obvious one.

Exactly.

Those I want.

Yeah.

Spotify.

Yeah.

And obviously feel free to dive in on your other work.

I know you're not just doing deep research, right?

But you know, we're just kind of focusing on deep research here.

I actually have asked for modifications after this first run where I was like, Oh, you stopped.

Like I actually want you to keep going.

Like one of these other things.

And then it continued to modify it.

So it really felt like a little bit of a co-pilot type experience, but more like an agent that would research.

I thought it was pretty cool.

Yeah.

One of the challenges is currently we kind of let the model decide based on your query, like amongst the three categories.

So some, there is, there is a boundary there.

Like some of these things, depending on how deep you want to go, you might just want a quick answer versus like kick off another deep research.

And even from a UX perspective, I think the panel allows for this notion of, you know, not every follow-up is going to take you like five minutes.

Right now, it doesn't do any follow-up.

Does it do follow-up search?

It always does.

It depends on your questions.

Since we have the liberty of like really long context models, we actually hold all these, all the research material across dance.

So if it's able to find the answer in things that's found, you're going to get a faster reply.

Yeah.

Otherwise it's just going to go back to planning and yeah.

Yeah.

A bit of a follow-up on, on the, since you brought up context, I had two questions.

One, do you have a HTML to markdown transform step or do you just consume raw HTML?

There's no way you can see more HTML, right?

We have both versions, right?

So there is the models are getting like every generation of models are getting much better at native, native understanding of these representations.

I think the, the, the markdown step definitely helps in terms of, you know, there's a lot of nice, like, as you can imagine with, with the pure HTML JavaScript.

Right.

Exactly.

So yeah, when it makes sense to do it, there's, we don't artificially try to make it hard for the model, but sometimes it depends on the kind of access of what we get as well.

Like, for example, if there's an embedded snippet, that's HTML, we want the model to, you know, be able to work on that as well.

Yeah.

And no vision yet, but currently no vision.

Yes.

The reason I asked all these things, cause I've done the same.

Like I haven't done vision.

Yeah.

So the tricky thing about vision is I think the models are getting significantly better, especially if you look at the last six months, natively being able to do like VQA stuff and so on.

But the challenge is the trade-off between having to, you know, actually render it and so on that the gap, the trade-off between the added latency versus the value add you get.

You have a latency budget of minutes.

Yeah.

It's true.

In my opinion, the places you'll see a real difference is like, like, I don't know, a small part of the tail, especially in like this kind of an open domain setting.

If you just look at what people ask, there's definitely some use cases where it makes a lot of sense to do it, but I still feel it's not in the head cases.

And we do it when we get there.

The classic is like, it's a JPEG that has some important information and you can't, you can't touch it.

Yeah.

Okay.

And then the other technical follow-up was just, you have 1 million to 2 million token context.

Has it ever exceeded 2 million?

And what do you do there?

Yeah.

So we had this challenge sometime last year where we said, when we started like wiring up this multi-turn where we said, Hey, let's see how long somebody in the team can take DR, you know?

Yeah.

What's the most challenging question you can ask that takes the longest?

Yeah.

No, we also keep asking follow-ups.

Like for example, here you could say, Hey, I also want to compare it with like how it's done.

Okay.

So you're guaranteed to bust it.

Yeah.

We also have, we have retrieval mechanisms if required.

So we natively try to use the context as much as it's available beyond which, you know, we have like a rag set up to figure out.

Okay.

This is all time house in house tech.

Yes.

Okay.

Yes.

What are some of the differences between putting things in context versus rag?

And when I was in Singapore, I went to the Google.

Context versus rag.

Well, when I was in Singapore, I went to the Google cloud team and they talked about Gemini plus grounding is Gemini plus search, kind of like Gemini plus grounding, or like, how should people think about the differentiates of like I'm doing retrieval on data versus I'm using deep research versus I'm using grounding.

Sometimes the labels can be hard too.

Yeah.

I can, let me try to answer the first part of the question.

The second part, I'm not fully sure of, of the grounding offering.

So I can at least, at least talk about the first part of the question.

So I think you're asking like the difference between like being able to, when do, when would you do rag versus rely on the contact?

I think we all, we all get that.

I was more curious, like from a product perspective, when you decide to do rag versus like this, you didn't need to, you know, do you get better performance?

Just putting everything in context?

The tricky thing for rag, it really works well because a lot of these things are doing like cosine distance, like a dot product kind of a thing.

And that kind of gets challenging when your query side has multiple different attributes.

The dot product doesn't really work as well.

I would say, at least for me, that's, that's my guiding principle on when to avoid rag.

That's one.

The second one is, I think every generation of these models are like the initial generations, even though they offered like long context, that performance as the context kept growing was, you would see some kind of a decline.

But I think as the newer generation models came out, they were really good, even if you kept filling in the context in being able to piece out like these really fine-dime information.

So I think these two, at least for me, are like guiding principles on when to.

Just to add to that, I think like, just like a simple rule of thumb that we use is like, if it's the most recent set of research tasks where the user is likely to ask lots of follow-up questions, that should be in context.

But like, as stuff gets 10 tasks ago, you know, it's fine if that stuff is in rag because it's less likely that the user needs to do, you need to do like very complex comparisons between what's currently being discussed and the stuff that you asked about, you know, 10 turns ago, right?

So that's just like a, a very, like the rule of thumb that we follow.

And so from a user perspective, is it better to just start a new research instead of like extending the context?

Yeah, I think that's a good question.

I think if it's a related topic, I think there's benefit to continue with this thread because you could, the model, since it has this in memory, could figure out, oh, I've found this niche thing about, I don't know, milk regulation in this case in the US.

Let me check if you're in a follow-up country or place also has something like that.

So these kinds of things you might have not caught if you started a new thread.

So I think it really depends on, on the use case.

If there's a natural progression and you feel like this is like part of one cohesive kind of a project, you should just continue using it.

My follow-up is going to be like, oh, I'm just going to look for summer camps or something.

Then yeah, I don't think it should make a difference, but we haven't really, you know, pushed that to send and tested that, that aspect of it for us.

Most of our tests are like more natural transitions.

How do you eval DP search?

Oh boy.

Yeah, this is a hard one.

I think the entropy of the output space is so high.

It's like, people love auto-raters, but it brings its own set of challenges.

And so for us, we have some metrics that we can auto-generate, right?

So for example, as we move, when we do post-training and have multiple models, we kind of want to make sure the distribution of like certain stats, like for example, how long is spent on planning, how many iterative steps it does on like some dev set.

If you see large changes in distribution, that's, that's kind of like a early signal of something has changed.

It could be for better or worse.

So we have some metrics like that, that we can auto-compute.

So every time you have a new version, you run it across a test suite of cases and you see how long it takes.

Yeah.

So we have like a dev set and we have like some kind of automatic metrics that we can detect in terms of like the behavior end-to-end.

Like for example, how long is the research plan?

Do we like, does a new model produce really longer, many more steps?

Just a number of characters.

Like number of steps in case of the research plan.

In the plans, it could be like, like we spoke about how it iteratively plans based on like previous searches, how many steps does that go on and average over some dev set.

So there are some things like this you can automate, but beyond that, there are auto-raters, but we definitely do a lot of human events.

And there we have defined with product about certain things we care about and been super opinionated about, is it comprehensive?

Is it complete?

Like groundedness and these kind of things.

So it's a mix of these two attributes.

There's another challenge, but I love it.

Is this where other challenge in that sometimes you just have to have your PM review examples?

Yeah.

Exactly.

Yeah.

And for later...

You're the human-human-reader.

The human-reader.

But broadly what we tried to do is for the eval question is like, we tried to think about like, what are all the ways in which a person might use a feature like this?

And we came up with what we call an ontology of use cases.

And really what we tried to do is like stay away from like verticals like travel or shopping and things like that, but really try and go into like, what is the underlying research behavior type that a person is doing?

So there's queries on one end that are just, you're going very broad, but shallow, right?

Things like shopping queries are an example of that.

Or like, I want to find the perfect summer camp.

My kids love soccer and tennis.

And really you just want to find as many different options and explore all the different options that are available and then synthesize, okay, what's the TLDR about each one?

Kind of like those journeys where you open many, many Chrome tabs, but then like need to take notes somewhere of the stuff that's appealing.

On the other end of the spectrum, you know, you've got like a specific topic and you just want to go super deep on that and really, really understand that.

And there's like all sorts of points in the middle, right?

Around like, okay, I have a few options, but I want to compare them or like, yeah, I want to go not super deep on a topic, but I want to cover a slightly more topics.

And so we sort of developed this ontology of different research patterns.

And then for each one came up with queries that would fall within that.

And then that's sort of the eval set by which we then run human evals on and make sure we're trying to doing well across the board on all of those.

Yeah.

You mentioned three things.

Is it literally three or is it three out of like 20 things?

How wide is the ontology?

I basically just told the, told the, the full set.

Yeah.

I told, no, no, no.

I told you the like extremes, right?

So like, okay.

Yeah.

And then we, we, we had like several, several midpoints.

So basically, yeah, going from like something super broad and shallow to something very specific and deep.

We weren't actually sure which end of the spectrum users are going to really resonate with.

And then on top of that, you have compounds of those, right?

So you can have things where you want to make a plan, right?

Like a great one is like, I'm going to plan a wedding in, you know, Lisbon and I, you know, I need you to help with like these 10 things, right?

And so that becomes like a project with research enabled, right?

And so then it needs to have research planners and venues and catering, right?

And so there's, there's sort of compounds of when you start combining these different underlying ontology types.

And so that we also thought about that when we, when we tried to put together our eval set, what's the maximum conversation length that you allow or design for?

We don't have any hard limits on the, how many turns you can do.

One thing I will say is most users don't go very deep right now.

Yeah, it might just be that it takes a while to get comfortable.

And then over time you start pushing it further and further.

But like right now we don't see a ton of users.

I think the way that you visually present it suggests that you stop when the doc is created.

Right.

So you don't actually really encourage the UI doesn't encourage ongoing chats that as though it was like a project.

Right.

I think, I think there's definitely some things we can do on the UX side to basically invite the user to be like, Hey, this is the starting point.

Now let's keep going together.

Like where else would you like to explore?

So I think there's definitely some, some explorations we could do there.

I think the, in terms of sort of how deep, I don't know, we've seen people internally just really persist to quite, quite a ways.

The other other thing I think will change with, with time is people kind of uncovering different ways to use deep research as well.

Like for, for the wedding planning thing, for example, it's, it's not one of the, you know, first thing that comes to mind when, when we tell people about this product.

So that's another thing I think as people explore and, and, and find that this can do these various different kinds of things.

Some of this can naturally lead to longer conversations.

And even for us, right?

When we dog fooded this, we saw people use it in like phase we hadn't really thought of before.

So that was because this was like little new, like we didn't know, like, will users wait for five minutes?

What kind of tasks will, are they, you know, going to try for something like that takes five minutes.

So our primary goal was not to specialize in, you know, in a particular vertical or, or target one type of user.

We just wanted to put this in the hands of like, like we had like this busy pattern persona and like we raise different user profiles and, and see like what people try to use it for and learn more from that.

And how does the ontology of the DR use case type back to like the Google main product use cases?

So you mentioned shopping as one ontology, right?

There's also Google shopping.

Yeah.

To me, this sounds like a much better way to do shopping.

They're going on Google shopping and looking at the wall of items.

How do you collaborate internally to figure out where it goes?

Yeah, that's a good question.

So when I meant like shopping, I sort of tried to boil down underneath what exactly is the behavior.

And that's really around like, I called it like options exploration.

Like you just want to be able to see and whether you're shopping for summer camps or shopping for a product or shopping for like scholarship opportunities, it's sort of the same action of just like I need to curate from a large, like I need to sift through a lot of information to curate a bunch of options for me.

So that's kind of what we tried to distill down rather than like thinking about it as a vertical.

But yeah, Google searches is like awesome.

If you want to have really fast answers, you've got high intent for like, I know exactly what I want.

And you want like super up to date information, right?

And I still do kind of like Google shop because it's like multimodal, you see the best prices and stuff like that.

I think creating a good shopping experience is hard, especially like when you need to look at the thing.

If I'm shopping for shoes, and like I don't want to use deep research, because I want to look at how the shoes look.

But if I'm shopping for like HVAC systems, great, like I don't care how it looks, or I don't even know what it's supposed to look like.

And I'm fine using deep research, because I really want to understand the specs and like, how exactly does this work and the voltage rating and stuff like that, right?

So like, and I need to also look at contractors who know how to install each HVAC system.

So I'd say like, where we really shine when it comes to shopping is those that kind of end of the spectrum of like, it's more complex, and it matters less what it like, it's it's maybe less on the consumer side of of shopping.

One thing I've also observed just about the, I guess, I guess the metrics or like the communication of what value you provide.

And also this this goes into the latency budget, is that I think there's a professor's incentives for research agents to take longer and it be perceived to be better, to people are like, oh, you're you're searching like 70s 70 websites for me, you know, but like 30 of them are irrelevant, you know, like, I feel like right now, we're in kind of a honeymoon phase where you get a pass for all this, but being inefficient is actually good for you.

Because, you know, people just care about quantity and not quality.

Right?

So they're like, oh, this thing took an hour for me, like, it's doing so much work, like, or it's slow.

That was super counterintuitive for us.

So actually, the first time I realized that what you're saying is when I was talking to Jason Calacanis, and he was like, do you actually just make the answer in 10 seconds and just make me wait for the balance?

Yeah, which we hadn't expected that people would actually value the like work that it's putting in because he could actually worry about it.

We were really worried about it.

We were like, I remember we actually built two versions of deep research.

We had like a hardcore mode that takes like 15 minutes.

And then what we actually shipped is a thing that takes five minutes.

And I even went to end and I was like, there has to be a hard stop, by the way, it can never take more than 10 minutes.

Yep.

Because I think at that point, like users will just drop off.

Nope.

But what's been surprising is like, that's not the case at all.

And it's been going the other way.

Because when we worked on a system, at least and other Google products, the metric has always been if you improve latency, like all the other metrics go up, like satisfaction goes up, retention goes up, all of that, right?

And so when we pitched this, it's like, hold on, in contrast to like all Google orthodoxy, we're actually going to slow everything right down.

And we're going to hope that like users are still still on purpose.

Yeah, not on purpose.

Yeah, I think it comes down to the trade off, like, what are you getting in return for for for the wait.

And from an engineering slash modeling perspective, it's just trading off entrance compute and time to do two things, right?

Either to explore more to be like more complete, or to verify more on things that you probably know already.

And since it's like a spectrum, and we don't claim to have found the perfect spot, we had to start somewhere and we're trying to see where like the there's probably some cases where you actually care about verifying more than the others in an ideal world, based on the query and conversation history, you know what that is.

So I think, yeah, it basically boils down to these three things from a user perspective, am I getting the right value add from an engineering slash modeling perspective, are we using the compute to either explore effectively, and also verify and go in depth for things that are vague or uncertain in the initial steps.

The other point about the more number of websites, I think, again, it comes with a trade off, like, sometimes you want to explore more early on before you kind of narrow down on either the sources or the topics you want to go deep.

So that's one of the if you look at, like the way at least for most queries, the way deep research works here is initially, it'll go broad, if you look at the kinds of websites, it's trying to explore all the different topics that we measured in the research plan.

And then you would see choices of websites getting a little bit narrower on a particular topic, or a particular entity that it has come across and so on.

So that's roughly how the number kind of fluctuates.

So we, we don't do anything deliberate to either keep it low or, or, you know, I try to would you be would it be interesting to have an explicit toggle for amount of verification versus amount of search?

I think so.

I think like users would always just hit that toggle.

I think I worry that like max everything.

Yeah, if you like give a max power button, users are always just going to hit that button.

Right.

So then the question comes like, why don't you just decide from the product POV, where's the right, where's the right balance?

OpenAI has a preview of this, like, I think either in the topic of OpenAI, and there's a preview of this model routing feature, where you can choose intelligence, cheapness and speed.

And but then they're all zero to one values.

So then you just choose one for everything.

Right.

Obviously, they're gonna like do a normalization number thing.

But users are always going to want one right now.

We've discussed this a bit like, if I buy my pure user hat, I don't want to sell anything like I come with a query, you figure it out.

Like, sometimes I feel like there will be based on the query.

Like, for example, right?

If I'm asking about, hey, how does rising rates from the Fed, how's our income for a middle class?

And how was it traditionally happened?

These kind of things you want to be very accurate, and you want to be very precise on historical trends of this and so on and so on.

Whereas there is a little bit more leeway when you're saying, hey, I'm trying to find businesses near me to go celebrate my birthday or something like that.

So in an ideal world, we kind of figured that trade off based on the conversation history and the topic.

I don't think we are there yet as a research community.

And it's an interesting challenge by itself.

So this reminds me a little bit of the notebook LM approach, Ryza, we also asked this thing to Ryza.

And she was like, yeah, just people want to click a button and see magic.

Yeah, like you said, you just hit start every time, right?

Most people don't even okay.

My feedback on this, if you want feedback, is that I am still kind of a champion for Devin, in a sense that Devin will show you the plan while it's working the plan.

And you can say like, hey, the plan is wrong.

And I can chat with it while it's still working.

And he will live update the plan and then, you know, pick off the next item on the plan.

I think it's static, right?

Like, while you're working on a plan, I cannot chat.

It's just normal.

Bolt also has this, like, you know, that's the most default experience.

But I think you should never lock the chat.

You should always be able to chat with the plan and update the plan and the plan scheduler, whatever orchestration system you have under the hood should just pick off the next job on the list.

That'll be my two cents.

Especially if we spend more time researching, right?

Because like right now, if you watched that query we just did, it was done within a few minutes.

So your chance, your opportunity to chime in was actually like, or it left the research phase after a few minutes.

So your opportunity to chime in and steer was less.

But especially imagine you could imagine a world where these things take an hour, right?

And you're doing something really complicated.

Then yeah, like your intern would totally come check in with you.

Be like, here's what I found.

Here's like some hiccups I'm running into the plan.

Give me some steer on how to change that or how to change direction.

And you would do that with them.

So I totally would see, especially as these tasks get longer, we actually want the user to come engage way more to like create a good output.

I guess Devin had to do this because some of these jobs like take hours.

Right.

So yeah.

And it's proficient since it's where they charge by hour.

Oh, so they make more money the slower they are.

Interesting.

Have we thought about that?

I'm calling this out because everyone is like, Oh my God, it takes hours for it does.

Hours of work autonomously for me.

And then they are like, okay, it's good.

But like, this is a honeymoon phase.

Like at some point we're going to say like, okay, but you know, it's very slow.

Yeah.

Anything else that like, I mean, obviously within Google, you have a lot of other initiatives.

I'm sure you like sit close to the Nopogalem team in any learnings that are coming from shipping AI products in general.

They're really awesome people.

Like they're really nice, friendly thought, just like as people, I'm sure you met and you like realized this with Razer and stuff.

So like they've actually been really, really cool collaborators or just like people to bounce ideas off.

I think one thing I found really inspiring is they just picked a problem and hindsight's 2020, but like in advance, just like, Hey, we just want to build like the perfect IDE for you to do work and like be able to upload documents and ask questions about it and just make that really, really good.

And I think we were definitely really inspired by their ability, their vision of just like, let's pick up a simple problem, really go after it, do it really, really well and have be opinionated about how it should work and just hope that users also resonate with that.

And that's definitely something that we tried to learn from separately.

They've also been really good at, you know, and maybe Morgan, you want to chime in here, just extracting the most out of Gemini 1.5 Pro and they were really friendly about just like sharing the ideas about how to do that.

Yeah, I think, I think you, you learn a bit like when you're trying to do the last, last mile of, of these products and then pitfalls of any, any given model and so on.

So yeah, we definitely have a healthy relationship and then we're doing the same for other.

You'll never merge, right?

It's just different teams.

They are different teams.

So they're in like labs as an organization that the mission of that is to really explore kind of different bets and explore what's possible.

Even though I think there's a paid plan for Nopokalam now.

Yeah.

So I think it's the same plan as us actually.

So it's like, it's more than just the labs is what I'm saying.

It's more than just labs.

Cause I mean, yeah, ideally you want things to graduate and into, and stick around, but hopefully one thing we've done is like not created different skews, but just being like, Hey, if you pay the Gemini 1, yeah, whatever you get, you get everything.

What about learning from others?

Obviously, I mean, opening ISD research, literally, that's the same name.

I'm sure there's a lot of contention.

Is there anything you've learned from other people trying to build similar tools?

Like do you have opinions on maybe what people are getting wrong that they should do differently?

It seems like from the outside, a lot of these products look the same.

Ask for research, get back to research.

But obviously when you're building them, you understand and want this a lot more.

When we built deep research, I think there was a few things that we took a few different bets around how this, how it should work.

And what's nice is some of that is actually where we feel like was the right way to go.

So we felt like agents should be transparent around telling you upfront, especially if they're going to take some time, what they're going to do.

So that's really where that research plan, we showed that in a card.

We really wanted to be very publisher forward in this product.

So while it's browsing, we wanted to show you like all the websites it's reading in real time, make it super easy for you to like double click into those while it's browsing.

And the third thing is, you know, putting it into a side by side artifact so that you could ideally easy for you to read and ask at the same time.

And what's nice is you kind of, as other products come around, you see some of these ideas also appearing in, in other iterations of this product.

So I definitely see this as a space where like everyone in the industry is learning from each other.

Good ideas get reproduced and built upon.

And so yeah, we'll, we'll definitely keep iterating on and kind of following our users and seeing, seeing how we can make, make our future better.

But yeah, I think I think like it's, it's like, this is the way the industry works is like everyone's going to kind of see good ideas and want to replicate and build off of it.

And on the model side, opening is the O3 model, which is not available through the API, the full one.

Have you tried already with the two model?

Like, is it a big jump?

Or is a lot of the work on the post training?

Yeah, I would say stay tuned.

Definitely, it currently is running on 1.5.

The new generation models, especially with these thinking models, they unlock a few things.

So I think one is obviously the better capability in like analytical thinking, like in math, coding and these type of things.

But also this notion of, you know, as they produce thoughts and think before taking actions, they kind of inherently have this notion of being able to critique them, the partial steps that they take, and so on.

So yeah, we're definitely exploring multiple different options to make better value for the for our users as we as we trade.

Yeah, yeah.

I feel like there's a little bit of a conflation of inference time compute here, in the sense of like, one, you can inference time compute within the model, the thinking model, right?

And then two, you can inference time compute by searching and reasoning.

Yeah, doing more iterative.

I wonder if there gets in the way, like when you presumably you've tested thinking plus deep research, if the thinking actually does a little bit of verification, so maybe saves you some time, or it like tries to draw too much from its internal knowledge, and then therefore searches less, you know, like, does it step on each other?

Yeah, no, I think that's a really nice callout.

And this also goes back to the kind of use case.

The reason I bring that up is, there are certain things that I can tell you from model memory last year, the Fed did x number of updates and so on.

But unless I sourced it, it's going to be hallucinating.

Yeah, like, one is the hallucination, or even if I got it right, as a user, I'd be very wary of that number, unless I'm able to like, source the dot gov website for it and so on, right.

So that's another challenge.

Like, there are things that you might not optimally spend time verifying, even though the models like, like, this is a very common fact, the model already knows, and it's able to like, reason over, and balancing that out between trying to leverage the model memory versus being able to ground this in is in, you know, some kind of a source is the challenging part.

And I think as, as like, you rightly called out with the thinking models, this is even more pronounced because the models know more, they're able to like, draw second order insights more just by reasoning over technically, they don't know more, they just use their internal knowledge more.

Right?

Yes.

But also, like, for example, things like math, I see they've been they've been post trained to do better math.

Yeah, I think they just they probably do way better job and in like, in that sense.

Yeah, I mean, obviously, reasoning is a topic of huge interest.

And people want to know what a engineering best practices like we think we know, like, you know, how to prompt them better.

But engineering with them, I think also very, very unknown.

Again, you guys are going to be the first to figure it out.

Yeah, definitely interesting times.

And yeah, no pressure.

While we're on that sort of technical elements and technical bend, I'm interested in like other parts of the deep research tech stack that might be worth calling out any hard problems that you solved?

Just more generally?

Yeah, I think the iterative planning one to do it in a generalizable way.

Yeah, that was the thing I was most wary about.

Like, you don't want to go down the route of being able to teach how to plan iteratively per domain or like per type of problem.

Like, like even in the outgoing back to the ontology, if, if you had to teach the model for every single type of ontology, how to come up with these traces of planning, that would have been nightmarish.

So trying to do that in a super data efficient way by leveraging a lot of like things model memory, as well as like, there's this very tricky balance when you work on like on the product side of any of these models is knowing how to post train it just enough without losing things that it knows in pre-framing, basically not overfitting in the most trivial sense, I guess.

But yeah, so the techniques there, data augmentations there and multiple experiments to tune this trade off.

I think that's, that's one of the challenges.

On the orchestration side, this is basically you're spinning up a job.

I'm an orchestration nerd.

So how do you do that?

Is like a sub internal tool?

Yeah, so we built this asynchronous platform for deeper search, which is basically to like most of our interactions before this were like sync in nature, they like, you know, all chat, advisor sync.

Exactly.

And now now you can leave the chat and come back.

Exactly.

And close your computer.

And now it's an Android.

And yeah, rolling around.

So we switch it on sometimes.

Okay, you're reminding him, right?

Yeah, we wrapped on all Android phones.

And then iOS is this week.

But yeah, what's, what's neat though, is like you can close your computer, you get an application, and so on.

So it's some kind of sync engine that you need.

Yes, yes.

So we the other one is this notion of synchronicity and the user able to leave.

But also, if you're, if you build like five, six minute jobs, they're bound to be like failures, and you don't want to like, we try your progress, and so on.

So this notion of like keeping state knowing what to retry and kind of keep the journey going.

Is there a public name for this?

Or?

No, I don't think there's a public name for this.

Yeah, right.

There's any sort of be like, this is a spark job, or, you know, it's like a rave, you know, thing or whatever, in the old Google days might be like MapReduce, or, you know, whatever.

But like, this is a different scale and nature of work.

Yeah, then those things.

So we just I'm trying to find a name for this.

And right now, we can name it now.

What the classic because I used to work in this area, this is workflows, this sort of durable, like back when you were in the world.

So Apache airflow temporal, you guys were both at Amazon, by the way, AWS step functions would be one of those where you define a graph of execution, but step functions are more static, and would not be as able to accommodate deep research style back ends.

What's neat, though, is we built this to be like, quite flexible.

So like, you can imagine once you start doing our or multi day jobs, like, yeah, you have to model what the agent wants to do.

Exactly.

And but also like, ensure like, it's stable, you know, for like hundreds of LLM calls.

Yeah, it's boring.

But like, you know, this is the thing that makes it run autonomously, you know, yeah, so like, it's, yeah, anyway, I'm excited about it.

Just to close out the opening, I think, I would say opening I easily be you on marketing.

And I think it's because you don't launch a benchmarks.

And my question to you is, should you care about benchmarks?

Should you care about humanities last exam or not mmm, or you but whatever they're like, I think benchmarks are great.

The thing we wanted to avoid is like, the day Kobe Bryant entered the league, who was the president's nephew and like weird like, big Kobe friend, okay, just like these like weird things that like nobody talks that way.

So like, why would we over solve for like, some sort of a benchmark that doesn't necessarily represent the product experience we want to build nevertheless, like benchmarks are great for the industry, and like rally a community and help us like understand where we're at.

I don't know.

Do you have any?

No, I think you kind of hit the points.

I think the for us, our primary goal is like solving the deep research user value for the user use case, the benchmarks, at least the ones that we are seeing, they don't directly translate to the product.

There's definitely some technical challenges that you can benchmark against, but they don't really like if I do great on hLE, that doesn't really mean I'm a great deep researcher.

So you want to avoid going into that rabbit hole a bit.

But we also feel like benchmarks are great, especially in the whole gen is space with like models coming every other day and everybody claiming to be like, it's tricky.

The other big challenge with benchmarks, especially when it comes to like the models these days is the output space entropy is like everything is like text.

And so there's a notion of verifying, even if you got the right answer, different labs do it in like different ways.

And but we all compare numbers.

So there's a lot of, you know, art slash figuring out like how you verify this or how you run this in a level plane.

But yeah, so I think the straight offs is definitely valued to doing benchmarks, but at the same time, we also like a selfish PM perspective.

Benchmarks are a really great way to motivate researchers like make number go up exactly or just like prove you're the best.

Like it's like a really good way of like rallying the researchers within your company.

Like I used to work on the ML perf benchmarks and like that was like, yeah, you'd put like a bunch of engineers in a room and in a few days they do like amazing performance improvements on our TPU stack and things like that.

Right.

So just like having a competitive nature and a pressure like really motivates people.

There's one benchmark that is impossible to benchmark, but I just want to leave you with it, which is that deep research, most people are chasing this idea of discovering new ideas and deep research right now will summarize the web in a way that you know, is much more readable, but it will, you know, what will it take to discover new things from the things that you've searched?

First, I think the thinking style models definitely help you because they are significantly better on how they reason natively and being able to, you know, draw these second order insights, which is like very premise.

Like if you can't do that, you can't think of doing what you mentioned.

So that's, that's one step in the other thing is I think it also depends on the domain.

So sometimes you can drift with a model for like new hypothesis, but depending on the domain, you might not be able to verify that hypothesis, right?

So like coding math, there are reasonably good tools that the model already knows to interact with, and you can run a verifier test the hypothesis and so on.

Like even if you think about it from a purely agent perspective saying, Hey, I have this hypothesis in this area, go figure out and come back to me, right?

But let's say you're a chemist, right?

So what are you going to do that we don't have like synthetic environments yet, where the model is able to verify these hypothesis by playing in a playground and have this like a very accurate verifier or a reward signal.

The computer uses another one where there are this both, both in the open source, the search and so on.

There's like nice playgrounds coming up.

So I think for, if you're talking about truly being able to come up with my personal opinion is the model doesn't, has to do the second order thinking and so on that we're seeing now with these new models, but also be able to play and test that out in an environment where you can verify and give it feedback so that it can continue trading.

Yeah.

So basically like code sandboxes for now.

Yeah.

So in those kinds of cases, I think, yeah, it's a little bit more easy to envision this like end to end, but not for all domains or physics engines.

Yeah.

So if you think about agents more broadly, there's like a lot of things that go into it.

What do you think are like the most valuable pieces that people should be spending time on?

Like things that come to mind that I'm seeing a lot of early stage companies is like memory, you know, like we already touched on emails.

We touched a little bit on a tool call.

There's kind of like the odd piece, like should this agent be able to access this?

If yes, how do you verify that?

What are things that you want more people to work on that would be helpful to you?

I can take a stab at this from the lens of like deep research, right?

Like I think some of the things that we're really interested in, in how we can push this agent are one like similar to memories, like personalization, right?

Like if I'm giving you a research report, the way I would give it to you if you're a 15 year old in high school should be totally different to the way I give it to you.

If you're like a PhD or postdoc, right?

You can prompt it.

Right.

But the second thing though is like, it should like ideally know where you're at and like everything you know up to that point, right?

And kind of further customized, right?

Have, have this understanding of like where you are in your learning journeys.

I think modality will be also really interesting.

Like right now we're, we're text in, text out.

We should go multimodal in, right?

But also multimodal out, right?

Like I would love if my reports are not just text, but like charts, maps, images, like make it super interactive and multimodal, right?

And optimized for the type of consumption, right?

So the way in which I might put together an academic paper should be totally different to the way I'm trying to do like a learning program for a, for a kid, right?

And just the way it's structured.

Ideally, like you want to do things with generative UI and things like that to really customize reports.

I think those are definitely things that I'm personally interested when it comes to like a research agent.

I think the other part that's super important is just like, we will reach the limits of the open web and you want to be able to, like a lot of the things that people care about are things that are in their own documents, their own corpuses, things that are within subscriptions that they personally really care about, right?

Like, especially as you go more niche into specific industries.

And ideally you want ways for people to be able to compliment their deep research experience with that content in order to further customize their answers.

There's two answers to this.

So one is, I feel in terms of like the approach for us, at least for me rather trying to figure out the core mission for like an agent building that, I feel like it's still early days for us, like to try to platformatize or like try to build these, oh, there are these five horizontal pieces and you can plug and play and build your own agent.

My personal opinion is we are not there yet.

In order to build a super engaging agent, I would, if I were to start thinking of a new idea, I would start from the idea and try to just do that one thing really well.

Yes, at some point, there will be a time where like these common pieces can be pulled out and platformatized.

I know there's a lot of work across companies and in the open source community about providing these tools to really build agents very easily.

I think those are super useful to start building agents, but at some point, once those tools enable you to build the basic layers, I think me as an individual would try to focus on curating one experience before going super broad.

Yeah, we have Brett Taylor from Sierra and he said they mostly built everything in-house.

Building everything in-house, which is very sad for VCs.

The next great framework and tooling and all that.

But the space is moving so fast, like the problem I described might be often six months from now and I don't know.

We'll fix it with one more LLM ops platform.

Yes.

Okay, so just the final point, just plugging your talk, people will be hearing this before your talk.

What are you going to talk about?

What are you looking forward to in New York?

I would love to actually learn from you guys.

What would you like us to talk about?

Now that we've had this conversation with you, what do you think people would find most interesting?

I think a little bit of implementation and a little bit of vision, 50/50 and I think both of you can sort of fill those roles very well.

Everyone looks at you, you're a very polished Google product and I think Google always does polish very well.

But everyone will have to want deep research for their industry.

He's invested in deep research for finance and they focus on their thing.

And there will be deep research for everything.

You have created a category here that OpenAI has cloned.

And so let's talk about what are the hard problems in this brand of agent that is probably the first real product market fit agent, I will say more so than the computer used ones.

This is the one where people are easily pays for $200 worth, a month worth of stuff, probably $2000 once you get it really good.

So I'm like, okay, let's talk about how to do this right from the people who did it.

And then where is this going?

So yeah, it's very simple.

I'm pretty sure I want that.

Yeah, thank you.

For me as well, I'm also curious to see you interact with the other speakers, because then there will be other sort of agent problems.

I'm very interested in personalization, very interested in memory.

I think those are related problems, planning, orchestration, all those things.

Off in security, something that we haven't talked about.

There's a lot of the web that's behind off walls.

How do I delegate to you my credentials so that you can go and search the things that I have access to?

I don't think it's that hard.

It's just people have to get their protocols together.

And that's what conferences like that has hopefully meant to achieve.

Yeah, no, I'm super excited.

I think for us, we often live and breathe within Google, and which is a really big place, but it's really nice to take a step back, meet people approaching this problem at other companies or totally different industries.

Inevitably, at least where we work, we're very consumer focused space.

I see.

Right?

Yeah, I'm more B2B.

It's also really great to understand, okay, what's going on within the B2B space and within different verticals.

Yeah, the first thing they want to do is deep research for my own docs, my company docs.

Obviously, you're going to get asked for that.

Yeah, I mean, there'll be more to discuss.

I'm really looking forward to your talk.

And thanks for joining us.

Yeah, thanks for having us.

Thanks so much, guys.