Latent Space · 2025-06-02

Google I/O Recap: Gemini 2.5 Pro, Live API & Realtime Voice AI

Hosts: Saren Charnikton, Sveks

Guests: Logan Kilpatrick, Shrestha Basu Mallick, Kwindla Hultman Kramer

Gemini 2.5 ProLive APIRealtime voice AINative audio-to-audio modelsGemini diffusionImplicit context cachingURL context / research agentsDaily.co and PipecatVoice activity detectionSpeaker diarizationWebRTC vs WebSocketsGoogle I/O 2025

Why it matters

Implicit context caching is now active for 2.

Key claims

  • Thinking budgets arriving on Gemini 2.5 Pro (expected early June), with an option to disable reasoning and run it as a raw, non-reasoning model; thought summaries are live now.
  • Implicit context caching is now active for 2.5 Pro — cost savings happen automatically without developers opting in — alongside continued support for explicit caching.
  • Native audio output in the Live API is a marquee I/O release; it can interleave languages fluidly (Shrestha demos Bengali/English; Matt Villoso demoed Klingon) and now powers a native audio-to-audio architecture distinct from the cascaded pipeline.
  • New URL context tool lets developers retrieve in-depth web page content (alone or paired with search) while respecting publisher controls, unlocking research-agent use cases.

Episode summary

Summary

Recorded on-site at Google I/O, this Latent Space preview episode features Logan Kilpatrick (Google DeepMind), Shrestha Basu Mallick (PM lead for the Gemini API side), and Kwindla Hultman Kramer (Daily.co / Pipecat). The trio walks through their personal highlights from the announcements, framing I/O as a developer-experience push on top of the Gemini model family. Logan singles out thinking budgets coming to 2.5 Pro (with the ability to disable reasoning entirely) and the rollout of thought summaries as the most consequential developer-facing changes, while Shrestha highlights native audio output and the new URL context tool for building research agents.

A central thread is the real-time Live API: session length, tool calling, voice activity detection, and the shift from cascaded architectures (audio-in, text model, TTS-out) to a native audio-to-audio model. Shrestha details knobs for tuning VAD sensitivity and prefix padding, the latency target in the 500–700ms range, and the just-released proactive audio feature that semantically ignores irrelevant speech. Kwindla complements this with the framework perspective from Daily/Pipecat, explaining how voice-specific infrastructure (turn detection, context management, WebSockets vs. WebRTC) co-evolves with the underlying APIs, and noting that features like semantic VAD and speaker diarization (an unofficial but working capability of the 'native audio dialogue' model) are now migrating down into the model itself. Asynchronous function calling, already live on the cascaded architecture, is teased for native audio next.

The conversation closes on Google's model strategy. Logan relays a recent conversation with DeepMind CTO Koray Kavukcuoglu: the north star is 'one model, Gemini,' with research forks (reasoning, diffusion, Imagen) eventually folded back in rather than splintered into separate products. The reasoning work culminating in 2.5 Pro is cited as proof — video understanding improved as a side effect of merging capabilities, not from bespoke work. Looking ahead, Shrestha hopes for Gemini 3.0 and broader language coverage (officially 24, with unofficial Klingon/Bengali support already working), while Kwindla's wishlist from the builder community is more languages and more capabilities consolidated into the main model.

  • Thinking budgets arriving on Gemini 2.5 Pro (expected early June), with an option to disable reasoning and run it as a raw, non-reasoning model; thought summaries are live now.
  • Implicit context caching is now active for 2.5 Pro — cost savings happen automatically without developers opting in — alongside continued support for explicit caching.
  • Native audio output in the Live API is a marquee I/O release; it can interleave languages fluidly (Shrestha demos Bengali/English; Matt Villoso demoed Klingon) and now powers a native audio-to-audio architecture distinct from the cascaded pipeline.
  • New URL context tool lets developers retrieve in-depth web page content (alone or paired with search) while respecting publisher controls, unlocking research-agent use cases.
  • Live API challenges and progress: session length scaled beyond the original ~15–20 min audio / ~5 min video limits via developer knobs; VAD sensitivity and prefix padding are now tunable; latency target is 500–700ms.
  • Proactive audio ('semantic VAD') on the native audio model refuses to respond to irrelevant background speech; speaker identification/diarization works experimentally on the 'native audio dialogue' model; asynchronous function calling launched on the cascaded architecture.
  • DeepMind's strategic north star (per CTO Koray Kavukcuoglu) is 'one model, Gemini' — research forks for reasoning, Imagen, and diffusion are brought back into the mainline rather than shipped as separate product lines.
  • Kwindla Hultman Kramer (Daily / Pipecat) describes the live voice stack as a WebSockets-vs-WebRTC and packet-routing problem requiring human-conversation-level latency, and credits Daily as a launch partner whose feedback shaped the Live API.

Source material

Transcript

Hey, I'm Saren Charnikton.

Welcome to another episode of the Truma AI Podcast.

And I'm Sveks.

This is a special episode of the Late in Space Pod with Trimo at Google I/O.

Welcome.

Thanks for being here.

Thanks for hanging out with us.

I'm excited.

Logan, you were our first guest.

You came back remotely a few months ago.

And now you're back here.

A lot of the face of the AI studio, basically, that a lot of people are using.

I'm using it.

And I think it's a really welcome change for people being more accessible with the rest of the Google suite.

And Shrestha, you've been...

I just don't super know your role.

I just generally have you peg this PM of the API team with a particular focus on live.

Shrestha runs the show behind the scenes.

Is the latest public face running the show behind the scenes, model launches, the live API, generally all the stuff that's happening in the API is...

Shrestha's hard work.

Thank you for that, Logan.

But I think everyone knows who really runs the show.

There's public evidence there.

But yeah, I worked with Logan and a few other excellent PMs, but I lead the API side of the house.

There's a lot of announcements.

I think a lot of people have done their recaps.

What are you guys' personal highlights over I/O?

I'll break the rule and I'll give two that are not the big, big flashy ones.

I think the two that I think developers are going to be super excited about.

One, thinking budgets coming to 2.5 Pro.

And you'll be able to disable thinking as well.

So if you just want 2.5 Pro as a raw, non-reasoning model, we'll have that hopefully in early June.

And then thought summaries.

We've had this debate internally about do we need to show full thoughts?

Do developers want full thoughts?

I think developers say they want full thoughts.

We have thought summaries right now as a sort of step in that direction.

It'll be really interesting to find out and get the feedback around what are things that work with thought summaries?

What are the things that don't work with thought summaries?

I was reading some threads last night about thought summaries are now live in cursor as well.

And people were sort of reacting to having summaries versus not full thoughts.

So it'll be interesting to see, but I'm excited for both of those things.

Thought summaries are live now.

Thinking budget for 2.5 Pro will land with the GA model in a couple of weeks.

Yeah.

And I should say we already do have thinking budgets in 2.5 Flash.

I do think with all of the other features that we are releasing on top of our thinking models, summaries, budgets, I think this is our way of you have the models, but then we want to give developers as much control as they can on top of models.

But coming back to your question about my favorite feature, it's really hard to pick because all of these features we've been trying to push out for weeks.

But I think native audio output is...

I was just saying that with Quinn.

Yeah.

Yeah.

Yeah.

It's a personal highlight.

I actually, Quinn and I have been playing with it together for a bit as well.

I think especially with all the, obviously the voices sound great.

The fact that it can switch in and out of languages.

So Matt Villoso, our boss actually has a, has a demo on Twitter where it actually speaks Klingon, even though that's not an officially supported language.

But I speak Bengali, just being able to, for it to switch into and out of Bengali and English, that's been special.

And then if I get to pick another one, it's, I'd say we released a new tool called URL context.

And the idea is that you can use it by yourself or pair it with search to retrieve more in-depth information from web pages in a way that's respectful of our publisher ecosystem, of course.

And I think this will unlock new use cases, like if people want to build their own version of a research agent, which is something developers ask us for a lot.

Yeah.

Yeah.

Of course, mentioning that just prior to IO, there was a ton of new interesting new capability, including the update to Gemini 2.5 Pro, as well as the implicit context caching, which I know a lot of folks are waiting for.

We made implicit caching happen.

I think there was lots of feedback that people are like, explicit caching is nice.

Like there's definitely use cases where it makes sense, but people want implicit caching.

So I'm happy passing the cost saving on to developers.

You don't have to do anything.

It just works right now and you're saving money.

It's a great outcome.

I don't want to manage that myself.

Yeah.

There are people, like I think if you, there's so many use cases where like you're just doing chat on the same stuff over and over again.

And for those use cases, you want to be able to explicitly cache the thing and make sure and guarantee your cache so that you save money.

So I'm happy we have that.

Is there any behind the scenes of like what makes caching hard or anything that people don't appreciate about caching as a general concept?

I think this is a very important pricing paradigm that people need to really get behind.

Yeah, that's a good question.

I think there's a trade off between like all of the dimensions of caching, which is around like the sort of latency, because in some cases you're getting latency gains.

In other cases, it's like how much, you know, what's the cost for Google?

How much stuff do you want to cache altogether?

So we could have an entire episode and get a bunch of the caching people.

It's like a good example of like an infrastructure problem to be solved.

And a bunch of the folks who we work with love working on this problem.

So we should do a deep dive episode.

Yeah, I want to shout out that you've been doing more video stuff.

You have your own podcasts as part of your Gemini work.

You've also been doing a video with people on the team who have done the work.

Yeah, it's been fun.

We had so we should do a caching episode.

Exactly.

You did the long context one.

People loved it.

Your reception was very, was very positive about the long context one.

So thank you.

That was the first time that we did like a more deep technical discussion with folks on the team.

And Nicolay is awesome.

And we actually just did one with, we did work with Shrestha about the live API, which I'm excited about.

We did one with folks on the team about the multimodal capabilities in Gemini.

We're going to do a pre-training one, hopefully, which will be really cool.

We've got a bunch of people who are excited to talk about that.

So there's a bunch of them in the works and it's, it's fun to make them happen and have those conversations.

Yeah.

And my underrated pick is Gemini diffusion.

Yes.

Yeah.

Yeah.

Yeah.

It's not underrated.

It's not underrated.

It's the coolest thing ever.

Yeah.

So like, apart from speed, I wonder like what the potential results of a diffusion language model could be.

Generative UI.

Generative UI.

This is the way the generative UI has happened is through, through this experience.

Language.

The UI bit, just like being able to like say, I want, you know, build the UI on the fly using code based on what a user does.

So like you have no pre-compiled notion of what your website is.

And as a user goes through, as they click buttons, thousand tokens generate and it just like makes that UI for you.

Interesting.

I think that's going to be possible.

I mean, I think there's a lot of work to productionize, make Gemini diffusion, like actually a high quality model that meets the bar for us to bring to the world more generally.

But I do think that's going to be the killer use case will be like this generative UI experience that doesn't exist today because the models just take too long to generate tokens.

Yeah.

For me, it was really the role that.

Audio and video are taking throughout a bunch of independent product releases from the generative models to the live API to the on the fly transcription and translation.

Yeah.

It's, I think, kind of foreshadowing the role that that's going to play in a lot of developer applications.

Yeah.

Transcription actually, even before we release native audio.

Now, of course, you get text and audio interleaved in the output, but transcription used to be one of the biggest use cases we had on the live API.

What are you seeing as the challenges for folks getting started with live?

Yeah, that's a great question.

I think firstly, awareness, right?

Like people knowing that we have a live API.

You can do this.

That's why we're doing this talking to you folks.

I think some of the areas where so we were actually the first to market with also video input.

But one of the areas where we've been getting a lot of feedback is in session length.

Anybody who's been trying to put this in production, like when we started, you could do like 15 to 20 minutes of audio, I'm sorry, and about five minutes of video.

And so we've been putting in a lot of knobs for developers.

And we can talk about that more if you guys want to for people to have a sliding window or decide what resolution they want to send video in, but to basically increase the session length.

And then tool calls.

That was another area where we used to get a lot of feedback.

Again, we were very proud because we introduced tool chaining first.

So you could change search and code execution to all kinds of analysis.

But then we've had to do a lot of work in improving function calling, improving the performance of search.

And we continue to push on that.

I've got a quick one on this too, which is I think the level of commitment you need to make to the model provider in the world of the live API.

Like I do think for developers is a higher bar.

If you look at like what is chat completions or like what is for us generate content provide from just like a text modality perspective.

It's like, it's a pretty lightweight thing.

There's a lot of model providers that have that option.

Like I could switch to a different provider if I end up not liking some model provider, which I think is good for the ecosystem.

I think if you look at a lot of the live API infrastructure right now, like you really do need to commit that you're like gonna, you know, there's it's not easily interoperable between different model providers.

Like everyone's infrastructure is all bespoke and different.

So like it is a, it's a different level of commitment that you need to have to like really bet your company or your business or your product on the live API, which I do think is a challenge for developers to sort of make that level of commitment in this like fast moving AI world.

But I think hopefully there'll be like some level of like similarity and you'll get some model agnostic infrastructure to help make that, you know, make developers feel a little bit, a little bit easier about being able to move between models potentially.

I could go on and on, but if you have say more complex workflows, then one of the things is being able to change the system instructions at every step of your workflow.

And so yeah, on boarding some of the more complex use cases with the live API has been a work in progress as we've released like more features.

So what kind of complex workflows are we talking about?

You know, we have people who are building say gaming agents, but like we have multi-states, for example, in them.

We have a lot, I mean, this was a famous demo at Next, but we have folks who want to, you know, customer support agents, of course, you know, they can, the clients can last for hours, right?

Then there's a lot of use cases around people showing a certain screen.

This is the coolest use case, honestly.

Yeah.

And I was referring to like the famous demo at Next where Shopify showed how to set up a DNS using Cloud Play, right?

So in certain cases, especially the longer your workflow runs, like you might have to go from one state to another state and might want to change the SI or if you handle it, hand it from one agent to another agent, you might have to change the system instruction.

When you're thinking about building voice-based applications, is speech to text and then processing with a standard LLM, would you say that's like a precursor to the live era or are these two distinct paths that are still viable and that you still see being viable going forward?

That's a tough question and I'm still like, no, we have both out.

I do think perhaps eventually for most use cases, as these audio to audio architecture models get better, a lot of use cases will probably transition to that.

But when we talk to our developers, they still very much like those componentized components.

So that's why we also put out two new text to speech models at IO.

Not available through the live API yet, but really high performing controllable, promptable text to speech models.

I have an angle of an answer to this question, which is I talked to Cora this morning, who's our boss's boss, the CTO at DeepMind.

And Cora had a really interesting take, which is just around like what makes, one of the main things that makes what we're doing at Google, which M and I different than what a lot of the other labs are doing is like, we're here to make one model.

And like that model is Gemini.

And like, I think you do need to, to Shrestha's point, like to make the capabilities work in some cases, like you do need to have these forks that like go off and make that capability and harden it and then find a way to bring it back into the mainline model.

But like, we want to make one model and it's the Gemini model and like not have the sort of splintering of all these different capabilities.

And we've done a good job of, I think thinking the reasoning stuff was like the best example of this.

We had those, they were separate from the mainline Gemini models so that those teams, the research teams could go in hill climb and make progress and not need to be constrained about like, how do we do this without having there be collateral damage on other capabilities like multimodal or something like that.

But the teams went and did that and then they find a way to sort of bring the capabilities together.

And oftentimes what you see is there's tension in bringing them together, but it's the really exciting thing is what happens when you bring the capabilities together.

And like 2.5 Pro with reasoning is a great example of this where like multimodal with video understanding ended up like having this huge, like it's having this beautiful moment.

The model is like soda out of the box because of all the reasoning capabilities that were baked in.

It wasn't because they like did a bunch of stuff to make video understanding really good.

It was just like an artifact of bringing and merging those capabilities together.

So I think that as like a North star for Gemini models makes a ton of sense.

I agree with you.

And that's what I said, right?

Like I think eventually a lot of use cases will end up on Gemini will end up on natural voice.

But I think in order to foster development, like we have these offshoots from time to right.

We have our imagined models for image generation, even though now another I well, slightly pre I/O announcement, you can do interleave text and image within Gemini also, right?

And it unlocks.

But those are different models, right?

One is auto regressive.

The other is diffusion.

That's what I'm saying, right?

But for a lot of image generation, image editing, high quality photorealistic use cases, developers are still using imagine, but then slowly, but surely we're bringing those capabilities into Gemini.

Whoever's watching this, we had a mid I/O switch because obviously there's a lot going on here.

It's not AI shape shifting.

I know, I know.

But we also have Quinn, actually who made this podcast happen.

But you're a founder CEO of Daily.

Welcome.

I'm a big fan of all things voice and audio.

All things voice and audio.

It's fun to be here with you and with Shrestha.

So Quinn actually runs the voice AI meetup in San Francisco.

You are basically consistently the leading community builder.

And you're very generous of your time and knowledge.

I really appreciate that.

And obviously also recently you started pipecat, which is this open source framework for voice orchestration.

Which has really great support for all the Gemini models.

You wanted to say something about the relationship with Gemini and daily?

I just wanted to say that it's been a very, very fruitful partnership with daily.

They've been our partners since the launch of the live API.

And a lot of their feedback that they continuously has been instrumental to the success of the live API.

So both daily and live fit are more partners with them.

Quinn, I think we had a little bit of a prep for this.

You also wanted to dive into a little bit on the cascade of models in Gemini live.

I mean, I think Shrestha has taken a really interesting approach designing these APIs.

So you talked about components a little bit.

You talked about how you want to be able to do things both in the live API and in the more traditional chat API.

And you've got originally you designed the live API to have audio in, but then it's a separate text model, the notebook LM models audio out.

What was the sort of driver for that originally?

I mean, that at the time was we wanted to hit a certain quality bar, a certain latency bar.

And you know, notebook LM was already out and the TTS models that were powering notebook LM and were very, very good.

But we wanted an aspect of native.

So it was native audio in, but TTS out and we still have that architecture available through the live API.

But then now we just released audio to audio architect.

I mean, the infrastructure for this stuff is so interesting because you're always balancing latency cost, output quality.

There's no free lunch.

Yeah.

And other things like multilinguality.

Coming back to your question earlier, Sam, a lot, we had a lot of users asking us for, say, better German language support or something, which hopefully now we've delivered on with these models.

Yeah.

Yeah.

But now you have audio to audio in the live API as well.

In the live API only is where we have the native audio output models.

Now continuing to pull on the component versus single model thread a little bit.

When I think about voice, I think about it as being an area where to deliver solutions, you need to surround that strong model with a lot of voice specific infrastructure that is, I'm imagining challenging the scale.

Yeah.

So Shrestha, can you talk a little bit about that and maybe we can have Quinn talk about that from his perspective?

So the first thing that comes to mind is, of course, the voice activity detection models that we have.

And we've done a lot of work like finessing that model server side, but we've also learned that we need to provide some knobs to developers.

So now developers can actually tune the sensitivity on our voice activity detection model, as well as how much of the prefix pad, like how much of a time duration at the beginning at the start or stop of saying things.

And we also have a mode where you can disable our voice activity detection and bring your own.

But I think the larger point that you're touching on, Sam, that I do want to mention is it is really, really hard to bring all these components together and still get latency down to where it needs to be in the 500 to 700 millisecond range.

It's one of the hardest things we've had to do with the Live API.

What we see is that the shape of building these real time voice agents is a different set of developer problems in the shape of non real time or text mode things.

One of the fun things about partnering with Shrestha in DeepMind is we work on this open source framework that people use to build these kind of production voice systems.

And so we try to solve problems at the framework level like turn detection, light context management.

As the models get better, as the use cases get more clear, some of those features migrate from the framework into the APIs, which makes life easier for developers.

The use cases at the same time continue to broaden out.

And so there's more things for the framework to do.

So we're sort of filling the top of the use cases, building blocks, developer experience funnel and pushing down as we all get better and we all figure out what this new world looks like.

And maybe this is also a good segue into WebSockets versus WebRTC.

Yeah, you know, there's so much infrastructure.

Like for my whole career, I've been building large scale, low latency network stuff.

What we saw from my perspective when we started to see the possibilities of voice AI was you need this packet routing like down underneath the inference layer.

There's like the AI inference stuff, but then there's the just how do you move the audio and increasingly video around the internet.

And so there's a whole new generation of developers who are interested in these networking protocols because voice AI and now real time video are so interesting, which is super fun for me.

Cause like I've always thought moving packets around is one of the most fun things you can do on the internet.

Yeah.

Seven layers of the OSI stack.

Exactly.

Yeah.

At pretty, pretty demanding real time latencies as Trista is saying like human beings expect you to respond in a conversation in 500 milliseconds or so.

And if we're talking to an AI, we don't relax that assumption.

We bring our assumptions about human conversation into that experience of interacting with an AI.

Yeah.

Or not respond.

So it's a great point.

Yeah.

So like one of the features that we've pushed out a little more experimental, but would love for people to test it is what we're calling proactive audio.

And it's available only in the native audio in the audio to audio architecture right now.

And what this feature does is it's trained not to respond to irrelevant audio.

Okay.

So it's like a refusal kind of.

Yeah.

Or you could call it directionally like semantic voice activity detection, right?

So basically, yeah, like let's say I'm talking to the AI and then Quinn comes and asks me a question and I respond to Quinn, it'll know when not to respond.

So yeah.

I saw that in one of the demos, the AI seem to ignore a background question from someone else in the video.

I think there's two threads to pull on there.

One is that's another great example of things that we had to work really hard at the framework level to implement.

It's much, much better if it actually migrates down into the model or the API.

The other is part of the magic there is this semi separate feature, but I think they're multiplicative of now your models can actually recognize two different people just based on their voices.

You and I were, they have to, they, yeah.

This is not officially supported yet.

The one who does it, but just try it, right?

Just try and give us feedback.

But is it okay to talk about it?

Because it's, it might be my single favorite thing you can do with these models that you previously have not been able to do.

You can talk about what you've observed.

I was just saying it's not officially.

And what specific models are we talking about because speaker identification and diorization has always been really hard for these models.

All the it's called a gosh, like model naming now has become so it's called the native audio dialogue.

You'll see it in the live API, but that's the model.

And then, you know, to your point again, Sam, about architectures.

One thing that we launched on the cascaded architecture that we hope to eventually bring to the native audio as well is asynchronous function calling.

So earlier, the way it used to work is if you wanted the model to do a function call, you'd have to wait for the response.

And now you can set a non-blocking parameter and the model can go off and execute the function in the background.

I love you so much.

Yeah, that's great.

We do have to wrap up.

So I think one fun thing that we can do to wrap up would be a wish list for like next year's IO.

What would be one thing that you would wish it doesn't have to come true, but you know, which happens with Gemini?

Well, I was hoping for Gemini 3.0 at this IO.

So maybe Gemini 5.0 at the next IO.

Tell us what you mean by Gemini 5.

What do you want in Gemini 5.0 then?

I'll let Quinn go.

I'll just put on my hat as representative of a big community, people building this stuff more and more languages because AI is global.

And there are so many communities all over the world that are starting to use this stuff.

Can we do language, Laura?

It's hard to stuff everything in one language, in one model.

But they're building one model, as they said.

They're building the one universal model.

I think that would be a boring answer, but I think really more and more...

Languages?

No, languages.

I mean, wasn't I telling you earlier, like we officially support 24 languages, but you can try talking to the model and cling on and it'll respond to you.

So I think we'll get there way before next IO.

But I just think more and more capabilities into the main model is what I would say.

I'll have to think about this.

Yeah, it's a fun parlor game, but it also helps people align as to what is possible and what's coming up.

Thanks for your time, everyone.

This is very hastily organized, but I'm glad that we can make this happen.

It's nice to actually see seven person.

Same.

Yeah.

So you think I am the only the PM for the live APM, but we didn't get to talk about some of all of the other releases as well.

We'll save that for your talk at World's Fair.

You guys are all speaking and we'll be podcasting as well.

All right.

That's it.

Thank you so much.