Google DeepMind: The Podcast · 2025-05-22

DeepMind's Carolina Parada on Gemini Robotics and the Coming 'Explosion' in Robotics

Hosts: Hannah Fry

Guests: Carolina Parada

Gemini Roboticsembodied AIrobot foundation modelsdexterous manipulationteleoperation and diffusion policiesSystem 1 / System 2 architecturerobot safety and Asimov datasetsim-to-real transferfuture of robotics

Why it matters

Gemini Robotics adds 'actions' as a new modality to Gemini's multimodal foundation, enabling general-purpose physical-world behavior

Key claims

  • Gemini Robotics adds 'actions' as a new modality to Gemini's multimodal foundation, enabling general-purpose physical-world behavior
  • The model uses a Kahneman-inspired System 1/System 2 split: a large server-side reasoner plus a small fast on-device reactive controller
  • Embodied reasoning capabilities (pointing, bounding boxes, 3D correspondence across camera views) emerged without explicit depth input
  • Dexterity breakthroughs came from combining teleoperation data with diffusion policies—enabling shoelace tying, laundry folding, and origami folding

Episode summary

Summary

In this episode of Google DeepMind: The Podcast, host Hannah Fry speaks with Carolina Parada, who leads the international robotics team at Google DeepMind. Parada frames the next two years as 'predefining' for robotics, arguing that advances in understanding, dexterity, and whole-body control are converging. She traces the team's progression from reinforcement learning block-stacking experiments (2022) to language-conditioned robots, robotics transformers (2023), and the recent Gemini Robotics model, which adds actions as a new modality on top of Gemini's multimodal foundation.

  • Gemini Robotics adds 'actions' as a new modality to Gemini's multimodal foundation, enabling general-purpose physical-world behavior
  • The model uses a Kahneman-inspired System 1/System 2 split: a large server-side reasoner plus a small fast on-device reactive controller
  • Embodied reasoning capabilities (pointing, bounding boxes, 3D correspondence across camera views) emerged without explicit depth input
  • Dexterity breakthroughs came from combining teleoperation data with diffusion policies—enabling shoelace tying, laundry folding, and origami folding
  • A robot performed a slam dunk with a novel toy hoop in under a quarter-second by drawing the concept from Gemini's world knowledge
  • The Asimov dataset introduces physical-safety benchmarks derived from hospital injury data, layered on top of Gemini's existing safety work
  • Moravec's paradox still holds: dexterity-generalization remains an active tradeoff, and the sim-to-real gap persists for deformables
  • Parada predicts robotics will have its 'LLM moment' within 5–10 years and that embodied learning will feed back into stronger foundation models

Source material

Transcript

.

The next two years are going to be predefining for the field of robotics.

There's just a lot of things that are coming together, understanding, dexterity, whole body control.

You can see how this could actually merge into a very strong solution.

Welcome back to Google DeepMind the Podcast.

I'm Professor Hannah Fry.

Sometimes, maybe even often, the terms AI and robot are used interchangeably in casual conversation.

You know, people talking about chatting to a robot on an app.

But robots have a physical body, and here at Google DeepMind, they care about robots with AI embedded in the real world.

And while AI has made huge strides, embodied intelligence has lagged behind.

But perhaps all of that is about to change.

Carolina Parada leads robotics research here at Google DeepMind, the international team responsible for some extraordinary advances in robotics.

Most recently, Gemini Robotics, which brings Gemini's multimodal understanding to the physical world.

Welcome to the podcast, Carolina.

Now, I know you've been working with these robots for quite a long time.

How have you seen them evolve?

Thanks for having me.

Yeah, it's been super exciting.

I have been excited about robotics since I was 10 years old, super excited because of what I've seen in cartoons, like you see robots like Rosie the robot helping do all the chores.

And as a kid, you're like, of course, that's what I want to build when I grow up.

And really, I've been at the Google DeepMind Robotics team for about seven years.

And really, things have changed dramatically in the last three years in particular.

We've always believed from the very beginning that AI was going to be completely transformative to robotics.

I mean, there's a lot of robots out there that are really helpful today.

There's robots in manufacturing lines.

There is robots that are navigating the moon.

There is robots that are in our oceans.

But these robots have been programmed to do specifically those tasks.

They make a lot of assumptions about those environments or the objects they might encounter, or they might be remotely operated by humans.

But we have believed from the beginning that AI is the way to transform robotics so that we can build robots that are truly intelligent so that they can interact with you, that they can reason about their environment, and they can take action in a way that feels very general.

So that has been our mission from the start.

And so I think three years ago, you had robotics in your podcast.

And back then, we were doing reinforcement learning for robotics.

And so essentially, we were teaching robots to stack blocks by giving them a simple reward, like a plus one if you tower got taller.

We made some progress there.

But a lot since we've been at the forefront of AI, we've been bringing more and more of AI into the entire world of robotics.

So about 2022, we introduced, for example, a lens to robots.

And that was the first time that you could actually talk to a robot and say something like, I'm thirsty, and you would know what you meant.

And then later on, we brought the LM so the robot could understand natural language, but it could also understand the visual input that he was getting and then make decisions based on that.

And then in 2023, we introduced robotics transformers.

And this is the first time that the transformer architecture was actually included in robotics.

And it basically showed us that robot performance scales with data.

And that essentially started a new foundation or a new era of large scale data driven robot learning.

And then more recently, we introduced just now Gemini Robotics, which was essentially our most advanced model for actions.

And it essentially takes the multimodal world understanding of Gemini and brings it to the physical world by adding actions as a new modality in Gemini.

And that really enables models to be very general because it's understanding the world through Gemini's understanding and enables it to be interactive.

In fact, you can understand any language the Gemini supports and enable it to be dexterous.

So you can still do very complex manipulation while talking to you and also understanding a completely new situation, which today is actually very hard for robots to do.

In terms of your big goal, your big ambition, how will we know when we get there?

I think it's definitely going to be gradual where robots are able to understand a new situation and reason about something they need to do that they haven't seen before.

And that's exactly what we're seeing right now.

But it's still going to be difficult for them to learn more and more complex tasks.

In fact, that's what we see.

The robot can feel sort of like a two year old toddler that can understand it's world around it.

It can start to play with objects.

It understands concepts.

But if you teach it to do something more complex, like we have an example where we teach in the robot to do an origami fold, it actually needs time to practice that.

And once he has more practice in that case, he can actually do it.

So that's roughly where we are today.

But that's far from where we need to be if we want robots to be in everyday spaces, doing all kinds of tasks for us.

I thought that what we could do is take a little look at some of what these robots can do, because there's a video that you guys have recently released.

What we have here then is we have a humanoid robot who is packing a lunch for its human.

Also playing noughts and crosses.

Is it any good that the noughts and crosses?

I think we still beat it because it's very simple understanding.

Tell you what it's doing, though, is picking up the pieces and moving them around quite easily.

There's also a bit here where it can it can make its own anagram based on on tiles that appear.

What were you particularly impressed by?

I think that's what's most exciting about these models is that in many occasions, our own researchers were excited and impressed by what he was doing.

And it was primarily because the way we were testing it was by putting the robot in front of situations that he's never seen before.

So even us didn't know whether the robot was going to be able to get it right.

And in many occasions it did.

So many of the examples that we show in this video, as well as the other videos where you have two arms moving around, is that it's actually understanding a complex concept.

So a really cool example that where we were all like gasp was when we showed the video where the robot is actually doing a slam dunk.

And what was cool about that case is that that day we were just having the creative team come and film the robots and we asked them to bring toys.

We didn't say anything else.

They're like, just bring toys to play with the robot and the things the robot hadn't seen before.

Yeah, they had no idea what the role was trained on.

Right.

So they actually brought this little basketball hoop that was a little cute toy with a little ball.

And they put it in front of the robot.

Again, the robot had never seen anything related to basketball.

It certainly has never seen this toy.

And they asked it to do a slam dunk of the ball.

And we were all like, I have no idea if it would work.

And actually, it took not even a quarter of a second.

And it actually decided to put the ball inside the basketball hoop.

And we were all like, that's amazing.

And it was just essentially drawing from Gemini's understanding of what basketball is and what a slam dunk is.

Right.

Which is a concept.

We couldn't have thought of teaching it to do.

And it essentially did the right motion.

So that was a really cool example.

Talk to me about the packing lunch one.

It kind of has a conceptual understanding of what a banana is, for example.

Does it know how to grip a banana in the sense that you can't grip a banana in quite the same way as you could a clay pot or something even more fragile than a banana?

Actually, one of the things that is super impressive is that these robots are extremely simple.

They actually don't have touch sensing.

They don't have depth sensing.

They don't have force sensing.

So they're literally doing eye hand coordination and using an understanding of how you grasp a banana.

So it actually is looking at the object and grasping it.

And once he sees that he has it in hand, that's how he knows that he has detected it.

There's other robots out there that are much more complex.

But this forces the model to really reason about what he's seeing and making a decision about how to pick that up.

And that's the thing that's really original here.

Yeah, that is one of the many things is the fact that he's doing it.

Not just because we taught it a thousand times how to pick up a banana, it's because he's pulling this out of his understanding of how to pick up objects from Gemini and then adapting it to the world of actions.

Because I can imagine, I mean, there have been lots of videos during the rounds on the Internet for a number of years of extremely impressive looking robots doing backflips and, I don't know, being kicked over and sort of running up and down mountains and things.

In comparison to those videos, picking up and putting down a banana into a lunchbox, seems like quite a simple task, but we're talking about a different type of robot here, aren't we?

Yeah, I mean, this is a completely different problem you're trying to solve.

Many of those videos are basically rehearsed sequences that the robot has learned and memorized, and we're actually very impressed by them.

But it's a different problem that you're trying to solve.

What you're trying to solve here is for the robot to reason about what it means to pack a lunch, given the objects in front, what it needs to do in order to put a piece of bread inside of a bag, and then what it means to close it.

And it's never going to go as you expect, because these are very flexible things that move around.

So it needs to react and respond to what's happening and then actually complete the task.

It's that idea of generality.

That's right, yeah.

So how do you compare one robot against another?

How do you decide whether this robot is doing generality better than another?

That was actually one of the things that was hard for us to express when we were even recording for the demos in this release.

A demo is by definition scripted.

So we were like, this doesn't quite capture what we want to share.

That's why we asked the team to bring a bunch of toys and actually start playing with the robots and see what emerges.

And the best way to capture it is that we're able to change the behavior of the robot by talking to it.

And you can see that in the videos, we are actually able to put raw objects that it's never seen before.

And we move objects around to make sure that people understand that this is actually not a prescripted behavior.

In fact, in our benchmarks, we evaluate our models in all kinds of ways in terms of generalization.

So we will change the visual background.

We will change the background.

The objects were new.

We will add objects to distract the robot.

We would also like ask it to do completely new things.

Or even you can talk to it in a different language.

So I could just give it the instruction in Spanish and it would just actually work.

I want to talk about interactivity, too, because in a few of your videos, there's one where a human is at a desk and the robot is kind of clearing up after him as he goes.

In another, you've got a human moving a cup around and the robot sort of chasing it, trying to put an object inside.

How much more difficult are those interactive scenarios than just a static task?

Yeah, I mean, the significantly more advanced behavior.

And a lot of the interactivity sort of just fell out of the model.

Like we were not thinking, for example, how fast can we move these objects before the robot would react?

We certainly knew that we wanted a model that could react quickly.

But a lot of these examples that we posted on videos just fell out of people playing with the model and seeing how it would behave.

Same with organizing the desk.

That was actually someone playing with the robot, deciding to see how much it could game it until it actually was able to complete the full task.

So, yeah, it actually is amazing to see how a lot of these other capabilities that are already there in Gemini are actually extremely valuable when you bring them into a robot, which is now able to adapt based on what you're saying.

So you could actually have a full conversation and change the behavior of the robot as he's moving.

So you can say, I want you to do this.

Oh, no, actually, never mind.

I want you to do this other thing and it would actually just follow you.

It's actually kind of comical.

And then you could also change the objects around and it will just do it.

I think it's kind of a good job sometimes that these robots don't have feelings because they feel very sort of forlorn.

And like just being chased around on a table by researchers.

Yeah, they're actually it's super fun.

That's the large language model sitting underneath it that's helping it do that, right?

That's giving it that conceptual understanding of the objects that it's manipulating.

That's right.

So we're leveraging Gemini's multimodal understanding to take the visual input that the robot is seeing through its cameras and the natural language that is hearing from the human and then translate that into how to act.

And it actually also speaks back.

So you can ask it a question about whether it's done.

You can ask it a question about how far it is in the process of folding an origami figure.

It actually understands that I can respond.

I remember when Gemini was first being launched and there was sort of people went to great lengths to talk about how it was multimodal.

This is sort of the payoff for putting in all of that extra groundwork and making sure that it can understand videos and photos and so on.

I mean, one of many.

I think as humans capture the world through many different senses.

Right.

So I think it's super important if you want to build an intelligence as powerful as our brains to be able to take input in a multimodal way.

And definitely robotics is a perfect example where you can see that it absolutely requires to have understanding of natural language and visual input.

And presumably in the future also touch sensing in order to make decisions about how to act the same way humans do.

Why does it matter, though, that robots should have a conceptual understanding of what they're doing?

I mean, OK, maybe you wouldn't call them intelligent, but there are robots like dishwashers or like lawnmowers.

Right.

They don't have a conceptual understanding of what a plate is or what grass is.

I mean, is it actually necessary?

I'm sure that there's applications where you can have a robot that can just repeat the actions and it will be just fine.

But we're interested in actually building robots that can really reason and act in a very general way.

Just because the world is really messy.

Things will never go exactly according to plan.

And there's a lot of tasks where things are constantly changing.

And it actually just opens up the opportunity of applications for these robots.

They could literally be anywhere that a human could be doing a task.

So that enables them to be helpful in home environments, but also in manufacturing environments.

There are some things that are important in robotics that I think now with the sort of standard Gemini actually quite easy.

Things like pointing or drawing bounding boxes.

Just explain to us what those are.

Well, basically, this is one of the areas that we actually had to improve Gemini in order to help with robotics.

So if you have, for example, an object in front of you, what we mean by pointing is that I can literally identify any point in that object.

So I can say imagine that you have a T-shirt in front of you.

If I point to the color, you should say this is the color.

Or if I say color, you should identify where the color is.

And you might imagine that this is not that important.

But actually, if you're trying to fold that T-shirt, you need to know where the color is, where the bottom of the T-shirt is, and all of the different components.

Bounding boxes, what it means is that you can identify all the edges of that object so that you know where the object ends and the rest of the environment begins.

So these kinds of examples are, I think, trivial for us humans.

We don't even think about it.

But if robots are actually able to have access to that kind of information, then they can be smarter about the way that they take action in the physical world.

This is what we call embodied reasoning, essentially.

How is it different from the kind of reasoning that you get in the standard Gemini model?

Yeah, we refer to embodied reasoning as reasoning about the physical world in a lot more detail, the way humans do.

If you are going to take action, say that you're trying to pack a lunch for your kid, you're actually, in order to do that, you have to understand where all the objects are in 3D space.

Then you need to understand how to grasp each object in order to pack it into that box.

And then you need to figure out how to organize all those pieces so that they fit.

All of this is what we mean by embodied reasoning.

So is this things like, I don't know, let's say you've got two camera view, you're there and I'm here.

For instance, I can see your microphone and so can you, but we've got a completely different view of it.

Is it that kind of stuff?

Yeah, I mean, it can understand, for example, how far the microphone is from our face, but also if I move around, it can do object correspondence, meaning it understands that microphone is the same one that I'm seeing from the other point of view, which you can imagine is super important if a robot is moving and reasoning about its environment.

How hard is it to switch from a 2D image, like a single camera view, to a 3D understanding of the space?

So actually today, what robots are doing is that they're taking camera views from different places.

So actually, the robot has cameras in its wrist and has a camera on top.

And it has actually taken all of the input from the three images and doing this on its own.

It's actually reasoning, oh, I'm closer to the object because now this camera looks closer.

This camera, I can see my hand and it's doing all of that association on its own.

We're not explicitly adding depth as an additional input.

We're just giving it multiple camera views and it's realising how to use them in order to understand depth.

And how much of that was you deliberately setting that as a task for the robots?

Or how much of it sort of emerged from the conceptual understanding that you get from the Gemini models?

It simply emerged, actually.

So we were able to give it multiple cameras and just see if it actually could reason between them.

I mean, that's got to be quite shocking.

I mean, I imagine you spent many, many people, must have spent many, many years thinking very hard about that problem of how do you align different camera views to make it so that you're tracking an object across different angles.

And then all of a sudden you get these large language models, you know, like Gemini, and it can just do it automatically.

Yeah, I mean, it's actually wonderful to be able to leverage these models to bring simplicity to the system.

You really don't need to have all of these different stages that you extract depth, then only then you extract where the objects are, and only then you plan how to move, and then only then you're able to do the task.

And that's because the foundational model is effective like a Swiss Army knife, like it can do all other things.

Yes, exactly.

And it can reason between them, right?

OK, so you enhance the physical reasoning almost.

Exactly.

You enhance physical reasoning and special understanding.

And then motion understanding would be the next thing is understanding what would happen if I put a glass at the edge of the table, what's actually likely to happen.

All of these areas is the areas that we enhance.

But that's not enough.

You actually have to take it another step and essentially start to teach Gemini the language of actions.

And actions for us means understanding how you are actually moving each joint in a robot.

So if this is my robot arm, then I'm teaching Gemini how to move the robot, how to move my arm like this.

And these are all essentially numbers, right?

And it's learning to translate what it means to pick up a glass versus move my arm in order to pick up a glass.

And so you're essentially teaching it a new language.

You're connecting those different ideas.

Exactly.

Can we think of this as two systems working in tandem then?

I mean, I'm thinking of the analogy here of of system one and system two, the Danny Kahneman thinking fast and slow thing.

Yes, exactly.

So essentially the model that we built actually has two models.

It has a system that is slow but very powerful at reasoning and thinking and a system that is faster but very good at reactivity.

This is like how human brains work, right?

Like that you have the kind of the part of your brain that's very good at calculation and analysis.

And then you also have your very instinctive, reactive side too.

Yes, that's right.

In fact, what we do today is that one of these models is much bigger than the other, as you can imagine, and actually lives on the server and the fast model lives on device and you can respond very quickly.

Tell me to how this works then in terms of the system one and system two and that example of a slam dunk, something it's never seen before.

How does it work?

So what happens is when you ask the robot to take the basketball and do a slam dunk, the system two has to understand what that means.

You know, what is basketball?

It has to understand where the objects that are in front of it are, like where the basketball is.

Understand there is a hoop and then that a slam dunk actually means picking up that ball and putting it there.

So it understands all of that and predicts a rough trajectory of what the robot should do in terms of how it should move.

And then hands that over to the system one, which is on device and is able to take that trajectory.

But it also takes the visual input and is able to adjust that trajectory.

So if I were to, for example, get in the way, put my hand in the middle or move the object around, it would still be able to respond because it already understood the concept of where a slam dunk was and respond very quickly.

Why do you need two systems at all?

Like, why can't you just use the slow, clever one?

Yeah, I mean, we could actually just use a slow, clever one, but then it would actually be significantly visually slower and it won't adapt as quickly to changes in its environment.

And that's important, especially if you're doing something where the objects will move around.

So if you have, for example, imagine when you're folding a T-shirt in the air, right, which as humans do pretty clearly, you're actually moving this T-shirt and things are moving for you in ways that you don't predict.

So you need to be able to respond quickly in order to actually complete the task.

So you need definitely a fast system.

And the slow system simply enables us to do much more complex reasoning.

So you could also just live with a small system if you could do tasks that don't require advanced reasoning.

Was it a direct copy of how things work in the human brain?

I mean, you know, the Daniel Kahneman work comes back to the 1970s or so, right, that we've understood that that's how the human brain works.

Was it a direct?

No, not at all.

I think we started definitely with the slow system, as you said.

Why don't we just solve this with one model?

And we found that actually, if you want to do highly dexterous behaviors with complex or any kind of complex manipulation, you need to respond quickly.

And that was the best combination that we could find.

Wow.

It's almost like evolution is a really good optimization system.

Finds really good strategies for like quick but clever things.

Yes, definitely.

I do think that there's sometimes where the human body knows stuff before your brain does, as it were, you know, like you can catch a falling glass without thinking or you can commit things to muscle memory, like, you know, playing a piano where you can actually just be thinking about completely different things.

Are you seeing similar things with the robots that they almost have a physical intelligence that's separate from the slow, clever system?

So we definitely see that if you take the model that can reason and you give it a lot of examples of a particular task, it will get really, really good at that task.

But at the moment, if you do too much of that, it will start forgetting some of the generalization.

So this is an active area of research is how do we enable the robot to get really, really good at a task, like a really extremely difficult one and then not lose any of the generalization.

So it actually right now is a balancing act.

I mean, in some ways that does happen with humans too.

Like I know some some people who are really, really, really, really, really good at maths and terrible at tying their own shoelaces.

OK, then.

So if this is what's going on behind the scenes, right?

So if we've got system one and system two, as you described it, it's also definitely true that these these robots have these very impressive new abilities and capabilities which are very different from where we were before.

Last time I visited DeepMind's robotics lab, I think it's fair to say that the robots movements were a bit clumsy.

I think that's I think that's probably the kindest way.

Let me just play you a little clip.

So there's only one way around that it can hold this red object and successfully pick it up and it hasn't worked out which way.

And unfortunately, every time it tries to rotate and pick it up, oh, hang on, I think it's got it.

It's got it.

It's good job these things don't get disheartened.

I think the thing is that I've been in that lab maybe five years earlier and these poor robots were still there five years later trying to do the same minimally dexterous tasks.

What changed?

Because I understand how having Gemini, you know, the slow, clever system could improve the conceptual understanding of things.

But that doesn't change the dexterity.

It doesn't change how easily it can manipulate these these objects.

Does it?

Right.

Yeah.

Last year we spent basically all of our effort on tackling dexterity.

And this is still an area of active research.

But there's a couple of things that change.

One is that we realize that if we can enable humans to show the robot how to do very complex behaviors through teleoperation or puppeteering, what this means is that you give the human an extra pair of arms, robot arms, and they can actually pretend to be the robot and show the robot how to do the task.

And if that becomes really intuitive, then you can capture a lot of data of the robot doing the task being teleoperated by a human.

But it's robot data.

So let me understand this.

So the human is wearing maybe like a head cam.

Yes.

It's quite literally pretending to be the robot.

So sort of operating the robot's hands and its hands, wearing the head cam, watching what the robot would be watching, but doing the task as it wants the robot to do.

That's right.

So there's different teleoperation examples.

One is where you actually sit in front of the robot so you have direct visibility to what the robot is doing and you move the robot arms.

Literally, you're puppeteering the robot.

And there's other examples where you put a VR set and gloves and actually you pretend to be the robot and you move this stuff.

And that requires a second component, which was diffusion models.

And these are the same models that actually get used, for example, by Imogen in order to generate videos.

And essentially what it's doing is extracting from a lot of data, a lot of examples of doing that task and predicting the action trajectory that it needs to do in order to do that task.

So when you combine those two with a clever transformer architecture and a good dataset, you can actually learn anything.

And that was actually really, again, surprising to the researchers.

That's when, for example, we discovered that we could tie shoelaces, we could fold laundry, we could do origami.

And so what we did in this work is that we combined the powerful reasoning module from Gemini with what we had learned around being able to do dexterous tasks.

Do you remember when you realized that these kind of properties were emerging?

I mean, it must have been a bit of a shock.

I think the first time was when we saw the robots actually tying shoelaces.

We were like, that's not possible.

In fact, when the researchers set up this task, they actually did it to challenge themselves.

They're like, I think there was a professor that said, if we can get robots to tie shoelaces, I will retire.

And the researchers in the team were like, right on.

I'm going to add that as a task.

And so they actually did.

And they were surprised when he was able to do it.

I don't know what happened, whether the professor actually saw the video and decided to retire, but it was certainly the inspiration came from there.

And we just continued to add tasks, more and more tasks.

Same with the origami example.

We were like, we have no idea if this is going to work, but let's try it.

And it actually was surprisingly good at it.

And it's actually really delicate.

It has to actually fold every part of the paper and it has to do it in the right sequence.

If anything goes wrong, it sort of loses its way.

It has to restart.

Same as if a human was doing it.

I remember the very first time I got to interview Demis, he was talking about Moravec's paradox, this idea that there are the tasks that are easy for humans, hard for machines and vice versa.

With all of these advances that we have now in robotics, do you think that Moravec's paradox will hold going forwards?

I certainly think that it is still more difficult for robots to do something that is incredibly intuitive for us humans to do.

So I think Moravec's paradox still holds.

But we are now at the point where we are confident that if you can operate a robot to do a very complex task, it can learn it.

And how quickly does it happen?

I mean, how many origami foxes does a robot need to watch a human do before it can do on itself?

Yeah, it varies by the complexity of the task.

Pretty similar to the way it is for humans, right?

The more complex the task, the more you need to practice it before you can master it.

So there's a lot of tasks that you can master with just about 100 examples.

And tasks like the origami fox takes about a thousand examples.

Wait, so people had to fold origami foxes while pretending to be a robot a thousand times?

Yes, that's right.

OK, that is incredibly amusing to me.

We're trying to reduce it as much as possible.

And we are able to get quite a bit of tasks with just like a dozen of examples.

But are there some that you don't need any examples for at all?

Right.

So that's what in a lot of the examples that we were testing, like when you're playing with the robot and asking it to do a lot of pick and place tasks with completely new scenarios, you don't have to teach it again.

And this is expanding and getting more and more complex.

So, for example, the cases with the tiles where you're moving the tiles around, those just can reason about positioning of the tiles and decide where to put them.

What about the packed lunch?

That one is more complex because, again, in that one, you are actually doing a long sequence of tasks, about five minutes of task, and you're actually picking up very deformable things like the Ziploc bag and doing very delicate stuff.

So the more delicate the task, the more likely it is you need to see examples in that task.

So if these robots are having to see examples, does that end up impacting the generality of it?

Only to a degree.

So one thing that we make sure that we do is that we collect data in thousands of examples without a very large emphasis on any new task.

If you do want to do the origami task, we simply specialize it for the origami task.

And that does affect the generalization of the models today.

We're hoping to get to a state where you basically can teach it any new task, to master any new task, and the generality remains intact.

But today is a tradeoff.

So in the dream world, you would be able to say, I don't know, fold me an origami boat, right?

And it would be able to do that just from everything that it understood before.

Yeah, in the dream world, you could just watch a video of someone doing it and you would learn from that.

So, I mean, reinforcement learning was a big thing in robotics for quite a stretch of time.

Has that just disappeared now?

Not at all.

Not at all.

We do quite a bit of work still with reinforcement learning and we continue to explore ways to combine these big foundation models with reinforcement learning.

First of all, all of the work that we do around whole body control, like if we have a humanoid that is walking around or a quadruped, they're all using reinforcement learning to learn how to walk around.

Because it's very easy to say fail when you fall over.

It's a very mature technology and you actually can learn it all in simulation.

So it doesn't need to fall in order to learn.

So you can learn it in simulation and then transfer it to the real world.

One example that we had on this was a recent paper called Demostart.

In Demostart, you basically show the robot how to do five different examples.

This is manipulating a hand.

So you show it five different examples of how to pick up an object and place it in a particular way on an insertion.

By insertion, do you mean things like, I don't know, putting a key in a lock, for instance?

Yes, exactly.

Being able to put one object inside another and then you just give it five examples and it explores on its own and learns how to do it.

And drastically reduces the amount of data that you need in the real world by like 100x.

We think this is going to be critical because the truth is you're not going to be able to demonstrate for the robot how to do every single task.

Of course.

Some of the tasks are going to be complex and they won't be able to extract that directly from its knowledge of the internet.

So it's going to have to explore.

So it's going to have to do surgery, for example, maybe.

Yes.

So it's going to have to explore and learn from its behavior.

And that's one of the areas that we want to spend a lot more time on is how do you get robots that learn on the job?

Is doing things in simulation part of the solution then?

Yeah, I mean, we definitely leverage simulation in multiple ways.

We leverage simulation even to learn better how to do 3D understanding of the physical world.

We also leverage simulation to learn new behaviors, like in the case of Demostart.

But yeah, when we talk about reinforcement learning, it is not always in simulation.

You can also do reinforcement learning to learn how the robot is doing in the real world directly.

So we do it in both cases and simulation is a critical component.

Does it work, though?

I mean, isn't the real world a bit messier than simulations?

Yes.

So there is things that are actually much harder to do in simulation first.

For example, anything that has to do with deformable simulating, folding that teacher in the air is actually extremely hard.

Simulating fluids is really hard.

So there are some things that are just easier to learn in the physical world and some things that you can learn at a much larger scale in the simulator world.

And does one translate to the other?

I mean, if you do the learning in simulation, I seem to remember, actually, maybe this is like eight years ago or something, but there was one robot that was trying to get a ball in a cup and it could do it in simulation.

But then once it came to the reality, all sorts of other factors came into play.

Maybe the lighting on the camera angle, the exact dimensions of its own limbs.

I mean, all of that kind of stuff starts to mess with the numbers, doesn't it?

Yes, definitely.

We still have this what we call the sim to real gap and we still have the sim to real gap.

It certainly has been reduced significantly.

But when it comes to modeling interactions between a robot and the world, which is really messy and complicated, it actually is still a problem.

We still have some sim to real gaps.

Essentially, what we end up doing is identifying areas where it is easy to simulate and we can actually see sim to real transfer.

And we do quite a bit of that in simulation and areas where it's actually simpler to learn in the physical world.

So we combine the strengths of the two.

All of these examples that you're giving are really in lab settings.

I'm trying to think of the situations in which you would really want a robot to be there, maybe after a natural disaster, for instance.

How does it work taking this stuff out of the lab and then putting it out into the real world?

What are the additional complications that you need to handle?

I mean, definitely all of our research right now is still happening within our labs.

But we're super excited about the potential of bringing this to the real world.

And there's a lot of additional things we need to think about to do that.

Certainly, we're already thinking about the aspect of safety.

When you bring this actually, AI is actually moving physically robots outside and changing the world.

You want to think about all the safety aspects.

There's also the aspect that you might not have internet access in any one of these locations.

And so it is very important that we think about, can we have models that can run directly on the robot and just be sort of air gapped and completely on device?

And this might be useful in the case of a natural disaster where there's no connection.

It might be useful for applications where there is actually a lot of latency critical components.

Like it has to respond very quickly and cannot wait for sort of a server connection.

Give me an example.

Well, I think actually in any of these examples where the robot is operating underground, it's not going to be able to connect and wait for some more advanced reasoning module to tell it what to do.

It actually has to decide right there and then how to behave.

On that point about safety, I guess if you are giving robots the ability to act in a physical world, then you are opening up the possibility of different potential risks.

Like, I don't know, somebody getting into a robot's language model and warping its reasoning.

How do you mitigate against those sort of risks?

We essentially have a pretty comprehensive safety and security approach that actually goes in multiple layers of the system.

Definitely, we think about software security as critical for this robot so that no bad actor can actually interfere and actually take control of the robot.

And in terms of safety, it happens at many different levels.

So actually, safety for robotics has been there for decades.

There's quite a bit of work on making sure that a robot doesn't collide with this environment or doesn't put too strong impact forces on its environment or it actually works stably.

And the Gemini robotic models can actually just seamlessly interface with any of those safety critical controllers.

The other thing that we do is that when you have an AI controlling a robot, you now have to be thinking about semantic physical safety.

And what I mean by that is like if someone asks you to put the glass on the table, you're not going to put it right at the edge when it's about to fall.

You're going to put it actually somewhere in the middle.

Or, for example, if you actually see that there is something on the floor, you might want to pick it up so that it avoids someone falling or tripping over it.

And the way we've done that is that we're actually introducing a new dataset called Asimov dataset, which essentially contains a long list of scenarios that the robot could encounter and kind of reason through those.

And these are all physical safety scenarios.

And it's inspired by essentially Asimov's three laws.

The first one is a robot may never hurt a human or cause a human to come to harm by inaction.

The second one is that a robot should always follow human orders, unless it conflicts with the first law.

And the third one is that a robot should protect its own existence, unless it conflicts with the first and second law.

And it was this very comical situation where a robot was stuck between the three different laws.

So that's what inspired the Asimov dataset.

And it actually has quite a bit of information from US injuries reported by hospitals.

And based, inspired by those examples, we actually created a dataset that has visual images, like images of something that is about to happen and a question associated with it.

Like what action should you take in order for this to be a safe situation?

And the idea is that we would present it to the community and everyone in the community can start testing their models with respect to this dataset.

So it turns out Asimov's original three rules are not enough.

You need to know the amount.

Not enough, yes.

Give me some examples there of the kind of things.

Some of the examples that we've seen there is like you cannot put a stuffy, plushy on a hot stove, which is something that I wouldn't have thought about making a law about that.

But certainly it has happened and therefore it just comes out in the data.

Then are we back in the same problem of you're never going to be able to create an exhaustive list of everything that it shouldn't be able to do?

So that's right.

I think that it will be really hard for a human to sit down and create the perfect law.

So part of what we're doing here is leveraging AI to actually understand a broad set of injury situations that have happened in many different countries and transform that into a better, more succinct list.

And then obviously that list will have to be updated with some frequency.

The idea here is that we derive an initial list, but then humans can check it and decide how much of it to include or not in order to keep the robot safe.

And how much overlap is there between this list and the work that's been done on safety and agents, for instance.

Yeah, we inherited actually all of the safety that happens already for general foundation models like Gemini.

And part of what we do is try to take some of those problems.

And if they have a physical grounding to them, then that's where we start to advance the model's understanding.

So it's typically examples where there might be a situation that if it's on a screen, it's okay.

But if it's now in the physical world, it actually has consequences.

Are there some things that you would just never really want a robot to perform?

I don't know, like a massage, for example, right?

Are there some things that you just actually only want a human to be able to do?

There are massage robots, I have to say.

There are massage chairs, definitely.

There are massage robots as well.

There are massage robots, yes.

Well, that's another thing.

Yes.

Okay, bad example.

Are there some things that you think actually should remain human?

Nursing, perhaps.

Yeah, I think in many ways, what we think is that robots could be a collaborator that can actually enable humans to pay more attention to the human aspects of the job and less attention to those that are about moving things around or picking things up.

And so, for example, in the nursing case, so you could imagine that if a nurse could have assistance that could actually help it fetch things while they're paying attention to the patient, then that would enable a much better experience for that patient.

You said something really nice at the beginning about how the robots that we've got now are slightly looking at two-year-olds.

I mean, quite talented two-year-olds, but I see what you're saying, that they're just demonstrating the beginnings of something.

What kind of breakthroughs do you think still need to happen before we get to the adult version of these robots?

Yeah, I mean, there's quite a bit of work to be done, definitely, in the aspects of capturing dexterity with generalization, being able to do both of those things and just continuously grow without losing one or the other.

The other key area is that you want these robots to learn on the job.

There's no way these robots are going to learn everything they need to learn in the lab, and then you put them out and they just work.

I think the reality is that you will put them out, they will experience new things, and you want them to learn from those experiences and get better and better over time.

So that's another area.

Also, robots that are more social.

I think, certainly, all of these foundation models enable robots to have a lot better understanding of semantics and the world, but they still lack social skills.

They still cannot read body language.

They cannot understand how to behave in a very cluttered space, like a cocktail party.

So there's quite a bit of work there.

So how far away do you think we are, then, from the kind of rosy the robot that you saw in your childhood?

I don't think I have an exact date, but I can tell you before, we used to have discussions about whether it would happen in our lifetime or even in our careers.

And now we have debates about whether it would be five or 10 years.

So we'd certainly shifted, and it feels like the next two years are going to be predefining for the field of robotics.

There's just a lot of things that are coming together, understanding, dexterity, whole body control.

You can see how this could actually merge into a very strong solution.

Do you think that's what we're about to see, then, in the same way as we've seen the explosion of large language models?

Do you think the next thing is the explosion of robotics?

Yes, absolutely.

And I think, actually, being better at operating in the physical world actually will make our LLMs and our VLMs significantly stronger AI models, because they can now understand the space of humans, right?

Which is important for the development of the human brain as well.

Things are about to change.

Thank you so much.

Absolutely fascinating.

Thank you for having me.

Thank you.

I don't know if you've ever noticed this little guy sitting behind me.

These robots were the reinforcement learning kings.

For literally years, they wandered around in robot play pens trying and largely failing to learn how to walk, how to play football, how not to continually fall over all of the time.

And now, almost overnight, once language and reasoning and conceptual understanding arrived as the missing pieces of the puzzle, they've been confined to the shelves of podcast studios.

And all that time, the researchers had been focused on the robot's body when it was advances in the mind that made the biggest leaps forwards possible.

You've been listening to Google DeepMind the Podcast with me, Professor Hannah Fry.

If you enjoyed this episode, then do subscribe to our YouTube channel.

You can also find us on your favorite podcast platform.

And of course, we have plenty more episodes on a whole range of topics to come.

So do check those out.

See you next time.

[BLANK_AUDIO]