Google DeepMind: The Podcast · 2025-08-22

Genie 3: Google DeepMind's Real-Time Interactive World Model

Hosts: Hannah Fry

Guests: Shlomi Fruchter, Jack Parker-Holder

world modelsGenie 3interactive video generationautoregressive neural networksagent training and simulationSIMA agentembodied AIAGIsim-to-real transferVeo video modelsAI safety

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

DeepMind's Genie 3 generates interactive 3D worlds at 24fps from text prompts

Key claims

Genie 3 is a real-time autoregressive world model that generates interactive 3D environments from text or image prompts without any underlying game engine or 3D representation
It runs at ~24 frames per second, predicting each frame from scratch given user actions and all past frames, similar to next-token prediction in LLMs
Demonstrations include walking into Edward Hopper's Nighthawks painting, exploring a Siberian hunter's lodge from a text prompt, and piloting a jet ski around Kauai
Memory and world consistency emerge from training; users can turn away and return to find the scene unchanged, while unexplored directions are generated stochastically using world knowledge

Episode summary

Summary

Google DeepMind researchers Shlomi Fruchter (Research Director) and Jack Parker-Holder (Research Scientist) joined host Hannah Fry to discuss Genie 3, a prototype real-time interactive world model that generates diverse, explorable 3D environments from text or image prompts. Unlike traditional game engines, Genie 3 is a neural network that predicts every pixel autoregressively, generating roughly 24 frames per second based on user actions and the past sequence. The model demonstrates remarkable consistency and memory—users can walk through an Edward Hopper painting, explore a Siberian hunter's lodge, or pilot a jet ski around Kauai—and exhibits emergent physics understanding for water, smoke, and gravity without explicit simulation.

Genie 3 is a real-time autoregressive world model that generates interactive 3D environments from text or image prompts without any underlying game engine or 3D representation
It runs at ~24 frames per second, predicting each frame from scratch given user actions and all past frames, similar to next-token prediction in LLMs
Demonstrations include walking into Edward Hopper's Nighthawks painting, exploring a Siberian hunter's lodge from a text prompt, and piloting a jet ski around Kauai
Memory and world consistency emerge from training; users can turn away and return to find the scene unchanged, while unexplored directions are generated stochastically using world knowledge
Promptable world events allow users to inject characters or objects into a scene on the fly, useful for agent training and entertainment
DeepMind's SIMA (Scalable Instructable Multi-World Agent) has already been dropped into Genie 3 environments as a test case, with the environment unaware of the agent's goal to maintain authentic simulation
Current limitations include imperfect physics, poor handling of people and social interactions, and a persistent sim-to-real gap
The team positions Genie 3 as a foundation model for simulated worlds and a stepping stone toward embodied AGI, though they emphasize it is not a complete solution

Source material

Transcript

the comments below.

Thanks for watching!

Welcome back to Google DeepMind, the podcast.

I'm Professor Hannah Fry.

Now, the latest video generation models have impressed the entire world.

They've created this near-perfect imitation of reality.

But the limitations of video is that you are just a viewer rather than a participant.

And that's not how humans experience the real world, right?

We instead can navigate environments we've never been to and still have an expectation of what we're likely to encounter.

We can explore in every feasible direction, kind of without limits, and interact with things that we chance upon along the way.

And that is the next great frontier for this technology, to move beyond generating a perfect recording of a scene and towards building a dynamic simulation of a world we can finally step into.

Enter Genie 3, a prototype world model that can generate an unprecedented variety of interactive environments.

It's already been described as a stepping stone towards AGI.

And with me today are two of its creators, Shlomi Fruchter, Research Director, and Jack Parkerholder, Research Scientist.

Welcome to the podcast, mate.

Thanks for having us.

Okay, let's get straight into it.

What is Genie 3?

It's a real-time interactive world model that allows you to create diverse, visually interesting worlds from a text prompt.

So there is no underlying game engine, no structure, no code.

It's just a neural network that's predicting every single pixel in reaction to inputs from the user and also the past.

And so the flexibility and the diversity of things you can create in basically no time is quite unprecedented.

You haven't had a whole army of artists sitting in rooms constructing a world in order to be able to interact with it.

Yes, I think the key point is that you can create any world that you can imagine, right?

And that's not something that you can do with a game engine, right?

Well, let's, okay, let's have a look at it because you've got some demos for me, right?

Yeah, so we have a few.

The first one, I think you might like it.

So it's basically playing a cat.

Okay, you've got me already.

Ginger cat, not so.

Excellent.

And what you have here is a beautiful ginger cat wandering around an apartment.

It's very beautifully furnished.

It's got these nice Persian rugs and wooden floor.

It's got a sofa that it's currently trying to jump on, but it's not doing it itself, right?

You're prompting its movement.

Yes, exactly.

I'm just using the keyboard to control the cat.

So I can look around, move the cat, basically tell it where to go so it can jump over the sofa.

And I really like walking into the sunlight here.

So this is reacting to the inputs that you're giving it.

Yes, exactly.

Is it light gonna change as you go into the...

Look at that.

Yes.

So the model is basically trying to predict what's going to happen next based on the sequence of inputs that it gets.

And it does it in real time.

I mean, there's also sort of the detail that you're seeing in this 3D environment.

It's also quite reminiscent of some of the stuff that we're seeing with VO.

Like if I didn't know that you were interacting with it, how is it different from that?

So when you create a video using VO, so you provide a prompt and then the model is trying to figure out how to create this entire video of say eight seconds from start to finish.

And once it's ready, then you cannot change how the camera moves around.

And definitely you cannot explore it much more than just eight seconds.

Right.

Can you use an image to prompt this or is it only text?

Yes.

So we just found out that we can actually use an image and the video is to prompt them all.

And in this particular case, we found that we can actually use paintings.

For example, this is Night Hawk by Edward Hopper.

A very famous picture.

A very famous picture from 1942.

And basically we asked Gini Free to let us walk into the painting.

So this painting is of a very vivid image, a street corner at nighttime.

You're looking in through the glass to see a man and a woman leaning up against a bar and then someone serving drinks the other side of the counter.

It's got these rich greens, the pavement underneath, the way the light falls is really, really evocative.

And we were able to walk around this environment and kind of like maybe get a sense of how it looked in the artist's mind.

And turn around for me.

Turn around to the bit that the artist didn't paint.

Yes.

So yes.

So we can go down there.

Let's see.

It works.

It's actually pretty cool because you can imagine when you look at the picture in the first place that it was meant to be a very dark city.

But then that one spot illuminated and the model, I guess, kind of went with that.

This is proper open world stuff there.

Yeah.

And the nice thing is that if we go back, it's still there, right?

Like the model is generating the world in a way that is consistent.

So you can just go on and go back to where you've already been.

It has a memory of what existed before.

Yes, exactly.

Have you got another one for us?

Yeah.

We have one that we're just trying the jet ski around a few islands.

So let's see how it...

Tell me the original prompt.

So it's sailing a jet ski through the waters around the islands of Kauai.

Sounds dreamy.

The waters have different ramps that we can go up on.

Yeah.

Okay.

Let's see.

All right.

Here we go.

Okay.

So we are sort of the POV of the person on the jet ski.

We can see the hands in frame.

Both hands are consistent with each other, I should add.

The water is beautifully still.

You can see these islands in the background.

The sun is quite low in the sky and you can see the reflection of its rays on the water.

Now I see you're taking us up a ramp here.

It's raining.

A bit too slow.

I don't know.

What are you going to do on the way down?

Oh, and it splashes when it hits the water.

Yes, it looks bad.

And when you look around to the back, there's a trail in the water, exactly as you would expect from a real jet ski.

Yes, exactly.

Yes.

So are you seeing elements of where it's understanding physics?

Definitely, yes.

There's some things we, I guess, refer to them as emergent properties, where just from having sort of general training and seeing lots of different things, when it sees a new scenario, it understands sort of like how smoke moves or how water should flow.

But that's not maybe 100% accurate in every single setting, but it's got enough accuracy that you do feel some sense of being in the scene.

And as humans, we can't obviously spot these things that are wrong with it.

So I think, as Jack said, like it's pretty, there are definitely limitations.

But on the other end, you know, after I've been working on game engines in the past, and I think we worked really hard to make all of these kind of effects independently, like the water simulation, everything.

And here, we have basically a model that can do all of that out of the box with some limitations.

Without even really trying.

And there's other things that can do that would be almost impossible to get through other methods, right?

Like simulating other animals and people in the world.

I think that's something that's really exciting more in the future as well, is be able to interact with other agents in the world.

Yeah, I mean, this totally interactive environment that ends up being consistent regardless of which direction you push it in.

I mean, that's really extraordinary.

I mean, these demos kind of demonstrate the proof of concept, I guess.

But how would you see this being used?

What are the kind of applications that you'd be looking at?

One thing that we are very excited about is using it for actually the simulation environment for agents.

So for example, you can imagine an agent that wants to accomplish a goal, right?

And then we can put them in any environment that we can imagine may be one that is more challenging for it.

And then it can explore the environment, maybe try and accomplish a goal and learn from its mistakes again, without doing anything in the real world, which is very expensive.

Another thing is that we're excited about is actually using those simulations for planning.

So if you have a robot or you have some, again, an agent that wants to accomplish a goal, they can maybe do some rollouts in the simulation, figure out what might happen.

For example, if they want to go across the road and the agent can use the model to predict a few options, maybe there is like a few scenarios, maybe a person going to cross its path, maybe something else going to happen.

And then using those rollouts, it can decide what's the next action it should take.

And then this is used for planning.

And beyond that, we just see a lot of applications for education, for entertainment.

Just anchor this in a few examples for me.

So I mean, what's the idea here?

Are you going to, in a history lesson, be able to create a world of Victorian England, for instance?

Exactly.

So imagine you are in front of a bunch of students, you know, they're obviously excited to learn Victorian England, but they've also got a lot of other distractions, things that they're interested in as well.

Instead of just reading a textbook, you can instead allow them to kind of step into the world, right?

And you can take them on a virtual tour, in a sense of what it would have been like to be there.

So for places that maybe are harder to access, maybe far distant corners of the planet or perspectives that you couldn't otherwise get.

So being a jaguar, for example, or other kind of animals or being a shark, you know, or going back in the past.

These are things that you couldn't really get any other way as an experience.

And it might make more visual learners in particular, I think resonate more with them.

And that's when you have the human who's playing with the controls.

But as you say, if you then let an agent loose in this, that opens up a whole other level of possibilities.

Yeah.

So once you have a real or very close to real simulation of an environment, then the agent can use it instead of actually learning in the real world, which is very costly, right?

If an agent makes a mistake of a robot makes a mistake in the real world, it's much harder to kind of like fix on basically this is a way for agents to learn in a simulated environment where we can control everything, we can set up some environment that is maybe more challenging or less predictable than what typically the agent was trained to do.

And this way, basically, the agent can improve in this safe simulation.

So we're very excited about that.

So let's say that you were maybe running a factory and you wanted to install a robot with a particular task, you can recreate the precise environment or find itself in and allow it to, well, find its own mistakes.

Yeah, I mean, that's a great example, because this is already something that we're close to be able to do, like we're already right robots are getting quite capable.

But I think what's even more exciting is things that it's quite far away from being able to do that this could completely enable and unlock.

So having robots and embodied agents like actually in the real world, the diversity of possible scenarios is just quite hard to fathom, I think for our current systems.

I think of this example of like, what would an embodied agent do on Halloween, right?

It's maybe one day a year, it sees children running around in costumes, like, what would it do the first time it sees this, it's quite a challenging scenario to prepare for.

And even if you've seen it before, it might be different in the next year.

And so to really be able to simulate these rare events and be able to text describe any imaginable world to become robust to it, make sure the robots are safe, or agents are safe, they understand all these different things, but they can also learn from their experience as well, which we know is really important.

But then I also wonder, I mean, all the examples that we've given so far has been it's sort of experiencing the real world at the human level, as it were in terms of our size, could you kind of shrink this down and create a simulated world that was at the level of molecules or the human cell, for instance?

Yeah, so we tried that.

And we actually have a few examples where we like walk, like basically moving around like a blood vessel, or so yes, it's not necessarily always biologically accurate.

But this is not, we don't think it's a fundamental limitation, like we could probably, if we had more accurate simulations that models can be trained on, then we see that in the future, maybe we have other variants of models that are able to kind of specialise in this particular environment.

But we did try to focus more on the real world from the eyes of a person, because we think that's the kind of the most widely applicable, in terms of the generality of the model.

Is this about just utilising the existing developments in artificial intelligence?

Or is this also about kind of stepping us closer towards the goal of AGI?

We think this is definitely a kind of a new kind of foundation model.

And that's why the breadth of applications is so wide, but also so nascent, because we've never really had this kind of model before, right?

It sort of blends and combines ideas from what we've seen language models, and also video models, right, with some of the techniques that we use.

And so it's kind of combining these different elements in something quite new, which I think is what's so exciting about it, right?

And this breakthrough that we've made is that it might enable some completely new applications that we didn't really have before.

And so it's still quite an early stage for this research, but we're quite excited to see what the next few months bring.

Well, let me dig into that a bit deeper then, because I know that your background is much more in VO and the video side of things.

But you were working on Genie 1 and Genie 2 before you and your team came on board.

What were you doing there?

What was the inspiration?

How is it different from this iteration?

So before Genie, I was working on open-ended learning, so training agents in large simulated environments where we could configure different components of the world.

So we worked on the XLAN project.

And the idea there was basically with procedural generation, generate wide variety of environments that were still specified in code and have the agent learn from the different experiences and to become a generalist agent in simulation.

But ultimately, we were bottlenecked by the availability of environments.

We were also, in PhD, working a bit on world models as well, but with much more limited, constrained setting, we were training world models from single environments, typically quite low-dimensional ones.

And the dream was really to combine these ideas, to learn general world models that could be used simulators for any imaginable tasks, and then train agents in them to solve completely new things and have this kind of open-ended loop where we generate new worlds and the agents would learn from those.

We started with Genie 1 as kind of a proof of concept, like could we do this at all?

Could we generate new worlds that were interactive at all?

And that was quite a major breakthrough.

And then with Genie 2, we scaled this to any 3D sort of environment.

There were some things that emerged from the emergent properties in that work.

Tell me a little bit about those.

Were they expected?

So for Genie 2, the question for us was, can this idea scale?

Because Genie 1 was quite a simple proof of concept.

Does this work at all?

Whereas Genie 2 was like, is there something that could really scale to look more like what we see in foundation models nowadays?

And we weren't sure if it would really work.

We weren't sure if it would be consistent for very long because Genie 1 only lasted a couple of seconds.

We weren't sure if it would work at higher resolution because Genie 1 was 90p, which is very small images.

Genie 2 was 360p.

And then the type of diversity of environments was a significant increase.

And so given all of that, we weren't sure if it would be possible to have a single neural network that could simulate anything within that domain.

And so when we actually got the model, it was definitely an emergent property for completely new worlds.

And we used Imagine 3 to generate the starting frames back then.

It could do things like simulate smoke or when you drove off the side of a cliff, the car would have gravity.

Or if you landed in a puddle, it splashed.

And it was quite surprising that this worked so well.

And that gave us the confidence that this next step with Genie 3 would be possible.

So Genie 1, 2D, Genie 2, 3D.

So with Genie 1 then, you're like, right, we want to be able to create any sort of environment.

And you feed in, I mean, it was platform games, wasn't it?

Yep.

Just tons and tons and tons and tons of footage of platform games.

Exactly.

And then were there any emergent properties in that?

Well, the fact that you could generate completely new ones, I think was surprising.

I don't think that had really been shown before.

Even a painting, right?

Yes, actually.

Yeah, I mean, you know, they work better than me.

We even had a picture of my dog in the park and you could kind of move her left and right like a platform game.

But of course, she's not one.

We also had, Jeff Clune was an advisor of the project.

His children did a bunch of drawings.

And we were able to kind of animate those and move them around like games.

And I think it's safe to say that I was not in the training data.

Right.

So I am happy to call that an emergent property.

It looked quite different to what it was trained on.

And thanks for reminding me because it's been a couple of years.

And then the step between Genie 1 and Genie 2 was making it 3D.

Yeah.

So it was increasing the diversity of the things it could do.

So compared to Genie 1, which was just 2D platform games, it was 2D games, but as well as 3D environments as well in the same model.

It was also higher resolution and there was a lot more consistency.

So it would last a bit longer when you interacted with it.

So it was really a step up in capability in a few different dimensions, which required a much more concentrated effort to get that to be possible.

And we weren't sure if it would actually work at all.

So it was more of a proof of concept at the slightly larger scale.

And that gave us confidence that what we have now would be achievable.

So you were working on building an environment that you could interact with.

And meanwhile, in parallel, you were working on video generation.

Yes.

So my background is actually in 3D game engines.

And like, but very long time ago, I used to work on simulations.

That was kind of like where I started working on, you know, AI back then we didn't use even AI, we call it ML.

But in the last few years, I really got more excited as the technology evolved and then worked on image models, video models.

And yeah, in the last two years, video models, as we all know, like got to new kind of new levels of realism.

And I remember looking at one of the, I think it was the imagine video models and just saying like, how is it possible that there is a full simulation of a world just like in this model?

It's just mind blowing, right?

If you think about the level of realism that those models even initially achieved compared to simulation using like 3D graphics methods.

And then we've, we've, what we tried to do is basically to build the best possible video model.

And when I saw the results, I started thinking, okay, what happens if can we do that in real time?

And then I was obviously following the work by Jack and team.

And I think that was also very inspiring.

And we just said, okay, we have to to go to the next level.

So what were the elements from VO that you wanted to combine or learn from for this project?

Because it's not just the visual aesthetic.

The quality and the realism is very something that we really invested in.

And I think there is a level of realism where I think we kind of got to it with VO2.

The physics are not perfect, but it's good enough to start to be useful, right?

So we can actually create some scenes that are indistinguishable from real footage, right?

Not everything, not all the time, but it's starting to be there, right?

With VO3, we also added audio and other stuff.

So taking it to the next step of basically making this interactive was kind of like the obvious next step, but technically it's quite challenging.

And especially in terms of how fast we should create the next frames, right?

And that's what was one of the core challenges for the project for Gini Free.

I think the approach that we have overall is to try and understand how those models train, how they are being trained, how they learn, and the same kind of like principles that helped us scale and improve VEO, we found them to be useful for Gini Free as well.

It seems to me, okay, as an outsider, right, it seems to me that the objective of Gini and VO are, I mean, even though they look visually quite similar, the objectives of them are quite different, right?

Like VO, you're trying to create this very realistic, non-interactive environment, whereas with Gini, you need to make this consistent, explorable world that you can move around in.

Do you have to start from scratch or is there like kind of swaps that you can make?

So I think in a way, if you think about the video model, right, you can tell it, okay, I want to walk around maybe a volcano or I want the camera to move that way or another.

So what it does is basically looking at this entire video and tries to create a coherent eight second or whatever, a long video, and it can change the past and the future at the same time.

I think that's the property of video models that's very different from Gini Free.

Because it spits out the end result.

Exactly.

And you can think of it as a painting that you can change all the time, everything on the canvas.

And while it's much in a way easier than doing it in what we call autoregressive or just extending one frame at a time.

So I think that's the fundamental difference.

There's still a lot of similar aspects, for example, the way that we have eventually to take some text input and convert it to some kind of visual output.

So that's somewhat similar.

So I would say definitely we don't start from scratch.

We build on top of, I think, a lot of work in this space, but there are definitely some novel properties that we had to figure out.

Well, that's super interesting, actually.

The idea that time is the key component in this, that you have to understand the past and march forward to the future.

That's where the autoregressive stuff comes in, right?

Exactly.

So essentially every frame you see is generated from scratch at that point in time.

So things that happen later in the interaction aren't known yet.

Right.

And things that happen in the beginning have to all be remembered by the model.

So essentially, if we're doing 24 frames per second, it's like doing image generation 24 times per second, each one completely generating from scratch, given all of the past and the actions of the agent or human player.

There's an analogy here, though.

I mean, that's sort of the way that language models work, right?

Yes, exactly.

So I think it's really a good example.

So we know that language models are basically being trained to predict the next token award, right?

So they look at text and try to guess, okay, what's the distribution of words or tokens that happen to follow?

And what we have with autoregressive world models, we actually have a similar problem.

We want to basically predict the next observation, which is pretty much visual, the next frame, given what was already seen.

And the nice thing I think in this parallel to LLMs is that LLMs learn from that very simple task, a very rich, potentially representation of the world or how people think, how people solve problems, for example.

And I think the reason that word models, especially autoregressive models are kind of exciting for us is because maybe through that task, that is pretty much simple task that anyone can understand, they have to learn the dynamics of the world.

And I think it's nice to, if you really think about it, it's kind of a superset of intelligence, because through the world, if you say, okay, now I'm playing chess with some grandmaster, and what's the next visual or the next frame would actually be their next maybe move, right?

So of course, our model is not capable of doing that, but at the limit, it's kind of goes very, very far.

But it's these ideas of understanding the past, the context and being able to predict the next move in the future.

Which is also quite powerful, right?

Because it means you can start in the same location and do many very different things.

So from an agent's perspective, there could be a quite simple task, but you want to get really good at it.

And so you can simulate various different scenarios.

And this is very analogous to reinforcement learning paradigm, right?

Where you have this M reset function that brings you back to the same state.

And then you want to get more experience from there.

Well, let's inch through this then.

So the very first frame, you can have an image, like you did with the painting, but you can also have a text input, right?

Yes, of course.

I mean, I know you asked me earlier to describe somewhere I'd been.

And I got to go to a Hunter's Lodge in Siberia once.

It was sort of the most impressive thing I could come up with.

And this is what I sent to you.

I went to a Hunter's Lodge, sat on a reindeer skin in the woods outside of your cuts in Siberia, drank vodka and ate frozen calf liver.

Yes.

So what you did with that?

So first we were surprised with the problem, but then we just tried to put it into the system.

And the system is basically able to add some more details to the problem that you provided.

But still, it does follow the key elements that you've provided, right?

Yeah.

So just to describe what we're looking at here, this is, I mean, exactly as described, it's a Siberian forest.

There's snow covering the ground.

There's this outdoor table with a single plate and a bottle of clear liquid, which I can only assume is vodka.

And in the distance, there's a little wood lodge with a fire burning inside.

Actually, there's two of them there, one on either side.

And then you can kind of see the little grass poking out of the snow as you go off into the distance further into the forest.

It's really amazing.

The light, again, you guys are very, you love making the light late afternoon, the beautiful golden hour.

But so this, that first frame then is generated in the same way as you might find in Vio.

Yeah.

So the model doesn't treat it like in any special way.

It's just like you give it a text and it just, it just starts outputting frames and it doesn't do any preparation before it.

It just throws you into the world and you can go wherever.

And then it's from that first frame that the prediction backwards.

Exactly.

Yeah.

So it's like text to first frame, then first frame, first action to next frame, and so on and so forth from that point onwards, basically.

And how critical is the exact wording of the prompt here?

I mean, can you get better worlds with better prompts?

Yeah, I think that's definitely true.

There's an art to prompting, I think, all of these modern models and some people are better at it than others.

And fortunately, we have some people who are much better than me at this.

This can work pretty much out of the box.

This one, yeah, they often they work pretty well, especially when you have very vivid description, like the table with the vodka and the frozen coughs.

Sometimes you can try something and not quite capture exactly what you wanted first time.

And then you can iterate a little bit on the prompt and then get something that's much more like what you wanted.

And does that require you to regenerate the world or given that it's marching forwards, can you add things in on the fly?

So we do have ways to add things on the fly, what we call "promptable word events".

And this is just something that you can say, okay, now I want, for example, a balloon to fly in or some other character to show up.

This is something we're very excited about, because it also allows the environment to be more interesting for people, but also, as we mentioned, kind of more relevant for training agents in simulation, because then we can kind of throw in something that happens in the world that they have to adapt to.

So like in that example, maybe have a reindeer coming through?

Right, we can have a reindeer or we can have like another person is walking into the scene.

And then we have to, if it's an agent, basically they can respond to it.

And if it's just for entertainment purposes, then it's much more interesting than just walking in a world that nothing happens.

And this is all a key advantage of the autoregressive part, I guess, right?

That you do have control over the future.

Exactly.

Yeah.

So you can just inject things sort of on the fly as you're generating things in real time.

The thing that I find extraordinary about this is the consistency, right?

Like the memory of the system, you turn away and then you turn back and it's exactly how you left it.

But presumably if you're turning in a direction you haven't yet turned in, then it's sort of a stochastic process, right?

So how do you balance it that it's sometimes a memory and sometimes generated statistically?

Well, it's kind of a mixture of things.

So you can in the text prompt specify things that are out of sight, which I think is quite powerful compared to image prompting, right?

Because you can say on the right is X, Y, Z.

And then when you actually look round to the right, it's there when you actually play or interact in the world.

But then there is also an element of the model using its world knowledge to generate things.

So in the Hopper example with the artwork, it does generate the street that we haven't seen before.

And you can't be exactly certain what it's going to generate.

The model uses its own sort of, I guess, intuition of what should be there.

And that's intuition based on having watched incredible amounts of video footage in advance.

So basically the model is trying to just generate a sequence of frames that is representative of the world, right?

So if it already saw some part or generated some part of the world, then the right thing for the model to do would be to kind of recall this memory and use it.

So if I look back to where I've already been, then the right thing for the model to do would be just to use the same thing.

But when you look to a new area, sort of from the model perspective, again, it can allow itself to generate something new because it hasn't been seen.

So the model doesn't really treat it in any fundamentally different way.

So the model kind of learns to basically balance the two aspects.

And again, all comes back to anchoring all of the generation to the prompt or what the user provided, right?

That's where the information to the generation comes from.

But then I guess going back to the language analogy that we were talking about earlier, it's not that surprising that once a statement has been established in a conversation, that it remains consistent when you refer back to it.

Exactly.

And we've seen with language models, this sort of ability to have memory and consistency has been something that's improved a lot recently, especially with the latest Gemini models now versus two years ago.

Yeah, I think what's interesting here is the size of the memory or the level of detail.

So if we think about I'm talking to Gemini and after a few sentences, it might refer to something I've said before.

That's great.

But the number of details that we have in those visual worlds is just like it's staggering.

The quality of the memory, considering how much detail and how much information it has to actually remember.

Well, how do you do that?

Does this require you having a sort of 3D representation of the world that you're in?

We don't use anything like that for this version of the model.

It's largely just learned as an emergent property just from this auto-aggressor.

You guys, your emergent properties, my goodness.

We're very good students of the biterless.

Yeah, I think it's similar to if you're predicting the next frame, you just have to learn to remember these critical things in the past.

And obviously, the model has some representations that prioritize the important details, but it really is just predicting the next frame.

And is this analogous then to the fact that language models have this conceptual understanding and can describe the same things in different ways?

Is it the transformer architecture here that's allowing you to do that?

Yes.

So the architecture is, I think almost everything today is a transformer.

So yes, this is also a transformer.

Looking back, basically the model will see what was already generated, including the user provided inputs, and based on that, makes the prediction.

So I think the interesting question about explicit 3D is that the model probably had to learn some representation, but it's just not an explicit representation.

So we see that the ability of the model to understand 3D environments is very strong.

And I think the most, to me, like an emergent ability is that it actually works where it was not trained for like, for example, in taking an oil painting in 1942 and actually making it into some kind of a 3D environment that's pretty much out of its distribution.

Absolutely.

I want to go back to this idea of it, understanding physics.

Jack, if you've got this world that you can put an agent in, does that mean that you can test how well it knows physics?

I mean, could you get a hammer and a feather, for instance, and allow it to drop it at the same time?

You definitely could do that.

I think that would probably be close to the frontier of the model's capabilities at this point.

We've seen it's quite good at visual things and things that are more general concepts, right?

So you can imagine water occurs in quite a lot of different scenarios it's seen before.

Gravity has probably occurred in many of those, but probably not these exact objects.

But I think if you were to specify in the text prompt, like in the world, there is a feather that has less gravity and a hammer that is heavier, maybe then it would work.

Yeah, I think there are a lot of, those models are inherently visual, right?

They don't know anything about the world except for what we can see.

And that's a limitation, I would say, right?

Even for video models, some things sometimes don't make sense.

And the model has to guess if something's heavy, for example, right?

There is no real way for a model to look at an image and say, okay, that's how much this thing weighs.

So it kind of makes up the weight and then try to kind of simulate what would have happened.

And sometimes it breaks.

And we know that from video models, even the best video models.

And I think what we have is basically we solve an even harder problem, because we have to, as we said, we can't fix the past, right?

We have to go with once the model generates something, it has to roll with it.

And I think we have really good progress in terms of simulation like fluid dynamics and then but maybe, you know, some other aspects of the world physically would not be accurate because of those limitations.

Well, tell me what some of the work that you've been doing with with agents so far, because actually, on a previous episode, we got to talk about SEMA, the Scalable Instructable Multi-World Agent, exactly, which they were putting into existing computer game environments.

But I mean, you can now take SEMA and put it into these generated environments, too.

Exactly.

So the cool thing with this is that we've got at Google DeepMind, we have these agents that are trained to be general, the M is multi-world, as you just said.

And so what we're able to do is take out the worlds we generate, and then test whether these agents and the latest versions of them are able to already use them for agent training or collecting experiences or evaluations as they are.

You can say to it something like navigate to the robot over there.

And then you can give it an image from the world and it can take the first action.

And then from that point onwards, it's interacting in the world through through actions.

But the Genie 3 model does not know what the goal is, right?

So that makes it like a genuine simulation, rather than if you were to tell Genie 3 the goal that SEMA agent is trying to achieve this goal.

It might make the experience not authentic, right?

Because it might make it happen in an incorrect way.

For it.

Yeah, exactly.

I guess if you've got a robot in the real world, the world isn't helping the robot.

Exactly.

Like you can say, okay, the robot has to go to fetch the red cube and then it looks to the left and there is a red cube, right?

It's a bit making it.

Yeah.

And this is a problem that we've encountered in other scenarios, but you don't really get this problem if you have this separation between the agent and the environment.

And that's the really nice thing of working with agent teams that are focused on building really capable agents, but then having them access our environment as if it's any other environment, right?

So the new worlds that are created by Genie 3 look the same as the existing worlds that the SEMA agent was trained on.

But then what if there isn't a red cube?

What if it searches around forever and there is no red cube?

So yeah, I think that's part of the things that we're looking into is how we can add more details as we go, right?

Like for example, if you put the agent in some room and then it has to open maybe a drawer and find something in there, right?

So we want to be able to inject events into the world and control it.

So I think this is kind of like the interesting frontier here is how you can make the world look very realistic and also control what's happening in the world in a way that still makes sense.

So I think what we've seen is that we've our promptable word events.

We can add things that happen in the world, but if you just want something to pop into the world, then it's not necessarily plausible.

I think like if you look at the, if you're in the desert and then you're going to ask it, okay, now I want to see an elephant, then where does this elephant going to come from?

Or maybe it's going to come from the side or when you look to the left.

So I think there is something really interesting in how, what does it mean to change a world because the model has some assumptions.

And when it comes to agents, this is definitely an important capability to be able to inject this new event into the world.

So you've done this already.

Yeah, we've got signs of life for this.

I think we haven't really say we've got a full agent training loop where we're already doing some large scale training in these environments, but what we can already do is test the agents in them and see how they do.

Right.

And I think it's quite remarkable that given these things weren't developed together, we just dropped the agent in and it can already do things.

And you can imagine all the different things you could now use this for.

Color that in for me.

Give me an example.

If you have like a factory where you have some robots and you want to introduce maybe a very boring example, but the new machine, right.

And that wasn't there.

Or you change somehow the structure of the building and you want to test the robot before you actually put them in the new building.

Right.

So again, this is like you can simulate the world that's basically a variant of what the model, maybe the agent has seen in the past and see if it breaks and write in this or everything can happen in like simulated environment and not necessarily break your new machine.

Right.

So that's like one example, I would say.

Find the unintended consequences.

Yeah.

And then the evaluation of the model.

So that's even not training a model like the agent, just testing how well it adapts maybe to a new variation of an environment.

All of these examples you've given so far where the agent has a specified objective, which I know is like sort of the point of SEMA thus far.

Yeah.

But what about if you had agents that didn't have an objective?

There was a really nice quote that I came across, which is that almost no prerequisite to any major invention was made with that invention in mind.

Can you imagine a point in the future where you are letting agents loose in these environments without specifying an objective for them?

Yeah, exactly.

So the quote comes from Why Greatness Cannot Be Planned, which is from Ken Stanley and Joel Lehman.

It's a great book.

And the general idea there is that searching for interestingness might actually lead to things that are more useful for practical goals than if you just directly optimize for the practical goals themselves.

And clearly, the bigger the domain and the space for discovery, the more interesting things could happen.

So they had this reunite example quite a while ago with this paper called Pick Breeder, where essentially they allowed people to select images and sort of combine them to create new images that were mutations of those two.

People weren't directly optimizing for specific end goals.

But by just choosing what they found interesting, they ended up discovering really cool structured pictures like a skull or a butterfly that weren't obvious how you would reach that from the starting points.

And some of the stepping stones along the way didn't really look much like the final goal.

And they wouldn't have been things you obviously would have chosen to go for if you had those goals in mind.

And there are lots of examples of this in the real world.

If you are, for example, trying to reach the moon, you wouldn't build a bigger ladder.

Optimizing along one dimension with maybe a greedy myopic approach doesn't always lead you to make these big leaps.

Well, I mean, evolution itself is like the classic example of iteration without objective.

Yeah, and we see this a lot in research.

Yeah, my perspective on that is that I think we as humans, we kind of decide what's interesting.

I think there is even an example where the entire evolution of mathematics was guided by people deciding what's next, what's interesting, what's not interesting.

And just the problem being hard doesn't mean that it's interesting at all.

And I think in a way, when we think about generating new things in science, there is some aspects of maybe beauty or interest that is coming from us.

And models might maybe learn to simulate that.

But I think it's really important to remember that we ultimately decide what's interesting as our preferences as people.

I mean, you haven't had the move 37, as it were in this.

Even in this case, right, the goal of go or the game was designed in a certain way that people find it interesting to play, right?

Otherwise, like, they and it's a very ancient game, right?

So I'm just saying that the setup, even when there is an like, or getting to the moon, this goal was made by people.

So I'm just saying, like, I think there is still a broader constraints set by us in a way.

And like, if a machine comes up with, with a problem to solve, we still have to say, is this an interesting problem?

Because otherwise, we'll just, okay, I don't care about that.

So I think there is still an aesthetic part to anything that's open ended.

Yeah, I think there's this, there was this quote from Demis a while ago about the levels of creativity.

And it was like, interpolation is one like you see a new cat and you can identify as a cat.

Extrapolation was one where it's like, given the rules of go, can you discover a new move like move 37?

And then the third level is generating completely new things like could you actually invent go is what he said.

And we actually had this as like a motivation for the genie project at the beginning was like, can you create completely new things.

And I think we're starting to see that happen.

Since it's a completely new kind of model, someone on the team will do something like create a certain kind of world and then immediately other members of the team and like that's really interesting.

And then they start sort of evolving that idea themselves.

And then we posted on social media and we see the reaction to some things and then we know that's interesting.

So then we create new things that way.

And that's just with a very limited access to the model.

So clearly you can see if we open this up a bit more in the future that it could lead to some sort of open ended creativity that way.

It's like an evolution of the criteria that people find.

Yeah, exactly.

There's people in the loop guiding the like interestingness.

Well, then okay, could you take I'm just thinking back to the conversation that we had with Dave Silver earlier in the series where he was sort of saying, you know, actually, there's one way you remove humans from the equation and actually you get even more surprising results potentially.

Could you okay, just play along with me here for a moment.

But could you get to a point where you just simulate the first I don't know single celled organism and allow it to evolve inside of genie, you know, and actually watch the process of evolution happen in a virtual environment?

That's a great question.

And that's kind of the dream of like an a life open ended evolution community.

I think maybe the worlds that we create are not fully rich enough.

But I definitely think that we're getting along that path, right?

So open ended evolution and a life they've been designing worlds that could facilitate this kind of thing typically encode.

And so this could be an alternative approach to getting maybe richer real world simulations.

And so, in theory, I mean, we've been made quite a lot of progress pretty fast.

But if you've got the simulation to be fully like the real world, and it had the kind of objectives and constraints that lead to these kind of evolutionary steps, then it's definitely plausible.

But I can't say it's definitely there yet.

It's not not a direct answer.

But I actually tried, I think there is maybe a very basic example of game of life.

Right.

So it has fun ways.

Yeah, it has four rules, right.

And then I actually tried using fail to simulate it, right, you give it an image and take and it doesn't work like it does look like it's evolving.

And if you don't know the rules, you would look like, yeah, it looks like reasonable, like different pixels light up and light and but it doesn't follow the four rules of the game, right?

I think this is a good example for what our current models are able to do, and what they're less like still limited in their ability to follow specific rules.

But to actually evolve maybe life forms, I think you need much more ability to do both to also simulate the physical world, but also follow some basic rules of physics in a very kind of like accurate way.

And yes, I'm constrained.

And I think we're not like we see some glimpse of it, but not it's definitely very far from being able to evolution in on a GPU.

Well, thank you for going philosophical with me for a moment there.

I enjoyed that.

Let's come back down to Earth with a thump though, because I mean, there are safety implications with this.

What what are your main concerns?

I think there's different levels of concern, right?

There's sort of the known things.

And they're quite obvious.

I mean, things like violence, maybe we wouldn't want to occur in the world in new ways.

And that's something that we can we can already start addressing.

But there's also maybe some more gray areas where we're not actually really sure how we feel about these things like historical settings, for instance, some of them may be unsavory for subtle reasons.

And those are just things that we have a team can see quite clearly.

But I think there's probably also some things that we haven't considered.

And we'd rather get those things right by limiting our early access and getting feedback, which we're already doing.

And we've already learned a lot from the folks that we brought in a few weeks ago and the folks that we're still interacting with.

Like what?

We've heard a lot of new use cases that I didn't think of vocational training could actually be quite impactful, right?

Lots of people can't go into the role of things like like firefighting, for instance, what does it feel like to actually be there without having a visceral sort of experience?

It's probably something that you would benefit a lot from to be able to simulate in advance, even if it's not perfectly correct from a simulation perspective.

There may be elements to it that just getting a sense of what it's like to be situated in that specific circumstance, that it's quite nice to be able to simulate in advance.

Minus the heat.

Yeah, minus the heat and the smoke.

And the genuine jeopardy.

Yes.

But actually, you raised another interesting point there, though, because you said even if it's not perfectly realistic, is there also another danger about this gap between what's simulated and what's real?

How do you make that as small as possible?

This is something the SIM to real gap is something we've spoken about in this podcast lots of times before.

Let's say you've got your example where you've got a robot in a factory.

Oh, and it's moving around in terms of interacting.

You can't directly take that and map it onto the real world, right?

Yeah, so I think over time, we'll see more control being able to basically being able to take maybe a real environment, map it into the model so the model can base its generation on the real environment.

And we see it to an extent we've been starting from an image or starting with a video.

And now the question is, would it be perfectly the same as the real world?

Probably not.

I don't think it's even well defined.

What does it mean?

But I think the gap is definitely narrowing.

So we can take environments, you know, if the past we know, we know a lot of the RL environments look very far from anything that's photorealistic or real world.

But now we can go even closer.

So but definitely there still remains a gap and we will have to kind of see what are the implications and like, we're still not using it for any real world deployment, of course.

Yeah.

Yeah, I think it's kind of an iterative approach.

I don't think we're saying right now we've got G3, we've solved simulation for any possible embodied task.

But I think what we can do is combine it with other techniques.

So we still would train our agents in the same way that we already did without this.

And we use it to kind of augment the training process.

And then another element to it is, it's really important that it has some diversity, right?

So if it's always wrong in the same way, then agents might learn to exploit that inaccuracy.

Whereas if the model can generate quite diverse sort of different worlds, then what we can do is really test the breadth of the agents capabilities and make sure that there's no scenario where it does something really wrong.

That might actually be a strength, right?

So same in sim to real, we want to do domain randomization.

Maybe by having a model that is a generative model, being able to sort of search the space of possibilities and check that all of the agents do something sensible might be a good thing.

But that doesn't mean you want to completely train it that that is the real world.

Maybe you want to use it to make it more adversarially robust rather than learn specifics.

Let me make sure I understand that then.

So if it's wrong, but wrong in unpredictable ways, then actually that might end up making the agent more robust in the long run.

Yeah.

So what you want to do is similar to domain randomization is you want to make sure that there's no plausible scenario where the agent could do something really unsafe.

It's quite a different objective than if you had one specifically incorrect scenario, and then you told the agent exactly how to behave from that one.

Instead, it sort of make it so that in any possible future world, the agent should be able to do something sensible.

That's so interesting then, because I was sort of imagining that you were trying to kind of nudge this towards it being more realistic towards getting like more reliable outcomes towards sort of like closing that gap of simple real.

But I mean, the way you're describing this is that not necessarily.

The question is what do you mean by reliable?

Because reliable to me is probably allowed about following the instructions that we provide, right?

The model.

So if we want the model to simulate the specific environments and we describe it with a lot of detail, we want the model to follow that.

If there is something not so plausible in this description, the models should still follow that.

And I think the challenge is that sometimes we as people or like for various reasons, we're interested in the less plausible scenarios, right?

For example, in some of the examples we saw today, you wanted some vodka and the calf liver.

All right.

So that's not a very, like it's not like if you just sample from all of the possible tables in Siberia, probably that's not like in the middle of the distribution.

So I think that the reliability to me comes mostly from following the description that we provide the model with and simulating the world in a way that would be close to that.

I think that's a really good point, actually.

I think it's in under specified environments, you want diversity because you want to be able to adapt anything within the plausible distribution.

But if you have a very well specified environment, then you want it to be accurate.

And I think we're kind of seeing improvement on both those dimensions, but we're probably not fully there yet.

Let me go back to the AGI question, if I may, the end question that everyone always wants to ask.

Do you think that this is a step towards it, Jack?

I think AGI is something itself, which is relatively subjective, and people have different interpretations of what you mean by AGI.

So I think it would be quite maybe grandiose to say our model is the key thing in the whole field that will enable AGI.

But I think for me, an AGI needs to be embodied and be able to act in the physical world.

That's what really excites me.

I think that could really improve people's quality of life in any demographic, anywhere in the world.

And so with that framing, I definitely think this is an important tool.

I can't see how an embodied AGI or an AGI that is embodied would be able to operate any scenario in the world without being able to simulate it, to gather experience and learn from its own experience.

Because that's the paradigm that we've used in other settings to get superhuman capabilities or even just robust capabilities.

And so I believe very strongly that we need simulation.

And I also believe very strongly that we won't be able to build a simulator of the real world any other way.

So when you combine those two things, I think yes, it is a big step for my version of AGI.

Yeah, and I think it's really good answer.

And on top of that, I would just say that our current generation of AI is limited to the digital world.

For AI to be used for us, definitely it has to have some kind of real world interaction.

So I think again, it's a small step towards embodied AI.

And there are definitely a lot of gaps to get there.

So I think we need much more better signals that, for example, robots get while they walk through the world.

They need to get some physical response.

It's not enough just to have a visual input and output.

So I just think that this is definitely a step towards that vision.

Yeah, because there is still a lot that this can't do.

I mean, the additional sensors being one of them, but also doesn't handle people that well at the moment.

Exactly.

And I think that's really the key thing is I both think this is one of the most promising technologies to achieve sort of sociable and socially aware robots and body agents.

But also, I think that's the biggest limitation probably of the current iteration of the model is that it doesn't do this perfectly because our standards have raised and that's the thing that we now think is something that isn't good enough.

But I think it's really critical that we do have that, right?

Because even if our robots and embodied agents fully understand physics, I think physics are fairly consistent around the world, but people are not right.

And we want these agents, robots, whatever form factor they come in, to be able to really augment humans and work with humans to make our quality of life better.

And so they need to understand how humans think, work, interact, and be able to work with us on things.

And so that's what I think we're really excited about is one of the things that might be enabled by our model.

I think there are definitely a lot of limitations in terms of the quality of the generation, but I'm very excited about the pace.

If you think about it, like in, we had a gene 2, Ver2 in December, we definitely feel the pace, the impact on it in our personal lives, but the field is just moving fast.

And if we remember just what less than two years ago, we had images generated with six fingers, and that was a big thing.

And nobody's speaking about that anymore.

So I don't see why we won't be able to generate people in much higher fidelity and with everything that follows that.

Is the goal here then to have a foundational model to essentially do for simulated worlds, what LLMs have done for language?

Yeah, exactly.

I think you've put it better than probably the right code.

I think this is really a step change as a foundation model in terms of the breadth and generality and capabilities.

And I think this is probably similar to what Shomi alluded to, as we've seen with images recently, where there were things obviously like the fingers to them now being, I mean, pretty incredible.

We saw the same thing with video maybe in the past year where once we had something like VO2, it's looking pretty amazing at this point.

And we saw this with language models, maybe three or four years ago, where they started to get really capable.

And we wanted to get to that point for this new kind of foundation model, sort of an autoregressive world model.

Now we're there, there's a whole host of different potential things that could be used for and have impact on.

And we're still fairly early in that right now.

But there are, you know, are there elements of simulation too, right?

Like, do you think that you will ever be able to use this kind of idea to recreate a lived experience rather than just a visual one?

So I think that we have many, many senses that we're not even aware of, right?

Like, so for example, proprioceptive reception that we are basically, we feel where we are.

And we have this kind of like notion of where we are in the world.

And I think when we think about actually putting people in a simulation to really feeling kind of immersed in that thing, this is a huge part of it.

Basically, the constraint to visual and maybe audio is still too much of a constraint.

I think that, like, there is definitely potential for that.

But it goes through multiple technologies that we have to build to actually get there.

So before that, I expect people to be able to connect, interact with photorealistic environments, but still in some like through some kind of an interface that will be like a hybrid interface that maybe they can feel some sensation for like maybe gloves or something.

I think I would also say that there is definitely something about being interactive in real time that does make a big difference for the experience.

And we've had members of the team say that they visited childhood locations, for example, and did actually get like a sense for it that you couldn't really get from an image or a video.

So there is already some sort of degree of experience that you can gain from this kind of model already.

And obviously, we were working hard to make it an even more capable model in the future.

So maybe that will extend.

Yeah, there is actually, it reminds me of a project that we had earlier in the year, that the team at Google used VEO actually to help people with early onset of dementia to go back to their childhood memories and reconstruct them.

So I can imagine that that might be, for example, potentially therapeutic tool as well, that they cannot only look at the video, but maybe actually relieve or remember some things that's from their childhood.

So I think that even before, we don't need to go very far for things to have positive impact on the world.

Amazing.

That was absolutely fascinating.

Thank you so much.

Thanks for having us.

Thanks for having us, Hannah.

I think the most impressive part of this is not what you're looking at on the screen.

It's how that is generated.

It's this change that this model represents from creating realistic images or videos of the real world as though they were frozen moments in time into something that can actually handle time in the way that we experience it with an arrow that's pointing in only one direction, right, where effect follows cause to build this consistent, forward-moving world where the present is a direct result of the past.

And that is why I think that this is an early hint of something that's much bigger.

This is not just a new way to design games or beautiful environments.

This here is the bedrock for machines that can genuinely plan and reason about our world.

You have been listening to Google DeepMind the podcast with me, Professor Hannah Fry.

Now we're going to take a little bit of a pause over the summer, but we're going to be back with more episodes in the autumn from Google HQ in California.

And in the meantime, do take a look at our extensive back catalog, which covers everything from tools for creators to AI for drug discovery.

See you soon.