NVIDIA AI Podcast · 2025-09-17

Bringing Robots to Life with AI: The Three Computer Revolution

Hosts: Noah Kravitz

Guests: Yashraj Narang

NVIDIA roboticsthree-computer solutionimitation learningreinforcement learninghumanoid robotssim-to-real transfersynthetic datavision-language-action modelsneural robot dynamicsCoRL conference

Read summary Jump to transcript Original podcast

Podcast feed URL

Open feed

Why it matters

Imitation learning provides human-like guidance but is bounded by demonstrations.

Key claims

NVIDIA's three-computer robotics strategy: DGX (training), Omniverse/Cosmos (simulation/synthetic data), and Jetson AGX Thor (onboard inference)
Seattle Robotics Lab, founded 2017 by Dieter Fox with Jensen Huang's backing, focuses on the full robotics stack and maintains close ties to UW
Imitation learning provides human-like guidance but is bounded by demonstrations; reinforcement learning is less efficient but can achieve superhuman performance
Modular (perception-planning-control) vs end-to-end approaches are converging on hybrid architectures, mirroring autonomous driving's trajectory

Episode summary

Summary

Yashraj Narang, head of NVIDIA's Seattle Robotics Lab, joins the NVIDIA AI Podcast to explain how robots come to life through AI. He outlines NVIDIA's three-computer strategy for robotics: DGX systems (e.g., GB200, Blackwell) for training large AI models, Omniverse and Cosmos for simulation and synthetic data generation, and Jetson AGX Thor for onboard inference. The discussion spans the lab's research across perception, planning, control, reinforcement learning, imitation learning, and vision-language-action models.

Narang contrasts imitation learning (mimicking human demonstrations) with reinforcement learning (intelligent trial and error that can exceed human performance), and the modular perceive-plan-act approach with the newer end-to-end paradigm that maps sensors directly to motor commands—drawing parallels to autonomous driving's evolution toward hybrid architectures. He addresses the explosion of humanoid robotics, attributing it to the convergence of mature hardware, LLMs/VLMs, and the fact that human environments are built for human form factors, while predicting traditional and humanoid robots will coexist.

On data, Narang notes robotics lacks an internet-scale corpus, making simulation essential. He details the sim-to-real gap across perception, physics, and latency, and techniques to close it (domain randomization, adaptation, invariance). NVIDIA's proposed "data pyramid" layers YouTube videos, synthetic simulator data, and real-world data. He closes with a teaser of the CoRL conference in Seoul and the lab's Neural Robot Dynamics (NeRD) project, which replaces explicit physics computations with differentiable neural networks enabling fine-tuning and speed gains on AI-optimized hardware.

NVIDIA's three-computer robotics strategy: DGX (training), Omniverse/Cosmos (simulation/synthetic data), and Jetson AGX Thor (onboard inference)
Seattle Robotics Lab, founded 2017 by Dieter Fox with Jensen Huang's backing, focuses on the full robotics stack and maintains close ties to UW
Imitation learning provides human-like guidance but is bounded by demonstrations; reinforcement learning is less efficient but can achieve superhuman performance
Modular (perception-planning-control) vs end-to-end approaches are converging on hybrid architectures, mirroring autonomous driving's trajectory
Humanoid robotics has surged because human environments are built for human form factors; traditional and humanoid robots will coexist
Robotics faces a data problem with no internet-scale corpus; synthetic data from simulation is essential, and real-world data remains the ground truth
NVIDIA proposes a 'data pyramid' for humanoids: YouTube videos at the base, synthetic simulator data in the middle, real-world data at the top
NeRD (Neural Robot Dynamics) replaces explicit physics with differentiable neural simulators, enabling fine-tuning on real data and faster execution on AI-optimized hardware
CoRL (Conference on Robot Learning) was held in Seoul; NVIDIA presented multiple papers, talks, posters, and demos

Source material

Transcript

[Music] Hello, and welcome to the NVIDIA AI Podcast.

I'm your host, Noah Kravitz.

Our guest today is Yashraj Narang.

Yas is Senior Research Manager at NVIDIA and the head of the Seattle Robotics Lab, which I'm really excited to learn more about along with you today.

Yas's work focuses on the intersection of robotics, AI, and simulation, and his team conducts fundamental and applied research across the full robotics stack, including perception, planning, control, reinforcement learning, imitation learning, simulation, and vision-language action models.

Full robotics stack, like it says.

Prior to joining NVIDIA, Yas completed a PhD in Materials Science and Mechanical Engineering from Harvard University and a Master's in Mechanical Engineering from MIT.

And he's here now to talk about robots, the field of robotics, robotics learning, all kinds of awesome stuff.

I'm so excited to have you here, Yas.

So thank you for joining the podcast.

Welcome.

Thank you so much, Noah.

So maybe first things first, and this is a very selfish question I mentioned before we started, but I think the listeners will be into it too.

I've never been to the Seattle Robotics Lab.

I don't know much about it.

Can we start with having you talk a little bit about your own role, your background, if you like, and give us a little peek into what the Seattle Lab is all about.

Yeah, absolutely.

So the Seattle Robotics Lab, it started in, I believe, October of 2017, and actually joined the lab in December of 2018.

And the lab was started by Dieter Fox, who's a professor at University of Washington.

And at the time, I believe, he had a conversation with Jensen at a conference.

Jensen Huang, of course, the CEO of NVIDIA.

And Jensen thinks way far out into the future.

And at that point, he was getting really excited about robotics.

He said, you know, essentially, that we need a research effort in robotics at NVIDIA.

And that's really how the lab started.

So that was kind of the birth of the lab.

And at the beginning, the lab, you know, and it still does a very academic focus.

Okay, so we consistently have really high engagement at conferences.

We publish a lot, we do a lot of fundamental and applied research.

And recently, NVIDIA has been developing, especially over the past few years, a really robust product and engineering effort as well.

And so we're working more closely and closely with them to try to get some of our research out into the hands of the community.

So, you know, fundamental academic mission, but it's really important for us as well to transfer our research and get it out there for everyone to use.

Fantastic.

And you mentioned Dieter Fox, I believe, at UW University of Washington.

Is the lab, is there a relationship there?

Yeah, so when Dieter started the lab, we, you know, over a number of years had a very close relationship with University of Washington, where many students from his lab and others would come do internships at the Seattle Robotics Lab.

We still definitely have that kind of relationship.

I stepped into the leadership role just a few months ago.

Oh, wow.

Okay.

And, you know, plan to maintain that relationship because it's been so productive for us.

Awesome.

I have a little bit of bias.

Somebody very close to me is a UW alum, go Huskies.

So, you know, I had to ask.

All right, let's talk about robots.

We're going to start talking about, well, really, I'll leave it to you and I'll ask at a very high level, how do robots come to life?

What does that mean when we talk about, you know, a robot coming to life?

And I think there's going to get into the three computer concept and stuff like that, but I'll leave it to you at a high level.

What does that mean?

Bringing up robots to life?

Yeah, it's a big question.

I think it's a real open question too.

I think we can even start with what is a robot.

I think this is a subject of debate, but, you know, generally speaking, a robot is a synthetic system that can perceive the world, can plan out sequences of actions, can make changes in the world, and it can be programmed and it typically serves some purpose of automation.

That's really sort of the essence of a robot.

And now there's the question of if you have a robot, how can it come to life?

So I would say that if most people were, for example, to step into a factory today, you know, like an automotive manufacturing plant, they would see lots and lots of robots everywhere.

Right.

And the motion of these robots and the payloads of these robots and the speed of these robots, it's extremely impressive.

But those same people that are walking into these places, they might not feel like these robots are alive because they don't necessarily react to you.

In fact, you probably want to get out of the way if there's a good time together, you know, to be safe.

So I think part of robots coming alive is really this additional aspect of intelligence so that when conditions change, it can adapt, it can be robust to perturbations, and it can start to learn from experience.

Yeah.

And I think that's really kind of the essence of coming alive.

Got it.

And what is the three computer concept and how does it relate to robotics?

Yeah, the three computer concept is pretty interesting.

I think this was, you know, I don't know the exact history of this, but I think this was inspired by, you know, the three body problem.

So the three computer concept, it's really a formula for today's robotics, you know, both on the research side and the industry side.

And it has three parts, as the name suggests.

So the first computer is the NVIDIA DGX computer.

So this includes things like GB200 systems, grays, Blackwell, superchips, and systems that are composed of those chips.

And these are really ideal for training large AI models and running inference on those models.

So getting that fundamental understanding of the world, being able to process, you know, take images as input, language as input, and produce meaningful actions, robot actions as output, for example, training these sorts of models, and then running inference on those.

The second computer is Omniverse and Cosmos.

It's a combination of these things.

So Omniverse is really a developer platform that NVIDIA has built for a number of years with incredible capabilities on rendering, incredible capabilities on simulation, and many, many applications built on top of this platform.

So for example, in the Seattle Robotics Lab, we're heavy users of Isaac Sim and Isaac Lab, which are basically robot simulation and robot learning software that is developed on top of Omniverse.

And what you can do with Omniverse is essentially train robots to acquire new behaviors, for example, using processes like reinforcement learning, which is sort of intelligent trial and error.

You can also use it to evaluate robots, for example, if you have some learned behaviors, and you want to see how it performs in different scenarios, you can put it into simulation and kind of see what happens there.

Cosmos is essentially a world model for robotics and world model is this kind of big term and many people have different interpretations of it.

But just to kind of ground things a bit here, some of the things that Cosmos has done is actually make video generation models.

So you could have an initial frame of an image, you can have a language command, and then you can predict sequences of image that come after that.

So this is the Cosmos predict model.

There's also the Cosmos transfer model.

And the idea here is that you can take an image and you can again take, let's say a language prompt, and you can transform that image to look like a completely different scene while maintaining, you know, the shape and semantic relationships of different objects in that image.

And then there's Cosmos reason, which is really a VLM, which is a vision language model.

So it can take images as input, language as input, and it can basically produce language as output, it can answer questions about images, and it can do a sort of a step by step thinking or reasoning process.

Now, just stepping back a little bit, you know, second computer again, Omniverse and Cosmos.

And what they're really used for is to generate data, to generate experience, and to evaluate robots in simulation.

And so in a sense, this can kind of come either before or after the first computer, you know, you can, for example, generate a lot of data, and then learn from it using that first computer, these DGX systems, or you can train a model on that DGX system and then evaluate it using something like Omniverse or Cosmos.

And the third computer is the AGX.

By the way, I looked this up recently, I was curious, we've been here for a while, but still curious, what is what is the D and DGX stand for?

What is the A and AGX stand for?

Oh, yeah, okay.

It's apparently for deep learning.

And A is apparently for autonomous.

So it's kind of a nice way to remember it.

So interesting.

The more you know.

Yeah, exactly.

The more you know, right?

So the third computer is the Jetson AGX specifically the Thor has been recently released.

And this is all about running inference on models that are located on your robot.

So instead of having, you know, separate workstations or, you know, data centers, this is a chip that actually lives on the robot where you can, we can basically have AI models there and you can run inference on them in real time.

Really powerful.

So before asking to follow up, I feel like I have to plug the podcast real quick because it was really sort of satisfying in a way that listening to you and thinking, oh, yeah, we did an episode with that.

Oh, yeah, Sonia talked about that.

Oh, yeah.

So I will say, if you would like to know a little more about the feeling of walking through an automotive factory with a lot of robots doing amazing things without worrying about getting out of the way.

Great episode with Siemens from a few months back.

Check that out.

I mentioned Sonia Fidler Ridley recently.

Sorry, from Nvidia.

She spoke around SIGGRAPH, but a lot of stuff related to robots.

Of course, from GTC, there are my plucks.

Okay, so you got into this a little bit, Yash, but, you know, mentioning Thor in particular, but what's changed recently in the field?

And what does that mean for where robotics is headed?

Yeah, I think there have been many changes in the field.

I think, for example, the three computers solution, three computer strategy from Nvidia, that's been definitely a key enabler.

Just the fact that there is access to more and more compute, more and more powerful compute, and tools like Omniverse, for example, for rendering and simulation and Cosmos for world models.

And of course, you know, better and better onboard compute.

I think that's really, really empowered robotics.

On, let's say, you know, maybe if we think a little bit about the learning side, I think since joining the lab in December of 2018, I've sort of been lucky to witness different transformations in robotics over time.

So, you know, one thing that I witnessed early on was actually, I think this was in 2019 when OpenAI released its Rubik's Cube manipulation work.

And so these are basically dexterous hands, human-like hands, that learn to manipulate a Rubik's Cube and essentially solve it.

But it was learned, you know, purely in simulation and then transferred to the real world.

So that was kind of a big moment in the rise of the sim to real paradigm, training and simulation, the point in the real world.

I think other things, you know, came after that, the, you know, transformers were, of course, invented kind of before, but really starting to see more and more of that model architecture and robotics.

I think that was that was a big moment, or a big series of moments.

Another specific moment that was pretty powerful was just, of course, as everybody in AI knows, chat GBT.

So I think that was released in late 2022.

Most people started to interact with it early 2023.

And then, you know, the world of robotics started thinking about, okay, how do we actually leverage this for what we do?

And, you know, many other fields kind of felt the same thing.

So there was really an explosion of papers starting in 2023 about how to use language models for robotics, and how to use vision language models for robotics.

And I think that was that was quite interesting.

So there are papers that kind of explored this along every dimension, like can you, for example, give some sort of long range task to a robot, or, you know, in this case to a language model, and have it figure out all the steps you need to accomplish in order to perform that task.

Can you, for example, use a language model to construct rewards?

So when you do, for example, reinforcement learning, intelligent trial and error, usually needs some sort of signal about how good your attempt was, you know, you're trying all of these different things, how good was that sequence of actions, and that's that's typically called a reward.

So, you know, these are traditionally hand coded things using a lot of human intuition.

And there was some very interesting work, including Eureka from Nvidia, about how to use language models to sort of generate those rewards.

There was also kind of a simultaneous explosion in more more general generative AI, for example, generating images and generating 3d assets.

A lot of this work came from Nvidia as well.

So on the image generation side, you know, there was work, for example, on generating images that describe the goal of your robotic system.

So where do you want your robot to end up?

What do you want the final product to look like?

Let's generate an image from that, and use that to sort of guide the learning process.

And then there's also, you know, when it comes to simulation, one of that one of the we'll probably get more into this a little bit later.

But one of the challenges of simulation is you have to build a scene, and you have to build these 3d assets or meshes.

And that can take a lot of time and effort and artistic ability and so on.

So there's a lot of work on automatically generating these scenes and generating these assets.

And in a sense, you can kind of view this transformation that we've seen over the past few years as kind of taking the human or human ingenuity more and more out of the process or at higher and higher levels, as opposed to absolutely doing everything and sort of hard coding things like rewards and final states, and you know, building meshes and assets manually and describing scenes and so on and so forth.

So we're, you know, able to automate more and more of that.

There's so much in what you just said.

And one of the big things for me from this perspective is thinking about how little I understood about Omniverse, let alone Cosmos, before having the chance to have some of these conversations, particularly over the past few months and having to do with robotics, physical AI, and simulation and the idea of creating the world and then the robot is able to learn and Cosmos is all it's just fascinating.

It's so cool to, you know, to I'm wanting to geek out on my end.

But when you're talking about the different types of learning and, you know, I'm sure they go together in the same way that you mix different approaches to anything and solving complex problems.

Can you talk a little bit about, I don't know if pros and cons is the right way to describe it, but the difference between imitation and reinforcement learning, not so much in what they are, but in sort of, you know, effectiveness or how you use them together and that sort of thing.

Yeah, absolutely.

I think these, you know, these are two really popular paradigms for robot learning.

And I will, you know, try to kind of ground it in what we do, what we typically do in robotics, the typical implementations of imitation learning and reinforcement learning.

So in a typical imitation learning pipeline, you're typically learning from examples.

So for example, let's say I define a task, I'm trying to pick up my water bottle with with a lot.

What I might do if I were using an imitation learning approach is maybe, you know, physically move around the robot and pick up the water bottle, or I might use my keyboard and mouse to sort of tele operate the robot and pick up the water bottle, or I might use other interfaces.

But the point is that I am collecting a number of demonstrations of this behavior.

I do it once in one way, I do it, you know, the second time in a different way.

And maybe I move the water bottle around and I collect a lot of different demonstrations there.

And basically, the purpose of imitation learning is to essentially mimic those demonstrations, the behaviors would ideally look as I have demonstrated it, right.

Now, reinforcement learning operates a little bit differently.

Reinforcement learning tries to discover the behaviors, you know, or the sequences of actions that achieve the goal.

So, you know, in the most extreme case, what you might do if you were to take a reinforcement learning approach, again, intelligent trial and error, is you might just have proposals of different sequences of actions that are being generated.

And if they happen to pick up the water bottle, I give a reward signal of one.

Okay.

And if they fail, I might give a reward signal of zero.

And the key difference here is that I am not providing very much guidance on this sequence of actions that the robot needs to use in order to accomplish the task.

I'm letting the robot explore, try out many different things, and then come up with its own strategy.

So, you know, pros and cons.

So, imitation learning, you know, one pro is that you can provide a lot of guidance.

And the behaviors that you learn, for example, if a human, if a person is demonstrating these behaviors, and the behaviors that you learn would generally be human-like.

They're trying to essentially mimic those demonstrations.

Now, reinforcement learning, on the other hand, you know, again, in the most extreme case, you're not necessarily leveraging any demonstrations, the robot or agent, it's often called, has to figure this out on its own.

And so it can be less efficient, of course, you're not giving it that guidance.

And so it's trying all of these sequences of actions.

And there are principled ways to do that.

But essentially, it would be less efficient than if you were to give it some demonstrations and say learn from that.

Now, the pro is that you can often do things that you have the capability of doing things that are really, that can be really hard to demonstrate.

So one of the things, you know, one of the topics that I've worked on for some time, for example, is assembly, literally, teaching robots to put parts together.

And this can actually be really difficult to do via a teleoperation interface, you probably need to be an expert gamer in order to do that.

I hear you talk about assembling things.

And I think of forget the robot, I think of myself trying to put together like very small parts on something, you know, twisting a screw in and I can't that makes me cringe, let alone trying to tell operator robot, yeah, it can be really hard, depending on the task.

And the second thing is that reinforcement learning generally has the potential to achieve superhuman performance.

So there are things and I think games are a great example, like, you know, one of the domains of reinforcement learning historically has been in games like Atari games, and that's kind of where maybe in recent history got super excited about reinforcement learning, because all of a sudden you could have these AI agents that can do better at these games than any human ever, if the same capabilities apply to robots.

So you can potentially learn the robot can learn behaviors that are better than, you know, what any what any person could possibly demonstrate.

And maybe like a simple example of this is speed.

So maybe there's there's a tricky problem, you're trying to give your robot where it has to go through a really narrow path, and has to do this very quickly.

And if you were to demonstrate this, you might proceed very slowly, you might collide along the way.

But if a reinforcement learning agent is allowed to solve this problem, it could probably learn these behaviors automatically, these smooth behaviors, and it can start to do this really, really fast.

And you know, assembling objects is another example, you can start to assemble objects faster than you could possibly demonstrate.

And I think that's the power.

That's very cool.

The thinking about or listening to you talk about different approaches to teaching and learning brought to mind, I was looking at the Nvidia YouTube channel, just the other day for a totally different reason, and came across the video of Jensen giving the robot a gift and writing the card that says, you know, Dear Robot, enjoy your new brain or something along those lines, right?

There's something I only know kind of the name modular versus end to end brain.

What is that about?

Is that am I along the right lines?

Or is that something totally different?

No, no, that's, it's essentially a way to design robotic intelligence, I would say these are two competing paradigms.

Both of these paradigms can leverage the latest and greatest in hardware, I would say that now, the modular approach is an approach that has been developed for a very long time in robotics.

And sort of a classic framing for this is that a robot, you know, in order to perform some tasks or set of tasks, needs to have the ability to perceive the world.

So to take in sensing information and then come up with an understanding of the world where everything is, for example, and it also needs the ability to plan.

So, for example, given some sort of model of the world, like like a physics model, for example, or a more abstract model, and maybe some sort of reward signal, you know, can it actually select a sequence of actions that is likely to accomplish a desired goal?

Right.

And then, you know, a third module in this modular approach would be the action module.

And that means you get in this sequence of actions, maybe this these configurations that you'd like the robot to reach in space.

And the action module would figure out also called control would figure out what are the motor commands that you want to generate?

Literally, what are the signals you want to send to the robots motors in order to move along this path in space?

So that's kind of the perceived plan act framework.

It's called different things over time.

But that's kind of the classic framing for for a modular approach.

And so following that, you would have maybe a perception module, and you'd have some group of people working on that.

You'd have a planning module, you'd have some group of people working on that you have an action module.

And so this is kind of how many robotic systems have been built over time.

Now, the end to end approach is something that is is definitely newer.

And the idea is that you don't draw these these boundaries, really, you you you take in your sensor data, like camera data, you know, maybe force torque data if you're interacting with the world.

And then you directly predict that the commands that you may send your motors, right?

So you kind of skip these intermediate steps.

And you go straight from from inputs to outputs.

And that's the end to end approach.

And, you know, I would say the the module approaches are extremely powerful, they have their their their advantages, which there are really there's there's a lot of maturity around developing each of those modules can be easy to debug, you know, for teams of engineers, which I was mentioning the groups of people earlier.

Yeah, it can be easier to certify as well, you know, if safety is a critical application, the end to end approach, the advantage there is that you're not relying as much on human ingenuity or human engineering to figure out what exactly are the outputs I should be producing for my perception module, what exactly the outputs I should be producing for my planning module, and so on.

That requires a lot of engineering.

And if you don't do it, right, you may not get the desired outcome.

Yeah, I was just gonna say conceptually, it made me think of the difference between doing whatever task I'm used to doing and asking a chat bot just to shoot me the output, you know, and yeah, yeah, right.

And I think just another analogy here would be, I think this has been a really fruitful debate, really vigorous debate in autonomous driving, actually.

So in the 2010s, I would say just about every effort in autonomous driving was focused on the modular paradigm, again, you know, separate perception, planning, control modules and different teams associated with with each of those things.

And then kind of late, like let's say, you know, early in the 2020s was a real shift to the end to end paradigm, which basically said, let's just collect a lot of data and train a model that goes directly from pixels to actions, you know, actions in this case being steering angle, throttle brakes, and so on.

Yeah.

And many things today kind of look, I would say like a hybrid, you know, different companies and strategies, but most people have converged upon something that has elements of both.

I'm speaking with Yash Raj Narang.

Yash is a senior research manager at NVIDIA and the head of the Seattle Robotics Lab.

And we've been talking about all things robots, AI, simulation, which we'll get back to in a second.

But we were just talking about different styles, different approaches to robotics learning.

Wanted to go back to earlier in the conversation when you mentioned, you know, going into the factory and seeing all these different robots doing these kinds of things.

And even before that, your definition of what a robot is or is not.

And thinking about that, and I'm getting to thinking about asking you to define sort of the difference between traditional and humanoid robots.

And I'm thinking traditional, like robot arms in a factory.

You know, I have fuzzy probably images from sci-fi movies when I was a kid and stuff like that, right?

And humanoid robots.

And I mentioned this earlier back during GCC, I had the chance to sit down with the CEO of 1X Robotics, who we talked all about humanoid robots.

So maybe you can talk a little bit about this, traditional robots, humanoid robots, what the difference is, and maybe why we're now starting to see more robots that look like humans and whether or not that has anything to do with functionality.

Yeah, absolutely.

So one of your earlier questions too is kind of how is robotics changed recently?

Yeah.

I think this is just another fantastic example of that.

It's been unbelievable over the past few years to see the explosion of interest and progress in humanoid robotics.

And to be fair, actually, companies like Boston Dynamics and Agility Robotics, for example, have been working on this since probably the mid, maybe even early 2010s.

Yeah.

And so they made continuous progress on that and everybody was always really excited and inspired to see their demo videos and so on.

Can I interrupt you to ask a really silly question, but now I need to know, is there a word, we say humanoid robots, right?

Is there a word for a robot that looks like a dog?

Because Boston Dynamics makes me think of those early atlas, I think, those early videos.

Yeah, yeah, yeah.

And I think Boston Dynamics used to, they had a dog-like robot, which was called Big Dog.

Okay.

Yeah, yeah.

Sometime then, which is maybe why this is called the mind.

People typically refer to them as quadrupeds.

Quadrupeds, okay.

Just four legs.

Right.

Got it.

Thank you.

No problem.

Yeah.

So, yeah, where were we?

So, traditional robots versus humanoids.

So, there's been an explosion of interest in humanoids, particularly over the past few years.

And I think it was just this perfect storm of factors where there was already a lot of excitement being generated by some of the original players in this field.

Folks like Tesla got super interested in humanoid robotics, I think, 2022, 2023.

And it also coincided with this explosion of advancement in intelligence through LLMs, VLMs, and early signals of that in robotics.

And so, I think there's a group of people, four thinking people, Jensen very much included, this is near and dear to his heart, that felt that the time is right for this stream of humanoid robotics to finally be realized.

Right.

Let's actually go for it.

And this begs the question of why humanoids at all?

Why have people been so interested in humanoids?

Why do people believe in humanoids?

And I think that the most common answer you'll get to this, which I believe makes a lot of sense, is that the world has been designed for humans.

We have built everything for us, for our form factors, for our hands.

And if we want robots to operate alongside us in places that we go to every day, in our home, in the office, and so on, we want these robots to have our form.

And in doing so, they can do a lot of things, ideally, that we can.

We can go up and down stairs that were really built for the dimensions of our legs.

We can open and close doors that are located at a certain height and have a certain geometry because they're easy for us to grab.

Humanoids could manipulate tools like hammers and scissors and screwdrivers and pipettes, if you're in a lab, these sorts of things, which were built for our hands.

And so, that's really the fundamental argument about why humanoids at all.

And it's been amazing to see this iterative process where there's advancements in intelligence and advancements in the hardware.

So basically, the body and the brain, and going back and forth and just seeing, for example, the amount of progress that's been happening over the past couple of years in developing really high-quality robotic hand hardware.

It's amazing.

So that's really my understanding of the story and the fundamental argument behind humanoid robots.

But I definitely see, I would say I see a future where these things actually just coexist, traditional and humanoid.

Yeah.

So earlier, we were talking about the importance of simulation, creating world environments where robots can explore, can learn all the different approaches to that.

And I think we touched on this a little bit, but can you speak specifically to the role of simulated or synthetic data versus real world data?

It's something we touched upon.

And again, listeners, the more we're talking about this, I feel like all these recent episodes sort of coming together, talking about the increasing role of AI broadly generating tokens for other parts of the system to use and all of that.

So when it comes to the world of robotics, simulated data, real world data, how do they work?

How do they coexist?

Yeah.

So first, I'd like to say that in contrast with a number of other areas like language and vision, robotics is widely acknowledged to have a data problem.

So there is no internet scale corpus of robotics data.

And so that's really why so many people in robotics are very, very interested in simulation and specifically using it to generate synthetic data.

So that's that's basically the idea is that simulation can be used to have high fidelity renderings of the world.

They can be used to do really high quality physics simulations, and they can be used as a result to generate a lot of data that would just be totally intractable to collect in the real world.

And real world data is, you know, generally speaking, your source of ground truth.

It doesn't have any gap with respect to the real world because it is the real world.

But it tends to be much harder to scale, you know, in contrast with autonomous vehicles, for example, robotics doesn't really have a car at the moment, there aren't fleets of robots that everybody has access to.

Can't put a dash cam on the those little food delivery robots and get the data you need.

Even if you could, you know, will be nearly enough data, the answer is probably no, you know, to train general intelligence.

You know, that's kind of why people are really attracted to the idea of using simulation to generate data.

In real world, whenever you can get it, it's the ideal source of data.

But it's just really, really difficult to scale.

So you mentioned, you know, using real world data, there's no gap.

We've talked about the sim to real gap in other contexts.

How do you close it in robotics?

What's the importance of it?

Where are we at?

And you talked about a little bit, but get into the gap a little more and what we can do about it.

Sure.

So sim to real gap.

So there are different areas in which simulation is typically different from the real world.

So one is, you know, on the perception side, literally, you know, the visual qualities of simulation are very different from the real world simulation looks different often from the way the real world does.

So that's that's one source of gap.

Another source of gap is really on the physics side.

So for example, in the real world, you might be, you know, trying to manipulate something, pick up something that is very, very flexible.

And your simulator might only be able to model rigid objects, you know, or rigid objects connected by joints.

And you know, even if you had a perfect model in your simulator of whatever you're trying to move around or manipulate, you still have to figure out like what are the parameters of that model?

You know, what is the stiffness of this thing that I'm trying to move around?

What is the mass?

What are the inertia matrices in these properties?

So physics is just another gap.

And then there are other factors, things like latencies.

So in the real world, you might have different sensors that are streaming data, different frequencies.

And in simulation, you may not have modeled all of the complexities of different again, different sensors coming into different frequencies, your control loop, maybe running at a particular frequency.

And these things may have a certain amount of jitter or delay in the real world, which you may or may not model in simulation, right?

Okay.

So these are just a few examples of areas where you, you know, it might be quite different between simulation in the real world.

And generally speaking, the ways around this are you either spend a lot of time modeling the real world, really capturing the visual qualities and the physics phenomena and the physics parameters and the latencies and putting that in simulation, but that can take a lot of time and effort.

Another approach is, you know, called domain randomization or dynamics randomization.

And the idea is that you can't possibly identify everything about the real world and put it into simulation.

So whenever I'm doing learning on simulated data, let me just randomize a lot of these properties.

So I want to train a robot that can, you know, pick up a mod or, you know, put two parts together.

And it should work in any environment.

It shouldn't shouldn't really matter what the background looks like.

So let me just take my simulated data and randomize the background.

In many, many, many different ways.

And you can do similar strategies for physics models as well.

You can randomize different parameters of physics models.

And then there's also another approach, which is really focused on domain adaptation.

So I really care about a particular environment in which I want to deploy my robot.

So let me just augment my simulated data to be reflective of that environment.

You know, let me make my simulation look like an industrial work cell or let me make it look like my home, because I know I'm going to have my robot operate here.

And maybe the final approach is kind of, you know, this thing called domain invariance.

So there's randomization, adaptation in invariance, which is basically the idea that I'm going to remove a lot of information that is just not necessary for learning.

You know, if maybe if I'm if I'm picking up certain objects, I only need to know about the edges of these objects.

I don't need to know what color, for example.

So, you know, taking that idea and incorporating it into the learning process and making sure that my networks themselves or my data might be transformed in a way that it's no longer reliant on these things that don't matter.

Yeah.

I'm thinking about all of the data coming in and, you know, all the things that can be captured by the sensors and using video to train.

And earlier, you were talking about the problem and it made me think of reasoning models, the problem of, you know, can you give a robot a task and can it break it down and reason its way and then actually execute and do it?

What are reasoning VLA models been talked about?

Not recently, I keep hearing about them anyway.

Can you talk a little bit about what they are and how they're used in robotics?

Yeah, absolutely.

So reasoning itself, you know, just stepping back for a second, reasoning is an interesting term because it means many things to many different people.

Yeah.

I think a lot of people think about things like logic and causality and common sense and so on, you know, different types of reasoning.

And you can use those to draw conclusions about the world.

Reasoning in the context of LLMs and VLMs and now VLAs, so vision language action models that produce actions as outputs, often means, you know, in simple terms, thinking step by step.

In fact, if you go to chat GPT and you say, here's my question, you know, show me your work or think step by step, it will do this form of reasoning.

And so that's the idea is that you can often have better quality answers or better quality training data if you allow these models to actually engage in a multi-step thinking process.

And that's kind of the essence of reasoning models.

And reasoning VLAs are no exception to that.

So I might give a robot a really hard task like setting a table.

And maybe I want my VLA to now identify what are all the sub tasks involved in order to do that.

And within those sub tasks, what are all the smaller scale trajectories that I need to generate and so on.

So this is kind of the essence of the reasoning VLA.

Got it.

Right.

So to start to wrap up here, I was going to ask, I am going to ask you to sort of, in a way, it's kind of summarizing what we've been talking about, but maybe to put kind of a point on what you think sort of the most important current limitations are to robotic learning that, you know, we're working, you're working, you and your teams and folks in the community are working to overcome.

You mentioning setting the table though, made me think, you know, a better way to ask that, how far are we from laundry folding robots?

Like, am I going to, I'm the worst at folding laundry.

And I always see demos and I heard at some point that, you know, folding laundry sort of represents conceptually a very difficult task for a robot.

Am I going to see it soon before my kids go off to school?

I think you might see it soon.

I've seen some really impressive work coming out recently, you know, from various companies and demos within Nvidia on things like laundry folding.

Yeah.

And, you know, the general process that people take is to collect a lot of demonstrations of people actually folding laundry and then use imitation learning paradigms or variance.

Try to learn from those demonstrations.

And this ends up actually being, if you have the right kind of data and enough data in the right model architectures, you can actually learn to do these things quite well.

Now, the classic question is how well will it generalize?

If I learn to fold, you know, if I have a robot that can fold my laundry, can it fold your laundry?

Right, right.

The typical answer to that is you probably need some amount of data that's in the setting that you actually want to do the robot in and then you can fine tune these models.

But I would say we're actually pretty, we're getting closer and closer, closer than certainly I've ever seen on tasks like laundry folding.

I'm excited.

You've got me optimistic and I thank you for that.

So perhaps to get back to the more general conversation of interest, the current limitations, what do you see them as and, you know, what's the prognosis on getting past them?

Sure.

I think one big one is people feel, I would say the community as a whole is really optimistic about the role of simulation robotics, or at least most of the community.

Simulation can take different forms.

It can take kind of the physics simulation approach, or it can take this, you know, video generation, like let me, let me just predict what the world will look like.

And these are really, you know, really thriving paradigms.

And I think two questions around that one that we just talked about, which is the sim to real gap.

So I think sim to real gap is people have made a lot of progress on it, something we've worked very hard on in video, but there's still a lot more progress to be made, you know, until we can truly generate data and experience and simulation and have it transferred to the real world without having to, you know, put a lot of thought and engineering into truly making it work.

And conversely, there's there's the real to sim question.

So building simulators is really, really difficult.

You again, have to, you know, design your scenes and design your 3d assets and so on.

Wouldn't it be great if we could just take some images or take some videos of the real world and instantly have a simulation that also has physics properties doesn't just have the visual representation of the world, but it has realistic masses and friction and these other properties.

So sim to real and real to sim, I think are two big challenges.

And we're just getting closer and closer, you know, every few months on on solving those problems.

And then the boundaries between sim and real, I think we'll start to be a little bit blurred, which which is kind of a maybe an interesting possibility.

I think that's one big thing.

And the second big thing I'd say for now is the data question.

Again, robotics, as we're talking about it here, doesn't have the equivalent of a car.

There is no fleet of robots that everybody has access to that can be used to collect a ton of data.

And until that exists, I think we have to think a lot more about where we're going to get that data from.

And one thing that the crew effort at Nvidia, which is around humanoids has proposed is this idea of the data pyramid, where you basically have, you know, at the base of the pyramid, things like videos, YouTube videos that you're trying to learn from, and then maybe a little bit higher in the pyramid, you have things like synthetic data that's coming from different types of simulators.

And then maybe at the top of the pyramid, you have something like data that's actually collected in the real world.

And then the question is, what is the right mixture of these different data sources to give robots this, you know, general intelligence.

So, Yash, as we're recording this, Coral is coming up.

Let's end on that forward looking note.

And it'll be a good segue for the audience to go check out what Coral is all about.

But tell us what it's about and what your and Nvidia's participation is going to be like this year.

Yeah, absolutely.

So, um, Coral is, since for the conference on robot learning, and it started out as a small conference, I think, and, you know, it's probably 2017 was maybe the first edition of it.

And it's grown tremendously.

It's one of the hottest conferences in robotics research now, as learning itself as a paradigm has really taken off.

This year, it's going to be in in Seoul in Korea, which is extremely exciting.

Yeah.

And it's going to bring together the robotics community, the learning community and the intersection of those two communities.

And so, you know, I think everybody in robotics is looking forward to this.

Our participation, you know, the Seattle Robotics Lab and other research efforts at Nvidia, for example, the gear lab, which focuses on humanoids, you know, presenting a wide range of papers.

And so we're going to be giving talks on those papers, presenting posters on those papers, hopefully some some demos.

And, you know, we're just going to be really excited to talk with, with researchers and, you know, people will be interested in joining us in our missions.

Fantastic.

Any of those posters and papers you're excited about in particular, maybe you want to share a little teaser with us?

Yeah, I'm excited about a number of them.

But one that I can just call out for now, that I work closely on is this project called Neural Robot Dynamics.

So that's a name of the paper.

And we we have, you know, abbreviated that to nerd.

I was going to ask, I'm glad.

So it's yeah, it's just any RD also kind of inspired by neural radiance fields.

Right.

Right, of course.

Yeah.

So we had this framework and these models, which we call nerd.

And the idea is basically that classical simulation, so typical physics simulators kind of work in this way where they are, you know, performing these explicit computations about here are my joint torques of the robot.

Here are some external forces.

Here's some contact forces.

And let's predict the next state of the robot.

And the idea behind neural simulation is can we capture that all with a neural network?

And so that you know, you might be wondering, why would you want to do that?

And there is some some advantages to this.

So one is that, you know, neural networks are inherently differentiable.

And what that means is that you can understand if you slightly change the inputs to your simulator, what would be the change in the outputs?

And if you know this, then you can perform optimization, you can figure out how do I optimize my inputs to get the robot to do something interesting.

So neural networks are inherently differentiable.

And if you can capture a simulator in this way, you can essentially create a differentiable simulator for fruit, which is kind of which is kind of exciting.

Another thing, which is really exciting to us is fine tune ability.

So it's very difficult, if you're given a simulator, and you want and you have some set of real world data that you collected on that particular robot that you're simulating, to actually figure out how should I modify the simulator to better predict that real world data.

And neural simulators can kind of do this very, very naturally, you can fine tune them, just like any other neural network.

So I can train a neural network on some simulated data, and then collect some amount of real world data, and then fine tune it.

And this process can be continuous, you know, if my robot changes over time, or there's wear and tear, I can continue fine tuning it, and always have this really accurate, you know, simulator of that robot, which is pretty exciting.

Yeah, that's really cool.

Yeah, I think I think it's really cool.

And a third advantage, which we are sort of in the early stages of exploring is really on the speed side.

So a lot of compute, today, as many people know, it's been really optimized for AI workloads and specific types of mathematical operations, specific types of matrix multiplications, for example, that are very common in neural networks.

And if you can transform a typical simulator into a neural network, then you can, you can really take advantage of all of these speed benefits that come with the latest compute and with the latest software built on top of that.

So that's really exciting to us.

And we sort of did this project in a way that allows these neural models to really generalize.

So for given a particular robot, if you put it in a new place, you know, in the world, or you change some aspects of the world, this model can still make accurate predictions, and it can make accurate predictions over a long time scale.

Amazing.

For listeners who would like to follow the progress at Coral in particular, Seattle Robotics Lab in particular, NVIDIA more broadly, where are some online places, some resources you might direct them to?

Yeah, I'd say the Coral website itself is probably your, you know, your primary source of information.

So you'll find the program for Coral, you'll find, you know, links to actually watch some of the talks at Coral.

You'll be able to have links to papers, and you'll see the range of workshops that are going to be there.

And a lot of them, I'm sure will post recordings of these workshops.

That's a great way to get involved.

And that's just CRL.org for the listeners.

Yes, yes, that's right.

Yeah, you get your website as well.

I'm sure we'll have on the website and through NVIDIA social media accounts.

No, you could probably call out to those.

I'm sure there's gonna be plenty of updates on Coral over the next, next period of time.

Can I ask you, as a parting shot here, predict the future for us.

What does the future of robotics look like?

You can look out a couple of years, five years, 10 years, whatever time frame makes the most sense.

And you know, we want to hold you to this.

But what do you think about when you think about the future of all this?

Yeah, I think it comes down to those fundamental questions.

So you know, one is kind of what will the bodies of robots look like?

So this is kind of what you touched on with, you know, robot arms and factories versus humanoids.

And I think what you'll see is that there'll be a place for both.

So, you know, robot arms and more traditional looking robots will still operate in environments that are really built for them or need an extremely high degree of optimality.

And humanoids will really operate in environments where they need to actually be, you know, alongside humans and, you know, in your household and in your office and so on around many, many things that have been built for humans.

So I kind of see that as the future of the body side of things.

On the brain side of things, there's also these questions of, you know, modular versus end to end paradigms.

And what I've seen in autonomous vehicles is, of course, as we talked about before, starting with modular, swinging to end to end, you know, starting to converge on something in the middle.

And I can imagine that robotics, as we're talking about here, for example, robotic manipulation, will start to follow a similar trajectory where we will explore end to end models and then probably converge on hybrid architectures until we collect enough data that an end to end model is actually all we need.

You know, that's kind of how I see those aspects.

There are some other questions, for example, are we going to have specialized models or are we just going to have one big model that solves everything?

Right.

You know, that one is a little bit hard to predict.

But I would say that, again, there's probably a role for both where we're going to have specialized models for very specific domain specific tasks and where, for example, power or energy limits are very significant.

And you're going to have sort of these generalist models in other domains where you need to do a lot of different things and you need a lot of common sense reasoning to solve tasks.

Yeah, I would say those are those are some some open debates.

And that would be my prediction.

And then maybe one other thing that you touched on was simulation versus the real world.

And again, I kind of see this as one of the most exciting things.

I'd love to see how this unfolds.

But I really feel that the boundaries between simulation and real world will start to be blurred.

The sim to real problem will be more and more solved.

And the real to sim problem will also be more and more solved.

And so we'll be able to capture the complexity of the real world and make predictions in a very fluid way, perhaps using a combination of physics simulators and these world models that people have been building like cosmos.

Amazing future.

Yash, thank you so much.

This has been an absolute pleasure.

And I know you have plenty to get back to.

So we appreciate you taking the time out to come on the podcast.

All the best with everything and enjoy coral.

Can't wait to follow your progress and read all about it.

Thank you so much.

No, it's been a pleasure.

Yeah.