Training Data · 2025-01-28

ReflectionAI Founder Ioannis Antonoglou: From AlphaGo to AGI

Hosts: Unknown

Guests: Ioannis Antonoglou

AlphaGoAlphaZeroMuZeroDeepMindreinforcement learningMonte Carlo Tree Searchself-playsynthetic datadata wallAI agentsworld modelsAGI timelineSWE-benchReflection AI

Read summary Jump to transcript Original podcast

Podcast feed URL

Open feed

Why it matters

AlphaGo combined a policy network (move suggestion), a value network (win-probability gut feel), and MCTS.

Key claims

AlphaGo combined a policy network (move suggestion), a value network (win-probability gut feel), and MCTS; initial training used human game data, then self-play via policy gradient.
Move 37 looked like a hallucination but turned out to be creative genius; Move 78 exposed a real blind spot that was fixed by scaling and switching to a ResNet architecture in AlphaZero.
AlphaZero learned Go from scratch via self-play, demonstrating that a policy-improvement-and-distillation loop (search → better policy → retrain) can produce superhuman play without human data.
MuZero removed the need for a perfect simulator by learning an internal world model—directly analogous to modern video world models like Sora.

Episode summary

Summary

Ioannis Antonoglou, founding engineer at DeepMind and now founder of Reflection AI, traces the lineage from AlphaGo to AlphaZero to MuZero and explains why those breakthroughs are directly relevant to today's LLM agent race. He walks through the technical core of AlphaGo—policy and value networks combined with Monte Carlo Tree Search—and recounts the famous Lee Sedol match, including the initially mistaken Move 37 (later revealed as creative genius) and Move 78, which exposed a blind spot that scale and self-play eventually solved. AlphaZero demonstrated that pure self-play without human data could reach superhuman performance, and MuZero extended this by learning its own internal world model, removing the need for a perfect simulator.

Antonoglou argues that reinforcement learning is essential to the next phase of AI progress because the field is approaching a data wall for LLM pretraining. He frames RL broadly as anything learned through trial and error and sees it as the path to high-quality synthetic data, citing AlphaZero's policy-improvement-and-distillation loop as a template for what today's systems need to imitate. He is skeptical that simply scaling LLMs will produce robust reasoning; planning at inference time and reward-driven improvement are required. He draws explicit parallels between MuZero and modern world models like Sora, and between AlphaZero-style search and methods like Q*.

Looking forward, Antonoglou identifies planning, in-context learning, and reliability/robustness as the three biggest open problems for agentic AI. He predicts 1–3 years to 50% on SWE-bench and 3–5 years to 90%, 5 years before LLMs have their "AlphaZero moment" where compute translates directly into intelligence, and at least one more year of text pretraining runway before the data wall becomes binding—with synthetic data bridging the gap. He praises David Silver and Ilya Sutskever as the most influential researchers in his career.

AlphaGo combined a policy network (move suggestion), a value network (win-probability gut feel), and MCTS; initial training used human game data, then self-play via policy gradient.
Move 37 looked like a hallucination but turned out to be creative genius; Move 78 exposed a real blind spot that was fixed by scaling and switching to a ResNet architecture in AlphaZero.
AlphaZero learned Go from scratch via self-play, demonstrating that a policy-improvement-and-distillation loop (search → better policy → retrain) can produce superhuman play without human data.
MuZero removed the need for a perfect simulator by learning an internal world model—directly analogous to modern video world models like Sora.
Antonoglou frames RL as the only sustainable path past the LLM data wall, since RL can trade compute for intelligence without relying on finite human-generated data.
He identifies planning, in-context learning, and reliability/robustness as the three critical open problems for agentic AI today.
Predictions: 1–3 years to 50% on SWE-bench, 3–5 years to 90%, ~5 years to an LLM "AlphaZero moment" where compute translates directly to intelligence.
He credits David Silver (his PhD supervisor and RL pioneer) and Ilya Sutskever as the two most influential researchers of his career.

Source material

Transcript

Go is a complex game, there was always a bit of worry about whether AlphaGo was truly as good as we believed.

So we actually had the conviction that deep reinforcement learning is the answer.

Based on everything that we could measure and everything we could see.

But the thing about these systems is that they're not like classic computers.

You just like know that they always produce the same answer.

They're like stochastic.

They are creative.

And they have some blind spots.

They hallucinate.

Similarly to how model LLMs hallucinate.

So you need to just like really push them and just like see exactly where they break.

And the only way you could actually do that is by having like the best humans playing against them.

Today we're excited to welcome Giannis Antenoglio, a researcher and an engineer who has contributed to some of the most significant breakthroughs in AI.

As a founding engineer at DeepMind, Giannis played a crucial role in developing AlphaGo, which made history by defeating Go world champion Lee Sedol.

He later co-led the development of MuZero, which pushed the boundaries even further by mastering multiple games autonomously.

Now, as he embarks in his latest venture with reflection, he's focused on building the next generation of AI agents.

We're excited to talk to Giannis about the breakthrough moments in AI history that he's witnessed firsthand.

From AlphaGo's famous Mu 37 to his perspective today on what's next for the combination of reinforcement learning and large language models on the way to AGI.

Giannis, thank you so much for joining us today.

Thank you so much for having me.

Giannis, you have an incredible background having worked at DeepMind as a founding engineer for over a decade, starting with some of the most notable projects that have really defined the industry.

DeepMind quite notably created this notion of building AI within games to start.

Can you share a little bit more about why DeepMind chose to start with games at the time?

Yeah, so DeepMind was the first company to truly embrace the concept of artificial general intelligence or AGI.

From the outset, they had ground ambitions aiming to build systems that would match or exceed human intelligence.

So the big question was and still is, how do you build AGI?

And more importantly, how do you measure intelligence in a way that allows for meaningful research and performance improvements?

So the idea of using video games as a testing ground came naturally to DeepMind founders.

It was Denzel Sabis and Shane Lake, because Denmies had a background in the gaming industry, and Shane's PhD thesis defined AGI as a system that could learn to complete any task.

Video games provided a controlled yet complex environment where these ideas could be explored and tested.

And to what extent, he mentioned games are they provide a very controlled environment.

To what extent are games representative or not of the real world?

Like if you have a result in games, do you think that generalizes naturally to the real world or not?

So, I mean, I guess games have indeed been viable for developing AI.

And you actually have like a few examples of that.

So you can see that PPO, for example, which is currently being used in RLHF, was developed using OpenAI, GIM and with Joko and Atari.

And similarly, you have like MCTS, which was developed, which stands for Monte Carlo Trisage, and was developed through Port James like Pac-Amon and Go.

But at the same time, games have like a number of limitations.

So the real world is messy, is unbounded, and it's much tougher not to crack than even the most complex games.

So even though it just gives you an interesting test bed to develop new ideas, it's definitely limiting.

And it does really capture all the complexity of the real world.

Okay, interesting, though.

So a lot of the techniques and algorithms that you've developed in the game environment, DPO, etc., these are used in the real world.

Yeah, so PPO is actually like exactly what just PD used for RLHF.

And so MCTS, it's used in Museo, and Museo has been used in the real world in things like compression, video compression for YouTube.

It was part of the self-driving system at Tesla at some time, and it was also used for developing a pilot that was completely controlled by an AI.

So yeah, I mean, you can see methods like that being used in the real world to solve real problems.

So interesting.

Janice, I remember back in 2017 when AlphaGo, the movie, came out and it featured the incredible game of AlphaGo against Lee Sedol.

Can you take us back to that moment in time and maybe the years leading up to it as you're building AlphaGo?

How was AlphaGo specifically chosen as the game to focus on?

So I think like games, you've always been a benchmark for AI research.

So like before Go, you had chess, and chess was like a major milestone with IBM's deep blue defeating Garek Asparov in the late 90s.

And I mean, even though chess and Go are completely different games and Go is definitely a different beast, there is like games have always been acted as test beds for the development, especially board games, for the development of new AI methods.

Actually, even going back to the earliest days of AI research, Turing and Shannon, they both worked on their own versions of chess bots.

So now the thing about Go is that it's a much harder problem than chess.

The reason for that is because it's almost closely possible to define an evaluation method, a heuristic.

So in chess, you can just take a look at the board, you can count the number of pawns that each side has.

You can see what the ranks of these pawns are.

And then you can just like make some, you can draw some conclusions, like who is winning and why.

But like in Go, there's nothing like that.

Like it's mostly human intuition.

And if you ask like a Go, you know, professional player, like how they know whether a position is a good one or a bad one, they will say that like, you know, after having played the game for so long, they can just like seal it in their gut, that like this is a better position than the other one.

So now it's actually a question of how do you encode the feeling in your gut into like an AI system, right?

So this is exactly the reason why Solving Go was considered the holy grail of AI research for a long time.

And it was a challenge that seemed almost impossible.

But at the same time, it was like within reach.

People felt that like they could actually get it cracked.

And this is exactly what AlphaGo did back in 2016.

And it kind of like showcase two new methods, which is like deep learning and reinforcement learning.

Because back in 2015 and 2016, like now we kind of think of deep learning and reinforcement learning as mature technologies.

But like back then we're kind of like literally like making the taking the first steps and they were kind of like the new kid in the block.

And most people were kind of like really skeptical about them.

Like everyone thought that deep learning was another AI fad that we just like won't last the test of time.

So, yeah, I mean, AlphaGo was chosen because it was like clear to show that you actually have like the most the most performant agent in the world.

You could actually evaluate it.

You can have it play with other humans.

And at the same time, it was within reach given like the latest developments in deep learning and reinforcement learning.

I remember reading that there's more configurations of the Go board than than Adams in the universe by many others in magnitude.

And that blew me away because I mean, I grew up playing Go and it felt like such a very simple in terms of the rules.

I see why it was the holy grail.

Maybe can you explain how AlphaGo worked?

Technically, maybe maybe explain it to me like I'm a fifth grader because that is that is effectively my level of sophistication understanding these things.

But how did it work?

I mean, you mentioned both reinforcement learning and deep learning were involved.

I'd love to peel that back a little bit.

Yeah, absolutely.

So AlphaGo has two deep neural networks.

So like a neural network is a function that like takes something as an input and produce something as an output.

And it's literally like a black box.

We don't really know exactly how it does it.

Just like know that you can actually if you train it on enough data, it will just like learn the mapping.

If we learn the function from input to the space.

So AlphaGo actually had access to two deep neural networks, the policy network and the value network.

And the policy network suggested the most promising move.

So it will just take a look at a current position and just like say, OK, you know, based on the current position, this is the list of moves that I would recommend you just like consider playing.

And it also had access to the value network.

We'll just take a look at like a board position and just like give you a winning probability.

Like what are your chances of actually winning the game starting from this position?

This is exactly the gut feeling like it had like its own gut feeling on like whether the position is a good one or a bad one.

So once you have access to these two networks, then you can actually like play in your imagination a number of games.

You can consider like the most promising moves, then you can consider your opponents most promising moves.

And then you can just like evaluate its moves like the value network.

And then, you know, you can use a method called minmax.

What that says is that I want to win the game.

But I also like know that my opponent wants to win the game.

So I want to just like pick a move that will maximize my chances of winning, knowing that like my opponent will try to maximize their chance of winning.

So if you actually like do that and simulate a bunch of moves, then you can just like get the optimal action.

And you know, the way to just like do this imagination, this planning, this search in the most efficient way is by using a tree sets method called modical trees.

So MCTS.

So whenever people talk about MCTS, they literally just like mean this heuristic of how do I, you know, how do I choose which features to consider so that like I can make informed decisions.

The role for reinforcement learning and deep learning and building AlphaGo was that AlphaGo first of all was a success of reinforcement learning and deep learning because like this is exactly the two methods that powered AlphaGo.

And the policy network was initially trained on a large set of human games.

So you had like many games played by human professionals and you just like consider every position and you consider the move they took at this position.

And then you have like a dipping a legwork that tries to predict this move.

Then once you have the policy network, you need to somehow find a way to just like obtain a value network.

So we did it in two ways.

First, we just took the policy network and we had it play against itself and we used reinforcement learning to to improve it to improve the blank strength of the model.

So we use a technique called policy gradient.

So what policy gradient does is that it just like looks at the game and then it looks at the outcome.

This is the simplest version of like of policy gradient.

It looks at the outcome of the game and for all the moves that led to a win, they'll just like say, great, you know, just increase the probability of choosing this move.

And for all the moves that led to a loss, it says, great.

Now decrease the probability of like this move being selected in the future.

And if you do that, like, you know, for many games and for long enough, then you just like get an improved policy.

Now, once you have this improved policy, you can just generate a new data set of games where like the policy plays against itself.

And then you have like a huge amount of games where for its position, you know who the final winner was.

So then you can take this network, you can take another network, a value network and have it predict the outcome of the game based on the current position.

So what the network learn is that if I start at this position and I play under my current policy, on average, this is the player who wins.

It's either a black player or the white player.

So this is the first version of like a value network.

And you can just like use it within AlphaGo by combining it with the policy network.

And what were some of the biggest challenges in building this?

And how did you overcome them?

Yeah, so AlphaGo was not just a basic challenge, but was mostly, I'd say, an engineering marvel.

It was the early versions run on 1200 CPUs and 176 GPUs.

And the version that played against listed all used 48 TPUs.

So like TPUs were like the first accelerator, custom accelerators.

And these were like these accelerators were like really primitive back then.

Because literally it was like the first version, right?

Like now the later accelerators are much, much better and much more stable.

So the system had to be highly optimized to minimize latency, maximize throughput.

We had to build landscape infrastructure for training these networks.

And it was a massive endeavor, just required a lot of coordinated effort from many talented individuals working on different aspects of the project.

But I just like walked you through a number of steps to just like obtain the policy network and the value network.

And each of these steps had to just be implemented at the limits of like what was available and what was possible back then in terms of scale.

And it had to be implemented in a way where people could just like think everything.

They could just like try the research ideas fast and get results fast.

So yeah, lots of people scale at levels that hadn't been implemented before.

And it's kind of like working at the forefront of what was possible back then.

I love your highlight of it being a research marvel and an engineering marvel.

And I remember you sharing one time that part of the reason this project came about also was because Google had TPUs that they needed a test customer for.

And that was the spark, this AlphaGo project.

So that's pretty incredible.

How much conviction did the DeepMind team have that this is going to work?

You mentioned that at the time, deep learning, reinforcement learning were still relatively novel but DeepMind was very much founded with that belief.

But did you guys think that you were going to be able to have kind of these superhuman level results beating the top Go player in the world?

Was it a crazy idea and maybe it'll work or did the team have conviction like this is going to work?

So at CELTAC the team had a cautious optimism.

So one of AlphaGo's lead developers, Ajay Khang, he is a strong amateur Go player and he had been working on Go for like a decade before AlphaGo happened.

And we also had like a lead report of a computer game of computer players.

And you could see that AlphaGo was significantly stronger than anything that had come before.

But Go is a complex game and there was always a bit of worry about whether AlphaGo was truly as good as we believed.

So we actually had the conviction that deep reinforcement learning is the answer based on everything that we could measure and everything we could see.

But the thing about this system is that they're not like classic computers where you just like know that they always produce the same answer.

They're like stochastic.

They are creative.

And they all have like some blind spots.

They hallucinate like similarly to how like model LLMs hallucinate.

So you need to just like really push them and just like see exactly where they break.

And the only way you could actually do that is by having like the best humans playing against them.

Move 37, can you tell us what that was?

It was such a monumental move.

And I think everyone watching it at the time, it was, and at least at all maybe primarily was confused by that move.

What was going on in your head when that happened?

So yeah, I mean, move 37 in game two against Lissedall was literally a spectacular moment in the sense that it kind of showed gaze to the world that AlphaGo has creativity.

And it demonstrated that AI could come up with strategies that even top human players hadn't considered.

So at first, like I still remember that, like we thought that AlphaGo made an error.

So that's it actually like hallucinated.

It did something like it didn't mean to do.

But then turned out to be a brilliant and a conventional move that underscore that the system had a deep understanding of the game.

The system actually had like creativity.

It could think of things that like people hadn't thought of before.

I want to take us to another key move in the game.

I think it was in game four.

At this point, I was rooting for Lee because I was like, oh, the poor guy needs to win a game.

I moved 78.

And AlphaGo made a mistake and Lissedall notices it.

I guess what was the weakness there that Lee found during the game?

Yeah, exactly.

So I mean, Lissedall's victory in game four was literally a testament to human ingenuity.

Like move 78 was unexpected and caught off a goal of God.

Initially AlphaGo, like based on its evaluations, misinterpreted as a mistake and thought that it was actually like winning.

So that's why it didn't respond appropriately.

And this kind of highlighted the blind spot in the system.

So the game showed that while systems like AlphaGo are extremely powerful, at the same time they still have vulnerabilities and there were like still areas where we could further improve it.

But how do you go about improving something like that?

Do you need to show it a lot more data of that type of human ingenuity move or how do you go about fixing and patching those blind spots?

So yeah, I mean, it's actually interesting that by the end of the game, Lissedall, we just like put together a benchmark where you're just kind of like trying to quantify and just have a way of measuring the mistakes that like AlphaGo makes and this kind of blind spots, let's say.

And then we just tried a number of approaches to just like improve the algorithm so that we can solve these issues.

And what happened is that actually the most effective way of getting rid of them was just like do what we were doing just like at a higher scale and better.

So just like change the architecture of the model, we just like switched to a deep rest net with two output heads.

And we also like we just had a bigger network trend and more data than just like move to AlphaZero and better algorithms.

And that kind of like made it so that we didn't have any hallucinations anymore.

So in a way, we just like scale data, you know, things that are always kind of the well-known recipe in the field of AI is exactly what solves it in our taste.

With scale and data, how much did higher quality data or maybe specifically data from great professional players, the best professional players make a meaningful difference or was it just any data?

Now for us, what mattered was that we kind of solved it using self-play.

So we actually had access to the most competent co-player in the world.

And we just like used it to generate the best quality games and then just trained on these games.

So I guess like, you know, we didn't need to have like human experts because you had like an expert in-house.

It wasn't human.

Right.

Interesting.

Amazing.

Well, I'd love to move on to the progression from AlphaGo to AlphaZero.

And you talked a little bit about this notion of self-play just now.

AlphaZero was powerful because it learned how to play the game from scratch entirely from self-play without any human intervention.

Can you share more about how that worked and why that was important?

So AlphaZero was a game changer because it learned entirely from scratch through self-play without any human data.

And this was like a major leap from AlphaGo because like AlphaGo, as I said, relied heavily on human expert games.

So two things happened.

First of all, AlphaZero managed to simplify the training process and also like showed that AI could literally just like get from zero to superhuman performance just purely by playing against itself.

And that allowed it to just be applicable to a whole range of like new domains that were out of reach because like there weren't enough like human data for it.

But I think like the more the more important thing is that just so that AlphaZero also solved all the issues of like AlphaGo had in terms of hallucinations, in terms of blind spots and robustness.

So like AlphaZero was like a better method, just full style.

And you explained kind of how AlphaGo worked to a fifth grader.

What would you tell the fifth grader would be the key difference technically that you implemented with AlphaZero?

So AlphaZero, just like AlphaGo, uses a policy network and a value network along with multi-cloud traces.

So in that respect, it's exactly the same as AlphaGo.

So the key difference is in training.

AlphaZero starts with random weights and lands by playing games against itself.

And by playing games against itself, it iteratively improves its performance.

But the main idea behind AlphaZero is that whenever you take a set of weights, a set of policy and value networks, and then you just combine them with search, then you just like end up with a better player.

You just like increase your performance, you just like become a stronger player.

So what that meant is that we can actually use this mechanism to improve the model policy, the role policy.

So this is what we call in reinforcement learning a policy improvement operator.

Here you can just take an existing policy and then do something, some magic, and then just like come up with a better policy.

And then you can just take this policy and distill it back to the initial policy and then just repeat this process.

Then you have like a reinforcement learning algorithm.

And I think this is exactly what people are trying to do today with like two star or synthetic data.

This is exactly the idea of how can I take a policy, do something with it, planning, compute, whatever it is, and derive a better policy, which I can then imitate and just like kind of distill back to the original policy.

So this is exactly what AlphaZero is doing.

It uses MCTS search to produce a better policy.

Then it takes its trajectories, it trains its policy and value network on the new better trajectories, and it repeats this process until it converges to an expert level Go player.

It's fascinating and counterintuitive that starting without the weights that you would have from professional level players is actually a better starting place.

The epitome of AI agents in games was achieved, I think, via MuZero, which is the progression even from AlphaZero itself.

And it's also where you became one of the co-leads or one of the leads of the game.

AlphaZero was obviously impressive because of self-play, but it also needed to be told the environment's dynamics or the rules of the game.

And MuZero takes us to the next level without needing to be told the rules of the game.

And it mastered quite a few different games, Go, Chess, and many others.

Can you share a little bit about how MuZero worked and why was this particularly meaningful?

Absolutely.

So, AlphaZero, as you said, was a massive success in games like Chess, Go, Shogi.

In games where we actually had access to the game rules, where we actually had access to a perfect simulator of the world, but these two lands on the perfect simulator made it challenging to apply it to real-world problems.

And real-world problems are often messy, and they lack the rules and truly have just right a perfect simulator of them.

So that's exactly what MuZero tried to solve.

So MuZero masters the games, of course, like Go, Chess, and Shogi, but also masters more visually challenging games or games with a hard go like Atari.

And it does that without giving access to the simulator.

It just lends how to build an internal simulator of the world and then just use this internal simulator in a way similar to what AlphaZero was doing.

So it does that by using model-based reinforcement learning, where what that means is that you can just take a number of trajectories generated by an agent and then try and learn a prediction model of how the world works.

So this is actually quite similar to what methods like Sora are trying to do now, where they just take YouTube videos and they try to just learn a world model by just trying to predict based on starting from one frame what's going to happen in the future frames.

So MuZero tries to do exactly that, but it does it in a way different from genitive models in the sense that it tries to only model things that matter for solving the reinforcement learning problem.

So it tries to predict what the world's going to be in the future, what's the value of like future states, what's the policy for like future states.

So only things that you need within your MCTS.

But the fundamental is kind of like remain the same.

So how do you just like learn a model based on trajectories?

And then once you have this model, you can just combine the search and get super human performance.

So of course, like you can always decouple the two problems and have like the model being trained separately from data out in the wild and then just like combine that with MuZero.

And we just found that back then, given the limitations of like our models and the smaller sizes, kind of like make more sense to just like keep those two together and only have the model predict things that matter for planning.

So just like try to model everything because you're kind of hitting the limits of what the capacity of the model could take.

So interesting.

Is it right to assume then that not only Sora takes the same approach, but maybe other world models or other robotics foundation models?

Yeah.

So anything that tries to just like build a model of how the world works and then just like use that for planning, it's within MuZero like methods.

So yeah, you can just like train it on YouTube videos.

You can train it on like the inputs coming from like robots.

You can train it on any environment.

You can even think of like large language models as a form of models of like text.

So like the model text.

But the thing about text is that like the model is a bit trivial.

Like you don't need to just, there aren't many artifacts happening when you're trying to predict what the next world is going to be.

Right.

So have you seen the ideas behind MuZero kind of be used outside gameplay or in messy real world environments?

So yeah, I mean, so as I've said, AlphaZero and MuZero are quite general methods and they were like, there's a number of scientific communities in chemistry.

So there's AlphaChem in quantum computing.

Some people try to use AlphaZero in optimization where they just like adopted AlphaZero because it was really powerful in really doing planning and just like solving this optimization problems.

At the same time, MuZero was incorporated in a version of like Tesla's self-driving system.

It was kind of reported in their AI day.

And it was also used, I think it's currently being used within YouTube as a custom compression algorithm.

But I think it's early days and takes time for like this new technology to be fully adopted from better industry.

We'd love to talk a little bit more about reinforcement learning and agents.

You alluded earlier to the fact that reinforcement learning and deep learning back in 2015 were new, nascent ideas.

They really grew in popularity, 2017, 2018, 2019 onwards.

And then they were overshadowed by LLMs, largely because of the GPT and everything else that came out.

But now reinforcement learning is back.

Why do you think that is the case?

Yeah, I mean, first of all, LLMs and multimodal models have indeed brought incredible progress to AI.

So these models are exceptionally powerful and can perform some truly impressive tasks.

But they have like some fundamental limitations and one of them is the availability of like human data.

We'll just keep talking about the data wall and what happens once we run out of like high quality data.

And this is exactly where reinforcement learning signs.

So reinforcement learning excels because it doesn't rely solely on pre-existing human data.

Instead, reinforcement learning uses experience generated by the agent itself to improve its performance.

So this self-generated experience allows reinforcement learning to learn and adapt and to even adapt to scenarios where human data is scarce or like non-existent.

So if you define the reinforcement learning problem in the right setting, in the right way, you can literally effectively exchange compute for intelligence.

You can just like get to a point similar to where we were with AlphaZero where we just like the moment we threw more compute at it, like we made the networks bigger, we just like used more games, we just literally got a better player and was deterministic.

You always get a better player.

So I guess this is exactly where we want to be with like this synthetic data pipelines.

Currently, we have that with the scaling clause in LLMs that if you have like more data and bigger models, then you get like a, you know, you can predict that there's going to be an improvement to performance.

But you know, once you run out of like human data, how do you just keep going?

And synthetic data is like the answer to that.

And the only way that, you know, you can actually get high quality reinforcement learning, high quality data to just like improve your model is like via some form of reinforcement learning.

And just like leaving, I'm just like keeping reinforcement learning as a really kind of blanket term here where I just like define it as anything that lends through trial and error.

How do you think reinforcement learning is being brought into the kind of like LM world?

And you mentioned Q* earlier.

I guess in a closed form game, you have like a pretty clearly defined policy and value function.

How does that work in like a messy kind of real world environment or the LLM world?

I mean, I guess like there are two different types of like messy real world, right?

Like there is the if you try to just like build a controller or something, that's a really messy environment.

And then if you operate in the digital space.

So personally, I believe that digital AGI, which happened much earlier than, you know, robotics AGI.

And the reason for that is exactly that you have control over the environment.

And the environment is like computers, like the digital world.

So even though it's like messy and noisy, it's still contained.

It's not like the real kind of like world in that sense.

So now in terms of how do you bring like reinforcement learning?

So reinforcement learning is we used to say in deep mind that you have like the problem and you have the solution.

And the problem setting of reinforcement learning is how do I take a model?

How do I take policy and generate synthetic data?

Like I find a way to improve this policy via interacting with the environment, via trial and error.

And it's like the reinforcement learning problem setting, right?

Then there's like the solution space where you have value functions and have like a reinforcement learning methods.

So I think that there's a lot of inspiration to draw from like classical reinforcement learning methods that were developed in the past decade, but have just adopted, you have to adjust them to the, to the new world of LLMs.

So methods like Q* try to do that by just taking the idea that if I have a policy and then I do planning, I consider possible future scenarios.

And then I have a way to evaluate which one is better.

Then I can just like take the best ones and then ask the model to imitate these better ones.

And this is like a way of improving the policy.

So in the classic RL framework, you do that by using a policy and a value network.

In the new world, you'll just do that by asking your, by having a reward model or asking your LLM to just like give you feedback on an output it gave you.

So interesting.

You also talked a little bit about synthetic data earlier.

I think some folks are very bullish on synthetic data and some folks more skeptical.

I also believe that synthetic data is more useful in some domains where outcomes and success is perhaps more deterministic.

Can you share a little bit about your perspective on the role of synthetic data and how bullish you are on it?

Yeah, I mean, I think synthetic data is something that we have to solve one way or another.

So it's not about whether you're bullish or not.

It's an obstacle, but we have just find a way around it.

We will run out of data.

There is so much data that humans can produce and also it's important that the system start taking actions.

They start learning from their own mistakes.

So we need to just find a way to make synthetic data work.

Now what people have done is that they've tried like the most, I guess like naive approach where you just like take the models, they produce something and you try to just like train on that.

And of course, like, you know, they've seen that there's more collapsing and this just like doesn't work out of the box.

But you know, new methods never work out of the box.

You just like need to invest in it and just like take your time and you know, really kind of think of what's the best way of doing it.

So I'm really optimistic that we'll just definitely find ways to improve these models.

And I think that like actually there is a number of methods out there like the two star and the equivalence that just, you know, in the new world where people don't really set their research breakthroughs the way they used to is probably hidden behind like some company trade secrets.

I'm going to ask about reasoning and, you know, novel scientific discoveries.

Do you think that that can kind of naturally come out of just scaling LLMs if you have enough data?

Or do you think that kind of like the ability of reason and, you know, come up with net new ideas requires kind of doing reinforcement learning and, you know, deeper compute at inference time?

So I think like you need reinforcement learning to get better reasoning because the distribution of like it's it's it's also about the distribution of data, right?

Like you have like you have a lot of data out in the wild in the internet.

But at the same time, you don't always have like the right type of data.

So you don't have the data or like someone reasons and they just like explain the reasoning in detail.

You have some of it you have like an incredible that like the the models have actually amounts to to pick it up and just imitate it.

But if you want to just like improve on that capability, then you need to do reinforcement learning.

You need to just like show the model how this kind of emerging capability can further be improved by just like have it generate synthetic data interact with the environment, you know, just tell it when it's doing something right and when it's not doing something right.

So yeah, I think like reinforcement learning is definitely part of the answer for that.

AlphaGo, AlphaZero and MuZero are the most powerful agents we've ever built.

Can you share a little bit about how some of the lessons and learnings unlocked from that are relevant to how we're pursuing building AI agents today?

Yeah, so I think like AlphaGo and MuZero, you know, they've actually fundamentally transformed their approach to AI agents because they highlight the the importance of planning and scale, in my opinion, that if you actually look at the charts of like different models and how they scale, you can see that like AlphaGo and AlphaZero were kind of really ahead of the time, like they were kind of outliers.

You have like this this case of like how compute scaled and then you have like AlphaZero or like somewhere standing on its own.

So it's so that like if you can scale and you can reapposition that, then you can get like incredible, incredible results.

At the same time, you know, it also showed that you don't have just only train, you can also like, you know, have better performance during inference, during test, during evaluation, but just like using planning.

And I think that this is something that we start seeing more and more in the near future.

Or like this method will just like start thinking more like planning more before they're just making any decisions.

So I'd say that like this is more of the charitlets of AlphaGo and AlphaZero and MuZero.

It's the basic principles and the basic principles are of that scale matters, planning matters.

These methods can really solve problems that we thought that are insanely complex or like, you know, beyond what we can solve on our own.

Similar problems with the ones that we actually observed today with these last language models are things that we saw back then, like back in 2016, we actually saw that these models can hallucinate or that like at the same time, they're also creative, that they will just come up with solutions that we hadn't thought of.

But they can also like have blind spots or hallucinate or be susceptible to kind of like adversarial attacks, which I guess like everyone knows now that these neural networks suffer from.

So I think that like these are the main kind of lessons drawn from this line of work.

What do you think are the biggest open questions from this line of work for the field dancer going forward?

So the main question is, we had like AlphaGo and MuZero and we just like managed to have like this insanely robust and reliable systems that will just always play Go and at the highest possible kind of level.

And they'll just like achieve consistently, they will just like be top of the leaderboard, we'll just like never lose a game.

So AlphaGo Master actually like played against 60 people in online matches and just like literally won in every single one of them.

So there was like no, there was like this battle for like a critical robust level.

And I think like this is exactly what we're missing now with this LLM based agents.

Sometimes they get it, sometimes they don't, you cannot trust them.

They will just like, you know, you have like some amazing demos, but like, you know, they happen once every two times even, or like once every 10 times you have like something amazing and the remaining nine, they just lost their way and didn't do anything.

So I think like what we need to do is just find a way to just make these LLM based agents equally robust to the ones that we had with AlphaGo and MuZero and AlphaZero.

This is like the new open question of like how do you actually do that?

We'd love to move into some of your thoughts on the broader ecosystem today.

I'm going to touch on a few really core problems that people are working on right now.

One the data wall problem that will hit eventually perhaps by 2028 or so, as some folks predict.

Another being the idea of planning as an area that AI agents need to get better at.

And then, you know, a third idea that you just described was around robustness and reliability.

Can you share a little bit about maybe some of these areas that you think the whole field needs to solve that you are most excited about to help us unlock this vision of really getting to the AI agents that we want?

Yeah, I mean, I'll just like also add another one to the list.

So I think like another major, another major challenge is like how to improve the in-context learning capabilities of this model, so like how do we make sure that like these systems can learn on the fly and how they can adapt to new context like quickly.

So this is like another thing that I think is going to be really important.

It's going to happen the next few years, couple of years actually.

So Janice, what's the term that you used for that?

In-context learning?

In-context learning.

In-context learning, yeah.

So it's the idea that a system can actually learn how to do a new task with like few short prompting, like it kind of like sees a few examples and on the fly, it kind of like learns how to adapt to the new environment, it learns how to use the new tools that were provided to it or like it kind of like lends, it's not just all the knowledge it has stored in suites but like it's also like acquiring new knowledge by just like interacting with the real world, interacting with the environment.

So I think that this is like another place where there is a lot of work happening at the moment and going to have like amazing progress in the next couple of years and I'm really excited about that.

So yeah, I mean to recap, I think like planning is important.

You know, in-context learning is important and you know, reliability.

So the best way to achieve reliability is just like ensure that this model somehow know how to retain from their mistakes.

So if they just like made a mistake somewhere, they can just like see that and they're like, okay, you know, I made a mistake, I'll just like correct for it.

The way that humans, you know, make mistakes all the time but like we, you know, you can correct for them.

So these are like the three areas which I'm really excited to see progress on.

Now that you've kind of embarked on your own entrepreneurial journey, how do you think that the areas where startups can compete against the big research labs and like how do you kind of motivate yourself for that journey?

Yeah, I mean, it's a new world for me but at the same time, it's not that new because when I joined DeepMind, it was literally a startup.

So and I was like literally in the first of two employees.

So I actually like saw that firsthand.

But you know, one of the benefits of like working for a startup is that, you know, that duty and the focus.

So everyone really cares.

Everyone just moves really fast.

And there's like a clear focus on what we want to build.

So the building is like what's the most important kind of motivation for people like just like building.

And I think like this is one of the big advantages that startups have over more established businesses.

At the same time, you know, it's easier to just like devote to adapt to new findings in technologies.

You're not kind of like tied to some pre-existing solutions or like some products that you don't want to deprecate because like they bring a lot of revenue to you.

But if you're a startup, you know, you have like no such chains.

You can just like move fast and you know, be innovative and just, you know, break conventions.

And at the same time, just like allows you to leverage like open source resources, things that are out of touch for like the big labs.

And yeah, and you don't have like the red tape that like big places tend to have.

I love the term that you sometimes be honest, main quest versus side quest.

Yeah, it's the idea of like having a main focus like, you know, in big places in big labs, they have like many different projects that like people are working on.

And it usually happens that they have like the main quest, the main, you know, thing that like everyone's working on.

And there's like many multiple, like smaller side quests that the idea is just like feed into the bigger quest.

But like usually they don't get as much, they don't get like as many resources or like as many, as much focus on like the leadership.

So yeah, they tend to, yeah, do a trophy.

In the broader field, what are some of the most defining projects that you admire the most and maybe who are some of the most influential researchers that you admire the most?

Yeah, absolutely.

So I actually like started my AI research journey back in 2012.

And I've actually like seen some milestones.

So it's like I give a list of what I think are like the main milestones like in AI in the past like 12 years that I've been around.

So the first one I'll say is like AlexNet.

This is the first paper that kind of like show that deep learning is the answer.

I mean, back then, it didn't feel like it.

It just like felt like curiosity.

But like now, I think that most people are convinced that like deep learning is part of the answer.

Then it was a TQN.

I had the pleasure to actually work on TQN and just like see firsthand how it started.

It was actually developed by a friend of my flat knee.

And it was like the first system that showed that you can actually combine deep learning with reinforced learning to achieve super human performance or like super human performance in really complex environments.

Then this was AlphaGo.

Again, I was like really lucky to just like work on that and it showed that scale and planning are really important ingredients.

And if you just like do that right, then you get huge success in an incredibly complex environment.

AlphaFold, another one, this is again by DeepMind.

So that like these methods are not just like things that you can use to solve games, but they actually will make this world a better place.

They will just like ensure that healthcare is improved, that scientific discoveries are being realized, that we'll just like make sure this world is a better place by using AI.

Then Chachapati, it kind of like brought AI to everyone, just like made it accessible to the broad audience.

Like everyone knows what AI is now.

It has made my life of explaining my job much easier.

And finally, trip to four.

And I think that probably trip to four is like the latest kind of peak advancement in AI, because it kind of like showed that artificial general intelligence is a matter of years.

It's within reach.

Yeah, we are getting there.

I think that most people now believe that we are like a few years away from like AGI.

And that's because of like the incredible breakthrough that GPT-4 was.

Now in terms of like some people I really admire, before I forget.

So I'd say first like David Silver, he was my PhD supervisor.

He was my mentor at DeepMind.

He's an incredibly researcher.

He let off go and off zero.

And he has an early gilding dedication to the field of reinforcement learning.

And he's probably the one of the smartest people or maybe the smartest person I know and amazing guy in amazing reinforcement learning engineer.

And the second one I'd say is Ilya Satskyer.

And he was a co-founder of OpenAI.

I had the opportunity to work with him just a little bit in the really early days of Alpha Go.

But I think like his commitment to scaling AI methods and pushing the boundaries of what the systems can achieve is remarkable.

And he got nature that like GPT-3 and GPT-4 happen.

So yeah, immense respect towards him.

Thank you for sharing that.

Let's close out with some rapid fire questions.

Maybe first, what do you think will be the next big milestones in AI, let's say in the next one, five and 10 years?

So I think like the next five to 10 years, the world will be a different place.

I actually really believe that.

I think that in the next few years, we'll see models becoming powerful and reliable agents that can actually independently execute tasks.

And I think that AI agents will be massively adopted across industries, especially in science and healthcare.

So in that sense, I'm really excited on what's coming in AI.

And what I'm most excited about is AI agents.

Systems can actually like do tasks for you.

And this is exactly what we're building at Reflection.

In what year do you think will pass the 50% threshold on Sweetbench?

So I think we are one to three years away from the 50% threshold for sweet agents and three to five years from achieving 90%.

So the reason is, while progress is amazing, I think we still need reliable agent to hit these milestones.

And it's really, when it comes to research, it's like hard to make precise predictions.

When do you think we'll hit the data wall for scaling LLMs?

And do you think all the research in RL is mature enough to keep up our slope of progress?

Or do you think there will be a bit of a lull as we try to figure out what happens when we hit the wall?

So I think based on what I've read, I think we have at least one more year for text just like before we hit the wall.

And then we have these extra modalities, which might actually buy us maybe a year extra.

And I think we are in a really good place to just start using synthetic data.

So in the next two years, we'll just figure out the synthetic data problem.

So I think that we won't really hit the wall.

Just like we hit the wall, but no one realized it because we have new methods in place.

Do you think LLMs will have their AlphaGo moment?

And if so, when?

I think it's like LLMs had their AlphaGo moment with the initial release of Chazepytee, where they showcased the power and the progress made over the past decade.

I think what they hadn't had yet is their AlphaZero mode.

And that's the moment where more compute directly translates to increase intelligence without human division.

And I think it's like this breakthrough is still on the horizon.

When do you think that'll happen?

I think it's going to happen in the next five years.

Wow.

Amazing.

Janis, thank you so much for joining us and taking us through the awesome history of AlphaGo, AlphaZero, MuZero, your own journey through DeepMind, and then many of the core research problems that the whole industry is tackling today around data and building for reliability and robustness and planning and in-context learning.

We're really excited for the future that you're helping us build and that you're pushing forward in the field as well.

So thank you so much, Janis.

Thank you so much for having me.

Thank you.