Training Data · 2025-07-30

OpenAI's IMO Team on Why Models Are Finally Solving Elite-Level Math

Hosts: Unknown

Guests: Alex Wei, Sheryl Hsu, Noam Brown

IMOtest-time compute scalingreinforcement learninghard-to-verify tasksmulti-agent systemsmath reasoningLean vs natural language proofscombinatoricsAI safety / self-awareness

Read summary Jump to transcript Original podcast

Podcast feed URL

Open feed

Why it matters

OpenAI team scaled test-time compute to ~100 minutes per problem to win IMO gold

Key claims

A three-person OpenAI team (Wei, Hsu, Brown) achieved IMO gold using general-purpose techniques rather than IMO-specific tooling.
Test-time compute scaled from ~0.1 minutes a year ago to ~100 minutes per problem for IMO, with the team targeting thousands to hundreds of thousands of hours for research-level math.
The model used natural-language proofs and deliberately skipped Lean's formal-verification track, which they view as too narrow for general reasoning goals.
Progress was driven by reinforcement learning improvements on hard-to-verify tasks, not just verifiable-reward benchmarks, plus multi-agent parallel compute.

Episode summary

Summary

OpenAI researchers Alex Wei, Sheryl Hsu, and Noam Brown join the Training Data podcast to discuss their model achieving gold-medal performance at the International Mathematical Olympiad (IMO). The three-person team built on general-purpose techniques for scaling test-time compute and tackling hard-to-verify tasks, rather than IMO-specific optimizations. They describe a rapid final sprint of only a couple of months, with strong early evidence coming from improvements on problems without verifiable rewards. The model reasoned for roughly 100 minutes per problem, up from about a tenth of a minute a year ago, and used natural-language proofs rather than Lean formal verification, which the team deemed too narrow for their general-reasoning goals.

The team discusses why the model declined to attempt Problem 6 (a notoriously hard combinatorics problem) rather than hallucinating an answer, calling that self-awareness a meaningful sign of progress. Proofs were graded by three external former IMO medalists who reached unanimous consensus on correctness, and the raw proofs were published on GitHub despite being intentionally hard for humans to read. Noam Brown frames the work in terms of scaling reasoning from minutes toward the thousands of hours that would be needed for research-level mathematics and potentially Millennium Prize problems, while emphasizing that the underlying techniques—parallel compute via multi-agent systems, long-horizon RL on hard-to-verify tasks—are general-purpose and intended to feed back into broader OpenAI model capabilities.

Looking ahead, the team plans to expose the system to mathematicians for evaluation rather than ship it directly as a product, and sees problem generation and long-horizon research reasoning as the next frontiers beyond time-boxed competition math.

A three-person OpenAI team (Wei, Hsu, Brown) achieved IMO gold using general-purpose techniques rather than IMO-specific tooling.
Test-time compute scaled from ~0.1 minutes a year ago to ~100 minutes per problem for IMO, with the team targeting thousands to hundreds of thousands of hours for research-level math.
The model used natural-language proofs and deliberately skipped Lean's formal-verification track, which they view as too narrow for general reasoning goals.
Progress was driven by reinforcement learning improvements on hard-to-verify tasks, not just verifiable-reward benchmarks, plus multi-agent parallel compute.
The model recognized Problem 6 was beyond its reach and abstained rather than hallucinating—a behavior the team highlighted as a sign of genuine self-awareness.
Proofs were graded by three external former IMO medalists with unanimous consensus, and raw outputs were published to GitHub without rewriting for readability.
Combinatorics and abstract 'leap-of-insight' problems remain much harder than geometry or problems requiring many small deductive steps.
The same infrastructure and techniques power recent OpenAI launches, and the team plans to make the system available to mathematicians before broader deployment.

Source material

Transcript

The pace of progress is really—I think you see it so clearly in math, and I think Alex tweeted about this, where even a few years ago, these models were struggling with grade-school math.

I remember even in 2024 that GSMAK was used as the standard eval when everybody would release a model.

And then it was math for a short period of time, and then it became Amy, and then it became USA MO.

And the pace that it's just gone, blown through all of these math benchmarks, is really astonishing.

Today we're joined by Alex Wei, Sheryl Hsu and Noam Brown, the trio behind the OpenAI model that just achieved gold medal performance at the International Mathematical Olympiad.

The IMO gold is one of the most important milestones in the race to artificial superintelligence, and what makes this breakthrough particularly fascinating isn't just the mathematical chops, but the underlying architecture.

General-purpose techniques for scaling test-time compute and handling hard-to-verify tests that extend far beyond competition math.

We've now gone from models that can reason about math for a tenth of a minute just a year ago to systems that can reason and concentrate on the order of a hundred minutes.

The hope for superintelligence is that as we scale reasoning to thousands or hundreds of thousands of hours, we can begin to solve humanity's greatest unsolved problems in math, the sciences, and more.

Alex, Sheryl, and Noam joined us on training data to talk about their approach and share some of the behind-the-scenes fun and learnings behind this historic result.

Enjoy the show.

Alex, Sheryl, Noam, thank you so much for joining us today.

We have with us the team behind OpenAI's first gold medal at the IMO.

Congratulations to you all.

It's a momentous achievement.

Thanks.

Thank you.

I'd love to get into a little bit of the origin story behind this.

I know that the IMO gold has just been this elusive thing that everyone in AI has been chasing for a long time.

I remember back when Sam pitched us in 2021, it was on the slides, and I remember thinking, "Oh, that seems really far away."

I'd love to understand the more immediate origin story for this specific effort.

When did you guys start thinking about this, and how did it come about?

Yeah, I think it's something that we've been thinking about for a long time.

I remember in my first week at OpenAI, Noam asked me, "When do you think the model will get IMO gold?"

I thought it was really unlikely in 2025.

I feel like it's something that's always been on our minds, as you said, Sam, many years ago as well.

But this specific effort, I think it was really only maybe a couple months since-- Just a couple months.

Like the last sprint to get everything ready for this year's IMO.

Of course, we've been working on improving our algorithms.

The ideas for this started coming together maybe six months ago, but really the last push, we're going to try to do something for this year's IMO was only a couple months long.

It's amazing.

And how big is the team involved?

So we're definitely building on a lot of folks' work at OpenAI.

This is not possible without a lot of help from people working on inference and the scaling org, the people who do the pre-training and the RL training.

But in terms of the core team, I would say it's just three of us.

So it was a super small scrappy effort here.

That's crazy.

Just the three of you.

Also it was mostly Alex.

Alex had been working on this technique for a while and Cheryl and I were happy to help out as we were getting closer to the IMO to make it a reality.

That's so cool.

And how does this even come about?

Do you self-direct and self-choose, I want to work on IMO gold and I'm going to get us there?

How do you even raise your hand to work on something like this?

I think it was something where it just felt like maybe it's possible.

Maybe if we push it for a couple of months we can just get there.

One of the nice things about OpenAI is that I think the researchers are really empowered to do the kinds of research that they think is impactful.

So Alex had this pitch that "Hey, there's just a new technique that I think could help out a lot."

Honestly, there's a decent amount of skepticism.

I think some people were supportive.

But everybody felt like we should give them the freedom to be able to explore this and pursue it.

And then it started showing some strong evidence.

I think people still were a little skeptical, but more people were getting excited about it.

Eventually it turns into something more substantial.

And I think now people are obviously very excited about it.

Can you say a little bit more about the strong evidence?

What were some of the early signs that you all were seeing that made you really lean in?

I think it's just progress on hard to verify tasks.

Where I think previously we know a lot of RL was more focused around just like if you have these verifiable rewards, what can you do?

We were just seeing more improvement on these harder to verify tasks is what made us excited.

Maybe on that front, how did you even verify that the results you had were right?

I saw that you published the proofs on GitHub, but can you just say a little bit more about how you even know that you've discovered the answers?

Because my understanding is that they're done a bit differently from how a human might answer them.

Yeah, I do think the style of the model outputs is a little atrocious.

Atrocious isn't the word I was going to use.

It's a creative, like an alien language.

Yeah, I think it was a very small scrappy effort.

And so we didn't optimize as hard for human readability, but that's something that we know how to do.

We can do the same stuff in the same way that chat GPT is very readable.

We can do the same things here.

Do you even need to optimize for human readability?

Is that even important?

I think if you're showing this to humans, they prefer readability.

We were actually discussing we got the proofs.

Because you could actually just run them through chat GPT and ask chat GPT to rewrite them in a more readable way.

The proofs are still correct.

They're just a little bit more readable.

And we were like, oh, should we post these online?

Should we post the more readable version that's run through chat GPT?

Or should we just post the raw version?

And we decided, I think for full transparency, we'll just post the originals and people will figure it out.

You guys have a bunch of IMO medalists and participants in the staff at OpenAI, right?

Do you guys like Moonlight in your spare time grading the answers that the model produces?

Like during the testing, we read a lot of samples.

But for grading these specifically, we hired external former IMO medalists.

So each proof was graded by three medalists.

And for each one, they reached unanimous consensus on the correctness.

I should also say that for me, I don't know about Cheryl, but for me, the proofs are beyond my ability to comprehend.

I was a math major and I never really did competition math.

And I already like the stuff that this model is like writing about is beyond my ability to grade.

Yeah, same.

I think that's what makes it like even more amazing, just like how smart the model is.

Totally.

What about problem six?

How come none of the models at this year's IMO had a solution and your model didn't even attempt problem six?

Can you say more about what makes that problem?

And traditionally, problem six is always the hardest at the IMO.

Is that right?

Yeah, I think problem three or problem six usually.

Okay.

Just say a bit more about what made problem six different and what you learned from, I think you tweeted that the fact that your model knew that it couldn't solve problem six was one of the things that gave you hope.

So just say a bit more about that as well.

For problem six, it's just a really tough problem.

I think like if you gave me like months to think about it, if you even gave me like a big hint about like the main idea to solve problem six, I don't think I'd be able to get there.

It's just like crazy, like tough problem where there's so many things you can do and there's like a very narrow path to finding the proof.

And I think it's one of those things I think like math is just hard.

Yeah.

And we like threw a lot of compute at problem six, but I think it was good to see the model doesn't like try to hallucinate or like try to just like make up some solution, instead we'll say like no answer.

I mean, it's kind of the point when you click it's done so much work just to like say no answer.

But I think, you know, it's good that it actually like acknowledges that.

Yeah.

That's an amazing level of self-awareness of your own kind of ceiling because I mean, I remember at least a couple of years ago with these models, they would always try to be helpful and make up an answer.

Right.

And so to see this as just like, I think an amazing level of self-awareness from these models.

When we released the reasoning models, you know, I talked to some professors at mathematicians, computer scientists, and I was asking them like, you know, are you finding value in these models?

And the answer was frequently yes.

But the one thing that they would complain about is like, if they would ever ask the model a question that it didn't know the answer to, it would just like output a very convincing but wrong answer.

And they would have to like go through it very carefully to figure out was it exactly correct or was there like, you know, some flip of an inequality or something that the model snuck in there.

And it's nice to see that this model like, if it doesn't know, it will just like acknowledge that it doesn't know, at least more frequently.

I guess internally, did you guys have like a betting like a poly market or something going on whether you guys were going to win IMO gold this year?

And like, what was the internal vibe?

I think we felt like we had a strong shot.

But I think we also felt that it wasn't like a lock, where there's definitely a distribution of questions where the models would probably struggle more than a human's.

But then, you know, there's another distribution of questions where the models would be really, really strong.

And I think this year was somewhere in the middle, where, you know, like problem six, like, I think is just out of reach of state of the art models today.

And I think maybe in general, like, you know, like these hard, like combinatorics problems, which problem six was, I think more challenging.

And that's still something that the models struggle with.

What is it about combinatorics that makes it challenging versus, you know, the like geometry, for example, which seems like you guys do well at?

I think for combinatorics, it's probably because it's a little more like abstract, a little more high dimensional.

And I think oftentimes, like combinatorics problems sort of require like leaps of faith or leaps of insight that, you know, the models are good at.

I think the models are more good at like, you know, problems that require like a bunch of smaller steps, for example.

Was that from your guys perspective?

Was the internal vibe optimistic or not that you all were in the eight gold?

I feel like it wasn't super optimistic.

Like, I think they definitely knew that like it could happen.

But I think like even like a month or like two months back, it definitely felt like it would have to like improve quite a bit, which I guess we did.

I remember I was talking to another researcher at OpenAI, like maybe two months before the competition and we were like, you know, saying like, okay, if we were to bet, you know, I'm a betting man.

Yes, you are.

And I would say like, what odds would you take?

Because I was willing to bet on like, we were going to get gold here and he was like, there's really no chance.

And, you know, and, you know, he said that he would gladly take like two to one odds against like the model winning.

So like, you know, less than one third chance.

But he didn't want to bet against us.

So, you know, he thought it'd be bad, bad vibes to bet against the team winning.

So he didn't go for the bet.

So did you make some some pocket change to him?

I wish I wish I.

I mean, you need it.

So because I mean, you guys were I think you tweeted 12% on Amy, like 15 months ago, right?

So it's even even though you want to never want to bet against scale and open AI, it's just it's just an astounding slope of what you all have accomplished here.

The pace of progress is really I think you see it so clearly in math.

And I think Alex tweeted about this where, you know, even a few years ago, these models were struggling with like grade school math.

And you know, and then we we, you know, I remember even in 2024, that like, gsmak was used as like the standard eval when everybody would release a model.

And then it was like math for a short period of time.

And then it became Amy and then it became usa mo.

And the pace that it's just gone blown through all of these math benchmarks is really astonishing.

Yeah, I remember training a model on gsmak two years ago.

Yeah, we're past those days, saturated the evals.

What's next?

Do you think I mean, at this point next year, you think we'll be solving Millennium prizes?

I think it's I think those are still very far away.

I think on one hand, you know, you think about like, how much math progress has been made since like gsm 8k, which is like, you know, like, just like two years ago, it was sort of a standard that people were trying to push on, you know, that that's like an astounding level of progress.

But also you think about like how much time it takes for people like, you know, gsm 8k problems, they like grade school math, you know, take someone good at math, like a couple seconds.

And now we've gone from like a couple seconds to something that takes like, you know, these brilliant students an hour and a half per problem on average, you know, the IMO is three problems four and a half hours.

And then, you know, research math is going to be like, you know, these same, you know, brilliant students, they've grown up their researchers, it's going to take them like 1500 hours.

So there's like, you know, 1000 acts of like more thinking time.

And then million Millennium prize problems have taken entire fields like, you know, people's lifetimes of thinking and you know, we still don't have much progress on most of those.

And so it's, on one hand, like, you know, super exciting that we've made so much progress.

On the other hand, it's sort of also like humbling to see like, how much further, you know, progress has to go from like an hour and a half to like, you know, 10s of 1000s, hundreds of 1000s of hours of humans thinking.

Totally.

No, I think you deserve a lot of credit for seeing the future on this memory you visit us before you even joined open AI, talking about the results from gameplay and you know, what happens if you let a model think for hours and 10s of hours and credit view, you've really seen the future on this.

Thank you.

Yeah, I mean, it's exciting to see it actually happen.

What are the hard things that happen as you scale compute time, inference time from the order of point one minutes to the order of 100 minutes?

I guess at a high level, because not everyone, most of our listeners are not AI researchers.

But what are the hard things that happen to keep the model on the rails, so to speak?

I think we can point to you is like, pretty clearly a challenge is that if you have the model thinking for like 1500 hours, then in order to eval it, you have to have it think for 1500 hours.

And so eventually, the evaluation of the models becomes a significant, you know, speed bump on progress.

So we're not really at that point, you know, like, if we have the model thing for an hour and a half, it's no big deal, you know, we can we can run those tests.

But run a test for the model is thinking for a month, it takes a month to finish that test.

And so progress can only advance so fast if you want to wait for those kinds of results.

I think both of you are on the multi agent team, help me understand like what the role that multi agent systems play in this.

Yeah, so in addition to having the model like think for a very long time, and, you know, make a lot of progress on, you know, hard to verify tasks.

This also involved scaling up parallel compute.

And so that's, there's a multi agent component to that, we're probably not going to be able to go into too much detail about the exact techniques.

But that was certainly like one way that we were able to scale up test time compute for the IMO.

By the way, one thing I'll add for the multi agents, you know, scaling parallel compute thing is that the way that we did it, you know, we really tried to prioritize generality in our in our techniques.

I for example, like, you know, I worked on AI for poker, Alex, and I actually both worked on AI for diplomacy.

So Alex was on the team that worked on Cicero.

Yeah, nice.

And you know, those were projects that I'm, I'm really proud of.

But they were also projects that we spent years working on to like achieve that result.

And with the pace of AI progress being so fast, it felt like that wasn't the best use of time to like develop a very bespoke system that could only do that one task.

And so we all like really prioritized general purpose techniques and all this.

And you know, the techniques that we used for everything for scaling up the thinking time for working on hard to verify tasks and for the parallel compute are all general purpose techniques that we're either planning or have used for, you know, other systems as well.

And is that the reason you all chose not to do this and lean like my understanding is the official kind of IMO AI track was was a was a lean interpretation this year.

Is that why you guys chose not to go with lean?

Yeah, that's right.

I mean, there are certainly I think there is a lot of value in lean as a tool.

You know, mathematicians find it useful, for example.

But the priority for us is really general purpose reasoning capabilities and lean has its limitations.

And so that's why we wanted to prioritize natural language.

My layman's understanding is lean is a formal verification tool.

Does your result here basically say that like informal verification with scale can, you know, can perform at the same level or even surpass formal verification?

Is that the right takeaway?

I wouldn't say you I would not say that's the right takeaway.

And Alex, you have thoughts?

I say that these are just like, you know, sort of two like orthogonal sort of components here where like, I think, you know, I think we found the informal math sort of an interesting problem because it represents sort of like a kernel of difficulty around like, you know, like scaling up test time compute, hard to verify tasks that represented something like difficulties from like, you know, a very broad set of tasks that we were like, you know, interested in from like a general purpose standpoint.

I think like lean is a little bit more narrow where like, I think a lot more of the world can be approached with like informal reasoning than is like formalizable.

I don't think there's anything wrong with narrow AI, like narrow AI can be very effective and obviously, like far surpass general purpose AI in certain domains.

And I think the right way to think about it is in the same way that humans human mathematicians find a lot of value in lean.

General AI can be compatible with with, you know, a more narrow system that's that's focused on like formal formal mathematics and, you know, the combination, I think can be better because of it.

I think I saw on Twitter from multiple folks at OpenAI and I think you guys have mentioned this as well that, you know, this system was built with a very similar approach and infrastructure to many of the recent launches from OpenAI.

We had Isa from the chat GPC agent lunch on the podcast last week.

Can you say a little bit more about what the similar kind of foundation and approaches?

I think like infrastructure wise, like, I mean, like we all kind of use the same infrastructure.

But I think as far as like the core of this question, like, you know, like, like no, I'm Alex said, there's nothing like that's very bespoke to IMO here.

And the hope is really that we can use the techniques that Alex worked on as far as like non verifiable tasks.

And as far as just scaling up test time compute, and be able to like apply this to other areas of reasoning or other areas of like model capabilities in general, and just build like stronger models, you know, like, keep improving agent keep improving chat GPT and everything else.

Tell me tell me about the actual experience of IMO day.

What was it like?

Yeah, I mean, we were waiting for the problems to come through because like, you know, once the once the like participants finish the exam, then they get posted.

And so we like, you know, plug the problems into our model.

And that was around like, it was pretty late at night, maybe like 1am or something.

And honestly, I went to sleep because it's like, you know, it's 1am, I'm not gonna stay up for four and a half hours to like see the other like in the morning and see.

But I think these two like actually stayed up and like, got to watch the model and see it come in in real time.

It was it was a lot of fun.

I'm like, wake up, wake up, wake up.

We got this.

There were there were a couple moments where like Alex was like so exhausted that he decided to like take a nap.

But we like told him like, okay, just make sure your phone is on silence so that we need to call you.

And at one point, we did actually have to call him but he don't think he woke up.

That's awesome.

It must have been such a thrill and such a high, especially for that to come through at like, so you got you started at 1am.

So you must have known like 9am then?

Oh, it's four and a half hours for the first day.

Okay.

Yeah, I don't know.

I mean, we can kind of see the problems come in.

So I just be like making sure the systems are like staying stable and Alex is like over the reading and seeing whether or not how the model is doing.

So you were doing the you were doing the live human proof checking to see if it was actually I was there is you know, you're naturally very like anxious about the results.

So I was just like looking at the you know, like the partial progress the model was making you can sort of like, we can sort of observe that.

And then like, you know, I also like, you know, hand check things like, you know, we were we're going to send these out to the graders but like I was also just like hand check them because I was so curious like, okay, well call me next time I want I want to come hang out there for that.

That sounds awesome.

What are the cool things about these models is like, you know, I can't understand the proofs.

But when you see the model like thinking about it, it, it will express its uncertainty or its confidence in natural language throughout throughout the process.

And it will just kind of say words that will like hint at it's like, you know, if it's like really confident that I figured out I'll say good a lot, you know.

And if it's like unsure, it'll like throw out a lot of question marks.

And so it's like cool that I can kind of follow along and see how the model is like, you know, feeling about about its progress, even though I can't really tell if it's like got it correct or not.

You get the dreaded seems hard.

You got that on problem six.

No progress.

Hard.

Sorry, keep going too bad.

Wonderful.

I guess looking ahead, what you've got, you've gotten like the pinnacle results in competition math, I guess you can go do put them next year, but you're basically at the top, right.

And so what's next?

Yeah, so actually for for Putnam, the problems, I think, since the exam is like, you know, less time per problem than the IMO and it's a little more knowledge heavy.

We actually found in our eval is that the model, you know, was like really, really good at putting them problems like better than it was that IMO problems.

And so I think, you know, the frontiers here are really not about like, you know, these like very like time boxed competition problems anymore.

But it's about like problems that really take like longer periods of time and more deep thinking to solve.

It's really cool.

Okay, so you're gonna start proving novel theorems now.

Yeah, I think there's like, there's this, I think very intimidating gap between like, you know, these like very like time box competition problems, then like a, you know, real research breakthrough, which, you know, takes like a year's worth of work, like in years that that's like on the order of like 1500 hours instead of 1.5.

Yeah, totally.

I guess relatedly, I was listening to the Demis podcast last night, and he mentions that, you know, the hardest thing is actually coming up with the interesting problems to solve.

And I'm curious if you all agree with that.

I think there's some truth to that, that, you know, these models are really good now at solving these problems.

Coming up with them is, you know, still a challenge.

But I think it's also worth noting the incredible pace of progress that we're seeing.

And, you know, there's always there's always an external, you know, and originally when LM came out, it was like, well, how do we get them to reason?

And then we got them to reason, but then how do we get them to reason on hard to verify tasks, and now they can reason hard to verify tasks.

And I think the next hurdle is going to be like, okay, well, how do we get them to come up with these novel questions?

You know, like even creating an IMO question is a challenge.

And, you know, it takes a lot of extra mathematicians, a lot of work to do that.

But I don't see any fundamental barriers that block us from getting there.

I love that.

Do your results in math, do they just fully generalize to, you know, you're just gonna be better at scientific reasoning, you're gonna be better at general reasoning.

Does being great at competition math make you, you know, be great at everything else?

I think how we approach this was not like, you know, we should be like, you know, great at competition math.

But really, I think it's like we were focused on like, developing like general purpose techniques to make up like reinforcement learning better.

And I think those we are, you know, very excited to like, improve our models and other domains beyond math.

And so and, you know, hopefully, like make models more useful for like, you know, us in like everyday usage.

This is like a, you know, it's a pretty late breaking results.

It's honestly, it was a prize even to people internally at OpenAI.

And so the next step is to incorporate this more broadly into our models and, you know, the reasoning capabilities across the board.

But you know, it's gonna take some time to go through that process and deploy it to the world.

So I think it's gonna come.

But yeah, it'll just take a little bit more time.

Is it harder for these math, these models to do the IMO or the Physics Olympiad?

I think definitely the Physics Olympiad because the Physics Olympiad has, I think, like an experimental section.

Oh, We need some robotics fish.

I didn't realize that.

Okay.

I thought it was just done with a piece of paper.

Yeah, so I think it's the model will probably be good at the on the paper part.

But yeah, I think we'll be a little bit of a bit of time before it can, you know, do the experiments.

Not with like a world model.

Okay, cool.

Are you gonna release this model for customers to play with?

Riloff's son is a math, math Olympiad kid.

And he's like, I want access to the math Olympiad model.

Like, will people be able to play with this?

So we want to make this accessible to mathematicians to use.

We're still trying to figure out the exact details of how we make that happen.

But I think it's really cool that we've developed this system that is incredibly good at math.

And it makes sense that we want to see what mathematicians can do with it.

I've actually already been emailing with like the Stanford Professor, mathematics professor, he actually emailed me, like about a year ago before we announced a one and he was like, Hey, do you want to do a collaboration on like solving hard math problems?

And basically what I told him is like, I think we just got to advance general reasoning capabilities and eventually they're gonna be able to help you with your like hard math problems.

And I think that's actually the most promising route to getting there.

He was a little skeptical, but every model release, like every reasoning model release, he's like, emailed me with a follow up and is like, can it solve this problem now?

And I've been plugging them in.

And I don't know what the output is, but I like email it back to him and he says like, yeah, that's wrong.

And, and he emailed me like a follow up this time with like, you know, the same problem really asking like, Hey, can it solve it now?

It still can't solve it, but at least this time I like recognizes that it can't solve it.

So I think that's like a big step.

But we're curious to see if there's like a lot of other problems out there that mathematicians want to challenge this model with and see if it can take them on.

Amazing.

Congratulations to you all.

I think this is a momentous result that the entire field has been waiting for for a very long time.

And the fact that it was accomplished by a team of three people in a span of two months, it's, it's, it's extraordinary.

Congratulations.

And thanks for joining us on training data.

Thank you.

Thanks for having us.

[Music]