The a16z Show · 2025-08-08

GPT-5 Launch Day: Inside OpenAI's Post-Training, Agents, and Coding Breakthroughs

Hosts: Unknown

Guests: Christina Kim, Isa Fulford, Sarah Wang

GPT-5post-trainingcoding agentscreative writingmodel behaviorsycophancyhallucinationsreinforcement learningRL environmentsDeep ResearchChatGPT Agentmid-trainingevals and benchmarksstartup opportunitiesAGI discourse

Read summary Jump to transcript Go to episode

Podcast feed URL

Open feed

Why it matters

GPT-5 launch described as a major usability leap, with coding (especially front-end), creative writing, and reasoning showing the biggest improvements over GPT-4

Key claims

GPT-5 launch described as a major usability leap, with coding (especially front-end), creative writing, and reasoning showing the biggest improvements over GPT-4
Coding gains came from intense focus on data sets, reward modeling, and aesthetics—Michael Truell called it the best coding model in the market
Model behavior was intentionally redesigned post the GPT-4o sycophancy incident, with post-training framed as an art of balancing competing reward signals
Hallucinations and deception are reduced because reasoning models can pause and reflect step-by-step rather than blurting out answers

Episode summary

Summary

Recorded on the day of GPT-5's launch, this episode features OpenAI researchers Christina Kim (Core Models post-training lead) and Isa Fulford (Deep Research and ChatGPT Agent post-training lead) in conversation with a16z's Sarah Wang. They discuss what makes GPT-5 feel like a step change—particularly in coding (with Michael Truell's endorsement calling it the best coding model available), creative writing, and front-end web development, where the team invested heavily in data curation and reward modeling rather than architectural changes.

A major thread is how agent products like Deep Research and ChatGPT Agent informed GPT-5's development. Fulford explains the reinforcement learning insight: training data sets built for frontier agent models get fed back into flagship reasoning models, creating a self-reinforcing loop. The researchers also describe how post-training is treated as an art—balancing competing rewards (helpfulness vs. sycophancy) and intentionally resetting model behavior after the GPT-4o sycophancy issues. Hallucination reductions are attributed to chain-of-thought reasoning allowing models to pause before responding.

On agents, Fulford defines them as systems that do useful work asynchronously on a user's behalf, with the long-term vision of a chief-of-staff-like assistant. Both researchers identify high-quality, realistic RL environments and task data as the key bottlenecks—areas they believe are ripe for startup collaboration. They introduce mid-training as a way to update knowledge cutoffs and extend model intelligence without a full pre-training run. Closing reflections touch on OpenAI's integrated research/product culture, the importance of 'taste,' and why GPT-5's biggest significance may be making their smartest model freely accessible.

GPT-5 launch described as a major usability leap, with coding (especially front-end), creative writing, and reasoning showing the biggest improvements over GPT-4
Coding gains came from intense focus on data sets, reward modeling, and aesthetics—Michael Truell called it the best coding model in the market
Model behavior was intentionally redesigned post the GPT-4o sycophancy incident, with post-training framed as an art of balancing competing reward signals
Hallucinations and deception are reduced because reasoning models can pause and reflect step-by-step rather than blurting out answers
Agent products (Deep Research, ChatGPT Agent) directly informed GPT-5—RL data sets built for agents feed back into flagship models, creating a self-reinforcing loop
High-quality, realistic RL environments and task data are identified as the biggest bottlenecks and a key opportunity area for startups
Mid-training introduced as a phase between pre- and post-training that extends intelligence and updates knowledge cutoffs without a full pre-training run
The async agent paradigm has shifted user expectations—people are now willing to wait minutes for high-value work, as proven by Deep Research's success

Source material

Transcript

I think it’s pretty unique at OpenAI to be able to work on something that’s so generally useful.

It’s like everything they tell you not to do at a startup is just like your user as anyone.

You just kind of take it for granted that you literally have this wizard in your pocket.

We’re trying to make the most capable thing and we’re also trying to make it useful to as many people as possible and accessible to as many people as possible.

I think we hear this with GPT-5 internally when people are testing it.

They're like, "Oh, I thought I asked a really hard question."

They're like, "A little bit insulted that it happened in two seconds."

Or like, "But it doesn't even want to think at all."

Today's episode was recorded the day GPT-5 launched—a major milestone not just for OpenAI but for the entire AI ecosystem.

Joining me in the studio, fresh off the launch livestream were three people who were instrumental in making this model a reality.

Christina Kim, researcher at OpenAI, who leads the Core Models team on post-training; Isa Fulford, researcher at OpenAI, who leads Deep Research and the chat-GPT agent team on post-training; and A16Z general partner Sarah Wang, who has helped lead our investment in OpenAI since 2021.

We talk about what's new in GPT-5 from major leaps in coding and creative writing to meaningful improvements in reasoning, behavior, and trust.

We also get into training, RL environments, and why data quality is more important than ever.

We also cover agents—what that word actually means, the paradigm shift for async workflows, and the golden age for the idea guys.

Let's get into it.

As a reminder, the content here is for informational purposes only.

It should not be taken as legal business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund.

Please note that A16Z and its affiliates may also maintain investments in the companies discussed in this podcast.

For more details, including a link to our investments, please see a16z.com/disclosures.

So, slow news day.

Not much going on for you guys.

Thank you for coming.

I know, obviously, Tina, you were just on the livestream.

We're recording "Day of."

Congratulations.

Thank you.

For those who are unfamiliar, why don't you introduce what you guys do at OpenAI?

Yeah, I'm Christina.

I lead the core model team on post-training.

I'm Isa.

I lead the deep research chat GPT agent team on post-training.

And Tina, you've both been here for a while now.

Do you know what one you'd be able to do with your history at the company?

Yeah, I've been at OpenAI for about four years now.

I originally worked on WebGPT, which was the first LLM using tool use.

But it was just one question.

So the model learned how to use the browser tool, but you only ask one question, you got to answer back.

And then we kind of just had this realization like, oh, normally when you have questions, you have more questions after that.

And so we started building this chat bot, and then that eventually became chat GPT.

And what have been the reactions so far?

It's only been a few hours, but in your livestream, what are any reflections?

What can you tell us "Day of"?

I'm honestly really excited.

I think that obviously we have some great eval numbers and numbers are always really exciting.

But I think the thing I'm like really excited about this model is just it's way more useful, like in cross like all the things that people actually use chat for.

And it's not just like, and I think the eval numbers look good, but then also like the way when people use it, I think they'll notice a quite big of a difference when the utility of it.

I mean, this is my personal use cases.

I use it for coding and writing all the time, and it's just a huge step change.

Yeah.

Sorry, you've been involved in helping lead our investments since 2021.

When you either share more or tee up how you've been thinking about this as it relates to coding or more broadly.

Yeah, well, actually just on the topic of coding, it was a huge deal to have Michael Traul come on there and not only showcase the capabilities, but also say this is the best coding model in the market.

And so just curious to the extent that you can share, what did you do differently to get these results?

Yeah, I think huge shout out to the team, especially Michelle Pokris.

I think to get these things right and like eval numbers is one thing, like I said, but to get the actual usability and like how great it is at coding.

I think it takes a lot of detail and care.

I think the team put a lot of effort into data sets and thinking about the reward models for this.

But I think it's just literally just caring so much about getting coding working well.

And maybe actually just to double click on front end web development.

I mean, we've seen as sort of investors in the ecosystem that's obviously taken off in the last six to eight months.

If you could pinpoint the improvement to that piece specifically, is it around is it more around aesthetics or is there sort of another capability leap forward in terms of what we can do with front end?

I think there's going to be a lot more we can do with front end.

I think the way we've gotten this big leap, I mean, if you compare to O3's front end coding capability, this is just totally next level.

It feels very different.

And I think it kind of just goes back to what I was saying.

The team just really cared about like nailing front end.

And that means like getting the best data, like thinking about the aesthetics of the model and all of these things.

I think it's just all those details are really coming together and making the model like great at front end.

Really exciting to see.

Loved the demos in the live stream too.

I wanted to ask about model behaviors because I know you worked on that too.

But how did you guys think about that for GBT5?

And there are a lot of things that we've talked about in prior models of sync-offency and characteristics like that.

How did you guys think about for this?

What did you guys change or tweak?

Yeah, the design of this model has been very, very intentional for model behavior, especially with the sync-offency issues that we had like a few months ago with 4.0.

And we've just spent a lot of time thinking about like, yeah, what is the ideal behavior?

And I think for post-training, what's really, or one of the reasons I really like post-training is it feels more like an art than maybe even like other areas of research.

Because you kind of have to make all these trade-offs, right?

Like you have to think about like for my rewards, like all these different rewards I could be optimizing during the run.

Like how does that trade off against it, right?

Like I want the assistant to be like super helpful and engaging.

But maybe that's like a bit too engaging and getting too engaging gets to the overly effusive like assistant that we have.

So I think it's really like a balancing act of trying to figure out like what are like the characteristics and like what do we want this model to actually feel like?

And I think we were really excited with GPT-5 because it's kind of a time to like reset and rethink about, especially since it's so easy to make something, I think very engaging in the sense that in an unhealthy way, how can we make this like a very healthy, helpful assistant?

Say more about how you received such kind of reduction in hallucinations, but also deception.

What's the relationship between those?

I guess like for me, I find hallucinations, deceptions like pretty related.

So the model, and we kind of saw this a lot with the reasoning models.

Like the reasoning model would understand that it didn't have some ability, but then it still really wanted to respond.

I think we really baked it into the models that they want to be helpful.

And so they're like whatever I can say to be helpful in that moment.

That's kind of what we consider for like deception versus hallucinations.

Sometimes the model like literally it seems that they will just say something quickly.

And we kind of see a lot of this reduction with the thinking with when the models are able to stay step by step, they actually can like pause before blurting out an answer is kind of what it feels like with a lot of the previous models for hallucinations.

Over the next few weeks as you're evaluating, what are the biggest questions that you're having or that you're sort of anticipating being potentially answered?

I'm just really curious to see how all of these things reflect in usage, right?

Like I think coding is way, way better.

Like what is this actually unlocked for people?

And I think we're really excited to be offering these models at the price points that we have because I think this actually like unlocks like a lot more use cases that really weren't there before.

Maybe like previous competitor models were are good at coding, but the price point is not as exciting.

And so I think with this number of capabilities that we have in this model and the price point, I'm kind of excited to see like all the new startups and like developers like doing things on top of it.

Yeah, we're excited to.

But by the way, just on the topic of usage, you obviously have a lot of products with a ton of usage already.

And since we have one of the deep research gurus here, too, how did deep research chat, GPT operator sort of your existing products inform how you went about approaching GPT five?

One thing that's interesting is with reinforcement learning, training a model to be good at a specific capability is very data efficient.

You don't need that many examples to teach it something new.

And so the way that we think about it on my team is we're trying to push capabilities and things that are useful to people.

So like deep research, it was the first model to do like very comprehensive browsing.

But then when three came out, it was also good at comprehensive browsing.

And that's because we're able to take the data sets that we've created for the frontier agent models and then contribute it back to the frontier reasoning models.

We always want to make sure that the capabilities that we're pushing with agents makes it into their flagship models as well.

Yeah, that's great.

Very self reinforcing.

You mentioned all the startups that you're excited to see come as flush out what you think that could look like or even just have some opportunities you're more excited about because of this.

I mean, people always say vibe coding.

I think basically like non-technical people like have such a powerful tool at their hands.

I think really just need some good idea and like you're not going to be limited by the fact that like you don't know how to code something like you saw two of our demos, which were front end coding or in the beginning.

And that's just literally took minutes.

I think that would have honestly taken me like a week to actually build like fully interactive.

And so I think we're just going to have a lot more.

I would expect like maybe a lot more like indie type of like businesses built around this because of the fact that like you just need to have the idea, write a simple prompt and then you get the full fledged out.

It's the world of the ideas guy.

Yeah, it's our time.

Finally.

How about in the broader sort of AGI discourse?

What does this mean or accelerate or not?

Or like how do we think about sort of the broader AI discourse in terms of what is GBT5 mean here or change the conversation in any sort of way?

I think with GBT5, it kind of sets like a new, it's obviously state of the art and like all the things we talked about.

But I think if you're showing that like, you know, we can continue pushing the frontier here and I feel like there's always people like, oh, we're hurting a wall.

Like things aren't actually improving.

And I think the interesting thing is I feel like we've almost saturated a lot of these evils in the real like metric of like how good our models are getting is I think can be like usage, right?

Like what are the new use cases that are being unlocked and like what how like how many more people are using this in their daily lives to help them like across multiple tasks.

I feel like that's actually like the ultimate usage in terms like that I'm excited about for terms of like, are we getting to AGI?

Yeah, actually, I think Greg made this comment about how he was comparing the last model to this model and the benchmark went from 98 to 99.

He's like, clearly, we've saturated the benchmarks.

At least on that that front, which is instruction following.

What benchmarks do you pay attention to?

Like, how do you guys think about evals?

Right.

Because given you're already saturating what's out there to a large extent or doing very well along those dimensions, what actually gets you to push the frontier is that before them?

So usage would be kind of post the model release.

But before you get there, what are you guys looking to internally to help guide you?

Is it a lot of internal evals that you created?

You know, is it really access to start up seeing what they think?

Maybe it's a combo of all the above.

But how do you weigh all those things?

Yeah, I mean, I think on our team, we really work backwards from the capabilities we want the models to have.

So maybe we want it to be good at creating slide decks or something or spread editing spreadsheets.

And then if evals for those things don't exist, we try to make evals that are representative measures of that capability in a way that's actually going to be useful for users.

And then a lot of those are internal.

We'll collect them maybe from human experts or try and synthetically create examples or we'll actually look at usage data.

And then for us, we'll just try and hill climb on those.

Yeah, I think we make this joke a lot internally that if you want to nerd-sipe someone into working on something, you just need to make a good eval and then you're going to be so happy to try to hill climb that.

I like what you said about starting with the capabilities first.

How do you prioritize which you actually are shooting for?

Let's say there's this dimension of maybe deeper into everyday use versus getting much deeper into the expert use cases.

How do you think about that trade-off?

What does that trade-off mean practically speaking?

And what do you guys prioritize when?

I mean, I think it's pretty unique at OpenAI to be able to work on something that's so generally useful.

I mean, it's like everything they tell you not to do at a startup is just like your user is anyone.

Like for deep research, we wanted it to be good across every single domain someone might want to do research in.

And I think you only have the privilege of doing that if you work at a company that has a huge distribution and all different kinds of users.

Yeah, I mean, I think if you choose a capability that's quite general like online research, you just have to make sure that you represent a distribution of tasks across loads of different domains if you want to get good at all of them.

But then, yeah, sometimes it is.

It's hard to decide to focus on one specific thing because there are just so many different vesicles that you could choose from.

But I think in some cases, maybe coding will be really important.

So then a specific team will focus on coding.

But I think in general, because the capabilities are so general, usually like the next model improvement just kind of improves performance on a pretty broad range.

Yeah, I think we've kind of seen this like with the progression of even the models that we've had in chat, GPT, like as the model gets smarter, it's better at instruction following.

It's better at tool use and there's more things get unlocked as we just continue to make smarter models.

I think like a good chunk of our team also like does focus on just getting general intelligence up because I think the wins that we get from there are like you saying like pretty great whenever we get a new base model and just saying like, oh, wow, suddenly this clicks, it works.

And I think we kind of saw that moment with like operator because we had been working on computer usage, but I think it was hard to find like get the model to actually without the multi model capabilities to really support it.

Like you couldn't have something like operator when it launched.

Yeah, it's the same thing with everyone was talking about agents, but we didn't really have a way of actually training useful agents.

I mean, I think everyone was talking about all these agent demos, but nothing that actually really works.

But I think when we saw the reinforcement learning algorithm working really well on math and physics problems and coding problems, it became pretty clear, like just from reading through the chain of thought, like, okay, this thing's actually like thinking and reasoning and backtracking and to build something that's able to like navigate the real world.

It also needs to have that ability.

So we realize, okay, like this is a thing that's going to actually let us get to useful agents.

And so I think it's interesting at open AI because you have people pushing like foundational algorithms, getting really good at math, getting a gold medal in the IMO.

And then on post training will often take like those methods and try and figure out how to make things that are most useful and usable to all of our users.

How much of the improvements are coming from the architecture versus the data versus the scale?

Like, how do you sort of think about that?

My opinion, I'm very data-filled.

Like, I think data is very important.

I think like, I think deep research was so good because Iisa put so much thought and like careful attention to like the data curation that they did and thinking about all the different use cases she wanted to have represented.

So I'm on team data.

Yeah, I mean, I think all are very important, but especially like, especially now that we have such an efficient way of learning data is even high quality data is even even more important.

Maybe on the data topic, we've been talking a lot about RL environments.

It's a popular space for startups.

Yes.

Who all want to work with you guys.

And I was curious just to get your thoughts on this since you've been data or your data-pilled.

But what are the bottlenecks that you see for the next stage?

Is that, I mean, maybe tying it to RL environments, is there sort of a lack of good realistic RL environments that that's sort of the next frontier, which maybe creates an opportunity for these startups that once you sort of are able to really work within a environment that takes a long time to build.

These are not sort of built in a day or two that you can actually automate labor to the full extent of the way that you would need computer use to do.

Yeah, I think in my opinion, I do think there is a lot of value in getting really good tasks and getting really good tasks requires really good RL environments.

I think the more complicated and the more realistic, the more simulated we can make them, I think the better we'll get.

And I think we're kind of saying that tasks matter more at this point, given the fact that we have such a strong algorithm.

So I think the data, creating data and figuring out the best tasks to train on is one of the big questions we have.

Yeah, there's some generalization from training on one website to another, but if you want to get really, really good at something, the best thing to do is just train on that exact thing.

So yeah, I think we're definitely just constrained by how things that we can represent in a way that we can train on.

Like the chat GBC agent, for example, has such a general tool.

It has a browser and a terminal.

And between those two things, you can basically do most of the tasks that a human does on a computer.

So in theory, you can ask it to do anything that you can do on your computer.

It's obviously not good enough to do that yet, but with the tools it has in theory, you can push it really, really fast.

So now we just have to make it really good at all those things by training on way more things.

Let's talk about creative writing.

Maybe you talk about the improvements there, how you think about it.

That's one of my favorite improvements in GBC5.

The writing, I honestly find it very tender and touching, especially for a lot of the creative writing that we want to do.

We were thinking through a bunch of different samples for the live stream, and every time I was like, "Oh, that's actually like, that hits."

And it's like spooky.

I'm just like, "Oh, this feels like someone should have written this."

But I think it's really cool because you can actually really use it for helping you with things.

My example in the live stream was writing, helping me write the eulogy, something that's kind of hard to write, especially since writing isn't really something a lot of people are good at.

I'm personally a very, very bad writer.

That's not true.

I think it's...

But it makes a better story.

I think it's true.

Compared to making the other things look better.

But it's so great to have this tool to help me craft whenever...

I use it literally for simple things like Slack messages to figure out how to phrase this well.

It'll help me give me some iterations of how to say something to the team.

I want to see those prompts.

Yeah.

We're now all just looking for M dashes.

That was a good save.

We're like, "Mm, check my PT."

Where do you stand on the M dash discourse?

I like M dash.

I do that normally now.

People think I'm just using the M dash PT.

I know, I know.

I know, me too.

Going back to the discourse for a second, Sam said in his interview with Jack, he said, "If you had said 10 years ago that we would get models at the level of PhD students, I would think, 'Wow, the world looks so different,' and yet we've basically taken it for granted."

Do you think basically the improvements are similar?

As soon as we get them, we're just going to be like, "Oh, now this is the standard."

Or do you think at some point, this is going to be like, "Oh, my God, this is like..."

How do you think about people's ability to acclimate or adjust?

It seems like people adjust really quickly, don't you think?

Yeah, whatever happens.

If the chat GPT got released and everyone was like, "Wow, that's so cool," but then you just take it for granted that you literally have this wizard in your pocket.

You can ask it whatever random thought you have, and it just pops out like a good essay and you're like, "Oh, okay, cool.

That's what's happening."

I guess people adapt to things rather quickly in my opinion with technology, and it is really easy.

I think because the form factor is so easy, even with new tools like Deep Research and Chat GPT Agent, it's presented in such an easy way that people already know how to interface with.

I think as long as that's true, even with the models getting much smarter than us, I think it's still going to be quite approachable to people.

Do you think the jump from GPT 4 to 5 was bigger or 3 to 4?

Or maybe 3.5 to 4?

At least one thing for me and my usage of it is sometimes I'm wondering if I have hard enough questions to ask it to actually highlight the difference.

Because when it gets to a point where it's just answering what you need so well, it's almost harder to tell the difference in some areas.

But with writing, I've been using it for a few weeks, and it's just blown me away in a way that models previously haven't.

Maybe I'm biased, I'm recency biased, but I think this jump to 4 to 5 is most impressive for me because I guess with 3.5 when we first released it, the most common use case for me then also was still just for coding.

But now, even though 4 was better at coding, I feel like the jump between 4 and 5 in terms of breadth of ability to do things is just way different and way more.

You can just handle a lot more complex things than before, with the context being much longer as well.

I think the jump to 4 to 5 to me is much bigger.

Is there anything the model categorically can't do?

I guess for 5, we don't really take action in the real world yet.

We're going to team up with Agent for that.

As I said, you could ask the Agent to do anything, but it's not capable enough to do everything you want it to do yet.

We take a conservative approach especially with asking the user for confirmation before doing any kind of action that's irreversible, like sending an email or ordering something, booking something.

I can imagine quite a number of tasks where you'd want to take bulk actions, which you might not be able to do right now because it would last you every single time.

But I think as people get more comfortable using these things and as they get better and you trust them more, you might allow it to do things for you without checking in with you as much.

Maybe just to build on that question, in terms of what it can't do today but what you would direct future research toward, if you look at coding, something like end-to-end DevOps, for example, that feels like the logical next set of capabilities.

Do you guys think we'll get there in, I don't know what you'll name it, but 5.5 or GPT-6?

How far away from something like that?

Yeah, I don't know about the exact thing of DevOps, but I do feel like with the models getting much smarter, one other thing that came to my mind when you asked me the question is longer running tasks and things like that.

GPT-5 is great because within a couple of minutes maybe you get a full-fledged app, but then what would it look like if you actually gave it an hour, a day, a week?

What can it actually get done?

There's going to be a lot of interesting stuff.

We're interested to see what will happen there.

Yeah, I think a lot of it is not just about the model capability, but it's actually how you set it up in a way to do things.

I'm sure that you could build something that's monitoring your Humio or Datadog, whatever, with these current models that's just setting up the harness to make that possible.

Same for agentic tasks.

I think a lot of things that will be quite useful will be when the agent proactively does something for you, which I don't think is impossible today.

It's just not set up that way, but eventually as it proactively does things for you, then we might get feedback on whether that was useful and we can make it even better at triggering.

Agent is probably the most overused word of 2025.

That being said, your agent's launch is extremely exciting.

What does that word mean to you in the context of capabilities that you'd like to build in the near term or have already built?

What is most important that the agent is able to do on behalf of your users?

I guess my very general definition would just be something that does useful work for me on my behalf with, I would say, asynchronously.

You'd leave it and then come back and get a result or a question about what it's doing.

In terms of roadmap for agents, longer term you want it to be able to do anything that a chief of staff or assistant or something like that would do for you.

But I think in the more immediate term, there are a lot of new capabilities that we launched in ChatGPT Agent that we just want to improve.

One of the main capabilities is deep research.

So just being really good at synthesizing information from the internet.

But also, I think we can improve capabilities on synthesizing information from all of the services that you use and private data that you have.

And then also being better at creating and editing artifacts like docs or slides and spreadsheets because I think so much of the work that's useful that people do in their jobs is basically just research and making something.

But then also I personally love all the consumer use cases.

Making it better at shopping or planning a trip and those kinds of things are also really fun.

And so that also involves taking an action, which is interesting because it's kind of the last step often of a task.

It's the maybe a task that would take less time for a human and it's actually a very hard research question to get it to do something or book something or use a calendar picker.

But yeah, once you have the end-to-end flow working really well, it can basically do anything.

Yeah, that's incredible.

On the shopping piece, I now do not make a single large ticket purchase without having ChatGPT put all the options in a table for me along the dimensions I care about.

It's incredible.

But I want to push on the async piece because I don't know if you would agree with this, but it felt like a revelation to me at least at the beginning of the year that people were willing to wait.

So you kind of think about, oh, we want it faster.

Like the value prop of this tool is that it gives me the answer fast, right?

That was sort of very 2024.

Clearly, this paradigm has shifted.

People are willing to wait for high-quality, high-value answers and work.

How do you think about the trade-off between how long you take to get something back to the user versus the value that you're providing?

And what do you think is the ideal frontier for something like that?

Yeah, it's interesting because I built the retrieval on ChatGPT and was on the browsing team before this.

Tina was also on the browsing team.

And we were always making these trade-offs and optimizations for latency.

And so we were thinking, how can you best fill the context with information you've retrieved so that the answer is pretty good in a few seconds?

And so I think with deep research, I was just very excited to remove latency as a constraint.

And since we were going for these tasks that are really hard for humans to do and would take humans many hours to do, I think we felt like, if you asked an analyst to do this and it would take them 10 hours or two days, it seems reasonable that someone would be willing to wait five minutes in your product.

So I think that was the-- we just kind of made that bet.

And luckily, it seems like it's the case.

But I do also think that initially people are like, oh, this is amazing.

It's doing all this work.

That would have taken me so long.

And now people are like, OK, but now I want it in 30 seconds.

Right, to the point on the bar changing.

Because yeah, I was going to say, is there any sort of rule of thumb-- I'm sure it's constantly shifting-- where as long as you're 10 times faster than it would take the human to do, they're willing to wait for it?

Or is that just constantly shifting sand?

I think with these launches, people's expectations keep getting changing.

Yeah, I do think we have a specific number.

One thing that's interesting is I think sometimes people are just biased to thinking that the longer answer is more thorough or it's done more work for it, which I don't necessarily think is the case.

Deep research, for example, always gives you a really long report.

But sometimes for me, I don't want to read this whole long report.

I actually don't like that.

Since so agent, it will only give you a long report if you ask for it.

But I think sometimes people, since now they're always getting a really long report, they're like, wait, I've been waiting.

Where's my long report?

But sometimes it's really hard to find a specific piece of information and would have also taken a human a long time because it's in page 10 of the results where it finds this information.

So I think it's interesting also how you can condition people's expectations with the product so that when you change or like with deep research, it always thinks for a really long time, which again, I don't necessarily think is the future.

But I think now people are like really used to the amount of time that they wait.

I think we hear this with GPT-5 internally when people are testing it.

They're like, oh, I thought I asked like a really hard question.

I feel like a little bit insulted that it has like two seconds.

When it doesn't even want to think at all.

It's like the Mark Twain line.

I didn't have time to write you a short letter, so I wrote you a long one.

Yeah.

Why don't you talk about the bottle?

Like why don't we have reliable agency?

What are the main bottlenecks as you see them?

Yeah, I think a big part of it is the things that we train on are often really good at.

And sometimes with the things outside of that, it can be a bit-- sometimes it's good at those things, sometimes it's not good at those things.

So I think, yeah, creating more data across like a broader range of things that we want it to be good at.

I think also what's interesting with agents is we have this-- when something is doing something on your behalf and it has access to your private data and the things that you use, it's kind of more scary, the different things it could do to achieve its final goal.

In theory, if you asked it to buy you something, and make sure that I like it, it could go and buy five things just to make sure that you liked one of them, which you might not necessarily want.

So I think that there's definitely-- having oversight during training is also an interesting area.

I think there's just new things that we have to develop to push these agents even further.

So yeah, I think that's part of it.

And then also every time we have a smarter base model or something like this, it improves every model that's built on top of that.

So I think that will also help, especially with multimodal capabilities, as Tina said, with computer use.

Because it's just literally looking at screenshots of a web page.

And it's a little interesting because the way that humans focus on specific things, it's a lot to expect a model to just take a whole image and be able to know everything about the image when we're looking at something, we'll focus on a specific thing.

Yeah, I think that there's lots of room for improvement in lots of areas.

Sorry, that was kind of a general answer.

No, no, well, actually I was going to-- maybe that last example gets into something that we were curious about, which is-- and this ties back to training data as well-- but what sort of-- I guess what specific categories of browsing tasks are challenging for agents today?

And I don't know if you have thoughts on how you'd overcome this for sort of the next iteration of the model.

I mean, I think one thing is-- so free training, it's based on what data is available, right?

And so I think when we've done these free training-- there's not much data out there to begin with, but people using computers.

Computer usage is not really a thing that-- there's lots of data out there, and this is something we actually have to seek out now that this is a capability that we want.

So I think that's actually probably a big one, just for general improvements of computer usage.

Do you think you'll lean more heavily on human data vendors to help collect that?

Or given it doesn't exist, to your point, recorded in the way that maybe it's most helpful for training, how do we-- but it is probably the most useful application of the models to at least knowledge work.

How do you overcome that?

I mean, I think one cool thing is, for example, for initial deep research, there's not really any data sets that exist for browsing in the same way that you have a math data set that already exists, so we have to create all this data.

But once you have good browsing models or good computer use models, you can bootstrap them to help you make synthetic data.

So I think that's a pretty promising area.

Christina, can you explain what mid-training is and what does it achieve that pre or post doesn't?

So I think with your pre-training runs, these are the big runs.

These are the massive ones, what we're building all these giant clusters for.

So you can kind of think of mid-training as literally-- it's for middle.

We do it before, after pre-training, but before post-training.

You kind of think of a way to extend the model's intelligence without having to do a whole new pre-training run.

So this is mostly just focus on data and off of the pre-training models.

So this is a way for us to do things like updating the knowledge cutoff of these models.

So when you pre-train it, you're kind of like, "Okay, shoot, now we're kind of stuck in this date and we can't ever update it again."

And does it quite make sense to put all that data into post-training?

And so mid-training is just a smaller pre-training run to help expand the model's intelligence and up-to-dateness.

Did you work on WebGBT?

Yes, I did.

Okay, so you're basically an AI historian.

Yes, yes.

She also watched some computer use.

I'm an elder.

So can you reflect back a little bit to four years ago, five years ago, and sort of reflect on what are the biggest things-- like if you were to predict the five years out, what are the inflection points or biggest things that would have surprised you?

Honestly, with WebGBT, the main thing we were just excited about was trying to ground these language models.

We had so many issues with hallucinations and the model just saying random things.

And the fact of-- we didn't really do mid-training then, so the fact of how do we make sure the model is actually up-to-date, most factually up-to-date?

So then that's kind of how we thought about, "Oh, let's give it a browsing tool."

I think that makes sense.

And then, yeah, like I said, that kind of went on from, "Oh, I actually want to keep asking questions, so what a chatbot would look like."

But at this point, I think there had been a few chatbots by a few other companies.

And I feel like a chatbot is also a very common AI thing to think of.

But they're quite unpopular at the time, so we weren't really even sure that this is actually something useful for people to work on or people to use or will people be excited about this?

Is this really a research innovation that we are remaking the Turing test here?

But I think it kind of clicked into me that maybe there was actually something interesting happening here.

We gave early access to about 50 people, most of those people being people I lived with at the time.

And there are two of my roommates just used it all the time.

They just would never stop using it, and they would just have these long conversations.

And they would ask for quite technical things because they're also AI researchers.

And so I was just like, "Oh, this is kind of interesting."

I don't know.

And at the time, we were kind of thinking, "Okay, we kind of have this chatbot.

Should we make this a really specific meeting bot type of thing?

Do we make it a coding helper?"

But it was interesting to see my two roommates just use it for anything and everything and just literally be chatting with it the whole workday as they were using it.

I was like, "Oh, this is kind of interesting."

But then it was also interesting to see the majority of the people that I gave access to on that 50 person list didn't really use it that much.

But I was like, "Oh, there's clearly something here, but it's not quite maybe for everyone yet, but there's something here."

When did you realize I'm working at one of the most important companies of this generation?

When was the moment where you were like, "Hey, this is something that I obviously believe is important.

That's why I joined," but that you realized the scale and significance?

Honestly, I kind of had this moment before I joined OpenAI.

I think with the scaling laws paper with GBT3, it just kind of hit me that if this exponential is true, there's not really much else I want to spend my life working on.

I want to be part of this story.

I think there's going to be so many interesting things unlocked with this, and I think this is probably the next step level in terms of technology that it kind of made me realize, "Oh, I should probably go start reading about deep learning and figure out how I can get into one of these labs."

Ysa, what was your moment?

I think for me it was also before I started working at OpenAI using-- I think I first learned about OpenAI in an AI class or something, or some kind of computer science class, and they were saying, "Oh, they trained on the whole internet."

It's like, "Oh, that's so crazy.

What is this company?"

Then started using GBT3.

I think I was a power user of the OpenAI playground, and at a certain point had early access to these different OpenAI features, embeddings and things like that, and just became this big OpenAI fan, which is a little embarrassing, but it's fine because it got me.

Eventually they're like, "Okay, you're stalking us.

Do you want to interview her?"

I think it was pretty clear to me just how much I was using GBT3, which wasn't even-- compared to what we have now, it just pales in comparison.

But from then I was hooked and just trying to figure out a way to work here.

Maybe a question or more in the company building front.

We all sort of read and reread Calvin French-Joan's piece just as reflections on working at OpenAI.

Curious-- and you don't have to comment on that piece unless you want to, but we'd love your reflections on the change that you've seen over the last four years or even less than that, given I think that was only covering one year of change.

But what are the biggest things that you've seen change at OpenAI?

When I first joined OpenAI, the applied team was 10 engineers or something.

We didn't really have this product arm.

We had just launched the API.

It was just a completely different world.

I think AI is in most people's mind now after chat GBT, but I think pre-chat GBT, people didn't really know what AI was or really thought about it as much.

It's kind of cool working in a place that my parents know what I do now.

That's really cool.

I think the company obviously is just a lot bigger, but I think with that, we can just take a lot more bets.

I think when I first joined OpenAI, there were obviously way less people.

It was much, much smaller.

It was around 200 people, and I think we're close to-- A few thousand, for sure.

When I joined, there was also a few hundred before chat GBT.

So it's obviously very different in how all of your friends have heard of what you work on.

I think culturally, obviously, the company is much bigger.

I still think we've maintained this.

It still feels very much like a startup.

I think some people who come from a startup are surprised.

They're like, "Oh, I'm working even harder than when I was working on the startup that I founded."

I think ideas can still come from anywhere, and if you just take initiative and want to make something happen, you can.

It doesn't really matter how senior you are or anything like that.

I think we've been able to maintain that culture, which I think is pretty special.

Yeah, we definitely reward agency, and I think that's always been true.

I think especially on the research side, the teams are quite small.

When ESA was working on deep research, it was like two people.

Two people.

So I think we still do that on the research side.

Most research teams are quite small in NIMBLA for that reason.

Earlier, you said we do something at OpenAI, which startups never do, which is try to appeal to every single person with a product.

Are there other things that come to mind that OpenAI just does differently than your peers or other startups, or things that we may not appreciate being on the-- I mean, I think it's different for different teams, but my team collaborates so closely with the applied engineering team and the product team and design team in a way that I think sometimes research can be quite separate from the rest of the company, but for us, it's so integrated.

We all sit together.

Sometimes the researchers will help with implementing something.

I'm not sure that engineers are always happy about it, but we'll try.

They'll get out of the front-end code.

And vice versa, they'll help us with things that we're doing for model training rounds and things like that.

So I think some of the product teams are quite integrated.

I think it's for post-training.

It's a pretty common pattern, which I think just lets you move really quickly.

I guess one thing that I think is unique about OpenAI is that you're both very much a consumer company by revenue, et cetera, products, but also an enterprise company.

How does that internally-- what would you guys consider yourself, or is that even just the wrong paradigm to think about?

Yeah, I mean, I guess if you tie it to the mission, it's like we're trying to make the most capable thing and we're also trying to make it useful to as many people as possible and accessible to as many people as possible.

So in that framing, I think it makes a lot of sense.

The concept of taste has become also very widely used.

What does good taste mean within OpenAI?

How do you know when you see it?

Know it when you see it?

And is that something that even in a world where everything-- the cost to produce everything just keeps going down and down?

Is that the one thing that's not commoditizable, or is that also shifting given maybe that can go into the training data?

No, I think taste is quite important, especially now that it is, like I said, our models are getting smarter.

It's easier to use them as tools.

So I think having the right direction matters a lot now and having the right intuitions and with the right questions you want to ask.

So I would say maybe it matters more now than before.

I think also I've been surprised by how often the thing that is the most simple, easy to explain is the thing that works the best.

And so sometimes it seems very obvious, but it's quite hard to get the details of something right.

But I think usually good research to taste is just like pretty simplifying the problem to the dumbest thing or the most simple thing you can do.

Yeah, I feel like with every research release we do and when people figure out what happened there, they're like, "Oh, that's so simple.

Oh, obviously that would have worked."

But I think it's like knowing to try that obvious or at the time not obvious thing that is obvious in hindsight.

And then all of the details around the hype around all these things, like the infra that's obviously very hard, but the actual concept itself is usually pretty straightforward.

Very cool.

Taste is Arkham's razor.

So sort of in closing here, obviously, it's historic day.

Do you want to contextualize sort of what this means in context of the mission and where you've been to get to now to where you're going?

Yeah, I think with GPT-5, the thing that's the word that's been in my mind throughout all of this is usable.

And I think the thing that we're excited about is getting this out to everyone.

We're excited to get our best reasoning models out to free users now.

And I think just getting our smartest model yet to everyone, I'm just excited to see what people are going to actually use it for.

That's a great place to wrap.

Tina, you said thanks so much for coming to the podcast.

Thank you.

Thank you for having us.

Visit GreatThis Podcast.com/a16z.

We've got more great conversations coming your way.

See you next time.