
Latent Space · 2025-04-15
OpenAI GPT 4.1: New Developer-Focused Workhorse Models
Hosts: Alessio, Swyx
Guests: Michelle Pokrass, Josh McGrath
Why it matters
GPT 4.1, 4.1 Mini, and a new Nano tier released, focused on instruction following, coding, and 1M context
Key claims
- GPT 4.1, 4.1 Mini, and a new Nano tier released, focused on instruction following, coding, and 1M context
- 4.1 is smaller and cheaper than 4.5 but doesn't beat it on all benchmarks; most 4.5 workloads can migrate to 4.1
- Significant gains come from post-training techniques rather than just pre-training scale increases
- Two new open-source long-context evals released: graph traversal walks and MRCR (multi-round co-reference retrieval)
Episode summary
Summary
OpenAI's Michelle Pokrass (now leading post-training research) and Josh McGrath join Latent Space to discuss the launch of GPT 4.1, GPT 4.1 Mini, and the new GPT 4.1 Nano. The models are positioned as developer-focused workhorses with three core improvements: instruction following, coding, and a first-of-its-kind 1 million token context window. The team explains the confusing 4.5-to-4.1 naming jump by noting that 4.1 is significantly smaller and cheaper than 4.5, doesn't beat it on all benchmarks, and most developers can replace much of their 4.5 usage with 4.1.
Josh McGrath details the long-context work, including two open-source evaluations (graph traversal and MRCR multi-needle retrieval) designed to test reasoning across the full context rather than simple needle-in-haystack tasks. On coding, GPT 4.1 scores 55 on SWE-bench (beating o1), with the team emphasizing post-training as the primary driver of gains rather than just pre-training scale. The conversation also covers vision improvements (especially in 4.1 Mini), preference fine-tuning, prompting best practices (XML for inputs, instruction placement), pricing updates (prompt caching discount raised from 50% to 75%), and the strategic rationale behind deprecating 4.5 to reclaim GPU compute.
- GPT 4.1, 4.1 Mini, and a new Nano tier released, focused on instruction following, coding, and 1M context
- 4.1 is smaller and cheaper than 4.5 but doesn't beat it on all benchmarks; most 4.5 workloads can migrate to 4.1
- Significant gains come from post-training techniques rather than just pre-training scale increases
- Two new open-source long-context evals released: graph traversal walks and MRCR (multi-round co-reference retrieval)
- GPT 4.1 scores 55 on SWE-bench, beating o1's 41; team broke coding into facets like diff quality, repo exploration, test writing
- Extraneous edits eval: 4.0 made unrelated edits 9% of the time vs 2% for 4.1, addressing agentic overreach
- Prompt caching discount increased from 50% to 75%; 4.1 Mini is NOT cheaper than 4.0 Mini
- Fine-tuning available day one for 4.1 and Mini (not Nano); preference fine-tuning highlighted as underused for style steering
Source material
Transcript
Hey everyone, welcome to the Layton Space podcast.
This is Alessio, partner in CTO Eddasible, and I'm joined by my co-host Swyx, founder of SmallAI.
Hey, and today we have a returning guest as well as a new friend, welcome Michelle and Josh.
Hello!
Both of you work on the...
I guess Michelle, I think used to introduce you as manager on the API team.
It seems like you've changed your role since we last talked on the podcast.
Yeah, now I lead a team on the research side, specifically in post-training.
Yeah, and Josh, you are also on post-training.
Yep, I'm a researcher on Michelle's team.
Yeah, and I just found an interesting commonality you guys have.
You're also both from Waterloo, continuing the tradition of extremely correct engineers.
Oh yeah, we talked about that last time, that's right.
Okay, so we're gathering to talk about GPT 4.1.
You launched it, I mean, we got a little preview and it was a little bit rumored, right?
It was pre-released, I guess, with OpenRouter as Quasar Alpha, and then it was also an Optimus version.
And I think people are trying to figure out why are we going back from 4.5 to 4.1, you know, there's a whole bunch of other things, but what are the headline facts, I guess, you guys want to emphasize about 4.1?
Yeah, I'll just say we released three new models today, GPT 4.1, GPT 4.1 Mini, and GPT 4.1 Data.
And the real focus on these were just making models that were great for developers, so we improved instruction following, coding, and shipped our first 1 million context models.
Josh, anything to add?
I don't know if there's anything else that people should really...
They're all, like, sort of in the fine print.
No, I think the only thing that I would touch on maybe twice is that there's actually a new model in the lineup, Nano, which is even faster for developers, so they're making, you know, low latency applications.
And cheaper.
What's any fun story behind the code names?
Or, you know, I got the Strawberry Hat as another fun time in the lore of OpenAI.
Yeah, yeah, we really wanted to get as much developer feedback as possible on this model to make sure it worked well in the real world.
And so we tested it kind of through OpenRouter, and it was super cool to see people latch on to the names and get the theories going.
But the feedback we got from there was super helpful.
Yeah, so it's not even like the name, it's more about just like the API shape.
Once we saw like, Chad Kumple, it was like very obviously OpenAI.
Yeah, it's a good note.
Yeah, but like, I mean, okay, is there like an emphasis on stars?
Like, what inference were we supposed to draw from, you know, quote unquote, super massive black holes?
I don't think there's anything really to draw from there.
Okay, they're just cool.
Just code names, you know, they make you think of cool concepts.
The vibes are good.
The vibes are good.
Yeah.
The other thing about the examples, we're just mining for lore here, right?
The interesting animal comes up a few times on the live stream and on the blog posts.
What's up with Tapirs?
Who likes Tapirs here?
Yeah, our team is just a super big fan of Tapirs.
So they just happen to work their way into a lot of our content.
Okay, cool.
Awesome.
Yeah, go ahead.
Yeah, go ahead.
I think like the first thing that yeah, we just want to run through is obviously the 4.1 to 4.5.
I think that's the first thing that everybody was maybe confused about.
So I don't know you're denvergating 4.5.
It sounds like 4.1 is just like a kickass model and the 4.5 size, maybe it's not as good of a pit.
That was just a research preview.
So yeah, I don't know.
Whatever you want to say to address that, I think it's something we've seen come up also in the Discord.
Yeah, totally.
Okay, naming is really hard.
And we've tried to make this as less confusing as we can.
But you know, nothing's perfect.
Basically, the way we got here is that GPT 4.1 is like a pretty big improvement over the 4.0 line.
And we really wanted to signify that.
However, it's a model that's like much smaller and cheaper than GPT 4.5.
And as a result, you know, doesn't achieve the same like Amy or other intelligence emails.
So it doesn't beat 4.5 on all of the emails.
And so we didn't think it made sense to increment beyond 4.5.
But we do think for most developers, they can kind of replace a lot of their 4.5 usage with 4.1.
And then the mini strictly better than 4.0 mini.
Yeah.
Yeah.
With the nano.
But like, we don't know if 4.1 is a distillation of 4.5 or there's no relationship there.
Like what can we say about the shared lineage?
Yeah, what I'll say there is we're always using various research techniques to improve our models.
And distillation is something we talked about before.
It's really meaningful, especially for the small models.
And we've kind of pulled out some of the things that made 4.5 really good.
Like it has a lot of the instruction following greatness and also rolled that into 4.1.
Awesome.
I think one of the because I really strongly remember on the 4.0 launch that their communication was that we were kind of moving to a new model architecture that is Omni model, right?
That's what that's the O in 4.0.
And that 4.1 is part of this subsequent trend of trying to merge everything like the reasoning model, the Omni model, everything.
And I think that there's this doubt about whether 4.1 I think is basically trying to be sold as a strict replacement for 4.0.
But I don't know.
Is it going to be fully Omni model?
Is it is it like roughly the same architecture that we that we that we think 4.0 has?
So we we already have different slugs on the real time API and like, yeah, sponsors API.
So they're already somewhat different checkpoints.
I don't think we don't have any current plans to release 4.1 in the real time API.
But you know, things things may change.
Yeah.
And then there's image and all that right?
Like, and as far as we know, no plans maybe about nothing announced.
Not not right now.
The focus for 4.1 was kind of these three core capabilities for developers.
Yeah, we are Discord actually also did a launch but watch party for the recent 4.5 podcast that Sam Altman did, where I think for the first time, it was basically kind of confirmed that like something that people already knew, like Andre Carpathi was already talking about this, that 4.5 was like 10x the size of four.
And I think there's a question about like, do we do the linear linear interpolation of 4.1 is like, you know, zero point is like, I don't know, 2x size or something?
That's not really how we think about naming the models is a whole bunch of parts that go into the recipe.
And so, you know, it doesn't really reflect on just the pre training recipe, our version numbering, but I think the 4.1 is just because of the large jump that we have in like coding capabilities, long contacts, and so on, it's more so what it's like for the end user more so than anything about the training recipe.
We can go a little under the hood on training, though.
And we'll say that, you know, nano is obviously a new pre train.
We also have a new pre train for mini.
And then larger version is a new mid train.
But we find that actually a significant amount of the games come from new post training techniques.
And so I think in the past, the narrative is that you need to pre train these larger and larger models to get better performance.
And we're finding that we're able to squeeze a lot more out of post training now.
Talking about how big a model is the other side of it is the context window, you have a 1 million context, I know that Sam at that day, last year, you said that yeah, 1 million was like months away.
So right on right on time.
Can you talk about yet how hard that was to get to 1 million and then maybe where the end game is in your mind?
Is it 10 million 100 million infinite?
What what really matters as you start to scale this?
Yeah, Josh worked a lot on long contacts.
So he's the right person to ask.
Definitely.
So I think the first thing that we that I thought was really interesting when we were going to long context is actually some of the evals that you see as like headlines on maybe other blogs where it's needle in a haystack.
Actually, most of the models do really well right out of the box.
But then we had to actually first get a lot of measurement on the longer context for long context reasoning.
So you know, we actually just open sourced two new evaluations that are about using the context in a more complex way.
So you know, one of them, you have to reason a lot about ordering and the other is actually walking through graphs.
So there's a lot of reasoning that you have to do in those data sets.
And that's where doing long context is actually much harder.
But single needle in a haystack, we were able to saturate pretty easily.
And then most of the work came at this these harder tasks.
Yeah, I was gonna say how not just how you think about length of context in terms of consuming documents versus like active kind of like thinking and planning.
I think there's obviously a whole part on the prompting side around billing agentic worthless.
Do you think that people maybe like still think too much of it about yet needle in a haystack kind of document retrieval versus like traversing very long plans and kind of like iterations in context?
Yeah, I think the mental model that I have is maybe actually has some more variables in it.
So there's, there's the needle in a haystack where you have like some amount of distractors and some you know, needles that you're trying to find.
And I think that it's more so about how dense of the context you need to use.
So like summarization, you're you're actually just using the entirety of the context, whereas you know, needle in a haystack, it's very sparse.
And then I also generally think about orderedness, if you're going to make some sort of inference on this, do you are you just look looking, you know, sort of front to back or do you need to move around in the context in order to generate a good answer while the model sampling?
Yeah, is that something that you worked on with graph walks?
Is that the thing?
Yeah, that was sort of the, the most synthetic and clean way to measure the model.
And then you know, we worked on a lot of other training techniques, data to sort of test the model's ability and training the model's ability to reason throughout the context in a sort of shuffled way.
Yeah, you know, actually, I have the ability, I like to give people a little bit of visual aid with with these things.
So I actually went into your hugging face release and got an example of the graph task.
And so there's there's a few versions of this, right?
There's like the BFS and DFS version.
And also, it's I guess it's very character specific.
So I don't know, maybe could you tell us like, like, you know, design choices around this, like what was surprisingly hard, you know, anything like that?
Yeah, so the idea here is you take a graph and you encode it into the context by looking at the the edge lists and just putting that into the context and then asking the model to do an operation.
And then, you know, under the hood, we're actually just executing the real operation and using that to then evaluate the model's ability to work.
One of the things that I found surprising, at first was the what the model would do when it wasn't sure how to use its context, you know, early versions of the model just sort of looping saying like, Oh, no, I can't find this edge that I think need should be there.
And yeah, I think I was actually very surprised how all models seem to have more difficulty than I would have expected on a task that, you know, we would find very simple or like, you know, maybe an undergrad could write a Python script to run in a couple of minutes.
Yeah, right.
Okay, so like, what is the part like, no, what is the real life task that this is meant to model, I guess, you know, I feel like the other one, MRCR.
Yeah, seems a little bit more intuitive, where, you know, you have like four different stories, and you pick out the second one.
And that's a real task that people have.
But people don't really traverse graphs, like this is a bit more theoretical.
But like, you know, was there any sort of correlation study done?
Yeah, this is actually meant to be sort of the idealized version of like a multi hop reasoning benchmark.
So we have a lot of things where, you know, you're putting hundreds of documents into the context, and then you might ask a question that you actually have to traverse 10 documents for, but there, the edges are they're implicit, right?
Like there is some underlying graph that's connecting all of these documents that you need to traverse in order to answer the question.
But they're actually much harder to reverse because the edge isn't actually given to you.
And so the question there was like, okay, if I actually just give you all of the IDs of these things that you need to traverse, can the model even do that where it's like, it's actually just a lower bound on how well the model can do.
And I think it's that that's actually somewhat well, well reflected in some of the internal benchmarks we have that are using more natural data.
Imagine something like a tax return, right?
Or are you like, upload the entire tax code, like to figure out what to put into this, you know, box, you'll need to reference all of these boxes.
And so this is like a similar level of multi hop reasoning.
But again, like Josh said, all of the references are implicit.
Yeah, I think that takes some kind of backtracking, if it's needed is also super interesting, especially for agent work.
I for listeners who have been listening to us for a while, we actually covered this paper in New York's last year, two years ago, called Geevel, where they actually modeled graphs for graph traversals for agent planning.
And it reminds me closely of that.
It's just that they never came up with this exact format that you have here, which basically is the same thing.
I also like that you included blank answers, because sometimes people do hallucinate, or models do hallucinate answers, and you have a fair amount of blank ones.
Thank the random sampling over graphs I did, I guess.
Is this tied also to the file search API that you released recently?
Like, how should people think about how everything kind of comes together in the API?
Yeah, I think oftentimes with retrieval, you might be using RAG to fill the context.
And a lot of this is to get around the limitation of a short context window.
So we do expect a lot of developers to start uploading their full context more directly to the model.
So for smaller tasks, you maybe don't need the whole vector store.
But we do anticipate this to play well with that paradigm as well.
Like maybe you can just insert way more chunks into the context.
So we think it'll play nice.
Yeah.
Any relationship to the memory upgrades and chat GPT that we recently got?
Is long context just directly usable for memory?
Or should we just always have a separate memory system?
Yeah, it's a good question.
So right now, the dreaming feature, we kind of have some of these memories embedded in the context.
But, you know, they are they are separate features.
So 4.1 is powering the API, whereas the enhanced memory is chat GPT only.
Yeah.
Awesome.
Yeah, I think I think that's interesting.
I guess the one last thing I'll call it on long context, which is kind of unintuitive, or maybe there's an explanation, which was the you had two needle for MRCR.
And then we had four and eight.
And everything kind of just regresses to some kind of baseline of like, let's say 30%, or 20% as that.
But it's interesting to see where the smaller models sometimes match or outperform the larger models.
I was wondering if there's anything unusual there?
Or do you think it was like a bad roll of the dice?
I think it's probably just a bad roll of the dice.
I think I would probably look more so at the the larger narrow ones, see things regresses you and increase the number of needles because there's sort of more complex reasoning that has to do about the order of different things in its context.
Awesome.
Yeah, cool.
Happy to happy to move on from there.
Yeah, we have a whole bunch of other evals that we can go over.
So I had in my notes that we could talk over, you know, anything that you that you want.
There was also like Kali from Shen Yu with who is have on a podcast for instruction following.
And I realized that, you know, he joined open the eye and I wonder if he had a role to play in that one.
No, we did not collab on it.
Honestly, I think it's best when you know, authors and model developers don't collab too much because you things, you know, as objective as possible, not training game any else.
Yeah.
And then I think there was also like for the first time the announcement of the or shout out of the internal instruction following benchmark from API data.
People have had the ability to opt in to share data for a while.
Actually, I like publish the I posted a tweet because I found it in the dashboard that you can just opt in.
And like there's basically 16 days left for this program where you can get free inference.
And like so I'm just kind of curious like what you found from that kind of IFE value that that might be different from the normal IFE value that people have.
Yeah, totally.
A lot of the instruction following emails that are open sourced are open sourced, you know, or crafted in a way that are easy to craft.
So for example, like graph walks is is somewhat easy to craft like you can create this graph and verify easily.
But it is not exactly aligned with what the users are doing.
And this is true for some of the instruction following emails where you ask the model to output exactly four words or, you know, three paragraphs, or stuff like that things that you can verify easily in code.
And these are useful instructions, but we find that many of the really interesting instructions are actually challenging to grade.
And so the open source emails often don't have them.
And so getting this like real world, diverse set of data actually helps us find like, what are the commonalities and what developers are doing?
What is a really good example of like a negative instruction?
And then we can go from there and figure out how to how to evaluate it.
Yeah, I think that's it.
There's also an interesting question of like, what domains do people use you on?
And I wonder if like, there's a way to tell you.
Because sometimes it can be very confusing if I especially because maybe I'm building an app and letting people use my key, but other people are building apps on top of me.
So you have just a lot of chaos of like multiple degrees of abstraction, where you just have to parse through the prompts.
Yeah, it's true.
Well, I will say we do use our own products internally where we can.
And so we're not manually by hand reading every prompt after they're like, anonymous, we scrub them of any identifying data, then we use our models to take passes to categorize them.
And so if we get feedback that like, we're not doing well on ordered instructions, then we can kind of do a pass over all of our data and find some good examples of those.
So there's an instruction following section and this great prompting GPT for one models, I think maybe we can go through some of these examples.
The first one that caught my mind that it's not necessary to use all caps and other incentives like bribes or tips.
But developers can experiment with this for extra emphasis.
So I think the second part leaves me confused.
Are you saying that people should still try and do this?
And sometimes the model responds positively to it?
Do you feel like it's still just part of the lore?
I'm curious why I would have loved for you to say either yes, it works or like no, you should stop.
It looks silly.
I guess the truth is somewhere in the middle.
The truth is always messy.
Reality is that our models have gotten a lot better at following instructions just stated once and clearly, but we find honestly developers often become the best experts at prompting our models because you know, you're building your livelihood on this thing and get to know the details of it really intimately.
So I will say stuff like that won't hurt the performance of the model.
We cannot always want to leave it open to people to figure out what works best.
Yeah.
Yeah.
And then you had to always start with a response rules or instructions section.
Are those keywords meant to be taken kind of like verbatim?
Like those are kind of like the tokens that work the best or is it just like an example?
Warm example.
Yeah.
Okay, cool.
Yeah.
This is great.
I feel like until today we did an episode with like the prompt report on like all these prompting techniques, but then it's also unclear for which model which ones work best.
So it's super useful.
And then you had a in the agentic workflows one you have a persistence thing.
It's like, please keep going.
How much and I think I read that improves like the sweet, the sweet bench, like 20% just by having like the persistence I think it's not that this one prompt improves sweet bench 20%.
It's that we found this is the most effective harness for our model and combined with all the post training improvements and results in the big improvement.
Yeah.
Like the model is trying a lot to be helpful and often it wants to check back in with the user and be like, you know, should I keep doing this?
Like am I on the right track?
And so a prompt like this mixture, it keeps going doesn't bother you again and just gets the task done.
Yeah.
Yeah.
I think like there's this interesting trade off between persistence and yielding back to the user, the more agentic a model wants to be the more persistent it should be.
But then sometimes it just goes off the rails.
And I wonder how you solve this trade off because sometimes it just goes too far.
There's been criticisms of Claude Sonnet trying to rewrite too many files at once when I just wanted to make one thing, for example.
And that's a form of bad persistence.
What are the axes here in which like you think about it?
Yeah.
I think one interesting thing that comes to mind here is that we had an extraneous edits eval where you asked the model to make an edit and classify like were all of its changes related to what it was asked to do or did it go off and do a little too much.
And we found that from four oh, which got 9% is pretty crazy.
9% of the time making an extraneous edit is a lot.
4.1 is at 2%.
So it's a pretty big improvement.
So yeah, I will just say like focusing on this, we've heard feedback about this.
We made an eval and we made sure to track it and improve it during training too.
Yeah.
Yeah.
I mean, everything comes out to the eval as no surprise to anybody.
That's true.
There's another interesting eval that I think is causing some noise.
For the first time, I think also that you being the master of structured outputs should know that JSON is bad now and we should all use XML.
I wouldn't say that.
I don't know which eval you're talking about, but it's in the prompts guide, which maybe you guys didn't write.
So we kind of spring this on you.
Yeah.
Noah and Julie and our team wrote the prompt guide and did a great job.
I do think XML is very helpful for structuring prompts.
Whereas for parsing outputs, maybe the story is a bit different.
Like sometimes it's really useful to get outputs in JSON so you can plug them directly into your application.
But I do think the models work particularly well with XML as inputs.
But, Chris, you need an ad?
No.
No.
Cool.
I mean, I think people always just care a lot about code tool calls and structured outputs as you well know.
And so any updates to instructions over there is good.
People also are interested in this concept of that apparently putting the instructions and user query at the top and the bottom, so duplicating it at the top of the bottom in the context, is much better than putting it top only and much better than putting it bottom only.
Again, this is from the prompt guide, so I don't know how aware you guys are on this.
Yeah, I think part of that was just like, you know, a miracle.
We tried all three for when we were evaluating the model and having that redundancy is definitely the best.
But then using those, the instructions at the beginning, the model is going to be able to then take that into account as it does processing.
Yeah, I think like a lot of people would see this as like running counter to prompt caching, because obviously you want to put the things that change a lot at the bottom.
Basically, is this fixable in post training?
Like, can we just tell models to take instructions or user queries only at the bottom, because we want to optimize for prompt caching?
When we figure it out, we will do that.
It seems doable.
It seems like a post training thing.
I don't know, maybe my mental model post training is wrong.
So I think actually having things at the beginning of the prompt, you would still get prompt caching there.
If you're putting in, for example, like a big needle in a second and you have the data changing each time, like per user, there's still different ways that you can be putting the prompt at the beginning and getting a lot of the cache hits.
It sort of just depends on your use case.
Yeah, awesome.
The other thing I noticed, I know you made a note of this, Sean, too, is that our chain of thought and reasoning and how people should think about this model versus a reasoning model.
Yeah, what's your, yeah, should I just use 4.1 and prompt it to do a chain of thought?
Should I use a one and make a plan and then use 4.1 to implement the plan?
How should people think about composability?
Yeah, it's a great question.
We have found that 4.1 is a lot better at doing planning and thinking through its steps in COT when prompted than our previous non-reasoning models.
But our reasoning models are designed to have kind of more coherent plans and be able to reason over longer horizons than these non-reasoning models.
And you can see that reflected in things like intelligence benchmarks.
So, AMI, GPQA, stuff like that, you'll see the reasoning models do much better.
So, in general, I would say the question you're really getting at is, "I'm a developer, which model should I be using?"
And I think the answer is always going to be the fastest model that accomplishes your task.
So, maybe you start prompting 4.1 as a starting point.
If it does your task super well, then maybe you drop down to 4.1 mini and save latency or even nano.
Whereas if 4.1 is struggling a little, maybe needs more coherent reasoning over longer time horizons, then maybe you upgrade to a reasoning model.
Is there a quick way to get through this heuristics?
I know one thing that a lot of people do is like they use 01 for like a plan and then they put that plan in cursor and then have the plan applied to their code base.
It sounds like there's maybe not a rule to when to do which, it's just like task dependent.
Yeah, I would say we're all kind of figuring out the best way to use these models together.
So, I do think reasoning models for planning and using kind of more targeted models to execute is definitely a good architecture.
Cool.
If there's nothing else on that side, I'd love to go into the coding, which is something that we're emphasizing a lot.
It's doing super well.
It's better than 01 in Sweetbench.
Was that expected?
Not really.
Yeah, like what's the story there?
There's also Sweet Lancer, which is a newer one, which attaches a money value to things.
And basically, what should people understand is going on here?
Like, is it a better coding based model or just a coding agent model?
And I think there's also a question about like, how important to coding is it if I'm not using a coding use case?
Yeah, so I'll start by saying we just set out to make model that was great at coding, both in your terminal or in your editor or wherever you want to use it.
And so we kind of broke that down into the problems that it encompasses.
So like developers want the model to produce better diffs, for example, or they want the model to explore the code base correctly, or they want to produce code that compiles or produce code that writes tests.
And so our approach was kind of teaching the model all of these various facets.
There's kind of just a bunch of work streams that all coalesced around GBT 4.1.
Yeah, I think much improved post training all over to make for a better coding model.
Yeah, I think there's like different kinds of coding, right?
Like, it's interesting for me to observe that there, for example, so I'm just gonna pull it up on the chart here because I always like to show people visuals.
You're 55 on Sweetbench, and 01 gets like a 41.
But then on, oh, I don't think I don't think I have the others like but but Aider is it is less it is not at 01 level.
And so I think I think I struggle to get some kind of intuition of when like, like, what are the different elements of coding?
I guess there is like, you know, single file edits, where there's like a diff or a whole file.
And then there is entire project edits.
Is that a reasonable split?
Are there more to this?
Yeah, that's one way to think about it.
Basically, where GBT 4.1, can I kind of explore go through a repo?
Yeah, I've been trained to do that particularly well.
Whereas, you know, to just get some code and produce a change, a reasoning model might do better because it can kind of reason over the entire file.
And so that's one good way to think about it.
Yeah, yeah, that's fair.
Any any understanding of like the smaller ones, the smaller models, like basically, for coding, I should only use 4.1 and forget the rest.
You might like want to use the smaller models.
Maybe if you have like, if you have an IDE where you need an autocomplete feature, for example, or if you want something super fast, if you're building like, I don't know, a text to SQL thing, you might want the first version of populate instantly.
So you can see like 4.1 mini is actually quite significantly better than 4.0 mini, and not that far away from the old 4.0.
So I do think that model will find use case in a bunch of these coding niches.
And I know you might not be able to talk about this, but the clip of the OpenAI CFO talking about the agentic suite has been going viral, I think today, it seems like every lab is putting a lot of emphasis into coding.
So yeah, I'm just curious if there's anything you can share about how people should think about open AI encoding, you know, obviously today you don't have, you know, Clotus clock code, you don't have any anything related to coding.
And I think the Windsor partnership today, they're giving for one for free, 4.1 for free for a couple weeks is maybe like one of the first open AI endorsement, I guess, on the live stream.
But yeah, just I know there might not be an answer that the PR team might approve, but I'm curious if you have any takes and thoughts.
I think just stay tuned.
Yeah, we, I think coding is an important use case for our users.
And so that's why we focused it on it a lot for 4.1.
We also love to use our own products internally.
And so making 4.1 selfishly helps us move faster as a company.
And so that's where the real focus has been for this model.
Do you track what percentage of code is written by 4.1 internally now?
We do have some metrics like that.
I don't have it off the top.
But I was actually just talking to one of the researchers on the team who worked on something over the weekend.
And he said that this model GBT 4.1 was able to like get 4950 of his commits on this massive PR done.
So we were pretty happy to hear that.
Excited to use that.
Awesome.
Yeah, I think on the yeah, I think coding is a super exciting use case.
And I think like open AI has always been very developer first as you've been to Michelle.
So it's good to see the convergence.
Yeah, the other I think the last capability that I kind of vectored in on was vision or just multimodality in general, it is a lot better.
Basically, I think like I really like these niche benchmarks like Math Vista and chart size.
Yeah, just any any any extra color on the vision side that you wanted to talk about, but maybe you couldn't fit into the blog post.
Yeah, yeah.
I think one maybe small nugget there is actually I think that 4.1 mini is really exciting on that front, because we were talking about it's a different pre training base.
And I think that really shows up in some of the vision evals.
And yeah, we talked about like coding instruction following long contexts, a lot of games coming from post training.
But in particular, multimodal, like, basically everything you're seeing the games are there for pre training.
So kudos to the pre training team there.
They've done incredible work on on perception and multimodal.
Yeah, totally.
So something that we've been exploring on the podcast for a while, and I'm curious if there's any takes on your side, is is there a strong split between like sort of what I call screen vision versus embodied vision, right?
Like, are you taking pictures of a training on snapshots of a computer for computer use, or, you know, and anything with charts, anything like on a PDF is very similar to that, or pictures from the real world, which is more embodied, right, like where a robot might be able to use that people have argued back and forth.
I'm curious where the movement is or the emphasis is.
I think one of the first off, I think that 4.1 is better at both of those things, regardless of how it was actually trained.
I think I would probably somewhat defer to the pre training team when it comes to which one you should be using, or using, you know, a mixture of both.
But we've improved our results across eVals on both.
Awesome.
Yeah, that's something that I think people should definitely do want to explore the more embodied stuff as well, because the benchmarks tend to focus on the screen vision stuff, you know, more, more chat, more controllable.
It's always easy to look at eVAL that is easy to grade.
Yeah, exactly.
Those are the things that get looked at the most, for sure.
I think one of the things that was really funny with both the 4.1 mini and nano is we had some strange internal eVAL results.
And it turns out that actually the, these new vision capabilities, they were able to read like, you know, signs in the background and stuff, which was actually changing like some of the validity of our results.
And so we were, you know, just running into different eVAL problems as you actually improve the models.
Is there a feature of a 4.1 image gen or is that like a completely different part of this vision?
Like, you know, in some sense, vision is image to text.
And the other way around is image gen.
Is it that simple or something else?
It is not.
No place right now to get 4.1 image gen.
Well, you know, it's very, very popular.
We like it too.
It's like melting your GPUs.
I mean, talking about GPUs, right?
Like, you know, part of this whole deprecation of 4.5 and moving people to 4.1 is to get back your GPUs.
That's a message that both Shuki and Kevin Weil have mentioned.
But like you are running all these models concurrently for the next three months.
Like, I don't know if you get back your GPUs.
I think it just grow that usage even more.
Yeah, I do think, you know, people get the message on deprecation and start moving over.
So as developers use this model less, we can kind of reclaim that compute.
But you're right, it takes a while.
And the trade off there is really our commitment to developers.
Like, if you have something in the API, we won't take it away without, you know, sufficient notice.
Yeah, with some notice.
That's the trade off that is right for us.
Okay, awesome.
Then a couple other smaller announcements.
Fine tuning available day one, which is I think new for open AI.
Usually you have to wait like a month or two for the fine tuning capability for 4.1 and mini only in nano and future.
Any specific call outs for fine tuning?
I guess like there's fine tuning is a general discipline that always applies.
But any wins that you guys can talk about?
So first off, yeah, shout out to the fine tuning team.
They've worked really hard to get this ready on day one.
One thing I will say is that I think people have slept on the preference fine tuning offering or the I think that's what we call the product.
Yeah.
So SFT is people know it pretty well.
It's the original fine tuning we had, whereas this preference fine tuning is super helpful for steering in a particular style.
And so I think not enough people are using that.
Isn't that only for reasoning models or is that for everything?
No, that's reinforcement fine tuning is only for reasoning models.
Right.
Preference fine tuning offers the pairs.
Yeah, exactly.
Yeah.
And I thought it was an alpha.
This is why I haven't looked into it.
I think it's RFT that's still an alpha.
Okay.
Well, that's a lot of confusion that we that we just cleared up.
Yeah, I think we're going to let us know.
You know, I'm doing my conference again in June and I think we're going to do a workshop on just general all the fine tuning options.
And I think that will clear up a lot of things, which is good.
Okay.
New models.
I know that we can't talk a lot about a lot of them.
Noam Brown from your reasoning team just said that there should be a follow up on reasoning models soon.
What can we say about that?
Sounds like he's, we're not the right people to ask, but let's stay tuned for.
Yeah.
But like 4.1 is a good basis for like whatever comes next.
Right.
Yeah.
Not all of our models kind of build on each other necessarily, but we think 4.1 is a great standalone offering for developers.
And we also think, you know, reasoning models are a good tool and toolbox.
Yeah.
Like more just generally, like I always want to explore the relationship between non reasoners and reasoners.
And then also like how we merge them.
Are we doing routing, you know, anything, anything of that sort.
Obviously you have a lot of secret sauce.
Cool.
And then I think the other thing that a lot of people are demanding or asking about is the creative writing model.
Will that ever see the light of day?
We're working on incorporating kind of those improvements into the models more generally.
On a separate release.
People love about 4.5 is like humor, the green text, the nuance.
So we've heard that feedback and I know, yeah, there's lots of folks working on that and trying to bring it into our next models.
Awesome.
Alessio, anything else?
No, this was great.
Any requests for the developer community?
Things that you want them to try out that maybe people are not doing, things you want them to build for you using the new on the new APIs.
I feel like first off send us feedback.
It was really useful to look at different partners and customers who are using our models and to get this like nice wrapped feedback from them.
It allows us to iterate a lot faster.
And on that vein, you know, opt in to data sharing.
This just helps us make the model better for you.
And one kind of slept on way to do this is the email's product.
So you can upload an email and opt in such that we'll pay for the inference costs if we can also use the email.
And this is just another great way.
Like we'll use those emails to make sure our models are getting better for people over time.
Yeah, I think the email's is permanent.
There's no end date announced, but the opt in in the API is at least until April 30th.
I think a lot of people still don't know about it.
We might want to extend that so that people can do more.
Yeah, it's like always with the team.
Yeah, awesome.
And I think the last question I had was on just on pricing.
I think pricing, you know, it's basically just generally cheaper than 4.0, but like not a ton, but like cheaper.
And then you're also introducing this concept of blended pricing for the first time that I've seen it.
But maybe it's just been out there for a while because you have caching and all that.
Just generally, what is the cash to non-cash ratio that we should be thinking about when thinking about workloads?
Like is there a general rule of thumb?
So one clarification, which is that GPT 4.1 Mini is not cheaper than GPT 4.0.
So it's not just like a blanket decrease in all the models, but however, 4.1 Mini is cheaper than 4.1.
Also, not sure if this is widely reported, but we've increased our prompt caching discount from 50% to 75% on these models.
Yeah, I saw that.
So that's a big input, you know, into figuring out what kind of application you build.
And then your question was on like what kind of blips do you think about?
Yeah, blended pricing, right?
Like I think there's this question of comparability of prices across models and across providers because like I, you know, like some people are three to one in terms of context to output and then some a part of that is cached.
I selfishly, I make a chart that just plots all the model labs versus all the prices.
And I'm sure you guys have seen it.
And I don't know what numbers to plug in there.
So what are people seeing in real life?
What's the median, you know, caching rate?
I don't think we have that off the top.
Blended pricing is more to just make it easier to compare like so you can say something like GPT 4.1 is 25% cheaper than GPT 4.0.
Yeah, you want one number.
Yeah.
Yeah.
No.
All right.
Well, I'll have to figure it out.
But thank you so much.
That was fantastic.
Thanks for all the work.
I think people are very excited to get to work testing this out, giving you feedback.
And I'm sure we'll be back again for the next one.
Probably the reasoner.
Nice.
Thank you guys.
Thank you.