Latent Space · 2025-10-07

OpenAI DevDay 2025: Apps SDK, AgentKit, MCP, and the New Era of Prompting

Hosts: Alessio (DeLuca Labs), Swyx (Latent Space editor)

Guests: Sherwin Wu (OpenAI Platform), Christina Huang (OpenAI Platform)

OpenAI DevDay 2025Apps SDKMCP protocolAgentKit / Agent BuilderChatKitEvals and prompt optimizationCodexDeveloper platform strategyReliability and Service Health

Why it matters

OpenAI adopted MCP around March 2025 alongside the Agents SDK and Responses API.

Key claims

  • Apps SDK launches built on MCP, inverting the web paradigm so ChatGPT is the outer layer and third-party apps embed inside, giving developers full branded UI control — a hard-won lesson from the Plugins and GPTs eras
  • OpenAI adopted MCP around March 2025 alongside the Agents SDK and Responses API; team member Nick Cooper sits on the MCP steering committee and the team credits Anthropic for treating it as a genuinely open protocol
  • AgentKit bundles Agent SDK, Agent Builder (visual node-based canvas), Connector Registry, ChatKit, and Evals into an end-to-end agent platform with shipped templates for customer support, data enrichment, document comparison, and planning
  • Evals is described as only ~10% complete for agents: trace-based grading is live, but multimodal evals, trajectory-level scoring, and automated prompt optimization (including interest in GEPA-style approaches) are the near-term roadmap

Episode summary

Summary

Sherwin Wu and Christina Huang from OpenAI's Platform team join Latent Space live from DevDay 2025 to walk through the major launches. They describe the Apps SDK as a natural extension of OpenAI's developer-first mission, finally delivering the lessons from earlier failures with Plugins and GPTs: third-party developers now get full UI control, custom branded components, and ChatGPT distribution (800M weekly active users, top app store rankings) rather than being constrained to model-call tools. The architecture intentionally inverts the traditional website-plus-chatbot model, putting ChatGPT at the top layer with apps embedded underneath, all built on the MCP protocol which OpenAI formally adopted around March 2025 alongside the Agents SDK and Responses API.

  • Apps SDK launches built on MCP, inverting the web paradigm so ChatGPT is the outer layer and third-party apps embed inside, giving developers full branded UI control — a hard-won lesson from the Plugins and GPTs eras
  • OpenAI adopted MCP around March 2025 alongside the Agents SDK and Responses API; team member Nick Cooper sits on the MCP steering committee and the team credits Anthropic for treating it as a genuinely open protocol
  • AgentKit bundles Agent SDK, Agent Builder (visual node-based canvas), Connector Registry, ChatKit, and Evals into an end-to-end agent platform with shipped templates for customer support, data enrichment, document comparison, and planning
  • Evals is described as only ~10% complete for agents: trace-based grading is live, but multimodal evals, trajectory-level scoring, and automated prompt optimization (including interest in GEPA-style approaches) are the near-term roadmap
  • Prompting is positioned as increasingly important, not less — framed as 'zero-gradient fine-tuning' — with automated prompt improvement meant to be tightly integrated with the Evals loop on Agent Builder
  • Evals now supports third-party models via OpenRouter, letting developers benchmark across providers including open-source models on Together, addressing multi-model portability
  • ChatKit is an embeddable iFrame (Stripe Checkout-style) with a widget ecosystem at widget.studio; open-sourcing the iFrame itself is unlikely because the evergreen update model is the value, though widget primitives and design system could be explored
  • New Service Health dashboard exposes per-org real-time SLOs (TPM, response codes, token velocity) as OpenAI pushes toward four- and five-nines reliability following last December's multi-hour outage

Source material

Transcript

[MUSIC PLAYING] Hey, everyone.

Welcome to the "Late in Space" podcast.

This is Alessio from the "Colonel Labs," and I'm joined by Swix, editor of "Late in Space."

Hello.

Hello.

And we are here in the "Open AI Dev Days" video with Sherwin and Christina from the "Open AI Platform" team.

Welcome.

Thank you for having us.

Yeah.

It's always-- Very good to be here.

Yeah.

It's so-- it's such a nice thing.

We've been-- we've covered, like, three of these Dev Days now.

And this is, like, the first time it's been, like, so well-organized that we have our own little studio, podcast studio, in the Dev Day venue.

And it's really nice that you actually get a chance to sit down with you guys.

So thanks for taking the time.

Yeah.

I feel like we-- Dev Day is always a process.

And, like, we've only had three of them, and we try to improve it every time.

And I actually-- I know for a fact that I think we have this podcast studio this time, because the podcast interviews and the interviews with folks like yourselves last time went really well.

And so I want to lean into it a little bit more.

I'm glad that we were able to have this studio for you all.

We were kneeling on the ground interviewing, like, Michelle, last year.

I don't know.

[INTERPOSING VOICES] Yeah.

I don't know that.

I just saw it post-production.

I thought it was-- We had to have people, like, cordon off the area so they wouldn't walk in front of the cameras.

People would just come up, hey, good to-- I'm like, we're, like, recording.

Nice.

I guess if you guys have been to three, like, what stood out from today?

Or what's your favorite part?

I feel like the vibes are just a lot more confident.

Like, you are obviously doing very well.

You have the numbers to show it.

I just-- every year, in Dev Day, you report the number of developers.

This year, it's 4 million.

I think last year was, like, three.

And I have more questions about that kind of stuff.

But also, like, just very interesting, very high confidence launches.

And then also, I think that the community is clearly much more developed.

Like, I think there's just a lot more things to dive into across the API surface area of OpenAI than, I think, last year, in my mind.

I don't know about you.

And we were at the OG Dev Day, which was the DALI Hack Night at OpenAI in 2022.

And I think Sam spoke to, like, 30 people.

So I think it's just crazy to see the-- Yeah, honestly, I think it's kind of similar to this podcast studio, which is, I think, we've had a number of Dev Days now.

We honestly were, like, slowly figuring things out as a company over time as well, both from a product perspective and also from a, like, how we want to present ourselves with Dev Day.

And at this point, we've had a lot of feedback from people.

I actually think a lot of attendees, we'll get an email with a chance for feedback as well.

And we actually do read those.

And we act on those.

And one of the things that we did this year that I really liked were all of those-- there was some art installations and the little arcade games that we did, which came up with-- via engaging with the feedback from the-- Yeah, the arcade games were so fun.

I loved the theme of all the ASCII art throughout.

This was my first SF Dev Day.

But I've been to the Singapore one.

That was actually my first-- Oh, yeah, that's the one I spoke to.

Yeah, I saw you there.

That was my first week of OpenAI.

So they're really in the defense.

We're on a plane to Singapore.

Yeah.

Yeah, that's awesome.

Well, so congrats on everything.

And kudos to the organizing team.

We should talk about some developer API stuff.

So we're going to cover a few of the things.

You're not exactly working on apps SDK.

But I guess, what should people just generally take away-- what should developers take away from the apps SDK launch?

How do you internally view it?

So the way that I think about it is, I actually view OpenAI since the very beginning as a company that is really valued, kind of like opening up our technology and bringing it out to the rest of the world.

One thing we talk about a lot internally is, our mission at OpenAI is to, one, build AGI, which we're trying to do.

But two, potentially just as important is to bring the benefits of that to the entire world.

And one thing that we realized very early on is that we as a company, it's very difficult for us to just bring it to truly every corner of the world.

And we really need to rely on developers, other third parties to be able to do this, which is-- Greg talked about the start of the API and how that was formulated.

But that was part of that mentality, which is we need to rely on developers.

And we need to open up our technology to the rest of the world so that they can partake for us to really fulfill our mission.

So the API obviously is a very natural way of doing that, where we just literally expose API endpoints or expose tools for people to build things.

But now that we have ChatTBT with its 800 million weekly active users-- I forgot the stat that we-- sure, I think it's now the fifth or sixth largest website in the world.

And the number one and number two most downloaded on the Apple App Store.

Oh, yeah, with Sora.

Yeah.

Yeah, but that one, it moves around all the time, so it's kind of hard to celebrate.

And-- We just screenshot it when it's good.

We definitely screenshot it when it was good.

But going back to my main point, we've always kind of engaged the developers as a way for us to bring the benefits of AGI to the rest of the world.

And so I view this as actually a natural extension of this.

Candidly, we've actually been trying to do this a couple of times with last dev day with GPTs, two dev days ago with-- sorry, two devs ago with GPTs-- GPTs, plugins.

--plugins, which was, I think, not tied to a dev day.

So I view this as, again, we love to deploy things so iteratively.

And I view it as just a continuation of that process and also engaging deeply with developers and helping them benefit from some of the stuff that we have, which in this case is chat GPT distribution.

And when-- so app set's the case built on the MCP protocol.

When did OpenAI become MCP-pilled?

I'm sure internally you must have had designed discussions before about doing your own protocol.

When did you buy into it, and how long ago was that?

I think it was in March, I want to say.

It's hard for me to remember kind of the exact-- March was the takeoff of MCP.

OK, yeah, yeah.

So we built the agents SDK, and we launched that alongside the responses API in early March.

And I think as MCP was growing, that felt like a really-- and we're building kind of a new agentic API that can call tools and just be much more powerful.

MCP was kind of like the natural protocol that developers were already using to bring all the tools into their system.

And I think in March is when we added an MCP to agents SDK first, and then soon after with kind of our other-- Yeah, I think there was like a tweet or something we did.

There was like opening IE is-- Yeah, there was definitely a moment.

I think there was a specific moment in a specific tweet.

But what I will say though is like-- and this is honestly that credit to the team at Anthropic that kind of created MCP is I really do think they treat it as an open protocol.

Like we work very closely with I think like David and the folks on the consortium.

And they are not really viewing it as this like thing that is specific to Anthropic.

They really view it as this open protocol.

There is like-- it is an open protocol.

The way in which you make changes feels very open.

We actually have a member of our team, Nick Cooper, who is sitting on kind of like that steering committee for MCP as well.

And so I think they are really treating it as something that is easy for us and other companies and everyone else to embrace, which I think they should because they do want it to be something that is very embraced by all.

And so because of that, I think it makes it a little bit easier for us to embrace it.

And honestly, it's a great protocol.

It's very general.

It's already solved.

Why would you make it?

Yeah, it's very general.

There's obviously still more to do with it.

But it was very easy for us to integrate because of how streamlined and how simple it was.

Yeah.

My final comment on apps SDK stuff, and then we'll move to AgentKit, is I always see it like abstractly when you sort of wireframe a website or an AI app.

It used to be that the initial AI integration on the website would be you have the normal website and then you have a little chat bot app.

And now it's kind of like inverted where there's chat GPT at the top layer.

And then there's like turn out the website embedded inside of it.

And it's kind of like that inversion that I honestly have been looking for a little bit.

And I think it's really well done.

Actually, all the integrations and the custom UI components that come up, you had Canva on the keynote there.

And it looks like Canva, but you can chat with it in all the context of your chat GPT.

That is an experience I've never seen.

Yeah, and I think that's kind of back to the iterative learning that we've had.

That I think was because we've learned a lot from plugins.

So when we launched plugins, I remember one of the feedback that we got.

I don't know if people here really remember plugins.

It was like March 23.

One of the points of feedback was like, oh, you can integrate.

We told all these companies that you can integrate these plugins in a chat GPT.

But they really didn't have that much control over how exactly it was used.

It was really just like a tool that the model could call.

And you were just like really bound by a chat GPT.

And so I think you can kind of see the evolution of our product with this.

And this time, we realized how important it was for companies, for third-party developers, to really own and steer the experience to make it feel like themselves, to help them really preserve their own brand.

And I actually don't think we would have gotten that learning had we not had all these other steps beforehand.

Awesome.

Christina, you were the star today on stage with the Agent Kit demo.

You had eight minutes to build an agent.

You had a minute to spare.

And then you had so many-- Three seconds.

Yeah, I wasn't sure.

Honestly, I was like, let's do a little bit less testing.

And maybe we-- I don't know how much time I killed on the widget.

I was extremely stressed when the download came.

Yeah, I was stressed out.

If a UI bug is what takes the demo down, it would be so sad.

I think it was a full screen, yeah, focus thing.

Yeah, I heard the window wasn't in focus or something.

Maybe you want to introduce Agent Kit to the audience.

Yeah, so we launched Agent Kit today.

Full set of solutions to build, deploy, and optimize agents.

I think a lot of this comes from working with API customers and realizing how hard it actually is to build agents and then actually take them into production.

Hard to get that confidence and the iterative loop and writing prompts, optimizing them, writing evals.

All takes a lot of expertise.

And so taking those learnings and packaging them into a set of tools, that makes it a lot easier and intuitive to know what you need to do.

And so there's a few different building blocks that can be used independently.

But they're stronger together because you then can get the whole end-to-end system and releasing that today for people to try out and see what they build.

Yeah, so I find it hard to hold all the building blocks in my head.

But actually, chronologically, it's really interesting that you guys started out with the Agent SDK first.

And then you have Agent Builder.

You have a Connector Registry.

You have Track Kit.

And then you have the eval stuff.

Am I missing any major components?

Those are the main moving parts, right?

Yeah, I think that's it.

I mean, we also still have the RFT fine-tuning API.

But we technically group it outside of the Agent Kit umbrella.

Got it.

Yeah, so it's weird how it develops.

And it's now become the full Agent platform, right?

And I think one thing that I wasn't clear about when I was looking at the demo was-- it's very funny because what you did on stage was build a live chat app for Dev Days website.

Yeah, did you get a chance to try it out?

Yeah, I tried to try it out.

That was awesome.

And actually, I wanted to ask how to deploy-- Where's merch?

Yeah, exactly.

I was like, where'd you click the merch?

Anyway, and this is very close to home because I've done it for my conferences.

And it's a very similar process.

But I think what was not obvious is how much is going to be done inside of Agent Builder.

I see there are some actually very interesting nodes that you didn't get to talk about on stage, like user approval.

That's a whole thing.

And transform and set state.

There's kind of like a Turing complete machine in here.

Yeah.

Yeah, so I mean, I think, again, this is the first time that we're showing Agent Builder.

And so it's definitely the beginning of what we're building.

And human approval is one of those use cases that we want to go pretty deep on, I think.

The node today that I showed is pretty simple, like binary approval.

Approve ejects.

It's similar to what you'd see for MCP tools, of approving that an action can take place.

But I think what we've seen with much more complex workflows from our users is that it's actually quite advanced, like human in the loop interaction.

Sometimes these could be over the course of weeks.

It's not just kind of simple approval for tool.

There's actual decision making involved in it.

And I think as we work with those customers, we definitely want to continue to go deeper onto those use cases, too.

Yeah.

What's the entry point?

So are developers also supposed to come here and then do the two code export, like just segment the use cases?

Yeah.

So I think the two reasons that you would come to Agent Builder are one, more as a playground, to model and iterate on your systems and write your prompts and optimize them and test them out.

And then you can export it and run it in your own systems using Agent SDK, using other models as well.

The second would be to get all of the benefits of us deploying that for you, too.

So you can use maybe natural language to describe what type of agent you want to build, model it out, bring in subject matter experts so that you really have this canvas for iterating on it and getting feedback, building data sets, and kind of getting feedback from those subject matter experts as well, and then being able to deploy it all without needing to handle that on your own.

And that's a lot of the philosophy around how we're building it with ChatKit as well.

You can kind of take pieces of it.

You can have a more advanced integration where it's much more customized.

But you also get a really natural path of going live with really kind of easy defaults as well.

Do you see it as a two-way thing?

So I build here.

I go to code.

Then maybe I make changes in code.

And then I bring those changes back to the agent builder.

I think eventually that's definitely what we want to do.

So maybe you could start off in code.

You could bring it in.

We'll also probably have ability to run code in the agent builder as well.

And so I think just a lot of flexibility around that.

The one thing I'd say too is a lot of the demos that we showed today I think were aired on the side of simplicity just so that the audience could kind of see it.

But if you talk to a lot of these customers, they're building pretty complex.

You've got to zoom out on that canvas quite a bit to kind of see the full flow.

And then for us, we were kind of working with a lot of customers who were doing this.

And then if you turn that into an actual agent SDK file, it's pretty long.

And so we saw a lot of benefit from having the visual set up here, especially as the setup grows longer and longer.

It would have been a little difficult to showcase this.

But even on some of the-- You want to do it in eight minutes.

Yeah, you can do it in eight minutes.

But even with some of the presets that we have on the side with the support being-- Yeah, exactly.

One of the things that we launched today as well alongside just the canvas is a set of templates that we've actually gathered from our engineers who are working in the field with customers directly of the kind of common patterns that they have in our own-- basically like playbooks when we're working with customers on customer support, document discovery, and so kind of publishing those as well.

Data enrichment, planning helper, customer service, structured data Q&A, document comparison.

That's nice.

Internal knowledge assistant.

And I think we just plan to add more to those as we can kind of build those out.

I was wondering if there should be-- so you're not the only agent builders, but obviously by default being an open AI, you are a very significant one.

Any interest in a protocol or interop between different open source implementations of this kind of pattern of agent builder?

I think we've thought about it, especially around, I'd say, agents SDK.

I would actually say maybe even zooming out a bit more from just this is like-- yeah, we were also sitting here and observing things being made over and over again.

Even besides agent workflows, we're kind of launching what the industry is trying to do with responses, like what we've done with responses API, stateful APIs.

And so obviously we were the first one to launch responses API, but a couple of other people have kind of adopted, I think, I think Grog has it in their API.

I think I saw LMSys just did something, you're sleeping walls, but not everyone.

And so unfortunately, I don't have a great answer today of yes or no, but we are kind of assessing everything and trying to see, hey, there has been a lot of value with MCP, hopefully with our commerce protocol as well.

ACP, yeah, I definitely did not forget the name.

And so even thinking about what we want to do with agents, with the agent workflow, the portability story around that, as well as the portability, I'd say even of responses API would be great if that could be a standard or something and developers don't need to build three different stateful API integrations if they want to use different models.

Yeah, and I think that's one of the-- so it's not exactly a protocol, but one of the things that we launched today with E-Vals2 is the ability to use third-party models as well and kind of bring that into one place.

And so I think definitely kind of see where the ecosystem is at, which is using multi-models and kind of having-- Third-party models as in non-open-day models?

Yeah, it'll work with E-Vals starting today.

OK, got it.

We have a really cool setup with OpenRouter, where we're working with them.

And then you can bring your OpenRouter setup.

And then with that, you can actually-- you write your E-Vals using our data sets tool, or user data set tool to create a bunch of E-Vals.

And you'd actually be able to hit a bunch of different model layers, take your pick from wherever, even open-source ones on together, and see the results in our product.

Yeah, that's awesome.

Speaking more about E-Vals, I think I saw somewhere in the release docs that you basically had to expand the E-Vals products a little bit to allow for agent E-Vals.

Maybe you can talk about what you had to do there.

Yeah.

I have an answer.

Yeah, I was going to say, so I actually think agent E-Vals is still a work in progress.

So I think we've made maybe 10% of the progress that we need here.

For example, I think we could still do a lot more around multimodal E-Vals.

But the main progress that we made this time was allowing you to take traces.

So the agent SDK has a really nice traces feature, where if you define things, you can have a really long trace, allowing you to use that in the E-Vals product and be able to grade it in some way, shape, or form over the entirety of what it's supposed to be doing.

I think this is step one.

I think it's good to be able to do this.

But I think our roadmap from here on out is to really allow you to break down the different parts of the trace and allow you to E-Val and measure each of those and optimize each of those as well.

A lot of times, this will involve human in the loop as well, which is why we have the human in the loop component here too.

But if you look at our E-Vals product over the last year, it's been very simple.

It's been much more geared towards this simple prompt completion setup.

But obviously, as we see people doing these longer genetic traces, how do you even evaluate a 20 minute task correctly?

And it's a really hard problem.

We're trying to set up our E-Vals product and move them that way to help you not only evaluate the overall trajectory, but also individual parts of it.

Yeah.

I mean, the magic keyword is rubrics, right?

Everyone wants LM as judge rubrics.

Yeah.

Obviously, where does this look of?

OK, great.

The other thing I think online, I see the developer community very excited about is automated prompts optimization, which is kind of E-Vals in the loop with prompts.

What is the thinking there?

Where's things going?

Yeah, so we have automated prompt optimization.

But again, I think that's an area that we definitely want to invest more in.

We, I think, did a pretty big launch of this when we launched GPD 5, actually, because we saw that it was pretty difficult as new models come out to learn all the quirks about a new model.

Yeah, the prompt optimization.

We have a big prompting guide for every model that we launch.

And I think building out a system to make that a lot easier, we definitely want to tie that in completely with E-Vals.

We should be able to kind of improve your prompts over time, improve your agents over time as well if they're kind of made in the agent builder based on the E-Vals that you've set up.

And so I think we see this as a pretty core part of the platform of basically suggested improvements to the things that you're building.

I actually think it's a really cool time right now in prompt optimization.

I'm sure you guys are seeing this too.

Not only are there a lot of products kind of around this, so kind of what we're thinking about.

But I also think there's a lot of interesting research around this, like G-E-P-A with-- the Databricks folks are actually doing really cool stuff around this.

We're obviously not doing any of the cool G-E-P-A optimization right now in our product.

But we'd love to do that soon.

And also, it's just an active researcher.

So whatever Matei and the Databricks folks might think about next, what we might think about internally as well, whatever new prompt optimization techniques come out, I think we'd love to be able to have that in our product as well.

And it's interesting because it's coming at a time when people are realizing that prompt.

I feel like two years ago, people were like, oh, at some point, prompting's going to be dead.

No.

And it's like-- It's gone up.

Yeah.

And if anything, it has become more and more entrenched.

And I think that there's this interesting trend where it's becoming more and more important.

And then there's also interesting cool work being done to further entrenched prompt optimization.

And so that's why I just think it's a very fascinating area to follow right now.

And also, it was an area where I think a lot of us were wrong two years ago because if anything, it's only gotten more important.

Yeah, I would say what-- Shenyu used to work at OpenAI now as an MSL.

We call this zero gradient fine tuning or zero gradient updating because you're just tweaking the prompts.

But it is so much prompt that it's actually-- you end up with a different model at the end of it.

There's a lot of things that make it more practical, too, just even from our perspective.

We have a fine tuning API.

And it is extremely difficult for us to run and serve all of these different snapshots.

Laura's great thinking lab just published John Trimlin just had a cool blog post about this.

But man, it is pretty difficult for us to manage all of these different snapshots.

And so if there is a way to hill climb and do this zero gradient optimization via prompts, yeah, I'm all for it.

And I think developers should be all for it because you get all these gains without having the fancy fine tuning work.

Since you are part of the API-- you lead the API team, and since you mentioned Thinky, I got to throw a cheeky one in there.

What do you think about the Tinker API?

So yeah, that's a good one.

So it's actually funny.

When it launched, I actually DMed John Trimlin.

I was like-- Really?

--while we finally launched it.

So the-- Because you used to work with him.

Yeah.

Yeah, so we-- it's actually funny.

So at-- yeah, so right when I joined OpenAI, this has actually been, I think, a passion project of John's.

He's been talking about doing something in this shape for a while, which is a truly low level research fine tuning library.

And so we actually talked about it quite a bit when he was at OpenAI as well.

It's actually funny.

I talked to one of my friends who said that when he was at Anthropic, he also worked on the idea for a bit.

And now-- He's a man on a mission.

Yeah, I mean, John's so great in this regard.

He's so purely just interested in the impact of this, because it's-- one, it's a really cool problem.

And then two, it also empowers builders and researchers.

You saw all the researchers who express all this love for Tinker, because it is a really great product.

And so I'm just really happy to see that they shipped it.

And I think he was really happy to get it out there in the world as well.

Yeah, this is probably-- this is very much a digression.

But it's weird-- someone passionate about API design, that it took this long to find a good fine tuning API abstraction, which is effectively all he wanted.

He was like, guys, I don't want to worry about all the infra.

I'm a researcher.

I just want these four functions.

And it's kind of interesting.

Yeah, yeah.

Cool.

Before the OpenAI Coms team barges in the room-- I know.

So what feedback do you want from people and the agent builder?

For example, the thing I was surprised by was the FL's blocks not being natural language and using the common expression language.

I'm sure that's something already on your roadmap.

What are other things where you're kind of at a fork that you would love more input on?

I think one of the things that we spent a lot of time discussing was whether we want more of the deterministic workflows or more LLM-driven workflows.

And so I think getting feedback on that-- honestly, having people model existing workflow-- a lot of what we did was work with our team on-- especially with engineers who are working with customers, modeling the workflows that already exist in the agent builder and what gaps exist.

What types of nodes are really common?

And how can we add those in?

I think that would be the most helpful feedback to get back.

And then as we expand from just chat-based-- right now, the initial deployment for agent builders through ChatKit, we plan on releasing more standalone workflow runs as well and the types of tasks that people would like to use in that type of API.

So more modalities, for example.

I mean, I think for sure more modalities-- I think voice is already something that a lot of people have talked to us about even today at Dev Day.

So I think modalities for sure, but also more like the logical nodes of what can't be expressed today.

Yeah.

Well, you're building a language, right?

You have common expression language, which I never heard of prior to this.

I thought this was this Python.

This is JavaScript.

And then there was a whole link in there.

Was that a big decision for you guys?

I think that was more just kind of a way that we thought we could kind of represent a mix of variables and conditional statements.

The other thing I'll also mention is that once you-- so there's a trope in developer tooling where anything that can store state will eventually be used as a database, including DNS.

So to be prepared for your state store to become a database, I don't know if there's any limits on that because people will be using it.

It's actually funny.

I'd heard this quote before.

And there's definitely some truth to it.

I don't know if our stateful APIs have become a database.

I guess quite yet.

But who knows?

I mean, conversations-- Well, you charge for it.

You charge for a system-- Storage, yeah.

The storage.

So there's some limit on that.

Yeah, but it's very cheap.

I remember we pressed it.

I think if you wanted to kind of dump all your data somewhere, I don't know.

This is the most transforming it all into this shape.

It's useful.

It's easy.

That's the place we want to buy it.

But also, please don't do this because I think it'll-- but quite a bit of strain on Venton and our M4 team and what we try and do.

How do you think about the MCP side?

So you have OpenAI first party connectors.

You have third party preferred, I guess, servers, you will call them.

And then you have open-ended ones.

Do you see that part of registry-like functionality expanding?

Or do you see most of it being user-driven?

OTT is the biggest thing.

If you add Gmail and Calendar and Drive, you have to all teach of them separately.

There's not a canonical-- what's the thinking there?

Yeah, but I think definitely for the registry, that's why we want to make it a lot easier for companies to manage what their developers have access to, managing the configurations around it.

And I think in terms of first party versus third party, we want to support both of those.

We have some direct integrations.

And then anyone can create MCP servers.

I think we want to make that a lot easier to establish private links for companies to use those internally.

So I think just really excited about that ecosystem growing.

Yeah, I think one of the coolest things observed, too, is just I actually think we as an industry are still trying to figure out the ideal shape of connectors.

So I mean, part of why I think the 1P connectors exist, too, we end up storing quite a bit of state.

It's a lot of work for us.

But by having a lot of state on our side-- we call them sync connectors-- we can actually end up doing a lot more creative stuff on our side when you're chatting with chat GPT and using these connectors to boost the quality of how you're using it.

If you have all the data there, you can do all this re-ranking.

You can do-- we can put it in a vector store if you want to put it anywhere else.

And so there's some inherent trade-offs here where you put in a lot of work to get these 1P connectors working.

But because you have the data, you can do a lot more and get higher quality.

But then the question is, oh my god, there's such a long tail of other things, which is where the MCP and the third party connectors come in.

But then you have the trade-off of your beholden to the API shape of the MCP creator.

It might actually work well.

It might not work well with the models.

And then what happens if it doesn't work well?

Then you kind of have to-- you're kind of at the mercy of this.

And MCP, by the way, is really great because it already does some layer of standardization.

But my senses are still going to be more evolving here.

And I think we want to support both of them because we see value in both right now, especially working with developers.

We want to have all options on the table here.

But it will be interesting to see how this evolves over time.

Yeah, when I saw about three or four months ago, when you launched the forum for signing with chat GPT interest, I think to me that's kind of like the vision, where I log in and I have the MCPs tied in.

And then I sign in with chat GPT somewhere.

And I can run these workflows in that app where I'm logging in.

So yeah, I think Sam said in an interview that he uses chat GPT as your personal assistant.

So I think this is a great step in that direction.

Yeah, I think there's a lot more to go in that direction.

But so far, no plan on opening eyes IDP, which is a different role in the off ecosystem.

Yeah, it's interesting because-- so the direct answer is no plans right now, of course.

But I actually think we currently have some version of this, which is our partnership with Apple.

Because with Apple, you can actually sign in to your chat GPT account.

And some of that identity does carry with you into your iOS experience with hearing.

Like if you-- I don't know if you've actually used the Siri integration.

I actually use it quite a bit.

But if you sign into your chat GPT account, the Siri integration will actually use your subscription status to decide what type of model to use when it passes things over to chat GPT.

And so if you're just a free user, you get the free model.

But if you're a plus or a pro subscriber, you get routed to GPT 5, which is I think what they-- I think we also recently announced the partnership with Kakao.

Oh, yeah, Kakao's another one.

Yeah, where I think a similar thing where you can sign in with chat GPT.

Kakao's one of the largest messenger apps in Korea and kind of interact with Kakao directly there.

Yeah, I mean, Sam's been talking about it for a while.

It's a very compelling vision.

We obviously want to be very thoughtful with him and how we do it.

Now you have a social network.

You have a developer platform.

At the beginning of the social network.

Very, very valuable.

Yeah, exactly.

OK, so and then on the other side of the office, something I was really interested to look at.

And I couldn't get a straight answer.

Is there some form of being bring your own key for Agent Kit?

When I expose it to the wider world, obviously, by default, I'm paying for all the inference.

But it'd be nice for that to have a limit.

And then if you want more, you can bring your own key.

Yeah, I mean, we don't have something like that yet.

But I think, yeah, it's definitely an interesting area too.

Yeah, it doesn't do it out of the box today.

But developers have been asking about it for forever.

It's a really cool concept.

Because then as a developer, especially indie developer, you don't need to bear the burden of inference.

Yeah, I think when you get into the business of Agent Builders that are publicly exposed where you have an allow list of domains, it rhymes with this exact pattern of someone has to bear the cost.

Sometimes you want to mess around with the different levels of responsibility.

Yeah, I will say in general, if you look at our roadmap, we engage a lot with developers.

We hear what are the pain points.

And we try and build things that address it.

And ideally, we're prioritizing in a way that's helpful.

But yeah, we've definitely heard from a good number of developers that the cost is.

Or all of the copy paste your key solutions right now, which are huge security hazards.

Because developers don't want to bear the burden of inference.

Hopefully, we make the cost cheaper.

The models keep getting cheaper.

Yeah, yeah, so hopefully that helps.

But what we realize is as we make it cheaper, the demand for it goes up even more.

And you end up still spending quite a bit.

But yeah, so we definitely heard this from a lot of developers.

And it's definitely something top of mind.

Do you see this as mostly like an internal tools platform, though?

To me, you've been doing a big push on the more forward deployed engineering things.

It's almost like, hey, we needed to build this for ourselves as we sell into these enterprises.

Might as well open it up to everybody.

What drives building these tools?

You think of people building tools that then exposed, or mostly on the internal side?

Yeah, so I think, again, our first deployment is ChatKit, which is intended to be for external users.

But I think one of the things that we also did see a lot as we were working with customers is that a lot of companies have actually built some version of an agent builder internally to manage prompts internally, to manage templates that they're sharing across the different developers that they have, maybe the different product areas.

And we were seeing that over and over again as well, and really wanted to build a platform so that this is not an area that every company needs to invest in and rebuild from scratch, but that they can have a place where they can manage these templates, manage these prompts, and really focus on the parts of agent building that is more unique to their business.

It is interesting, too.

From a deployment perspective, it has spanned both internal and external use cases.

Kind of like these internal platforms, people will use it for data processing or something, which is an internal use case.

But if you saw some of the demos today, there have been a huge number of companies that are trying to do this for external-facing use cases as well.

Customer service is one template in here.

Customer service, the ramp use case.

We use this internally and externally.

Our customer support helped.openout.com already powered on agent kit, and then various other internal use cases as well.

And one of the things that I actually think the team has done a really great job of-- so like Tyler, David, and Juwan on the team-- especially the chat kit components, they built it to be very consumer-grade and very polished.

You kind of look at that-- there's a whole grid of the different widgets and things that you could create there.

Ideally, people see it as these very polished, consumer-grade ready, external-facing things versus-- I think of internal tools and the UI as always the last thing that people care about.

But you really push the team, and I think they did a really great job of making the chat kit experience really, really consumer-grade.

And it should feel almost like chat GPT and with really buttery, smooth animations and really responsive designs and all of that.

Yeah, I think your point on widgets is definitely really resonates because chat kit-- it handles the chat UX, but we're also just building really visual ways for you to represent every action that you want to take.

And that is definitely very high polished.

Yeah, and when working with customers, those have been the most helpful customers for us to work with.

Because when Ramp is thinking about what they want to publicly present to people, they have a pretty high bar, as they should, as well as all the other customers that have been iterating on it.

And so that kind of feedback from our customers has really helped us up level the general product quality of the launch that we had today as well.

Yeah.

Would you ever-- would you open source chat kit?

Talked about it.

We've talked about it.

There are a bunch of trade-offs.

So chat kit itself is like an embeddable iFrame.

And so I think the actual-- It's an iFrame.

Yeah, for-- And so that helps us keep it like evergreen, right?

So if you are using chat kit and we come up with new-- I don't know-- a new model that reasons in a different way or a kind of new modality is that you don't actually need to rebuild and pull in new components to use it in the front end.

I think there's parts of widgets, for example, that is much more like a language and can definitely-- it's something that is easier to explore that for, as well as kind of the design system that we've built for chat kit.

But I think as part of-- yeah, the actual iFrame itself, I think there's a lot of value in that being more evergreen-- more evergreen experience that is pretty opinionated.

Like there'd be no point in being open source.

You want the-- Then you don't get the benefits of it.

Being Stripe alums, Stripe checkout, it's all optimized for you to-- So I'm not a Stripe alum, but Christina is.

The team, actually, is the team that built-- Stripe checkout?

Yeah, so it's very similar philosophically, right?

So Stripe can build elements and checkout, and not every business needs to rebuild the pieces that are really common.

And I think we see the same with chat.

We see chat being built over and over again, especially as we kind of come up with new modalities, like reasoning, everything.

It's not really something that is easy to keep up to date.

And so we should just do that and leave kind of the hard parts of building agents again to the developers.

Does it feel-- I mean, I know WordPress is like a bad connotation in a lot of circles, but to me, it almost feels like the WordPress equivalent of, like, chat is like, hey, this is like drop-in thing, and then you have all these different widgets.

Do you see the widget becoming a big kind of like developer ecosystem, where people share a widget?

Is that kind of like a first party thing?

And then what's like the MCP versus widget forest?

No, exactly.

I mean, it's kind of like-- it seems great for people that are like in between being technical and not really being technical enough.

Yeah, I mean, I think that's a big part of building widgets.

It's already kind of in the language that is very consumer-friendly.

You can use-- in our widget builder already, you can kind of use AI to create those widgets, and they look pretty good.

I don't know if you guys have gotten a chance to try that out yet, but definitely see kind of-- I don't know, a forest or something.

If you haven't tried out the widget studio and the demo apps as well, yeah.

You got a custom domain, like widget.studio.

It's cool.

Actually, don't know how we got that, but-- Yeah, everything's in check at .studio, and then we have the playground there, so you can try out what check it would look like with all the customizations.

We have checkit.world, which is a fun site we built.

I was spinning the globe for a while this morning.

It was like a widget spinner.

Kasia also uploaded some of her solar system stuff and all the demos as well.

And then that's where the widget builder-- Yeah, so it's really come together.

It's taken almost more than a year to come together and build all this stuff, but it's coming together.

It's really interesting.

Yeah, yeah, it's something that we-- You definitely planned all of this up front.

Oh, yeah, yeah.

We have the master plan from three years ago.

No, but I think-- especially on this stuff, I think there was an arc of a general platform that we did want to build around.

And it takes a while to build these things.

Obviously, codecs help speed it up quite a bit now.

But yeah, I will say it does seem great to start to have all the pieces start fitting together.

I mean, we launched evals, and we got the fine-tuning API for a while, and we laid all the groundwork for some of this stuff over the last year.

And we're hoping that we can eventually make it into this full-featured platform that's helpful for people.

I think you have.

Since you did the codecs mention, maybe a quick tip from each of you on codecs power user tools or tips.

So there's actually a funny one that one of the new grads has, I think, taught our team in general.

And I think this is a point for just how new grads and younger generation people are actually more AI native.

So one of them is to really lean in to push yourself to trust the model to do more and more.

So I feel like the way that I was using codecs-- and so for me, it's usually for my personal projects.

They don't let me touch the code anymore.

But you give it small tasks.

So you're not really trusting it.

I view it as this intern that I really don't trust.

But what a lot of the-- so we had an intern class this year.

But a lot of the interns would do is just full YOLO mode, trust it to write the whole feature.

And it doesn't work.

It doesn't work for worse.

It doesn't work sometimes.

But I don't know, 30%, 40% of the time, it's just one-shots it.

I actually haven't tried this with GPT-5 codecs.

I bet it probably one-shots it even more.

But one tip that I'm starting to-- I feel like undo this-- relearn things here is to really lean into the AGI component of it and just really let the model rip and trust it.

Because a lot of times, they actually do stuff that surprises me.

And then I have to readjust my priors.

Whereas before, I feel like I was in this safe space of, I'm just treating this-- I'm giving this thing a tiny bit of rope.

And because of that, I was limiting myself with how effective I could be.

Sure, but OK.

But also, is there an etiquette around submitting effectively vibe-coded PRs that someone else has not has to review?

And it's like, it can be offensive.

We have codecs to review now.

It actually reviews itself.

Because codecs approve its own PRs a lot more than humans.

It doesn't get approved, then.

I was going to say, I think the codecs PR reviews are actually one of the things that my team very much relies on.

I think they're very, very high-quality reviews.

On the codecs PR side, for the visual agents builder, we only started that probably less than two months ago.

And that wouldn't be possible without codecs.

So I think there's definitely a lot of use of codecs internally, and it keeps getting better and better.

And so yeah, I think people are just finding they can rely on it more and more.

And it's not totally vibe-coded.

It's still checked and edited.

But definitely as a kicking off point-- and I think I've heard of people on my team, it's like on their way to work-- they're kicking off five codecs tasks because the bus takes 30 minutes.

And you get to the office, and it kind of helps you orient yourself for the day.

You're like, OK, now I know the files.

I have the rough sense.

Maybe I don't even take that PR, and I actually just still code it, but it helps you just context switch so much faster, too, and be able to orient yourself in a code base.

There are so many meetings nowadays where I have one-on-ones with engineers.

And I walk into the room.

They're like, wait, wait, wait.

Give me a second.

I got to kick off my codecs thing.

I'm like, oh, sorry.

We're about to enter async zone.

It's almost like our notes.

You're like, let me-- And they're like, OK, now we can start our one-on-one because now it's a prank.

Yeah.

Cool.

We're almost out of time.

I wanted to leave a little bit of time for you to shout out the Service Health dashboard because I know you're passionate about it.

Well, tell people what it is and why it matters.

Yeah, so this is a launch that we actually didn't-- it didn't get any stage time today, but it was actually something I'm really excited about.

So we launched this thing called the Service Health dashboard.

You can now go into your usage or your settings account and see the health of your integration with our OpenAI API.

And so this is scoped to your own org.

So basically, if you have an integration that's running with us doing a bunch of tokens per minute or a bunch of queries, it's now tracking each of those responses, looking at your token velocity, TPM that you're getting the throughput, as well as the responses, the response codes.

And so you can see kind of like a real-time personal SLO for your integration.

The reason why I care a lot about this is, obviously, over the last year, we've spent a lot of time thinking about reliability.

We had that really bad outage last December, longest three, four hours of my life, and then had to talk to a bunch of customers.

We haven't had one that bad since, knock on wood.

We've done a bunch of work.

We have an infra team led by Venkat, and they've been working with Johnna on our team.

And they've just been doing so much good work to get reliability better.

And so we actually-- again, knock on wood-- we think we've got reliability in the spot where we're comfortable kind of putting this out there and kind of letting people actually see their SLO.

And hopefully, it's three, four, soon to be five nines.

But the reason why I care a lot about it is because we spent so much time on it, and we feel confident enough to have it behind a product now.

Five nines is like two minutes of outage or something.

Yeah, yeah.

We're working to get to five nines.

What is an extra nine take?

It's exponentially more work.

But in the last couple of days, you were talking about hitting three nines, and then hitting three and a half nines, and then hitting four nines.

But yeah, it's exponentially more work.

I could go for a while on the different topics.

But-- We'll have to do that in a follow-up.

I mean, that's the engineering side, right?

Yes, yes, yes.

You're surveying six billion tokens per minute.

We actually zoomed past that.

Yeah, that's the-- That's outdated.

Yeah, but yeah, it's been crazy, the growth that we've seen.

Awesome.

I know we're out of time.

It's been a long day for both of you, so we'll let you go.

But thank you both for joining us.

Yeah.

Yeah.

Thanks for having us.

Thanks.

Thank you.

That's it.

[APPLAUSE] How was that?

That was great.

OK.

We had the mic offer.

I didn't want to say on the podcast was on the tinker thing.

So we actually-- [MUSIC PLAYING]