Latent Space · 2026-04-07

OpenAI's Extreme Harness Engineering for Autonomous Coding Agents

Hosts: Alessio Fanelli, Swyx

Guests: Ryan Lopopolo

harness engineeringautonomous coding agentsOpenAI FrontierCodexSymphony orchestrationenterprise AI deploymentagent skillssoftware development lifecycle automation

Read summary Jump to transcript Go to episode

Why it matters

OpenAI Frontier team built a 1M+ LOC internal product with zero human-written code, relying entirely on Codex-powered agents.

Key claims

OpenAI Frontier team built a 1M+ LOC internal product with zero human-written code, relying entirely on Codex-powered agents.
The harness engineering approach focuses on modular skills, observability, and continuous feedback to automate the software development lifecycle.
Symphony, an Elixir-based orchestration system, manages multi-agent workflows to minimize human synchronous involvement and improve scalability.
Frontier is an enterprise platform enabling safe, governed deployment of AI agents integrated with existing company infrastructure and security tooling.

Episode summary

Summary

In this episode of Latent Space, Ryan Lopopolo from OpenAI Frontier discusses the development of a fully autonomous coding harness that produces over a million lines of code with zero human-written code or review in the loop. The team leveraged Codex models and a systems-thinking approach to build an internal product with rapid iteration cycles, focusing on modularity, observability, and automation to drastically increase engineering productivity. They emphasize the importance of scaffolding, agent skills, and continuous feedback loops to improve agent behavior and reduce human bottlenecks in the software development lifecycle.

Ryan also introduces Symphony, an orchestration system built on Elixir and the BEAM VM, designed to manage multi-agent workflows and remove humans from synchronous loops. The Frontier platform aims to enable enterprises to deploy safe, observable, and customizable AI agents integrated with their native stacks and governance requirements. The conversation highlights the evolving role of AI in software engineering, the shift towards AI-native codebases optimized for agent legibility, and the challenges and opportunities in scaling AI-driven development within large organizations.

OpenAI Frontier team built a 1M+ LOC internal product with zero human-written code, relying entirely on Codex-powered agents.
The harness engineering approach focuses on modular skills, observability, and continuous feedback to automate the software development lifecycle.
Symphony, an Elixir-based orchestration system, manages multi-agent workflows to minimize human synchronous involvement and improve scalability.
Frontier is an enterprise platform enabling safe, governed deployment of AI agents integrated with existing company infrastructure and security tooling.
Agents autonomously handle code, tests, CI, documentation, reviews, and release tooling, with humans primarily involved in final release approval.
The team uses strict architectural boundaries and reusable primitives to align agent output and reduce human review overhead.
Continuous distillation of team knowledge and agent behavior from logs and PR comments improves agent performance and process alignment.
The approach anticipates future model improvements to handle higher complexity tasks, with current limitations in zero-to-one product ideation and large-scale refactoring.

Source material

Transcript

I do think that there is an interesting space to explore here with Codex, the harness as part of building ad products, right?

There's a ton of momentum around getting the models to be good at coding.

We've seen big leaps in, like, the task complexity with each incremental model release, where if you can figure out how to collapse a product that you're trying to build a user journey that you're trying to solve into code.

It's pretty natural to use the Codex harness to solve that problem for you.

It's done all the wiring and lets you just communicate and prompts to let them all cook.

You have to step back, right?

Like, you need to take a systems thinking mindset to things and constantly be asking, where is the agent making mistakes?

Where am I spending my time?

How can I not spend that time going forward?

And then build confidence in the automation that I'm putting in place so I have solved this part of the SDLC.

All right, we're in the studio with Ryan Lopoplo from OpenAI.

Welcome.

Hi, thanks for visiting San Francisco and thanks for spending some time with us.

Yeah, thank you.

I'm super excited to be here.

You wrote a blogbuster article on an harness engineering.

It's probably going to be the defining piece of this emerging discipline.

Thank you.

It has been fun to feel like we've defined the discourse in some sense.

Let's contextualize a little bit.

This first part as you've ever done.

Yes, and thank you for spending with us.

What is where is this coming from?

What team are you in?

All that jazz?

Sure, sure.

I work on Frontier Product Exploration, new product development in the space of OpenAI Frontier, which is our enterprise platform for deploying agent safely at scale with good governance in any business.

And the role of the team has been to figure out novel ways to deploy our models into package and products that we can sell as solutions to enterprises.

And you have the background.

I'll just squeeze it in there.

Snowflake breaks, Stripe, Citadel.

Yes, yes.

The only exciting kind of customer entire life.

Yes.

The exact kind of customer that you went to.

So I'll say I was actually, I didn't expect the background when I looked at your Twitter.

I'm seeing the office there.

Stuff like this.

So you've got the mindset of like full send AI coding.

Stuff about Slop, like buckling in, your laptop on your way modes.

And then I look at your profile and I'm like, oh, you're just like, you're correct in the other run.

Perfect.

My.

It's quite fun to be AI maximalist if you're kind of live that persona.

OpenAI is the place to do it.

And it's sort of token this way to see.

Yeah, certainly helps that we have no rate limits internally.

And I can go like you said full send at this thing.

Yeah, yeah.

So the open AI frontier and your special team within open AI frontier.

We had been given some space to cook, which has been super, super exciting.

And this is why I started with kind of a out there constraint to not write any of the code myself.

I was figuring if we're trying to make agents that can be deployed into end enterprises.

They should be able to do all the things that I do.

And having worked with these coding models, these coding harnesses over six, seven, eight months.

I do feel like the models are there enough.

The harnesses are there enough where they're isomorphic to me in capability and the ability to do the job.

So starting with this constraint of I can't write the code meant that the only way I could do my job was to get the agent to do my job.

And like, just a bit of background before that.

This is basically the article.

So what you guys did is five months of working on an internal tool.

Zero lines of code over a million lines of code in the total code base.

You say it was sent X more like, it was sent X faster and you would have if you had done it by then.

So yeah, that was the mindset going into this right.

I started with some of the very first versions of code X CLI with the code X mini model, which was obviously much less capable than the ones we have today, which was also a very good constraint right.

It's quite a visceral feeling to ask the model to build you a product feature.

And it just not being able to assemble the pieces together, which kind of defined one of the mindsets we had for going into this, which is whenever the model just cannot.

You always pop open the Hask double click into it and build smaller building blocks that then you can reassemble into the broader objective.

And it was quite painful to do this honestly the first month and a half was 10 times slower than I would be.

But because we paid that cost, we ended up getting to something much more productive than anyone engineer could be because we built the tools, the assembly station for the agent to do the whole thing.

But yeah, so onward to GPD five five one five two five three five four to go through all these model generations and see there kind of quirks and different working styles.

Also meant we had to adapt the code base to change things up one the model was ripped.

One interesting thing here is five two the codex harness at the time did not have background shells in it, which means we were able to rely on blocking scripts to perform long horizon work.

But with five three and background shells it became less patient less willing to block so we had to retool the entire build system to complete in under a minute and this is not a thing I would expect to be able to do in.

The code days where people have opinions, but because the only goal was to make the agent productive over the course of a week we went from a bespoke make file build to basal to turbo to annex I'm just left it there because builds or fast at that point.

It's interesting to talk about turbo to annex that's interesting because that's the other direction that other people have been doing.

Ultimately, I have not a lot of experience with actual front and repo or.

You're trying to adjust your build system sky so I know the next team I know turbo from Jared Palmer and I'm like, yeah, that's an interesting.

The hill we were climbing right let's make it fast.

Is there a micro front ends involved?

So they will help how complex react to talk electron.

It's a simple app sort of thing and must be under a minute that's an interesting limitation.

I'm actually not super familiar with the background shelf stuff probably was talked about identify through release basically means that code x is able to spawn commands in the background and then go continue to work while it waits for them to finish.

So it can spawn an expensive build and then continue reviewing the code for example here on this helps it be more time efficient for the user invoking the harness.

And just to really kneel this like what does one minute matter like why not five.

Okay, we want to know what the inner loop to be as fast as possible.

If one minute was just the last round number and we were able to hit it.

And if it doesn't complete it kills it or something.

I know we just take that as a signal that we need to stop or doing double click decompose the build graph a bit to get us behind back under so that we can able the agent to continue to operate.

It's almost like you're it's like a ratchet like a forcing.

The build time discipline because if you don't it will just grow and grow.

That's right.

And you mentioned that current like the software I work on currently is at 12 minutes.

It sucks.

This has been my experience with platform teams in the past where you have an envelope of acceptable build times and you let it go up to breach.

And then you spend two three weeks to bring it back down to the lower end of the envelope and stop.

But because tokens are so cheap.

We're so insanely parallel with the model.

We just constantly be gardening this thing to make sure that we maintain these invariants.

Which means there's way less dispersion in the code and the STLC which means we can simplify in a way and rely on a lot more invariants as we write the software.

You mentioned in your article like humans became the bottleneck right you kicked off as a team of three people.

You're putting out a million line of code like 1500 PR's basically what's the mindset there so as much as code is disposable.

You're doing a lot of review a lot of the article talks about how you want to rephrase everything is prompting everything is what the agent can't see.

It's kind of garbage right you shouldn't have it in there so what's like the high level of how you went about building it and then how you address okay humans are just pure review like how was human in the loop for this we would be on even the humans reviewing the code as well most of the human review is post merge at this point but.

It's like the review that just let's just make ourselves happy you by you.

Fundamentally the model is trivially parallizable right as many GPUs and tokens as I am willing to spend I can have capacity to work on the hood base.

The only fundamentally scarce thing is the synchronous human attention of my team there's only so many hours in the day we have to eat lunch.

I would like to sleep although it's quite difficult to stop poking the machine because it makes me want to feed it you have to step back right like you need to take a systems thinking mindset to things and.

Constantly be asking where is the agent making mistakes where am I spending my time how can I not spend that time going forward and then build confidence in the automation that I'm putting in place so I have solved this part of the STLC.

And usually what that has looked like is like we started needing to pay very close attention to the code because the agent did not have the right building blocks to produce.

Modular software that decomposed appropriately that was reliable and observable and actually accrued a working front end and these things right so.

In order to not spend all of our time sitting in front of a terminal at most doing one or two things at a time invested in giving the model that observability which is that.

That graph is the post here yeah let's walk through the traces which existed first we started with just the app and the whole rest of it from vector through to all these log and metrics APIs was.

I don't know half an afternoon of my time we have intentionally chosen very high level fast developer tools there's a ton of great stuff out there now we use mece a bunch which makes it trivial to pull down all these go written victorious stack binaries in our local development.

Tiny little bit of Python glue to spin all these up and off you go one neat thing here is we have tried to invert things as much as possible which is instead of setting up an environment to spawn the coding agent into instead we spawn the coding agent like that's the entry point just codex and then we give codex via skills and scripts.

The ability to boot this stack if it chooses to and then tell it how to set some n variables so the app in local dev points at the stack that it has chosen to spin up.

And this I think is like the fundamental difference between reasoning models and the four ones and four rows of the past where these models could not think so you had to put them in boxes with a predefined set of state transitions.

Whereas here we have the model the harness be the whole box and give it a bunch of options for how to proceed with enough context for it to make intelligent choices.

So so like a lot of that is around scaffolding right yes previous agent you would define a scaffold it would operate in that loop try again.

That's pivoted off from when we've had reasoning models they're seeming to perform better when you don't have a scaffold right and you go into like niches here to like your spec dot md and like having a very short agent dot mg agent md.

Yes, so you even lay out what is here, but I like the table contents.

Thanks stuff like this it really helps guide people because everyone's trying to do this this structure also makes it super cheap to put new content into the repository to see your both the humans and the agents.

So you re-vented skills right.

One big agent then skills for first and whole skills it not exists when we started doing this you have a short 100 line overall table contents and then you have little skills right core belief some detail tracker.

Yeah, the skills over the tech jet tracker and the quality score are pretty interesting because this is basically a tiny little scaffold like a markdown table, which is a hook for codex to review.

All the business logic that we have to find a map assess how it matches all these documented guardrails and propose follow up work for itself before.

Or beads and all these ticketing systems we were just tracking follow up work as notes in a markdown file which we could spawn agent on a caron to burn down.

There's this really neat thing that like the models fundamentally crave text so a lot of what we have done here is figure out ways to inject text into the system right when we get a page because we're missing a timeout for example.

I can just add codex in slack on that page and say I'm going to fix this by adding a timeout.

Please update our reliability documentation to require that all network calls have timeouts.

So I have not only made a point in time fix but also like durable encoded this process knowledge around what good looks like.

And we give that to the root coding agent as it goes and does the thing but you can also use that to distill test out of or code review agent which is pointed out the same things to narrow the acceptable universe of the code that's produced.

I think one of the concerns I have with that kind of stuff is you think you're making the right call by making it's persisted for all time across everything.

But then you didn't think about the exceptions that you need to make right and then you have to roll it back.

Part of it is also so it says you can follow your structures to it's somewhat a skill right so it determines when it uses the tools right like it's not like it'll run every call it'll determine when it wants to check call these quite right.

Yeah and we do in the prompts we give these agents allow them to push back when we first started adding code review agents to the PR it would be.

Codex CLI locally writes the change pushes up a PR on those PR synchronizations a review agent fires it posts a comment we instruct codex that it has to at least acknowledge and respond to that feedback.

And initially the codex driving the code author was willing to be bullied by the PR reviewer which meant you could end up in a situation where things were not converging.

So we had to add more optionality to the prompts on both of these things right.

The reviewer agents were instructed to buy us toward merging the thing to not surface anything greater than a P2 and priority.

We didn't really define P2 but we gave it a fine P2.

We gave it a framework within which to score its output.

And then greater that P0 is worse right.

Yes.

P0 is you will.

Yeah.

But also on the code authoring agent side.

We also gave it the flexibility to either defer or push back against review feedback right.

It happens all the time right.

Like I happen to notice something and leave a code review which could blow up the scope by a factor of two.

I usually don't mean for that to be addressed exactly in the moment.

If you want to know more of an FYI, file it to the backlog, pick it up in the next fix it week sort of thing.

And without the context that this is permissible.

The coding agents are going to buy us toward what they do which is following instructions.

Yeah.

I do want to check in on a couple things right.

Sure.

Oh, the coding review agent it can merge autonomously.

I think that's something that a lot people are comfortable with.

And you have a list here of how much agents do they do products code and test CI configuration release tooling.

In terms of tools, documentation, eval harness, review comments, scripts that manage your repository.

So production dashboard definition files like everything.

Yes.

And so they're just all turning at the same time.

Is there like a record that that any human other team pulls to stop everything?

Because we are building a native application here.

We're not doing containers deploy.

So there's still a human in the loop for cutting the release branch.

That's it.

We require a bless human approved smoke test of the app before we promote it to distribution, D sort of things.

So you're going to app.

You're not building like infrastructure where you have like nine's of reliability and kind of stuff.

That's correct.

Okay.

And also like full recognition here that all this activity token, a completely green field repository.

There should be no assertion that this applies generally until this is a production thing.

You're going to ship the customer.

Of course.

Yeah.

So this is real.

And like one of the things there is you mentioned you started this as a repo from scratch.

The onboarding first month or so was pretty.

It was like working backwards, right?

And then you have to work with the system.

And now you're at that point where you know, you're very autonomous.

I'm curious like, okay, so what how human in the loop is it?

So what are the bottlenecks that you wish you could still automate?

And part of that is also like where do you see the model trajectory improving and offloading more human in the loop?

We just got five point four.

It's a really fantastic model, by the way.

Yeah.

It's the first one that's merged top tier coding.

So it's codex level coding and reasoning.

So general reasoning, both in one model.

So in computer vision.

Now we have it.

Now with this I for I can just have codex right the blog post.

Whereas for this one, I had to balance between chat.

Oh, I need to.

I might be out of a job.

Oh, my god.

Oh, you just give me an idea for a completely AI newsletter that five four could do.

Yeah.

I get it now.

This sort of thing is just one example of closing the loop.

Right.

Like the dashboard thing you mentioned.

We have codex authoring the JSON for the graphana dashboards and publishing them.

And also responding to the pages, which means when it gets the page, it knows exactly which dashboards are defined and what alerts.

What alert was triggered by which exact log in the codebase because all of the stuff is collated together.

It has to own everything.

Yes.

It means that if we have an outage that did not result in a page, it has the existing set of dashboards available to it.

It has the existing set of metrics and logs and can figure out where the gaps in the dashboard are or in the underlying metrics and fix them and one go.

In the same way you would have a full stack engineer be able to drive a feature from the back end all the way to the front end.

So it seems like a lot of the work you guys had to do was you as a small team are fully working for a way that the model wants to software to be written.

It's like less human legible for better code legibility agent legibility.

What do you think that affects broader teams?

So one at OpenAI do the A's on like this is how software should be written.

Like I can imagine say you join a new team with this methodology this mindset.

There's ways that teams do code review teams right code like teams are structured and a lot of it is for human legibility.

Should we all swap like how does this play back one broader into OpenAI and then like broader into this software and today?

Is it like teams up pick this up?

Well, it's pretty drastic right you have to make a pretty big switch should they just full send.

Yeah, the mindset is very much that I'm removed from the process right I can't really have deep code level opinions about things.

It's as if I'm group tech leading a 500 person organization.

Yeah, like it's not appropriate for me to be in the weeds on every PR.

This is why that post merge code review thing is like a good analog here, right?

Like I have some representative sample of the code as it is written.

I have to use that to infer what the teams are struggling with where they could use help.

Where they're already moving quickly and I can pivot my focus elsewhere.

Yeah, so I don't really have too many opinions around the code as it is written.

I do however have a command base class which is used to have repeatable chunks of business logic that comes with tracing and metrics and observability for free.

The thing to focus on is not how that business logic is structured, but that it uses this primitive because I know that's going to give leverage by default.

Yeah, back to that sort of system thinking.

And you have part of that in your blog post enforcing architecture and takes how you set boundaries for what to use.

There's also a section on redefining engineering and stuff, but yeah, it's just it's interesting to hear.

And as the models have gotten better, they have gotten better at proposing these abstractions to unlock themselves, which again lets me move higher and higher up the steps.

To look deeper into the future on what ultimately block the team from shipping.

Yeah, you mentioned, so you, this is primarily a, it's like a 1 million line of code base selection app, but it manages its own services as well.

So it's like a back in for front end type thing.

We do have a back end in there, but that's hosted in the cloud.

This sort of structure is actually within the separate main and render processes within the electric.

That's just how electronic work works.

Yeah, of course.

So I have also treated like NBC style decomposition with the same level of rigor, which hasn't been very fun.

I have a fun pan.

This is like a tangent.

NBC's model Google controller that any sort of full stack web dev knows that, but my up AI native version of this is model view claw.

It may cause the harness.

That's right.

I do think that there is an interesting space to explore here with codex, the harness as part of building AI products.

Right.

There's a ton of momentum around getting the models to be good at coding.

We've seen big leaps in like the task complexity with each incremental model release where if you can figure out how to collapse a product that you're trying to build a user journey that you're trying to solve into code.

It's pretty natural to use the codex harness to solve that problem for you.

It's done all the wiring and lets you just communicate in prompts to let the model cook.

Yeah, it's been very fun.

And there's also a very engineering ledgerable way of increasing the complexity.

Just give the model scripts.

The same scripts you would already build for yourself.

Yeah.

So for listeners, this is Ryan saying that software engineering or coding edges will eat knowledge work.

Like the non-coding parts that you would normally think, oh, you have to build a separate agent for it.

No.

Start your coding agent and go out from there.

Which open claw has?

Yes.

In code.

Everything is occurring.

By the way, since I brought it, I was probably the only place you bring it up.

Is any open claw usage from you?

No, no, not for any.

I don't have any spare Mac minis rattling around my house.

You can afford it.

No, I just don't care.

If it's changed anything and open the idea, but it's probably already it is.

And then yeah, I think I want to pull on here is like you mentioned ticketing systems and you mentioned PRs.

And I'm wondering if both those things have to go away or be reinvented for this kind of coding.

So the get itself and is like very hostile to multi agents.

Yeah, we make very heavy use of work trees.

But like even then, I just did a job to podcast yesterday with cursors saying then they said, and they're getting rid of work trees because it still has too many merge conflicts.

It's too unintuitive, but I go ahead.

The models are really great at resolving merge conflicts.

Yeah, and to get to a state where I'm not synchronously in the loop in my terminal.

I almost don't care that there are merge with this disposable.

Yeah, we invoke a dollar land skill and that coaches code X to push the PR.

Wait for human and agent reviewers.

Wait for CI to be green.

Fix the flakes if there are any merge upstream if the PR comes into conflict.

Wait for everything to pass.

Put it in the merge to deal with flakes until it's in main.

And this is what it means to delegate fully, right?

This is in a very large model repo.

Probably a significant tax on humans to get PR as merged.

But the agent is more than capable of doing this.

And I really don't have to think about it.

Although then keep my laptop open.

Yeah.

I used to be much more of a control freak.

But now I'm like, yeah, actually, you could do a better job.

This and me.

Yeah, with the right context.

Yes.

Anything else in harness engine general, just this piece.

I just want to make sure we.

I think one thing that I maybe didn't.

Super clear in the article that I heard on Twitter as an interest.

I'm on to them.

What's the chatter and what's your response?

Ultimately.

All the things that we have encoded in docs and tests and review agents and all these things are ways to put all the non functional requirements of building high scale high quality reliable software into a space that prompt and checks they should.

We either write it down as docs.

That links where the error messages told how to do the right thing.

So the whole meta of the thing is to basically tease out of the heads of all the engineers on my team.

What they think good looks like what they would do by default or what they would coach a new hire on the team to do to get things to merge.

And that's why we pay attention to all the mistakes mistakes that the agent makes, right?

Code being written that is misaligned with some as yet not written down non functional requirement.

Sorry, what did the online people misunderstand or.

No, what do you recently just literally said that.

I was like, oh, yeah, okay.

This is the thing.

This is what you do.

Interesting.

What other neat thing, which I did totally did not expect is folks were just taking the link to the article and giving it to pie or codex and say make the brief of this.

You were chief of all recursion and it was wildly effective really, it was wildly effective.

No, it's just actually just something I tried with five, four yesterday.

I didn't have that much time.

I was like, out speaking at something and this is one of my things.

That's okay.

I have this article.

Can we just scaffold out what it would be like to run this and I did it.

First is that and I was like, okay, let me take another little side repo and say, okay, if I was to fully automate this like this.

I haven't written a line of code.

It's like, okay, all set.

It's the side thing.

Voice TTS.

I'm just like slobbing out whatever.

It's nothing production.

I'm like, how would I make this like this?

And it's actually a really good way.

It's like a good way to learn.

What could be changed?

What could be like, it's just a good analyzing, right?

You give it all the code.

You give it all the context to give it the article and it walks you through it very well.

That's right.

I guess one more thing before we go to Symphony is I wanted to cover Brett Taylor's response.

We had him on the show.

He is your chairman.

Which is wild.

Yeah.

That he's reading your articles as well and like getting engaged in it.

He's a software dependencies are going away basically.

They can just be like vended.

Yes.

Response.

100%.

How you were selling?

He's still pretty data dog.

You still pay temporal.

Thank you.

Yep.

The level of complexity of the dependencies that we can internalize is I would say low medium right now.

Just based on model capabilities.

What does medium?

I would say like a a couple thousand line dependency is a thing that we could in House no problem caught in and afternoon of time.

One neat thing about it is like probably most of that code you don't even need.

Like by in housing and abstraction you can strip away all the generic parts of it and only focus on when I need to enable the specific thing.

Yes.

You're building.

I've been calling this to end a bullshit plug-ins.

Yeah.

Because it's so much when I publish an open source thing.

I want to accept everything.

Be liberal.

I want to accept.

This is postiles law.

But that means it's so much book.

It's so much overhead.

One other neat thing about this too is when we deploy codex security on the repo.

It is able to deeply review and change the internalized dependencies in a much lower friction way than it would be to like push patches upstream.

Wait for them to be released.

Pull them down.

Make sure that's compatible with all the transitives I have in my repo and things like that.

So it's also much lower friction to internalize some of these things.

If code is free because the tokens are cheaper.

Sort of thing.

I think like the only argument I have against this is basically scale testing.

Obviously the larger pieces are software like Linux.

My SQL.

He calls up even the data.

And then maybe security testing.

Where yes.

Classically I think is it.

Linus.

To open source of the best disinfectant.

Any eyes.

Many eyes.

And if in line your dependencies and code them up.

You're going to have to relearn mistakes from other people.

Yep.

And to internalize that dependency.

You're back to zero.

And you have to start reassembling all those bits and pieces to have high confidence.

Yeah.

Even part of the first intro of this.

You basically mentioned like everything was written by codex including internal tooling.

Right?

You'll internal tooling like when you're visualizing what's going on.

It's writing it for you.

Yeah.

I'm building through into this way.

Now and like I just showed them off and they're like how long did you spend?

And that is spending time just prompted it.

Very funny story here.

Yeah.

We had the player app to the first dozen users internally.

We had some performance issues.

So we asked them to export a trace for us.

Get a marble.

Give it to our on call engineer.

And he did a fantastic job of working with codex to build this beautiful local deb tool next JS at the Dragon Drop.

The marble in it visualizes the entire trace.

Well, it's fantastic.

Took an afternoon.

But none of this was necessary.

Because you could just spin up codex and give it the marble and ask the same thing and get the response immediately.

So in a way, optimizing for human legibility of that debugging process was wrong.

It kept him in the loop unnecessarily.

When instead, he could just like codex cooked for five minutes and gotten it.

Yeah.

If I were in things here, this is how we used to do it.

Or this is how I would have used to solve it.

Yeah.

And this local observability stack.

Like sure, you can deploy a girl to visualize the traces.

But I wouldn't expect to be looking at the traces in the first place because I'm not going to write the code to fix them.

Yeah.

So basically, you need to be like this kind of house stack and owning a whole loop.

I think that is very well established.

It sounds like you might be like sharing more about that in the future, right?

Yeah.

I think we're excited to do.

We're going to talk about symphony and a little bit.

But like the way we distributed it as a spec.

I think folks are calling ghost libraries on Twitter.

This is like a such a cool made.

It does mean it becomes much cheaper to share software with the world.

You define a spec, how you could build your own specifying as much as is required for a coding agent to reassemble it locally.

The flow here is very cool.

Like we have taken all the scaffolding that has existed in our proprietary repo, spun up a new one, ask codex with our repo as a reference, right?

The spec.

We tell it, spin up a teamox, spawn a disconnected codex to implement the spec.

Wait for it to be done, spawn another codex in another teamox.

To review the spec or review the implementation compared to upstream and update the spec.

So it diverges less and then you just loop over and over a rough style until you get a spec that is with high fidelity able to reproduce the system as it is.

Fantastic.

And you're basically you're not really adding any of your human bias in there, right?

Correct.

A lot of times people write a spec and be like, okay, I think it should be done this way.

And you'll riff on something and it's on all of that.

You're still scaffolding in a sense, right?

I want it done this way.

It can determine its spec better.

That's right.

Part of me, I've been working a lot on Evel's recently.

And part of me is wondering if an agent can produce a spec that it cannot solve.

Is it always capable of things that it can imagine or can you imagine things that it is impossible to do?

I think with symphony, we there's like this, this is access where you have things that are easier hard or established or new, right?

And I think things that are hard and new is still something that the models need humans.

Yeah, drive.

Yeah.

But I think those other quadrants are largely sold.

Given the right scaffold and the right thing.

I'm going to drive the agent to completion.

It's crazy to solve.

But it means that the humans, the ones with limited time and attention get to work on the hardest stuff.

Like the problems where it's pure white space out in front.

Or like the deepest refactorings where you don't know what the proper shape of the interfaces are.

And this is where I want to spend my time because it lets me set up for the next level of scale.

Yeah.

Yeah, amazing.

That's introduced symphony.

I think we've been mentioning it every now and then.

It looks here.

Interesting option.

Yeah.

Again, like the elixir manifestation here is just a derivative.

Is it a model chosen?

Yeah.

Because the process supervision and the gen servers are super amenable to the type of process orchestration that we're doing here.

You are essentially spinning up little demons for every task that is in execution and driving it to completion, which means the mall gets a ton of stuff for free by using elixir and the beam.

I had to go do a crash course in beam and elixir.

And I think most people are not operating at that scale of concurrency where you need that.

But it is a good mental model for resumability and all those things.

And these are things I care about.

But tell me the story, the origin story of symphony.

What do you use it for?

Is this how did it form?

Maybe any abandoned paths that you didn't take?

At the end of December, we were out about three and a half PRs per engineer per day.

So it was before five to came out.

In the beginning of January, everyone gets back from holiday with five to and no other work on the repository.

We were up in the five to ten PRs per day per engineer.

And I don't know about y'all, but it's very taxing to constantly be switching like that.

Like I was pretty capped out at the end of the day.

Again, where are the humans spending their time?

They're spending their time.

Hot text switching between all these active teamox pains to drive the agent full of it.

Yeah, so let's again build something to remove ourselves from the loop.

And this is what a frantic sprinted after here to find a way to remove the need for the human to sit in front of their terminal.

So a lot of experimentation with dev boxes and automatically spinning up agents.

Like it seems like a fantastic and state here where my life is beach.

I open language like today and say yes no to these things.

And this is again a super super interesting framing for how the work is done because I become more latency and sensitive.

I have way less attachment to the code as it is written.

Like I've had close to zero investment in the actual authorship experience.

So if it's garbage, I can just throw it away and not care too much about it.

In symphony there's this like rework state where once the PR is proposed and it's escalated to the human for review.

It should be a cheap review.

Is either mergeable or is not.

And if it's not, you move it to rework.

The elixir service will completely trash the entire work tree and PR and start it again from scratch.

And this is that opportunity again to say why was it trash, right?

What did the AP do that was that?

Fix that before moving the ticket to progress again.

Why is it not in the code X app?

I guess you guys are ahead of code X app.

Yeah, so the way the team has been working is basically to be as AI piled as possible and spread the head.

And a lot of the things we have worked on have fallen out into a lot of the products that we have.

Like we were in deep consultation with the code X team to have the code X app be a thing that exists.

Right, to have skills be a thing that code X is able to use.

So we didn't have to roll our on to put automations into the product.

So all of our automatic refactoring agents didn't have to be these hand rolled control loops.

It has been really fantastic to be in a way on angered to the product development of frontier and code X and just very quickly try to figure out what works.

And then later find the scalable thing that can be deployed widely.

It's been a very fun way to operate.

It's certainly chaotic.

I have lost track very often of what the actual state of the code looks like because I'm not in the loop.

There was one point where we had wired playwright directly up to the electron app with MCP.

MCP is I'm pretty bearish on because the harness forcibly injects all those tokens in the context and I don't really get to say over it.

They mess with auto compaction.

The agent can forget how to use the tool.

There's probably only what three calls in playwright that I actually ever want to use.

So I pay the cost for a ton of things.

Somebody vied a local demon that boots playwright and exposes a tiny little shim CLI to drive it and I had zero idea that this had occurred because to me I run code X and it's able to.

Like no knowledge of this at all.

So we have had like in human space to spend a lot of time doing synchronous knowledge sharing.

We have a daily stand up that's 45 minutes long because we almost have to fan out the understanding of the current state.

I was going to say this is good for a single human multi agent, but multi human multi agent is a whole like positive explosion of stuff.

Yeah, and that this is fundamentally why we have such a rigid like 10,000 engineer level architecture in the app because we have to find ways to carve up the space so people are not trampling on each other.

Sorry, I don't get the 10,000 thing.

Did I miss that?

The structure of the repository is like 500 NPM packages is like architecture to the access for what you would consider I think normal for a seven person team.

But if every person is actually like 10 to 50, then the like numbers on being super super deep into decomposition and sharding and like proper interface boundaries make a lot more sense.

To me, that's why I talked about microphones and I annexes from that world, but cool.

It's just coming back to to this.

I don't know if you have other thoughts on orchestrating so much work coin going through this is this enough is this like any of our moments.

It'll be interesting to see like where it is right now you pick linear is your issue tracker right or it's like a it's actually linear.

This is actually linear linear.

Oh, I know I never look a little more video to demo video at the download to run.

So because I'm a Slack Maxie, but yeah linear is also really good.

Yes, we do make a good use of Slack.

We we fire off codecs to do all these.

Locious like the fix ups, the things that like sync that knowledge into the repository.

It's super cheap.

Yeah, do it in codex.

My biggest plug is over the I needs to build Slack.

You need to own Slack build yours.

The turn distance.

I did read.

I would say that if we think that we want these agents to do economically valuable work, which is like this is the mission, right.

We want AI to be deployed widely to do economically valuable work.

Then we need to find ways for them to naturally collaborate with humans, which means collaboration tooling.

I think is an interesting space to explore.

Yeah, totally.

Good hope Slack linear.

Yeah, I was kind of thinking, okay, where do we see right now codex is started codex model then CLI now there's an app can let me shoot off multiple codex is in parallel.

But there's no great team collaboration for codex.

And it seems like your team had some say into what comes out, right.

So you talked them codex kind of was a thing from there.

If you guys are on the bound.

What stuff that like you might not focus on, but what do you expect other people to be building, right?

So people that are like 5x 50xing should you build stuff that's like very niche for your workflow for your team should it be more general.

So other people can adopt this or niche there.

As part of it is just okay, is everything just internal tooling do we have everything our own way like the way our team operates has our own ways that we like the communicate work.

Is there a broader way to do it?

Is it something like an issue tracker just thoughts if you're on our phone that I think TBD we have not figured this out in a general way.

I do think that there is leverage to be had in making the code and the processes as much the same as possible.

If you think that code is context code is prompts.

It's better from the agent behavior perspective to be able to look in a package in directory XYZ and it not to have to page so deeply into directory.

You see because they have the same structure use the same language they have the same patterns internally and that same like leverage comes from a lining on a single set of skills that you're pouring every engineers taste into to make sure that the agent is effective.

So like in our code base we have I think six skills that's it and if some part of the software development loop is not being covered our first attempt is to encoded in one of the existing set of skills which means that we can change the agent behavior more cheaply than changing the human driver behavior.

If you have an experiment of the agent changing their own behavior we do yeah parent agent changing a sub agents behavior something of that.

We have some bits for skill distillation so for example there's one neat thing you can do with codex which is just pointed at its own session logs to ask it to tell you how you can use.

Yeah, it's a better yeah like intro spexion you are asking to do something better so can I do this question better what skills should I have I like the modification of you can do just do things to you can just ask agent to do things.

Yeah, you can just codex things this is like a this is like a silly emoji that we have I just codex things because prompt things it's really glorious future if you live and.

But okay, you can do that one on one, but we're actually slurping these up for the entire team into blob storage and.

Running agent loops over them every day to figure out where is a team can we do better and how do we reflect that back in the repository though everybody better is from everybody else's behavior for free.

Same for like PR comments right these are all feedback.

That means the code as written deviated from what was good.

A PR comment a field build these are all signals that mean at some point the agent was missing context we got to figure out how to.

Yeah, slurping it up and put it back in the repo by way I do this exactly right I used when I use cloud code for.

Mark work.

Talk work is like a nice product.

Yes, I think you would agree I always have it tell me what do I do better next time and that's the method programming reflection thing.

So almost like like you have six reflection extraction levels in symphony and almost like the zero flare.

So the six levels are policy configuration coordination execution integration observability we've talked about a couple of these.

But the zero flare is like the okay I'll be working well to improve how we work.

Yes, can I modify my own workflow about MD or something.

I don't know.

Yeah, of course, yeah, of course you can like this thing is also able to cut it some tickets because we give it full access.

Yeah, making a ticket to have a cut tickets you can put in the ticket that you expected to file a sample of work.

Like self-modifying yeah, put don't put the agent in a box give the agent full accessibility over a storm.

I had a mental reaction when you said don't put the agent in a box say I think it should put it in a box like it's just that you're giving the box everything in the is.

Yeah, context and tools but we're like as developers were used to calling out to different systems but here you use the open source things like the.

For me, it's whatever and you run it locally so that you can have the full loop I assume.

Yep, I think like I want to minimize called dependencies.

You also want to make sure that you think about what the agent has access to what does it see does it go back in the loop like from the most basic sense of you let it see it's own like calls traces.

It can determine where it went wrong.

Are you feeding that back in so you know just the most basic level if you want to see exactly what they put out what like does the agent have access to.

What is being supported right it can something prove a lot of these things it's all text right.

My job is to figure out where it's the funnel text from one agent to the other.

It's so strange like way back at the start of this whole AI wave on Jay was like English is the hottest new programming language.

It's here.

Yeah, the features.

Yeah.

Okay, like a lot of software a lot of stuff.

There's a GUI.

It's made for the human.

We're seeing the evolution of CLIs for everything.

All tools have CLIs here out.

It's going to use them well.

Do we get good vision?

Do we get good little sandboxes?

Like right now it's a really effective way right models up to use tools.

I love to pass.

I love to read through text all slap a CLI let it go loose.

That works for everything.

It does.

Yeah, we've also been adapting non textual things to that shape in order to prove model behavior in some ways.

Right.

We want the agent to be able to see the UI.

Agents do not perceive visually in the same way that we do.

They don't see a red box.

They see red box button.

Right.

They see these things in latent space.

So if we want.

Yeah.

We have a thing every day goes off every time.

You're sitting in space.

Ding.

Anyway, if we want to actually make it see the layout is almost easier to Rasterize that image to ask York and feed it in to the agent.

And there's no reason you can't do both.

Right.

To like further refine how the model perceives the object it's manipulating.

Cool.

Could we you want to talk about a couple more of these layers that might bear more introspection or that you have personal passion for?

I will say that the coordination layer here was a really tricky piece to get right.

Let's do it.

Yep.

I'm all about that.

And this is simple.

This is where when we turn the spec into a mixer where like the model takes a shortcut.

Right.

Oh, have all these primitives that I can make use of in this lovely runtime that has native process supervision, which is I think a neat way to have taken the spec and made it more accessible by making choices that naturally map the domain.

You know, in the same way that like you would prefer to have a TypeScript model repo if you are doing full stack web development, right, because The ability to share types across the front and back in.

Reduce is a lot of complexity and because this is what GraphQL used to be.

That's right.

And I don't know if it's still alive, but there's no humans in the loop here.

So like my own personal ability to write or not write a mixer doesn't really have to bias us away from using the right tool for the job.

It is just wild.

Love it.

I love it.

Yeah.

I wonder if any languages struggle more than others because of this.

I feel like everyone has their own abstractions that would make sense, but maybe it might be slower, might be more faulty, where like you have to just kick the server every now and then.

I don't know.

I think I'm so good there's really well understood integration there.

MCP's dead.

I think all these are just like a really interesting hierarchy to travel up and down.

It's common language for people working on the system to understand.

The policy stuff is really cool, right?

Yeah.

You don't really have to build a bunch of code to make sure this is a way through the eye to pass it to institutional knowledge.

Yeah.

You just give it the GHCLI with your text to say, see, I have to pass it.

It makes the maintenance of these systems a lot easier.

Do you think that CLI maintainers need to be do anything special for agents or just as is it's good?

Because like, I don't think when people made that you get up CLI, they anticipate it this happening.

That's correct.

The GHCLI is fantastic.

It's a super industry.

Everyone go try GH repo, create GH pool and then pull across number.

Right.

GHPR, like 150 free.

Whatever.

And then it lay pools.

Basically, my only interaction with the GitHub web UI at this point is GHPRView.

That's web.

It's a tough glance at the diff.

And be like, sure thing, send it.

But the CLIs are nice because they're super token efficient.

And they can be made more token efficient really easily.

Like, I'm sure you all have seen, like, I go to buildkite or Jenkins.

And I could just get this massive wall of build output.

And in order to unblock the humans, your developer productivity team is almost certainly going to write some code that parses the actual exception out of the build logs and sticks it in a sticky note at the top of the page.

And you basically want CLIs to be structured in a similar way, right?

You're going to want to pass dash silent to prettier because the agent doesn't care that every file was already formatted.

It just wants to know it's either formatted or not.

So it can then go run a right command.

Similarly, like, in our PNPM distributed script runner when we had one, when you do dash recursive, like it produces a absolute mountain of text.

But all of that is for passing test suites.

So we ended up wrapping all of this.

And another script to suppress the vibe.

The one generally, I put the failing parts of the test.

You make a pipe errors versus the standard set it out.

I don't know, okay, whatever.

Generally thinking that this CLIs, I used to maintain a CLI for my company.

Yeah, this is like, of course, very quite to my heart.

Like, you're vibing my job.

That's right.

Cool.

And the other things, this is a long spec.

I appreciate that it's got a lot of strong opinions in here.

Any other things that we should highlight.

I think I also can spend the whole day going through some of these.

But I do think that some of these have a lot of care or some of this.

You might want to tell people, hey, take this, but make it your own.

Fundamentally, software is made more flexible when it's able to adapt to the environment, in which it is deployed, which means that things like linear or GitHub, or specified within the spec, but not required pieces of it.

There's like a more platonic ideal of the thing that you could swap in, like Gira or Bitbucket, for example.

But being able to tightly specify things like the ID formats, or how the Ralph Loop works for the individual agents, basically means you can get up and running with a fully specified system quickly, that you do then evolve later on.

I think we never intended for this to be a static spec that you can never change.

It's more like a blueprint to get something worth to start in Python running, for you then to violate or to your heart's content.

You have code in scripts in here, where it's all.

I think this is a really good prompt.

It's just a very long prompt.

Fundamentally, the agents are good at following instructions, so give them instructions, and it will improve the reliability of the result.

Much like the way we use symphony, we don't want folks to have to monitor the agent as it is viving the system into existence.

So being very opinionated, very strict around what the success criteria are, means that our deployment success rate goes up.

Yeah, means we don't have to get tickets on this thing.

I can all go back to that code as disposable, right?

Like early on, when you had CLI, it took you to kick off a codex running with take two hours.

You would want to monitor.

Okay, I'm in the workflow of just using one.

I don't want it to go down the wrong path.

I'll cut it off and just shoot off for it.

Like that was my favorite thing of the codex app, right?

It is for X.

It's okay.

One of them will probably be right.

One of them might be better.

Stop overthinking it.

Like my first example is probably like Deep Research.

When you put out Deep Research and I'd ask it something like, as to something about L.M.

It thought it was legal, something, and spend an hour came back with a report completely off the rails.

And I was like, okay, I got a monitor this thing a bit and I don't want it.

You want to build it.

It goes the right way.

You don't want to sit there and babysit, right?

You don't want to babysit your agents.

With Deep Research query that you made, looking at the bad result, you probably figured out you needed to tweak your prompt.

Yeah, a bit.

That's that guardrail that you fed back into the code base for the you're asking your prompt to further align the agency execution.

Same sort of concepts apply there too.

When you talk, how are the customers feeling?

For sympathy, I think we have none, right?

That's the thing we have put out into the work.

That's even as internal, right?

As long as you're happy, you're the customer.

That's right.

Just let's the external view.

I say folks are very excited about this way of distributing software and ideas in cheap ways.

For us as users, it has again pushed the productivity 5x, which means I think there's something here that's like a durable pattern around removing the human from the loop and figuring out ways to trust the output.

The video that has shared here is the same sort of video we would expect the coding agent to attach to the PR.

That is created.

That's part of building trust in the system.

That's to me fundamentally what has been cool about building this.

It more closely pushes that persona of the agent working with you to be like a teammate.

I'm a shoulder-sur-view like for the tickets that you work on during the week.

I would never think that I would want to do that.

Yeah, I wouldn't want a screen recording of your entire session in cursor or cloud code.

I would expect you to do what you think you need to do to convince me that the code is good and marginal and compress that full trajectory in a way that is legible to me.

The reviewer.

It's stuff.

And you can just do that.

Because codex will absolutely slings the method.

You can take it around.

It's great.

Oh, if it been peg is the OG like God CLI.

Yes.

I would strongly change it.

I used to say there's a SAS micro SAS that's called it.

In every flag.

Did FFM peg?

Oh for sure.

For sure.

Just post it as a service.

Put a UI on it.

People who don't know if FFM peg will pay for it.

When we were first experimenting with this, it was a while feeling to be at the computer with just like Windows just popping up all over the place and getting captured and files appearing on my desktop.

Very much felt like the future.

Yeah.

A thing controlling my computer for like actual productive use.

Like I'm just there keeping it like awake, jiggling the mouse that everyone saw.

That's the way that some office workers do.

So anybody of mouse jiggler.

That's right.

One thing like that.

Okay.

As stuff is so code is disposable.

A saying shoot off a budget for agents.

One question is, okay, are you always like a extra high thinking guy?

And where do you see spark?

So five point three spark.

There's a lot of me wanting to make quick changes.

I'm not going to open up an idea.

I'm not going to do anything.

But I will say, okay, fix this little thing.

Change a line.

Change a color.

Spark is great for that.

But am I still the model?

Like why don't I just let that go back and just riff on that?

Is there.

Spark is such a different model compared to the.

The extra high level reasoning that you get in these.

Yeah, three different people.

It is a different model of different architecture different.

Like it doesn't support it.

It's incredibly fast.

Small line model.

I have no quite the good out how to use it yet to be honest.

I was faster.

I was adapting it to the same sorts of tasks.

I would use X high reasoning for it.

Yeah.

And it would blow through three compactions before writing a line of code.

And it's another big thing with five point four, right?

Million cooking content.

Yes.

It's with a huge in agentic, right?

You can just run for longer before you have the compact.

The more tokens you can spend on a task before compacting.

Like the better you'll do.

That's right.

I'm not sure how to deploy Spark.

I think your intuition is right.

That it's very great for spiking out prototypes.

Exploring ideas quickly doing those documentation updates.

It is fantastic for us in taking that feedback and transferring it into a lint.

Where we already have good infrastructure for ESLints and the code base.

These sorts of things.

It's great.

And it allows us to unblock quickly doing those like anti-fragile healing tasks.

And the comments.

Yeah.

That makes sense.

So your push.

You guys are pushing models to the freaking limit.

What can cart models not do well yet?

They're definitely not there on being able to go from new product idea to prototype.

Single one shot.

This is where I find I spend a lot of time steering is translating and state of a mock for a net new thing.

Right.

Thing no existing screens into product that is playable with.

Similarly, while this has gotten better with each model release, like the norliest refactorings are the ones that I spend my most time with.

Right.

The ones where I am interrupting the most.

The ones where I am now double clicking to build tooling to help decompose monoliths and things like that.

This is a thing I only expect to get better right over the course of a month.

We went from the low complexity tasks to like low complexity and big tasks in both these directions.

So this is what it means to not bet against them all right.

You should expect that is going to push itself out into these higher and higher complexity spaces.

Yeah.

So the things we do are robust to that.

It just basically means I'll be able to spend my time elsewhere.

And figure out what the next model matters.

I do think it's also a bit of a different type of task right.

Codex is really good at codebase understanding working with code bases.

But companies like lovable, bold, replete.

They solve a very different problem scaffold of zero to one right idea to product.

And it's that there are people working on that.

Models are also pushing like step function changes there.

It's just different than the software engineers in today right.

Like I said, the model is isomorphic to myself.

The only thing that's different is figuring out how to get what's in here into context for the model.

And for these white space sort of projects, I myself.

I'm just not good at it.

Which means that often over the agent trajectory, I realize the bits that we're missing, which is why I find I need to have the synchronous interaction in.

I expect with the right harness with the right scaffold that's able to tease that out of me.

Or refine the possible space right to be super opinionated around the frameworks that are deployed or to put a template in place.

Right, these are ways to give the model.

All those non functional requirements that extra context to acre on and avoid that wide dispersion of possible outcomes.

Thank you for that.

I wanted to talk a little bit about frontier.

Yeah, sure.

Overall, you guys announced it maybe like a month ago.

And it is a few charts in here in this if it makes like your enterprise offering.

It's what I view it.

Is there one product or is there many?

I can't speak to the full product roadmap here, but what I can say is that frontier is the platform by which we want to.

Do AI transformation of every enterprise and from big to small.

And the way we want to do that is by making it easy to deploy highly observable.

Safe control.

Identifiable agents into the workplace.

We want it to work with your company native.

I am stack.

We want it to plug into the security tooling that you have.

We want it to be able to plug into the work space tools that you use.

So you're just going to be stripping specs.

Right.

We expect that there will be some harness things there.

Agents SDK is a core park of this to enable both startup builders as well as enterprise builders to have a works by default harness that is able to use all the best features of our models from the shell tool down to the codex harness.

With file attachments and containers and all these other things that we know going to building highly reliable complex agents.

We want to make that great and we want to make it easy to compose these things together in ways that are safe.

For example, right?

Like the GPT OSS safeguard model for example.

One thing that's really cool about it is it ships the ability to interface with a safety spec.

Safety specs are things that are bespoke to enterprises.

We owe it to these folks to figure out ways for them to instrument the agents and their enterprise to avoid exaltation in the ways they specifically care about.

To know about their internal company code names, these sorts of things.

So providing the right hooks to make the platform customizable, but also mostly working by default for folks is the space we are trying to explore here.

Yeah, and this is the snowflakes of the world.

Just need this, right?

Yeah, Brexit of the world.

Stripes.

Yeah.

Make sense.

I was going to go back to here.

I think the demo videos that you guys had was pretty illustrative.

It's like also to me an example of very large skill agent management.

Yes, like you give people a control dashboard that if you play if you play any one of these like multiple agent things.

You can dig down to the individual instance and see what's going on.

Yes, of course.

Well, who's the user?

Is it like the CEO, the CTO, CIO, something like that?

At least my personal opinion here, the buyer that we're trying to build product for here is one and employees who are making productive use of these agents, right?

That's going to be whatever surfaces they appear in the connectors they have access to, things like that.

Something like this dashboard is for IT, your GRC and government's folks, your AI innovation office, your security team, right?

The stakeholders in your company that are responsible for successfully deploying into the spaces where your employees work as well as doing so in a safe way that is consistent with all the regulatory requirements that you have and customer attestations and things like that.

So it is a iceberg beneath the actual and it's great.

You jump every, I guess, layer in the UI is like going down the layer of extraction in terms of the agent, right?

Yeah, yeah, I think it's good.

The ability to dive deep into the individual agent trajectory level is going to be super powerful, not only for from like a security perspective, but also from like someone who is accountable for developing skills.

One thing that was interesting that we also blogged about shipping was an internal data agent, which uses a lot of the frontier technology in order to make our data ontology accessible to the agent and things like that to understand what's actually in the data warehouse.

Yeah, submit a clear, yes type things.

I was pretty part of that, that world.

Is this all?

I don't know.

It's actually really hard for humans to agree on what revenue is.

Yes, yes.

Yes.

What is an active user?

There's what five data scientists in the company that have defined this golden.

They are different, yeah.

And there's also internal politics as to attribution of I'm marketing, I'm responsible for this much, and sales is responsible for this much, and they all add out to more than a hundred.

And I'm like, well, you guys have different definitions.

And if you start out, everything is there.

So I think that's cool.

Oh, you guys blocked about this.

Okay, I didn't see this.

Yeah, is this the same team?

Is this what you're free to?

Yes.

Okay.

We'll set people to read this.

This is our data agent.

A lot of the going to be doing this on.

Yeah, I don't know if you have any highlights.

I don't know.

Yeah, yeah, lots of homework for people.

No, but like data as the feedback layer, you need to solve this first in order to have the products feedback move closed.

That's right.

So for the agents understand, and this is not something that humans have not solved this.

Like, and this is how you build.

Oh, yeah, right.

That artists that do more than coding, right?

Yeah.

To actually understand how you operate the business.

Yeah.

You have to understand what revenue is, what your customer segments are.

Yeah.

What your product lines are.

One thing that's in moving back to the code base that we described here for harnessing.

One thing that's in core beliefs.md is who's on the team.

What product we're building?

Who are end customers are?

Who are pilot customers are?

What the full vision of what we want to achieve over the next 12 months is.

These are all bits of context that inform how we would go about building the software.

Let me go.

So we have to give it to the agent too.

I'm guessing that stuff is like pretty dynamic and it changes over time too, right?

Like part of it was it's not just a big spec.

You have it as one of the things, and it will iterate.

One thing that I think is going to break your mind even more is we have skills for how to properly generate deep fried memes and have Reaggy, culture, and slack because with the slack, the fact that you're able to use and code acts like I can get the agent to shit post on might be half.

It's just it's part of humor.

Humor is part of EGI.

Is it funny?

And it's pretty good.

Yeah.

Okay.

Yeah.

It's pretty good.

I think humor is like a really hard intelligence test, right?

It's like you have to get a lot of context into like very few words.

This is funny.

It's life five four is such a big uplift for our community.

It's the meaning.

Yeah.

For sure.

Yeah.

Maybe when y'all are done here today, ask codex to go over your code agent sessions and to roast you.

Love it.

I'll give it a shot.

You're coming back to the final five one in the make is.

Yeah.

I think that there are multiple other like you guys are working on this.

But this is a pattern that every other company out there should adopt.

Regardless of whether or not they work with you.

To me, this is like, I saw this.

I was like, fuck, every company needs this.

This is multiple billion dollars.

What it takes to get people to yes.

Yeah.

Actually realize the benefits and distribute it.

I think it sounds boring to people like all that's for safe cards and whatever, but I think you to handle agents at scale like you're envisioning here.

I don't know if it's like a real sweetheart like a demo, but this is what you need is a original sort of view of what temporal was supposed to be like you've really built this dashboard.

You basically have every long running process in the company and one dashboard.

And that's it.

That's right.

Yeah, I think it's pretty customized towards every enterprise.

Like you care about different things.

There's a lot of customization.

But there will be multiple unicorns just doing this as a service.

I'm like a very frontier field if you can tell.

Amazing.

Love it.

It only clicked because obviously this came out first.

Then harness and then symphony.

And the only click for me that like, this is actually the thing you ship to do that.

Yeah.

There's a set of building blocks here that we assembled into these agents.

And the building blocks themselves are part of the product, right?

Yeah.

The ability to steer revoke authorization if a model becomes misaligned.

Like all of this is accessible through frontier.

And there's going to be a bunch of stakeholders in the company that have the things they need to see in the platform.

Yeah.

So we'll build all those in the front here so that we can actually do the widespread deployment.

Yeah.

I'm also calling back to there's this like levels of EGI.

I don't know if opening I still talking about this, but they used to talk about five levels of EGI.

And one of it was like, oh, it's like an intern, the coding software engine.

And at some point it was AI organization.

And this is it.

That's right.

This is level four or five.

I can't remember which level, but it's somewhere on that path was this.

You know how I mentioned that my team is having fun sprinting ahead here.

And we do this thing where we're collecting all the agent trajectories from codex to Slurp them up and distill them.

This is what it means to build our team level knowledge base.

Happened to reflected back into the code base, but it doesn't have to be that way.

And it doesn't have to be bound to just codex.

I want chat to be tea to also learn our meaning culture and also the product.

We are building it how.

So that when I go ask it, it also has the full context of the way I do my work.

And I'm super excited for frontier to enable this.

Yeah, amazing.

What did the model people say when they see you do this?

Like, you have a lot of feedback obviously.

You have a lot of usage of budgetaries.

I don't, I don't imagine a lot of it's useful to them.

But some of it is.

But you have this too.

You deploy a billion tokens of intelligence a day.

And this was this was at the beginning of 26 year.

Yeah, cooking.

Yeah.

There's this fundamental tension, which I think you have talked about.

Between whether or not we invest deeper into the harness or we invest deeper into the training process to get the model to do more of this by default.

Yeah.

And I think success for the way we are operating here means the model gets better taste because we can point the way there.

And none of the things we have built actively degrade Asian performance because really all they're doing is running tests.

And like running tests is a good part of what it means to write reliable software.

If we were building an entire separate raw scaffold around code X to restrict its output, that I think would be like additional harness that would be prone to being scrapped.

But yeah, if instead we can build all the guard rails in a way that's just native to the output that code X is already producing, it's code.

I think no friction with how the model continues to advance.

But also like just good engineering.

And that's the whole point.

Yeah.

So I've had similar discussions with research scientists where the RL equivalent is on policy versus off policy.

Yeah.

And you're basically saying that you should build an on policy harness, which is already within distribution.

And you modify it from there.

But if you build off policy, it's not that useful.

That's right.

Super cool.

If that's any things that we haven't covered that we should get it out there.

Just I've been super excited to benefit from all the cooking that the code X team has doing.

They absolutely ship relentlessly.

This is one of our core engineering values ship relentlessly.

And they, the team there embodies it to an extreme degree.

You know, I have five three and then spark and five four come out within.

What feels like a month is just a phenomenally fast.

This is exactly a month that goes five three.

Four.

Yeah.

I mean, do we have every month now?

It's five five.

Thanks.

Exactly.

I can't say that.

The policy markets will be very upset.

I think it's interesting that it's also correlated with the growth.

The announced is two million users.

But like almost don't care about coatings anymore.

This is it.

This is the gay man.

It's like coding cool soft.

Like knowledge work.

That's right.

This is the thing to chase after.

Yeah.

And this is one of the things that my team has excited to support.

The whole like self hosted harness thing working, which you have done in like the rest of us are trying to figure out a catch up.

But then do things.

You know, right with it.

Do things.

That's right.

You can just do things.

That's the line for the episode.

So that's it.

Any other call to actions.

Your base in Seattle.

A team of your thing.

New Bellevue office.

We just had the grand opening yesterday as of the recording date, which was fantastic.

Beautiful building.

Super excited.

The part of the Bellevue community building the future in Washington.

And I would say that there is lots of work to be done in order to successfully serve enterprise customers here in front here.

We are certainly hiring.

And if you haven't tried the Codex app yet, please give it a download.

We just passed two million weekly active users growing at a phenomenally fast rate.

Twenty five percent a week over week.

It's come join us.

Yes.

I think that's an interesting.

My final observation.

So, somebody has very semifinal centric company.

I know people who have been who turned down the job or didn't get the job because they didn't want to move to SF.

And now they just don't have a choice.

You have to open the London and you have to open this Seattle.

And I wonder if that's going to be a shift in the culture.

Obviously, you can't say.

But I was one of the first engineering hires out of our Seattle office.

Yeah.

So I was very natural.

The success has been part of what I have been building toward.

And it has grown quite well.

We have durable products and lines of business that are built out of there.

A ton of zero to one work happening as well, which is the core essence of the way we do apply it.

AI work at the company to print after it.

New to figure out where we can actually successfully deploy them all.

Yes.

100 percent.

We also have a New York office too.

Has it been engineering presence?

Yeah.

Exactly.

That's at least out of my road match for AI.

Wherever people hire engineers, I will go.

That's right.

Azaram.

So a cool office to New York is the old RIA building.

I believe the RIA office.

Yes.

No, you will never be as big.

New York is, you can't get the size of office that they need.

The New York office.

Seattle has a very...

Madman.

That's five.

It's a beautiful.

The Bellevue one is very green.

Gold fixture is very Pacific Northwest.

It's very cool.

There's a lot of people who are like there for people like New York.

They want to be in New York, right?

Yeah.

We have a fantastic workplace team that has been building out these offices.

It really is a privileged worker.

Yeah.

Excellent.

Okay.

Thank you for your time.

You've been very generous and you've been cooking.

So I'm the only way you get back to cooking.

It's been amazing chatting with you folks.

Happy Friday.

Thank you.