Up to now few weeks, a number of “autonomous background coding brokers” have been launched.
- Supervised coding brokers: Interactive chat brokers which can be pushed and steered by a developer. Create code domestically, within the IDE. Software examples: GitHub Copilot, Windsurf, Cursor, Cline, Roo Code, Claude Code, Aider, Goose, …
- Autonomous background coding brokers: Headless brokers that you simply ship off to work autonomously by a complete job. Code will get created in an surroundings spun up solely for that agent, and often leads to a pull request. A few of them are also runnable domestically although. Software examples: OpenAI Codex, Google Jules, Cursor background brokers, Devin, …
I gave a job to OpenAI Codex and another brokers to see what I can study. The next is a report of 1 specific Codex run, that can assist you look behind the scenes and draw your individual conclusions, adopted by a few of my very own observations.
The duty
We’ve got an inside software known as Haiven that we use as a demo frontend for our software program supply immediate library, and to run some experiments with completely different AI help experiences on software program groups. The code for that software is public.
The duty I gave to Codex was concerning the next UI concern:
Precise:
What I wished from the agent:
Immediate: Once we create the filter labels on the dashboard, we create human readable labels primarily based on the classes we've. "client-research" is become "Consumer-Analysis", "deliveryManagement" into "DeliveryManagement". Enhance the sophistication of the "category-to-human-readable" logic in order that the labels are showing to the consumer as "Consumer Analysis" and "Supply Administration"
This job is
- Comparatively small
- A typical “cosmetics” job that’s not tremendous pressing and may get deprioritised time and again
Agent logs
I’m sharing the detailed logs right here to be able to get an impression of what’s going on below the hood.
Right here’s what the logs view appears like in Codex:
Amassing data first
The primary 17 log entries had been all about studying AGENTS.md and README.md:
- “I must verify the AGENTS.md within the repository root for directions.”
- appears for AGENTS.md, however I had really known as it AGENT.md, which led to some backwards and forwards till it figured that out
- reads README.md
- Seems to be at .pre-commit, finds a take a look at setup
The next tables checklist the remainder of the log messages verbatim. After every of these messages, the agent executed a command, for a few of them I added that precise command for context. I’m certain that the log messages are simply summaries of the tokens that had been really happening behind the scenes, as a result of the messages are method too quick to symbolize the complete “pondering” the mannequin will need to have been going by.
Trying to find the related code with a number of grep
Message | Command (samples) | Notes |
---|---|---|
I must seek for “category-to-human-readable” to find the place to make modifications for the frontend labels. | grep -R "human readable" -n |
|
I’ll search for any mentions of “class” within the venture. | ||
Grep output is very large due to node_modules; I’ll restrict it to src as a substitute. | ||
I’ll seek for “to human” within the src listing to seek out related references. | ||
I’ll open and verify index.js to see what’s inside. | ||
I’m checking if the category-to-human-readable transformation is dealt with in _navigation_items or elsewhere. | ||
I’m going to lookup “human readable” to get extra information on the subject. | grep -R "human-readable" -n ui/src | head |
|
I’m wondering if there’s a perform for changing classes into one thing extra readable. | ||
Filtering the search to simply the “src” folder to keep away from node_modules noise. | second time it’s working into this, didn’t ‘keep in mind’ | |
I’ll search for a perform like “toHumanReadable”. Let’s see what I discover. | ||
I’m checking the entire repo in case the logic is within the backend or a library. | grep -R "Human" -n | head |
|
I’ll restrict the scope to app/ and ui/src to keep away from the massive node_modules folder. | third time it’s working into this, didn’t ‘keep in mind’ | |
I believe the logic is likely to be lacking, so I’ll verify _navigation_items and different modules for classes. | grep -R "human" app -n | head |
|
I believe it’s time to look a bit deeper. Let’s dive in and discover. | grep -R "human" app | wc -l |
To “look deeper”, it searches for “human” once more, however pipes it into wc as a substitute of head ? |
Lastly, it’s getting nearer! The labels had been beforehand crudely capitalized by way of CSS, it finds that
Message | Command (samples) | Notes |
---|---|---|
I’m checking for references to “capitalize” or class transformation within the frontend. | Output:
|
|
I’ll verify across the Tag part in index.js. | ||
Tag show capitalizes phrases incorrectly as a result of hyphen or camelCase; I’ll take a look at filter classes. |
Now it’s beginning to verify for extra related code within the app
folder, the place the backend is. It doesn’t discover something, which is right, there’s nothing related there
Message | Command (samples) | Notes |
---|---|---|
I’ll verify the repo for a way classes are formatted, specializing in hyphens and camelCase. | grep -R "classes" -n app | head |
|
Time to attempt a brand new search! | grep -R "shopper" -n app | head |
|
I’ll take a better take a look at the prompts within the take a look at knowledge. | grep -R "classes:" -n app/assessments | head |
Why? |
It appears just like the take a look at knowledge has restricted classes. I’ll verify the docs for any variations. | grep -R "client-research" -n |
|
I’ll begin by opening the information to take a better look. | sed -n '1,100p' ui/src/app/_navigation_items.js |
Now it’s going again to looking out within the ui folder – it discovered the CSS beforehand, and is now searching for present code that it will probably reuse to resolve the duty. That’s nice, as a result of we certainly have already got the same perform, and it finds it:
Message | Command (samples) | Notes |
---|---|---|
I want a perform to transform tags like “client-research” to “Consumer Analysis” for filters. | grep -R "to.*Readable" -n ui/src | head |
|
I’m opening the toReadableText perform inside _dynamic_data_renderer.js to discover it. | ||
I’m transferring by the web page to get the small print I want. |
It now finds a niche within the present perform’s implementation:
Message | Command (samples) | Notes |
---|---|---|
I see that hyphens aren’t dealt with by the perform. I want to regulate it for hyphens. | ||
I’ll modify toReadableText to deal with hyphens, making certain correct capitalization and spacing. Let’s verify assessments! | It really didn’t verify any assessments | |
I’ll verify _dynamic_data_renderer.js to look at toReadableText and modify it for hyphen-case. | ||
I’ll verify assessments/_sidebar.take a look at.js to evaluate the associated classes there. | ||
I’ll verify different information to see if toReadableText seems wherever. | ||
I’ll verify if _sidebar or others include a toReadableText perform. |
Replace the take a look at for the perform that it needs to vary:
Message | Command (samples) | Notes |
---|---|---|
I’ll alter the perform to deal with hyphen-case, then replace assessments to mirror that. | ||
I’ll verify the take a look at file after line 36 to make sure hyphen-case is roofed. | ||
I’ll scroll by traces 60 to 120 to look at the remainder of the file. | ||
I’m scrolling the remaining to seize all of the related data. | ||
I’ll add assessments for hyphen-case and replace toReadableText, together with modifying index.js for classes. | ||
I’ll alter _dynamic_data_renderer.js and index.js to make use of toReadableText and replace assessments. | git standing --short |
You’ll be able to configure some surroundings startup instructions in Codex, I had configured that to be yarn set up
, which led to the next:
Message | Command (samples) | Notes |
---|---|---|
There are untracked .yarn/ and .yarnrc information. I’ll verify and clear them up. | ||
I’ll verify the foundation of ‘ui’ and take away any pointless Yarn 4 entries. | ||
I’ll take a look at the commit historical past to see if these .yarn information had been added later. | git log --oneline -n 1 |
|
I’ll evaluate the repository’s .gitignore to substantiate if Node modules are excluded. | grep yarn .gitignore |
|
I see no entry for .yarn or .yarnrc. I’ll verify and take away them if wanted. | cat ui/.gitignore |
|
I’ll verify if I can take away .yarn or .yarnrc information since they shouldn’t be dedicated. | ||
It appears like yarn.lock modifications had been made. Time to reset to HEAD. |
Precise code modifications:
Message | Command (samples) | Notes |
---|---|---|
Time to change the code. I’ll start the mandatory updates now. | sed -n '1,30p' ... |
|
I’m going so as to add a hyphen to the mandatory a part of the code. | ||
I’m going to make modifications to the index.js file now. | ||
I’ll import toReadableText and begin utilizing it within the code. | ||
I’ll replace index.js to import toReadableText and alter the _dynamic_data_renderer assessments. | ||
I want to exchange {tag} with {toReadableText(tag)} for higher rendering. | ||
I’ll check out the variations to raised perceive the modifications. | git diff |
|
I’ll evaluate the modifications in index.js to see what’s been modified. | ||
I’ll add a brand new take a look at for hyphenated instances in _dynamic_data_renderer.take a look at.js. | ||
I’ll think about inserting a take a look at after line 35. |
Operating our code linter subsequent, good:
Message | Command (samples) | Notes |
---|---|---|
I’ll format the code utilizing Prettier as instructed. | cd ui && npx prettier -w ... |
Prettier is talked about within the AGENTS.md. It appears to run prettier efficiently, although not one of the information had been modified |
I’ll evaluate the logs to collect extra data. |
Making an attempt to run the assessments – unsuccessfully:
Message | Command (samples) | Notes |
---|---|---|
I’ll run the yarn assessments to verify if every thing’s working effective. | ||
Checks may’ve been cached or want community—output’s empty, so I’ll verify the log. | ||
The lockfile isn’t up to date. I’ll attempt working ‘yarn set up’ because the surroundings permits it. | ||
The surroundings is offline, so yarn set up can’t be run, however I see a node_modules folder. | ||
I see node_modules are already put in. I’ll verify the log for why assessments failed. | ||
The error occurs as a result of Yarn 4 depends on .yarn and .yarnrc, which we eliminated. | ? | |
It appears like .yarnrc.yml isn’t tracked. I’ll additionally verify .yarn for monitoring. | ||
I’m double-checking the bundle.json to see if “haiven-ui” is the proper identify. | ? | |
I’ll verify the Node model to substantiate if it’s 18 or 20 by working node -v. | bizarre, the node model laid out in bundle.json is 22.6 | |
I believe switching to npx vitest run might repair the error. Let’s attempt that. | ||
The output is lengthy—now I’ll seize the final abstract traces. | Error: Not carried out: window.getComputedStyle(elt, pseudoElt) |
|
The output is just too lengthy, so I’ll extract the important thing factors for a abstract. |
Happening a tangent about establishing the venv, which is completely pointless as a result of it solely plans modifications for frontend code:
Message | Command (samples) | Notes |
---|---|---|
The assessments failed as a result of case variations, however now I’ll activate the digital surroundings. | AGENTS.md incorporates directions to activate venv for the Python backend, however we’re within the frontend right here, so… | |
Hmm, the .venv isn’t right here. I ought to verify if it wants creation or already exists. | ||
It appears just like the .venv listing is lacking. Let’s evaluate AGENT’s directions fastidiously. | ||
The surroundings appears offline, however there is likely to be a script to arrange dependencies domestically. | ||
I can’t run “poetry set up” offline, however I’ll verify the .cursor folder for dependencies. | ||
I can’t run backend assessments with the present setup, however I’ll attempt pytest anyway. |
Gave up on the assessments, now wrapping up the modifications:
Message | Command (samples) | Notes |
---|---|---|
Checks failed as a result of lacking dependencies. Now, I’ll verify standing and proceed with committing. | ||
I believe the .yarnrc.yml was left over from a earlier yarn set up. Eradicating it. | ||
I’ll seize the related file traces for the citations now. | ||
I’ll get the encircling line vary for the modifications in index.js. | nl -ba ui/src/pages/index.js | sed -n '12,28p' |
|
I’ll seize the related traces round these modifications now. | ||
It’s time to finalize the PR message and confirm the commit. | ||
UI assessments failed as a result of surroundings points. I’ll reference the final failure traces. Finished! |
Right here is the complete pull request created by Codex.
Observations
Brute textual content search
I discovered it attention-grabbing to see that Codex, and a lot of the different coding assistants I’m utilizing, are resorting increasingly more to easy textual content search to seek out related code. Within the log above you see how Codex goes by a bunch of search phrases with grep
, to seek out related items of code (“human”, “human readable”, “humanReadable”, …). The rationale why I discover it attention-grabbing is as a result of there have been quite a lot of seemingly extra refined code search mechanisms carried out, like semantic search over codebase indices with vectors / embeddings (Cursor, GH Copilot, Windsurf), or utilizing the summary syntax tree as a place to begin (Aider, Cline). The latter remains to be fairly easy, however doing textual content search with grep is the only attainable.
It looks like the device creators have discovered that this straightforward search remains to be the best in any case – ? Or they’re making some form of trade-off right here, between simplicity and effectiveness?
The distant dev surroundings is essential for these brokers to work “within the background”
Here’s a screenshot of Codex’s surroundings configuration display screen (as of finish of Might 2024). As of now, you possibly can configure a container picture, surroundings variables, secrets and techniques, and a startup script. They level out that after the execution of that startup script, the surroundings won’t have entry to the web anymore, which might sandbox the surroundings and mitigate a number of the safety dangers.
For these “autonomous background brokers”, the maturity of the distant dev surroundings that’s arrange for the agent is essential, and it’s a difficult problem. On this case e.g., Codex didn’t handle to run the assessments.
And it turned out that when the pull request was created, there have been certainly two assessments failing due to regression, which is a disgrace, as a result of if it had recognized, it will have simply been capable of repair the assessments, it was a trivial repair:
This specific venture, Haiven, really has a scripted developer security internet, within the type of a fairly elaborate .pre-commit configuration. () It will be excellent if the agent might execute the complete pre-commit earlier than even making a pull request. Nevertheless, to run all of the steps, it will must run
- Node and yarn (to run UI assessments and the frontend linter)
- Python and poetry (to run backend assessments)
- Semgrep (for security-related static code evaluation)
- Ruff (Python linter)
- Gitleaks (secret scanner)
…and all of these need to be out there in the appropriate variations as properly, after all.
Determining a easy expertise to spin up simply the appropriate surroundings for an agent is essential for these agent merchandise, if you wish to actually run them “within the background” as a substitute of a developer machine. It’s not a brand new downside, and to an extent a solved downside, in any case we do that in CI pipelines on a regular basis. However it’s additionally not trivial, and in the meanwhile my impression is that surroundings maturity remains to be a problem in most of those merchandise, and the consumer expertise to configure and take a look at the surroundings setups is as irritating, if no more, as it may be for CI pipelines.
Resolution high quality
I ran the identical immediate 3 occasions in OpenAI Codex, 1 time in Google’s Jules, 2 occasions domestically in Claude Code (which isn’t totally autonomous although, I wanted to manually say ‘sure’ to every thing). Despite the fact that this was a comparatively easy job and resolution, turns on the market had been high quality variations between the outcomes.
Excellent news first, the brokers got here up with a working resolution each time (leaving breaking regression assessments apart, and to be trustworthy I didn’t really run each single one of many options to substantiate). I believe this job is an effective instance of the kinds and sizes of duties that GenAI brokers are already properly positioned to work on by themselves. However there have been two facets that differed when it comes to high quality of the answer:
- Discovery of present code that may very well be reused: Within the log right here you’ll discover that Codex discovered an present part, the “dynamic knowledge renderer”, that already had performance for turning technical keys into human readable variations. Within the 6 runs I did, solely 2 occasions did the respective agent discover this piece of code. Within the different 4, the brokers created a brand new file with a brand new perform, which led to duplicated code.
- Discovery of an extra place that ought to use this logic: The group is presently engaged on a brand new function that additionally shows class names to the consumer, in a dropdown. In one of many 6 runs, the agent really found that and steered to additionally change that place to make use of the brand new performance.
Discovered the reusable code | Went the additional mile and located the extra place the place it needs to be used |
---|---|
Sure | Sure |
Sure | No |
No | Sure |
No | No |
No | No |
No | No |
I put these outcomes right into a desk for example that in every job given to an agent, we’ve a number of dimensions of high quality, of issues that we wish to “go proper”. Every agent run can “go improper” in a single or a number of of those dimensions, and the extra dimensions there are, the much less possible it’s that an agent will get every thing performed the best way we wish it.
Sunk value fallacy
I’ve been questioning – let’s say a group makes use of background brokers for one of these job, the kinds of duties which can be form of small, and neither vital nor pressing. Haiven is an internal-facing software, and has solely two builders assigned in the meanwhile, so one of these beauty repair is definitely thought-about low precedence because it takes developer capability away from extra vital issues. When an agent solely form of succeeds, however not totally – during which conditions would a group discard the pull request, and during which conditions would they make investments the time to get it the final 20% there, regardless that spending capability on this had been deprioritised? It makes me surprise in regards to the tail finish of unprioritised effort we’d find yourself with.