Skip to content

Harness engineering: shipping software at scale with agents

We've spent a couple of years in a public conversation about coding agents that is almost entirely focused on the wrong place: the model. Which version, which benchmark, which context window, which provider, which price per million tokens. It's a legitimate conversation, but it leaves out what actually explains why some teams are shipping software with agents at a scale that's hard to believe and other teams are still treating them as a glorified autocomplete. The difference isn't in the model. It's in everything around the model.

The community has given that "everything else" a name that deserves to stick: the harness. The term emerged around LangChain — with the formula "Agent = Model + Harness" — and was articulated in more detail by Birgitta Böckeler in Harness Engineering (Thoughtworks, published on martinfowler.com). This piece is about why the harness matters more than the model, what shape it takes when it's done right, and where to start if your team already has experience with agents but feels like they're not getting the most out of them.

The arithmetic that changes everything

Before getting into it, three data points are worth putting on the table.

Stripe merges more than 1,000 pull requests per week generated by agents. The codebase has hundreds of millions of lines, in Ruby with Sorbet, with internal libraries no public model knows about. And still, the agents work.

OpenAI built an internal product with a million lines of code and 1,500 PRs in five months, with a team that grew from 3 to 7 engineers, under an absolute rule: zero lines written by hand. None. Not the logic, not the tests, not the CI configuration, not the documentation, not the tooling scripts. All Codex.

Geoffrey Huntley describes a pattern he calls the "Ralph loop" — a tiny, monolithic orchestrator that runs one iteration, observes the result, and starts over — and presents it not as a trick, but as the most important primitive of agentic work.

If the only thing separating you from these teams were the model, we'd all have their numbers. We don't. What they have and most don't is internal infrastructure around the model: disposable sandboxes, structured context, automatic validators, fast feedback loops, agents reviewing other agents, and a mindset that treats the agent as a system to be engineered, not as a magic box to be prayed to.

The thesis, in one sentence

The model is the engine. The harness is what turns an LLM into an agent a team can depend on.

The model is interchangeable. A new one comes out every six months, all similar in what matters. The harness isn't. The harness is something you build, it evolves with your project, it captures your team's knowledge, and it compounds on itself months down the road. It is, bluntly, where your competitive advantage lives. The model is where the commodity lives.

This is, deep down, good news. You don't control the model — a vendor controls it, and they're going to optimize it for everyone equally. You do control the harness. And like almost everything you control in engineering, it improves with intent, with discipline, and with a team that understands what it's building and why.

What exactly is a harness

Böckeler breaks the harness down into two mechanisms, and the distinction is the most useful tool you'll take away from this piece:

  • Guides (feedforward). Hints that orient the agent before it acts. Conventions, templates, schemas, module generators, readable architecture documents, AGENTS.md as a map of the repo. They tell the agent how to proceed.
  • Sensors (feedback). Guardrails that act after. Tests, type checkers, linters, ephemeral observability, evals, other agents reviewing. They detect when the agent has gone off the path with enough speed and precision for the harness to correct it without your intervention.

A harness without guides produces an agent that drifts. A harness without sensors produces an agent that drifts and never finds out. The two together, well calibrated, produce something that starts to look like a collaborator.

Every time your agent fails, you should be able to say whether the failure is due to a missing guide or a missing sensor. If you can't, you don't have a harness. You have prompts and luck.

The question worth learning to ask

There is one concrete question worth more than any framework. When the agent fails — and it will fail — most teams ask: "what do I tell the agent, in this chat, so this doesn't happen?". It's the wrong question. It leads to ephemeral instructions that evaporate when you close the session, micro-corrections that don't get recorded anywhere, a prompt-tweaking spiral that accumulates nothing.

The right question is this:

What is missing from the harness so that this failure becomes impossible, or so that it gets detected automatically when it happens?

The answer can be many things: a new section in AGENTS.md or in a repo doc, an updated plan template, a new linter, an inferential sensor that checks for a specific pattern, a restructuring of the context the agent receives, a missing isolation layer. Any of these is valid — including the ones that touch versioned prompts, because a prompt committed to the repo is harness; a prompt in the chat that disappears is not. What distinguishes a good answer from a bad one isn't whether it touches prompts or code, but whether the lesson gets recorded somewhere the agent will read next time. And the important part: once you implement it, that specific failure doesn't happen again. Ever. Across thousands of future runs of the agent.

It's the same mental shift that distinguishes a mature platform team from one still firefighting. Not "how do I teach this team not to break this in this conversation", but "how do I make it impossible (or trivially detectable) for them to break this in any future conversation". The difference is that with agents the ROI is brutal: every hour invested in the harness improves every future run, and future runs are many more than you'd have with a human team.

Rigor doesn't disappear, it relocates

There's an observation from Chad Fowler worth having in front of you before you start dismantling practices. Every time software has taken a major leap — dynamic languages, extreme programming, continuous deployment — someone has announced "we don't need so much rigor anymore". And every time they've been wrong. Rigor doesn't disappear; it relocates. Dynamic languages moved it to the test suite; XP moved it to continuous integration; continuous deployment moved it to observability and automatic rollback.

The same thing happens with agents. Rigor moves out of "writing every line carefully" and "human-to-human review of every PR", and into two other places:

  • Specification of intent. What used to be informal criteria in a ticket now has to be mechanically readable, because that's what the agent is going to interpret. Imprecision in the ticket becomes imprecision in the code.
  • Verifiable evaluation. If you can't mechanically evaluate whether the agent's output is correct, you don't have a harness, you have a roulette wheel.

The heuristic worth adopting: when something seems to let go of rigor, look for where it relocated. If you can't find it, worry. Belongings are either somewhere else or they've been lost.

This explains an observation worth keeping in front of you: repos with codified discipline work better with agents than repos with cultural discipline, even when both produce human code of equivalent quality. Codified discipline means aggressive linters, schemas validated at the edge, fast tests, versioned docs. Cultural discipline means careful reviews, attentive seniors, unwritten conventions. The reason is direct: the agent can't access the culture. It only sees what the code makes mechanically clear.

What the agent can't see doesn't exist

There's a line from OpenAI's Codex team worth painting on the wall: "what the agent can't see doesn't exist". It's a literal statement about how a coding agent operates, and it has a consequence that changes how you see the repository.

In human teams, knowledge lives distributed: in the README yes, but also in Slack, in a Google Doc linked from some email, in a senior's head, in a hallway conversation. This distribution works — badly, but it works — because humans know how to ask and cross-reference sources.

An agent doesn't. The agent only sees what's in its effective context at the moment of acting. If the decision "don't touch this module, we're going to deprecate it" lives only in a Slack message from March 14, for the agent that decision doesn't exist. It's not that it ignores it: it's that it isn't accessible.

The consequence is liberating once you accept it: everything you want the agent to know has to be materialized in the repo or mechanically accessible from it. Not "documented somewhere". Materialized, in markdown, indexed, maintained by linters and by recurring agents that detect stale documentation.

And here's a side effect almost nobody anticipates: by making the repo legible to an agent, you make it enormously more legible to humans too. The historical investment in developer experience pays dividends for the agent, and the investment in the agent pays dividends for the humans joining the team. It's the virtuous loop Stripe sums up best: "what's good for humans is good for agents".

Three modes of delegation, and only one scales

Kief Morris articulates a distinction that avoids half the unproductive discussions there are about human oversight. There are three ways of putting humans in the loop with agents:

  • Humans Outside the Loop ("vibe coding"). The human defines the outcome and watches what comes out. It works for prototypes. It fails in serious production.
  • Humans In the Loop. The human reviews every artifact, line by line. It's arithmetically unsustainable when the agent's throughput goes up. The human becomes the bottleneck.
  • Humans On the Loop. The human supervises the system, not the artifacts. When something goes wrong, they don't fix the artifact: they modify the system that produced the artifact.

The only mode that scales months down the road is "on the loop". "In the loop" is the regime a conservative team lands on by default and where it gets stuck. The healthy transition is from "in" to "on", investing in the harness until system-level supervision becomes possible. The transition to "outside" — skipping the harness altogether — is where most teams get burned trying to shortcut the middle phases.

The kind of work this asks for

When you accept all of the above, the engineer's work changes. You're still at the center of the system, but the center has a different shape:

  • Before, you wrote code. Now you design environments where the right code can write itself.
  • Before, you reviewed PRs. Now you design the rules, sensors and reviewers that review PRs.
  • Before, you understood the domain to implement it. Now you understand the domain to specify it in a way the agent can implement correctly.
  • Before, you fixed bugs. Now you diagnose why the agent made the mistake and modify the harness so it doesn't happen again.

The skills that gain value are the ones that already distinguished good seniors: technical taste, systems thinking, clarity in specifying, judgment on when to invest in infrastructure. The ones that lose relative value are the purely artisanal ones: typing fast, memorizing APIs, mastering exotic syntax.

There's an awkward moment in this transition — if I'm not writing code anymore, what am I? — and it's worth naming instead of pretending it doesn't exist. The honest answer is that you're still an engineer, but the medium has changed. Before, you produced code; now you produce the conditions that make it possible for the right code to exist. It's engineering of a higher order, not a lower one.

Where to start on Monday

If you've made it this far and you're interested in moving from theory to practice, there's a sequence that works better than most alternatives:

  1. Start with isolation. Disposable sandboxes, cheap to create, cheap to throw away. It's the investment that multiplies all the others the most. Without this, the next steps yield half as much.
  2. Materialize context in the repo. The important stuff living in Slack, Google Docs, or heads, moved into versioned markdown. AGENTS.md as a 100-line index, not as a thousand-line encyclopedia.
  3. Codify the invariants you care most about. Custom linters with messages addressed to the agent. Dependency directions mechanically validated. Schemas at the edge.
  4. Promote every recurring failure to the harness. If the failure is obviously systemic from the start, you convert it into a guide or a sensor immediately. If it seems like a one-off, you fix it with minimum spend and note it down — but the second time there's no decision to make: you stop and promote it, because now you know it's a pattern. There's no well-managed third time.
  5. Design the PR flow so most PRs are reviewed agent-to-agent, and reserve human attention for cases where judgment adds real value.

You don't have to do everything at once. Start with steps 1 and 2; the rest fall by gravity once the first two are in place.

The close

The line that separates the teams doing remarkable things with agents from the ones still frustrated is a line of investment in infrastructure. Not talent, not model, not budget: the decision to treat the agent as a system to be engineered, not as a tool to be prompted nicely.

Models are going to keep improving. That's certain. What's less clear is how many teams will take advantage of them. The ones investing in their harness while others wait for the next model are going to arrive at the next model with a multiplier on top. The ones waiting for the next model are going to discover that the problem wasn't the model and are going to go back to waiting for the next one.

If this read resonated, chapters 1 through 11 of the guide develop each of the ideas in more detail: the mindset (ch. 1–2), the pillars of guides and sensors (ch. 3–4), the concrete practice of environments, context, architecture and PR flow (ch. 5–8), and sustainable maintenance and anti-patterns (ch. 9–11). Read them as standalone essays; no order required. What is required is starting.


This piece synthesizes ideas from Birgitta Böckeler (Thoughtworks), Chad Fowler, Geoffrey Huntley, Kief Morris, Ryan Lopopolo (OpenAI Codex team) and Alistair Gray (Stripe Leverage team, Minions). The term "harness" in this sense was popularized around LangChain. URLs in the sources section of the guide's README.