1. Why the harness matters more than the model¶

The public conversation about coding agents almost always revolves around the model: which version, which benchmark, which context window. It's the wrong question for a team that already uses these tools daily. The useful question is a different one: what's around the model?

The community calls that "around" the harness: the set of prompts, tools, rules, sandboxes, validators and feedback loops that surround the LLM when it acts as an agent. The term emerged around LangChain ("Agent = Model + Harness") and Birgitta Böckeler articulated it in more detail in Harness Engineering (Thoughtworks, published on martinfowler.com). The model is interchangeable. The harness isn't.

The engine analogy¶

An F1 engine inside a compact car doesn't win races. What wins races is the chassis, the aerodynamics, the tires, the telemetry and the pit crew. The model is the engine; everything else is where the race is won or lost. And the important part: that "everything else" is something you build, not the model vendor.

When a team complains that "the agent hallucinates" or "doesn't understand the code", the problem is almost never the model. The problem is that the agent is operating in an environment where it would be unreasonable to expect a new human to understand anything either. It has no map, no fast tests, can't reproduce the bug, doesn't know the team convention and nobody tells it when what it did is wrong.

The operational definition¶

The harness is everything that increases the probability of the agent doing the right thing on the first try, and that when it doesn't, it discovers and corrects it without human intervention. Böckeler breaks it down into two mechanisms:

Guides (feedforward) — hints that orient the agent before it acts: conventions, templates, system prompts, AGENTS.md, schemas.
Sensors (feedback) — guardrails that act after: tests, type checkers, linters, observability, automatic reviewers, evals.

A harness without guides produces an agent that drifts. A harness without sensors produces an agent that drifts and never finds out. A harness with both produces something that starts to look like a collaborator.

The empirical evidence¶

Three data points worth having on hand when someone says "agents aren't ready":

Stripe merges more than 1,000 PRs per week generated by agents in a codebase of hundreds of millions of lines of Ruby/Sorbet, with an uncommon internal stack. They pull it off not because they have a better model than anyone, but because they built Toolshed (~500 internal tools via MCP), devboxes pre-instantiated in 10 seconds, and a "blueprints" system that combines deterministic nodes with agentic nodes.
OpenAI built an internal product with ~1M lines and ~1,500 PRs in 5 months, with a team that grew from 3 to 7 engineers, and with the absolute rule that no line be written by hand. The key wasn't Codex. The key was the harness around Codex: app bootable per worktree, Chrome DevTools Protocol wired to the runtime, ephemeral observability queryable with LogQL/PromQL, a 100-line AGENTS.md as an index and docs/ structured as a system of record.
Geoffrey Huntley describes the "Ralph loop" — a monolithic orchestrator that executes one task per iteration — as the primitive any serious agent starts from. It's not a feature of the model; it's an architectural decision made by the team operating it.

None of these teams won because they had access to a secret model. They won because they treated the agent as a system to be engineered.

Why this changes the team's work¶

Once you accept that the harness is where quality is decided, you stop doing certain things and start doing others. You stop looking for the "perfect prompt" — prompts live in the harness now, versioned as code. You stop copy-pasting errors into the chat — errors enter the loop via automatic sensors. You stop asking the agent to guess the context — the context lives structured in the repo, mechanically accessible.

And you start doing things that used to look like over-engineering: custom linters for a single convention, versioned execution plans, local observability, "doc-gardening" agents that keep documentation up to date.

What in a human team would be nitpicky, in a team with agents is a multiplier. A rule codified once applies forever, on every PR, without wear, without debate, without having to remind anyone at the next standup.

The principle that runs through the whole guide¶

When something fails, the question isn't "how do I tell the agent this is wrong?". The question is "what is missing from the harness so this can't happen, or so it gets detected automatically when it happens?".

It's the same question a good internal engineering platform asks about its product teams. Not "how do I teach this team not to break this", but "how do I make it impossible (or trivially detectable) for them to break this". The mental shift is the same. The difference is that with agents the ROI is immediate and the improvement loop is hours, not quarters.