It is easy to build an agent. It is very difficult to build a great one.

Ever since I started using Claude Code, I have been amazed at how much the agent can do so well. For a while, I attributed everything to the model (Claude) and its training. It was only after Claude Code's source code was reverse-engineered that I discovered that much of what makes the product so good is not Claude the model, but the harness around it.

So I did a deep dive to understand harnesses, consuming as much relevant information as I could find on the subject. However, the most challenging part was not understanding how harnesses work or what they do. The most challenging part of writing this article was figuring out what makes a harness valuable, through the lens of a product builder.

The harness is the layer of your AI agent stack that sits immediately above the model and the only one that interacts with it. Right now (Jun 2026), the most dependable way to make an agent better is not by swapping out the model. It is by building a better harness.

A good example of this is when LangChain held its model fixed and changed only the surrounding harness. Its coding agent climbed from outside the top 30 to the top 5 of the Terminal-Bench 2.0 leaderboard, from 52.8% to 66.5%.1

This article explains what a harness is, what it does and how it can create a durable competitive advantage in your agentic products.

First, some basic concepts that make understanding harnesses easier.

Basic Concepts

The Model is Stateless

A language model is stateless. It is a function from text to text. Text goes in, text comes out, and once the text is sent out the model retains nothing. It cannot open a file, run a command, check the time, or recall what it said a second earlier.

So on every turn, some surrounding program has to gather the system instructions, the entire conversation to date, and the newest input, then send all of it back to the model. Once the model replies, this program reads the reply, performs whatever action the reply asks for, and feeds the result back to the model for the next turn.

The surrounding program that does this is the harness. It exists because the model cannot hold its own state or act on the world.1

The model is stateless A stateless model sits outside the harness boundary at the top. Inside the harness, Assemble Context sends the whole transcript up to the model and receives text back. Below it, a growing transcript store holds all prior turns; Assemble reads from it and appends each reply to it. The transcript grows turn over turn, and all state lives in the harness, not the model. The Model Is Stateless Stateless Model Text In → Text Out · No Memory Sends Transcript Returns Text THE HARNESS Assemble Context Whole Transcript Reads Appends Growing Transcript Held & Re-Sent In Full Each Turn Turn 1 Turn 2 Turn 3 ALL STATE LIVES IN THE HARNESS — the model is recomputed each turn and remembers nothing between calls.
Figure 1.The model forgets after every call and returns only text. The harness re-assembles and re-sends the whole growing transcript each turn. All of the state lives in the harness, not the model.

Definition of a Harness

The simplest and cleanest definition of a harness is the one LangChain1 and Anthropic2 use. Simply put:

Agent = Model + Harness

In other words, if it is not the model, it is the harness.1 I will use this definition throughout this article.

Other framings are finer breakdowns of the same thing. Hugging Face splits harness from scaffold,3 METR and the safety world call the whole thing scaffolding and its tuning elicitation,4 and product teams separate the framework you import, the harness that runs, and the hosted runtime.5 None of these definitions change the central ideas in this article.

Control Philosophy: Open vs. Fixed Paths

One choice in how you design the harness decides most of what you build: how much autonomy the agent is given. This falls on a spectrum with two ends.
Open Paths: You hand the model a computer and let it direct its own steps.
Fixed Paths: The model is a component inside an explicit graph or set of roles you control.

The control-philosophy spectrum, with products placed A horizontal axis runs from fixed paths on the left, where you constrain the model inside a graph or roles, to open paths on the right, where you give the model a computer and get out of the way. The start point sits at the fixed end, marked with a dot below the line labelled start here: begin constrained and widen autonomy only as the task needs it. Seven products are placed along it: CrewAI and LangGraph at the fixed end, OpenAI's Agents SDK in the centre, OpenClaw and OpenHands in the open middle, and Hermes and Claude Code near the open end. The Control-Philosophy Spectrum The control stance, not the model, is what predicts the rest Fixed Paths Constrain it in a graph or roles Open Paths Give the model a computer START HERE CrewAI OpenAI Agents SDK OpenHands Claude Code LangGraph OpenClaw Hermes You fix the path in advance The agent decides the path Begin constrained — widen autonomy only as the task needs it
Figure 2.The control philosophy, not the model, is what decides the rest of the harness. Claude Code and Hermes sit near the open end, OpenHands and OpenClaw in the middle, LangGraph and CrewAI at the fixed end, with OpenAI's Agents SDK in the centre. The arrow marks where to start: begin constrained, then widen autonomy only as the task needs it.

I have included some examples in the spectrum above. If you know any of these products, you already know what I am talking about.

Regardless of the kind of agent you are building, it is always better to start at the fixed, constrained end. Autonomy is earned by reliability and performance. You widen the autonomy only when an open-ended problem genuinely cannot be written as a fixed path. 6

Anatomy of a Harness

The harness is deterministic code. It is ordinary software: loops, conditional branches, schema validation, permission checks, retries, etc. It runs the same way every time. It is dependable and it is something you can fully control.

The model is probabilistic. It samples a plausible next token, and you can never fully guarantee what it will do.

Division of Labor: Trust the model for judgment. Trust the harness for guarantees.

Anything that must hold regardless of situation, such as a denied permission, a sandbox boundary, a verification gate, or a required output format, lives in the deterministic harness. Only open-ended judgment is delegated to the model. This is also the structural reason a smarter model cannot absorb governance.7

Anatomy of a harness A harness turns a stateless model into an agent by running a loop: assemble context, call the model, parse the reply, then fork on whether the model requested a tool or gave a final answer. Tool calls pass through a governance gate (permissions and sandbox), execute, and the result is appended back into the assembled context for the next turn, closing the loop. History, curated memory, and context-specific skills form a bidirectional side-store with assemble. Final answers go to a verify step whose mechanisms include rule and schema checks, test execution, self-critique, and a judge/evaluator; verified answers are returned to the user. The model, the external world, and the user sit outside the harness boundary. Anatomy of a Harness THE HARNESS Stateless Model Text In → Text Out · No Memory Sends Context Returns Text Call Model One Stateless Call Parse Reply Read Model Output Tool call? Assemble Context Prompt · History · Tools History Memory Skills Tool Call Final Answer GOVERNANCE Permissions / Sandbox Allow · Deny · Isolate If Allowed Execute Tool Run the Requested Call Acts On Returns Append Result to Context External World Files · Databases · Web · APIs Verify Goal Met? — Else Loop Rule & Schema Checks Tests / Execution Self-Critique Judge / Evaluator Return to User Final Answer Orchestration — a lead agent may spawn sub-agents, each its own loop. THE LOOP IS THE ENGINE — a stateless model re-fed its growing context and called again each turn, until Verify passes or a stop limit is hit: Budget · Steps · Repeated Failure.
Figure 3.The control loop is the spine. History, memory and skills feed each prompt the harness assembles, verification checks the output, tools call out through a permissions and sandbox gate, and orchestration wraps the loop with sub-agents. The model, the external world, and the proprietary context all sit outside the harness boundary.

The diagram above has seven parts:

  • Control Loop. The cycle at the center. Assemble the prompt, call the model, parse the reply, execute the tool the model asked for, observe the result, and repeat. The model only emits text. Every action in the world happens because the harness chose to act on that text. The loop itself is small and just a few dozen lines of code.1
  • History, Memory and Skills. The three things the harness draws on to assemble what is sent to the model. The live conversation (history), what the agent has learned about your world (memory) and any reusable procedures it has available (skills) specific to the context.
  • Context Assembly. The management of what the model sees inside its finite context window. Because the model remembers nothing between calls, the harness re-sends the relevant history each turn and prunes or summarizes it as it grows. This is genuinely hard. Model performance degrades well before the context is full. Only 60%-70% of the model's context is actually usable.8
  • Tools. The functions the model is allowed to call: read a file, run a command, query a database, hit an API. The Model Context Protocol (MCP) standardizes how a harness connects to those tools, but the harness is not limited to using MCP in calling tools.9
  • Permissions/Sandbox. The permissions and the sandbox form the deterministic gate on what the agent may do. Allow and deny rules decide which tool calls are permitted, and OS isolation, such as Seatbelt on macOS or Bubblewrap on Linux, contains what can be affected. The governance must hold even if a prompt injection bypasses the model's decision-making, so agents with access to sensitive information require sophisticated deterministic governance mechanisms.7
  • Verification. The checks on the model's work: deterministic checks, schema validation, retries, and a separate evaluator agent (the judge) given a fresh context window and no write access. Models skew positive when grading their own output, so the judge has to be separated from the worker.10
  • Orchestration. The outer layer that spawns and coordinates sub-agents, each with its own clean context window, for work that decomposes into parallel parts. It is powerful and expensive, but very useful on hard, splittable tasks.10

Two parts, Tools and Governance, do not actually sit in any single box. They run as threads through the agent stack. A tool's interface lives in the harness, but the backend it calls, the real business logic, lives in your own systems. Governance is enforced in the harness, but the policy it enforces comes from the business.

Tools and governance cross two layers Two vertical threads, tools and governance, each cross two layers rather than sitting in one. In the upper layer, your data and business logic, sit the tool backend (your business logic) and the policy (what is allowed). In the lower layer, the harness, sit the tool interface (function and MCP definitions) and the enforcement (permissions and sandbox). The tool thread runs both ways across the harness boundary; the governance thread runs one way, the policy setting the rules the enforcement runs. Tools & Governance Cross Two Layers YOUR DATA & BUSINESS LOGIC · THE LAYER ABOVE Tool Backend Your business logic Policy What's allowed THE HARNESS Tool Interface MCP / function defs Enforcement Permissions / sandbox Tool calls Sets the rules The interface and the enforcement live in the harness; the backend and the policy live in the layer above.
Figure 4.Tools and governance cross two layers rather than sitting in one. Tools: the interface lives in the harness, the backend lives in your data and business logic. Governance: enforcement lives in the harness, policy lives in the layer above.

Models absorb capabilities

As models scale, they absorb capabilities. This is a pattern that runs through the whole field. Skills that once demanded a system of their own such as translation, summarization, document comprehension, coding, multi-step reasoning, keep collapsing into a general large language model that simply does these things.

Each model generation pulls in work that was once hard, within its boundary of default behavior and you are watching this right now in real time. This absorption runs on a clock, dictated by model training cycles and it is accelerating.4

Figure 5 · What the model absorbs, and how fast

Harness featureBuild costAbsorbed by the modelTime
Chat-with-PDF wrappersA weekendNative document understanding, late 2023Months
Chain-of-thought scaffoldsDaysReasoning models, 2024 to 2025About a year
Function-calling orchestrationWeeksNative tool use, 2023 onwardMonths
RAG and chunking crutchesWeeksEased by 1M-token context, 2026 (partial)One to two years
"Context anxiety" patchA sprintGone one generation later, Sonnet 4.5 to Opus 4.5About three months
Hand-built few-shot exemplarsA dayNative zero-shot instruction following, instruction-tuned models, 2021 to 202211About a year
Agentic coding loops (plan, edit, run tests, repair)MonthsNative agentic coding via RL post-training, 2024 to 2026 (partial)About two years
Figure 5 · What the model absorbs, and how fast.The shaded columns are the model's doing: what absorbed each, and when.12

A large part of harness design is capability crutches. Stuff the model is not capable of doing yet. These include planning prompts, reflection loops, structured output tricks, retrieval workarounds, and hand-built orchestration. As you can see in the table above, these get absorbed by each successive model generation.

So when building a harness, we need to be careful about what we choose to build and treat capability crutches as cheap, temporary and disposable features that will be absorbed. They are useful in the short term, but rapidly depreciating assets.13

A useful shorthand: If a new model with a minimal harness matches a carefully tuned harness across real tasks, the model has absorbed much of the carefully tuned harness.

An AI Value Map

To know what makes an agent valuable, we first need to see where the value collects.

The map below which charts value in the AI agent stack is useful in understanding the current overall landscape. But it is only useful as long as we remember that this is a map and not the territory. We are still at the beginning of a change that will redefine many things including products and businesses. And the actual market is shifting rapidly with new techniques and technological progress as well as significant market moves by the current SOTA model providers.

The real test is not where a layer sits in the stack or how this map changes. It is whether you rent that layer, can only copy it, or actually own it.1

Where value sticks: a rent-to-own view of the agentic stack Five layers of the agentic stack run as rows against three columns: rented, copyable, and owned, where durable value accrues. Distribution and trust and your data and domain knowledge are owned outright. The harness splits three ways. The perishable edge the model will eat holds planning and RAG crutches, prompt and parse fitting, memory optimization, and inference cost tactics. The copyable mechanism anyone can clone holds the control loop, tool and MCP wiring, retries and verification, and orchestration and sandbox. The owned harness value, which cannot be copied or absorbed, is the three pillars: curated memory (what it knows about your world), context-specific skills (what it can do in your context), and reliability (validated judgment it cannot game). The model is rented at the frontier, fine-tuned or distilled, or owned at serving scale. Compute is rented, reaching toward owned only at fixed-cost scale. A flywheel links the owned data and distribution layers to an owned model. Where Value Sticks RENTED COPYABLE OWNED Where durable value accrues 5 Distribution & Trust (Brand) Brand, Installed Base, Switching Cost, Compliance 4 Domain Knowledge Owned Data & Processes Your proprietary context 3 The Harness Perishable Edge The model will eat it Planning & RAG crutches Prompt & parse fitting Memory optimization Inference cost tactics Mechanism Anyone can clone it Control loop Tool / MCP wiring Retries & verification Orchestration & sandbox Owned Harness Value Cannot be copied or absorbed Curated Memory What it knows about your world Context-Specific Skills What it can do in your context Reliability Validated judgment it can't game 2 The Model Frontier API Rented by the token Fine-Tune / Distill Open weights Owned Model Train or distill your own 1 Compute & Inference Chips · Serving Rented · the bill Fixed-Cost Compute At Scale Reaches toward owned FLYWHEEL
Figure 6.A value and ownership view of the agentic stack. Each layer sits left to right by whether you rent it, can only copy it, or own it, and only the owned side holds durable value. Inside the harness, which splits the same three ways, the durable part is the three pillars: curated memory, context-specific skills, and reliability. Inference cost control is table stakes, and its one durable lever, an owned model at serving scale, sits at the harness edge.
  • Compute. Rented by the token, from whoever owns the chips. Edge compute (devices capable enough to run a model) is catching on fast, but still limited by device processing capabilities and memory.
  • The Model. Rented at the frontier, where you call an API. It is owned only at serving scale, where you train or distill your own. Open models can be fine-tuned and hosted on your own or rented compute. The owned model is the single biggest thing that moves the margin needle in the current market, but requires a lot of capital.
  • The Harness. This is the most interesting one. It can be split into three things: a perishable edge the model will eat, mechanisms anyone can clone, and the parts that simply cannot be copied by competitors or absorbed by the model.
  • Domain Knowledge. Currently owned outright. Future is uncertain. This is what the model providers are targeting with FDE driven (Forward Deployed Engineer driven) business models and engineers who embed in the customer to encode domain workflows into the product.14 If these business models succeed (like Palantir) the models will absorb significant portions of domain knowledge specific to businesses. This is the most valuable part in this map,15 and it deserves its own articles. This is not one of those articles.
  • Distribution & Trust (Brand). Owned outright. The distribution and trust that make your brand are the real moat. No model can absorb them and no competitor can fork them. The harness provides you a few valuable methods in building these.

The map above is a view of the whole stack. From here on let's focus on the harness and identify what is worth building and owning within it.

Harnessing Value

Within the scope of work an agent handles, there are three things that make an agent competent at doing the job it is assigned. The agent has to know the context it works in, it has to be able to act in that context, and it has to be able to judge whether it did the work right.

Agent Competence Competence has three parts, each fused to your domain and each compounding with use. Knowledge becomes curated memory, what the agent knows about your world. Ability becomes context-specific skills, what it can do in your context. Judgment becomes reliability, how it checks its own work. The three form a triangle around a center labelled competence: knowledge, ability, judgment. The model supplies only the generic version of each; the version fused to your world is the part that does not commoditize. Agent Competence COMPETENCE Knowledge · Ability · Judgment KNOW Curated Memory What it knows about your world DO Context-Specific Skills What it can do in your context JUDGE Reliability How it checks its own work The model gives you the generic version of each.
Figure 7.Memory is what the agent knows about your world, ability is what it can do in it, judgement is how it checks itself. Each accrues in your context and compounds with use; the model supplies only the generic version of each.

The model gives you all three, but only ever the generic version. Currently it knows the world at large, it acts on tasks at large and it judges in the abstract. Each of these has a durable counterpart inside the harness, which can neither be absorbed by the model nor copied by competitors.

Curated Memory: This is what the agent has learned about your world, use case or domain and has chosen to keep.

Context-Specific Skills: These are the abilities the agent has built as a result of routine use within the context of its work.

Reliability: This is a checked and validated sense of whether the work is right.

All of these follow the same pattern: Take what the model is turning into a commodity, fuse it to your use case or domain, and let it accumulate until a competitor cannot simply buy past it.

Curated Memory

Memory mechanisms such as vector stores, full-text search, and the create-read-update-delete memory tools that ship today are a commodity. The open-source mem0 and Letta projects give these tools away,18 and the model providers are adding native memory of their own. This is not what I mean by Curated Memory and these are not worth investing your time in. Like inference optimization, these are table stakes.

The part that neither the model providers nor a competitor can copy is the discipline of deciding what to remember when and the accumulated path-dependent state the harness produces as a result of this. This compounds over time, increasing the agent's comprehension of its context. This lives squarely in the harness.

Curated Memory is the best name I have for it, for lack of a more precise, memorable industry-standard term.

It is easier to explain this with examples, and there are a handful of products already doing this today very creatively.

  1. Hermes, keeps four kinds of memory and recalls across past sessions through full-text search and summarization. The memory is deliberately cache-aware, so learning does not inflate inference costs.19
  2. Claude Code runs an agent-maintained memory file, an on-demand private store, and compaction.20
  3. Claude Managed Agents' Dreaming pass replays past sessions in the background to extract patterns, merge duplicates, and retire stale entries.21
  4. Sierra's Expert Answers mines a company's own resolved support conversations into reusable knowledge articles. A rival can copy the architecture, not the accumulated ticket history.22

Hermes and Claude Code's curated memory lives in the harness. Claude Managed Agents and Sierra's Expert Answers add this knowledge to their business owned data and processes.

Each of these creates switching costs and stickiness on repeated use.23 Additionally when captured, they create owned long term knowledge which a model can be trained on.

A Problem/Opportunity: Despite the techniques illustrated above, Memory Staleness is still not fully solved,24 and auto-generated memory can sometimes actively hurt rather than help. Something I experienced first-hand recently when Claude Code confused a repository with a similarly named older one, and made a royal mess of it. I am sure there are plenty more examples like this.

Context-Specific Skills

Just like memory mechanisms, the mechanisms for building context-specific skills are a commodity, open-sourced or shipped natively. They go by different names. Microsoft's Semantic Kernel called them Skills first, before renaming them Plugins. Anthropic, AutoGen, and OpenHands all ship Skills. OpenHands also calls them microagents.25 Cognition's Devin calls them Playbooks.26 And NVIDIA's Voyager grows a skill library.27

Academic literature just calls them reusable procedures, or SOPs. The packaging is converging as well. Anthropic open-sourced SKILL.md.28 AGENTS.md is a cross-tool standard read by Codex, Cursor, Copilot, and a dozen others, and Rules files ship in every IDE. By whatever name and in whatever format, the mechanism is a commodity.

Any model can build these skills for each user and deployment, and will even help the harness spot which ones repeat or solve a hard problem. But for that to happen, the harness has to provide the mechanisms to automatically identify, create, and fine-tune these skills.

Context-specific skills are the least proven of the three (Curated Memory, Context-specific skills & Reliability) at creating lasting value. I included it because it makes sense despite the lack of proven market examples. It stands to reason that a corpus of skills, each tuned over repeated use in real-world context and circumstances, is a valuable record of what gets used over and over, and which actions work and which don't. Curated memory captures what is true about the real-world context. Context-specific skills capture what works in it with the model.

That record feeds into your Owned Data and Processes.

At scale, and from a large enough corpus, this already mostly-anonymized data can be used to train or tune an owned model. As smaller open and edge models make tuning cheap, this corpus becomes a valuable source product builders can use to do what is currently only accessible to frontier models.

Some examples of actual implementations:

  1. Voyager (NVIDIA Research), an agent in Minecraft writes each new behavior into an ever-growing library of executable skills, composing later ones from earlier.27
  2. Hermes writes its own reusable skills when a workflow proves worth saving and refines them with use.19
  3. Agent Skills from Anthropic packages procedural knowledge and organizational context so a general model performs in its specific context.28
  4. Devin Playbooks (Cognition) are versioned procedures with success conditions and forbidden actions, saved and replayed.26

In the first two examples above, Context-Specific Skills live in the harness. In the last two examples (Agent Skills and Devin Playbooks), the harness sends this data to the layer above, which is Owned Data & Processes.

Reliability

This is the most important one.

The hardest part of shipping an agent is not getting it to do a task. It is being sure it did the task right. An agent that is even right 90% of the time and silently wrong the other 10%, with no way to catch the 10%, cannot be turned loose on anything that matters. And getting an agent to 90% is easy. The last 10% is very hard.

The three layers of reliability Reliability is built in three stacked layers an output falls through. First, the constrained path leaves the fewest open decisions and a route you can read back. Second, deterministic checks assert what code can verify. These two layers are deterministic, cheap and certain. Third, the validated judge is tuned on your own withheld, human-labeled ground truth; it is probabilistic, costly, and the part a competitor cannot buy. Improving Reliability 1 The Constrained Path Fewest open decisions, a readable route 2 Deterministic Checks Assert what code can verify 3 The Validated Judge Tuned on your withheld ground truth DETERMINISTIC Cheap, certain PROBABILISTIC Validated, costly Getting to 90% reliability is easy. The last 10% is very hard. These three methods go a long way.
Figure 8.A constrained path leaves fewer decisions to go wrong, deterministic checks catch what code can verify, and a validated judge tuned on your own withheld, human-labeled ground truth handles the open-ended rest. The first two layers are deterministic, cheap and certain; the third is probabilistic and costly.

There are three different techniques, that are known to work well:

1. The Constrained Agent Path

The most reliable path to reliability, is to leave the agent less room to be wrong. Every open decision the agent has to make can go wrong and then has to be caught. A path pinned down in advance prevents this. The most reliable agent has the fewest open decisions, and a fixed path that can be read back to see what it did.6

This is the driving design philosophy of LangGraph which has deterministic structural control, graph based state management and checkpoints, making everything auditable and recoverable. This is an example to anchor the mental model, not a recommendation of LangGraph or fixed paths.

By beginning constrained and widening autonomy as the task demands it, we retain more control over the reliability of an agent. This is the reason for my recommendation that you start with constrained paths earlier under control-philosophy.

2. Deterministic Checks / Assertions 29

We check what code can check. A lot of what an agent gets wrong can be defined.

Examples: The output is plainly incorrect in one attribute, a total does not add up, a citation points nowhere, an action breaks a business rule, etc.

A deterministic check/assertion catches such issues on every output. There is no judgment required and it is cheap, certain, and catches a surprising number of problems. The rules the harness encodes are owned by you.

3. A Validated LLM-as-Judge (Eval)

The first two methods are deterministic. A validated judge is the probabilistic one.

It determines what no deterministic test can determine, i.e. whether a model output is actually right. Having a validated judge allows you to reuse the judgement of the people who understand the domain or agent task to verify the work of the agent.

In general, if a model can be optimized for a metric instead of the output that metric represents, it will be, so public benchmarks and generic evals don't work. They measure the performance of a model on generic tasks and not what the agent you are building needs to actually do. Often they worsen the problem by letting you believe that the model is performing even when it is not.3031

If you ask the model to grade its own work, it will tend to agree with itself and forgive its own failures.10 If you use an unverified grader (even a different model) to grade the output, you will get a confident opinion of unknown quality.32

The only judgment you can trust is the one checked against the people who know the domain and the work the agent is tasked with. You have experts decide on a sample of real cases right or wrong (it has to be binary), hold it back as an answer key, and tune an LLM judge until its verdicts match theirs (validation) using ML-based techniques, retesting as the product and models shift. A generic eval and an unchecked grader are guesses, while a judge measured against human ground truth is an instrument whose accuracy you know, on the work the agent actually does. This is how you build a validated llm-as-judge.33

A validated judge can be expensive, and by a lot if you intend to use one at runtime within the harness. With current inference costs, I would build one only for critical agent tasks that need to be scrutinized automatically on a schedule, at random or during run-time.

Conclusion

The largest disclosed AI revenue today sits at the frontier model layer.34 But that revenue is pulled through from the layer above. GPT-3.5 only became a phenomenon once a chat product turned it into ChatGPT. Anthropic's recent growth is hard to imagine without Claude Code. The model supplies general capability. It is the harness that turns this into a powerful and valuable agent. And that is where the next wave is forming.

Gartner expects task-specific agents in roughly 40% of enterprise applications by the end of 2026, up from under 5% a year earlier.35 The infrastructure layer these agents plug into has already passed 10,000 MCP servers.9

No one agent or agent-type can address the millions of tasks or thousands of use cases we have today, so the variety of agents is set to explode. A coding agent and a support agent cannot use the same loop or the same guardrails, so the types of harnesses and what each does is set to explode as well.

The harness clearly is an extremely useful tool in creating value. For a product builder keeping up with this rapid momentum, building valuable agents by building a valuable harness is a bet on where durable long-term value comes from and not a strategy to capitalize on the current market. The design choices we make now in doing so will matter disproportionately. This article was an effort to identify the good ones. If you find better ones, please tell me.

A recurring theme you must have noticed, and the underlying message is this: Build what the models or your competition cannot commoditize.

Happy Building!

References

  1. LangChain, The Anatomy of an Agent Harness (Vivek Trivedy, 10 Mar 2026). langchain.com
  2. Anthropic, How Claude Code works (Claude Code documentation). code.claude.com
  3. Hugging Face, Harness, Scaffold, and the AI Agent Terms Worth Getting Right (25 May 2026). huggingface.co
  4. METR, Measuring AI Ability to Complete Long Tasks (Kwa et al., arXiv:2503.14499), Time Horizon 1.1 (29 Jan 2026), and Measuring Time Horizon using Claude Code and Codex (13 Feb 2026, the 50.7 percent Claude-Code-versus-ReAct result). arxiv.org · metr.org
  5. Anthropic, Scaling Managed Agents: decoupling the brain from the harness. anthropic.com
  6. Anthropic, Building effective agents (workflows versus agents). anthropic.com
  7. Simon Willison, The lethal trifecta for AI agents (16 Jun 2025), with Claude Code permissions and sandbox documentation. simonwillison.net
  8. Chroma, Context Rot technical report (Hong, Troynikov and Huber, Jul 2025). research.trychroma.com
  9. Anthropic, Donating the Model Context Protocol and establishing the Agentic AI Foundation (9 Dec 2025, 97M downloads, 10,000+ servers). anthropic.com
  10. Anthropic, How we built our multi-agent research system (token multiples, separate evaluator, single-agent versus multi-agent +90.2%). anthropic.com
  11. The few-shot to zero-shot shift: Wei et al., Finetuned Language Models Are Zero-Shot Learners (FLAN, ICLR 2022, arXiv:2109.01652), and Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (InstructGPT, NeurIPS 2022, arXiv:2203.02155). arxiv.org
  12. Rich Sutton, The Bitter Lesson (13 Mar 2019). incompleteideas.net
  13. Cat Wu (Anthropic, Head of Product for Claude Code) on Lenny's Podcast, How Anthropic's product team moves faster than anyone else (2026): the team builds features that do not fully work yet, then swaps in each newer model to see whether the capability gap has closed, and audits the system prompt every model release to strip out crutches that compensated for prior weaknesses. lennysnewsletter.com
  14. TechCrunch, Anthropic and OpenAI are both launching joint ventures for enterprise AI services (4 May 2026): both labs launched enterprise AI ventures built on the Palantir-style forward-deployed-engineer model, including OpenAI's "The Development Company." techcrunch.com
  15. Andreessen Horowitz, AI will split the software industry (Immerman and Rodriguez, Mar 2026), with Jennifer Li on context as currency. a16z.com
  16. Epoch AI, The Price of Progress: Algorithmic Efficiency in LLM Inference (arXiv:2511.23455, Nov 2025): per-token prices on matured "chatbot-tier" models keep falling fast, but the cost of running a hard reasoning benchmark (GPQA-Diamond) at the frontier rose roughly 18× per year as models grew and reasoning chains lengthened — reasoning is the scarce, expensive layer. arxiv.org
  17. TechCrunch, The high costs and thin margins threatening AI coding startups (Aug 2025), with Marina Temkin's Apr 2026 reporting on Cursor's Composer model, plus Foundamental's gross-margin calculation for Cursor. Secondary reporting, not audited accounts. techcrunch.com
  18. mem0 (Chhikara et al., arXiv:2504.19413), with Anthropic memory tool documentation. Token-savings claims are vendor-reported. arxiv.org
  19. Hermes (Nous Research), agent documentation and technical write-ups of its self-improving, skill-writing memory loop. Web-sourced. hermes-agent.nousresearch.com
  20. Anthropic, Managing context on the Claude Developer Platform (context editing plus memory eval: 84% fewer tokens, +39% performance, vendor-internal). anthropic.com
  21. Anthropic, Claude Code memory and Dreaming, a scheduled memory-consolidation process launched as a Managed Agents research preview in May 2026. claude.com · code.claude.com/docs/en/memory
  22. Sierra, Agent Data Platform (Nov 2025) and Expert Answers (Jan 2026): grounded knowledge articles auto-mined from a deployment's own resolved conversations. Vendor, primary. sierra.ai
  23. On memory as a moat: Nicolas Bustamante, Agent memory engineering, and Tara Tan (Strange VC), Memory is a moat. Web-sourced analysis. nicolasbustamante.com · strangevc.com
  24. Zep, Zep: A Temporal Knowledge Graph Architecture for Agent Memory (Rasmussen et al., arXiv:2501.13956, Jan 2025), built on the Graphiti engine; bi-temporal fact invalidation gives each fact a validity window so stale facts are retired, not deleted. Primary. arxiv.org
  25. OpenHands (All-Hands AI), agent Skills and microagents: keyword-triggered procedures shared through a public registry. Primary. docs.openhands.dev
  26. Cognition, Devin Playbooks — versioned, reusable task procedures with explicit success conditions and forbidden actions. Vendor docs, primary. docs.devin.ai
  27. Voyager (Guanzhi Wang et al., NVIDIA, Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291, 2023): builds an "ever-growing skill library of executable code," composing new skills from earlier ones, with no model fine-tuning. arxiv.org
  28. Anthropic, Equipping agents for the real world with Agent Skills (16 Oct 2025; open standard, 18 Dec 2025). The "procedural knowledge and organizational context" and "specific context" framing is Anthropic's own. anthropic.com
  29. Eval practitioners on why agents fail and how to judge them. Hamel Husain, Your AI Product Needs Evals, LLM Evals FAQ, and Who Validates the Validators? (the root-cause claim and the binary, hand-labeled, validate-the-judge rules), and Chip Huyen, Common pitfalls when building generative AI applications. Primary practitioner sources. hamel.dev · huyenchip.com
  30. OpenAI, Why we no longer evaluate SWE-bench Verified (Feb 2026). openai.com
  31. UC Berkeley RDI (Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song), Trustworthy Benchmarks and Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack (Apr 2026, arXiv:2605.12673). Their BenchJack exploit drove eight major agent benchmarks to near-perfect scores without solving tasks, six to 100 percent (incl. SWE-bench Verified and Terminal-Bench), GAIA to about 98 percent, OSWorld to 73 percent. Primary, web-verified. rdi.berkeley.edu
  32. Penfield Labs, We Audited LoCoMo: 6.4% of the Answer Key Is Wrong and the Judge Accepts up to 63% of Intentionally Wrong Answers (Apr 2026): the language-model judge accepted 62.8 percent of wrong but topically plausible answers, with reproducible scripts. Secondary. dev.to
  33. Hamel Husain, Using LLM-as-a-Judge For Evaluation: A Complete Guide — treating the judge as an ML problem: align it to human labels on a held-out labeled set, measure true-positive and true-negative rates, and iterate via prompt engineering or fine-tuning. hamel.dev
  34. Menlo Ventures, 2025: The State of Generative AI in the Enterprise (Dec 2025). $19B of $37B app layer, 63% startup share, 76% buy not build, 47% versus 25% pilot conversion, Anthropic about 40% API share. menlovc.com
  35. Gartner, Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up From Less Than 5% in 2025 (26 Aug 2025). Analyst projection, secondary. gartner.com