What is an agent harness?

An agent harness is the non-model software around a language model that turns it into an agent: the loop that calls the model, the tools it can invoke, the management of what it sees in the context window, the permissions and sandbox that constrain it, and the logic that checks its work. Put simply, Agent = Model + Harness, so if it is not the model, it is the harness.

What is the difference between the model and the harness?

The model is stateless and probabilistic: text goes in, text comes out, and it retains nothing and acts on nothing. The harness is deterministic code that holds all the state and does what the model cannot, re-assembling the transcript each turn, running tools, enforcing permissions, and verifying output. The rule of thumb is to trust the model for judgment and the harness for guarantees.

Where does value accrue in the AI agent stack?

Durable value sits on the layers you own, not the ones you rent. Compute and the frontier model are rented; domain knowledge, distribution and brand are owned; the harness is mixed. The real test of any layer is whether you rent it, can only copy it, or actually own it, and only the owned side holds lasting value.

What are the three pillars of a valuable harness?

Curated memory is what the agent has learned about your world and chosen to keep; context-specific skills are the procedures it has built and tuned through real use; reliability is a checked, validated sense of whether the work is right. They map to know, act, and judge. The model supplies only the generic version of each, while the compounding, context-fitted version lives in your harness.

How do you make an AI agent reliable?

Reliability is the hardest part, because getting an agent to 90% is easy and the last 10% is very hard. Three techniques work together: constrain the agent's path so there are fewer decisions to go wrong, add deterministic checks for anything code can verify, and use a validated LLM-as-judge for the open-ended rest. The first two are cheap and certain; the third is probabilistic and costly.

What is a validated LLM-as-a-judge?

It is an evaluator model tuned until its binary right or wrong verdicts match human experts on a held-out, hand-labeled answer key, then retested as the models and product change. Generic benchmarks and unverified graders do not work, and a model grading its own output tends to forgive its own failures. A judge measured against human ground truth is an instrument whose accuracy you actually know.

Is optimizing inference cost a competitive advantage for AI agents?

No, inference optimization is table stakes. Agents invert SaaS economics, cheap to build but expensive to run, burning 5 to 30 times the tokens of a simple app, so every builder optimizes with the same tactics: caching, routing, and compaction. The only durable cost advantage is owning the model you serve at scale, which carries a nine-figure admission fee beyond most builders.

What does it mean that models absorb capabilities?

Each model generation pulls in work that used to need scaffolding, such as chat-with-PDF wrappers, chain-of-thought prompts, function-calling orchestration, and hand-built few-shot examples. Those are capability crutches: useful but rapidly depreciating. Treat them as cheap, temporary, and disposable, and invest only in the parts a model cannot absorb, which are curated memory, context-specific skills, and reliability.

Should an AI agent use open or fixed paths?

Start at the fixed, constrained end and widen autonomy only as a task genuinely demands it. Open paths hand the model a computer and let it direct its own steps; fixed paths make the model a component in a graph you control. Autonomy is earned by reliability and performance, and the most reliable agent has the fewest open decisions and a path you can read back to see what it did.

Building Valuable Agents: Harnesses

Q: What makes an AI agent valuable and hard to copy?

An agent is competent when it can know its context, act in it, and judge its own work. The durable, uncopyable versions of those three live in the harness: curated memory, context-specific skills, and reliability. Each takes what the model is turning into a commodity, fuses it to your use case, and accumulates until a competitor cannot simply buy past it.

It is easy to build an agent. It is very difficult to build a great one.

Ever since I started using Claude Code, I have been amazed at how much the agent can do so well. For a while, I attributed everything to the model (Claude) and its training. It was only after Claude Code's source code was reverse-engineered that I discovered that much of what makes the product so good is not Claude the model, but the harness around it.

So I did a deep dive to understand harnesses, consuming as much relevant information as I could find on the subject. However, the most challenging part was not understanding how harnesses work or what they do. The most challenging part of writing this article was figuring out what makes a harness valuable, through the lens of a product builder.

The harness is the layer of your AI agent stack that sits immediately above the model and the only one that interacts with it. Right now (Jun 2026), the most dependable way to make an agent better is not by swapping out the model. It is by building a better harness.

A good example of this is when LangChain held its model fixed and changed only the surrounding harness. Its coding agent climbed from outside the top 30 to the top 5 of the Terminal-Bench 2.0 leaderboard, from 52.8% to 66.5%.¹

This article explains what a harness is, what it does and how it can create a durable competitive advantage in your agentic products.

First, some basic concepts that make understanding harnesses easier.

Basic Concepts

The Model is Stateless

A language model is stateless. It is a function from text to text. Text goes in, text comes out, and once the text is sent out the model retains nothing. It cannot open a file, run a command, check the time, or recall what it said a second earlier.

So on every turn, some surrounding program has to gather the system instructions, the entire conversation to date, and the newest input, then send all of it back to the model. Once the model replies, this program reads the reply, performs whatever action the reply asks for, and feeds the result back to the model for the next turn.

The surrounding program that does this is the harness. It exists because the model cannot hold its own state or act on the world.¹

Figure 1.The model forgets after every call and returns only text. The harness re-assembles and re-sends the whole growing transcript each turn. All of the state lives in the harness, not the model.

Definition of a Harness

The simplest and cleanest definition of a harness is the one LangChain¹ and Anthropic² use. Simply put:

Agent = Model + Harness

In other words, if it is not the model, it is the harness.¹ I will use this definition throughout this article.

Other framings are finer breakdowns of the same thing. Hugging Face splits harness from scaffold,³ METR and the safety world call the whole thing scaffolding and its tuning elicitation,⁴ and product teams separate the framework you import, the harness that runs, and the hosted runtime.⁵ None of these definitions change the central ideas in this article.

Control Philosophy: Open vs. Fixed Paths

One choice in how you design the harness decides most of what you build: how much autonomy the agent is given. This falls on a spectrum with two ends.
Open Paths: You hand the model a computer and let it direct its own steps.
Fixed Paths: The model is a component inside an explicit graph or set of roles you control.

Figure 2.The control philosophy, not the model, is what decides the rest of the harness. Claude Code and Hermes sit near the open end, OpenHands and OpenClaw in the middle, LangGraph and CrewAI at the fixed end, with OpenAI's Agents SDK in the centre. The arrow marks where to start: begin constrained, then widen autonomy only as the task needs it.

I have included some examples in the spectrum above. If you know any of these products, you already know what I am talking about.

Regardless of the kind of agent you are building, it is always better to start at the fixed, constrained end. Autonomy is earned by reliability and performance. You widen the autonomy only when an open-ended problem genuinely cannot be written as a fixed path. ⁶

Anatomy of a Harness

The harness is deterministic code. It is ordinary software: loops, conditional branches, schema validation, permission checks, retries, etc. It runs the same way every time. It is dependable and it is something you can fully control.

The model is probabilistic. It samples a plausible next token, and you can never fully guarantee what it will do.

Division of Labor: Trust the model for judgment. Trust the harness for guarantees.

Anything that must hold regardless of situation, such as a denied permission, a sandbox boundary, a verification gate, or a required output format, lives in the deterministic harness. Only open-ended judgment is delegated to the model. This is also the structural reason a smarter model cannot absorb governance.⁷

Figure 3.The control loop is the spine. History, memory and skills feed each prompt the harness assembles, verification checks the output, tools call out through a permissions and sandbox gate, and orchestration wraps the loop with sub-agents. The model, the external world, and the proprietary context all sit outside the harness boundary.

The diagram above has seven parts:

Control Loop. The cycle at the center. Assemble the prompt, call the model, parse the reply, execute the tool the model asked for, observe the result, and repeat. The model only emits text. Every action in the world happens because the harness chose to act on that text. The loop itself is small and just a few dozen lines of code.¹
History, Memory and Skills. The three things the harness draws on to assemble what is sent to the model. The live conversation (history), what the agent has learned about your world (memory) and any reusable procedures it has available (skills) specific to the context.
Context Assembly. The management of what the model sees inside its finite context window. Because the model remembers nothing between calls, the harness re-sends the relevant history each turn and prunes or summarizes it as it grows. This is genuinely hard. Model performance degrades well before the context is full. Only 60%-70% of the model's context is actually usable.⁸
Tools. The functions the model is allowed to call: read a file, run a command, query a database, hit an API. The Model Context Protocol (MCP) standardizes how a harness connects to those tools, but the harness is not limited to using MCP in calling tools.⁹
Permissions/Sandbox. The permissions and the sandbox form the deterministic gate on what the agent may do. Allow and deny rules decide which tool calls are permitted, and OS isolation, such as Seatbelt on macOS or Bubblewrap on Linux, contains what can be affected. The governance must hold even if a prompt injection bypasses the model's decision-making, so agents with access to sensitive information require sophisticated deterministic governance mechanisms.⁷
Verification. The checks on the model's work: deterministic checks, schema validation, retries, and a separate evaluator agent (the judge) given a fresh context window and no write access. Models skew positive when grading their own output, so the judge has to be separated from the worker.¹⁰
Orchestration. The outer layer that spawns and coordinates sub-agents, each with its own clean context window, for work that decomposes into parallel parts. It is powerful and expensive, but very useful on hard, splittable tasks.¹⁰

Two parts, Tools and Governance, do not actually sit in any single box. They run as threads through the agent stack. A tool's interface lives in the harness, but the backend it calls, the real business logic, lives in your own systems. Governance is enforced in the harness, but the policy it enforces comes from the business.

Figure 4.Tools and governance cross two layers rather than sitting in one. Tools: the interface lives in the harness, the backend lives in your data and business logic. Governance: enforcement lives in the harness, policy lives in the layer above.

Models absorb capabilities

As models scale, they absorb capabilities. This is a pattern that runs through the whole field. Skills that once demanded a system of their own such as translation, summarization, document comprehension, coding, multi-step reasoning, keep collapsing into a general large language model that simply does these things.

Each model generation pulls in work that was once hard, within its boundary of default behavior and you are watching this right now in real time. This absorption runs on a clock, dictated by model training cycles and it is accelerating.⁴

Figure 5 · What the model absorbs, and how fast

Harness feature	Build cost	Absorbed by the model	Time
Chat-with-PDF wrappers	A weekend	Native document understanding, late 2023	Months
Chain-of-thought scaffolds	Days	Reasoning models, 2024 to 2025	About a year
Function-calling orchestration	Weeks	Native tool use, 2023 onward	Months
RAG and chunking crutches	Weeks	Eased by 1M-token context, 2026 (partial)	One to two years
"Context anxiety" patch	A sprint	Gone one generation later, Sonnet 4.5 to Opus 4.5	About three months
Hand-built few-shot exemplars	A day	Native zero-shot instruction following, instruction-tuned models, 2021 to 2022¹¹	About a year
Agentic coding loops (plan, edit, run tests, repair)	Months	Native agentic coding via RL post-training, 2024 to 2026 (partial)	About two years

Figure 5 · What the model absorbs, and how fast.The shaded columns are the model's doing: what absorbed each, and when.¹²

A large part of harness design is capability crutches. Stuff the model is not capable of doing yet. These include planning prompts, reflection loops, structured output tricks, retrieval workarounds, and hand-built orchestration. As you can see in the table above, these get absorbed by each successive model generation.

So when building a harness, we need to be careful about what we choose to build and treat capability crutches as cheap, temporary and disposable features that will be absorbed. They are useful in the short term, but rapidly depreciating assets.¹³

A useful shorthand: If a new model with a minimal harness matches a carefully tuned harness across real tasks, the model has absorbed much of the carefully tuned harness.

An AI Value Map

To know what makes an agent valuable, we first need to see where the value collects.

The map below which charts value in the AI agent stack is useful in understanding the current overall landscape. But it is only useful as long as we remember that this is a map and not the territory. We are still at the beginning of a change that will redefine many things including products and businesses. And the actual market is shifting rapidly with new techniques and technological progress as well as significant market moves by the current SOTA model providers.

The real test is not where a layer sits in the stack or how this map changes. It is whether you rent that layer, can only copy it, or actually own it.¹

Figure 6.A value and ownership view of the agentic stack. Each layer sits left to right by whether you rent it, can only copy it, or own it, and only the owned side holds durable value. Inside the harness, which splits the same three ways, the durable part is the three pillars: curated memory, context-specific skills, and reliability. Inference cost control is table stakes, and its one durable lever, an owned model at serving scale, sits at the harness edge.

Compute. Rented by the token, from whoever owns the chips. Edge compute (devices capable enough to run a model) is catching on fast, but still limited by device processing capabilities and memory.
The Model. Rented at the frontier, where you call an API. It is owned only at serving scale, where you train or distill your own. Open models can be fine-tuned and hosted on your own or rented compute. The owned model is the single biggest thing that moves the margin needle in the current market, but requires a lot of capital.
The Harness. This is the most interesting one. It can be split into three things: a perishable edge the model will eat, mechanisms anyone can clone, and the parts that simply cannot be copied by competitors or absorbed by the model.
Domain Knowledge. Currently owned outright. Future is uncertain. This is what the model providers are targeting with FDE driven (Forward Deployed Engineer driven) business models and engineers who embed in the customer to encode domain workflows into the product.¹⁴ If these business models succeed (like Palantir) the models will absorb significant portions of domain knowledge specific to businesses. This is the most valuable part in this map,¹⁵ and it deserves its own articles. This is not one of those articles.
Distribution & Trust (Brand). Owned outright. The distribution and trust that make your brand are the real moat. No model can absorb them and no competitor can fork them. The harness provides you a few valuable methods in building these.

The map above is a view of the whole stack. From here on let's focus on the harness and identify what is worth building and owning within it.

Harnessing Value

Within the scope of work an agent handles, there are three things that make an agent competent at doing the job it is assigned. The agent has to know the context it works in, it has to be able to act in that context, and it has to be able to judge whether it did the work right.

Figure 7.Memory is what the agent knows about your world, ability is what it can do in it, judgement is how it checks itself. Each accrues in your context and compounds with use; the model supplies only the generic version of each.

The model gives you all three, but only ever the generic version. Currently it knows the world at large, it acts on tasks at large and it judges in the abstract. Each of these has a durable counterpart inside the harness, which can neither be absorbed by the model nor copied by competitors.

Curated Memory: This is what the agent has learned about your world, use case or domain and has chosen to keep.

Context-Specific Skills: These are the abilities the agent has built as a result of routine use within the context of its work.

Reliability: This is a checked and validated sense of whether the work is right.

All of these follow the same pattern: Take what the model is turning into a commodity, fuse it to your use case or domain, and let it accumulate until a competitor cannot simply buy past it.

Note · Inference Optimization

Inference optimization looks like a fourth source of value in the harness, but when you look closely, it is table stakes. Agents invert SaaS economics: cheap to build, expensive to run. They burn 5–30× the tokens compared to simple applications like AI chat. And the cost of inference is falling far slower for reasoning than it is for simple inference.¹⁶ So inference optimization is table stakes. Every builder renting a model has to clear these stakes using the same tactics like caching, routing, and compaction. And these features are converging into shared, automated defaults baked into platforms and SDKs. Clearing these table stakes keeps you alive. It does not set you apart.

In the current market the only thing that provides a genuine advantage on inference costs is owning the model you serve at scale. But that club requires a nine-figure admission fee (Ex: Cursor ¹⁷) and is beyond the reach of most product builders.

Curated Memory

Memory mechanisms such as vector stores, full-text search, and the create-read-update-delete memory tools that ship today are a commodity. The open-source mem0 and Letta projects give these tools away,¹⁸ and the model providers are adding native memory of their own. This is not what I mean by Curated Memory and these are not worth investing your time in. Like inference optimization, these are table stakes.

The part that neither the model providers nor a competitor can copy is the discipline of deciding what to remember when and the accumulated path-dependent state the harness produces as a result of this. This compounds over time, increasing the agent's comprehension of its context. This lives squarely in the harness.

Curated Memory is the best name I have for it, for lack of a more precise, memorable industry-standard term.

It is easier to explain this with examples, and there are a handful of products already doing this today very creatively.

Hermes, keeps four kinds of memory and recalls across past sessions through full-text search and summarization. The memory is deliberately cache-aware, so learning does not inflate inference costs.¹⁹
Claude Code runs an agent-maintained memory file, an on-demand private store, and compaction.²⁰
Claude Managed Agents' Dreaming pass replays past sessions in the background to extract patterns, merge duplicates, and retire stale entries.²¹
Sierra's Expert Answers mines a company's own resolved support conversations into reusable knowledge articles. A rival can copy the architecture, not the accumulated ticket history.²²

Hermes and Claude Code's curated memory lives in the harness. Claude Managed Agents and Sierra's Expert Answers add this knowledge to their business owned data and processes.

Each of these creates switching costs and stickiness on repeated use.²³ Additionally when captured, they create owned long term knowledge which a model can be trained on.

A Problem/Opportunity: Despite the techniques illustrated above, Memory Staleness is still not fully solved,²⁴ and auto-generated memory can sometimes actively hurt rather than help. Something I experienced first-hand recently when Claude Code confused a repository with a similarly named older one, and made a royal mess of it. I am sure there are plenty more examples like this.

Context-Specific Skills

Just like memory mechanisms, the mechanisms for building context-specific skills are a commodity, open-sourced or shipped natively. They go by different names. Microsoft's Semantic Kernel called them Skills first, before renaming them Plugins. Anthropic, AutoGen, and OpenHands all ship Skills. OpenHands also calls them microagents.²⁵ Cognition's Devin calls them Playbooks.²⁶ And NVIDIA's Voyager grows a skill library.²⁷

Academic literature just calls them reusable procedures, or SOPs. The packaging is converging as well. Anthropic open-sourced SKILL.md.²⁸ AGENTS.md is a cross-tool standard read by Codex, Cursor, Copilot, and a dozen others, and Rules files ship in every IDE. By whatever name and in whatever format, the mechanism is a commodity.

Any model can build these skills for each user and deployment, and will even help the harness spot which ones repeat or solve a hard problem. But for that to happen, the harness has to provide the mechanisms to automatically identify, create, and fine-tune these skills.

Context-specific skills are the least proven of the three (Curated Memory, Context-specific skills & Reliability) at creating lasting value. I included it because it makes sense despite the lack of proven market examples. It stands to reason that a corpus of skills, each tuned over repeated use in real-world context and circumstances, is a valuable record of what gets used over and over, and which actions work and which don't. Curated memory captures what is true about the real-world context. Context-specific skills capture what works in it with the model.

That record feeds into your Owned Data and Processes.

At scale, and from a large enough corpus, this already mostly-anonymized data can be used to train or tune an owned model. As smaller open and edge models make tuning cheap, this corpus becomes a valuable source product builders can use to do what is currently only accessible to frontier models.

Some examples of actual implementations:

Voyager (NVIDIA Research), an agent in Minecraft writes each new behavior into an ever-growing library of executable skills, composing later ones from earlier.²⁷
Hermes writes its own reusable skills when a workflow proves worth saving and refines them with use.¹⁹
Agent Skills from Anthropic packages procedural knowledge and organizational context so a general model performs in its specific context.²⁸
Devin Playbooks (Cognition) are versioned procedures with success conditions and forbidden actions, saved and replayed.²⁶

In the first two examples above, Context-Specific Skills live in the harness. In the last two examples (Agent Skills and Devin Playbooks), the harness sends this data to the layer above, which is Owned Data & Processes.

Reliability

This is the most important one.

The hardest part of shipping an agent is not getting it to do a task. It is being sure it did the task right. An agent that is even right 90% of the time and silently wrong the other 10%, with no way to catch the 10%, cannot be turned loose on anything that matters. And getting an agent to 90% is easy. The last 10% is very hard.

Figure 8.A constrained path leaves fewer decisions to go wrong, deterministic checks catch what code can verify, and a validated judge tuned on your own withheld, human-labeled ground truth handles the open-ended rest. The first two layers are deterministic, cheap and certain; the third is probabilistic and costly.

There are three different techniques, that are known to work well:

1. The Constrained Agent Path

The most reliable path to reliability, is to leave the agent less room to be wrong. Every open decision the agent has to make can go wrong and then has to be caught. A path pinned down in advance prevents this. The most reliable agent has the fewest open decisions, and a fixed path that can be read back to see what it did.⁶

This is the driving design philosophy of LangGraph which has deterministic structural control, graph based state management and checkpoints, making everything auditable and recoverable. This is an example to anchor the mental model, not a recommendation of LangGraph or fixed paths.

By beginning constrained and widening autonomy as the task demands it, we retain more control over the reliability of an agent. This is the reason for my recommendation that you start with constrained paths earlier under control-philosophy.

2. Deterministic Checks / Assertions ²⁹

We check what code can check. A lot of what an agent gets wrong can be defined.

Examples: The output is plainly incorrect in one attribute, a total does not add up, a citation points nowhere, an action breaks a business rule, etc.

A deterministic check/assertion catches such issues on every output. There is no judgment required and it is cheap, certain, and catches a surprising number of problems. The rules the harness encodes are owned by you.

3. A Validated LLM-as-Judge (Eval)

The first two methods are deterministic. A validated judge is the probabilistic one.

It determines what no deterministic test can determine, i.e. whether a model output is actually right. Having a validated judge allows you to reuse the judgement of the people who understand the domain or agent task to verify the work of the agent.

In general, if a model can be optimized for a metric instead of the output that metric represents, it will be, so public benchmarks and generic evals don't work. They measure the performance of a model on generic tasks and not what the agent you are building needs to actually do. Often they worsen the problem by letting you believe that the model is performing even when it is not.³⁰³¹

If you ask the model to grade its own work, it will tend to agree with itself and forgive its own failures.¹⁰ If you use an unverified grader (even a different model) to grade the output, you will get a confident opinion of unknown quality.³²

The only judgment you can trust is the one checked against the people who know the domain and the work the agent is tasked with. You have experts decide on a sample of real cases right or wrong (it has to be binary), hold it back as an answer key, and tune an LLM judge until its verdicts match theirs (validation) using ML-based techniques, retesting as the product and models shift. A generic eval and an unchecked grader are guesses, while a judge measured against human ground truth is an instrument whose accuracy you know, on the work the agent actually does. This is how you build a validated llm-as-judge.³³

A validated judge can be expensive, and by a lot if you intend to use one at runtime within the harness. With current inference costs, I would build one only for critical agent tasks that need to be scrutinized automatically on a schedule, at random or during run-time.

Conclusion

The largest disclosed AI revenue today sits at the frontier model layer.³⁴ But that revenue is pulled through from the layer above. GPT-3.5 only became a phenomenon once a chat product turned it into ChatGPT. Anthropic's recent growth is hard to imagine without Claude Code. The model supplies general capability. It is the harness that turns this into a powerful and valuable agent. And that is where the next wave is forming.

Gartner expects task-specific agents in roughly 40% of enterprise applications by the end of 2026, up from under 5% a year earlier.³⁵ The infrastructure layer these agents plug into has already passed 10,000 MCP servers.⁹

No one agent or agent-type can address the millions of tasks or thousands of use cases we have today, so the variety of agents is set to explode. A coding agent and a support agent cannot use the same loop or the same guardrails, so the types of harnesses and what each does is set to explode as well.

The harness clearly is an extremely useful tool in creating value. For a product builder keeping up with this rapid momentum, building valuable agents by building a valuable harness is a bet on where durable long-term value comes from and not a strategy to capitalize on the current market. The design choices we make now in doing so will matter disproportionately. This article was an effort to identify the good ones. If you find better ones, please tell me.

A recurring theme you must have noticed, and the underlying message is this: Build what the models or your competition cannot commoditize.

Happy Building!

References

LangChain, The Anatomy of an Agent Harness (Vivek Trivedy, 10 Mar 2026). langchain.com
Anthropic, How Claude Code works (Claude Code documentation). code.claude.com
Hugging Face, Harness, Scaffold, and the AI Agent Terms Worth Getting Right (25 May 2026). huggingface.co
METR, Measuring AI Ability to Complete Long Tasks (Kwa et al., arXiv:2503.14499), Time Horizon 1.1 (29 Jan 2026), and Measuring Time Horizon using Claude Code and Codex (13 Feb 2026, the 50.7 percent Claude-Code-versus-ReAct result). arxiv.org · metr.org
Anthropic, Scaling Managed Agents: decoupling the brain from the harness. anthropic.com
Anthropic, Building effective agents (workflows versus agents). anthropic.com
Simon Willison, The lethal trifecta for AI agents (16 Jun 2025), with Claude Code permissions and sandbox documentation. simonwillison.net
Chroma, Context Rot technical report (Hong, Troynikov and Huber, Jul 2025). research.trychroma.com
Anthropic, Donating the Model Context Protocol and establishing the Agentic AI Foundation (9 Dec 2025, 97M downloads, 10,000+ servers). anthropic.com
Anthropic, How we built our multi-agent research system (token multiples, separate evaluator, single-agent versus multi-agent +90.2%). anthropic.com
The few-shot to zero-shot shift: Wei et al., Finetuned Language Models Are Zero-Shot Learners (FLAN, ICLR 2022, arXiv:2109.01652), and Ouyang et al., Training Language Models to Follow Instructions with Human Feedback (InstructGPT, NeurIPS 2022, arXiv:2203.02155). arxiv.org
Rich Sutton, The Bitter Lesson (13 Mar 2019). incompleteideas.net
Cat Wu (Anthropic, Head of Product for Claude Code) on Lenny's Podcast, How Anthropic's product team moves faster than anyone else (2026): the team builds features that do not fully work yet, then swaps in each newer model to see whether the capability gap has closed, and audits the system prompt every model release to strip out crutches that compensated for prior weaknesses. lennysnewsletter.com
TechCrunch, Anthropic and OpenAI are both launching joint ventures for enterprise AI services (4 May 2026): both labs launched enterprise AI ventures built on the Palantir-style forward-deployed-engineer model, including OpenAI's "The Development Company." techcrunch.com
Andreessen Horowitz, AI will split the software industry (Immerman and Rodriguez, Mar 2026), with Jennifer Li on context as currency. a16z.com
Epoch AI, The Price of Progress: Algorithmic Efficiency in LLM Inference (arXiv:2511.23455, Nov 2025): per-token prices on matured "chatbot-tier" models keep falling fast, but the cost of running a hard reasoning benchmark (GPQA-Diamond) at the frontier rose roughly 18× per year as models grew and reasoning chains lengthened — reasoning is the scarce, expensive layer. arxiv.org
TechCrunch, The high costs and thin margins threatening AI coding startups (Aug 2025), with Marina Temkin's Apr 2026 reporting on Cursor's Composer model, plus Foundamental's gross-margin calculation for Cursor. Secondary reporting, not audited accounts. techcrunch.com
mem0 (Chhikara et al., arXiv:2504.19413), with Anthropic memory tool documentation. Token-savings claims are vendor-reported. arxiv.org
Hermes (Nous Research), agent documentation and technical write-ups of its self-improving, skill-writing memory loop. Web-sourced. hermes-agent.nousresearch.com
Anthropic, Managing context on the Claude Developer Platform (context editing plus memory eval: 84% fewer tokens, +39% performance, vendor-internal). anthropic.com
Anthropic, Claude Code memory and Dreaming, a scheduled memory-consolidation process launched as a Managed Agents research preview in May 2026. claude.com · code.claude.com/docs/en/memory
Sierra, Agent Data Platform (Nov 2025) and Expert Answers (Jan 2026): grounded knowledge articles auto-mined from a deployment's own resolved conversations. Vendor, primary. sierra.ai
On memory as a moat: Nicolas Bustamante, Agent memory engineering, and Tara Tan (Strange VC), Memory is a moat. Web-sourced analysis. nicolasbustamante.com · strangevc.com
Zep, Zep: A Temporal Knowledge Graph Architecture for Agent Memory (Rasmussen et al., arXiv:2501.13956, Jan 2025), built on the Graphiti engine; bi-temporal fact invalidation gives each fact a validity window so stale facts are retired, not deleted. Primary. arxiv.org
OpenHands (All-Hands AI), agent Skills and microagents: keyword-triggered procedures shared through a public registry. Primary. docs.openhands.dev
Cognition, Devin Playbooks — versioned, reusable task procedures with explicit success conditions and forbidden actions. Vendor docs, primary. docs.devin.ai
Voyager (Guanzhi Wang et al., NVIDIA, Voyager: An Open-Ended Embodied Agent with Large Language Models, arXiv:2305.16291, 2023): builds an "ever-growing skill library of executable code," composing new skills from earlier ones, with no model fine-tuning. arxiv.org
Anthropic, Equipping agents for the real world with Agent Skills (16 Oct 2025; open standard, 18 Dec 2025). The "procedural knowledge and organizational context" and "specific context" framing is Anthropic's own. anthropic.com
Eval practitioners on why agents fail and how to judge them. Hamel Husain, Your AI Product Needs Evals, LLM Evals FAQ, and Who Validates the Validators? (the root-cause claim and the binary, hand-labeled, validate-the-judge rules), and Chip Huyen, Common pitfalls when building generative AI applications. Primary practitioner sources. hamel.dev · huyenchip.com
OpenAI, Why we no longer evaluate SWE-bench Verified (Feb 2026). openai.com
UC Berkeley RDI (Hao Wang, Qiuyang Mang, Alvin Cheung, Koushik Sen, Dawn Song), Trustworthy Benchmarks and Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack (Apr 2026, arXiv:2605.12673). Their BenchJack exploit drove eight major agent benchmarks to near-perfect scores without solving tasks, six to 100 percent (incl. SWE-bench Verified and Terminal-Bench), GAIA to about 98 percent, OSWorld to 73 percent. Primary, web-verified. rdi.berkeley.edu
Penfield Labs, We Audited LoCoMo: 6.4% of the Answer Key Is Wrong and the Judge Accepts up to 63% of Intentionally Wrong Answers (Apr 2026): the language-model judge accepted 62.8 percent of wrong but topically plausible answers, with reproducible scripts. Secondary. dev.to
Hamel Husain, Using LLM-as-a-Judge For Evaluation: A Complete Guide — treating the judge as an ML problem: align it to human labels on a held-out labeled set, measure true-positive and true-negative rates, and iterate via prompt engineering or fine-tuning. hamel.dev
Menlo Ventures, 2025: The State of Generative AI in the Enterprise (Dec 2025). $19B of $37B app layer, 63% startup share, 76% buy not build, 47% versus 25% pilot conversion, Anthropic about 40% API share. menlovc.com
Gartner, Gartner Predicts 40% of Enterprise Apps Will Feature Task-Specific AI Agents by 2026, Up From Less Than 5% in 2025 (26 Aug 2025). Analyst projection, secondary. gartner.com