Harness Agent vs. Claude Code: Why a Vertical Agent Wins People Search by 19 Points

65.2Lessie Overall Score

45.8Claude Code Overall

+19.4Harness Gap (points)

119Real-World Queries

In 2026, the most interesting sentence in AI isn’t about a new model. It’s a formula that Anthropic, Martin Fowler, and half of the AI research community have converged on in the last few weeks:

Agent = Model + Harness.

If you’ve been on AI Twitter recently, you’ve seen the word harness everywhere. Princeton released HAL harness. HKUDS open-sourced OpenHarness. A new Meta-Harness paper showed that automatically rewriting the harness around a fixed model can lift TerminalBench-2 scores by several points without touching the weights. Philipp Schmid called the agent harness “the primary tool for solving model drift” in long-running tasks.

But here’s the thing nobody is saying out loud: almost every harness conversation in 2026 is about coding agents. Claude Code. SWE-bench. Terminal tasks. Repo navigation.

What about everything else? What about the agent work that doesn’t involve a Git repo?

We’re Lessie, and we build a harness agent for one specific job: finding people. Recruiters use us to find candidates. Sales teams use us to find decision-makers. VCs use us to find founders. Marketers use us to find creators. So when the harness conversation took off, we wanted to know something concrete: does the “harness matters more than the model” thesis actually hold up outside of coding?

So we built a benchmark and ran the experiment. The result is PeopleSearchBench, and the headline number is this:

On 119 real-world people search queries, Lessie scored 65.2. Claude Code, running on Sonnet 4.6, scored 45.8. That’s a 42% gap—and the only thing that changed was the harness.

Let’s unpack what that means.

What is a harness agent, in plain English?

The shortest definition comes from the OpenHarness team: the model is the agent; the code is the harness. A slightly longer one from Parallel Web: a harness is the runtime that wraps a model, intercepts its tool calls, manages its context, and keeps it on task.

Martin Fowler frames it as two halves working together. Guides are feed-forward controls—they shape the agent’s behavior before it acts (system prompts, tool descriptions, retrieved context, environment snapshots). Sensors are feedback controls—they observe what the agent did and feed corrections back in (linters, validators, verification loops). A good harness combines both. A bad harness is feed-forward only and watches the agent repeat the same mistake on turn 47.

So a harness agent is the whole package: a model plus the guides and sensors and tools and memory and verification logic that turn raw token prediction into something that can finish a real job.

There are two flavors emerging:

General-purpose harnesses like the Claude Agent SDK, OpenHarness, and the harness inside Claude Code. These are designed to be domain-agnostic.
Vertical harnesses built around one job, with guides and sensors tuned for that job’s failure modes.

Almost every harness benchmark you’ve read about—SWE-bench, TerminalBench-2, USACO, AppWorld—measures general harnesses on coding tasks. PeopleSearchBench is, as far as we know, the first benchmark that pits a vertical harness agent against a general one on a non-coding job.

Why people search needs its own harness

If you’ve ever asked a general AI agent to “find me senior ML engineers at Series B startups in Berlin who’ve shipped LLM products,” you already know the failure modes. Three of them are particularly stubborn, and all three are harness problems, not model problems:

1. Cross-source entity resolution. A real person exists across LinkedIn, X, GitHub, conference talks, company pages, and academic databases. They use different names, different photos, sometimes different spellings. A general harness has no built-in notion of “this LinkedIn profile and this GitHub account are the same human.” A people search harness has to solve this on every query.

2. Verification loops. Without a sensor layer, agents confidently invent people. They’ll cite a “Senior ML Engineer at Stripe Berlin” who doesn’t exist, because the tokens are plausible. The fix isn’t a smarter model—Sonnet 4.6 still does this in Claude Code. The fix is a sensor: every returned person gets checked against live web sources before it ever reaches the user.

3. Query decomposition for human attributes. “Series B Berlin ML engineer who ships LLM products” isn’t one query. It’s a checklist: role + seniority + company stage + location + domain + recent output. A general harness throws the whole sentence into a search box. A vertical harness decomposes it into criteria, runs them in parallel across the right sources, then re-assembles and ranks.

Each of these is exactly what Fowler means by guides and sensors. They just happen to be guides and sensors that nobody bothers to build into a general coding harness, because coding harnesses don’t need them.

The receipts: PeopleSearchBench

We built PeopleSearchBench to test this honestly. The full methodology is in the paper, but here’s the short version:

119 real queries sourced from actual recruiter, sales, and research workflows
4 languages (English, Portuguese, Spanish, Dutch)
4 scenarios: Recruiting (30), B2B Prospecting (32), Expert / Deterministic (28), Influencer / KOL (29)
4 platforms tested: Lessie (vertical harness agent), Exa (structured search API), Juicebox / PeopleGPT (recruiting platform with 800M+ profiles), Claude Code (general harness on Sonnet 4.6)
Three independent dimensions: Relevance (padded nDCG@10), Coverage (task completion × yield), Utility (profile data completeness)
Verification by live web search, not LLM vibes—every returned person gets fact-checked against LinkedIn, company sites, and public profiles. The verification agent has no idea which platform produced which result.

Here are the overall scores:

Lessie: Overall 65.2 | Relevance 70.2 | Coverage 69.1 | Utility 56.4
Exa: Overall 54.6 | Relevance 53.8 | Coverage 58.1 | Utility 53.1
Claude Code: Overall 45.8 | Relevance 54.3 | Coverage 41.1 | Utility 42.7
Juicebox: Overall 45.8 | Relevance 44.7 | Coverage 41.8 | Utility 50.9

Lessie is first in every dimension. It’s also the only platform that finished every single one of the 119 queries—a 100% completion rate. The other three regularly returned nothing on niche searches.

But the number that matters most for the harness debate is the gap between Lessie and Claude Code. Both are AI agents. Both can call tools. Both can search the web. Claude Code is running on one of the strongest models on the planet. And it lost by 19.4 points overall, including a 28-point gap on Coverage.

That 19.4 points is not a model gap. It is a harness gap.

The widest single-scenario gap was Influencer / KOL discovery, where Lessie scored 62.3 and Claude Code scored 43.2. Influencer search is where general harnesses fall apart hardest, because the right answer lives across TikTok, Instagram, YouTube, and X simultaneously, and a general harness doesn’t know how to fuse them. The narrowest gap was Recruiting, where three platforms scored above 64—recruiting is the most mature people-search vertical, and the field has had years to build tools for it.

The pattern is consistent: the more a scenario requires multi-source fusion and verification, the more the harness matters.

What’s inside the Lessie harness

We’re not going to publish our system prompts. But the architecture has three layers that map cleanly onto the guides-and-sensors model, and they’re worth describing because they’re roughly what any vertical harness agent needs:

Layer 1 — Multi-source orchestration (guides). When a query comes in, the harness routes it in parallel across professional networks, social platforms, academic databases, and public registries. Each source has its own retrieval strategy. The model never sees the raw fan-out; it sees a unified candidate set.

Layer 2 — Criteria decomposition and verification (sensors). Every query gets broken into explicit criteria—role, seniority, location, company stage, signals—and every candidate gets checked against those criteria via live web lookups before it reaches the ranking step. This is the same methodology PeopleSearchBench uses to score us, which is not a coincidence: we built the harness around the failure modes the benchmark measures.

Layer 3 — Profile enrichment. Once a person passes verification, the harness pulls structured profile data—current role, recent activity, contact paths, social presence. This is why our Utility score leads the field: returning the right person with empty fields isn’t useful, and a general harness has no reason to do enrichment as a built-in step.

The model in the middle is doing what models are good at: reasoning, ranking, summarizing, judging. The harness is doing everything else. Take the harness away and you have a chatbot. Take the model away and you have a search pipeline. Put them together and you have a vertical harness agent.

What this means for the harness debate

The interesting claim coming out of the 2026 harness conversation is that model progress is slowing on static benchmarks but agent performance is still wide open, because most of the remaining gains live in the harness. Meta-Harness showed this by automatically discovering better harnesses for coding. PeopleSearchBench shows the same thing from the other direction: hand-built vertical harnesses can beat a frontier model running inside a general harness, by margins that no model upgrade is going to close.

If that’s right, two things follow.

First, every commercially valuable agent job is going to get its own harness agent. People search is one. Legal research is another. Clinical reasoning, financial analysis, supply chain investigation, scientific literature review—each of these has failure modes that a general harness will never optimize for, because the general harness is optimizing for everything. Vertical harness agents are going to eat the long tail of agent work the same way SaaS ate the long tail of software.

Second, benchmarks need to follow. SWE-bench and TerminalBench-2 are great, but they measure one slice of harness quality. If the field is serious about the harness thesis, we need harness benchmarks for every vertical that matters. PeopleSearchBench is our attempt to start that for people search. The dataset, the evaluation pipeline, and the full results are open source.

The model is the engine. The harness is the car. We built the car for one road. If your job involves finding people—candidates, customers, investors, creators, partners—try the car: lessie.ai. And if you want to see exactly how we beat a frontier-model coding agent at something it was never built to do, the full benchmark and paper are here.

In 2026, the harness is the moat. The numbers say so.

FAQ

What is a harness agent?

A harness agent is a model wrapped in a runtime that manages its context, tool calls, verification loops, and memory. As Martin Fowler frames it, the harness has two halves: guides (feed-forward controls like prompts, tool descriptions, and retrieved context) and sensors (feedback controls like linters, validators, and verification loops). The model alone is just a token predictor. The harness is what turns it into something that can finish a real job.

Why does the harness matter more than the model in 2026?

Frontier model gains on static benchmarks are slowing, but agent performance is still wide open because most remaining gains live in the harness. Meta-Harness research showed that automatically rewriting the harness around a fixed model can lift TerminalBench-2 scores by several points without touching the weights. PeopleSearchBench shows the same pattern from the other direction: a vertical harness agent beat Claude Code on Sonnet 4.6 by 19.4 points overall, and the only thing that changed was the harness.

What is PeopleSearchBench and how does it work?

PeopleSearchBench is an open-source benchmark for AI people search. It evaluates 119 real-world queries across 4 scenarios (Recruiting, B2B Prospecting, Expert / Deterministic, Influencer / KOL) and 4 languages, scoring each platform on three independent dimensions: Relevance (padded nDCG@10), Coverage (task completion× yield), and Utility (profile data completeness). Every returned person is verified by live web search—LinkedIn, company sites, public profiles—with the verification agent blind to which platform produced which result. See the full results here.

How did Lessie beat Claude Code on people search by 19 points?

Three vertical harness layers Claude Code does not have: multi-source orchestration that fans out across professional networks, social platforms, academic databases, and public registries in parallel; criteria decomposition and verification that breaks each query into explicit checks and validates every candidate against live web sources before ranking; and profile enrichment that pulls structured data — role, recent activity, contact paths — for every verified person. Claude Code on Sonnet 4.6 is an excellent general harness, but it has no built-in entity resolution, no verification sensor, and no enrichment step. Those are harness features, not model features.

What is a vertical harness agent vs. a general harness agent?

A general harness agent — Claude Code, the Claude Agent SDK, OpenHarness — is domain-agnostic and tuned for broad tool use, mostly coding workflows. A vertical harness agent is built around one specific job, with guides and sensors tuned for that job’s failure modes. People search has stubborn failure modes (cross-source entity resolution, hallucinated profiles, multi-criteria decomposition) that a general harness will never optimize for, because it is optimizing for everything. Vertical harness agents are going to eat the long tail of agent work the same way SaaS ate the long tail of software.