How good is AI people search in 2026? We built an open-source benchmark to find out. 119 real-world queries drawn from actual practitioner workflows in recruiting, sales, and research—tested against four platforms: Lessie, Exa, Claude Code, and Juicebox. Every result was independently verified against live web sources. No self-reported data. No hand-picked examples.
The result: Lessie scored 65.2 overall, leading in all four scenario categories. The next closest platform scored 55. This post walks through the full benchmark results—what we measured, how we scored, and what the data reveals about the state of AI-powered people search.
Why This Matters
AI people search is becoming core infrastructure for recruiting, sales, and research teams. But until now, there was no standardized way to compare platforms. Vendors self-report accuracy numbers that can’t be verified. Case studies cherry-pick the best results. This benchmark changes that—119 real queries, independent web verification, and a level playing field for every platform tested.
Platform Comparison
119 real-world queries, scored independently through web verification on a 0–100 scale. Each platform ran the same queries under identical conditions. Scores are averaged across three dimensions: Relevance, Coverage, and Utility.

Overall scores: Lessie 65.2 | Exa 55 | Claude Code 46 | Juicebox 45.8. The overall score is the simple average of Relevance, Coverage, and Utility—each measured independently on a 0–100 scale.
Breaking it down by dimension: Lessie led on Relevance (70.2 vs. 54.3 for the next best), Coverage (69.1 vs. 58.1), and Utility (56.4 vs. 53.1). The largest gap was in Relevance—a +29% advantage over the runner-up—meaning Lessie consistently returned the right people, correctly ranked, across diverse query types.
Performance by Scenario
The benchmark covers four real-world use cases where AI people search creates business value. Each scenario reflects a distinct workflow: different data sources, different criteria complexity, and different definitions of a “good” result.

Influencer / KOL: Lessie 62.3 | Claude Code 43.2 | Exa 41.6 | Juicebox 31.1. This was the widest performance gap in the entire benchmark. Single-source platforms struggle most here because influencers exist across fragmented social platforms—Instagram, TikTok, YouTube, Twitter, podcasts, newsletters—and no single database covers them all.
Expert / Deterministic: Lessie 70.4 | Exa 61.2 | Claude Code 57 | Juicebox 44.2. These queries have verifiable correct answers or seek specific domain experts. Lessie’s hybrid search strategy—combining structured databases with live web research—proved most effective at finding the exact right people.
B2B Prospecting: Lessie 60.6 | Exa 55.2 | Juicebox 51.4 | Claude Code 43. Finding decision-makers at target companies is the most common AI people search use case. Lessie’s advantage comes from cross-referencing multiple data sources to verify current roles and contact information.
Recruiting: Lessie 68.2 | Juicebox 65.7 | Exa 64.7 | Claude Code 50.5. This was the most competitive scenario—three platforms scored above 64 overall. Recruiting queries benefit from LinkedIn-centric databases, which all platforms access. The margins here are the thinnest in the benchmark.
Scenario Deep Dive
Each scenario score breaks down into three independent dimensions: Relevance (did you find the right people?), Coverage (how many qualified results?), and Utility (is the returned data actionable?). Here’s the detailed breakdown.
Influencer / KOL—Finding content creators across social platforms
- Lessie: Relevance 65.2, Coverage 62.8, Utility 58.9—100% completion rate
- Exa: 89.7% of Lessie’s performance
- Claude Code: 82.8% of Lessie’s performance
- Juicebox: 79.3% of Lessie’s performance
Expert / Deterministic—Queries with verifiable answers or specific domain experts
- Lessie: Relevance 79, Coverage 75.2, Utility 57.1—100% completion rate
- Exa: 96.4% of Lessie’s performance
- Claude Code: 100% completion rate but lower overall scores
- Juicebox: 71.4% of Lessie’s performance
B2B Prospecting—Finding decision-makers at target companies
- Lessie: Relevance 62.8, Coverage 63.5, Utility 55.5—100% completion rate
- Exa: 100% completion rate, close on Coverage
- Juicebox: 84.4% of Lessie’s performance
- Claude Code: 75% completion rate—the lowest in this category
Recruiting—Finding candidates with specific skills, experience, and location
- Lessie: Relevance 74.8, Coverage 75.6, Utility 54.3—100% completion rate
- Exa, Juicebox: both 100% completion rate
- Claude Code: 90% completion rate
- Recruiting had the highest absolute scores across all platforms—this is the most mature use case for AI people search
Evaluation Dataset
The benchmark uses 119 queries curated from real practitioner workflows in recruiting, sales, and research. These are not synthetic test cases—they reflect the actual searches professionals run when looking for people. The dataset is multi-language (English, Portuguese, Spanish, Dutch) and practitioner-driven.
- Recruiting (30 queries): Finding candidates with specific skills, experience levels, and locations
- B2B Prospecting (32 queries): Identifying decision-makers at target companies for sales outreach
- Expert / Deterministic (28 queries): Queries with verifiable correct answers or seeking specific domain experts
- Influencer / KOL (29 queries): Finding content creators across social platforms by niche, audience, and engagement
Three evaluation dimensions measure independent aspects of search quality: Relevance (ranking quality), Coverage (result volume), and Utility (data completeness). These combine into the Overall score.
Methodology
The evaluation pipeline is fully automated and reproducible. Every result from every platform is verified against live web sources—no self-reported data, no manual curation.
Step 1: Decompose the Query. A query like “Senior ML engineer at a Series B startup in Berlin” becomes a structured checklist: role, seniority, domain, company stage, location. This decomposition defines the grading criteria for each result.
Step 2: Verify Against the Web. Every person returned by every platform is checked against LinkedIn, company websites, and social profiles. No self-reported data—only what can be independently confirmed online. This eliminates platform bias and ensures fair comparison.
Step 3: Score on Three Axes. Relevance (did you find the right people?), Coverage (how many?), and Utility (is the profile data actually useful?). These three scores combine into one Overall score: (Relevance + Coverage + Utility) / 3.
What We Measure
Relevance—Padded nDCG@10. Measures whether returned people match the query and are correctly ranked. Each person is web-verified and graded against explicit criteria. The score is padded to 10 slots—returning fewer results is penalized. This rewards both precision and recall in the top results.
Coverage—TCR × Yield. Measures how many qualified people are found per query. Combines task completion rate (did the platform return any results at all?) with average qualified result yield, capped at K=10. This rewards both reliability and volume of relevant results.
Utility—(C + E + A) / 3. Measures whether returned data is complete and actionable. Averages three sub-dimensions: structural completeness (C), query-specific evidence (E), and actionability (A). A profile with a name but no email, title, or company scores low on Utility even if the person is relevant.
Key Findings
After 476 platform runs across 119 queries, several patterns emerged that reveal where AI people search stands today and where each platform excels or falls short.
- #1 in All Four Scenarios. Lessie is the only platform to lead every category—Recruiting, B2B Prospecting, Expert / Deterministic, and Influencer / KOL. No other platform ranked first in more than one scenario.
- 100% Completion Rate. Every query returned results. No other platform achieved this—especially on niche and abstract searches where others returned nothing. Returning zero results is a failure mode unique to single-source platforms.
- Largest Relevance Gap: 70.2 vs. 54.3 (+29%). The ranking quality difference is most pronounced on multi-criteria queries—searches that combine role, seniority, industry, and location constraints.
- Influencer Is the Widest Gap. Lessie scored 62.3 overall; the runner-up scored 43.2. Single-source platforms struggle most here because influencer data is fragmented across dozens of social platforms.
- Utility Is the Closest Race. Profile data completeness is the most competitive dimension—all platforms scored between 42.7 and 56.4. This is where the industry has the most room for improvement.
- Recruiting Is the Most Competitive. Three platforms scored above 64 overall. This is the scenario where existing tools perform best—and where margins are thinnest. LinkedIn-centric data gives all platforms a stronger baseline here.
Open Source: The full evaluation dataset, scoring methodology, and platform-level results are available for review. We believe transparent benchmarks push the entire industry forward.