한국어
The Lessie Team 2026. 4. 2.

#1 in AI People Search: Lessie Benchmark Results Across 119 Real-World Queries

119 real-world queries, scored independently through web verification. Open-source methodology.

TL;DR

119Real-World Queries
4Scenario Categories
#1Overall Ranking
100%Completion Rate

How good is AI people search in 2026? We built an open-source benchmark to find out. 119 real-world queries drawn from actual practitioner workflows in recruiting, sales, and researchtested against four platforms: Lessie, Exa, Claude Code, and Juicebox. Every result was independently verified against live web sources. No self-reported data. No hand-picked examples.

The result: Lessie scored 65.2 overall, leading in all four scenario categories. The next closest platform scored 55. This post walks through the full benchmark resultswhat we measured, how we scored, and what the data reveals about the state of AI-powered people search.

Why This Matters

AI people search is becoming core infrastructure for recruiting, sales, and research teams. But until now, there was no standardized way to compare platforms. Vendors self-report accuracy numbers that cant be verified. Case studies cherry-pick the best results. This benchmark changes that119 real queries, independent web verification, and a level playing field for every platform tested.

Platform Comparison

119 real-world queries, scored independently through web verification on a 0100 scale. Each platform ran the same queries under identical conditions. Scores are averaged across three dimensions: Relevance, Coverage, and Utility.

Platform comparison bar chart showing Lessie at 65.2 overall, Exa at 55, Claude Code at 46, and Juicebox at 45.8 across Overall, Relevance, Coverage, and Utility dimensions

Overall scores: Lessie 65.2 | Exa 55 | Claude Code 46 | Juicebox 45.8. The overall score is the simple average of Relevance, Coverage, and Utilityeach measured independently on a 0100 scale.

Breaking it down by dimension: Lessie led on Relevance (70.2 vs. 54.3 for the next best), Coverage (69.1 vs. 58.1), and Utility (56.4 vs. 53.1). The largest gap was in Relevancea +29% advantage over the runner-upmeaning Lessie consistently returned the right people, correctly ranked, across diverse query types.

Performance by Scenario

The benchmark covers four real-world use cases where AI people search creates business value. Each scenario reflects a distinct workflow: different data sources, different criteria complexity, and different definitions of a good result.

Horizontal bar chart showing Lessie leading all four scenarios: Influencer/KOL 62.3, Expert/Deterministic 70.4, B2B Prospecting 60.6, Recruiting 68.2

Influencer / KOL: Lessie 62.3 | Claude Code 43.2 | Exa 41.6 | Juicebox 31.1. This was the widest performance gap in the entire benchmark. Single-source platforms struggle most here because influencers exist across fragmented social platformsInstagram, TikTok, YouTube, Twitter, podcasts, newslettersand no single database covers them all.

Expert / Deterministic: Lessie 70.4 | Exa 61.2 | Claude Code 57 | Juicebox 44.2. These queries have verifiable correct answers or seek specific domain experts. Lessies hybrid search strategycombining structured databases with live web researchproved most effective at finding the exact right people.

B2B Prospecting: Lessie 60.6 | Exa 55.2 | Juicebox 51.4 | Claude Code 43. Finding decision-makers at target companies is the most common AI people search use case. Lessies advantage comes from cross-referencing multiple data sources to verify current roles and contact information.

Recruiting: Lessie 68.2 | Juicebox 65.7 | Exa 64.7 | Claude Code 50.5. This was the most competitive scenariothree platforms scored above 64 overall. Recruiting queries benefit from LinkedIn-centric databases, which all platforms access. The margins here are the thinnest in the benchmark.

Scenario Deep Dive

Each scenario score breaks down into three independent dimensions: Relevance (did you find the right people?), Coverage (how many qualified results?), and Utility (is the returned data actionable?). Heres the detailed breakdown.

Influencer / KOLFinding content creators across social platforms

Expert / DeterministicQueries with verifiable answers or specific domain experts

B2B ProspectingFinding decision-makers at target companies

RecruitingFinding candidates with specific skills, experience, and location

Evaluation Dataset

The benchmark uses 119 queries curated from real practitioner workflows in recruiting, sales, and research. These are not synthetic test casesthey reflect the actual searches professionals run when looking for people. The dataset is multi-language (English, Portuguese, Spanish, Dutch) and practitioner-driven.

Three evaluation dimensions measure independent aspects of search quality: Relevance (ranking quality), Coverage (result volume), and Utility (data completeness). These combine into the Overall score.

Methodology

The evaluation pipeline is fully automated and reproducible. Every result from every platform is verified against live web sourcesno self-reported data, no manual curation.

Step 1: Decompose the Query. A query like Senior ML engineer at a Series B startup in Berlin becomes a structured checklist: role, seniority, domain, company stage, location. This decomposition defines the grading criteria for each result.

Step 2: Verify Against the Web. Every person returned by every platform is checked against LinkedIn, company websites, and social profiles. No self-reported dataonly what can be independently confirmed online. This eliminates platform bias and ensures fair comparison.

Step 3: Score on Three Axes. Relevance (did you find the right people?), Coverage (how many?), and Utility (is the profile data actually useful?). These three scores combine into one Overall score: (Relevance + Coverage + Utility) / 3.

What We Measure

RelevancePadded nDCG@10. Measures whether returned people match the query and are correctly ranked. Each person is web-verified and graded against explicit criteria. The score is padded to 10 slotsreturning fewer results is penalized. This rewards both precision and recall in the top results.

CoverageTCR × Yield. Measures how many qualified people are found per query. Combines task completion rate (did the platform return any results at all?) with average qualified result yield, capped at K=10. This rewards both reliability and volume of relevant results.

Utility(C + E + A) / 3. Measures whether returned data is complete and actionable. Averages three sub-dimensions: structural completeness (C), query-specific evidence (E), and actionability (A). A profile with a name but no email, title, or company scores low on Utility even if the person is relevant.

Key Findings

After 476 platform runs across 119 queries, several patterns emerged that reveal where AI people search stands today and where each platform excels or falls short.

Open Source: The full evaluation dataset, scoring methodology, and platform-level results are available for review. We believe transparent benchmarks push the entire industry forward.

Frequently Asked Questions

What is an AI people search benchmark?

An AI people search benchmark is a standardized evaluation that tests how well different platforms find and return information about people. This benchmark uses 119 real-world queries across recruiting, B2B prospecting, expert search, and influencer discovery — scored on Relevance, Coverage, and Utility through independent web verification.

How does Lessie compare to Exa, Claude Code, and Juicebox?

Lessie scored 65.2 overall, compared to Exa (55), Claude Code (46), and Juicebox (45.8). Lessie led in all four scenario categories and achieved 100% query completion rate. The biggest gap was in Relevance (+29% over the next-best) and Influencer / KOL search (62.3 vs. 43.2). See the full comparison at the benchmark results page.

Is the benchmark methodology open source?

Yes. The evaluation pipeline is fully automated and reproducible. Every result is verified against live web sources — LinkedIn, company websites, and social profiles. The dataset, scoring formulas, and per-query results are available for independent review.

What does Relevance, Coverage, and Utility mean in this benchmark?

Relevance (Padded nDCG@10) measures whether the returned people match the query and are correctly ranked. Coverage (TCR × Yield) measures how many qualified results are found per query. Utility ((C + E + A) / 3) measures whether the returned data is complete and actionable — including contact info, current role, and company details.

Why does Lessie perform best in the Influencer / KOL scenario?

Influencer data is fragmented across Instagram, TikTok, YouTube, Twitter, podcasts, and newsletters. Single-source platforms that rely on one database miss most of this. Lessies hybrid search strategy searches across 100+ sources simultaneously, which is why it scored 62.3 in this scenario while the runner-up scored 43.2. Try it yourself at Lessie Influencer Discovery.

Search Smarter. Find Anyone.

One search across professional networks, social platforms, and academic databases. Try Lessie free.

Start for free →

Related Articles

Apollo.io vs Lessie: 2026년, 당신의 워크플로우에 맞는 B2B 연락처 도구는?

Apollo.io 기능과 한계를 솔직하게 리뷰합니다. 데이터베이스 규모, 반송률, 가격 체계, AI 검색 기능을 비교.

2026년 최고의 B2B 리드 생성 도구: Lessie와 9가지 대안 비교

2026년 상위 B2B 리드 생성 도구 10가지를 실전 비교합니다.

2026년 B2B 영업 가이드: Lessie AI 프로스펙팅

Lessie AI가 100개 이상의 소스를 검색하여 B2B 프로스펙팅을 수 시간에서 수 분으로 압축합니다.

Lessie로 인플루언서 찾는 방법: 5천만+ 크리에이터 프로필 검색 (2026)

데이터 기반 인플루언서 발견으로 니치, 오디언스, 예산에 맞는 크리에이터를 찾으세요.