Benchmarks

Real evals, real data.

Run on HumanEval, Planning Completeness, and SimpleQA. Last updated May 2026.

Coding Hints (HumanEval)

164 HumanEval problems solved by Haiku with and without Tom's Index hints. Hints are optional — Haiku decides when to use them.

94%

With hints

Haiku + Tom's Index /v1/hint. Hint used only when Haiku judges the problem is complex.

92%

Without hints

Haiku alone, no external context.

Regressions

Hints never caused a correct solution to break. Rescued 4 problems on harder tasks.

Benchmark: OpenAI HumanEval (164 problems, full set). Each problem run once with claude -p --model haiku. Hints called via MCP only when Haiku judges the task requires multi-step reasoning. Eval script: eval/hint-eval.js.

Hint Agent (Planning Completeness)

20 complex technical planning questions scored by keyword coverage of required components. Tests whether /v1/hint decomposition improves answer completeness.

87%

With hint

Haiku + /v1/hint decomposition and follow-up exploration.

86%

Without hint

Haiku alone.

5→7

100% scores

More problems achieved perfect coverage with hints (7 vs 5).

Questions cover OAuth, database migration, rate limiting, Kubernetes, E2E encryption, CI/CD, caching, event sourcing, and more. Scored by presence of required technical terms. Eval script: eval/planning-eval.js.

Search Quality (SimpleQA)

20 developer-focused factual questions. Pass = correct answer found in top 5 results.

Engine	Correct	Rate
Google	19/20	95%
Tom's Index	18/20	90%

Questions like "Who created Git?", "What does JSON stand for?", "What is the default port for PostgreSQL?" — factual queries with a single verifiable answer. Eval script: eval/simpleqa-eval.js.