Run on HumanEval, Planning Completeness, and SimpleQA. Last updated May 2026.
164 HumanEval problems solved by Haiku with and without Tom's Index hints. Hints are optional — Haiku decides when to use them.
Haiku + Tom's Index /v1/hint. Hint used only when Haiku judges the problem is complex.
Haiku alone, no external context.
Hints never caused a correct solution to break. Rescued 4 problems on harder tasks.
Benchmark: OpenAI HumanEval (164 problems, full set). Each problem run once with claude -p --model haiku. Hints called via MCP only when Haiku judges the task requires multi-step reasoning. Eval script: eval/hint-eval.js.
20 complex technical planning questions scored by keyword coverage of required components. Tests whether /v1/hint decomposition improves answer completeness.
Haiku + /v1/hint decomposition and follow-up exploration.
Haiku alone.
More problems achieved perfect coverage with hints (7 vs 5).
Questions cover OAuth, database migration, rate limiting, Kubernetes, E2E encryption, CI/CD, caching, event sourcing, and more. Scored by presence of required technical terms. Eval script: eval/planning-eval.js.
20 developer-focused factual questions. Pass = correct answer found in top 5 results.
| Engine | Correct | Rate |
|---|---|---|
| 19/20 | 95% | |
| Tom's Index | 18/20 | 90% |
Questions like "Who created Git?", "What does JSON stand for?", "What is the default port for PostgreSQL?" — factual queries with a single verifiable answer. Eval script: eval/simpleqa-eval.js.