Execution success is not business correctness. Each row: one agent, one condition (closed book = schema only; open book = schema + the business rulebook), 3 seeds over the 150-item public bank. Per-axis numbers always appear beside the aggregate; weights are printed in every run report.
| agent | condition | ran fine | business-correct | weighted overall | definitional | grain | ambiguity | refusal | faithfulness | provenance |
|---|---|---|---|---|---|---|---|---|---|---|
| anthropic:claude-haiku-4-5-20251001 | closed | 100.0% | 38.0% | 45.4% | 32.4% | 58.6% | 33.3% | 50.4% | 56.7% | da57177f1e72fb34 · seeds 11 · $0.61 |
| anthropic:claude-haiku-4-5-20251001 | open | 100.0% | 44.0% | 29.1% | 23.8% | 30.4% | 25.5% | 41.6% | 28.4% | da57177f1e72fb34 · seeds 11 · $0.70 |
| http_openai:gpt-4o-mini | closed | 100.0% | 42.0% | 59.9% | 45.7% | 94.3% | 0.0% | 89.6% | 60.9% | da57177f1e72fb34 · seeds 11,22,33 · $0.11 |
| http_openai:gpt-4o-mini | open | 100.0% | 59.3% | 58.7% | 56.5% | 78.7% | 28.6% | 78.4% | 40.4% | da57177f1e72fb34 · seeds 11,22,33 · $0.19 |
| naive | closed | 100.0% | 9.3% | 48.9% | 13.3% | 100.0% | 0.0% | 84.0% | n/e | da57177f1e72fb34 · seeds 11,22,33 · $0.00 |
| naive | open | 100.0% | 9.3% | 48.9% | 13.3% | 100.0% | 0.0% | 84.0% | n/e | da57177f1e72fb34 · seeds 11,22,33 · $0.00 |
| Frontier agent rows (Anthropic / OpenAI APIs, closed + open book, 3 seeds) are pending keyed runs — every number on this page is traceable to a committed manifest, and none will be projected in advance. Private-split numbers appear only as aggregates when published (protocol). | ||||||||||
The naive rows are the deterministic offline floor: a
keyword-template baseline that answers everything and reads nothing. Its identical
closed/open scores demonstrate the floor ignores the rulebook — the gap the benchmark
measures. Generated by scripts/build_leaderboard.py from
benchmark/results/; every number is recomputable from committed traces.