LedgerBench — are analytics agents business-correct?

Execution success is not business correctness. Each row: one agent, one condition (closed book = schema only; open book = schema + the business rulebook), 3 seeds over the 150-item public bank. Per-axis numbers always appear beside the aggregate; weights are printed in every run report.

agent	condition	ran fine	business-correct	weighted overall	definitional	grain	ambiguity	refusal	faithfulness	provenance
anthropic:claude-haiku-4-5-20251001	closed	100.0%	38.0%	45.4%	32.4%	58.6%	33.3%	50.4%	56.7%	`da57177f1e72fb34` · seeds 11 · $0.61
anthropic:claude-haiku-4-5-20251001	open	100.0%	44.0%	29.1%	23.8%	30.4%	25.5%	41.6%	28.4%	`da57177f1e72fb34` · seeds 11 · $0.70
http_openai:gpt-4o-mini	closed	100.0%	42.0%	59.9%	45.7%	94.3%	0.0%	89.6%	60.9%	`da57177f1e72fb34` · seeds 11,22,33 · $0.11
http_openai:gpt-4o-mini	open	100.0%	59.3%	58.7%	56.5%	78.7%	28.6%	78.4%	40.4%	`da57177f1e72fb34` · seeds 11,22,33 · $0.19
naive	closed	100.0%	9.3%	48.9%	13.3%	100.0%	0.0%	84.0%	n/e	`da57177f1e72fb34` · seeds 11,22,33 · $0.00
naive	open	100.0%	9.3%	48.9%	13.3%	100.0%	0.0%	84.0%	n/e	`da57177f1e72fb34` · seeds 11,22,33 · $0.00
Frontier agent rows (Anthropic / OpenAI APIs, closed + open book, 3 seeds) are pending keyed runs — every number on this page is traceable to a committed manifest, and none will be projected in advance. Private-split numbers appear only as aggregates when published (protocol).

The naive rows are the deterministic offline floor: a keyword-template baseline that answers everything and reads nothing. Its identical closed/open scores demonstrate the floor ignores the rulebook — the gap the benchmark measures. Generated by scripts/build_leaderboard.py from benchmark/results/; every number is recomputable from committed traces.