LedgerBench — are analytics agents business-correct?

Execution success is not business correctness. Each row: one agent, one condition (closed book = schema only; open book = schema + the business rulebook), 3 seeds over the 150-item public bank. Per-axis numbers always appear beside the aggregate; weights are printed in every run report.

agentconditionran finebusiness-correct weighted overalldefinitionalgrainambiguity refusalfaithfulnessprovenance
anthropic:claude-haiku-4-5-20251001closed100.0%38.0%45.4%32.4%58.6%33.3%50.4%56.7%da57177f1e72fb34 · seeds 11 · $0.61
anthropic:claude-haiku-4-5-20251001open100.0%44.0%29.1%23.8%30.4%25.5%41.6%28.4%da57177f1e72fb34 · seeds 11 · $0.70
http_openai:gpt-4o-miniclosed100.0%42.0%59.9%45.7%94.3%0.0%89.6%60.9%da57177f1e72fb34 · seeds 11,22,33 · $0.11
http_openai:gpt-4o-miniopen100.0%59.3%58.7%56.5%78.7%28.6%78.4%40.4%da57177f1e72fb34 · seeds 11,22,33 · $0.19
naiveclosed100.0%9.3%48.9%13.3%100.0%0.0%84.0%n/eda57177f1e72fb34 · seeds 11,22,33 · $0.00
naiveopen100.0%9.3%48.9%13.3%100.0%0.0%84.0%n/eda57177f1e72fb34 · seeds 11,22,33 · $0.00
Frontier agent rows (Anthropic / OpenAI APIs, closed + open book, 3 seeds) are pending keyed runs — every number on this page is traceable to a committed manifest, and none will be projected in advance. Private-split numbers appear only as aggregates when published (protocol).

The naive rows are the deterministic offline floor: a keyword-template baseline that answers everything and reads nothing. Its identical closed/open scores demonstrate the floor ignores the rulebook — the gap the benchmark measures. Generated by scripts/build_leaderboard.py from benchmark/results/; every number is recomputable from committed traces.