Evaluation architecture
How Josh evaluates two distinct things — substrate retrieval quality (Phase 1) and agent harness quality (Phase 2) — through one shared eval framework.
This document is the canonical reference for evaluation design across both phases. When per-phase docs (e.g. chunking for chunker eval specifics) disagree with this document on cross-cutting concerns, this document wins.
Where this lives: the eval framework ships as
usejosh/josh-eval, a separate public repo (MIT) at v1 launch. See repo-structure "Launch repo split (planned)" for the topology.
Why two evaluation systems
Section titled “Why two evaluation systems”Josh has two architectural surfaces that need measurement, and they want measurement of fundamentally different shapes:
| System | What it measures | Metric shape |
|---|---|---|
| Retrieval eval (Phase 1) | Given a query, does the substrate surface the right chunk in the top-k? Tests chunker × embedding × ranker. | Graded relevance: NDCG@10, Recall@10, MRR over labeled (query, chunk) pairs. |
| Harness eval (Phase 2) | Given an instruction + corpus access, does the agent produce a correct deliverable? Tests model + tools + harness end-to-end. | Binary pass/fail per rubric criterion; all-pass scoring per task (LAB-style). |
These are complementary halves of the same engineering question. Retrieval eval holds the model dumb (a fixed embedding + reranker) and varies the retrieval pipeline. Harness eval holds the retrieval given (a closed corpus or a snapshotted substrate) and varies the model + agent loop. Neither subsumes the other.
They share substrate snapshots, the LLM-as-judge wrapper, the CLI surface, the reporting layer, and the model adapter abstraction. They diverge on what they evaluate.
The unified framework
Section titled “The unified framework”josh-eval/ substrate/ # SHARED — versioned corpus snapshots manifests/ # source URLs + SHA256s + retrieval timestamps snapshots/ # locally-rebuilt corpus at version v2026.08 rebuild.py # deterministic rebuild from manifest retrieval/ # PHASE 1 — substrate retrieval quality eval-sets/ # queries + labels per source/topic runner.py scorer.py # NDCG@k / Recall@k / MRR harness/ # PHASE 2 — agent deliverable quality tasks/ # task.json + (Mode 1/2 bundles) runner.py judge.py # all-pass rubric scoring adapters/ # Anthropic / OpenAI / Google / Mistral tools/ # read / write / glob / grep / josh.search sandbox/ # Podman per-task isolation shared/ # infra used by both judge_model.py # common LLM-as-judge wrapper cli.py # `josh-eval` entrypoint with subcommands report/ # static HTML sweep.py # matrix runs compare.py # dashboard generator results/ # gitignored output of any runSingle CLI entrypoint with subcommands:
josh-eval substrate fetch --version=v2026.08josh-eval substrate rebuild --manifest=./manifests/v2026.08/manifest.jsonjosh-eval retrieval run --eval-set=federal-register --pipeline=bm25+arctic-l+rerankjosh-eval harness run --task=appropriations/track-sa-3217 --mode=mode_2 --model=anthropic/claude-sonnet-4-7josh-eval sweep --task=appropriations --modes=mode_1,mode_2,mode_3 --models=sonnet,gpt-5josh-eval compare --area=appropriationsShared global flags: --substrate-version, --judge-model, --parallel, --out-dir.
Substrate snapshot system (shared)
Section titled “Substrate snapshot system (shared)”Both eval systems test against a versioned substrate. The substrate moves daily as new bills, rules, and hearings ingest; for any benchmark to be reproducible, runs must pin to a specific corpus state.
Manifest format
Section titled “Manifest format”substrate/manifests/v2026.08/ manifest.json sources/ federal-register.json # doc_id → {url, sha256, retrieved_at} congress-bills.json crs-reports.json legislators-committees.json ...Each source manifest is a flat list of (doc_id, source_url, sha256, retrieved_at) tuples. The full snapshot manifest is small (~MB); the rebuilt corpus is large (~tens of GB).
Two rebuild paths
Section titled “Two rebuild paths”- Primary: download from Hugging Face Datasets.
josh-eval substrate fetch --version=v2026.08pulls a pre-built SQLite + chunks + indexes archive fromhuggingface.co/datasets/usejosh/substrate-snapshots. This is what 99% of eval users want — no scrapers, no API quotas, just download and query. - Escape hatch: rebuild from manifest.
josh-eval substrate rebuild --version=v2026.08reads the manifest, hits the public APIs, downloads each source, verifies SHA256s, and rebuilds the SQLite database locally. Requirespip install josh-foundation. For Josh-internal verification or genuinely paranoid users (e.g., academic reviewers reproducing a paper).
Both paths produce bit-identical (or close-to-identical, modulo upstream-API changes) substrates at a given version. The HF archive is built by Josh internal running the rebuild path against the current public APIs at tag time.
Snapshot cadence
Section titled “Snapshot cadence”Monthly tagged versions: v2026.08, v2026.09, etc. A floating latest alias points at the most recent version.
Eval sets and harness Mode 3 tasks pin a specific version. Re-running an old eval against an old version produces the same number; pinning to latest produces a number that drifts over time.
What's in the snapshot
Section titled “What's in the snapshot”The snapshot is pure substrate — what every Josh Foundation install would have if it ingested the same source URLs at the same time. Proprietary downstream signals (forecasting outputs, prediction overlays, hand annotations) are not in the snapshot. The snapshot is pure substrate.
Retrieval eval (Phase 1)
Section titled “Retrieval eval (Phase 1)”Detailed design lives in chunking — query authoring, candidate gathering, labeling, metrics, decision rules. The relevant additions from this document are:
- Eval sets pin a substrate version.
retrieval/eval-sets/<source>/manifest.jsondeclares the substrate version the eval was authored against. Re-running on a newer version is allowed but flagged in reports — labels may have stale assumptions about what's in the corpus. - The runner moves to
josh-eval. What chunking callsshared/josh_substrate/src/josh_substrate/eval/chunker_compare.pyactually lives injosh-eval/retrieval/runner.py. The chunker protocol and chunker implementations stay injosh-substrate(production code), but the eval runner is research code and lives in the eval repo. - The CLI is
josh-eval retrieval run(not barejosh-eval chunkeras chunking hinted) — subcommand under the unified eval CLI.
Run shape:
josh-eval retrieval run \ --eval-set=federal-register \ --substrate-version=v2026.08 \ --pipeline=bm25+arctic-l+rerankOutput: results/retrieval/<eval-set>/<pipeline-config>/<timestamp>/scores.json + report.html. Pooled NDCG@10 / Recall@10 / MRR plus per-query breakdown.
Harness eval (Phase 2) — three modes
Section titled “Harness eval (Phase 2) — three modes”Phase 2 ships an agent UI / harness that takes natural-language instructions, calls tools against the substrate, and produces deliverables. The harness eval measures whether the agent's deliverables are correct.
Modeled on Harvey's Legal Agent Benchmark (LAB).
Three modes for the same task
Section titled “Three modes for the same task”A single task.json is runnable in three modes that differ only in how the agent gets to the documents. The task definition (instructions, rubric, deliverables) stays identical.
| Mode | Document access | What it tests |
|---|---|---|
| Mode 1 — pure-relevant bundle | The agent gets a sealed documents/ folder containing only relevant docs (~10–30 files, all gold). | Pure reasoning + synthesis. No filtering load. The "best case" — perfect retrieval, perfect scoping. Diagnostic: if the agent fails Mode 1, the failure is reasoning, not retrieval. |
| Mode 2 — gold + distractors bundle | Same sealed folder pattern, but with 50–200 plausible distractor docs added (~60–230 files total). | In-corpus filtering + reasoning. The agent has to identify which files matter. Closest to LAB's existing setup but with larger document sets to address LAB's "5–30 docs is too small for realism" criticism. |
| Mode 3 — live substrate snapshot | No sealed folder. The agent has a josh.search(query, k=10) tool against a snapshotted substrate (1M+ docs at the pinned version). Sandbox is --network=none except for substrate API. | Retrieval-at-scale + reasoning. The hardest mode. The gold docs are buried in a real haystack the agent has to navigate via search. |
Why three, not two
Section titled “Why three, not two”The user's call (2026-05-09 conversation): Mode 1 isolates reasoning from filtering in a way Mode 2 doesn't. A model that scores high on Mode 1 but poorly on Mode 2 has a filtering problem; one that scores poorly on both has a reasoning problem. Without Mode 1 we can't disambiguate.
A model run produces a 3-tuple score (mode_1, mode_2, mode_3). The shape of that tuple is diagnostic:
| Profile | What it means |
|---|---|
(0.85, 0.62, 0.38) | Reasoning fine; filtering degrades; live retrieval breaks down. Investigate Mode 3 retrieval pipeline. |
(0.85, 0.82, 0.40) | Reasoning fine; filtering fine; live retrieval breaks down. The gap is purely retrieval-at-scale. |
(0.45, 0.43, 0.40) | Reasoning is the bottleneck; filtering and retrieval are not the issue. |
Task format
Section titled “Task format”harness/tasks/<area>/<task>/ task.json bundles/ mode-1/ # ~10-30 pure-relevant docs (all gold) mode-2/ # ~60-230 docs (gold + plausible distractors) # mode-3 has no bundle — task.json references substrate doc IDstask.json (sketch):
{ "title": "Track SA 3217 amendments to FY2026 appropriations bill", "instructions": "Review the attached materials for HR 5417 and prepare a brief summarizing all pending amendments...", "deliverables": { "amendments-brief.docx": "amendments-brief.docx" }, "criteria": [ { "id": "C-001", "title": "Identifies SA 3217 strikes the §4(b) appropriation", "match_criteria": "PASS if the brief notes that SA 3217, filed by Sen. Lee on May 2, strikes the $1.4B figure in §4(b)(1) and replaces it with $0...", "deliverables": ["amendments-brief.docx"], "sources": ["bill:hr-119-5417", "amendment:sa-119-3217"] } ], "modes": { "mode_1": {"docs": ["hr-5417-text.docx", "sa-3217-text.docx", ...]}, "mode_2": {"docs": ["hr-5417-text.docx", "sa-3217-text.docx", "sa-3215-text.docx", ...], "distractor_count": 180}, "mode_3": { "substrate_version": "v2026.08", "gold_source_ids": ["bill:hr-119-5417", "amendment:sa-119-3217"] } }}The crucial property: the rubric describes facts to surface, not files to read. "PASS if the brief notes that SA 3217 strikes the §4(b) appropriation" works in all three modes. In Mode 1 the agent reads SA 3217 from a guaranteed-relevant folder. In Mode 2 it filters SA 3217 out of 200 mixed amendments. In Mode 3 it has to find SA 3217 via josh.search() against the v2026.08 substrate snapshot.
Mode 3 search tool
Section titled “Mode 3 search tool”Black-box from the agent's POV. The agent calls josh.search(query: str, k: int = 10) and gets ranked chunks back. It does not see whether the underlying implementation is FTS5 BM25, vector search, hybrid, or a reranked stack.
Rationale (decided 2026-05-09): wrapping and abstracting maximizes the chance that retrieval improvements propagate transparently. A low-level interface exposing FTS5 + vector tools separately would test the agent's ability to drive a retrieval API rather than the harness's ability to retrieve. Mode 3 is meant to measure the second.
Distractor sampling for Mode 2
Section titled “Distractor sampling for Mode 2”Hand-picked distractors don't scale. Random substrate samples are mostly trivially-irrelevant (no signal). The right balance is semantic-near-miss sampling: pull from positions ~50–200 in a vector-similarity ranking against the gold docs. Plausibly relevant by construction (the embedding thinks they're close) but not actually relevant (the labels say so). Cheap to generate, generates real filtering load.
Mode parity check (CI gate)
Section titled “Mode parity check (CI gate)”A task should be solvable in all three modes. A buggy distractor selection could make Mode 2 unsolvable while Mode 1 is fine. CI runs a strong reference model in all three modes for every new or modified task; if Mode 1 < 0.95 the task itself is broken.
Run shape
Section titled “Run shape”josh-eval harness run \ --task=appropriations/track-sa-3217 \ --mode=mode_2 \ --model=anthropic/claude-sonnet-4-7Output: results/harness/<task-id>/<mode>/<model-config>/<timestamp>/ containing config.json, transcript.jsonl, metrics.json, output/ (the deliverables), scores.json, report.html. Layout mirrors LAB's results/<run-id>/ shape.
Scoring
Section titled “Scoring”All-pass binary scoring per LAB. Each criterion is judged independently against its scoped deliverable file by an LLM judge (default claude-sonnet-4-7 at temperature 0). Task scores 1.0 only if every criterion passes; partial credit is recorded as a diagnostic (n_passed / n_criteria) but does not move the headline number.
Rationale: a vote-prediction missing the swing senator is not 95% useful; it is wrong. A regulatory comment summary that misses the rulemaking deadline is not "high quality with one gap"; it is a failure. The operational question is how often the agent gets everything right, run-by-run. Adopt LAB's framing wholesale.
Public-vs-private boundary
Section titled “Public-vs-private boundary”The eval framework is fully open. What does and does not enter it:
In the open benchmark
Section titled “In the open benchmark”- The substrate snapshot (rebuilt from public APIs)
- All task definitions (instructions, rubrics, deliverables)
- All Mode 1 / Mode 2 bundles (synthetic or curated from public material)
- Mode 3 substrate references (doc IDs in the snapshot)
- The runner, judge, model adapters, sandbox
- Josh's own scores on the benchmark — published openly
Stays private
Section titled “Stays private”- Proprietary downstream models layered on top of the substrate (forecasting, prediction) are evaluated by the benchmark but not part of it. Any product running on Josh Foundation can compete on the same public test.
- Hill-staff rubric authoring notes, contractor calibration sessions
- Any held-back test split (MTEB pattern: public-dev / private-test, rotated quarterly to defend against contamination)
- Operator telemetry, deployment-private ranking signals
What evaluators of other products release
Section titled “What evaluators of other products release”Any independent product can claim a number on this benchmark by running its stack against the queries and rubrics; it does not need to install Josh Foundation or release its source. Independent and Josh-published scores exist on the same axis.
Three benchmark types, three reproducibility models
Section titled “Three benchmark types, three reproducibility models”| Type | What's compared | Reproducibility |
|---|---|---|
| A. Retrieval pipeline comparison | Different chunker/embedder/ranker stacks, all running against the same substrate snapshot | Fully reproducible — anyone runs josh-eval retrieval run and gets the same answer. |
| B. Model comparison on Josh's harness | Different LLMs running through Josh's harness, same task corpora and rubrics | Fully reproducible — anyone runs josh-eval harness run --model=X and gets the same answer. |
| C. Product comparison | Different end-to-end products running against the same queries | Reproducible per-participant, but cross-participant numbers are not apples-to-apples by construction. That's the right framing — the test is "whose final answer is better," not "whose retrieval algorithm is better." |
Different benchmark, different question, different reproducibility model. Don't try to make one benchmark do all three jobs.
What we adopt from Harvey LAB
Section titled “What we adopt from Harvey LAB”LAB is the closest analog in any vertical. Summary of decisions adopted:
| LAB pattern | Adopted? | Notes |
|---|---|---|
| Filesystem-first results layout | Yes — verbatim | results/<run-id>/{config.json, transcript.jsonl, scores.json, report.html, output/} |
task.json + documents/ per task | Yes — extended | Add the three-mode bundle layout. |
Rubric with match_criteria text as the standard, no golden reference | Yes — verbatim | Cheaper to author than golden answers; self-documenting at the criterion level. |
| All-pass binary scoring | Yes — for harness only | Wrong shape for retrieval (NDCG is graded by design). |
| LLM judge at temperature 0, JSON verdict + reasoning | Yes — verbatim | Default claude-sonnet-4-7. Reasoning recorded for post-hoc review. |
| Per-criterion deliverable scoping | Yes — verbatim | Judge sees only the deliverable files relevant to the criterion. |
| Model adapter pattern (Anthropic / OpenAI / Google / Mistral) | Yes — verbatim | 4-method interface, provider-prefix routing. |
| Six closed-workspace tools (bash/read/write/edit/glob/grep) | Yes — for Modes 1 + 2 | Add josh.search() for Mode 3. |
| Per-task Podman sandbox | Yes — verbatim | --network=none --cap-drop=ALL, ro /workspace/documents, rw /workspace/output. Mode 3 allows substrate API only. |
| Sweep harness | Yes — extended | Sweeps over (task × mode × model) instead of just (task × model). |
| Static HTML comparison dashboards | Yes — verbatim | Same layout, ranks configs by all-pass rate. |
| Closed-corpus only (no large-scale retrieval test) | No — added Mode 3 | LAB's 5–30 doc per task is too small for realism. Mode 3 tests at substrate scale. |
| 1,251 hand-authored tasks across 24 practice areas | No — different shape | Federal policy uses ~15–20 policy areas instead. Task count TBD; aim for credible launch coverage rather than parity with LAB. |
| Synthetic data | No — public real data | Federal policy data is public; we can use real bills/rules/hearings. Bundles in Mode 1/2 are extracted from the substrate snapshot. |
Open questions
Section titled “Open questions”- Distractor sampling — semantic-near-miss configuration. Position 50–200 in vector ranking is a sketch. Needs calibration: do distractors at position 50 trip up models more than position 200? Worth a small experiment when the first task lands.
- HF Datasets archive size. A full substrate snapshot at v2026.08 might be 30–80 GB. HF Datasets handles this fine, but bandwidth on first download is non-trivial. Consider a "tiny" variant for CI testing (one source, one month of data, ~500 MB).
- Mode parity threshold. "Mode 1 < 0.95 = task broken" is a reasonable default but unjustified. After running the first 50 tasks across a strong model, look at the Mode 1 distribution and pick a threshold from data.
- Per-criterion
sourcesenforcement. Mode 3 task scoring should arguably check that the agent actually retrieved the cited sources, not just that the deliverable contains the right facts. Adds a "retrieval check" diagnostic alongside the rubric pass/fail. Worth piloting on a few tasks. - Eval set versioning vs substrate versioning. A query labeled against substrate v2026.08 may be wrong against v2026.09 (the corpus changed). Re-labeling on every snapshot is heavy. Lean: pin eval-sets to a substrate version; bump together when a new eval-set version is authored.
- Public-dev / private-test split mechanics. When does the private test set get rotated? Who has access? Mention in launch press to set expectations, or stay quiet?
Cross-references
Section titled “Cross-references”- chunking — retrieval eval design specifics: query authoring, labeling, metrics, decision rules
- repo-structure "Launch repo split (planned)" — where
josh-evallives at launch