Evaluation architecture

How Josh evaluates two distinct things — substrate retrieval quality (Phase 1) and agent harness quality (Phase 2) — through one shared eval framework.

This document is the canonical reference for evaluation design across both phases. When per-phase docs (e.g. chunking for chunker eval specifics) disagree with this document on cross-cutting concerns, this document wins.

Where this lives: the eval framework ships as usejosh/josh-eval, a separate public repo (MIT) at v1 launch. See repo-structure "Launch repo split (planned)" for the topology.

Why two evaluation systems

Josh has two architectural surfaces that need measurement, and they want measurement of fundamentally different shapes:

System	What it measures	Metric shape
Retrieval eval (Phase 1)	Given a query, does the substrate surface the right chunk in the top-k? Tests chunker × embedding × ranker.	Graded relevance: NDCG@10, Recall@10, MRR over labeled (query, chunk) pairs.
Harness eval (Phase 2)	Given an instruction + corpus access, does the agent produce a correct deliverable? Tests model + tools + harness end-to-end.	Binary pass/fail per rubric criterion; all-pass scoring per task (LAB-style).

These are complementary halves of the same engineering question. Retrieval eval holds the model dumb (a fixed embedding + reranker) and varies the retrieval pipeline. Harness eval holds the retrieval given (a closed corpus or a snapshotted substrate) and varies the model + agent loop. Neither subsumes the other.

They share substrate snapshots, the LLM-as-judge wrapper, the CLI surface, the reporting layer, and the model adapter abstraction. They diverge on what they evaluate.

The unified framework

josh-eval/
  substrate/            # SHARED — versioned corpus snapshots
    manifests/          # source URLs + SHA256s + retrieval timestamps
    snapshots/          # locally-rebuilt corpus at version v2026.08
    rebuild.py          # deterministic rebuild from manifest
  retrieval/            # PHASE 1 — substrate retrieval quality
    eval-sets/          # queries + labels per source/topic
    runner.py
    scorer.py           # NDCG@k / Recall@k / MRR
  harness/              # PHASE 2 — agent deliverable quality
    tasks/              # task.json + (Mode 1/2 bundles)
    runner.py
    judge.py            # all-pass rubric scoring
    adapters/           # Anthropic / OpenAI / Google / Mistral
    tools/              # read / write / glob / grep / josh.search
    sandbox/            # Podman per-task isolation
  shared/               # infra used by both
    judge_model.py      # common LLM-as-judge wrapper
    cli.py              # `josh-eval` entrypoint with subcommands
    report/             # static HTML
    sweep.py            # matrix runs
    compare.py          # dashboard generator
  results/              # gitignored output of any run

Single CLI entrypoint with subcommands:

josh-eval substrate fetch --version=v2026.08
josh-eval substrate rebuild --manifest=./manifests/v2026.08/manifest.json
josh-eval retrieval run --eval-set=federal-register --pipeline=bm25+arctic-l+rerank
josh-eval harness run --task=appropriations/track-sa-3217 --mode=mode_2 --model=anthropic/claude-sonnet-4-7
josh-eval sweep --task=appropriations --modes=mode_1,mode_2,mode_3 --models=sonnet,gpt-5
josh-eval compare --area=appropriations

Shared global flags: --substrate-version, --judge-model, --parallel, --out-dir.

Substrate snapshot system (shared)

Both eval systems test against a versioned substrate. The substrate moves daily as new bills, rules, and hearings ingest; for any benchmark to be reproducible, runs must pin to a specific corpus state.

Manifest format

substrate/manifests/v2026.08/
  manifest.json
  sources/
    federal-register.json    # doc_id → {url, sha256, retrieved_at}
    congress-bills.json
    crs-reports.json
    legislators-committees.json
    ...

Each source manifest is a flat list of (doc_id, source_url, sha256, retrieved_at) tuples. The full snapshot manifest is small (~MB); the rebuilt corpus is large (~tens of GB).

Two rebuild paths

Primary: download from Hugging Face Datasets. josh-eval substrate fetch --version=v2026.08 pulls a pre-built SQLite + chunks + indexes archive from huggingface.co/datasets/usejosh/substrate-snapshots. This is what 99% of eval users want — no scrapers, no API quotas, just download and query.
Escape hatch: rebuild from manifest. josh-eval substrate rebuild --version=v2026.08 reads the manifest, hits the public APIs, downloads each source, verifies SHA256s, and rebuilds the SQLite database locally. Requires pip install josh-foundation. For Josh-internal verification or genuinely paranoid users (e.g., academic reviewers reproducing a paper).

Both paths produce bit-identical (or close-to-identical, modulo upstream-API changes) substrates at a given version. The HF archive is built by Josh internal running the rebuild path against the current public APIs at tag time.

Snapshot cadence

Monthly tagged versions: v2026.08, v2026.09, etc. A floating latest alias points at the most recent version.

Eval sets and harness Mode 3 tasks pin a specific version. Re-running an old eval against an old version produces the same number; pinning to latest produces a number that drifts over time.

What's in the snapshot

The snapshot is pure substrate — what every Josh Foundation install would have if it ingested the same source URLs at the same time. Proprietary downstream signals (forecasting outputs, prediction overlays, hand annotations) are not in the snapshot. The snapshot is pure substrate.

Retrieval eval (Phase 1)

Detailed design lives in chunking — query authoring, candidate gathering, labeling, metrics, decision rules. The relevant additions from this document are:

Eval sets pin a substrate version. retrieval/eval-sets/<source>/manifest.json declares the substrate version the eval was authored against. Re-running on a newer version is allowed but flagged in reports — labels may have stale assumptions about what's in the corpus.
The runner moves to josh-eval. What chunking calls shared/josh_substrate/src/josh_substrate/eval/chunker_compare.py actually lives in josh-eval/retrieval/runner.py. The chunker protocol and chunker implementations stay in josh-substrate (production code), but the eval runner is research code and lives in the eval repo.
The CLI is josh-eval retrieval run (not bare josh-eval chunker as chunking hinted) — subcommand under the unified eval CLI.

Run shape:

josh-eval retrieval run \
  --eval-set=federal-register \
  --substrate-version=v2026.08 \
  --pipeline=bm25+arctic-l+rerank

Output: results/retrieval/<eval-set>/<pipeline-config>/<timestamp>/scores.json + report.html. Pooled NDCG@10 / Recall@10 / MRR plus per-query breakdown.

Harness eval (Phase 2) — three modes

Phase 2 ships an agent UI / harness that takes natural-language instructions, calls tools against the substrate, and produces deliverables. The harness eval measures whether the agent's deliverables are correct.

Modeled on Harvey's Legal Agent Benchmark (LAB).

Three modes for the same task

A single task.json is runnable in three modes that differ only in how the agent gets to the documents. The task definition (instructions, rubric, deliverables) stays identical.

Mode	Document access	What it tests
Mode 1 — pure-relevant bundle	The agent gets a sealed `documents/` folder containing only relevant docs (~10–30 files, all gold).	Pure reasoning + synthesis. No filtering load. The "best case" — perfect retrieval, perfect scoping. Diagnostic: if the agent fails Mode 1, the failure is reasoning, not retrieval.
Mode 2 — gold + distractors bundle	Same sealed folder pattern, but with 50–200 plausible distractor docs added (~60–230 files total).	In-corpus filtering + reasoning. The agent has to identify which files matter. Closest to LAB's existing setup but with larger document sets to address LAB's "5–30 docs is too small for realism" criticism.
Mode 3 — live substrate snapshot	No sealed folder. The agent has a `josh.search(query, k=10)` tool against a snapshotted substrate (1M+ docs at the pinned version). Sandbox is `--network=none` except for substrate API.	Retrieval-at-scale + reasoning. The hardest mode. The gold docs are buried in a real haystack the agent has to navigate via search.

Why three, not two

The user's call (2026-05-09 conversation): Mode 1 isolates reasoning from filtering in a way Mode 2 doesn't. A model that scores high on Mode 1 but poorly on Mode 2 has a filtering problem; one that scores poorly on both has a reasoning problem. Without Mode 1 we can't disambiguate.

A model run produces a 3-tuple score (mode_1, mode_2, mode_3). The shape of that tuple is diagnostic:

Profile	What it means
`(0.85, 0.62, 0.38)`	Reasoning fine; filtering degrades; live retrieval breaks down. Investigate Mode 3 retrieval pipeline.
`(0.85, 0.82, 0.40)`	Reasoning fine; filtering fine; live retrieval breaks down. The gap is purely retrieval-at-scale.
`(0.45, 0.43, 0.40)`	Reasoning is the bottleneck; filtering and retrieval are not the issue.

Task format

harness/tasks/<area>/<task>/
  task.json
  bundles/
    mode-1/             # ~10-30 pure-relevant docs (all gold)
    mode-2/             # ~60-230 docs (gold + plausible distractors)
    # mode-3 has no bundle — task.json references substrate doc IDs

task.json (sketch):

{
  "title": "Track SA 3217 amendments to FY2026 appropriations bill",
  "instructions": "Review the attached materials for HR 5417 and prepare a brief summarizing all pending amendments...",
  "deliverables": {
    "amendments-brief.docx": "amendments-brief.docx"
  },
  "criteria": [
    {
      "id": "C-001",
      "title": "Identifies SA 3217 strikes the §4(b) appropriation",
      "match_criteria": "PASS if the brief notes that SA 3217, filed by Sen. Lee on May 2, strikes the $1.4B figure in §4(b)(1) and replaces it with $0...",
      "deliverables": ["amendments-brief.docx"],
      "sources": ["bill:hr-119-5417", "amendment:sa-119-3217"]
    }
  ],
  "modes": {
    "mode_1": {"docs": ["hr-5417-text.docx", "sa-3217-text.docx", ...]},
    "mode_2": {"docs": ["hr-5417-text.docx", "sa-3217-text.docx", "sa-3215-text.docx", ...], "distractor_count": 180},
    "mode_3": {
      "substrate_version": "v2026.08",
      "gold_source_ids": ["bill:hr-119-5417", "amendment:sa-119-3217"]
    }
  }
}

The crucial property: the rubric describes facts to surface, not files to read. "PASS if the brief notes that SA 3217 strikes the §4(b) appropriation" works in all three modes. In Mode 1 the agent reads SA 3217 from a guaranteed-relevant folder. In Mode 2 it filters SA 3217 out of 200 mixed amendments. In Mode 3 it has to find SA 3217 via josh.search() against the v2026.08 substrate snapshot.

Mode 3 search tool

Black-box from the agent's POV. The agent calls josh.search(query: str, k: int = 10) and gets ranked chunks back. It does not see whether the underlying implementation is FTS5 BM25, vector search, hybrid, or a reranked stack.

Rationale (decided 2026-05-09): wrapping and abstracting maximizes the chance that retrieval improvements propagate transparently. A low-level interface exposing FTS5 + vector tools separately would test the agent's ability to drive a retrieval API rather than the harness's ability to retrieve. Mode 3 is meant to measure the second.

Distractor sampling for Mode 2

Hand-picked distractors don't scale. Random substrate samples are mostly trivially-irrelevant (no signal). The right balance is semantic-near-miss sampling: pull from positions ~50–200 in a vector-similarity ranking against the gold docs. Plausibly relevant by construction (the embedding thinks they're close) but not actually relevant (the labels say so). Cheap to generate, generates real filtering load.

Mode parity check (CI gate)

A task should be solvable in all three modes. A buggy distractor selection could make Mode 2 unsolvable while Mode 1 is fine. CI runs a strong reference model in all three modes for every new or modified task; if Mode 1 < 0.95 the task itself is broken.

Run shape

josh-eval harness run \
  --task=appropriations/track-sa-3217 \
  --mode=mode_2 \
  --model=anthropic/claude-sonnet-4-7

Output: results/harness/<task-id>/<mode>/<model-config>/<timestamp>/ containing config.json, transcript.jsonl, metrics.json, output/ (the deliverables), scores.json, report.html. Layout mirrors LAB's results/<run-id>/ shape.

Scoring

All-pass binary scoring per LAB. Each criterion is judged independently against its scoped deliverable file by an LLM judge (default claude-sonnet-4-7 at temperature 0). Task scores 1.0 only if every criterion passes; partial credit is recorded as a diagnostic (n_passed / n_criteria) but does not move the headline number.

Rationale: a vote-prediction missing the swing senator is not 95% useful; it is wrong. A regulatory comment summary that misses the rulemaking deadline is not "high quality with one gap"; it is a failure. The operational question is how often the agent gets everything right, run-by-run. Adopt LAB's framing wholesale.

Public-vs-private boundary

The eval framework is fully open. What does and does not enter it:

In the open benchmark

The substrate snapshot (rebuilt from public APIs)
All task definitions (instructions, rubrics, deliverables)
All Mode 1 / Mode 2 bundles (synthetic or curated from public material)
Mode 3 substrate references (doc IDs in the snapshot)
The runner, judge, model adapters, sandbox
Josh's own scores on the benchmark — published openly

Stays private

Proprietary downstream models layered on top of the substrate (forecasting, prediction) are evaluated by the benchmark but not part of it. Any product running on Josh Foundation can compete on the same public test.
Hill-staff rubric authoring notes, contractor calibration sessions
Any held-back test split (MTEB pattern: public-dev / private-test, rotated quarterly to defend against contamination)
Operator telemetry, deployment-private ranking signals

What evaluators of other products release

Any independent product can claim a number on this benchmark by running its stack against the queries and rubrics; it does not need to install Josh Foundation or release its source. Independent and Josh-published scores exist on the same axis.

Three benchmark types, three reproducibility models

Type	What's compared	Reproducibility
A. Retrieval pipeline comparison	Different chunker/embedder/ranker stacks, all running against the same substrate snapshot	Fully reproducible — anyone runs `josh-eval retrieval run` and gets the same answer.
B. Model comparison on Josh's harness	Different LLMs running through Josh's harness, same task corpora and rubrics	Fully reproducible — anyone runs `josh-eval harness run --model=X` and gets the same answer.
C. Product comparison	Different end-to-end products running against the same queries	Reproducible per-participant, but cross-participant numbers are not apples-to-apples by construction. That's the right framing — the test is "whose final answer is better," not "whose retrieval algorithm is better."

Different benchmark, different question, different reproducibility model. Don't try to make one benchmark do all three jobs.

What we adopt from Harvey LAB

LAB is the closest analog in any vertical. Summary of decisions adopted:

LAB pattern	Adopted?	Notes
Filesystem-first results layout	Yes — verbatim	`results/<run-id>/{config.json, transcript.jsonl, scores.json, report.html, output/}`
`task.json` + `documents/` per task	Yes — extended	Add the three-mode bundle layout.
Rubric with `match_criteria` text as the standard, no golden reference	Yes — verbatim	Cheaper to author than golden answers; self-documenting at the criterion level.
All-pass binary scoring	Yes — for harness only	Wrong shape for retrieval (NDCG is graded by design).
LLM judge at temperature 0, JSON verdict + reasoning	Yes — verbatim	Default `claude-sonnet-4-7`. Reasoning recorded for post-hoc review.
Per-criterion deliverable scoping	Yes — verbatim	Judge sees only the deliverable files relevant to the criterion.
Model adapter pattern (Anthropic / OpenAI / Google / Mistral)	Yes — verbatim	4-method interface, provider-prefix routing.
Six closed-workspace tools (bash/read/write/edit/glob/grep)	Yes — for Modes 1 + 2	Add `josh.search()` for Mode 3.
Per-task Podman sandbox	Yes — verbatim	`--network=none --cap-drop=ALL`, ro `/workspace/documents`, rw `/workspace/output`. Mode 3 allows substrate API only.
Sweep harness	Yes — extended	Sweeps over (task × mode × model) instead of just (task × model).
Static HTML comparison dashboards	Yes — verbatim	Same layout, ranks configs by all-pass rate.
Closed-corpus only (no large-scale retrieval test)	No — added Mode 3	LAB's 5–30 doc per task is too small for realism. Mode 3 tests at substrate scale.
1,251 hand-authored tasks across 24 practice areas	No — different shape	Federal policy uses ~15–20 policy areas instead. Task count TBD; aim for credible launch coverage rather than parity with LAB.
Synthetic data	No — public real data	Federal policy data is public; we can use real bills/rules/hearings. Bundles in Mode 1/2 are extracted from the substrate snapshot.

Open questions

Distractor sampling — semantic-near-miss configuration. Position 50–200 in vector ranking is a sketch. Needs calibration: do distractors at position 50 trip up models more than position 200? Worth a small experiment when the first task lands.
HF Datasets archive size. A full substrate snapshot at v2026.08 might be 30–80 GB. HF Datasets handles this fine, but bandwidth on first download is non-trivial. Consider a "tiny" variant for CI testing (one source, one month of data, ~500 MB).
Mode parity threshold. "Mode 1 < 0.95 = task broken" is a reasonable default but unjustified. After running the first 50 tasks across a strong model, look at the Mode 1 distribution and pick a threshold from data.
Per-criterion sources enforcement. Mode 3 task scoring should arguably check that the agent actually retrieved the cited sources, not just that the deliverable contains the right facts. Adds a "retrieval check" diagnostic alongside the rubric pass/fail. Worth piloting on a few tasks.
Eval set versioning vs substrate versioning. A query labeled against substrate v2026.08 may be wrong against v2026.09 (the corpus changed). Re-labeling on every snapshot is heavy. Lean: pin eval-sets to a substrate version; bump together when a new eval-set version is authored.
Public-dev / private-test split mechanics. When does the private test set get rotated? Who has access? Mention in launch press to set expectations, or stay quiet?

Cross-references

chunking — retrieval eval design specifics: query authoring, labeling, metrics, decision rules
repo-structure "Launch repo split (planned)" — where josh-eval lives at launch