Per-source retrieval-eval confirmation

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

The substrate has a default (model × chunker) pair locked from the CRS
embedding ablation (mxbai-embed-large-v1 + fixed_token_512_50), but the
win there was 0.026 NDCG — just over the 0.02 noise floor for a 50-query
set. Other corpora (bills, Federal Register, hearings, U.S. Code, etc.)
have meaningfully different shape: heavier structure, longer documents,
more citations, transcript-style turn structure. The CRS-winning combo
may underperform on those.

This spec defines a *light* per-source retrieval-eval that runs once
before a source's first bulk-embed, using the same harness as the CRS
ablation but scoped to fit a $2-5 Modal budget and an evening of
operator time. It is the formal NDCG/Recall/MRR eval the
substrate-embedding-evaluation (pre-bulk gate) is explicitly NOT —
the gate catches degenerate chunkers; this eval picks the chunker.

User stories

As a substrate operator, I want a one-command per-source eval that confirms or overrides the default (model, chunker) before bulk embed so that I never lock a source into a config that beats the default on CRS but loses by 0.05 NDCG on this source's shape.

As a substrate maintainer, I want an immutable eval artifact per source under josh-eval/retrieval/eval-sets/<source>/v1/ so that future model swaps have a baseline to beat, source by source.

As a contributor proposing a chunker change for an existing source, I want a documented protocol for re-running the eval at v2 so that chunker improvements are evidence-driven, not vibes-driven.

Acceptance criteria (EARS)

Before a source's first bulk-embed run, the system shall produce a per-source eval result at `josh-eval/retrieval/eval-sets/<source>/v1/results.html` showing NDCG@10, Recall@10, MRR for at minimum 6 (model × chunker) cells plus BM25, against ≥25 labeled queries.
When the eval runs, the system shall include `Snowflake/snowflake-arctic-embed-l-v2.0 × fixed_token_512_50` as the production baseline (the Phase 1 winner per `https://docs.usejosh.com/operations/embedding-architecture/#model-choice`).
When the eval runs against a source with predicted-best structural chunker per `https://docs.usejosh.com/operations/chunking/`, the system shall include at minimum one structural cell (e.g., `section_aware_fr_v1` for FR, `section_aware_bills_v1` for bills, `usc_uslm_section_v1` for USC/public-laws, `speaker_turn_v1` for hearings/Congressional Record).
When the eval runs against US Code or public laws, the system shall include a hybrid (BM25 + dense rerank) cell as a third arm — citation-heavy corpora are the wedge where hybrid may dominate either pure approach.
When a cell beats the production baseline (Arctic-L × fixed_token_512_50) by ≥0.02 NDCG@10 with ≤0.02 Recall@10 regression on this source's eval set, the system shall flag it as a candidate replacement; the operator decides whether to deploy the replacement or the default.
When no cell beats the production baseline, the system shall deploy the baseline as this source's live (model, chunker) and record the decision in `eval-sets/<source>/v1/decision.json`.
The per-source eval set (queries + labels + judge prompt + decision) shall be committed to the tracked repo and immutable once labeled; query or label changes force a v2 with re-baselining.
While the OVHcloud bare-metal host (`substrate-bare-metal-host`) is being provisioned, the eval harness shall run independently against Modal — Phase 2 work is NOT blocked on host readiness.

Success determiner

kindmanual

Checklist

bin/per-source-eval.py exists and is executable
For at least one source past CRS, `josh-eval/retrieval/eval-sets/<source>/v1/results.html` exists with cell rankings
decision.json names the chosen (model, chunker) for that source
Eval set queries and labels live in the tracked tree, not just on a developer's laptop

Determiner upgrades to bash once `bin/per-source-eval.py` lands. At that point the determiner is `bin/per-source-eval.py --source=<src> --dry-run` with `expect: "decision recorded"`. Keeping it `manual` while the spec is draft so the determiner doesn't false-fail before the implementation lands.

Clarifications needed

Should the per-source eval be a hard gate (block bulk-embed until run) or a soft default (operator can opt out for low-stakes sources)? Lean: hard gate for sources >10K docs, soft for <10K.
When this eval contradicts the predicted-best in `https://docs.usejosh.com/operations/chunking/` (e.g., predicted markdown-header for CRS, reality was fixed_token_512), what's the update protocol for chunking.html? Lean: append observed-best column, never silently rewrite predictions.
How often to re-run? Bumping the embedder model (e.g., mxbai-v2 ships) triggers a v2 across all sources. Beyond that, no scheduled re-runs — only triggered by 'we want to change something'.

Out of scope

The pre-bulk-embed *sanity gate* (chunker p95/max bounds, 10-query smoke). That's `substrate-embedding-evaluation`. This spec is the *formal* eval that picks the chunker; the gate is the safety-net that catches bugs.
Sources without bulk-embed in the v1 ingest roster (members, committees, votes, sponsorships, regulations.gov dockets, LDA filings). These are structured/metadata sources, retrieval is keyword/SQL, not vector.
Production-time embedding observability. Covered by `embedding-observability`.
Sources with weak structural signal (CRS — v1 frozen; GAO; CBO; SAPs). These ship on the production baseline (Arctic-L × fixed_token_512_50) without Phase 2 — the eval cost isn't worth running when the expected outcome is "baseline wins."

Dependencies

Plan

Single Python driver bin/per-source-eval.py --source=<src> that
reuses the embedding-ablation harness from private/experiments/ embedding-ablation/ (chunk_corpus.py, embed_chunks.py, retrieve.py,
score.py, prep_labels.py) and runs against Modal. Per-source inputs
the driver needs:

1. A *corpus sampler* that produces a stratified 500-1000-doc sample
from the live substrate (or the source's HTTP feed if pre-ingest).
Each source already has a sampling rule sketched in
substrate-embedding-evaluation.plan — promote those to actual code.

2. A *queries/v1.yaml* committed under josh-eval/retrieval/ eval-sets/<source>/v1/queries/. Authored the same way as CRS:
8 hand-authored + 12-17 subagent-generated from a stratified 25-doc
read, deduped, eyeballed. Distribution biased toward the source's
realistic question shape (e.g., heavier citation queries for USC,
heavier authority queries for committee reports).

3. A *cell matrix*. Default cells per source:

- mxbai-embed-large-v1 × fixed_token_512_50 (cross-corpus baseline)
- mxbai-embed-large-v1 × <source's predicted-best>
- nomic-embed-text-v1.5 × fixed_token_512_50
- mpnet-base-v2 × fixed_token_512_50 (cheap-tier control)
- BM25 whole-doc (sparse control)
- 1-2 source-specific exploration cells (e.g., 1024-token for long docs, speaker-turn for transcripts)

Skip arctic-l-v2 entirely — CRS showed it's not competitive on this
corpus shape and it's 2× the inference cost of mxbai.

4. A *judge prompt* — the same judge-v1.md from the CRS eval works
across sources without modification. Commit a copy alongside each
source's eval set so the eval is self-contained.

Sizing: ~10 cells × ~30 docs per query top-K = ~150-300 pool docs per
query × 25 queries ≈ 3,000-7,000 pairs to label. One pass of
Claude-Sonnet subagents (~30-70 batches of 100 pairs). Skip the
self-consistency re-judge for v1 (note in known-weaknesses).

Cost budget per source: ~$2 Modal GPU (15-30 min L4), <$1 CPU,
~$2-3 in subagent labeling. ~$5 total per source. Operator time:
~2 hours of authoring + reviewing + kickoff.

Sources in scope (have prose body, get bulk-embedded):

- crs_reports (v1 done — mxbai × fixed_token_512_50, NDCG 0.6911)
- bills
- public_laws
- us_code
- federal_register (rules, NPRMs, EOs)
- ecfr_and_cfr
- hearings (metadata only — skip; transcripts below)
- hearing_transcripts
- congressional_record
- committee_reports
- gao_reports
- cbo_cost_estimates
- statements_of_administration_policy

Sources skipped (structured/metadata, no body to embed):
legislators-and-committees, roll-call-votes, regulations-dot-gov-dockets,
lda-filings, topic-taxonomy, staff-directories.

Trigger order matches the bulk-embed order in the source roster.

## Known weaknesses (declared, not fixed)

The Phase 1 cross-source bake-off ([private/experiments/embedding-ablation/ phase1-design.md]) and the CRS v1 ablation produced the (model, chunker)
default this spec ratifies per source. Both have honest weaknesses that
this spec inherits — surfacing them so a future contributor (or future-me)
doesn't read the eval and conclude more than the data supports.

1. No structural chunkers in Phase 1. The cross-source bake-off
fixed the chunker at fixed_token_512_50 across all 9 sources by
design — to isolate the model variable. The structural-chunker
priors in https://docs.usejosh.com/operations/chunking/ (section-aware for
FR/bills/USC, speaker-turn for hearings/Congressional Record) are
untested. Phase 2 *is* the test. The risk this spec is built to
catch: a model that wins Phase 1 with naïve chunking might lose
Phase 2 with structural chunking on a structurally-rich source.

2. LLM-as-judge bias is reduced, not eliminated. Phase 1 used
Sonnet 4.6 at temp 0 with a 10% Opus dual-judge spot-check
(weighted κ 0.65, 88.7% binary agreement, disagreement symmetric
and model-independent). That's solid for *relative* ranking
between cells, weaker as an absolute number. A future Phase 3
with human spot-checks would tighten the absolute claims; not in
scope for v1.

3. 25 queries per source has wide per-source CIs. Bootstrap
CIs on per-source NDCG@10 are wider than the cross-source
aggregate's CI (which pools ~250 effective queries). This means
per-source comparisons inside Phase 2 have less statistical
power than the Phase 1 model pick. The 0.02 NDCG win threshold
was tuned for the CRS 50-query set; for 25-query Phase 2 runs
it's directionally fine but operators should not treat 0.021 vs
0.019 as a meaningful gap.

4. No hybrid retrieval cell in Phase 1. BM25 was tested whole-doc
as a sparse control, not as part of a fusion. On citation-heavy
corpora (USC especially), hybrid (sparse + dense rerank) may
dominate either pure approach. This spec's per-source eval should
add a hybrid arm for USC and public-laws specifically — citation
anchors are the wedge where BM25 closes or wins. If hybrid wins by
≥0.02 NDCG on USC, the production default for that source is
hybrid, not pure-dense. Other corpora (FR, hearings) are less
citation-dense; hybrid is lower priority there.

5. No end-to-end test on the production substrate. Phase 1 ran
on Modal harness with temp SQLite files; the production SQLite +
sqlite-vec + FTS5 + the chosen model has never been benchmarked
end-to-end at scale. Numbers in production may diverge from
numbers in the eval harness for non-retrieval reasons (vec0
quantization, index recall, FTS5 weighting). The pre-bulk gate
(substrate-embedding-evaluation) is the cheap check; a full
end-to-end retrieval benchmark against the live substrate is
a follow-on concern.

6. CBO is missing from Phase 1. DataDome wall blocked the
scraper; CBO never entered the cross-source bake-off. The
fallback when its ingester lands is to run Phase 2 against
whatever sample size the scraper can pull (likely small) and
pick recursive_v1 as the production chunker on prior, with the
eval as confirmation rather than tiebreaker.

7. Query sets are mostly LLM-generated. 8 hand-authored + 12-17
subagent-generated per source, deduped + eyeballed. This means
the eval reflects what an LLM thinks a policy professional would
ask, not real user traffic. Cite carefully in any external
writeup. The fix is harvesting real queries once Step 2 traffic
exists; not viable for the substrate-launch eval.

8. fixed_token_512_50 may handicap long-context models. mxbai
is 512-max-seq, so 512 is its natural fit. Arctic-L (8192) and
nomic (8192) were forced into 512-token windows. Phase 2's
fixed_token_1024_100 cell tests this for 8192-max-seq models —
a win there would suggest mxbai's tie with Arctic-L in Phase 1
overstated mxbai's quality (it was running at its ceiling, Arctic
wasn't). One reason the Arctic-L tiebreaker pick was right
irrespective of Phase 1's near-tie verdict.

9. Arctic-M was never benchmarked. The original production-planned
default crashed four times on Modal L4 with a CUDA gather-kernel
bug in the GTE custom modeling code. Excluded from Phase 1 results.
Re-test would be valuable if the upstream bug ever clears, but
none of this spec's per-source decisions hinge on Arctic-M.

None of these block running the per-source eval as scoped. They're
the honest perimeter of what the result can claim.

Tasks

0 of 14 done.

Changelog

2026-05-27T00:30:00Z draft→draft Promoted tasks from [STUB] to actionable. Restructured into: scaffolding-PR shape (t1-t5: runner promotion, driver, eval-set schema, immutability rule, data-status row), Tier 4 per-source PRs (t6-t10: FR, bills, USC, public laws, hearing transcripts), Tier 7 deferred per-source PRs (t11-t13: CR reuses transcripts, committee reports, eCFR reuses FR), and pre-launch verification (t14: re-run on production substrate). Also: - Acceptance criteria updated: baseline changed from `mxbai × fixed_token_512_50` (CRS winner) to `Arctic-L × fixed_token_512_50` (Phase 1 cross-source winner, per https://docs.usejosh.com/operations/embedding-architecture/#model-choice). - New acceptance criterion: USC and public-laws MUST include a hybrid (BM25 + dense rerank) cell — citation-heavy corpora are the wedge where hybrid may dominate. - New acceptance criterion: Phase 2 work runs in parallel with host provisioning (Modal-only, no host dependency). - Out-of-scope narrowed: sources with weak structural signal (CRS — v1 frozen; GAO; CBO; SAPs) ship on production baseline without Phase 2. - Removed "hybrid retrieval is out of scope" from out_of_scope — it's now in scope as a specific arm for USC/public-laws.
2026-05-27T00:00:00Z draft→draft Added "Known weaknesses (declared, not fixed)" section to the plan body. Nine honest perimeter notes inherited from Phase 1 + the CRS v1 ablation: no structural chunkers in Phase 1, LLM-judge bias reduced not eliminated, 25-query per-source CI width, no hybrid cell, no end-to-end production benchmark, CBO missing from Phase 1, LLM-generated queries, fixed_token_512_50 may handicap long-context models (one reason the Arctic-L tiebreaker pick was right), and Arctic-M unbenchmarked. None block running the eval as scoped. Spec remains draft.