Per-source retrieval-eval confirmation
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
The substrate has a default (model × chunker) pair locked from the CRS
embedding ablation (mxbai-embed-large-v1 + fixed_token_512_50), but the
win there was 0.026 NDCG — just over the 0.02 noise floor for a 50-query
set. Other corpora (bills, Federal Register, hearings, U.S. Code, etc.)
have meaningfully different shape: heavier structure, longer documents,
more citations, transcript-style turn structure. The CRS-winning combo
may underperform on those.
This spec defines a *light* per-source retrieval-eval that runs once
before a source's first bulk-embed, using the same harness as the CRS
ablation but scoped to fit a $2-5 Modal budget and an evening of
operator time. It is the formal NDCG/Recall/MRR eval thesubstrate-embedding-evaluation (pre-bulk gate) is explicitly NOT —
the gate catches degenerate chunkers; this eval picks the chunker.
User stories
As a substrate operator, I want a one-command per-source eval that confirms or overrides the default (model, chunker) before bulk embed so that I never lock a source into a config that beats the default on CRS but loses by 0.05 NDCG on this source's shape.
As a substrate maintainer, I want an immutable eval artifact per source under josh-eval/retrieval/eval-sets/<source>/v1/ so that future model swaps have a baseline to beat, source by source.
As a contributor proposing a chunker change for an existing source, I want a documented protocol for re-running the eval at v2 so that chunker improvements are evidence-driven, not vibes-driven.
Acceptance criteria (EARS)
- Before a source's first bulk-embed run, the system shall produce a per-source eval result at `josh-eval/retrieval/eval-sets/<source>/v1/results.html` showing NDCG@10, Recall@10, MRR for at minimum 6 (model × chunker) cells plus BM25, against ≥25 labeled queries.
- When the eval runs, the system shall include `Snowflake/snowflake-arctic-embed-l-v2.0 × fixed_token_512_50` as the production baseline (the Phase 1 winner per `https://docs.usejosh.com/operations/embedding-architecture/#model-choice`).
- When the eval runs against a source with predicted-best structural chunker per `https://docs.usejosh.com/operations/chunking/`, the system shall include at minimum one structural cell (e.g., `section_aware_fr_v1` for FR, `section_aware_bills_v1` for bills, `usc_uslm_section_v1` for USC/public-laws, `speaker_turn_v1` for hearings/Congressional Record).
- When the eval runs against US Code or public laws, the system shall include a hybrid (BM25 + dense rerank) cell as a third arm — citation-heavy corpora are the wedge where hybrid may dominate either pure approach.
- When a cell beats the production baseline (Arctic-L × fixed_token_512_50) by ≥0.02 NDCG@10 with ≤0.02 Recall@10 regression on this source's eval set, the system shall flag it as a candidate replacement; the operator decides whether to deploy the replacement or the default.
- When no cell beats the production baseline, the system shall deploy the baseline as this source's live (model, chunker) and record the decision in `eval-sets/<source>/v1/decision.json`.
- The per-source eval set (queries + labels + judge prompt + decision) shall be committed to the tracked repo and immutable once labeled; query or label changes force a v2 with re-baselining.
- While the OVHcloud bare-metal host (`substrate-bare-metal-host`) is being provisioned, the eval harness shall run independently against Modal — Phase 2 work is NOT blocked on host readiness.
Success determiner
Checklist
- bin/per-source-eval.py exists and is executable
- For at least one source past CRS, `josh-eval/retrieval/eval-sets/<source>/v1/results.html` exists with cell rankings
- decision.json names the chosen (model, chunker) for that source
- Eval set queries and labels live in the tracked tree, not just on a developer's laptop
Determiner upgrades to bash once `bin/per-source-eval.py` lands. At that point the determiner is `bin/per-source-eval.py --source=<src> --dry-run` with `expect: "decision recorded"`. Keeping it `manual` while the spec is draft so the determiner doesn't false-fail before the implementation lands.
Clarifications needed
- Should the per-source eval be a hard gate (block bulk-embed until run) or a soft default (operator can opt out for low-stakes sources)? Lean: hard gate for sources >10K docs, soft for <10K.
- When this eval contradicts the predicted-best in `https://docs.usejosh.com/operations/chunking/` (e.g., predicted markdown-header for CRS, reality was fixed_token_512), what's the update protocol for chunking.html? Lean: append observed-best column, never silently rewrite predictions.
- How often to re-run? Bumping the embedder model (e.g., mxbai-v2 ships) triggers a v2 across all sources. Beyond that, no scheduled re-runs — only triggered by 'we want to change something'.
Out of scope
- The pre-bulk-embed *sanity gate* (chunker p95/max bounds, 10-query smoke). That's `substrate-embedding-evaluation`. This spec is the *formal* eval that picks the chunker; the gate is the safety-net that catches bugs.
- Sources without bulk-embed in the v1 ingest roster (members, committees, votes, sponsorships, regulations.gov dockets, LDA filings). These are structured/metadata sources, retrieval is keyword/SQL, not vector.
- Production-time embedding observability. Covered by `embedding-observability`.
- Sources with weak structural signal (CRS — v1 frozen; GAO; CBO; SAPs). These ship on the production baseline (Arctic-L × fixed_token_512_50) without Phase 2 — the eval cost isn't worth running when the expected outcome is "baseline wins."
Dependencies
Plan
Single Python driver bin/per-source-eval.py --source=<src> that
reuses the embedding-ablation harness from private/experiments/ (chunk_corpus.py, embed_chunks.py, retrieve.py,
embedding-ablation/
score.py, prep_labels.py) and runs against Modal. Per-source inputs
the driver needs:
1. A *corpus sampler* that produces a stratified 500-1000-doc sample
from the live substrate (or the source's HTTP feed if pre-ingest).
Each source already has a sampling rule sketched in
substrate-embedding-evaluation.plan — promote those to actual code.
2. A *queries/v1.yaml* committed under josh-eval/retrieval/. Authored the same way as CRS:
eval-sets/<source>/v1/queries/
8 hand-authored + 12-17 subagent-generated from a stratified 25-doc
read, deduped, eyeballed. Distribution biased toward the source's
realistic question shape (e.g., heavier citation queries for USC,
heavier authority queries for committee reports).
3. A *cell matrix*. Default cells per source:
- mxbai-embed-large-v1 × fixed_token_512_50 (cross-corpus baseline)
- mxbai-embed-large-v1 × <source's predicted-best>
- nomic-embed-text-v1.5 × fixed_token_512_50
- mpnet-base-v2 × fixed_token_512_50 (cheap-tier control)
- BM25 whole-doc (sparse control)
- 1-2 source-specific exploration cells (e.g., 1024-token for long docs, speaker-turn for transcripts)
Skip arctic-l-v2 entirely — CRS showed it's not competitive on this
corpus shape and it's 2× the inference cost of mxbai.
4. A *judge prompt* — the same judge-v1.md from the CRS eval works
across sources without modification. Commit a copy alongside each
source's eval set so the eval is self-contained.
Sizing: ~10 cells × ~30 docs per query top-K = ~150-300 pool docs per
query × 25 queries ≈ 3,000-7,000 pairs to label. One pass of
Claude-Sonnet subagents (~30-70 batches of 100 pairs). Skip the
self-consistency re-judge for v1 (note in known-weaknesses).
Cost budget per source: ~$2 Modal GPU (15-30 min L4), <$1 CPU,
~$2-3 in subagent labeling. ~$5 total per source. Operator time:
~2 hours of authoring + reviewing + kickoff.
Sources in scope (have prose body, get bulk-embedded):
- crs_reports (v1 done — mxbai × fixed_token_512_50, NDCG 0.6911)
- bills
- public_laws
- us_code
- federal_register (rules, NPRMs, EOs)
- ecfr_and_cfr
- hearings (metadata only — skip; transcripts below)
- hearing_transcripts
- congressional_record
- committee_reports
- gao_reports
- cbo_cost_estimates
- statements_of_administration_policy
Sources skipped (structured/metadata, no body to embed):
legislators-and-committees, roll-call-votes, regulations-dot-gov-dockets,
lda-filings, topic-taxonomy, staff-directories.
Trigger order matches the bulk-embed order in the source roster.
## Known weaknesses (declared, not fixed)
The Phase 1 cross-source bake-off ([private/experiments/embedding-ablation/]) and the CRS v1 ablation produced the (model, chunker)
phase1-design.md
default this spec ratifies per source. Both have honest weaknesses that
this spec inherits — surfacing them so a future contributor (or future-me)
doesn't read the eval and conclude more than the data supports.
1. No structural chunkers in Phase 1. The cross-source bake-off
fixed the chunker at fixed_token_512_50 across all 9 sources by
design — to isolate the model variable. The structural-chunker
priors in https://docs.usejosh.com/operations/chunking/ (section-aware for
FR/bills/USC, speaker-turn for hearings/Congressional Record) are
untested. Phase 2 *is* the test. The risk this spec is built to
catch: a model that wins Phase 1 with naïve chunking might lose
Phase 2 with structural chunking on a structurally-rich source.
2. LLM-as-judge bias is reduced, not eliminated. Phase 1 used
Sonnet 4.6 at temp 0 with a 10% Opus dual-judge spot-check
(weighted κ 0.65, 88.7% binary agreement, disagreement symmetric
and model-independent). That's solid for *relative* ranking
between cells, weaker as an absolute number. A future Phase 3
with human spot-checks would tighten the absolute claims; not in
scope for v1.
3. 25 queries per source has wide per-source CIs. Bootstrap
CIs on per-source NDCG@10 are wider than the cross-source
aggregate's CI (which pools ~250 effective queries). This means
per-source comparisons inside Phase 2 have less statistical
power than the Phase 1 model pick. The 0.02 NDCG win threshold
was tuned for the CRS 50-query set; for 25-query Phase 2 runs
it's directionally fine but operators should not treat 0.021 vs
0.019 as a meaningful gap.
4. No hybrid retrieval cell in Phase 1. BM25 was tested whole-doc
as a sparse control, not as part of a fusion. On citation-heavy
corpora (USC especially), hybrid (sparse + dense rerank) may
dominate either pure approach. This spec's per-source eval should
add a hybrid arm for USC and public-laws specifically — citation
anchors are the wedge where BM25 closes or wins. If hybrid wins by
≥0.02 NDCG on USC, the production default for that source is
hybrid, not pure-dense. Other corpora (FR, hearings) are less
citation-dense; hybrid is lower priority there.
5. No end-to-end test on the production substrate. Phase 1 ran
on Modal harness with temp SQLite files; the production SQLite +
sqlite-vec + FTS5 + the chosen model has never been benchmarked
end-to-end at scale. Numbers in production may diverge from
numbers in the eval harness for non-retrieval reasons (vec0
quantization, index recall, FTS5 weighting). The pre-bulk gate
(substrate-embedding-evaluation) is the cheap check; a full
end-to-end retrieval benchmark against the live substrate is
a follow-on concern.
6. CBO is missing from Phase 1. DataDome wall blocked the
scraper; CBO never entered the cross-source bake-off. The
fallback when its ingester lands is to run Phase 2 against
whatever sample size the scraper can pull (likely small) and
pick recursive_v1 as the production chunker on prior, with the
eval as confirmation rather than tiebreaker.
7. Query sets are mostly LLM-generated. 8 hand-authored + 12-17
subagent-generated per source, deduped + eyeballed. This means
the eval reflects what an LLM thinks a policy professional would
ask, not real user traffic. Cite carefully in any external
writeup. The fix is harvesting real queries once Step 2 traffic
exists; not viable for the substrate-launch eval.
8. fixed_token_512_50 may handicap long-context models. mxbai
is 512-max-seq, so 512 is its natural fit. Arctic-L (8192) and
nomic (8192) were forced into 512-token windows. Phase 2's
fixed_token_1024_100 cell tests this for 8192-max-seq models —
a win there would suggest mxbai's tie with Arctic-L in Phase 1
overstated mxbai's quality (it was running at its ceiling, Arctic
wasn't). One reason the Arctic-L tiebreaker pick was right
irrespective of Phase 1's near-tie verdict.
9. Arctic-M was never benchmarked. The original production-planned
default crashed four times on Modal L4 with a CUDA gather-kernel
bug in the GTE custom modeling code. Excluded from Phase 1 results.
Re-test would be valuable if the upstream bug ever clears, but
none of this spec's per-source decisions hinge on Arctic-M.
None of these block running the per-source eval as scoped. They're
the honest perimeter of what the result can claim.
Tasks
0 of 14 done.
- t1 Promote chunk_corpus.py / embed_chunks.py / retrieve.py / score.py / prep_labels.py from private/experiments/embedding-ablation/ to a tracked module under josh-eval/retrieval/runner/. Pin Modal versions, add a unit test for each module against a fixture corpus.
- t2 bin/per-source-eval.py — driver that orchestrates the runner module against a named source. Flags: --source, --corpus-path, --eval-set-path, --cells, --modal, --dry-run, --post-bulk.
- t3 Eval-set schema: queries.yaml (8 hand + 12-17 subagent), labels.jsonl, judge-v1.md (copy from CRS eval), decision.json (chosen model+chunker+rationale). Schema documented in https://docs.usejosh.com/operations/eval-architecture/.
- t4 Eval-set immutability rule documented in https://docs.usejosh.com/operations/eval-architecture/: v1 frozen on commit, queries/labels are append-only, changes force a v2 with full re-baseline.
- t5 decision.json schema + per-source row in https://docs.usejosh.com/data-status/ showing chosen (model, chunker). Surfaces eval status in the existing storage-and-chunking table.
- t6 Federal Register v1 eval — write sample_federal_register.py corpus sampler, implement section_aware_fr_v1 chunker (XML HD-driven), author queries/v1.yaml, run eval against Modal, write decision.json. Spec: bills-USC-FR get section-aware; FR is the first run past CRS.
- t7 Bills v1 eval — write sample_bills.py corpus sampler, implement section_aware_bills_v1 chunker (USLM <section>), author queries/v1.yaml, run eval, write decision.json. Establishes the USLM walker as a shared module reused by USC and public-laws.
- t8 U.S. Code v1 eval — write sample_us_code.py corpus sampler, implement usc_uslm_section_v1 chunker (reuses USLM walker), include a hybrid (BM25 + dense rerank) cell as required by acceptance criterion 4 — citation-heavy corpus, hybrid is the wedge.
- t9 Public laws v1 eval — reuses USC's usc_uslm_section_v1 chunker without new chunker code; decision.json explicitly notes the reuse and the USC eval result.
- t10 Hearing transcripts v1 eval — write sample_hearing_transcripts.py corpus sampler, implement speaker_turn_v1 chunker (one chunk per speaker turn, recursive split on >2048-token turns), author queries/v1.yaml (heavier on quote-attribution patterns), run eval.
- t11 Congressional Record v1 eval — reuses speaker_turn_v1 chunker from t10 without new chunker code; decision.json explicitly notes the reuse and the hearing-transcripts eval result.
- t12 Committee reports v1 eval — implement section_aware_committee_v1 chunker (CRPT XML), run eval. Lower priority — committee reports are less structurally rich than FR/bills/USC.
- t13 eCFR v1 eval — likely reuses section_aware_fr_v1 chunker; decision.json notes the reuse and the FR eval result.
- t14 Pre-OSS-launch: re-run all v1 evals against Arctic-L locally (Phase 1 used Modal) to confirm production-substrate numbers match Modal numbers within noise. If they diverge meaningfully, that's a real finding worth investigating before launch.
Changelog
-
2026-05-27T00:30:00Z
draft→draftPromoted tasks from [STUB] to actionable. Restructured into: scaffolding-PR shape (t1-t5: runner promotion, driver, eval-set schema, immutability rule, data-status row), Tier 4 per-source PRs (t6-t10: FR, bills, USC, public laws, hearing transcripts), Tier 7 deferred per-source PRs (t11-t13: CR reuses transcripts, committee reports, eCFR reuses FR), and pre-launch verification (t14: re-run on production substrate). Also: - Acceptance criteria updated: baseline changed from `mxbai × fixed_token_512_50` (CRS winner) to `Arctic-L × fixed_token_512_50` (Phase 1 cross-source winner, per https://docs.usejosh.com/operations/embedding-architecture/#model-choice). - New acceptance criterion: USC and public-laws MUST include a hybrid (BM25 + dense rerank) cell — citation-heavy corpora are the wedge where hybrid may dominate. - New acceptance criterion: Phase 2 work runs in parallel with host provisioning (Modal-only, no host dependency). - Out-of-scope narrowed: sources with weak structural signal (CRS — v1 frozen; GAO; CBO; SAPs) ship on production baseline without Phase 2. - Removed "hybrid retrieval is out of scope" from out_of_scope — it's now in scope as a specific arm for USC/public-laws. -
2026-05-27T00:00:00Z
draft→draftAdded "Known weaknesses (declared, not fixed)" section to the plan body. Nine honest perimeter notes inherited from Phase 1 + the CRS v1 ablation: no structural chunkers in Phase 1, LLM-judge bias reduced not eliminated, 25-query per-source CI width, no hybrid cell, no end-to-end production benchmark, CBO missing from Phase 1, LLM-generated queries, fixed_token_512_50 may handicap long-context models (one reason the Arctic-L tiebreaker pick was right), and Arctic-M unbenchmarked. None block running the eval as scoped. Spec remains draft.