substratedraftp0

Pre-bulk embedding evaluation gate

substrate-embedding-evaluation · updated — · owner rritz

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Before the embedder is turned loose at full concurrency on a source's
full corpus, we need a cheap, reproducible smoke that confirms (a) the
chunker isn't producing degenerate output, (b) the embedder model's
runtime maximum sequence length matches what the chunker is sized
against, (c) the embedder hits expected throughput on the configured
provider, and (d) 10 hand-authored sanity queries return plausible
top-5s. Without this
gate, a bad chunker config or a silently-truncating embedder ships
thousands-to-millions of useless vectors before anyone notices, and
fixing it means re-embedding the corpus.

Triggering case: the 23K CRS reports re-load on the new OVHcloud host
(per crs-reports-ingester t10) is the first source to hit the bulk
embedder. We need this gate in place before that run, and we want it
to be the same gate every subsequent source uses.

User stories

As a substrate operator, I want a single command that says GATE PASS or GATE FAIL before bulk embed so that I never run the embedder for hours just to find out the chunker p95 is 8K tokens.

As someone evaluating a chunker change, I want the gate to also catch retrieval-quality regressions on a 10-query smoke so that I can land a chunker swap with at least eyeball-level confidence before re-embedding.

As a future contributor, I want an artifact directory I can diff between runs so that chunker-config decisions are reviewable in PRs, not buried in chat.

Acceptance criteria (EARS)

STUB Before a source's first bulk-embed run at full concurrency, the system shall run `bin/embed-eval-gate.py --source=<source>` against a stratified ~200-doc sample (sampling rule documented per source) and produce a pre-bulk report under `/data/eval-gate/<source>/<timestamp>/`.
STUB When the gate runs, the system shall record the embedder's runtime `max_seq_length` and refuse to PASS if the chunker's hard ceiling exceeds it (truncation would silently corrupt the bulk run).
STUB Where the smoke pass produces chunks with p95 > 2048 tokens, mean > 1024 tokens, median < 32 tokens, or any single chunk > 4096 tokens, the system shall emit `GATE: FAIL` and exit non-zero.
STUB When the gate runs, the system shall execute 10 hand-authored sanity queries (per source, version-controlled under `josh-eval/retrieval/eval-sets/<source>/sanity-queries.yaml`) against the smoke index and write top-5 results + chunk text to `pre-bulk-top5.jsonl` for human eyeballing.
STUB When the bulk run completes, the system shall re-execute the same 10 sanity queries against the full index and confirm top-5 stability (≥80% Jaccard overlap with the smoke top-5 per query, averaged across the 10); below-threshold queries are reported but do not auto-fail (judgment call for the operator).
STUB While the gate is running, if the embedder provider raises `ProviderConfigError`, the system shall emit `GATE: FAIL` immediately without consuming the rest of the sample (config errors don't fix themselves).
STUB When the gate emits `GATE: PASS`, the system shall write a `gate-pass.json` artifact (sample size, chunker_id+config, model_id+version, runtime max_seq_length, throughput chunks/s, top-5 query log path, observer hash) — this artifact is what authorizes the bulk run.

Success determiner

kindmanual

Checklist

bin/embed-eval-gate.py exists and is executable
Running `bin/embed-eval-gate.py --source=crs_reports --dry-run` against the OVHcloud substrate emits GATE: PASS or GATE: FAIL with reasons
GATE: PASS run produces gate-pass.json with all fields populated
GATE: FAIL on a deliberately-degenerate chunker config (chunk_size=50000) blocks correctly
Sanity-query top-5 log is human-readable (chunk text + heading_path + score per result)

Determiner upgrades to bash once `bin/embed-eval-gate.py` lands. At that point the kind becomes `bash` and the command is `bin/embed-eval-gate.py --source=crs_reports --sample-size=200 --dry-run` with `expect: "GATE: PASS"`. Keeping it `manual` while the spec is `draft` so the determiner doesn't false-fail before the implementation lands.

Clarifications needed

Should the gate run in CI on every chunker-module change, or only on operator demand? CI would catch regressions but adds ~minutes to `uv run poe ci`.
10 sanity queries per source — author them upfront for CRS, or harvest from real agent traffic once Step 2 is live? Lean upfront for CRS (no traffic exists yet).
Where do post-bulk Jaccard stability numbers go? Lean toward writing them to `gate-post-bulk.json` alongside `gate-pass.json` and surfacing in `https://docs.usejosh.com/data-status/` per source.
Whether to gate the production cron's incremental embeds the same way (probably not — incremental adds ~10s of chunks per run, gate is sized for thousands).

Out of scope

Full retrieval eval with NDCG@10 / Recall@10 / MRR — that's the chunker comparison harness in `josh-eval/retrieval/runner.py` per `eval-architecture.html`. The pre-bulk gate is a sanity check, not the formal eval.
Production-time embedding observability — covered by `embedding-observability`.
Per-chunker-strategy A/B comparison — separate concern; once the gate passes for a source, the chosen chunker is "live" and alternatives are evaluated through the comparison harness.
Re-running the gate on every nightly incremental ingest (gate is for first-bulk and chunker-change events, not steady state).

Dependencies

Plan

Single Python script at bin/embed-eval-gate.py. Reads source name,
loads ~200 stratified docs from the named source's parsed-records
state, runs the configured chunker, embeds via the live provider into
a temp SQLite file (the gate never touches the production substrate),
then runs the per-source sanity-query yaml against the temp index.
Writes artifacts under /data/eval-gate/<source>/<utc-timestamp>/.

Stratification rule for CRS: sample weighted by type prefix proportion
in crs_reports — roughly (R, RL, RS, IF, LSB, IN, balance) ≈
(60, 35, 25, 30, 15, 25, 10) for a 200-doc sample, capped at the
available rows per type. Sampling rule for other sources is added when
each source's first bulk-embed approaches.

Sanity-query authoring rule: 10 queries per source covering (1)
citation-following ("CRS analysis on 42 USC 1396a"), (2) topical
("Veterans Affairs life insurance coverage"), (3) recency ("most
recent CRS report on AI export controls"), (4) authority ("CRS
guidance on Endangered Species Act"), and (5) spot ("Electoral
College FAQ"). Authored once, version-controlled, immutable per
substrate-version pin (per eval-architecture.html).

Block thresholds picked deliberately to be loose — the gate exists to
catch screw-ups, not to gate quality. Quality calls happen in the
formal retrieval harness. False positives here cost operator time;
false negatives cost embedded-vector hours.

## Known gaps (not blocking initial operator use)

The spec as drafted is well-shaped for "operator runs the gate
manually before kicking off a bulk-embed." It is underspecified for
"automated check that catches problems without a human watching."
Surfacing these here so future-me (and any contributor activating
the embedder) doesn't think the spec is more complete than it is:

1. No notification on FAIL. If the gate is ever moved into a
scheduled/automated path, an overnight GATE: FAIL would be
discovered only by reading logs. The host has no notification
infra wired up (email, Slack, pager) and substrate-observability-defaults
doesn't cover this. Out of scope until that broader infra lands.

2. No auto-block of the bulk run. The spec uses "shall" language
for the operator workflow but nothing enforces it in code. If the
embedder is started without the gate having passed, the bulk run
proceeds. Future wiring: the embedder daemon refuses to drain
chunk_embedding_jobs for source X until
/data/eval-gate/<source>/latest/gate-pass.json exists AND its
chunker_id+config matches the running config. That makes "you
forgot to run the gate" a fail-loud condition.

3. Throughput threshold is a STUB. Acceptance criterion 3 lists
concrete bounds (p95 > 2048 tokens, mean > 1024, etc.) but
"expected throughput on the configured provider" has no number
attached. Concrete throughput thresholds come from observing
the first few real runs against CRS. The [STUB] prefix on each
criterion is the marker for "this number is TBD."

4. Post-bulk drift is reported but not actioned. The Jaccard
≥80% re-check after bulk completes is "judgment call for the
operator" and the spec doesn't name the file where below-threshold
queries are written. Lean toward gate-post-bulk.json alongside
gate-pass.json, surfaced in https://docs.usejosh.com/data-status/ per source — but
that requires https://docs.usejosh.com/data-status/ per-source surfacing, which
isn't there yet.

5. No PR-time check. A chunker-config change in a PR currently
has no CI hook that runs the smoke against the changed config.
The gate is purely operator-time. Adding it to uv run poe ci
would add minutes; deferred until chunker config moves out of
env-vars into a versioned, reviewable place.

None of these block first use. The gap that's most worth closing
alongside actual embedder activation is #2 (auto-block) — the rest
can wait until the operational rhythm is established.

Tasks

0 of 7 done.

t1 [STUB] bin/embed-eval-gate.py — sample loader + chunker driver + temp-substrate embedder + reporter
t2 [STUB] CRS sanity-query yaml at josh-eval/retrieval/eval-sets/crs_reports/sanity-queries.yaml (10 queries per the 5-pattern rule above)
t3 [STUB] markdown_header_v1 chunker landed at shared/josh_substrate/src/josh_substrate/chunking/markdown_header.py (depends-on for the gate to have anything to drive)
t4 [STUB] First gate run on CRS — read out p50/p95/max chunk sizes, throughput, top-5 logs; iterate on chunker config until GATE: PASS
t5 [STUB] Post-bulk re-run protocol — same gate command with --post-bulk flag, computes Jaccard against pre-bulk top-5
t6 [STUB] Determiner upgrades to bash kind once the script ships and a real GATE: PASS is reproducible
t7 [STUB] https://docs.usejosh.com/data-status/ surfaces per-source gate-pass.json existence + timestamp once the gate is the operational norm

Changelog

2026-05-13T01:00:00Z draft→draft Added "Known gaps (not blocking initial operator use)" section to the plan body. Captures five gaps surfaced during a critical read of the spec: no FAIL notification, no auto-block of the bulk run, throughput threshold is STUB, post-bulk drift is reported-but-not-actioned, no PR-time check. None block first use; #2 (auto-block) is the one worth wiring alongside actual embedder activation. Spec remains `draft` — gaps are honest forward-looking commentary, not yet acceptance criteria.

docs/spec/substrate-embedding-evaluation.html · generated by bin/build-spec.py