Embedding architecture

How chunks become vectors. Four moving parts — schema, protocol, worker, query-time singleton — wired so the encoder is a swappable function over the substrate, not a baked-in dependency.

Three goals

A new Josh deployment becomes queryable in a day, not a week. Bulk-embedding 50M chunks on CPU is days; on a remote GPU it's hours. The architecture makes the slow path optional, not the only path.
The encoder is genuinely swappable. Every embedded row carries model_id and model_version, so a future model swap is a side-by-side rebuild followed by a config flip — not a downtime migration.
The code provably works. Every adapter passes the same contract suite. Every spec carries a runnable success_determiner. uv run poe ci catches regressions before push.

Four moving parts

Part	Lives in	Job
Schema	`migrations/0005_embedding_jobs.py`	Queue table + per-chunk attribution columns. Idempotent claim primitive.
Protocol	`shared/josh_substrate/embedding/`	One `EmbeddingProvider` Protocol every adapter satisfies. Three callables, four attributes.
Worker	`josh-embedder/`	Daemon that drains the queue → calls provider → writes vec0. Resumable, idempotent, retry-aware.
Query-time singleton	`josh-core/app/embedding.py`	`get_query_provider()`, lazy-loaded once at first call. Backs `POST /embed` + future retrieval endpoints.

Schema

Two changes in migration 0005:

chunk_embedding_jobs — central queue read by the worker. Composite PK (chunk_table, chunk_id, model_id, model_version) makes the worker idempotent and supports side-by-side multi-model rebuilds. Statuses: pending → running → done (success) or pending → running → failed (after max_attempts).
embedded_model_id / _version / _at on every <source>_chunks table. Lets retrieval answer "is this chunk current?" without joining the queue.

The per-source vec0 virtual tables are unchanged — crs_report_chunks_vec0 still has embedding float[1024] and embedding_bq bit[1024]. Vectors are written via vec_quantize_binary(?) to avoid coupling to vec0's tagged-binary internal format.

The provider protocol

One Protocol, three callables, four attributes. Adapters satisfy it structurally — no inheritance from a substrate class, so a third-party adapter can drop in without an import dependency.

class EmbeddingProvider(Protocol):
    model_id: str          # 'Snowflake/snowflake-arctic-embed-l-v2.0'
    model_version: str     # '2.0'
    dim: int               # 1024
    max_batch: int

    async def embed_documents(self, texts: list[str]) -> EmbeddingResult: ...
    async def embed_query(self, text: str) -> Embedding: ...
    async def health_check(self) -> bool: ...

Three reference adapters ship with the substrate:

Adapter	Use for	Throughput	Cost
`LocalSTProvider`	Query-time, low-volume bulk	~50–200 docs/s on CPU	Free, in-process
`ModalProvider`	Bulk backfills, GPU-on-demand	~3–5K docs/s on L4	~$0.80/hr L4 (per-second billing)
`HttpProvider`	OpenAI / TEI / vLLM / self-hosted	Backend-dependent	Backend-dependent

Adapters wrap backend-specific exceptions in EmbeddingError subclasses (ProviderConfigError, ProviderTransientError, ProviderUnavailableError) so the worker's retry policy is uniform regardless of which backend failed.

The worker (josh-embedder)

One daemon process, one event loop, one provider, polling the queue. Public surface is two coroutines:

run_forever — loops until cancelled. Yields per-cycle counts for observability.
drain_once — runs one full pass, returns. Used by tests, the smoke script, and one-off CLI invocations.

One claim → embed → write cycle:

Atomic UPDATE … RETURNING claims up to batch_size rows in pending, transitions them to running, returns identifiers. Increments attempts.
Per claimed row, fetch chunk_text from <chunk_table> (validated against [A-Za-z0-9_]; defends against injection-shaped queue rows).
Call provider.embed_documents(texts).
Single transaction: INSERT OR REPLACE INTO <chunk_table>_vec0 …, stamp embedded_model_* on the chunk row, mark job done.

Failure semantics

Error	Worker action	Why
`ProviderConfigError`	Mark every job in batch `failed` (terminal).	Config doesn't fix itself; retrying wastes budget.
`ProviderTransientError`	Revert to `pending` if `attempts < max_attempts`; otherwise `failed`.	Retry budget bounded; transient errors clear.
`ProviderUnavailableError`	Revert to `pending` without burning attempts.	Outage affects every job equally; per-job budget is wrong tool.
Source chunk row missing	Mark `failed` with explanatory `last_error`.	Nothing to embed; not retryable.

Query-time path

Different from bulk in two ways: latency-sensitive (sub-50ms target), and run inside a long-lived FastAPI process that can't pay model-load cost per request. The query-time singleton lives in josh-core, not josh-embedder:

from app.embedding import get_query_provider
provider = get_query_provider()              # lazy-loads once
vector = await provider.embed_query(text)    # ~30ms on CPU

Query-time is locked to JOSH_QUERY_EMBED_PROVIDER=local — Modal cold-start (10–30s) makes remote providers unsuitable for interactive paths. Bulk embedding still uses whichever JOSH_EMBED_PROVIDER the worker is configured with; the two are independent.

Model swaps

v1 ships single-model-per-deployment. The composite PK on chunk_embedding_jobs still supports the side-by-side pattern when a swap is needed:

Insert new jobs with the new model_version: INSERT INTO chunk_embedding_jobs SELECT chunk_table, chunk_id, 'new-model-id', 'new-version', 'pending' FROM chunk_embedding_jobs WHERE status='done'.
Spin up a second worker (or repoint the existing one) with the new JOSH_EMBED_MODEL_VERSION. It claims only the new rows and writes new vectors. The old vec0 rows are overwritten via INSERT OR REPLACE when the new run reaches them.
Once coverage hits 100% (verify via SELECT COUNT(*) FROM chunk_embedding_jobs WHERE model_version='new' AND status='done'), flip JOSH_QUERY_EMBED_MODEL_VERSION on josh-core and restart. Queries cut over to the new model with no downtime in retrieval availability.

Multi-model parallel querying (different vec0 tables per model_version, runtime routing) is deferred to a Phase 2 spec.

Configuration reference

Worker (`josh-embedder`)

Env var	Default	Purpose
`JOSH_EMBED_PROVIDER`	`local`	`local` \| `modal` \| `http`
`JOSH_EMBED_MODEL_ID`	`Snowflake/snowflake-arctic-embed-l-v2.0`	Model the worker will load (local) or claim jobs for (modal/http). See "On the production default" below for the Phase 1 result that picked Arctic-L over Arctic-M.
`JOSH_EMBED_MODEL_VERSION`	`2.0`	Version stamped onto every vec0/chunk row.
`JOSH_EMBED_BATCH_SIZE`	`32`	Max jobs claimed per poll cycle.
`JOSH_EMBED_POLL_INTERVAL`	`5.0` (seconds)	Sleep when queue is empty.
`JOSH_EMBED_MAX_ATTEMPTS`	`5`	Per-job retry budget on transient failure.
`JOSH_EMBED_TRUST_REMOTE_CODE`	`0`	Set to `1` for trusted publishers (e.g. Snowflake) requiring custom modeling code.

Query-time (`josh-core`)

Env var	Default	Purpose
`JOSH_QUERY_EMBED_PROVIDER`	`local`	Query-time only supports `local` in v1.
`JOSH_QUERY_EMBED_MODEL_ID`	`Snowflake/snowflake-arctic-embed-l-v2.0`	Loaded at first `/embed` or retrieval call. Must match the worker's `JOSH_EMBED_MODEL_ID`.
`JOSH_QUERY_EMBED_MODEL_VERSION`	`2.0`	Used by `/health/embedding` to filter queue counts.
`JOSH_QUERY_EMBED_TRUST_REMOTE_CODE`	`0`	Mirror of the worker flag for the query path.

On the production default

The substrate ships Snowflake/snowflake-arctic-embed-l-v2.0 (568M params, 1024-dim, 8192 max-seq) as the default for both bulk and query-time embedding. The pick comes from a two-phase evaluation:

CRS ablation (v1, frozen 2026-05-14). 5 dense models × 5 chunkers + BM25 against ~6,500 labeled (query, doc) pairs on a 1,500-doc CRS sample. Winner was mxbai-embed-large-v1 × fixed_token_512_50 by +0.026 NDCG@10 over Arctic-L — just over the 0.02 noise floor.
Phase 1 cross-source bake-off (2026-05-19). Same harness extended to 9 prose corpora (CRS + 8 others), 4 models × 1 fixed chunker × 25 queries × 9 sources ≈ 37,500 labels. Mean NDCG@10 across corpora: Arctic-L 0.8364, mxbai 0.8226 — a statistical tie inside their bootstrap CIs. Arctic-L wins or ties 7 of 9 sources.

The tiebreaker between Arctic-L and mxbai went to Arctic-L for one operational reason: Arctic-L's 8192 max-seq lets the Phase 2 chunker work (per substrate-retrieval-eval-per-source) test long-context configurations (1024-token chunks, section-aware structural chunks that may exceed 512 tokens) without re-embedding the corpus under a different model. mxbai's 512 max-seq would force re-embeds for any chunker that emits chunks longer than 512 tokens — that's a real cost at substrate scale.

On Arctic-M. An earlier revision of this document defaulted to Snowflake/snowflake-arctic-embed-m-v2.0 (335M params, same 1024-dim) on the prior that the smaller model would be cheaper inference at comparable quality. Four attempts to run Arctic-M on Modal L4 all crashed with CUDA error: device-side assert triggered ... vectorized gather kernel index out of bounds in the GTE custom modeling code path; Arctic-M is excluded from Phase 1 results as a result. Arctic-L uses the standard XLM-Roberta backbone and runs cleanly. If the Arctic-M kernel bug is resolved upstream, a future Phase 1' could re-test Arctic-M; until then Arctic-L is the production pick.

What this default doesn't yet decide. Phase 1 fixed the chunker at fixed_token_512_50 across all sources. The per-source structural-chunker question (e.g., section-aware for FR/bills/USC, speaker-turn for hearings/Congressional Record) is Phase 2 work under substrate-retrieval-eval-per-source — run before each structurally-rich source's first bulk-embed, scoped to ~$5 Modal budget per source.

End-to-end verification

bin/embed-smoke.sh runs the whole pipeline against a fresh tmp SQLite in ~30 seconds (after first-run model download): migrations, seed chunks + jobs, drain queue, verify vec0 rows, verify chunk attribution, verify re-drain is a no-op, print queue status. The unit test suites cover individual contracts; the smoke script proves the pieces fit together.

uv run poe smoke
# or directly:
bin/embed-smoke.sh