Embedding architecture
How chunks become vectors. Four moving parts — schema, protocol, worker, query-time singleton — wired so the encoder is a swappable function over the substrate, not a baked-in dependency.
Three goals
Section titled “Three goals”- A new Josh deployment becomes queryable in a day, not a week. Bulk-embedding 50M chunks on CPU is days; on a remote GPU it's hours. The architecture makes the slow path optional, not the only path.
- The encoder is genuinely swappable. Every embedded row carries
model_idandmodel_version, so a future model swap is a side-by-side rebuild followed by a config flip — not a downtime migration. - The code provably works. Every adapter passes the same contract suite. Every spec carries a runnable
success_determiner.uv run poe cicatches regressions before push.
Four moving parts
Section titled “Four moving parts”| Part | Lives in | Job |
|---|---|---|
| Schema | migrations/0005_embedding_jobs.py | Queue table + per-chunk attribution columns. Idempotent claim primitive. |
| Protocol | shared/josh_substrate/embedding/ | One EmbeddingProvider Protocol every adapter satisfies. Three callables, four attributes. |
| Worker | josh-embedder/ | Daemon that drains the queue → calls provider → writes vec0. Resumable, idempotent, retry-aware. |
| Query-time singleton | josh-core/app/embedding.py | get_query_provider(), lazy-loaded once at first call. Backs POST /embed + future retrieval endpoints. |
Schema
Section titled “Schema”Two changes in migration 0005:
chunk_embedding_jobs— central queue read by the worker. Composite PK(chunk_table, chunk_id, model_id, model_version)makes the worker idempotent and supports side-by-side multi-model rebuilds. Statuses:pending → running → done(success) orpending → running → failed(aftermax_attempts).embedded_model_id/_version/_aton every<source>_chunkstable. Lets retrieval answer "is this chunk current?" without joining the queue.
The per-source vec0 virtual tables are unchanged — crs_report_chunks_vec0 still has embedding float[1024] and embedding_bq bit[1024]. Vectors are written via vec_quantize_binary(?) to avoid coupling to vec0's tagged-binary internal format.
The provider protocol
Section titled “The provider protocol”One Protocol, three callables, four attributes. Adapters satisfy it structurally — no inheritance from a substrate class, so a third-party adapter can drop in without an import dependency.
class EmbeddingProvider(Protocol): model_id: str # 'Snowflake/snowflake-arctic-embed-l-v2.0' model_version: str # '2.0' dim: int # 1024 max_batch: int
async def embed_documents(self, texts: list[str]) -> EmbeddingResult: ... async def embed_query(self, text: str) -> Embedding: ... async def health_check(self) -> bool: ...Three reference adapters ship with the substrate:
| Adapter | Use for | Throughput | Cost |
|---|---|---|---|
LocalSTProvider | Query-time, low-volume bulk | ~50–200 docs/s on CPU | Free, in-process |
ModalProvider | Bulk backfills, GPU-on-demand | ~3–5K docs/s on L4 | ~$0.80/hr L4 (per-second billing) |
HttpProvider | OpenAI / TEI / vLLM / self-hosted | Backend-dependent | Backend-dependent |
Adapters wrap backend-specific exceptions in EmbeddingError subclasses (ProviderConfigError, ProviderTransientError, ProviderUnavailableError) so the worker's retry policy is uniform regardless of which backend failed.
The worker (josh-embedder)
Section titled “The worker (josh-embedder)”One daemon process, one event loop, one provider, polling the queue. Public surface is two coroutines:
run_forever— loops until cancelled. Yields per-cycle counts for observability.drain_once— runs one full pass, returns. Used by tests, the smoke script, and one-off CLI invocations.
One claim → embed → write cycle:
- Atomic
UPDATE … RETURNINGclaims up tobatch_sizerows inpending, transitions them torunning, returns identifiers. Incrementsattempts. - Per claimed row, fetch
chunk_textfrom<chunk_table>(validated against[A-Za-z0-9_]; defends against injection-shaped queue rows). - Call
provider.embed_documents(texts). - Single transaction:
INSERT OR REPLACE INTO <chunk_table>_vec0 …, stampembedded_model_*on the chunk row, mark jobdone.
Failure semantics
Section titled “Failure semantics”| Error | Worker action | Why |
|---|---|---|
ProviderConfigError | Mark every job in batch failed (terminal). | Config doesn't fix itself; retrying wastes budget. |
ProviderTransientError | Revert to pending if attempts < max_attempts; otherwise failed. | Retry budget bounded; transient errors clear. |
ProviderUnavailableError | Revert to pending without burning attempts. | Outage affects every job equally; per-job budget is wrong tool. |
| Source chunk row missing | Mark failed with explanatory last_error. | Nothing to embed; not retryable. |
Query-time path
Section titled “Query-time path”Different from bulk in two ways: latency-sensitive (sub-50ms target), and run inside a long-lived FastAPI process that can't pay model-load cost per request. The query-time singleton lives in josh-core, not josh-embedder:
from app.embedding import get_query_providerprovider = get_query_provider() # lazy-loads oncevector = await provider.embed_query(text) # ~30ms on CPUQuery-time is locked to JOSH_QUERY_EMBED_PROVIDER=local — Modal cold-start (10–30s) makes remote providers unsuitable for interactive paths. Bulk embedding still uses whichever JOSH_EMBED_PROVIDER the worker is configured with; the two are independent.
Model swaps
Section titled “Model swaps”v1 ships single-model-per-deployment. The composite PK on chunk_embedding_jobs still supports the side-by-side pattern when a swap is needed:
- Insert new jobs with the new
model_version:INSERT INTO chunk_embedding_jobs SELECT chunk_table, chunk_id, 'new-model-id', 'new-version', 'pending' FROM chunk_embedding_jobs WHERE status='done'. - Spin up a second worker (or repoint the existing one) with the new
JOSH_EMBED_MODEL_VERSION. It claims only the new rows and writes new vectors. The oldvec0rows are overwritten viaINSERT OR REPLACEwhen the new run reaches them. - Once coverage hits 100% (verify via
SELECT COUNT(*) FROM chunk_embedding_jobs WHERE model_version='new' AND status='done'), flipJOSH_QUERY_EMBED_MODEL_VERSIONon josh-core and restart. Queries cut over to the new model with no downtime in retrieval availability.
Multi-model parallel querying (different vec0 tables per model_version, runtime routing) is deferred to a Phase 2 spec.
Configuration reference
Section titled “Configuration reference”Worker (josh-embedder)
Section titled “Worker (josh-embedder)”| Env var | Default | Purpose |
|---|---|---|
JOSH_EMBED_PROVIDER | local | local | modal | http |
JOSH_EMBED_MODEL_ID | Snowflake/snowflake-arctic-embed-l-v2.0 | Model the worker will load (local) or claim jobs for (modal/http). See "On the production default" below for the Phase 1 result that picked Arctic-L over Arctic-M. |
JOSH_EMBED_MODEL_VERSION | 2.0 | Version stamped onto every vec0/chunk row. |
JOSH_EMBED_BATCH_SIZE | 32 | Max jobs claimed per poll cycle. |
JOSH_EMBED_POLL_INTERVAL | 5.0 (seconds) | Sleep when queue is empty. |
JOSH_EMBED_MAX_ATTEMPTS | 5 | Per-job retry budget on transient failure. |
JOSH_EMBED_TRUST_REMOTE_CODE | 0 | Set to 1 for trusted publishers (e.g. Snowflake) requiring custom modeling code. |
Query-time (josh-core)
Section titled “Query-time (josh-core)”| Env var | Default | Purpose |
|---|---|---|
JOSH_QUERY_EMBED_PROVIDER | local | Query-time only supports local in v1. |
JOSH_QUERY_EMBED_MODEL_ID | Snowflake/snowflake-arctic-embed-l-v2.0 | Loaded at first /embed or retrieval call. Must match the worker's JOSH_EMBED_MODEL_ID. |
JOSH_QUERY_EMBED_MODEL_VERSION | 2.0 | Used by /health/embedding to filter queue counts. |
JOSH_QUERY_EMBED_TRUST_REMOTE_CODE | 0 | Mirror of the worker flag for the query path. |
On the production default
Section titled “On the production default”The substrate ships Snowflake/snowflake-arctic-embed-l-v2.0 (568M params, 1024-dim, 8192 max-seq) as the default for both bulk and query-time embedding. The pick comes from a two-phase evaluation:
- CRS ablation (v1, frozen 2026-05-14). 5 dense models × 5 chunkers + BM25 against ~6,500 labeled (query, doc) pairs on a 1,500-doc CRS sample. Winner was
mxbai-embed-large-v1 × fixed_token_512_50by +0.026 NDCG@10 over Arctic-L — just over the 0.02 noise floor. - Phase 1 cross-source bake-off (2026-05-19). Same harness extended to 9 prose corpora (CRS + 8 others), 4 models × 1 fixed chunker × 25 queries × 9 sources ≈ 37,500 labels. Mean NDCG@10 across corpora: Arctic-L 0.8364, mxbai 0.8226 — a statistical tie inside their bootstrap CIs. Arctic-L wins or ties 7 of 9 sources.
The tiebreaker between Arctic-L and mxbai went to Arctic-L for one operational reason: Arctic-L's 8192 max-seq lets the Phase 2 chunker work (per substrate-retrieval-eval-per-source) test long-context configurations (1024-token chunks, section-aware structural chunks that may exceed 512 tokens) without re-embedding the corpus under a different model. mxbai's 512 max-seq would force re-embeds for any chunker that emits chunks longer than 512 tokens — that's a real cost at substrate scale.
On Arctic-M. An earlier revision of this document defaulted to Snowflake/snowflake-arctic-embed-m-v2.0 (335M params, same 1024-dim) on the prior that the smaller model would be cheaper inference at comparable quality. Four attempts to run Arctic-M on Modal L4 all crashed with CUDA error: device-side assert triggered ... vectorized gather kernel index out of bounds in the GTE custom modeling code path; Arctic-M is excluded from Phase 1 results as a result. Arctic-L uses the standard XLM-Roberta backbone and runs cleanly. If the Arctic-M kernel bug is resolved upstream, a future Phase 1' could re-test Arctic-M; until then Arctic-L is the production pick.
What this default doesn't yet decide. Phase 1 fixed the chunker at fixed_token_512_50 across all sources. The per-source structural-chunker question (e.g., section-aware for FR/bills/USC, speaker-turn for hearings/Congressional Record) is Phase 2 work under substrate-retrieval-eval-per-source — run before each structurally-rich source's first bulk-embed, scoped to ~$5 Modal budget per source.
End-to-end verification
Section titled “End-to-end verification”bin/embed-smoke.sh runs the whole pipeline against a fresh tmp SQLite in ~30 seconds (after first-run model download): migrations, seed chunks + jobs, drain queue, verify vec0 rows, verify chunk attribution, verify re-drain is a no-op, print queue status. The unit test suites cover individual contracts; the smoke script proves the pieces fit together.
uv run poe smoke# or directly:bin/embed-smoke.sh