Skip to content

Embedding architecture

How chunks become vectors. Four moving parts — schema, protocol, worker, query-time singleton — wired so the encoder is a swappable function over the substrate, not a baked-in dependency.

  1. A new Josh deployment becomes queryable in a day, not a week. Bulk-embedding 50M chunks on CPU is days; on a remote GPU it's hours. The architecture makes the slow path optional, not the only path.
  2. The encoder is genuinely swappable. Every embedded row carries model_id and model_version, so a future model swap is a side-by-side rebuild followed by a config flip — not a downtime migration.
  3. The code provably works. Every adapter passes the same contract suite. Every spec carries a runnable success_determiner. uv run poe ci catches regressions before push.
PartLives inJob
Schemamigrations/0005_embedding_jobs.pyQueue table + per-chunk attribution columns. Idempotent claim primitive.
Protocolshared/josh_substrate/embedding/One EmbeddingProvider Protocol every adapter satisfies. Three callables, four attributes.
Workerjosh-embedder/Daemon that drains the queue → calls provider → writes vec0. Resumable, idempotent, retry-aware.
Query-time singletonjosh-core/app/embedding.pyget_query_provider(), lazy-loaded once at first call. Backs POST /embed + future retrieval endpoints.

Two changes in migration 0005:

  1. chunk_embedding_jobs — central queue read by the worker. Composite PK (chunk_table, chunk_id, model_id, model_version) makes the worker idempotent and supports side-by-side multi-model rebuilds. Statuses: pending → running → done (success) or pending → running → failed (after max_attempts).
  2. embedded_model_id / _version / _at on every <source>_chunks table. Lets retrieval answer "is this chunk current?" without joining the queue.

The per-source vec0 virtual tables are unchanged — crs_report_chunks_vec0 still has embedding float[1024] and embedding_bq bit[1024]. Vectors are written via vec_quantize_binary(?) to avoid coupling to vec0's tagged-binary internal format.

One Protocol, three callables, four attributes. Adapters satisfy it structurally — no inheritance from a substrate class, so a third-party adapter can drop in without an import dependency.

class EmbeddingProvider(Protocol):
model_id: str # 'Snowflake/snowflake-arctic-embed-l-v2.0'
model_version: str # '2.0'
dim: int # 1024
max_batch: int
async def embed_documents(self, texts: list[str]) -> EmbeddingResult: ...
async def embed_query(self, text: str) -> Embedding: ...
async def health_check(self) -> bool: ...

Three reference adapters ship with the substrate:

AdapterUse forThroughputCost
LocalSTProviderQuery-time, low-volume bulk~50–200 docs/s on CPUFree, in-process
ModalProviderBulk backfills, GPU-on-demand~3–5K docs/s on L4~$0.80/hr L4 (per-second billing)
HttpProviderOpenAI / TEI / vLLM / self-hostedBackend-dependentBackend-dependent

Adapters wrap backend-specific exceptions in EmbeddingError subclasses (ProviderConfigError, ProviderTransientError, ProviderUnavailableError) so the worker's retry policy is uniform regardless of which backend failed.

One daemon process, one event loop, one provider, polling the queue. Public surface is two coroutines:

  • run_forever — loops until cancelled. Yields per-cycle counts for observability.
  • drain_once — runs one full pass, returns. Used by tests, the smoke script, and one-off CLI invocations.

One claim → embed → write cycle:

  1. Atomic UPDATE … RETURNING claims up to batch_size rows in pending, transitions them to running, returns identifiers. Increments attempts.
  2. Per claimed row, fetch chunk_text from <chunk_table> (validated against [A-Za-z0-9_]; defends against injection-shaped queue rows).
  3. Call provider.embed_documents(texts).
  4. Single transaction: INSERT OR REPLACE INTO <chunk_table>_vec0 …, stamp embedded_model_* on the chunk row, mark job done.
ErrorWorker actionWhy
ProviderConfigErrorMark every job in batch failed (terminal).Config doesn't fix itself; retrying wastes budget.
ProviderTransientErrorRevert to pending if attempts < max_attempts; otherwise failed.Retry budget bounded; transient errors clear.
ProviderUnavailableErrorRevert to pending without burning attempts.Outage affects every job equally; per-job budget is wrong tool.
Source chunk row missingMark failed with explanatory last_error.Nothing to embed; not retryable.

Different from bulk in two ways: latency-sensitive (sub-50ms target), and run inside a long-lived FastAPI process that can't pay model-load cost per request. The query-time singleton lives in josh-core, not josh-embedder:

from app.embedding import get_query_provider
provider = get_query_provider() # lazy-loads once
vector = await provider.embed_query(text) # ~30ms on CPU

Query-time is locked to JOSH_QUERY_EMBED_PROVIDER=local — Modal cold-start (10–30s) makes remote providers unsuitable for interactive paths. Bulk embedding still uses whichever JOSH_EMBED_PROVIDER the worker is configured with; the two are independent.

v1 ships single-model-per-deployment. The composite PK on chunk_embedding_jobs still supports the side-by-side pattern when a swap is needed:

  1. Insert new jobs with the new model_version: INSERT INTO chunk_embedding_jobs SELECT chunk_table, chunk_id, 'new-model-id', 'new-version', 'pending' FROM chunk_embedding_jobs WHERE status='done'.
  2. Spin up a second worker (or repoint the existing one) with the new JOSH_EMBED_MODEL_VERSION. It claims only the new rows and writes new vectors. The old vec0 rows are overwritten via INSERT OR REPLACE when the new run reaches them.
  3. Once coverage hits 100% (verify via SELECT COUNT(*) FROM chunk_embedding_jobs WHERE model_version='new' AND status='done'), flip JOSH_QUERY_EMBED_MODEL_VERSION on josh-core and restart. Queries cut over to the new model with no downtime in retrieval availability.

Multi-model parallel querying (different vec0 tables per model_version, runtime routing) is deferred to a Phase 2 spec.

Env varDefaultPurpose
JOSH_EMBED_PROVIDERlocallocal | modal | http
JOSH_EMBED_MODEL_IDSnowflake/snowflake-arctic-embed-l-v2.0Model the worker will load (local) or claim jobs for (modal/http). See "On the production default" below for the Phase 1 result that picked Arctic-L over Arctic-M.
JOSH_EMBED_MODEL_VERSION2.0Version stamped onto every vec0/chunk row.
JOSH_EMBED_BATCH_SIZE32Max jobs claimed per poll cycle.
JOSH_EMBED_POLL_INTERVAL5.0 (seconds)Sleep when queue is empty.
JOSH_EMBED_MAX_ATTEMPTS5Per-job retry budget on transient failure.
JOSH_EMBED_TRUST_REMOTE_CODE0Set to 1 for trusted publishers (e.g. Snowflake) requiring custom modeling code.
Env varDefaultPurpose
JOSH_QUERY_EMBED_PROVIDERlocalQuery-time only supports local in v1.
JOSH_QUERY_EMBED_MODEL_IDSnowflake/snowflake-arctic-embed-l-v2.0Loaded at first /embed or retrieval call. Must match the worker's JOSH_EMBED_MODEL_ID.
JOSH_QUERY_EMBED_MODEL_VERSION2.0Used by /health/embedding to filter queue counts.
JOSH_QUERY_EMBED_TRUST_REMOTE_CODE0Mirror of the worker flag for the query path.

The substrate ships Snowflake/snowflake-arctic-embed-l-v2.0 (568M params, 1024-dim, 8192 max-seq) as the default for both bulk and query-time embedding. The pick comes from a two-phase evaluation:

  1. CRS ablation (v1, frozen 2026-05-14). 5 dense models × 5 chunkers + BM25 against ~6,500 labeled (query, doc) pairs on a 1,500-doc CRS sample. Winner was mxbai-embed-large-v1 × fixed_token_512_50 by +0.026 NDCG@10 over Arctic-L — just over the 0.02 noise floor.
  2. Phase 1 cross-source bake-off (2026-05-19). Same harness extended to 9 prose corpora (CRS + 8 others), 4 models × 1 fixed chunker × 25 queries × 9 sources ≈ 37,500 labels. Mean NDCG@10 across corpora: Arctic-L 0.8364, mxbai 0.8226 — a statistical tie inside their bootstrap CIs. Arctic-L wins or ties 7 of 9 sources.

The tiebreaker between Arctic-L and mxbai went to Arctic-L for one operational reason: Arctic-L's 8192 max-seq lets the Phase 2 chunker work (per substrate-retrieval-eval-per-source) test long-context configurations (1024-token chunks, section-aware structural chunks that may exceed 512 tokens) without re-embedding the corpus under a different model. mxbai's 512 max-seq would force re-embeds for any chunker that emits chunks longer than 512 tokens — that's a real cost at substrate scale.

On Arctic-M. An earlier revision of this document defaulted to Snowflake/snowflake-arctic-embed-m-v2.0 (335M params, same 1024-dim) on the prior that the smaller model would be cheaper inference at comparable quality. Four attempts to run Arctic-M on Modal L4 all crashed with CUDA error: device-side assert triggered ... vectorized gather kernel index out of bounds in the GTE custom modeling code path; Arctic-M is excluded from Phase 1 results as a result. Arctic-L uses the standard XLM-Roberta backbone and runs cleanly. If the Arctic-M kernel bug is resolved upstream, a future Phase 1' could re-test Arctic-M; until then Arctic-L is the production pick.

What this default doesn't yet decide. Phase 1 fixed the chunker at fixed_token_512_50 across all sources. The per-source structural-chunker question (e.g., section-aware for FR/bills/USC, speaker-turn for hearings/Congressional Record) is Phase 2 work under substrate-retrieval-eval-per-source — run before each structurally-rich source's first bulk-embed, scoped to ~$5 Modal budget per source.

bin/embed-smoke.sh runs the whole pipeline against a fresh tmp SQLite in ~30 seconds (after first-run model download): migrations, seed chunks + jobs, drain queue, verify vec0 rows, verify chunk attribution, verify re-drain is a no-op, print queue status. The unit test suites cover individual contracts; the smoke script proves the pieces fit together.

Terminal window
uv run poe smoke
# or directly:
bin/embed-smoke.sh