substrateshippedp0

Substrate vector pipeline

substrate-vector-pipeline · updated 2026-05-10T18:30:00Z · owner rritz

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Per-source vector index via sqlite-vec's vec0 virtual table. Each
text-bearing source gets a <source>_chunks_vec0 companion that holds
both the full-precision embedding (float[1024]) and a binary-quantized
(bit[1024]) companion for the BQ+rescore retrieval pattern — ~32×
storage compression and ~95 % recall versus full-precision float at
brute-force scale. The chunk-attribution columns
(embedded_model_id, embedded_model_version, embedded_at) on every
chunks table let a model swap re-enqueue affected chunks without a
full rebuild. vec1 (IVFADC + OPQ) is the future ANN swap; until it
releases, the substrate ships on sqlite-vec BQ+rescore.

User stories

As a query-time agent, I want hybrid search across FTS5 and vector results so that lexical and semantic retrieval are one substrate, not two.

As an embedder, I want a typed `vec0` table per source with a stable column shape so that writing vectors is the same shape regardless of source.

As an operator, I want chunk-level model attribution so that a model swap touches only the affected chunks.

Acceptance criteria (EARS)

Where a substrate source produces chunks, its migration shall create a `<source>_chunks_vec0` virtual table via `CREATE VIRTUAL TABLE … USING vec0(chunk_id TEXT PRIMARY KEY, embedding float[1024], embedding_bq bit[1024])` (see `crs_report_chunks_vec0` in migration `0002_crs.py` line 277-).
Where a substrate source has a chunks table, the table shall include `embedded_model_id TEXT`, `embedded_model_version TEXT`, and `embedded_at TEXT` columns; existing rows with NULL values shall be treated as 'needs embedding' by the worker (per migration `0005_embedding_jobs.py` line 96-).
When `josh-embedder` writes a vector for a chunk, it shall stamp the chunk's `embedded_*` columns and insert a row into the per-source `vec0` table; the insert shall include both `embedding` (float[1024]) and `embedding_bq` (bit[1024]).
When the model is swapped (`model_id` or `model_version` change), `embedded_model_version` divergence shall cause the embedder to re-enqueue the affected chunks rather than overwriting silently.
When `alembic upgrade head` runs against a fresh DB, every shipped source's `<source>_chunks_vec0` table shall exist and accept the documented column shape (verified by the determiner).

Success determiner

kindbash

Command

set -e
tmp=$(mktemp -d)
SUBSTRATE_DB_PATH="$tmp/probe.db" \
  uv run --directory shared/josh_substrate alembic upgrade head
uv run python - <<PY
import asyncio, os
os.environ["SUBSTRATE_DB_PATH"] = "$tmp/probe.db"
from sqlalchemy import text
from josh_substrate.db import async_engine

async def main() -> None:
    engine = async_engine()
    async with engine.connect() as conn:
        # vec0 table exists.
        rows = (await conn.execute(text(
            "SELECT name FROM sqlite_master WHERE type='table' AND name='crs_report_chunks_vec0'"
        ))).all()
        assert rows, "crs_report_chunks_vec0 missing"
        # Chunk-attribution columns exist on the chunks table.
        cols = {r[1] for r in (await conn.execute(text(
            "PRAGMA table_info(crs_report_chunks)"
        ))).all()}
        for c in ("embedded_model_id", "embedded_model_version", "embedded_at"):
            assert c in cols, f"crs_report_chunks missing column {c}"
        # chunk_embedding_jobs queue table exists.
        rows = (await conn.execute(text(
            "SELECT name FROM sqlite_master WHERE type='table' AND name='chunk_embedding_jobs'"
        ))).all()
        assert rows, "chunk_embedding_jobs missing"
    await engine.dispose()
    print("OK")
asyncio.run(main())
PY
rm -rf "$tmp"

Expect

Stdout ends with `OK`. `crs_report_chunks_vec0`, `chunk_embedding_jobs`, and the three `embedded_*` attribution columns on `crs_report_chunks` are all present.

Probe (machine-checked contract)

[
  {
    "label": "crs_report_chunks_vec0 carries float[1024] embedding + bit[1024] companion",
    "query": "SELECT (sql LIKE '%float[1024]%' AND sql LIKE '%bit[1024]%') FROM sqlite_master WHERE type='table' AND name='crs_report_chunks_vec0'",
    "compare": {
      "op": "eq",
      "value": 1
    }
  },
  {
    "label": "crs_report_chunks has the 3 embedded_* attribution columns",
    "query": "SELECT COUNT(*) FROM pragma_table_info('crs_report_chunks') WHERE name IN ('embedded_model_id', 'embedded_model_version', 'embedded_at')",
    "compare": {
      "op": "eq",
      "value": 3
    }
  },
  {
    "label": "chunk_embedding_jobs work queue exists",
    "query": "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='chunk_embedding_jobs'",
    "compare": {
      "op": "eq",
      "value": 1
    }
  }
]

Schema-only check — proves the vector pipeline is wired on an empty substrate. End-to-end behaviour is exercised by `bin/embed-smoke.sh` and the embedding contract tests (`embedding-provider-protocol`, `embedding-worker`, `embedding-query-integration`).

Clarifications needed

None.

Out of scope

The actual `EmbeddingProvider` implementations — that's `embedding-provider-protocol` and the per-adapter specs.
Chunking strategy. The chunker output shape is documented in `https://docs.usejosh.com/operations/chunking/`.
`vec1` IVFADC+OPQ migration. Tracked separately when upstream releases.
Cross-source vector search. Per-source `vec0` is the substrate primitive; the surface concern lives in `rest-api-search`.

Dependencies

Plan

Three migration-level pieces:

1. Per-source <source>_chunks_vec0 (migration 0002_crs.py,
analogous blocks per future source). One vec0 row per chunk;
chunk_id is the soft FK to <source>_chunks.id. Holds both
full-precision and binary-quantized vectors so the BQ+rescore
query path doesn't need a separate table.
2. Per-source <source>_chunks attribution columns (migration
0005_embedding_jobs.py). Three additive columns
(embedded_model_id, embedded_model_version, embedded_at)
added via batch_alter_table (required for SQLite ALTER limitations,
see migrations/env.py's render_as_batch=True).
3. chunk_embedding_jobs work queue (migration 0005_embedding_jobs.py).
Composite primary key
(chunk_table, chunk_id, model_id, model_version); status enum
(pending, running, done, failed); covering index on
(status, inserted_at) for the worker poll query and a partial
index WHERE status='failed' for operator triage.

Worker side (josh-embedder): polls the queue, embeds in batches via
the configured EmbeddingProvider, writes both vector representations
to the per-source vec0 table, stamps the attribution columns. Detail
in https://docs.usejosh.com/operations/embedding-architecture/.

Query side (josh-core): the BQ+rescore pattern uses
vec_distance_hamming(embedding_bq, query_bq) for the wide cut, then
a re-rank on the candidates' full-precision embedding. The wrapper
ships with rest-api-search.

Tasks

4 of 6 done.

t1 crs_report_chunks_vec0 shipped with float[1024] + bit[1024] columns
t2 Chunk-attribution columns (embedded_model_id/version/at) shipped
t3 chunk_embedding_jobs queue table + indexes shipped (migration 0005)
t4 Determiner verifies all three pieces on a fresh substrate
t5 BQ+rescore query wrapper — tracked under rest-api-search
t6 vec1 IVFADC+OPQ swap — deferred until upstream releases

Changelog

2026-05-10T18:30:00Z (new)→shipped Spec content backfilled retrospectively. The per-source `vec0` table pattern landed with the CRS migration; the chunk-attribution columns and `chunk_embedding_jobs` queue landed in migration 0005 alongside the embedding worker. This commit replaces the [STUB] placeholders with acceptance criteria grounded in migrations 0002 and 0005, plus a schema-only determiner.

docs/spec/substrate-vector-pipeline.html · generated by bin/build-spec.py