Substrate vector pipeline
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Per-source vector index via sqlite-vec's vec0 virtual table. Each
text-bearing source gets a <source>_chunks_vec0 companion that holds
both the full-precision embedding (float[1024]) and a binary-quantized
(bit[1024]) companion for the BQ+rescore retrieval pattern — ~32×
storage compression and ~95 % recall versus full-precision float at
brute-force scale. The chunk-attribution columns
(embedded_model_id, embedded_model_version, embedded_at) on every
chunks table let a model swap re-enqueue affected chunks without a
full rebuild. vec1 (IVFADC + OPQ) is the future ANN swap; until it
releases, the substrate ships on sqlite-vec BQ+rescore.
User stories
As a query-time agent, I want hybrid search across FTS5 and vector results so that lexical and semantic retrieval are one substrate, not two.
As an embedder, I want a typed `vec0` table per source with a stable column shape so that writing vectors is the same shape regardless of source.
As an operator, I want chunk-level model attribution so that a model swap touches only the affected chunks.
Acceptance criteria (EARS)
- Where a substrate source produces chunks, its migration shall create a `<source>_chunks_vec0` virtual table via `CREATE VIRTUAL TABLE … USING vec0(chunk_id TEXT PRIMARY KEY, embedding float[1024], embedding_bq bit[1024])` (see `crs_report_chunks_vec0` in migration `0002_crs.py` line 277-).
- Where a substrate source has a chunks table, the table shall include `embedded_model_id TEXT`, `embedded_model_version TEXT`, and `embedded_at TEXT` columns; existing rows with NULL values shall be treated as 'needs embedding' by the worker (per migration `0005_embedding_jobs.py` line 96-).
- When `josh-embedder` writes a vector for a chunk, it shall stamp the chunk's `embedded_*` columns and insert a row into the per-source `vec0` table; the insert shall include both `embedding` (float[1024]) and `embedding_bq` (bit[1024]).
- When the model is swapped (`model_id` or `model_version` change), `embedded_model_version` divergence shall cause the embedder to re-enqueue the affected chunks rather than overwriting silently.
- When `alembic upgrade head` runs against a fresh DB, every shipped source's `<source>_chunks_vec0` table shall exist and accept the documented column shape (verified by the determiner).
Success determiner
Command
set -e
tmp=$(mktemp -d)
SUBSTRATE_DB_PATH="$tmp/probe.db" \
uv run --directory shared/josh_substrate alembic upgrade head
uv run python - <<PY
import asyncio, os
os.environ["SUBSTRATE_DB_PATH"] = "$tmp/probe.db"
from sqlalchemy import text
from josh_substrate.db import async_engine
async def main() -> None:
engine = async_engine()
async with engine.connect() as conn:
# vec0 table exists.
rows = (await conn.execute(text(
"SELECT name FROM sqlite_master WHERE type='table' AND name='crs_report_chunks_vec0'"
))).all()
assert rows, "crs_report_chunks_vec0 missing"
# Chunk-attribution columns exist on the chunks table.
cols = {r[1] for r in (await conn.execute(text(
"PRAGMA table_info(crs_report_chunks)"
))).all()}
for c in ("embedded_model_id", "embedded_model_version", "embedded_at"):
assert c in cols, f"crs_report_chunks missing column {c}"
# chunk_embedding_jobs queue table exists.
rows = (await conn.execute(text(
"SELECT name FROM sqlite_master WHERE type='table' AND name='chunk_embedding_jobs'"
))).all()
assert rows, "chunk_embedding_jobs missing"
await engine.dispose()
print("OK")
asyncio.run(main())
PY
rm -rf "$tmp"
Expect
Probe (machine-checked contract)
[
{
"label": "crs_report_chunks_vec0 carries float[1024] embedding + bit[1024] companion",
"query": "SELECT (sql LIKE '%float[1024]%' AND sql LIKE '%bit[1024]%') FROM sqlite_master WHERE type='table' AND name='crs_report_chunks_vec0'",
"compare": {
"op": "eq",
"value": 1
}
},
{
"label": "crs_report_chunks has the 3 embedded_* attribution columns",
"query": "SELECT COUNT(*) FROM pragma_table_info('crs_report_chunks') WHERE name IN ('embedded_model_id', 'embedded_model_version', 'embedded_at')",
"compare": {
"op": "eq",
"value": 3
}
},
{
"label": "chunk_embedding_jobs work queue exists",
"query": "SELECT count(*) FROM sqlite_master WHERE type='table' AND name='chunk_embedding_jobs'",
"compare": {
"op": "eq",
"value": 1
}
}
]
Schema-only check — proves the vector pipeline is wired on an empty substrate. End-to-end behaviour is exercised by `bin/embed-smoke.sh` and the embedding contract tests (`embedding-provider-protocol`, `embedding-worker`, `embedding-query-integration`).
Clarifications needed
None.
Out of scope
- The actual `EmbeddingProvider` implementations — that's `embedding-provider-protocol` and the per-adapter specs.
- Chunking strategy. The chunker output shape is documented in `https://docs.usejosh.com/operations/chunking/`.
- `vec1` IVFADC+OPQ migration. Tracked separately when upstream releases.
- Cross-source vector search. Per-source `vec0` is the substrate primitive; the surface concern lives in `rest-api-search`.
Dependencies
Plan
Three migration-level pieces:
1. Per-source <source>_chunks_vec0 (migration 0002_crs.py,
analogous blocks per future source). One vec0 row per chunk;
chunk_id is the soft FK to <source>_chunks.id. Holds both
full-precision and binary-quantized vectors so the BQ+rescore
query path doesn't need a separate table.
2. Per-source <source>_chunks attribution columns (migration
0005_embedding_jobs.py). Three additive columns
(embedded_model_id, embedded_model_version, embedded_at)
added via batch_alter_table (required for SQLite ALTER limitations,
see migrations/env.py's render_as_batch=True).
3. chunk_embedding_jobs work queue (migration 0005_embedding_jobs.py).
Composite primary key
(chunk_table, chunk_id, model_id, model_version); status enum
(pending, running, done, failed); covering index on
(status, inserted_at) for the worker poll query and a partial
index WHERE status='failed' for operator triage.
Worker side (josh-embedder): polls the queue, embeds in batches via
the configured EmbeddingProvider, writes both vector representations
to the per-source vec0 table, stamps the attribution columns. Detail
in https://docs.usejosh.com/operations/embedding-architecture/.
Query side (josh-core): the BQ+rescore pattern usesvec_distance_hamming(embedding_bq, query_bq) for the wide cut, then
a re-rank on the candidates' full-precision embedding. The wrapper
ships with rest-api-search.
Tasks
4 of 6 done.
- t1 crs_report_chunks_vec0 shipped with float[1024] + bit[1024] columns
- t2 Chunk-attribution columns (embedded_model_id/version/at) shipped
- t3 chunk_embedding_jobs queue table + indexes shipped (migration 0005)
- t4 Determiner verifies all three pieces on a fresh substrate
- t5 BQ+rescore query wrapper — tracked under rest-api-search
- t6 vec1 IVFADC+OPQ swap — deferred until upstream releases
Changelog
-
2026-05-10T18:30:00Z
(new)→shippedSpec content backfilled retrospectively. The per-source `vec0` table pattern landed with the CRS migration; the chunk-attribution columns and `chunk_embedding_jobs` queue landed in migration 0005 alongside the embedding worker. This commit replaces the [STUB] placeholders with acceptance criteria grounded in migrations 0002 and 0005, plus a schema-only determiner.