substrateverifiedp0

Embedding jobs queue + per-chunk model attribution

embedding-jobs-schema · updated — · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

An embedding worker needs (1) an idempotent queue it can claim from
atomically and (2) per-chunk attribution so retrieval code knows
which model produced which vector. Migration 0005 introduces the
chunk_embedding_jobs table (PK on chunk_table+chunk_id+model_id+
model_version, supporting side-by-side multi-model rebuilds) plus
three columns on every <source>_chunks table: embedded_model_id,
embedded_model_version, embedded_at. Without these, the worker
has no durable claim mechanism and the substrate can't answer "is
this chunk current with the active model?" without a join.

As an embedding worker, I want an atomic claim primitive on a queue table so that multiple workers can run side-by-side without double-embedding chunks.

As a retrieval query, I want per-chunk attribution of which model produced the vector so that I can detect stale vectors after a model swap without joining the queue.

  1. When migration 0005 runs against a fresh substrate, the system shall create `chunk_embedding_jobs` with the composite PK `(chunk_table, chunk_id, model_id, model_version)`.
  2. Where `chunk_embedding_jobs.status` is set, the system shall enforce a CHECK constraint allowing only `pending`, `running`, `done`, `failed`.
  3. When migration 0005 runs, the system shall add `embedded_model_id`, `embedded_model_version`, and `embedded_at` columns to `crs_report_chunks`.
  4. While the worker polls the queue, the index `idx_chunk_embedding_jobs_status_inserted` shall make `status='pending' ORDER BY inserted_at LIMIT N` an indexed scan.
  5. When migration 0005 is downgraded, the system shall drop the queue table and the new chunk-row columns in reverse order.
kindbash

Command

set -euo pipefail
DB=$(mktemp)
(cd shared/josh_substrate && SUBSTRATE_DB_PATH="$DB" uv run alembic upgrade head)
sqlite3 "$DB" "SELECT name FROM sqlite_master WHERE type='table' AND name='chunk_embedding_jobs'" | grep -q chunk_embedding_jobs
sqlite3 "$DB" "SELECT name FROM pragma_table_info('crs_report_chunks') WHERE name='embedded_model_id'" | grep -q embedded_model_id
rm -f "$DB"
echo OK

Expect

OK

Runs alembic against a tmp SQLite, asserts both schema additions exist.

None.

None.

Single migration 0005_embedding_jobs.py in
shared/josh_substrate/src/josh_substrate/migrations/versions/. Idiomatic
raw-SQL migration matching the style of 0002_crs.py. TEXT timestamps
via strftime('%Y-%m-%dT%H:%M:%fZ','now') matches every other table.

4 of 4 done.

  • t1 Migration 0005 created with queue table + indexes
  • t2 embedded_model_id/version/at added to crs_report_chunks via batch_alter_table
  • t3 Verified migration applies cleanly to a fresh SQLite
  • t4 downgrade() reverses both changes
  • 2026-05-10T11:00:00Z plannedverified Migration 0005 written and verified to apply cleanly.

docs/spec/embedding-jobs-schema.html · generated by bin/build-spec.py