Embedding jobs queue + per-chunk model attribution
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
An embedding worker needs (1) an idempotent queue it can claim from
atomically and (2) per-chunk attribution so retrieval code knows
which model produced which vector. Migration 0005 introduces thechunk_embedding_jobs table (PK on chunk_table+chunk_id+model_id+
model_version, supporting side-by-side multi-model rebuilds) plus
three columns on every <source>_chunks table: embedded_model_id,embedded_model_version, embedded_at. Without these, the worker
has no durable claim mechanism and the substrate can't answer "is
this chunk current with the active model?" without a join.
User stories
As an embedding worker, I want an atomic claim primitive on a queue table so that multiple workers can run side-by-side without double-embedding chunks.
As a retrieval query, I want per-chunk attribution of which model produced the vector so that I can detect stale vectors after a model swap without joining the queue.
Acceptance criteria (EARS)
- When migration 0005 runs against a fresh substrate, the system shall create `chunk_embedding_jobs` with the composite PK `(chunk_table, chunk_id, model_id, model_version)`.
- Where `chunk_embedding_jobs.status` is set, the system shall enforce a CHECK constraint allowing only `pending`, `running`, `done`, `failed`.
- When migration 0005 runs, the system shall add `embedded_model_id`, `embedded_model_version`, and `embedded_at` columns to `crs_report_chunks`.
- While the worker polls the queue, the index `idx_chunk_embedding_jobs_status_inserted` shall make `status='pending' ORDER BY inserted_at LIMIT N` an indexed scan.
- When migration 0005 is downgraded, the system shall drop the queue table and the new chunk-row columns in reverse order.
Success determiner
Command
set -euo pipefail
DB=$(mktemp)
(cd shared/josh_substrate && SUBSTRATE_DB_PATH="$DB" uv run alembic upgrade head)
sqlite3 "$DB" "SELECT name FROM sqlite_master WHERE type='table' AND name='chunk_embedding_jobs'" | grep -q chunk_embedding_jobs
sqlite3 "$DB" "SELECT name FROM pragma_table_info('crs_report_chunks') WHERE name='embedded_model_id'" | grep -q embedded_model_id
rm -f "$DB"
echo OK
Expect
Runs alembic against a tmp SQLite, asserts both schema additions exist.
Clarifications needed
None.
Out of scope
None.
Dependencies
Plan
Single migration 0005_embedding_jobs.py inshared/josh_substrate/src/josh_substrate/migrations/versions/. Idiomatic
raw-SQL migration matching the style of 0002_crs.py. TEXT timestamps
via strftime('%Y-%m-%dT%H:%M:%fZ','now') matches every other table.
Tasks
4 of 4 done.
- t1 Migration 0005 created with queue table + indexes
- t2 embedded_model_id/version/at added to crs_report_chunks via batch_alter_table
- t3 Verified migration applies cleanly to a fresh SQLite
- t4 downgrade() reverses both changes
Changelog
-
2026-05-10T11:00:00Z
planned→verifiedMigration 0005 written and verified to apply cleanly.