bin/embed-smoke.sh — end-to-end embedding pipeline verification
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Per-component unit tests prove each piece in isolation, but they don't
prove the pieces fit together when wired with real config. The smoke
script does that in ~30 seconds (after first-run model download): fresh
SQLite → migrations → seeded chunks + jobs → real josh-embedder invocation against a real
drain-onceLocalSTProvider → vec0 row
verification → re-drain idempotence → status command. If any of those
drift apart, the smoke catches it before deploy. Wired intouv run poe smoke and uv run poe ci-full so it's one command from
any developer's machine.
User stories
As a contributor making cross-component changes, I want one command that exercises the whole embedding pipeline so that I find integration drift locally, not in production.
As an operator preparing a deploy, I want a quick end-to-end check before pushing to hetzner so that I don't ship a worker that can't actually write to vec0.
Acceptance criteria (EARS)
- When `bin/embed-smoke.sh` runs from the repo root, the system shall create a fresh tmp SQLite, apply migrations, seed 5 CRS chunks + jobs, drain the queue via `josh-embedder drain-once`, and exit 0 only if all assertions hold.
- When the script verifies vec0 row count, the system shall query through `josh_substrate.db.async_engine` (which loads sqlite-vec) — not via the bare `sqlite3` CLI — because the virtual table is invisible without the extension.
- When the queue has been fully drained, the system shall verify `chunk_embedding_jobs.status='done'` for all 5 rows AND `crs_report_chunks_vec0` has 5 rows AND every chunk has `embedded_at IS NOT NULL`.
- When the script re-runs `drain-once` on an already-drained queue, the system shall verify zero new jobs are processed (idempotency proof).
- When the smoke completes, the system shall remove its tmp directory automatically (trap on EXIT) so the disk doesn't accumulate test DBs.
Success determiner
Command
bash bin/embed-smoke.sh
Expect
First run downloads ~1.3GB (BAAI/bge-large-en-v1.5) into the HF cache. Subsequent runs are warm in <30s. Production deployments swap to Snowflake/snowflake-arctic-embed-l-v2.0 (also 1024 dim, standard XLM-Roberta backbone, no custom modeling code) — the smoke uses BGE-large to keep CI prereqs minimal and to avoid binding the smoke to a model the per-source retrieval eval might later swap. See https://docs.usejosh.com/operations/embedding-architecture/#model-choice for the Arctic-L production default rationale.
Clarifications needed
None.
Out of scope
None.
Dependencies
Plan
Single bash script (~80 lines) at bin/embed-smoke.sh. Uses BGE-large
(1024 dim, no remote code, no xformers) so the smoke is reproducible
across environments. Each verification step echoes a "→" line so the
output is grep-friendly. set -euo pipefail + EXIT trap for cleanup.
Wired into [tool.poe.tasks].smoke so uv run poe smoke anduv run poe ci-full invoke it.
Tasks
7 of 7 done.
- t1 Script writes fresh SQLite, runs alembic upgrade head
- t2 Seeds 5 CRS chunks + matching pending jobs via inline SQL
- t3 Calls `uv run josh-embedder drain-once` and asserts processed count
- t4 Verifies vec0 row count via Python (extension-loaded engine)
- t5 Verifies chunk attribution columns are stamped
- t6 Verifies re-drain is a no-op (`processed 0 jobs`)
- t7 Wired into `poe smoke` and `poe ci-full` tasks
Changelog
-
2026-05-27T00:00:00Z
verified→verifiedProduction swap target updated from Arctic-M to Arctic-L per Phase 1 result (see https://docs.usejosh.com/operations/embedding-architecture/#model-choice). Arctic-L uses the standard XLM-Roberta backbone, so the swap no longer requires JOSH_EMBED_TRUST_REMOTE_CODE=1 or xformers. Smoke itself unchanged — still uses BGE-large for minimal CI prereqs. -
2026-05-10T11:00:00Z
planned→verifiedSmoke script written and exits 0 against fresh substrate.