bin/embed-smoke.sh — end-to-end embedding pipeline verification

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Per-component unit tests prove each piece in isolation, but they don't
prove the pieces fit together when wired with real config. The smoke
script does that in ~30 seconds (after first-run model download): fresh
SQLite → migrations → seeded chunks + jobs → real josh-embedder drain-once invocation against a real LocalSTProvider → vec0 row
verification → re-drain idempotence → status command. If any of those
drift apart, the smoke catches it before deploy. Wired into
uv run poe smoke and uv run poe ci-full so it's one command from
any developer's machine.

User stories

As a contributor making cross-component changes, I want one command that exercises the whole embedding pipeline so that I find integration drift locally, not in production.

As an operator preparing a deploy, I want a quick end-to-end check before pushing to hetzner so that I don't ship a worker that can't actually write to vec0.

Acceptance criteria (EARS)

When `bin/embed-smoke.sh` runs from the repo root, the system shall create a fresh tmp SQLite, apply migrations, seed 5 CRS chunks + jobs, drain the queue via `josh-embedder drain-once`, and exit 0 only if all assertions hold.
When the script verifies vec0 row count, the system shall query through `josh_substrate.db.async_engine` (which loads sqlite-vec) — not via the bare `sqlite3` CLI — because the virtual table is invisible without the extension.
When the queue has been fully drained, the system shall verify `chunk_embedding_jobs.status='done'` for all 5 rows AND `crs_report_chunks_vec0` has 5 rows AND every chunk has `embedded_at IS NOT NULL`.
When the script re-runs `drain-once` on an already-drained queue, the system shall verify zero new jobs are processed (idempotency proof).
When the smoke completes, the system shall remove its tmp directory automatically (trap on EXIT) so the disk doesn't accumulate test DBs.

Success determiner

kindbash

Command

bash bin/embed-smoke.sh

Expect

✓ embed-smoke.sh: all checks passed

First run downloads ~1.3GB (BAAI/bge-large-en-v1.5) into the HF cache. Subsequent runs are warm in <30s. Production deployments swap to Snowflake/snowflake-arctic-embed-l-v2.0 (also 1024 dim, standard XLM-Roberta backbone, no custom modeling code) — the smoke uses BGE-large to keep CI prereqs minimal and to avoid binding the smoke to a model the per-source retrieval eval might later swap. See https://docs.usejosh.com/operations/embedding-architecture/#model-choice for the Arctic-L production default rationale.

Clarifications needed

None.

Out of scope

None.

Dependencies

Plan

Single bash script (~80 lines) at bin/embed-smoke.sh. Uses BGE-large
(1024 dim, no remote code, no xformers) so the smoke is reproducible
across environments. Each verification step echoes a "→" line so the
output is grep-friendly. set -euo pipefail + EXIT trap for cleanup.
Wired into [tool.poe.tasks].smoke so uv run poe smoke and
uv run poe ci-full invoke it.

Tasks

7 of 7 done.

t1 Script writes fresh SQLite, runs alembic upgrade head
t2 Seeds 5 CRS chunks + matching pending jobs via inline SQL
t3 Calls `uv run josh-embedder drain-once` and asserts processed count
t4 Verifies vec0 row count via Python (extension-loaded engine)
t5 Verifies chunk attribution columns are stamped
t6 Verifies re-drain is a no-op (`processed 0 jobs`)
t7 Wired into `poe smoke` and `poe ci-full` tasks

Changelog

2026-05-27T00:00:00Z verified→verified Production swap target updated from Arctic-M to Arctic-L per Phase 1 result (see https://docs.usejosh.com/operations/embedding-architecture/#model-choice). Arctic-L uses the standard XLM-Roberta backbone, so the swap no longer requires JOSH_EMBED_TRUST_REMOTE_CODE=1 or xformers. Smoke itself unchanged — still uses BGE-large for minimal CI prereqs.
2026-05-10T11:00:00Z planned→verified Smoke script written and exits 0 against fresh substrate.