substrateverifiedp0

josh-embedder service — drains the queue into vec0 tables

embedding-worker · updated — · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

A new top-level service peer to josh-core and josh-ingester,
responsible for one job: read pending rows from chunk_embedding_jobs,
embed them via the configured provider, write vectors into the
per-source vec0 virtual table, mark done. Decoupled from ingestion
so a failing GPU backend doesn't take ingestion down with it; runs
in its own Kamal app with its own scaling envelope.

As an ingestion pipeline, I want chunks I write to the queue eventually become vectors so that I don't have to know about embedding to do my job.

As an operator, I want a worker I can throttle independently of ingestion so that GPU spend doesn't ride my source-scraping cadence.

  1. When the worker polls the queue, the system shall claim up to `batch_size` jobs in a single `UPDATE ... RETURNING` statement, atomically transitioning them to `running`.
  2. Where the chunk row referenced by a claimed job no longer exists, the system shall mark that job `failed` with `last_error='source chunk row missing at claim time'`.
  3. When the provider raises `ProviderConfigError`, the system shall mark every claimed job `failed` (terminal) and continue serving subsequent batches.
  4. When the provider raises `ProviderUnavailableError`, the system shall revert claimed jobs to `pending` without consuming attempt budget.
  5. When the provider raises `ProviderTransientError`, the system shall revert to `pending` if `attempts < max_attempts`; otherwise mark `failed`.
  6. When the worker successfully embeds a batch, the system shall write `INSERT OR REPLACE` to `<chunk_table>_vec0`, stamp the chunk row's `embedded_model_*` columns, and mark the job `done` — all in one transaction.
  7. Where `chunk_table` contains characters outside `[A-Za-z0-9_]`, the system shall raise `ValueError` (defends against SQL-injection-shaped queue rows).
kindtest_file

Path

josh-embedder/tests/test_worker.py

Runner

uv run pytest josh-embedder/tests -v

None.

None.

Service lives at josh-embedder/. Module split:
settings.py (frozen dataclass, env-driven),
factory.py (provider switch),
jobs.py (queue I/O — claim, mark_done, mark_failed, queue_depth),
worker.py (poll loop, batch processing, error semantics),
cli.py (typer entrypoints: daemon, drain-once, status).
vec0 writes use vec_quantize_binary(?) for the bit[] companion to
avoid tagged-binary-format coupling.

7 of 7 done.

  • t1 Service skeleton with pyproject.toml, src/, tests/
  • t2 Atomic claim via UPDATE...RETURNING with subquery
  • t3 vec0 + chunk attribution + queue update in one transaction
  • t4 Three error classes mapped to terminal/non-terminal/budget-aware retries
  • t5 CLI with daemon, drain-once, status commands
  • t6 8 worker tests covering happy path, idempotency, resumability, all error paths
  • t7 chunk_table name validation (rejects injection-shaped values)
  • 2026-05-10T11:00:00Z plannedverified Worker + CLI + 8 tests all pass.

docs/spec/embedding-worker.html · generated by bin/build-spec.py