Vector snapshot distribution — "deploy in a day" (Phase 2)
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Federal data is public domain and chunk IDs are stable; there is no
reason every Josh deployment re-embeds the same 50M-chunk corpus
from scratch. Once a week, a canonical snapshot of the chunk SQLite
DB and the vector parquet for the canonical model is published. New
deployments download both, import, and incrementally ingest forward
— minutes to first queryable substrate vs. hours of GPU spend on
Modal or days of CPU. This is the "deploy in a day" path. Because
federal data is public domain and chunk IDs are stable, the snapshot
is reproducible; that is the unique property the architecture exposes.
NOT IN SCOPE FOR PHASE 1. Documented as draft so the architectural
direction is visible; built only when external deployments ask for it.
User stories
As a new federation deployer, I want my substrate queryable in <1 hour, not 6 days so that I can demo Josh on a fresh box without burning Modal credits.
As the Foundation, I want to be the canonical source of federal-data vectors so that a new deployment is queryable in minutes, not days.
Acceptance criteria (EARS)
- STUB When the snapshot pipeline runs (weekly), the system shall publish content + vectors snapshots + manifest.json to a public R2/S3 bucket.
- STUB When a fresh deployment runs `josh bootstrap --snapshot <date>`, the system shall download both files, verify their hashes against the manifest, import vectors keyed by chunk_id, and resume incremental ingestion from the snapshot date.
- STUB Where a deployment uses a non-default embedding model, the system shall ignore the snapshot and embed via the configured provider (snapshot is opt-in).
Success determiner
Checklist
- Snapshot publishing job runs to completion against a recent corpus
- manifest.json matches actual file hashes
- Bootstrap path on a fresh VM produces a queryable substrate
Determiner upgrades to bash once the publishing pipeline exists.
Clarifications needed
None.
Out of scope
- Per-source incremental snapshots (whole-corpus only for v1 of this spec)
- Automatic snapshot fallback when bootstrap encounters errors
- Multi-model snapshots (one snapshot per model_version; multi-model is its own spec)
Dependencies
Plan
Two jobs (publishing on Cloud, bootstrap on every deployment) plus a
serialization format. Likely Parquet for vectors (compact, well-tooled)
with chunk_id, model_id, model_version, vector columns. Manifest
carries content + vectors + model identity + sha256s + timestamps.
Tasks
0 of 4 done.
- t1 [STUB] Snapshot serialization format + manifest schema
- t2 [STUB] Cloud-side weekly publishing job
- t3 [STUB] josh bootstrap --snapshot CLI command
- t4 [STUB] Hash verification + atomic import path
Changelog
No history yet.