substratedraftp2

Vector snapshot distribution — "deploy in a day" (Phase 2)

embedding-snapshot-distribution · updated — · owner rritz

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Federal data is public domain and chunk IDs are stable; there is no
reason every Josh deployment re-embeds the same 50M-chunk corpus
from scratch. Once a week, a canonical snapshot of the chunk SQLite
DB and the vector parquet for the canonical model is published. New
deployments download both, import, and incrementally ingest forward
— minutes to first queryable substrate vs. hours of GPU spend on
Modal or days of CPU. This is the "deploy in a day" path. Because
federal data is public domain and chunk IDs are stable, the snapshot
is reproducible; that is the unique property the architecture exposes.

NOT IN SCOPE FOR PHASE 1. Documented as draft so the architectural
direction is visible; built only when external deployments ask for it.

User stories

As a new federation deployer, I want my substrate queryable in <1 hour, not 6 days so that I can demo Josh on a fresh box without burning Modal credits.

As the Foundation, I want to be the canonical source of federal-data vectors so that a new deployment is queryable in minutes, not days.

Acceptance criteria (EARS)

STUB When the snapshot pipeline runs (weekly), the system shall publish content + vectors snapshots + manifest.json to a public R2/S3 bucket.
STUB When a fresh deployment runs `josh bootstrap --snapshot <date>`, the system shall download both files, verify their hashes against the manifest, import vectors keyed by chunk_id, and resume incremental ingestion from the snapshot date.
STUB Where a deployment uses a non-default embedding model, the system shall ignore the snapshot and embed via the configured provider (snapshot is opt-in).

Success determiner

kindmanual

Checklist

Snapshot publishing job runs to completion against a recent corpus
manifest.json matches actual file hashes
Bootstrap path on a fresh VM produces a queryable substrate

Determiner upgrades to bash once the publishing pipeline exists.

Clarifications needed

None.

Out of scope

Per-source incremental snapshots (whole-corpus only for v1 of this spec)
Automatic snapshot fallback when bootstrap encounters errors
Multi-model snapshots (one snapshot per model_version; multi-model is its own spec)

Dependencies

Plan

Two jobs (publishing on Cloud, bootstrap on every deployment) plus a
serialization format. Likely Parquet for vectors (compact, well-tooled)
with chunk_id, model_id, model_version, vector columns. Manifest
carries content + vectors + model identity + sha256s + timestamps.

Tasks

0 of 4 done.

t1 [STUB] Snapshot serialization format + manifest schema
t2 [STUB] Cloud-side weekly publishing job
t3 [STUB] josh bootstrap --snapshot CLI command
t4 [STUB] Hash verification + atomic import path

Changelog

No history yet.

docs/spec/embedding-snapshot-distribution.html · generated by bin/build-spec.py