Spec — Roadmap (ordered)

A sequenced view of the substrate's path to the open-source launch. The spec catalog filters by status and category; this page orders the same specs by when they need to land. Tiers are dependencies: each tier assumes everything above it.

Status: shipped verified in_progress blocked planned draft · Priority: p0 p1 p2. Rows in each tier are roughly in execution order. "Source roster" tiers (3–6) follow the launch sequence in data-status.

Tier 0 — Substrate foundation (mostly done)

The substrate primitives every source depends on: SQLite schema, migrations, FTS5 + vec0 pipelines, embedding queue + worker + providers. This tier is largely landed.

Spec	Status	Priority	Notes
substrate-sqlite-foundation	shipped	p0	SQLite + WAL config + connection conventions.
substrate-migrations-workflow	shipped	p0	Alembic-style migrations; schema source of truth.
substrate-body-normalization	shipped	p0	Raw → Markdown body normalization.
substrate-citation-metadata	shipped	p0	Citation extraction + canonical IDs.
substrate-fts5-pipeline	shipped	p0	FTS5 virtual tables + per-column weights.
substrate-vector-pipeline	shipped	p0	sqlite-vec wiring; vec0 companion tables.
ingester-modularity-pass	shipped	p0	Ingester scaffold + per-source module pattern.
conventions-refactor	verified	p0	Code conformed to conventions doc.
ci-foundation	verified	p0	`uv run poe ci` = lint + typecheck + tests + spec validate + nav check.
embedding-jobs-schema	verified	p0	Job queue table + composite PK per (chunk, model, version).
embedding-provider-protocol	verified	p0	One Protocol, three callables; uniform error hierarchy.
embedding-provider-local-st	verified	p0	In-process sentence-transformers adapter.
embedding-provider-modal	verified	p1	Modal GPU adapter for bulk backfills.
embedding-provider-http	verified	p1	Generic OpenAI/TEI-compatible HTTP adapter.
embedding-contract-tests	verified	p0	Provider contract suite, parameterized over adapters.
embedding-worker	verified	p0	Daemon: drains the queue into vec0 tables.
embedding-query-integration	verified	p0	josh-core singleton + `/embed` endpoint.
embedding-observability	verified	p1	`/health/embedding` with queue stats.
embedding-smoke-test	verified	p1	End-to-end pipeline smoke under `uv run poe smoke`.
embedding-architecture-doc	verified	p1	Operator-facing reference. Updated 2026-05-27: Arctic-L is the production default.
substrate-single-image-deploy	verified	p1	Substrate services consolidated into one Kamal image.

Tier 1 — Host + storage + backup (unblock everything else)

Nothing else loads until the bare-metal host is reachable with /data bind-mounted and a working backup pipeline. The Litestream spec is here as documentation of a path-not-taken — restic+DO Spaces is canonical.

Spec	Status	Priority	Notes
substrate-bare-metal-host	in_progress	p0	OVHcloud Vint Hill provision.
substrate-volume-mount	planned	p0	Bind-mount `/data:/data`; no named Docker volumes.
substrate-nightly-backup	blocked	p0	restic → DO Spaces, 30/8/12 retention. Blocked on host availability.
substrate-litestream-backup	draft	p2	Path not taken. Documented for design-archaeology.
substrate-observability-defaults	planned	p0	Logging + embedding operational knobs.
substrate-tombstone-policy	planned	p0	Supersession + tombstones for re-fetched documents.
substrate-cron-scheduler	in_progress	p1	Per-source incremental scheduling.
substrate-source-defaults	planned	p0	Per-source ingester defaults (rate limits, retries, UA).

Tier 2 — Eval gates (run before any bulk-embed)

Two gates protect every bulk-embed run from silently shipping useless vectors: a cheap pre-bulk smoke (chunker p95 / model max-seq sanity), and a per-source retrieval-eval that picks the live (model × chunker) for structurally-rich sources.

Spec	Status	Priority	Notes
substrate-embedding-evaluation	draft	p0	Pre-bulk gate: ~200-doc smoke, 10 sanity queries, chunker bounds check. Runs before each source's first bulk-embed.
substrate-retrieval-eval-per-source	draft	p1	Phase 2 chunker confirmation per structurally-rich source. ~$5 Modal budget per source.

Tier 3 — Re-ingest verified sources on the new host

Two ingesters already shipped against the prior DigitalOcean droplet; their code is verified but the data needs reloading on the OVHcloud host. CRS reuses the v1 retrieval-eval; legislators are structured-only.

Spec	Status	Priority	Notes
legislators-and-committees-ingester	verified	p0	Re-ingest 12,780 legislators + 230 committees + 3,879 assignments. Unblocks bills (FK on bioguide).
crs-reports-ingester	verified	p0	Re-ingest 23,092 reports. First bulk-embed run under the new Arctic-L default; uses CRS v1 retrieval-eval result.

Tier 4 — Launch corpus (5–7 sources for the OSS launch)

The credible minimum the OSS launch ships against. Each structurally-rich source runs Phase 2 retrieval-eval before its bulk-embed. Order optimizes for demo legibility (FR first), FK dependencies (bills after legislators), and chunker reuse (USC + public laws share USLM; the chunker is written once).

Order	Spec	Status	Priority	Phase 2 eval?
1	federal-register-ingester	planned	p0	Yes — section-aware HD-driven chunker vs fixed-token
2	bills-ingester	planned	p0	Yes — USLM `<section>` chunker
3	us-code-ingester	planned	p0	Yes — USLM section-level + hybrid arm (citation-heavy)
4	public-laws-ingester	planned	p0	Reuses USC chunker + eval result
5	roll-call-votes-ingester	planned	p0	No — structured-only

Tier 5 — Surface (CLI + REST + MCP)

The substrate's external shape. The OSS launch needs josh init, josh ingest, the REST API conventions doc, the search endpoint, and an MCP server wrapping it. Specs land in parallel with Tier 4; cutover is when the launch sources are queryable end-to-end.

Spec	Status	Priority	Notes
cli-conventions	planned	p0	Pipe-friendly, structured, agent-callable.
cli-init	planned	p1	`josh init` — set up empty SQLite + migrations.
cli-ingest	planned	p0	`josh ingest --source=<name>`.
oss-startup-scripts	planned	p0	Docker-compose-friendly startup; documented prereqs.
rest-api-conventions	planned	p0	Pagination, error shape, auth header conventions.
rest-api-resource-endpoints	planned	p0	Per-source resource endpoints.
rest-api-search	planned	p0	Unified hybrid search endpoint (FTS5 + vec0).
rest-api-aggregations	planned	p0	Faceted aggregations.
rest-api-dossiers	planned	p0	Pre-computed entity dossiers.
rest-api-entity-resolution	planned	p0	Bioguide / committee / topic resolution.
mcp-server	planned	p0	MCP wrapper around the REST surface.

Tier 6 — Launch (cut release, blog, Show HN)

The OSS launch coordination: dataset card, vision doc, public README, then the actual cutover. Specs in private/strategy/specs/ handle the cloud-side concerns (admin UI, API-key auth, Stripe billing); those land independently.

Spec	Status	Priority	Notes
dataset-card-and-vision-doc	draft	p0	Public dataset documentation (what the cloud's data contains, sources, freshness cadence).
agents-md-cross-tool-instructions	draft	p2	`AGENTS.md` cross-tool instructions for agent clients.
josh-code-quality-skill	draft	p0	Code-quality skill for contributors.
josh-pr-review-skill	draft	p0	PR-review skill for contributors.
spec-pickup	planned	p1	`bin/spec-pickup.py` — render a spec as a self-contained agent brief.

Tier 7 — Post-launch sources (cloud-first, then OSS)

Sources beyond the launch corpus. These land in the cloud product first, then in the OSS once their ingester is stable. Per-source Phase 2 evals continue as each new structurally-rich source approaches bulk-embed.

Order	Spec	Status	Priority	Notes
6	committee-reports-ingester	planned	p1	CRPT XML; section-aware chunker family.
7	hearings-ingester	planned	p0	Hearings metadata + markup schedule (structured-only).
8	hearing-transcripts-ingester	planned	p0	Speaker-turn chunker; Phase 2 eval required.
9	congressional-record-ingester	planned	p0	Reuses speaker-turn chunker + eval result.
10	ecfr-and-cfr-ingester	planned	p0	Likely reuses FR section-aware chunker.
11	regulations-dot-gov-dockets-ingester	planned	p0	Metadata-only; comment text deferred.
12	gao-reports-ingester	planned	p1	Recursive chunker; GovInfo provider after probe swap.
13	statements-of-administration-policy-ingester	planned	p0	Short docs; recursive or whole-doc.
14	lda-filings-ingester	planned	p0	Structured-only; lda.gov v1.
15	cbo-cost-estimates-ingester	planned	p0	Blocked on DataDome workaround.
16	staff-directories-ingester	planned	p0	Best-effort coverage; structured-only.
17	topic-taxonomy-ingester	planned	p0	LCSH subset loader.

Tier 8 — Substrate v1.x (after launch)

Optimizations and capability expansions queued for post-launch. Vector-snapshot distribution is the most consequential — turns "first-day setup is multi-day backfill" into "import the canonical snapshot, ingest the delta, query immediately."

Spec	Status	Priority	Notes
embedding-snapshot-distribution	draft	p2	"Deploy in a day" — weekly canonical snapshot of the chunk SQLite + vec0 vectors.

docs/spec/roadmap.html · Hand-authored from the spec catalog. Updated 2026-05-27 alongside the Arctic-L production-default resolution.