Spec — Roadmap (ordered)

A sequenced view of the substrate's path to the open-source launch. The spec catalog filters by status and category; this page orders the same specs by when they need to land. Tiers are dependencies: each tier assumes everything above it.

Status: shipped verified in_progress blocked planned draft · Priority: p0 p1 p2. Rows in each tier are roughly in execution order. "Source roster" tiers (3–6) follow the launch sequence in data-status.

Tier 0 — Substrate foundation (mostly done)

The substrate primitives every source depends on: SQLite schema, migrations, FTS5 + vec0 pipelines, embedding queue + worker + providers. This tier is largely landed.

SpecStatusPriorityNotes
substrate-sqlite-foundationshippedp0SQLite + WAL config + connection conventions.
substrate-migrations-workflowshippedp0Alembic-style migrations; schema source of truth.
substrate-body-normalizationshippedp0Raw → Markdown body normalization.
substrate-citation-metadatashippedp0Citation extraction + canonical IDs.
substrate-fts5-pipelineshippedp0FTS5 virtual tables + per-column weights.
substrate-vector-pipelineshippedp0sqlite-vec wiring; vec0 companion tables.
ingester-modularity-passshippedp0Ingester scaffold + per-source module pattern.
conventions-refactorverifiedp0Code conformed to conventions doc.
ci-foundationverifiedp0uv run poe ci = lint + typecheck + tests + spec validate + nav check.
embedding-jobs-schemaverifiedp0Job queue table + composite PK per (chunk, model, version).
embedding-provider-protocolverifiedp0One Protocol, three callables; uniform error hierarchy.
embedding-provider-local-stverifiedp0In-process sentence-transformers adapter.
embedding-provider-modalverifiedp1Modal GPU adapter for bulk backfills.
embedding-provider-httpverifiedp1Generic OpenAI/TEI-compatible HTTP adapter.
embedding-contract-testsverifiedp0Provider contract suite, parameterized over adapters.
embedding-workerverifiedp0Daemon: drains the queue into vec0 tables.
embedding-query-integrationverifiedp0josh-core singleton + /embed endpoint.
embedding-observabilityverifiedp1/health/embedding with queue stats.
embedding-smoke-testverifiedp1End-to-end pipeline smoke under uv run poe smoke.
embedding-architecture-docverifiedp1Operator-facing reference. Updated 2026-05-27: Arctic-L is the production default.
substrate-single-image-deployverifiedp1Substrate services consolidated into one Kamal image.

Tier 1 — Host + storage + backup (unblock everything else)

Nothing else loads until the bare-metal host is reachable with /data bind-mounted and a working backup pipeline. The Litestream spec is here as documentation of a path-not-taken — restic+DO Spaces is canonical.

SpecStatusPriorityNotes
substrate-bare-metal-hostin_progressp0OVHcloud Vint Hill provision.
substrate-volume-mountplannedp0Bind-mount /data:/data; no named Docker volumes.
substrate-nightly-backupblockedp0restic → DO Spaces, 30/8/12 retention. Blocked on host availability.
substrate-litestream-backupdraftp2Path not taken. Documented for design-archaeology.
substrate-observability-defaultsplannedp0Logging + embedding operational knobs.
substrate-tombstone-policyplannedp0Supersession + tombstones for re-fetched documents.
substrate-cron-schedulerin_progressp1Per-source incremental scheduling.
substrate-source-defaultsplannedp0Per-source ingester defaults (rate limits, retries, UA).

Tier 2 — Eval gates (run before any bulk-embed)

Two gates protect every bulk-embed run from silently shipping useless vectors: a cheap pre-bulk smoke (chunker p95 / model max-seq sanity), and a per-source retrieval-eval that picks the live (model × chunker) for structurally-rich sources.

SpecStatusPriorityNotes
substrate-embedding-evaluationdraftp0Pre-bulk gate: ~200-doc smoke, 10 sanity queries, chunker bounds check. Runs before each source's first bulk-embed.
substrate-retrieval-eval-per-sourcedraftp1Phase 2 chunker confirmation per structurally-rich source. ~$5 Modal budget per source.

Tier 3 — Re-ingest verified sources on the new host

Two ingesters already shipped against the prior DigitalOcean droplet; their code is verified but the data needs reloading on the OVHcloud host. CRS reuses the v1 retrieval-eval; legislators are structured-only.

SpecStatusPriorityNotes
legislators-and-committees-ingesterverifiedp0Re-ingest 12,780 legislators + 230 committees + 3,879 assignments. Unblocks bills (FK on bioguide).
crs-reports-ingesterverifiedp0Re-ingest 23,092 reports. First bulk-embed run under the new Arctic-L default; uses CRS v1 retrieval-eval result.

Tier 4 — Launch corpus (5–7 sources for the OSS launch)

The credible minimum the OSS launch ships against. Each structurally-rich source runs Phase 2 retrieval-eval before its bulk-embed. Order optimizes for demo legibility (FR first), FK dependencies (bills after legislators), and chunker reuse (USC + public laws share USLM; the chunker is written once).

OrderSpecStatusPriorityPhase 2 eval?
1federal-register-ingesterplannedp0Yes — section-aware HD-driven chunker vs fixed-token
2bills-ingesterplannedp0Yes — USLM <section> chunker
3us-code-ingesterplannedp0Yes — USLM section-level + hybrid arm (citation-heavy)
4public-laws-ingesterplannedp0Reuses USC chunker + eval result
5roll-call-votes-ingesterplannedp0No — structured-only

Tier 5 — Surface (CLI + REST + MCP)

The substrate's external shape. The OSS launch needs josh init, josh ingest, the REST API conventions doc, the search endpoint, and an MCP server wrapping it. Specs land in parallel with Tier 4; cutover is when the launch sources are queryable end-to-end.

SpecStatusPriorityNotes
cli-conventionsplannedp0Pipe-friendly, structured, agent-callable.
cli-initplannedp1josh init — set up empty SQLite + migrations.
cli-ingestplannedp0josh ingest --source=<name>.
oss-startup-scriptsplannedp0Docker-compose-friendly startup; documented prereqs.
rest-api-conventionsplannedp0Pagination, error shape, auth header conventions.
rest-api-resource-endpointsplannedp0Per-source resource endpoints.
rest-api-searchplannedp0Unified hybrid search endpoint (FTS5 + vec0).
rest-api-aggregationsplannedp0Faceted aggregations.
rest-api-dossiersplannedp0Pre-computed entity dossiers.
rest-api-entity-resolutionplannedp0Bioguide / committee / topic resolution.
mcp-serverplannedp0MCP wrapper around the REST surface.

Tier 6 — Launch (cut release, blog, Show HN)

The OSS launch coordination: dataset card, vision doc, public README, then the actual cutover. Specs in private/strategy/specs/ handle the cloud-side concerns (admin UI, API-key auth, Stripe billing); those land independently.

SpecStatusPriorityNotes
dataset-card-and-vision-docdraftp0Public dataset documentation (what the cloud's data contains, sources, freshness cadence).
agents-md-cross-tool-instructionsdraftp2AGENTS.md cross-tool instructions for agent clients.
josh-code-quality-skilldraftp0Code-quality skill for contributors.
josh-pr-review-skilldraftp0PR-review skill for contributors.
spec-pickupplannedp1bin/spec-pickup.py — render a spec as a self-contained agent brief.

Tier 7 — Post-launch sources (cloud-first, then OSS)

Sources beyond the launch corpus. These land in the cloud product first, then in the OSS once their ingester is stable. Per-source Phase 2 evals continue as each new structurally-rich source approaches bulk-embed.

OrderSpecStatusPriorityNotes
6committee-reports-ingesterplannedp1CRPT XML; section-aware chunker family.
7hearings-ingesterplannedp0Hearings metadata + markup schedule (structured-only).
8hearing-transcripts-ingesterplannedp0Speaker-turn chunker; Phase 2 eval required.
9congressional-record-ingesterplannedp0Reuses speaker-turn chunker + eval result.
10ecfr-and-cfr-ingesterplannedp0Likely reuses FR section-aware chunker.
11regulations-dot-gov-dockets-ingesterplannedp0Metadata-only; comment text deferred.
12gao-reports-ingesterplannedp1Recursive chunker; GovInfo provider after probe swap.
13statements-of-administration-policy-ingesterplannedp0Short docs; recursive or whole-doc.
14lda-filings-ingesterplannedp0Structured-only; lda.gov v1.
15cbo-cost-estimates-ingesterplannedp0Blocked on DataDome workaround.
16staff-directories-ingesterplannedp0Best-effort coverage; structured-only.
17topic-taxonomy-ingesterplannedp0LCSH subset loader.

Tier 8 — Substrate v1.x (after launch)

Optimizations and capability expansions queued for post-launch. Vector-snapshot distribution is the most consequential — turns "first-day setup is multi-day backfill" into "import the canonical snapshot, ingest the delta, query immediately."

SpecStatusPriorityNotes
embedding-snapshot-distributiondraftp2"Deploy in a day" — weekly canonical snapshot of the chunk SQLite + vec0 vectors.

docs/spec/roadmap.html · Hand-authored from the spec catalog. Updated 2026-05-27 alongside the Arctic-L production-default resolution.