Spec — Roadmap (ordered)
A sequenced view of the substrate's path to the open-source launch. The spec catalog filters by status and category; this page orders the same specs by when they need to land. Tiers are dependencies: each tier assumes everything above it.
Status: shipped verified in_progress blocked planned draft · Priority: p0 p1 p2. Rows in each tier are roughly in execution order. "Source roster" tiers (3–6) follow the launch sequence in data-status.
Tier 0 — Substrate foundation (mostly done)
The substrate primitives every source depends on: SQLite schema, migrations, FTS5 + vec0 pipelines, embedding queue + worker + providers. This tier is largely landed.
| Spec | Status | Priority | Notes |
|---|---|---|---|
| substrate-sqlite-foundation | shipped | p0 | SQLite + WAL config + connection conventions. |
| substrate-migrations-workflow | shipped | p0 | Alembic-style migrations; schema source of truth. |
| substrate-body-normalization | shipped | p0 | Raw → Markdown body normalization. |
| substrate-citation-metadata | shipped | p0 | Citation extraction + canonical IDs. |
| substrate-fts5-pipeline | shipped | p0 | FTS5 virtual tables + per-column weights. |
| substrate-vector-pipeline | shipped | p0 | sqlite-vec wiring; vec0 companion tables. |
| ingester-modularity-pass | shipped | p0 | Ingester scaffold + per-source module pattern. |
| conventions-refactor | verified | p0 | Code conformed to conventions doc. |
| ci-foundation | verified | p0 | uv run poe ci = lint + typecheck + tests + spec validate + nav check. |
| embedding-jobs-schema | verified | p0 | Job queue table + composite PK per (chunk, model, version). |
| embedding-provider-protocol | verified | p0 | One Protocol, three callables; uniform error hierarchy. |
| embedding-provider-local-st | verified | p0 | In-process sentence-transformers adapter. |
| embedding-provider-modal | verified | p1 | Modal GPU adapter for bulk backfills. |
| embedding-provider-http | verified | p1 | Generic OpenAI/TEI-compatible HTTP adapter. |
| embedding-contract-tests | verified | p0 | Provider contract suite, parameterized over adapters. |
| embedding-worker | verified | p0 | Daemon: drains the queue into vec0 tables. |
| embedding-query-integration | verified | p0 | josh-core singleton + /embed endpoint. |
| embedding-observability | verified | p1 | /health/embedding with queue stats. |
| embedding-smoke-test | verified | p1 | End-to-end pipeline smoke under uv run poe smoke. |
| embedding-architecture-doc | verified | p1 | Operator-facing reference. Updated 2026-05-27: Arctic-L is the production default. |
| substrate-single-image-deploy | verified | p1 | Substrate services consolidated into one Kamal image. |
Tier 1 — Host + storage + backup (unblock everything else)
Nothing else loads until the bare-metal host is reachable with /data bind-mounted and a working backup pipeline. The Litestream spec is here as documentation of a path-not-taken — restic+DO Spaces is canonical.
| Spec | Status | Priority | Notes |
|---|---|---|---|
| substrate-bare-metal-host | in_progress | p0 | OVHcloud Vint Hill provision. |
| substrate-volume-mount | planned | p0 | Bind-mount /data:/data; no named Docker volumes. |
| substrate-nightly-backup | blocked | p0 | restic → DO Spaces, 30/8/12 retention. Blocked on host availability. |
| substrate-litestream-backup | draft | p2 | Path not taken. Documented for design-archaeology. |
| substrate-observability-defaults | planned | p0 | Logging + embedding operational knobs. |
| substrate-tombstone-policy | planned | p0 | Supersession + tombstones for re-fetched documents. |
| substrate-cron-scheduler | in_progress | p1 | Per-source incremental scheduling. |
| substrate-source-defaults | planned | p0 | Per-source ingester defaults (rate limits, retries, UA). |
Tier 2 — Eval gates (run before any bulk-embed)
Two gates protect every bulk-embed run from silently shipping useless vectors: a cheap pre-bulk smoke (chunker p95 / model max-seq sanity), and a per-source retrieval-eval that picks the live (model × chunker) for structurally-rich sources.
| Spec | Status | Priority | Notes |
|---|---|---|---|
| substrate-embedding-evaluation | draft | p0 | Pre-bulk gate: ~200-doc smoke, 10 sanity queries, chunker bounds check. Runs before each source's first bulk-embed. |
| substrate-retrieval-eval-per-source | draft | p1 | Phase 2 chunker confirmation per structurally-rich source. ~$5 Modal budget per source. |
Tier 3 — Re-ingest verified sources on the new host
Two ingesters already shipped against the prior DigitalOcean droplet; their code is verified but the data needs reloading on the OVHcloud host. CRS reuses the v1 retrieval-eval; legislators are structured-only.
| Spec | Status | Priority | Notes |
|---|---|---|---|
| legislators-and-committees-ingester | verified | p0 | Re-ingest 12,780 legislators + 230 committees + 3,879 assignments. Unblocks bills (FK on bioguide). |
| crs-reports-ingester | verified | p0 | Re-ingest 23,092 reports. First bulk-embed run under the new Arctic-L default; uses CRS v1 retrieval-eval result. |
Tier 4 — Launch corpus (5–7 sources for the OSS launch)
The credible minimum the OSS launch ships against. Each structurally-rich source runs Phase 2 retrieval-eval before its bulk-embed. Order optimizes for demo legibility (FR first), FK dependencies (bills after legislators), and chunker reuse (USC + public laws share USLM; the chunker is written once).
| Order | Spec | Status | Priority | Phase 2 eval? |
|---|---|---|---|---|
| 1 | federal-register-ingester | planned | p0 | Yes — section-aware HD-driven chunker vs fixed-token |
| 2 | bills-ingester | planned | p0 | Yes — USLM <section> chunker |
| 3 | us-code-ingester | planned | p0 | Yes — USLM section-level + hybrid arm (citation-heavy) |
| 4 | public-laws-ingester | planned | p0 | Reuses USC chunker + eval result |
| 5 | roll-call-votes-ingester | planned | p0 | No — structured-only |
Tier 5 — Surface (CLI + REST + MCP)
The substrate's external shape. The OSS launch needs josh init, josh ingest, the REST API conventions doc, the search endpoint, and an MCP server wrapping it. Specs land in parallel with Tier 4; cutover is when the launch sources are queryable end-to-end.
| Spec | Status | Priority | Notes |
|---|---|---|---|
| cli-conventions | planned | p0 | Pipe-friendly, structured, agent-callable. |
| cli-init | planned | p1 | josh init — set up empty SQLite + migrations. |
| cli-ingest | planned | p0 | josh ingest --source=<name>. |
| oss-startup-scripts | planned | p0 | Docker-compose-friendly startup; documented prereqs. |
| rest-api-conventions | planned | p0 | Pagination, error shape, auth header conventions. |
| rest-api-resource-endpoints | planned | p0 | Per-source resource endpoints. |
| rest-api-search | planned | p0 | Unified hybrid search endpoint (FTS5 + vec0). |
| rest-api-aggregations | planned | p0 | Faceted aggregations. |
| rest-api-dossiers | planned | p0 | Pre-computed entity dossiers. |
| rest-api-entity-resolution | planned | p0 | Bioguide / committee / topic resolution. |
| mcp-server | planned | p0 | MCP wrapper around the REST surface. |
Tier 6 — Launch (cut release, blog, Show HN)
The OSS launch coordination: dataset card, vision doc, public README, then the actual cutover. Specs in private/strategy/specs/ handle the cloud-side concerns (admin UI, API-key auth, Stripe billing); those land independently.
| Spec | Status | Priority | Notes |
|---|---|---|---|
| dataset-card-and-vision-doc | draft | p0 | Public dataset documentation (what the cloud's data contains, sources, freshness cadence). |
| agents-md-cross-tool-instructions | draft | p2 | AGENTS.md cross-tool instructions for agent clients. |
| josh-code-quality-skill | draft | p0 | Code-quality skill for contributors. |
| josh-pr-review-skill | draft | p0 | PR-review skill for contributors. |
| spec-pickup | planned | p1 | bin/spec-pickup.py — render a spec as a self-contained agent brief. |
Tier 7 — Post-launch sources (cloud-first, then OSS)
Sources beyond the launch corpus. These land in the cloud product first, then in the OSS once their ingester is stable. Per-source Phase 2 evals continue as each new structurally-rich source approaches bulk-embed.
| Order | Spec | Status | Priority | Notes |
|---|---|---|---|---|
| 6 | committee-reports-ingester | planned | p1 | CRPT XML; section-aware chunker family. |
| 7 | hearings-ingester | planned | p0 | Hearings metadata + markup schedule (structured-only). |
| 8 | hearing-transcripts-ingester | planned | p0 | Speaker-turn chunker; Phase 2 eval required. |
| 9 | congressional-record-ingester | planned | p0 | Reuses speaker-turn chunker + eval result. |
| 10 | ecfr-and-cfr-ingester | planned | p0 | Likely reuses FR section-aware chunker. |
| 11 | regulations-dot-gov-dockets-ingester | planned | p0 | Metadata-only; comment text deferred. |
| 12 | gao-reports-ingester | planned | p1 | Recursive chunker; GovInfo provider after probe swap. |
| 13 | statements-of-administration-policy-ingester | planned | p0 | Short docs; recursive or whole-doc. |
| 14 | lda-filings-ingester | planned | p0 | Structured-only; lda.gov v1. |
| 15 | cbo-cost-estimates-ingester | planned | p0 | Blocked on DataDome workaround. |
| 16 | staff-directories-ingester | planned | p0 | Best-effort coverage; structured-only. |
| 17 | topic-taxonomy-ingester | planned | p0 | LCSH subset loader. |
Tier 8 — Substrate v1.x (after launch)
Optimizations and capability expansions queued for post-launch. Vector-snapshot distribution is the most consequential — turns "first-day setup is multi-day backfill" into "import the canonical snapshot, ingest the delta, query immediately."
| Spec | Status | Priority | Notes |
|---|---|---|---|
| embedding-snapshot-distribution | draft | p2 | "Deploy in a day" — weekly canonical snapshot of the chunk SQLite + vec0 vectors. |