Spec

The spec is the single source of truth for what Josh Foundation v1 ships and how each item is verified. Each item carries an EARS-style acceptance criteria list and a tagged-union success determiner an agent can run mechanically.

Every spec item lives as a YAML file under docs/spec/data/<id>.yaml. These pages are generated by bin/build-spec.py. Click the pencil icon on any section of an individual spec page to edit in-browser; in Chrome / Edge / Arc, saves write back to disk silently via the File System Access API. In Safari / Firefox, saves download a patched YAML you drop back into docs/spec/data/. See repo structure for the surrounding layout.

9 draft 33 planned 3 in progress 1 blocked 17 verified 7 shipped

substrate draft p2

Litestream WAL streaming (path not taken)

Continuous WAL replication of `/data/josh.db` to S3-compatible storage with point-in-time recovery to any second within the retention window. **Evaluated and indefinitely deferred for the substrate.** Preserved as this spec for documentation of the path-not-taken; not on any roadmap.

substrate draft p1

Per-source retrieval-eval confirmation

The substrate has a default (model × chunker) pair locked from the CRS embedding ablation (mxbai-embed-large-v1 + fixed_token_512_50), but the win there was 0.026 NDCG — just over the 0.02 noise floor for a 50-query set. Other corpora (bills, Federal Register, hearings, U.S. Code, etc.) have meaningfully different shape: heavier structure, longer documents, more citations, transcript-style turn structure. The CRS-winning combo may underperform on those.

substrate draft p0

Pre-bulk embedding evaluation gate

Before the embedder is turned loose at full concurrency on a source's full corpus, we need a cheap, reproducible smoke that confirms (a) the chunker isn't producing degenerate output, (b) the embedder model's runtime maximum sequence length matches what the chunker is sized against, (c) the embedder hits expected throughput on the configured provider, and (d) 10 hand-authored sanity queries return plausible top-5s. Without this gate, a bad chunker config or a silently-truncating embedder ships thousands-to-millions of useless vectors before anyone notices, and fixing it means re-embedding the corpus.

substrate draft p1

Spec system as a living, agent-maintained surface

The spec system today is a one-way pipe: humans (and occasionally agents) author YAML; `bin/build-spec.py` renders HTML; `bin/sync-nav.py` propagates nav; humans read the result. Three things are missing that would make the system *living* — self-updating as work happens, and agent-extensible:

substrate draft p2

Vector snapshot distribution — "deploy in a day" (Phase 2)

Federal data is public domain and chunk IDs are stable; there is no reason every Josh deployment re-embeds the same 50M-chunk corpus from scratch. Once a week, a canonical snapshot of the chunk SQLite DB and the vector parquet for the canonical model is published. New deployments download both, import, and incrementally ingest forward — minutes to first queryable substrate vs. hours of GPU spend on Modal or days of CPU. This is the "deploy in a day" path. Because federal data is public domain and chunk IDs are stable, the snapshot is reproducible; that is the unique property the architecture exposes.

substrate planned p0

Substrate observability defaults — logging + embedding operational knobs

Pin the cross-cutting observability conventions every Josh service inherits — logging library, log format, log levels, request-id propagation, sensitive-data redaction — so debugging a production incident isn't four different grep patterns across four services. Today there's drift: `josh-ingester` uses `structlog` while `josh-embedder` uses Python's stdlib `logger`; `josh-core` has no convention; `josh-cli` (future) doesn't either. Without a substrate- wide answer, observability dashboards have to handle multiple shapes and agents debugging via log reads can't filter by stable fields.

substrate planned p0

Substrate source-ingester defaults

Pin the cross-cutting defaults every source ingester inherits, so adding a new source is "implement the Source protocol" rather than "rediscover HTTP timeouts, anti-bot strategy, PDF library choice, and disk layout per source." Today, working defaults exist in `josh-ingester/ingester/http.py` (User-Agent, retry, rate limit), `shared/josh_substrate/corpus.py` (filesystem layout), and `josh-ingester/ingester/scheduler.py` (timezone) but they're not documented as canonical conventions and four explicit cross-source decisions remain undecided across clarifications_needed in multiple ingester specs:

substrate planned p0

Substrate tombstone & supersession policy

Federal records get withdrawn, replaced, corrected, and superseded by upstream sources. Without a substrate-wide policy, each ingester invents its own response: bills hard-deletes, FR keeps the row, CRS bumps `updated_at`. Citations break inconsistently. Agents can't reason about why one source's withdrawn record returns 404 while another returns 200.

substrate planned p0

Volume-as-data-host bind mount

Every byte of substrate state — the SQLite file, raw payloads, normalized Markdown, advisory locks, backup staging — lives at `/data` on the host and is bind-mounted into containers as `/data:/data`. The host-side `/data` lives on a durable, redundant device (a block-storage volume on cloud, a RAID array on bare metal); the container's root filesystem is treated as throwaway. This is what makes the substrate portable across compute and what makes Kamal redeploys safe — recreate the container freely, the substrate is unaffected.

substrate planned p1

bin/spec-pickup.py — render a spec as a self-contained agent brief

The spec system already stores everything an executor needs — acceptance criteria, success determiner, plan, dependencies, tasks — but in a structured shape, not as a prose prompt. Today, queueing work for an agent means hand-writing a self-contained brief: copy the spec content, add ambient repo context (CLAUDE.md, conventions, recent commits), spell out guardrails. `bin/spec-pickup.py` mechanizes that: one command, one spec ID (or no ID for "what's next?"), one prompt ready to pipe into `spawn_task`, the chip UI, or any other agent harness.

substrate in progress p0

Provision substrate on OVHcloud bare-metal host

Stand up the substrate fresh on an **OVHcloud Advance-1 2024** dedicated bare-metal server in Vint Hill, VA. The prior DigitalOcean droplet + 500 GB block-storage volume was destroyed on 2026-05-10; the substrate is mostly regenerable from public APIs, so the rebuild path is "provision OVHcloud → run migrations → re-ingest" rather than rsync-and-cutover.

substrate in progress p1

Substrate cron scheduler

Every source declares a `schedule` cron string, but those declarations are inert until a daemon fires them. This spec wires an `apscheduler.AsyncIOScheduler` inside the `josh-ingester` container that walks `discover_sources()` on boot, registers one job per source whose `schedule` is non-null, and invokes the shared `run_source()` runner on each tick — picking up flock-based per-source advisory locks and writing every cron-fired run to `ingestion_runs` with `triggered_by='scheduler'` so cron runs are distinguishable from manual ones.

substrate blocked p0

Substrate nightly backup (restic to DO Spaces)

A nightly backup of the entire `/data` directory shipped to S3-compatible object storage via **restic** (block-level content-addressed dedupe), with retention managed by `restic forget` + `restic prune`. The backup covers everything durable: the SQLite substrate (`/data/josh.db`, taken via `sqlite3 .backup` into a staging file for consistency), the corpus directory (`/data/corpus/` — raw payloads + normalized Markdown), and any other state that lands under `/data` over time. Excludes are limited to active-write transient files (`-wal`, `-shm`, lock files).

substrate verified p1

/health/embedding observability endpoint

Operators need a single cheap endpoint that reports queue depth by status, the active query-time provider's identity, and a provider health probe. Without this, "is the embedding pipeline keeping up?" is a manual SQL query against the queue table — fine for triage, bad for dashboards.

substrate verified p1

Consolidate substrate services into a single-image Kamal deploy

Before this change, the substrate ran as three independent Kamal apps — `josh-core`, `josh-ingester`, `josh-embedder` — each with its own `Dockerfile`, `config/deploy.yml`, Docker Hub image, and Docker Hub build-cache image. Bringing the substrate up required three `kamal setup` invocations from three different working directories, three separate image builds, three pre-deploy hook firings (one of which actually ran alembic; the others early-exited), and triplicate maintenance of nearly-identical deploy YAML.

substrate verified p0

Embedding jobs queue + per-chunk model attribution

An embedding worker needs (1) an idempotent queue it can claim from atomically and (2) per-chunk attribution so retrieval code knows which model produced which vector. Migration 0005 introduces the `chunk_embedding_jobs` table (PK on chunk_table+chunk_id+model_id+ model_version, supporting side-by-side multi-model rebuilds) plus three columns on every `<source>_chunks` table: `embedded_model_id`, `embedded_model_version`, `embedded_at`. Without these, the worker has no durable claim mechanism and the substrate can't answer "is this chunk current with the active model?" without a join.

substrate verified p0

EmbeddingProvider protocol + uniform error hierarchy

Worker, query-time, and adapter code all need a single shape to talk about embeddings. The Protocol (in `shared/josh_substrate/embedding/protocol.py`) defines three callables and four required attributes; the error hierarchy (in `errors.py`) gives the worker uniform retry/terminate semantics regardless of which backend failed. Without this layer every adapter is an island.

substrate verified p1

HttpProvider — generic OpenAI-/TEI-compatible HTTP adapter

Covers OpenAI's `/v1/embeddings`, HuggingFace TEI, vLLM with `--task embed`, and any self-hosted server speaking either format. This is the escape hatch for deployments that already pay for an embedding API or run their own embedding service: no need to touch the substrate, just point `JOSH_HTTP_EMBED_URL` at it.

substrate verified p0

Local CI foundation — uv workspace + ruff + mypy + pytest + poe

Every later spec in the substrate inherits this layer. CI runs locally — not on GitHub Actions — so contributors can replicate the exact same checks pre-push, no cloud minutes burnt. The single entry point is `uv run poe ci`: lint + format-check + typecheck + unit + contract + spec-validate, all in <30s on a warm cache. Without this, every new spec's "success determiner" runs in an undocumented, drift-prone environment.

substrate verified p0

LocalSTProvider — sentence-transformers in-process adapter

The default provider for two scenarios: (1) query-time embedding in josh-core, where Modal cold-start (10–30s) is unacceptable for interactive latency, and (2) bulk embedding when a deployment has no GPU/remote backend and must fall back to CPU. Without LocalSTProvider, Josh deployments require Modal credentials or a TEI server to do anything; with it, the substrate is fully usable on a single VM.

substrate verified p1

ModalProvider — embeddings via a Modal-deployed function

GPU-backed bulk embedding via Modal's pay-per-second pricing makes the cheapest option for large backfills (~$3 to embed 50M chunks on an A10/L4 vs ~6 days on CPU). Adapter ships in `josh-substrate` and is self-contained; the deployed Modal app spec is shipped separately so federation users can `modal deploy` under their own account and pay their own bill — Foundation doesn't middleman GPU spend.

substrate verified p0

Provider contract test suite (parameterized over adapters)

An adapter is "correct" if and only if it satisfies the Protocol's contract. We make that assertable: a single pytest module parameterized over every registered `ProviderFactory` runs the same ~12 tests against each adapter. Drop in a new adapter, append its factory, run the suite — pass = production-ready. Without this suite, every adapter has its own ad-hoc tests and "works" means whatever each author thought to verify.

substrate verified p0

Query-time embedding in josh-core (singleton + /embed endpoint)

Retrieval queries need to embed the user's query string before vec0 lookup, with sub-50ms latency. josh-core loads a singleton `EmbeddingProvider` at process startup so per-request embedding skips model-load overhead. The `/embed` endpoint is a smoke surface — the production retrieval API uses the same singleton internally but is a separate spec (`rest-api-search`).

substrate verified p0

Refactor shipping code to match the conventions doc

The conventions doc (`https://docs.usejosh.com/operations/conventions/`) was authored from the patterns the two shipping ingesters already mostly follow. CI is now landed (see `ci-foundation`) — ruff + mypy + pytest + poe all run via `uv run poe ci`, but the legacy modules are *grandfathered* out of the strict checks via documented exclude lists. The grandfathering buys surgical-changes safety while the embedding work lands; this spec closes the gaps so the conventions doc actually applies to every file in the repo, not just the new code.

substrate verified p1

Substrate determiner verification

Make a success_determiner's green/red signal trustworthy enough that a scheduled, self-prompting routine can act on it without a human in the loop. The unlock is verification, not autonomy: a determiner is only trustworthy if it provably goes RED when the thing it checks actually breaks. This spec governs the layer that proves that — a target-agnostic `probe:` contract, a single substrate execution shim, adversarial mutation suites, and a CI gate (`poe verify-determiner-suite`) that fails the push if any determiner is gameable.

substrate verified p1

bin/embed-smoke.sh — end-to-end embedding pipeline verification

Per-component unit tests prove each piece in isolation, but they don't prove the pieces fit together when wired with real config. The smoke script does that in ~30 seconds (after first-run model download): fresh SQLite → migrations → seeded chunks + jobs → real `josh-embedder drain-once` invocation against a real `LocalSTProvider` → vec0 row verification → re-drain idempotence → status command. If any of those drift apart, the smoke catches it before deploy. Wired into `uv run poe smoke` and `uv run poe ci-full` so it's one command from any developer's machine.

substrate verified p1

https://docs.usejosh.com/operations/embedding-architecture/ — operator-facing reference

Operators, contributors, and future agents need one canonical place to read about the embedding pipeline: schema, protocol, worker semantics, query-time path, model swap mechanics, and the full env var matrix. Without this, the only source of truth is the code, which scatters the architectural reasoning across 8 files and a migration. The doc lives in the public `docs/` tree alongside `ingestion-architecture.html` so the pair tells the full ETL story.

substrate verified p0

josh-embedder service — drains the queue into vec0 tables

A new top-level service peer to `josh-core` and `josh-ingester`, responsible for one job: read pending rows from `chunk_embedding_jobs`, embed them via the configured provider, write vectors into the per-source `vec0` virtual table, mark done. Decoupled from ingestion so a failing GPU backend doesn't take ingestion down with it; runs in its own Kamal app with its own scaling envelope.

substrate shipped p0

Ingester modularity pass

Before more sources land (Federal Register is next), tighten the ingester's modularity so source work and CLI work can proceed in parallel without colliding. Three coordinated moves: subpackage the source modules (currently single 20–50KB files), add an optional per-source CLI extension point to the `Source` protocol (so source-specific commands like the existing `fix-crs-types` live with their source instead of in the global CLI), and reframe the add-a-new-source runbook to make explicit which steps touch source code vs. exercise the existing generic surface.

substrate shipped p0

Substrate FTS5 pipeline

Full-text search via SQLite's FTS5 — external-content virtual tables paired with the primary record tables, plus the standard three-trigger pattern (AFTER INSERT / AFTER UPDATE / AFTER DELETE) that keeps the FTS5 shadow row in sync with the parent row. BM25 with per-column weights is the ranking surface; `(10.0, 4.0, 2.0, 1.0)` for `(title, abstract, action, body)` is the working default, applied at query time in the search wrapper that ships with `rest-api-search`. FTS5 ranks better than Postgres `ts_rank_cd` for our use case (native IDF + length normalization); choosing FTS5 over the alternatives is one of the reasons the SQLite-as-substrate decision was a search upgrade, not a downgrade.

substrate shipped p0

Substrate SQLite foundation

SQLite + WAL + sqlite-vec + FTS5 — the storage foundation under everything else in Josh. One file at `/data/josh.db`, opened by `josh-core` (read-mostly) and `josh-ingester` (batch writes) via the same connection helper. The helper applies the substrate-wide PRAGMAs (WAL, foreign keys on, 30s busy_timeout, NORMAL synchronous, 256 MB mmap) and loads the bundled `vec0` extension on every fresh DB-API connection. Decision locked 2026-05-07 — see the storage-stack rationale in `https://docs.usejosh.com/operations/architecture/`.

substrate shipped p0

Substrate body normalization

Every text-bearing source needs raw payload preserved on disk *and* a normalized Markdown body the chunker, embedder, and FTS5 indexer consume. Per-format `Normalizer` implementations (`html_v1`, `text_v1` shipped; `fr_xml_v1`, `uslm_v1`, `ecfr_xml_v1`, `pdf_v1` planned) all conform to the `Normalizer` protocol in `josh_substrate.protocols` and return a `NormalizedBody` (markdown + heading tree + artifacts + warnings). The registry lookup (`josh_substrate.normalizers.registry`) is the single dispatch point — loaders never import a specific normalizer.

substrate shipped p0

Substrate citation metadata

Every record-bearing table in the substrate carries the same six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) so downstream agents can cite a result without a second round-trip. Per-source `citation_for()` formatters live in `josh_substrate.citations.formatters.<source>` and produce the human-readable `citation_string` from the row's metadata using a single shared style. Loader stamps citation columns at write time; API serializers re-use the same formatters for response payloads.

substrate shipped p0

Substrate migrations workflow

Schema is the source of truth: one Alembic migration per logical change, living inside the installed `josh_substrate` package and addressed via the `package:resource` syntax (`script_location = josh_substrate:migrations`). Production runs use a Kamal pre-deploy hook so `alembic upgrade head` fires against the new image *before* the container swap — failed migrations abort the deploy, never half-apply against the running substrate. Local development and tests use the same mechanism via a session-scoped template DB in `conftest.py`.

substrate shipped p0

Substrate vector pipeline

Per-source vector index via `sqlite-vec`'s `vec0` virtual table. Each text-bearing source gets a `<source>_chunks_vec0` companion that holds both the full-precision embedding (`float[1024]`) and a binary-quantized (`bit[1024]`) companion for the BQ+rescore retrieval pattern — ~32× storage compression and ~95 % recall versus full-precision float at brute-force scale. The chunk-attribution columns (`embedded_model_id`, `embedded_model_version`, `embedded_at`) on every chunks table let a model swap re-enqueue affected chunks without a full rebuild. `vec1` (IVFADC + OPQ) is the future ANN swap; until it releases, the substrate ships on `sqlite-vec` BQ+rescore.

source planned p0

Bills ingester

Bills are the centerpiece of the substrate — every other legislative source (public laws, committee reports, hearings, votes, CBO estimates) FKs back to a bill, and most agent queries either start at a bill or need to resolve one. Shipping bills early is the biggest schema test in v1: it stresses cross-source joins (legislators, committees), large body text (HR-1-class bills can run 1.9 MB of XML), inlined amendments (HR 1 119th has 493), and three independent change signals (RSS, sitemap `<lastmod>`, per-file `Last-Modified`). If the SQLite + FTS5 + sqlite-vec stack survives bills, it survives the v1 source roster.

source planned p0

CBO cost estimates ingester

Cost estimates and budgetary analyses from the Congressional Budget Office, the legislative branch's nonpartisan budget shop. Every bill ordered reported by a House or Senate committee gets a CBO score (when feasible) — 5- and 10-year budget impact. CBO also publishes broader analyses: federal budget outlook, agency-program reviews, working papers. For Josh, CBO is critical for the **vote-impact and budget-summary** workflows: "what's the 10-year cost of the SPEED Act," "what's CBO's score on H.R. 4920," "what's the budget outlook for FY2027." The bill ↔ CBO linkage already has a schema target (`bill_cbo_estimates` in `bills.html`); this ingester populates the source table CBO publications point at.

source planned p1

Committee reports ingester

Committee reports (H. Rep. and S. Rep. — GovInfo CRPT collection) are the written record of legislative intent: section-by-section analyses, dissents, CBO cost estimates inlined, and the committee vote rollcall. They are FK-target-rich (reports reference bills, members, and committees) and the best substrate answer to "why was this bill amended" / "what did the committee say" queries. Alternative 6th/7th launch source if Federal Register or hearings slip.

source planned p0

Congressional Record ingester

The official daily journal of floor proceedings in the House and Senate, plus the Daily Digest summary, plus the Extensions of Remarks. The Congressional Record is the record of what was said in Congress — every speech, every parliamentary motion, every quorum call, every vote tally. For Josh, this is the densest citation source in the v1 corpus: MODS metadata is extraordinarily rich (per-granule bioguide IDs for every speaking member, bills referenced, USC sections cited, Stat. cross- references, committees, votes, reports, amendments, RINs, laws). The citation graph more or less falls out of CREC MODS for free. Distinct from `hearing-transcripts-ingester` (committee hearings, CHRG); this is floor proceedings (CREC).

source planned p0

Federal Register ingester

Federal Register is the most cross-referenced regulatory source on the substrate — every other regulatory source FKs to it eventually. Shipping FR early validates the SQLite + FTS5 + sqlite-vec stack against a real, high-volume, body-text source (NPRMs and final rules routinely run 30K+ tokens). It's also the bedrock under "what regs touch X" queries that agents will run constantly.

source planned p1

GAO reports ingester

GAO reports are heavily-cited oversight + investigatory products from the legislative branch's audit arm — the substrate's strongest single signal for agency accountability and program-performance questions. Shipping GAO is also the substrate's first PDF-heavy, anti-bot-walled source, and forces the RSS-discovery + headed-browser-fetch pattern into the ingester framework. GAO is one of CRS / GAO / committee-reports as the 6th-or-7th launch source and is launch-desired (not launch-blocking).

source planned p0

Hearing transcripts ingester

The text of what was actually said in hearings — every speaker turn, every Q&A round, every witness statement read into the record. This is the highest-value-per-MB corpus in Josh: it lets the agent answer "what did Jensen Huang testify about export controls" or "summarize the GAO IG's testimony in the May 7 oversight hearing." Distinct from `hearings-ingester` (the meeting calendar metadata) — this source carries the body content of what was said.

source planned p0

Hearings and markup schedule ingester

Every committee meeting Congress holds — hearings, markups, business meetings, oversight panels — both as future-scheduled events (the calendar) and as past records (the meeting + witness metadata). This is the source-of-truth for "what is committee X doing this week" and "who testified at the Snowden hearing." Distinct from `hearing-transcripts` which carries the *content* of what was said; this source carries the *metadata* of when, where, who, and what bills were on the agenda.

source planned p0

LDA filings ingester (Senate + House unified at lda.gov)

Lobbying Disclosure Act filings: who is lobbying whom, on what bills, for which clients, and (for political contributions) where the lobbyists' money is going. Federal lobbying activity must be reported quarterly to both chambers; political contributions semi-annually. The Senate Office of Public Records and Clerk of the House publish through a unified API at `lda.gov/api/v1/` (the legacy `lda.senate.gov/api/v1/` is sunset 2026-06-30). For Josh, LDA filings are the primary signal of "who is influencing this legislation": a user looking at a bill should see the disclosed lobbyists who reported lobbying activity on it; a user looking at a Member should see contributions reported to/from PACs the lobbyists represent.

source planned p0

Public laws ingester

Public Laws are the "what actually became law" surface — the bridge between bills (the legislative process) and the U.S. Code (codified statute). The GovInfo PLAW collection ships USLM 2.0 XML for every enacted statute since 1995, and pairs cleanly with the existing `bill_laws` join table so agents can answer "what bill became this law?" and "what laws came out of this Congress?" without leaving the substrate.

source planned p0

Regulations.gov dockets ingester

When a federal agency proposes or finalizes a regulation, it opens a docket on regulations.gov for public comment. Each docket is the lifecycle container for a single rulemaking — the regulatory document(s), supporting analyses, comment count, and deadline. For Josh, dockets give us the bridge between Federal Register documents and the broader rulemaking proceeding: every FR rule cites a `docket_id`, and many dockets house multiple FR documents over their lifecycle (proposed rule → final rule → correction). Comment counts and deadlines are also high-signal for "what's open for comment now" agent queries. **Comment text itself is deferred to v2** (hundreds of GB to TB).

source planned p0

Roll-call votes ingester

Every recorded floor vote in the House and Senate — passage, motions to recommit, amendment votes, cloture, motion to proceed, election of Speaker, motions to table — and every member's individual vote on each. This is the most direct expression of legislator behavior in the substrate. FK-target-rich on bills (most roll calls attach to a `legis-num`) and on legislators (every recorded vote names the member), which makes "how did Senator X vote on the SPEED Act" or "every roll call on HR 1" trivial joins. Both chambers publish XML back to 1989, and the schemas are completely different. As of 2026-05-29 the Senate side is no longer bot-walled: `vote_menu_119_1.xml` and `vote_119_1_00001.xml` both return plain HTTP/1.1 200 (Server: Apache) to a bare `curl` (confirmed with default and `curl/8` UA), so plain `requests`/`httpx` fetch works for both chambers — no headed browser required.

source planned p0

Staff directories ingester (congressional staff + agency officials + caucus memberships)

Three thinly-related person datasets, all classified as **best-effort + community contribution**: (1) congressional staff (chiefs of staff, LDs, LAs, committee staff — the people who actually write bills), (2) federal agency officials (heads, deputies, key program directors), and (3) caucus memberships (Freedom Caucus, Problem Solvers, etc.). Bundled into one ingester because **all three share the same characteristic problem**: there is no canonical, machine-readable, comprehensively-maintained public source. Each requires fragmented per-site scraping or community-contributed datasets. Honest disclosure: this is the **weakest source area in v1** — partial coverage shipped, gap explicitly documented.

source planned p0

Statements of Administration Policy ingester

A Statement of Administration Policy (SAP) is an OMB memorandum on behalf of the President expressing the administration's position on pending legislation. Short (typically 1-3 pages), structured, and high-signal: a "Strongly Opposes" SAP frequently signals a veto threat; a "Supports" SAP often accompanies the President's signature when the bill becomes law. For Josh, SAPs answer "what's the White House position on this bill" — a question every Hill staffer asks weekly. Naturally FK'd into bills via the per-admin YAML's `bills`/`congress` fields (the repo README warns that filenames are arbitrary, so bills cannot be parsed from the slug). Small but high-value source.

source planned p0

Topic taxonomy ingester (LCSH subset)

Pull a curated subset of the Library of Congress Subject Headings (LCSH) into the substrate as a backbone vocabulary for cross-source topic alignment. Each ingested source comes with its own native topic vocabulary (`bill_subjects`, `fr_document_topics`, `gao_report_topics`, `lda_lobbying_activities.general_issue_code`, etc.), and the per-source vocabularies don't align. LCSH is the authority-controlled, library- jargony, well-defined vocabulary that gives us a stable backbone to reference; the per-source taxonomies map *into* it.

source planned p0

U.S. Code ingester

The U.S. Code is the canonical statute layer the citation graph anchors on: bills cite USC sections, FR rules amend or implement USC sections, CRS reports cite USC. Shipping the OLRC USLM 2.0 XML pipeline (with GovInfo htm/pdf as the human-readable companion) into `usc_titles` + `usc_sections` (+ structure + chunks) gives Josh the resolution target for "what does the law actually say" queries and turns every other source's `42 U.S.C. § 1396` reference into a real node instead of a dead string.

source planned p0

eCFR and CFR ingester

The Code of Federal Regulations (CFR) is the codified body of regulations issued by federal agencies — the regulatory analog to the US Code's statutes. Two sources, one substrate: **eCFR** (current state, daily updated, clean REST API) for "what is 40 CFR 60 today" queries, and **CFR Annual** (yearly snapshots from GovInfo) for citation-grade date-stamped historical reference. CFR is the regulatory citation target: every Federal Register rule cites CFR sections; bills cite CFR sections; CRS reports cite CFR sections. Closing the citation graph (bill → law → USC + Federal Register rule → CFR) needs this source.

source verified p0

CRS reports ingester

Congressional Research Service reports — Congress's in-house policy analysts. EveryCRSReport.com mirror is the cleanest of the v1 sources (CSV index + per-report JSON + HTML, conditional GET, daily refresh) so it served as the test case that validated the body-normalization + citation-metadata + FTS5 + sqlite-vec pipeline end to end before bills.

source verified p0

Legislators and committees ingester

The `unitedstates/congress-legislators` mirror (7 fetched YAML files; `committees-historical.yaml` is the 8th upstream file but intentionally out of scope at v1) is the canonical open-data source for every member of Congress (current + historical), every committee + subcommittee, and current committee assignments. It's the smallest second source — no body normalization, no vector pipeline, no chunking — which made it the right shape to validate that the add-a-new-source runbook generalizes beyond CRS before bills exposes the full schema surface. Every other Congress-side source FKs to `legislators.bioguide_id` and `committees.thomas_id`.

surface planned p0

CLI conventions

Pin the cross-cutting CLI conventions before per-verb specs land. Today the workspace ships two per-service Typer CLIs (`josh-ingester`, `josh-embedder`) that operators invoke via `kamal app exec --reuse '...'`, and three planned specs (`cli-init`, `cli-ingest`, `embedding-snapshot-distribution`) reference a unified `josh` binary that doesn't yet exist. Without this layer, each spec invents its own verb shape, output format, and exit-code mapping, and the OSS first-run UX fragments across three CLI names.

surface planned p0

CLI ingest

`josh ingest <source> [--since ...] [--max ...]` — the public CLI command for backfill and ad-hoc runs. Mirrors what the runner offers but with a stable, OSS-friendly UX.

surface planned p1

CLI init

`josh init` — first-run setup for OSS self-hosters. Creates `/data` skeleton, applies migrations, prompts for embedding-provider config, prints a smoke-test command.

surface planned p0

MCP server

MCP is the destination shape for Josh's agent-facing surface. Anthropic "Connectors" (Claude.ai + Cowork), ChatGPT "Apps", Claude Code, Cursor — all install a remote MCP server by URL. Locking in that one server is what lets Josh be installable in every major agent host with a single codebase, no per-vendor adapter.

surface planned p0

OSS startup scripts

Self-hoster bootstrapping scripts: `docker compose up` (or single `docker run`), starter `.env`, smoke ingest of one source. Required for OSS launch — code ships, data does not.

surface planned p0

REST API conventions

Pin the cross-cutting REST conventions every endpoint, every resource, and every consumer (REST clients, MCP server, CLI) inherits from. Without this layer, each downstream spec (`rest-api-resource-endpoints`, `rest-api-search`, `mcp-server`) silently picks its own pagination shape, error format, datetime style, and field-projection model — and when agents trained on different APIs hit Josh, they drift.

surface planned p0

REST API — aggregations

Roughly 17% of realistic queries against the substrate (11 of 64 in the `https://docs.usejosh.com/operations/query-coverage/` analysis) are analytical: counts, top-N rankings, group-by-and-tally, time-bucket histograms. Examples that drove this spec:

surface planned p0

REST API — dossiers

Cross-source fan-out from one ID is a high-volume agent shape. The 64-query coverage analysis (`https://docs.usejosh.com/operations/query-coverage/`) surfaced 11 questions (17%) that need this — e.g.:

surface planned p0

REST API — entity resolution

Registry sources (legislators, committees, lobbyists, staffers, taxonomies) need a typed fuzzy-name-to-ID resolver that is distinct from both search and ID lookup. The 64-query coverage analysis (`https://docs.usejosh.com/operations/query-coverage/`) surfaced this as Category 5: questions like "Sen. Markey from Mass," "Katie O'Brian on E&C," "Marc Lampkin lobbyist with uncertain spelling," "House Ag — which committee?" — each is a fuzzy match against short structured fields, not BM25 over a body and not an exact ID lookup.

surface planned p0

REST API — resource endpoints

Per-source resource fetches: `GET /v1/bills/{bill_id}`, `GET /v1/legislators/{bioguide_id}`, `GET /v1/committees/{committee_id}`, and friends. This is the **deterministic** half of the API — given a known identifier, return the full record + citation, never via FTS5 or vector search. Without this surface, identifier-bearing questions ("Does HR103 mention child protection?", "Who is on Ways and Means?") fall through into the search path, where dense embeddings reliably miss exact identifiers (Harvey, 2025). The substrate is already relational; this spec exposes that relational shape directly.

surface planned p0

REST API — unified search endpoint

Search is the agent's most-used tool. A single `GET /v1/search` that unifies FTS5 BM25 + vector retrieval across every text-bearing source is the smallest API surface that proves the substrate is a real product rather than a database with a pretty face. Per the v1 plan: this is the first endpoint shipped after schema validation against two real sources.

launch draft p2

AGENTS.md cross-tool agent instructions

Outside contributors using non-Claude AI tools (Cursor, Codex CLI, Aider, Continue) won't auto-load Josh's project guidance — those tools look for `AGENTS.md`, the cross-tool convention. Today everything lives in `CLAUDE.md`, which only Claude Code reads. To keep contribution quality high regardless of which agent a contributor uses, both filenames must resolve to the same content. Low priority — only matters once the repo is public and outside contributions land.

launch draft p0

Code-quality skill for Josh contributors

AI agents (Claude Code, Cursor, Codex) writing or reading Josh code need a consistent way to check files against `https://docs.usejosh.com/operations/conventions/`. Without a skill, every agent re-derives "what good Josh code looks like" from prose each session — which the CMU sugar-rush research (https://civic.io/2026/03/16/ending-the-sugar-rush/) shows is unreliable and produces drift over time. A skill compresses the conventions doc into a checkable contract: invoke `/josh-code-quality <path>`, get a structured report grouped by convention category. Same input → same output. Pairs with the PR review skill (separate spec) which evaluates whole changes.

launch draft p0

Dataset card and vision doc

HuggingFace-style dataset card (what's in the substrate, sources, licenses, freshness) + a vision doc framing Josh Foundation's role. Ships on day one of OSS launch.

launch draft p0

PR-review skill for Josh contributors

The CMU sugar-rush study (https://civic.io/2026/03/16/ending-the-sugar-rush/) argues that AI-assisted contributions should be reviewed by AI applying the *same* project guidance the writer applied — a "closed-loop quality system." Today, an outside contributor's PR gets eyeballed by a maintainer who has to remember every Josh convention. A PR-review skill encodes those conventions into a structured review pass: takes a diff (local branch or GitHub PR URL), evaluates against AGENTS.md, the four working rules, the conventions doc, the spec workflow rules, and the critical rules, and emits a verdict with reasoning. Pairs with the code-quality skill (separate spec) which evaluates whole files; this skill evaluates whole *changes*.

launch in progress p1

Public docs site on Astro Starlight at docs.usejosh.com

The current public docs (~50 hand-authored HTML files under `docs/`) carry a custom design system, hand-maintained sidebar, and a hand- authored breadcrumb pattern. That's fine for an internal-only working set but it's a contributor tax once the repo is public — every doc edit means hand-touching the HTML, the nav, the breadcrumbs, and the design tokens. The OSS launch needs a docs surface where a contributor can edit a single `.mdx` file, push a PR, and have the navigation/styling regenerate.

Schema and authoring

The spec format is anchored in a few sources of truth:

Adding a new spec

  1. For per-source ingester specs, copy docs/spec/data/_templates/source.yaml to docs/spec/data/<source-name>-ingester.yaml. The 14-task ingestion-built contract is baked in.
  2. For other specs, copy any of the seeded examples (substrate-volume-mount, federal-register-ingester, rest-api-search) and edit.
  3. Run bin/build-spec.py to validate + regenerate HTML, then bin/sync-nav.py to propagate the canonical sidebar.

docs/spec/index.html · generated by bin/build-spec.py