Spec

The spec is the single source of truth for what Josh Foundation v1 ships and how each item is verified. Each item carries an EARS-style acceptance criteria list and a tagged-union success determiner an agent can run mechanically.

Every spec item lives as a YAML file under docs/spec/data/<id>.yaml. These pages are generated by bin/build-spec.py. Click the pencil icon on any section of an individual spec page to edit in-browser; in Chrome / Edge / Arc, saves write back to disk silently via the File System Access API. In Safari / Firefox, saves download a patched YAML you drop back into docs/spec/data/. See repo structure for the surrounding layout.

9 draft 33 planned 3 in progress 1 blocked 17 verified 7 shipped

substrate draft p2

Litestream WAL streaming (path not taken)

Continuous WAL replication of `/data/josh.db` to S3-compatible storage with point-in-time recovery to any second within the retention window. **Evaluated and indefinitely deferred for the substrate.** Preserved as this spec for documentation of the path-not-taken; not on any roadmap.

substrate draft p1

Per-source retrieval-eval confirmation

The substrate has a default (model × chunker) pair locked from the CRS embedding ablation (mxbai-embed-large-v1 + fixed_token_512_50), but the win there was 0.026 NDCG — just over the 0.02 noise floor for a 50-query set. Other corpora (bills, Federal Register, hearings, U.S. Code, etc.) have meaningfully different shape: heavier structure, longer documents, more citations, transcript-style turn structure. The CRS-winning combo may underperform on those.

substrate draft p0

Pre-bulk embedding evaluation gate

Before the embedder is turned loose at full concurrency on a source's full corpus, we need a cheap, reproducible smoke that confirms (a) the chunker isn't producing degenerate output, (b) the embedder model's runtime maximum sequence length matches what the chunker is sized against, (c) the embedder hits expected throughput on the configured provider, and (d) 10 hand-authored sanity queries return plausible top-5s. Without this gate, a bad chunker config or a silently-truncating embedder ships thousands-to-millions of useless vectors before anyone notices, and fixing it means re-embedding the corpus.

substrate draft p1

Spec system as a living, agent-maintained surface

The spec system today is a one-way pipe: humans (and occasionally agents) author YAML; `bin/build-spec.py` renders HTML; `bin/sync-nav.py` propagates nav; humans read the result. Three things are missing that would make the system *living* — self-updating as work happens, and agent-extensible:

substrate draft p2

Vector snapshot distribution — "deploy in a day" (Phase 2)

Federal data is public domain and chunk IDs are stable; there is no reason every Josh deployment re-embeds the same 50M-chunk corpus from scratch. Once a week, a canonical snapshot of the chunk SQLite DB and the vector parquet for the canonical model is published. New deployments download both, import, and incrementally ingest forward — minutes to first queryable substrate vs. hours of GPU spend on Modal or days of CPU. This is the "deploy in a day" path. Because federal data is public domain and chunk IDs are stable, the snapshot is reproducible; that is the unique property the architecture exposes.

substrate planned p0

Substrate observability defaults — logging + embedding operational knobs

Pin the cross-cutting observability conventions every Josh service inherits — logging library, log format, log levels, request-id propagation, sensitive-data redaction — so debugging a production incident isn't four different grep patterns across four services. Today there's drift: `josh-ingester` uses `structlog` while `josh-embedder` uses Python's stdlib `logger`; `josh-core` has no convention; `josh-cli` (future) doesn't either. Without a substrate- wide answer, observability dashboards have to handle multiple shapes and agents debugging via log reads can't filter by stable fields.

substrate planned p0

Substrate source-ingester defaults

Pin the cross-cutting defaults every source ingester inherits, so adding a new source is "implement the Source protocol" rather than "rediscover HTTP timeouts, anti-bot strategy, PDF library choice, and disk layout per source." Today, working defaults exist in `josh-ingester/ingester/http.py` (User-Agent, retry, rate limit), `shared/josh_substrate/corpus.py` (filesystem layout), and `josh-ingester/ingester/scheduler.py` (timezone) but they're not documented as canonical conventions and four explicit cross-source decisions remain undecided across clarifications_needed in multiple ingester specs:

substrate planned p0

Substrate tombstone & supersession policy

Federal records get withdrawn, replaced, corrected, and superseded by upstream sources. Without a substrate-wide policy, each ingester invents its own response: bills hard-deletes, FR keeps the row, CRS bumps `updated_at`. Citations break inconsistently. Agents can't reason about why one source's withdrawn record returns 404 while another returns 200.

substrate planned p0

Volume-as-data-host bind mount

Every byte of substrate state — the SQLite file, raw payloads, normalized Markdown, advisory locks, backup staging — lives at `/data` on the host and is bind-mounted into containers as `/data:/data`. The host-side `/data` lives on a durable, redundant device (a block-storage volume on cloud, a RAID array on bare metal); the container's root filesystem is treated as throwaway. This is what makes the substrate portable across compute and what makes Kamal redeploys safe — recreate the container freely, the substrate is unaffected.

substrate planned p1

bin/spec-pickup.py — render a spec as a self-contained agent brief

The spec system already stores everything an executor needs — acceptance criteria, success determiner, plan, dependencies, tasks — but in a structured shape, not as a prose prompt. Today, queueing work for an agent means hand-writing a self-contained brief: copy the spec content, add ambient repo context (CLAUDE.md, conventions, recent commits), spell out guardrails. `bin/spec-pickup.py` mechanizes that: one command, one spec ID (or no ID for "what's next?"), one prompt ready to pipe into `spawn_task`, the chip UI, or any other agent harness.

substrate in progress p0

Provision substrate on OVHcloud bare-metal host

Stand up the substrate fresh on an **OVHcloud Advance-1 2024** dedicated bare-metal server in Vint Hill, VA. The prior DigitalOcean droplet + 500 GB block-storage volume was destroyed on 2026-05-10; the substrate is mostly regenerable from public APIs, so the rebuild path is "provision OVHcloud → run migrations → re-ingest" rather than rsync-and-cutover.

substrate in progress p1

Substrate cron scheduler

Every source declares a `schedule` cron string, but those declarations are inert until a daemon fires them. This spec wires an `apscheduler.AsyncIOScheduler` inside the `josh-ingester` container that walks `discover_sources()` on boot, registers one job per source whose `schedule` is non-null, and invokes the shared `run_source()` runner on each tick — picking up flock-based per-source advisory locks and writing every cron-fired run to `ingestion_runs` with `triggered_by='scheduler'` so cron runs are distinguishable from manual ones.

substrate blocked p0

Substrate nightly backup (restic to DO Spaces)

A nightly backup of the entire `/data` directory shipped to S3-compatible object storage via **restic** (block-level content-addressed dedupe), with retention managed by `restic forget` + `restic prune`. The backup covers everything durable: the SQLite substrate (`/data/josh.db`, taken via `sqlite3 .backup` into a staging file for consistency), the corpus directory (`/data/corpus/` — raw payloads + normalized Markdown), and any other state that lands under `/data` over time. Excludes are limited to active-write transient files (`-wal`, `-shm`, lock files).

substrate verified p1

/health/embedding observability endpoint

Operators need a single cheap endpoint that reports queue depth by status, the active query-time provider's identity, and a provider health probe. Without this, "is the embedding pipeline keeping up?" is a manual SQL query against the queue table — fine for triage, bad for dashboards.

substrate verified p1

Consolidate substrate services into a single-image Kamal deploy

Before this change, the substrate ran as three independent Kamal apps — `josh-core`, `josh-ingester`, `josh-embedder` — each with its own `Dockerfile`, `config/deploy.yml`, Docker Hub image, and Docker Hub build-cache image. Bringing the substrate up required three `kamal setup` invocations from three different working directories, three separate image builds, three pre-deploy hook firings (one of which actually ran alembic; the others early-exited), and triplicate maintenance of nearly-identical deploy YAML.

substrate verified p0

Embedding jobs queue + per-chunk model attribution

An embedding worker needs (1) an idempotent queue it can claim from atomically and (2) per-chunk attribution so retrieval code knows which model produced which vector. Migration 0005 introduces the `chunk_embedding_jobs` table (PK on chunk_table+chunk_id+model_id+ model_version, supporting side-by-side multi-model rebuilds) plus three columns on every `<source>_chunks` table: `embedded_model_id`, `embedded_model_version`, `embedded_at`. Without these, the worker has no durable claim mechanism and the substrate can't answer "is this chunk current with the active model?" without a join.

substrate verified p0

EmbeddingProvider protocol + uniform error hierarchy

Worker, query-time, and adapter code all need a single shape to talk about embeddings. The Protocol (in `shared/josh_substrate/embedding/protocol.py`) defines three callables and four required attributes; the error hierarchy (in `errors.py`) gives the worker uniform retry/terminate semantics regardless of which backend failed. Without this layer every adapter is an island.

substrate verified p1

HttpProvider — generic OpenAI-/TEI-compatible HTTP adapter

Covers OpenAI's `/v1/embeddings`, HuggingFace TEI, vLLM with `--task embed`, and any self-hosted server speaking either format. This is the escape hatch for deployments that already pay for an embedding API or run their own embedding service: no need to touch the substrate, just point `JOSH_HTTP_EMBED_URL` at it.

substrate verified p0

Local CI foundation — uv workspace + ruff + mypy + pytest + poe

Every later spec in the substrate inherits this layer. CI runs locally — not on GitHub Actions — so contributors can replicate the exact same checks pre-push, no cloud minutes burnt. The single entry point is `uv run poe ci`: lint + format-check + typecheck + unit + contract + spec-validate, all in <30s on a warm cache. Without this, every new spec's "success determiner" runs in an undocumented, drift-prone environment.

substrate verified p0

LocalSTProvider — sentence-transformers in-process adapter

The default provider for two scenarios: (1) query-time embedding in josh-core, where Modal cold-start (10–30s) is unacceptable for interactive latency, and (2) bulk embedding when a deployment has no GPU/remote backend and must fall back to CPU. Without LocalSTProvider, Josh deployments require Modal credentials or a TEI server to do anything; with it, the substrate is fully usable on a single VM.

substrate verified p1

ModalProvider — embeddings via a Modal-deployed function

GPU-backed bulk embedding via Modal's pay-per-second pricing makes the cheapest option for large backfills (~$3 to embed 50M chunks on an A10/L4 vs ~6 days on CPU). Adapter ships in `josh-substrate` and is self-contained; the deployed Modal app spec is shipped separately so federation users can `modal deploy` under their own account and pay their own bill — Foundation doesn't middleman GPU spend.

substrate verified p0

Provider contract test suite (parameterized over adapters)

An adapter is "correct" if and only if it satisfies the Protocol's contract. We make that assertable: a single pytest module parameterized over every registered `ProviderFactory` runs the same ~12 tests against each adapter. Drop in a new adapter, append its factory, run the suite — pass = production-ready. Without this suite, every adapter has its own ad-hoc tests and "works" means whatever each author thought to verify.

substrate verified p0

Query-time embedding in josh-core (singleton + /embed endpoint)

Retrieval queries need to embed the user's query string before vec0 lookup, with sub-50ms latency. josh-core loads a singleton `EmbeddingProvider` at process startup so per-request embedding skips model-load overhead. The `/embed` endpoint is a smoke surface — the production retrieval API uses the same singleton internally but is a separate spec (`rest-api-search`).

substrate verified p0

Refactor shipping code to match the conventions doc

The conventions doc (`https://docs.usejosh.com/operations/conventions/`) was authored from the patterns the two shipping ingesters already mostly follow. CI is now landed (see `ci-foundation`) — ruff + mypy + pytest + poe all run via `uv run poe ci`, but the legacy modules are *grandfathered* out of the strict checks via documented exclude lists. The grandfathering buys surgical-changes safety while the embedding work lands; this spec closes the gaps so the conventions doc actually applies to every file in the repo, not just the new code.

substrate verified p1

Substrate determiner verification

Make a success_determiner's green/red signal trustworthy enough that a scheduled, self-prompting routine can act on it without a human in the loop. The unlock is verification, not autonomy: a determiner is only trustworthy if it provably goes RED when the thing it checks actually breaks. This spec governs the layer that proves that — a target-agnostic `probe:` contract, a single substrate execution shim, adversarial mutation suites, and a CI gate (`poe verify-determiner-suite`) that fails the push if any determiner is gameable.

substrate verified p1

bin/embed-smoke.sh — end-to-end embedding pipeline verification

Per-component unit tests prove each piece in isolation, but they don't prove the pieces fit together when wired with real config. The smoke script does that in ~30 seconds (after first-run model download): fresh SQLite → migrations → seeded chunks + jobs → real `josh-embedder drain-once` invocation against a real `LocalSTProvider` → vec0 row verification → re-drain idempotence → status command. If any of those drift apart, the smoke catches it before deploy. Wired into `uv run poe smoke` and `uv run poe ci-full` so it's one command from any developer's machine.

substrate verified p1

https://docs.usejosh.com/operations/embedding-architecture/ — operator-facing reference

Operators, contributors, and future agents need one canonical place to read about the embedding pipeline: schema, protocol, worker semantics, query-time path, model swap mechanics, and the full env var matrix. Without this, the only source of truth is the code, which scatters the architectural reasoning across 8 files and a migration. The doc lives in the public `docs/` tree alongside `ingestion-architecture.html` so the pair tells the full ETL story.

substrate verified p0

josh-embedder service — drains the queue into vec0 tables

A new top-level service peer to `josh-core` and `josh-ingester`, responsible for one job: read pending rows from `chunk_embedding_jobs`, embed them via the configured provider, write vectors into the per-source `vec0` virtual table, mark done. Decoupled from ingestion so a failing GPU backend doesn't take ingestion down with it; runs in its own Kamal app with its own scaling envelope.

substrate shipped p0

Ingester modularity pass

Before more sources land (Federal Register is next), tighten the ingester's modularity so source work and CLI work can proceed in parallel without colliding. Three coordinated moves: subpackage the source modules (currently single 20–50KB files), add an optional per-source CLI extension point to the `Source` protocol (so source-specific commands like the existing `fix-crs-types` live with their source instead of in the global CLI), and reframe the add-a-new-source runbook to make explicit which steps touch source code vs. exercise the existing generic surface.

substrate shipped p0

Substrate FTS5 pipeline

Full-text search via SQLite's FTS5 — external-content virtual tables paired with the primary record tables, plus the standard three-trigger pattern (AFTER INSERT / AFTER UPDATE / AFTER DELETE) that keeps the FTS5 shadow row in sync with the parent row. BM25 with per-column weights is the ranking surface; `(10.0, 4.0, 2.0, 1.0)` for `(title, abstract, action, body)` is the working default, applied at query time in the search wrapper that ships with `rest-api-search`. FTS5 ranks better than Postgres `ts_rank_cd` for our use case (native IDF + length normalization); choosing FTS5 over the alternatives is one of the reasons the SQLite-as-substrate decision was a search upgrade, not a downgrade.

substrate shipped p0

Substrate SQLite foundation

SQLite + WAL + sqlite-vec + FTS5 — the storage foundation under everything else in Josh. One file at `/data/josh.db`, opened by `josh-core` (read-mostly) and `josh-ingester` (batch writes) via the same connection helper. The helper applies the substrate-wide PRAGMAs (WAL, foreign keys on, 30s busy_timeout, NORMAL synchronous, 256 MB mmap) and loads the bundled `vec0` extension on every fresh DB-API connection. Decision locked 2026-05-07 — see the storage-stack rationale in `https://docs.usejosh.com/operations/architecture/`.

substrate shipped p0

Substrate body normalization

Every text-bearing source needs raw payload preserved on disk *and* a normalized Markdown body the chunker, embedder, and FTS5 indexer consume. Per-format `Normalizer` implementations (`html_v1`, `text_v1` shipped; `fr_xml_v1`, `uslm_v1`, `ecfr_xml_v1`, `pdf_v1` planned) all conform to the `Normalizer` protocol in `josh_substrate.protocols` and return a `NormalizedBody` (markdown + heading tree + artifacts + warnings). The registry lookup (`josh_substrate.normalizers.registry`) is the single dispatch point — loaders never import a specific normalizer.

substrate shipped p0

Substrate citation metadata

Every record-bearing table in the substrate carries the same six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) so downstream agents can cite a result without a second round-trip. Per-source `citation_for()` formatters live in `josh_substrate.citations.formatters.<source>` and produce the human-readable `citation_string` from the row's metadata using a single shared style. Loader stamps citation columns at write time; API serializers re-use the same formatters for response payloads.

substrate shipped p0

Substrate migrations workflow

Schema is the source of truth: one Alembic migration per logical change, living inside the installed `josh_substrate` package and addressed via the `package:resource` syntax (`script_location = josh_substrate:migrations`). Production runs use a Kamal pre-deploy hook so `alembic upgrade head` fires against the new image *before* the container swap — failed migrations abort the deploy, never half-apply against the running substrate. Local development and tests use the same mechanism via a session-scoped template DB in `conftest.py`.

substrate shipped p0

Substrate vector pipeline

Per-source vector index via `sqlite-vec`'s `vec0` virtual table. Each text-bearing source gets a `<source>_chunks_vec0` companion that holds both the full-precision embedding (`float[1024]`) and a binary-quantized (`bit[1024]`) companion for the BQ+rescore retrieval pattern — ~32× storage compression and ~95 % recall versus full-precision float at brute-force scale. The chunk-attribution columns (`embedded_model_id`, `embedded_model_version`, `embedded_at`) on every chunks table let a model swap re-enqueue affected chunks without a full rebuild. `vec1` (IVFADC + OPQ) is the future ANN swap; until it releases, the substrate ships on `sqlite-vec` BQ+rescore.

Spec

Litestream WAL streaming (path not taken)

Per-source retrieval-eval confirmation

Pre-bulk embedding evaluation gate

Spec system as a living, agent-maintained surface

Vector snapshot distribution — "deploy in a day" (Phase 2)

Substrate observability defaults — logging + embedding operational knobs

Substrate source-ingester defaults

Substrate tombstone & supersession policy

Volume-as-data-host bind mount

bin/spec-pickup.py — render a spec as a self-contained agent brief

Provision substrate on OVHcloud bare-metal host

Substrate cron scheduler

Substrate nightly backup (restic to DO Spaces)

/health/embedding observability endpoint

Consolidate substrate services into a single-image Kamal deploy

Embedding jobs queue + per-chunk model attribution

EmbeddingProvider protocol + uniform error hierarchy

HttpProvider — generic OpenAI-/TEI-compatible HTTP adapter

Local CI foundation — uv workspace + ruff + mypy + pytest + poe

LocalSTProvider — sentence-transformers in-process adapter

ModalProvider — embeddings via a Modal-deployed function

Provider contract test suite (parameterized over adapters)

Query-time embedding in josh-core (singleton + /embed endpoint)

Refactor shipping code to match the conventions doc

Substrate determiner verification

bin/embed-smoke.sh — end-to-end embedding pipeline verification

https://docs.usejosh.com/operations/embedding-architecture/ — operator-facing reference

josh-embedder service — drains the queue into vec0 tables

Ingester modularity pass

Substrate FTS5 pipeline

Substrate SQLite foundation

Substrate body normalization

Substrate citation metadata

Substrate migrations workflow

Substrate vector pipeline

Bills ingester

CBO cost estimates ingester

Committee reports ingester

Congressional Record ingester

Federal Register ingester

GAO reports ingester

Hearing transcripts ingester

Hearings and markup schedule ingester

LDA filings ingester (Senate + House unified at lda.gov)

Public laws ingester

Regulations.gov dockets ingester

Roll-call votes ingester

Staff directories ingester (congressional staff + agency officials + caucus memberships)

Statements of Administration Policy ingester

Topic taxonomy ingester (LCSH subset)

U.S. Code ingester

eCFR and CFR ingester

CRS reports ingester

Legislators and committees ingester

CLI conventions

CLI ingest

CLI init

MCP server

OSS startup scripts

REST API conventions

REST API — aggregations

REST API — dossiers

REST API — entity resolution

REST API — resource endpoints

REST API — unified search endpoint

AGENTS.md cross-tool agent instructions

Code-quality skill for Josh contributors

Dataset card and vision doc

PR-review skill for Josh contributors

Public docs site on Astro Starlight at docs.usejosh.com

Schema and authoring

Adding a new spec