U.S. Code ingester · Josh spec

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

The U.S. Code is the canonical statute layer the citation graph anchors on:
bills cite USC sections, FR rules amend or implement USC sections, CRS
reports cite USC. Shipping the OLRC USLM 2.0 XML pipeline (with GovInfo
htm/pdf as the human-readable companion) into
usc_titles + usc_sections (+ structure + chunks) gives Josh the
resolution target for "what does the law actually say" queries and turns
every other source's 42 U.S.C. § 1396 reference into a real node instead
of a dead string.

User stories

As a policy analyst using my own AI agent, I want to ask "what does 42 U.S.C. § 1396a(b) say today?" and get the current statutory text with a citation so that I can ground a memo on the actual law instead of guessing from headlines.

As an OSS self-hoster, I want to run `josh ingest us-code` on a fresh substrate and get all 54 titles loaded so that I have a complete statute layer locally without standing up my own USLM parser.

As a cross-source consumer (bills, CRS, FR, eCFR), I want to resolve a `usc:42-1396a` reference parsed out of upstream text to a real `usc_sections` row so that citation-graph queries like "every CRS report touching § 1396a" return joined rows, not orphan strings.

Acceptance criteria (EARS)

When `josh ingest us-code --title 26 --max 100` is run against an empty substrate, the system shall populate `usc_sections` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
Where a section body is parsed from USLM XML, the system shall write a normalized Markdown body under `/data/corpus/us-code/bodies/markdown/` and the raw title-level USLM XML under `/data/corpus/us-code/bodies/raw/`.
While the ingester is walking a USLM document, if a single section fails to parse, then the system shall log the failure to `ingestion_logs` with the offending section identifier and continue with the next section in the title.
When an FTS5 query `MATCH 'tax-exempt'` is executed against `usc_sections_fts`, the system shall return BM25-ranked results with the per-column weight tuple applied to (heading, chapeau, body).
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the latest ingested release point (e.g. `rp:119-36`) such that the next run only fetches newer release points or annual editions.
Where a section row is loaded, the system shall populate the parent FK chain (`usc_titles.id` → `usc_structure_nodes.id` for chapter/subchapter/part/subpart → `usc_sections.title_id`) so that the title-and-structure graph is walkable from any section.
The system shall enforce a unique citation identifier per section (e.g. `26 USC § 501` resolves to exactly one current-edition row, with letter-suffix sections like `1396a` preserved as text).
Where a section body exceeds ~5K tokens, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding.

Success determiner

kindbash

Command

set -euo pipefail
# Smoke backfill: a single small title is plenty (Title 1 is ~200 KB)
ssh josh 'docker exec josh-ingester josh ingest us-code --title 1 --max 100'
# Verify section rows landed with all citation columns populated
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM usc_sections
  WHERE title_number = 1
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND published_at IS NOT NULL AND retrieved_at IS NOT NULL
    AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL;"'
# FTS5 sanity — Title 1 ('General Provisions') contains 'United States'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM usc_sections_fts WHERE usc_sections_fts MATCH '\''United States'\'';"'
# Section-graph integrity: every loaded section FKs back to its title
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM usc_sections s
  LEFT JOIN usc_titles t ON s.title_id = t.id
  WHERE s.title_number = 1 AND t.id IS NULL;"'

Expect

>= 1 fully-populated section row; FTS5 match returns >= 1; orphan-FK count = 0.

Smoke run only — single small title (Title 1). Full 54-title backfill (~3-5 GB USLM XML from the OLRC release-point bundle `xml_uscAll@{c}-{n}.zip`, ~50K-60K sections) runs detached on the server per the `add-a-new-source.html` runbook (`docker exec -d` pattern).

Clarifications needed

Granularity: section-level rows only for v1, or also a `usc_subsections` table for `(b)(2)(A)` sub-section paths? Lean section-only per the per-source doc; defer subsection rows to v1.x.
Amendments: store only the latest edition + latest release point, or a versioned `usc_section_versions` table for time-travel queries? Lean latest-only for v1; document the lag honestly.
Release-point cadence: USC release points appear every 1-3 months as Congress passes laws. Monthly poll is plenty — confirm before registering the cron, and decide whether to also do a yearly full-edition refresh.
Historical retention: keep last N annual editions (e.g. 5 × ~5 GB = 25 GB) for 'as of date X' queries, or always-current only? Lean current-only for v1.
Chunking strategy for very long sections — Title 26 (IRC) and Title 42 (Public Health) have sections that run hundreds of pages. Decide the chunker (recursive vs section-aware via USLM hierarchy) per the chunking experiment framework in `https://docs.usejosh.com/operations/chunking/`.
OLRC vs GovInfo as the primary fetcher when both work — OLRC release points (`xml_uscAll@{c}-{n}.zip` at uscode.house.gov/download) are the only source of USLM 2.0 full-text XML, so OLRC is primary; GovInfo per-title ZIPs (html/*.htm + pdf/*.pdf + metadata XML, no USLM full text — verified by unzipping the 2023 title1/title5 packages) are the human-readable companion/fallback; reconfirm before backfill.

Out of scope

State codes (CA, NY, TX statute corpora) — Phase 6.
Historical USC versions pre-1994 — USLM didn't exist; pre-1994 editions are PDF-only and don't fit this pipeline.
Subsection-level row decomposition (`usc:42-1396a(b)(2)(A)` as its own row) — defer to v1.x; v1 retrieves at section granularity and the citation resolver walks `body_xml` for sub-paths.
Popular-name table, tables-of-acts-cited, parallel-table hyperlink overlays — defer; v1 ships with section-level coverage only.
Real-time public-law integration (closing the 1-3 month OLRC integration lag ourselves) — out of scope; document the lag.
Appendix/Tables volumes (Title 50 Appendix, Title 18 Appendix) — defer to v1.x; needs a small schema accommodation (`appendix_designation` or similar).

Dependencies

Plan

Source module follows the standard Source protocol from
https://docs.usejosh.com/operations/ingestion-architecture/. Per-source spec at
https://docs.usejosh.com/sources/us-code/ (probe findings + indicative DDL — the
Alembic migration under shared/josh_substrate/src/josh_substrate/migrations/versions/
is the source of truth, per CLAUDE.md "Where the schema lives").

Discover/fetch/parse/load shape:
- discover: OLRC release-point landing page
https://uscode.house.gov/download/download.shtml lists release points
(xml_uscAll@{c}-{n}.zip, e.g. xml_uscAll@119-93.zip) — the USLM 2.0
full-text source. Watermark is
ingestion_source_state.last_seen_key = "{edition_year}|{release_point}"
so the next run only fetches newer editions or release points. GovInfo
per-year sitemap
https://www.govinfo.gov/sitemap/USCODE_{YYYY}_sitemap.xml (one <loc>
per title) layers on for the human-readable htm/pdf companion artifacts.
- fetch: OLRC release-point ZIP at
https://uscode.house.gov/download/releasepoints/us/pl/{c}/{n}/xml_uscAll@{c}-{n}.zip,
extracted to pull the per-title usc{N}.xml USLM 2.0 files (verified:
the GovInfo per-title ZIP USCODE-{YYYY}-title{N}.zip contains only
html/*.htm, pdf/*.pdf and metadata XML — NOT USLM full text — by
unzipping the 2023 title1/title5 packages). Save the raw ZIP
bytes under /data/corpus/us-code/bodies/raw/; SHA256 of the ZIP is
the title's raw_sha256 for citation. The GovInfo per-title ZIP is a
fallback for the human-readable htm/pdf rendering only; OLRC is the
primary USLM source.
- parse: uslm_v2 lxml walker. The walker traverses
<title>/(<subtitle>?)/<chapter>/(<subchapter>?)/(<part>?)/(<subpart>?)/<section>
emitting one usc_titles row, N usc_structure_nodes rows
(chapter/subchapter/part/subpart with parent FKs), and N
usc_sections rows. For each <section>: render heading + chapeau
+ body (subsections/paragraphs/subparagraphs/clauses preserved as
nested Markdown lists) → Markdown to
/data/corpus/us-code/bodies/markdown/; capture <sourceCredit>
verbatim into source_credit_text and extract public-law refs into
enacted_by_law_ids (junction table usc_section_enacting_laws,
not a SQLite array — pattern from FR fr_document_rins).
- load: idempotent upsert on (title_number, section_number, edition_year)
natural key. Junction-table inserts for enacting-law refs and
cross-section references (no SQLite arrays). FTS5 sync triggers fire
on insert/update/delete. Long-section chunks enqueued for embedding
via ingestion_embedding_queue. is_repealed / is_omitted /
is_reserved flags preserve placeholder rows so historical
citations still resolve.

Migration shape: 0NNN_us_code.py with usc_titles,
usc_structure_nodes (self-referential parent_id for the
chapter→subchapter→part→subpart graph), usc_sections (with FK to
usc_titles.id), usc_section_enacting_laws junction,
usc_sections_fts virtual + 3 sync triggers (BM25 with column weights
applied to heading/chapeau/body), usc_section_chunks and
usc_section_chunk_vecs (vec0 with float[1024] and bit[1024]).
Indices on (title_number, section_number), partial index on
WHERE NOT (is_repealed OR is_omitted) for active-section filtering,
and on usc_structure_nodes(parent_id) for graph walks.

Schedule: monthly at 04:30 UTC on the 1st. USC release points appear
every 1-3 months — monthly polling is plenty, no need to spam OLRC
hourly. Annual-edition refresh runs in the same job by checking the
GovInfo sitemap year roll-over (USCODE_{YYYY+1}_sitemap.xml first
appears around June of YYYY+1).

Concurrency: per-source flock /data/locks/ingest-us-code.lock.
OLRC USLM fetch is a single release-point ZIP bundling all titles; the
GovInfo htm/pdf companion fetch is small fan-out (54 titles) and should
respect a ~1-2 req/sec polite cap.

Tasks

0 of 17 done.

Changelog

No history yet.