Public laws ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Public Laws are the "what actually became law" surface — the bridge between
bills (the legislative process) and the U.S. Code (codified statute). The
GovInfo PLAW collection ships USLM 2.0 XML for every enacted statute since
1995, and pairs cleanly with the existing bill_laws join table so agents
can answer "what bill became this law?" and "what laws came out of this
Congress?" without leaving the substrate.
User stories
As a policy analyst using my own AI agent, I want to ask the agent for the full text of P.L. 119-21 with citations so that I can quote the as-enacted statute exactly, not a stale bill version.
As an OSS self-hoster, I want to run `josh ingest public-laws --congress 119` on a fresh substrate so that I can populate the public-laws table from GovInfo on my own instance.
As the U.S. Code ingester (downstream), I want to FK from `usc_section_amendments` rows to `public_laws.id` by `law:<congress>-<number>` so that agents can trace a USC section back to the enacting Public Law.
As the bills ingester (downstream), I want the `bill_laws.public_law_number` field to resolve to a real `public_laws` row so that agents can pivot from "bill" to "as-enacted statute text" in one hop.
Acceptance criteria (EARS)
- When `josh ingest public-laws --congress 119` is run on the deployed substrate, the system shall populate `public_laws` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
- When a Public Law's USLM XML body is fetched, the system shall write the raw payload under `/data/corpus/public-laws/bodies/raw/` and a normalized Markdown body under `/data/corpus/public-laws/bodies/markdown/`.
- While the ingester is running, if a single law fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `package_id`) and continue with the next law.
- Where a Public Law's MODS extension declares an originating bill, the system shall populate the existing `bill_laws` join table linking `bills.id` to `public_laws.id` (resolving the bill soft-FK when the bill is already in the substrate; leaving `bill_id` NULL otherwise).
- The system shall enforce a unique constraint on `(congress, law_type, law_number)` so re-issued PLAW packages overwrite rather than duplicate.
- When an FTS5 query `MATCH 'reconciliation'` is executed against `public_laws_fts`, the system shall return BM25-ranked results with the per-column weight tuple `(10.0, 4.0, 2.0, 1.0)` over `(short_title, title, citation_text, body)`.
- Where a Public Law body exceeds ~5K tokens (most reconciliation and omnibus laws), the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` with the highest sitemap `<lastmod>` seen so the next run only fetches new or changed packages.
Success determiner
Command
set -euo pipefail
# Smoke backfill against the current Congress
ssh josh 'docker exec josh-ingester josh ingest public-laws --congress 119 --max 50'
# Verify rows landed with all six citation columns populated
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM public_laws
WHERE congress = 119
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND published_at IS NOT NULL AND retrieved_at IS NOT NULL
AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL;"'
# FTS5 sanity — every modern Congress has at least one act
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM public_laws_fts WHERE public_laws_fts MATCH '\''act'\'';"'
# bill_laws join populated for at least some of the smoke-backfill rows
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM bill_laws bl
JOIN public_laws pl ON pl.id = bl.public_law_id
WHERE pl.congress = 119;"'
Expect
Smoke run only — 119th Congress has ~200-400 public laws by full enactment. Full backfill (1995-present, ~3-5K packages) runs detached on the server per `add-a-new-source.html`.
Clarifications needed
- Backfill scope: 119th Congress only (~200-400 laws), or 1995-present full PLAW collection (~3-5K laws), or current + previous Congress for v1 launch? Per `https://docs.usejosh.com/sources/public-laws/` the PLAW collection is small (~500 MB-1 GB total), so full 1995-present is feasible.
- Private laws (`pvtl` packageId pattern, ~5-10 per Congress) — include in v1 with `law_type='private'`, or skip entirely as out-of-scope? Schema in the source doc supports both.
- `bill_laws` resolution policy when the originating bill isn't yet in the substrate: insert with `bill_id=NULL` and re-resolve on a periodic job, or block PLAW load until bills catches up? The bills ingester runs more frequently, so soft-FK with later resolution is the simplest path.
- Slip-law-vs-Statutes-at-Large progression: PLAW packages are the slip-law version (single act). The bound Statutes at Large volume that supersedes it adds editorial notes and side-headings. v1 likely tracks slip laws only and accepts that Stat. volume publication lag (~12-18 months) means the canonical Stat citation may need a later refresh.
- PLAW publication lag (1-7 days post-signature): the bills ingester sees `bill_laws.public_law_number` populated before GPO publishes the PLAW package. Implement a retry queue (per `https://docs.usejosh.com/sources/public-laws/` cross-trigger), or just rely on the daily sitemap diff to catch it on the next run?
Out of scope
- State laws / state statutes-at-large analogs — Phase 6+.
- Pre-1995 historical PLAW backfill — Statutes at Large bound volumes via Hathitrust have older laws but require a different parsing pipeline; v2 if useful.
- Full USC amendment extraction (`public_law_usc_amendments` table) — the structural amends-graph is high-value but parser-heavy; v1 ships the law text + bill linkage, v2 builds the section-amends-section graph.
- Statutes-at-Large to law_id resolution (`public_law_stat_references.resolved_law_id`) — captured as raw Stat citations in v1; resolution to other PLAW rows lands in v2.
- Real-time webhooks / push subscriptions — surface-layer concern, not an ingester concern.
Dependencies
Plan
Source module follows the standard Source protocol — seehttps://docs.usejosh.com/operations/ingestion-architecture/ andhttps://docs.usejosh.com/operations/add-a-new-source/. Per-source spec athttps://docs.usejosh.com/sources/public-laws/ (probe findings + indicative DDL —
the migration undershared/josh_substrate/src/josh_substrate/migrations/versions/0NNN_public_laws.py
is the source of truth).
Discover/fetch/parse/load shape:
- discover: GovInfo per-year sitemap
https://www.govinfo.gov/sitemap/PLAW_{YYYY}_sitemap.xml (~100-200
<url> entries per year). Watermark from
ingestion_source_state.last_seen_key is the highest <lastmod> seen
in the previous run. Per-Congress filter is a discover-time predicate
(the package id encodes Congress: PLAW-{congress}publ{number}).
- fetch: per-package USLM XML at
/content/pkg/PLAW-{c}publ{n}/uslm/PLAW-{c}publ{n}.xml and MODS at
/metadata/pkg/PLAW-{c}publ{n}/mods.xml. Both write to
/data/corpus/public-laws/bodies/raw/. SHA256 the USLM body — that's
the raw_sha256 for citation. Conditional GET via per-file
Last-Modified to skip unchanged packages.
- parse: a plaw_uslm_v1 lxml walker preserving section/heading
hierarchy → Markdown body to
/data/corpus/public-laws/bodies/markdown/. MODS extension parses
title, short title, congress, law_type, law_number, originating bill
(type/number/chamber), approval_date, statutes-at-large citation, and
the list of <statuteAtLargeAmended> references for citation graph.
- load: idempotent upsert on (congress, law_type, law_number)
natural key (matches the unique constraint). When the MODS bill
reference resolves to an existing bills.id, write a bill_laws
row linking bills.id ↔ public_laws.id; otherwise insert with
bill_id=NULL and let a periodic re-resolve job pick it up later.
FTS5 sync triggers fire on insert/update/delete. Bodies > ~5K tokens
enqueue chunks for embedding.
Migration shape (0NNN_public_laws.py): public_laws (with the six
citation columns + package_id UNIQUE + (congress, law_type, UNIQUE),
law_number)public_laws_fts virtual + 3 sync triggers,public_law_chunks, public_law_chunk_vecs (vec0 with bothfloat[1024] and bit[1024]). Re-uses the existing bill_laws join
table from the bills migration; no new join table for v1. Stat
references and USC amendments tables deferred to v2 per out-of-scope.
Schedule: daily 04:00 UTC. Public laws are signed when Congress is in
session and GPO publishes the PLAW package 1-7 days post-signature, so
daily polling of current + previous year sitemaps is plenty. Hourly is
overkill; daily aligns with the cadence in https://docs.usejosh.com/sources/public-laws/.
Tasks
0 of 17 done.
- t1 Migration 0NNN_public_laws.py applied locally + production
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/public_laws.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (plaw_uslm_v1 walker over USLM 2.0; raw + markdown paths)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests on 3+ fixtures (P.L. 119-21, P.L. 117-103, a private law)
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 Vector vec0 table for chunks (float + BQ) — public laws bodies routinely exceed 5K tokens
- t8 Unit test for plaw_uslm_v1 parser against captured fixtures
- t9 Integration test for full discover→fetch→parse→load pipeline against the GovInfo sitemap
- t10 Smoke backfill (50 laws, 119th Congress) clean per success_determiner
- t11 Production backfill at agreed scope (per clarifications_needed) complete
- t12 Daily 04:00 UTC cron registered and observed firing once
- t13 No recurring errors in ingestion_logs after 24h; https://docs.usejosh.com/sources/public-laws/ updated to ingestion-built
- t14 Inventory rows in https://docs.usejosh.com/josh-data-sources/ and https://docs.usejosh.com/data-status/ flipped to ingestion-built
- t15 12-sample manual USLM→Markdown validation gate (per add-a-new-source.html) — stratified across reconciliation, omnibus, single-section, private-law, and pre-2010 packages
- t16 URL verification gate: 10/10 random public_laws rows match upstream GovInfo titles + approval dates via WebFetch
- t17 bill_laws join integrity test: for the smoke-backfill cohort, every `public_laws` row whose MODS declared an originating bill that exists in `bills` has a corresponding `bill_laws` row
Changelog
No history yet.