Bills ingester · Josh spec

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Bills are the centerpiece of the substrate — every other legislative source
(public laws, committee reports, hearings, votes, CBO estimates) FKs back to
a bill, and most agent queries either start at a bill or need to resolve one.
Shipping bills early is the biggest schema test in v1: it stresses
cross-source joins (legislators, committees), large body text (HR-1-class
bills can run 1.9 MB of XML), inlined amendments (HR 1 119th has 493), and
three independent change signals (RSS, sitemap <lastmod>, per-file
Last-Modified). If the SQLite + FTS5 + sqlite-vec stack survives bills,
it survives the v1 source roster.

User stories

As a policy analyst using my own AI agent, I want to ask the agent for the latest House and Senate bills on a given topic, with sponsors and cosponsors resolved to real members so that I can run vote-prediction and "who introduced what" research without scraping Congress.gov myself.

As an OSS self-hoster, I want to run `josh ingest bills` on a fresh substrate and have BILLSTATUS, sponsors, cosponsors, actions, subjects, and law linkage all populate so that I can stand up a federal-bills-aware agent on my own instance without standing up a separate scraper stack.

As a downstream FK consumer (public laws, committee reports, hearings ingesters), I want to FK to bills by the canonical `bill:{congress}-{type}-{number}` ID so that cross-source citation graph queries (e.g. "every committee report about HR 1") resolve cleanly inside one substrate.

As an agent harness running retrieval over the substrate, I want FTS5 BM25 ranking over bill titles, policy areas, and latest-action text, plus vector similarity over bill body chunks so that "find similar bills" and topical search return ranked, citation-anchored results in one query path.

Acceptance criteria (EARS)

When `josh ingest bills --since 2026-01-01` is run against the deployed substrate, the system shall populate `bills` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
When a BILLSTATUS XML body is fetched, the system shall write the raw payload under `/data/corpus/bills/bodies/raw/` and a normalized Markdown body under `/data/corpus/bills/bodies/markdown/`.
While the ingester is running, if a single bill fails to parse, then the system shall log the failure to `ingestion_logs` (with `bill_id` and parser stage) and continue with the next bill — no crash, no half-written rows.
When a bill's `<actions>` block is parsed, the system shall populate one `bill_actions` row per `<item>` preserving `(bill_id, sequence)` uniqueness, and shall NOT deduplicate rows that share the same event with different `actionCode`/`type` categorizations.
Where a bill declares a sponsor or cosponsors, the system shall resolve each `bioguideId` to the `legislators` table populated by `legislators-and-committees-ingester` and write `bill_cosponsors` rows; if a `bioguideId` is unknown, the system shall log to `ingestion_logs` and queue a retry on the next legislators sync.
When an FTS5 query `MATCH 'reconciliation'` is executed against `bills_fts`, the system shall return BM25-ranked results with the per-column weight tuple `(10.0, 4.0, 2.0, 1.0)` over (title, policy_area, latest_action_text, summary_text).
Where a bill's text version body exceeds ~5K tokens, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding into `bill_chunk_vecs`.
When the BILLSTATUS RSS feed reports a batch of updated files, the system shall refetch only those files and shall update `ingestion_source_state.last_seen_key` to the highest `pubDate` observed, so the next incremental tick skips already-seen batches.
Where a bill's BILLSTATUS contains a `<laws>` block, the system shall populate the `bill_laws` junction (composite PK `(bill_id, law_type, law_number)`) AND set the denormalized `bills.became_law=true` and `bills.public_law_number` columns in the same transaction.

Success determiner

kindbash

Command

set -euo pipefail
# Smoke backfill (50 bills from current congress)
ssh josh 'docker exec josh-ingester josh ingest bills --since 2026-04-01 --max 50'
# Verify rows landed with all six citation columns populated
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM bills
  WHERE update_date >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
# Sponsor join populated (FK to legislators)
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM bills b
  JOIN legislators l ON l.bioguide_id = b.sponsor_bioguide
  WHERE b.update_date >= ''2026-04-01'';"'
# Action junction populated
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM bill_actions WHERE bill_id IN (SELECT id FROM bills WHERE update_date >= '\''2026-04-01'\'');"'
# FTS5 sanity
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM bills_fts WHERE bills_fts MATCH '\''appropriations'\'';"'

Expect

>= 50 fully-populated rows; sponsor join >= 50; bill_actions >= 50; FTS5 match returns >= 1.

Smoke run only. Full BILLSTATUS backfill (113th-current ~105K bills, ~1.5 GB ZIP download) runs detached on the server per the `add-a-new-source.html` runbook. Bill-text backfill is lazy — only enacted PLAW USLM bodies and current-congress text versions are eager-fetched.

Clarifications needed

Backfill scope: 113th-current (BILLSTATUS coverage, ~105K bills, ~10-25 GB) vs current-congress only (119th, ~16K bills) for v1 launch. Brief targets '108th onward' but BILLSTATUS only goes back to 113th — confirm we're shipping 113th+ and documenting the pre-113th gap, OR backfilling 108th-112th via Congress.gov API alone.
Primary source for text: BILLSTATUS XML for status + structure is locked in. Open question is whether to eagerly fetch BILLS text-version XML for all ~2.5M version files (100s of GB) or lazily fetch on first agent request, eager-fetching only enacted-law bodies (PLAW USLM, much smaller). Lean: lazy + eager-only for current-congress + PLAW.
Bill text chunking strategy. Section-aware (split at `<section>` / `<part>` / `<title>` in billres XML / USLM) vs fixed-size with overlap. Section-aware preserves USLM semantics but parser is more work. Resolve via the `chunking.html` experiment framework before kicking off the embedding backfill.
Amendment-text version codes (`eas`, `eah`) refer to amendment text, not bill text. Model as (a) a separate `bill_amendment_text_versions` table or (b) `bill_text_versions.is_amendment_text` flag. Lean: (b).
108th-112th historical coverage: skip with documented v1 gap, or partial-ingest via Congress.gov v3 API alone? Affects backfill time, storage, and the v1 launch narrative.

Out of scope

State-level legislatures (CA, NY, TX bills) — Phase 6.
Pre-1990s historical bills predating any digital structured source — not addressable.
Real-time webhook / push subscription surface — surface-layer concern, deferred to a future surface spec.
PDF text extraction from bill PDFs — XML and HTML formats are sufficient for v1; PDF is fallback only when neither exists.
The `unitedstates/congress` Python scrapers as a runtime dependency — we read their parsers for edge cases but re-implement directly against our schema.

Dependencies

Plan

Source module follows the standard Source protocol from
https://docs.usejosh.com/operations/ingestion-architecture/. Per-source spec at
https://docs.usejosh.com/sources/bills/ (probe findings + indicative DDL — the migration
under shared/josh_substrate/src/josh_substrate/migrations/versions/ is the
source of truth, not the docs page).

Source priority (locked from https://docs.usejosh.com/sources/bills/ §source-priority-decision):
- Primary: GovInfo BILLSTATUS XML (no auth, no rate limit, deterministic
URLs, ZIP packages per congress×billtype, RSS for incrementals).
- Secondary: GovInfo BILLS / PLAW for bill text (XML preferred, HTML/PDF
fallback). PLAW ships modern USLM 2.0; live bills ship older billres DTD.
- Tertiary: Congress.gov v3 API for verification + enrichment only
(needs api.data.gov key, 5K/hr quota). Not worth being primary.
- Companion (already shipped): legislators-and-committees-ingester
populates legislators and committees keyed by bioguide_id /
system_code. Bills FK to those tables; load order is
legislators → committees → bills.

Discover/fetch/parse/load shape:
- discover: Hourly RSS poll on
https://www.govinfo.gov/rss/billstatus-batch.xml for current-congress
deltas; daily sitemap diff per congress×billtype against
bills.source_last_modified (BILLSTATUS sitemaps live under
https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/... — the root
www.govinfo.gov/sitemap.xml is a 404); weekly drift-check comparing
per-congress sitemap counts to substrate row counts. Watermark = highest
pubDate observed in ingestion_source_state.last_seen_key.
- fetch: Per-bill BILLSTATUS XML (deterministic URL
BILLSTATUS-{c}{t}{n}.xml) to /data/corpus/bills/bodies/raw/. SHA256
the bytes for raw_sha256 citation column. Polite ~1-2 req/sec; retry
HTTP 503 with exponential backoff up to ~5 min.
- parse: bills_billstatus_v1 lxml walker → bills row + child rows
(bill_actions, bill_cosponsors, bill_committees, bill_subjects,
bill_titles, bill_summaries, bill_text_versions, bill_amendments,
bill_related, bill_laws, bill_cbo_estimates, bill_committee_reports).
Markdown body = title + policy area + latest action + summary text
written to /data/corpus/bills/bodies/markdown/{bill_id}.md. Bill text
XML (billres / USLM) is parsed lazily on first request.
- load: Idempotent upsert on (congress, bill_type, number) natural key.
All child-table inserts in one transaction per bill. FTS5 sync triggers
fire on insert/update/delete. Body chunks for bills above ~5K tokens
enqueued into ingestion_embedding_queue.

Migration shape: New migration under
shared/josh_substrate/src/josh_substrate/migrations/versions/0NNN_bills.py
with: bills, bill_actions, bill_cosponsors, bill_committees,
bill_subjects, bill_titles, bill_summaries, bill_text_versions,
bill_amendments, bill_related (soft-FK), bill_laws (junction),
bill_cbo_estimates, bill_committee_reports, plus bills_fts virtual
table with the standard 3-trigger sync (bills_fts_ai, _au, _ad),
plus bill_chunks and bill_chunk_vecs (vec0 with float[1024] and
bit[1024]). Indexes per https://docs.usejosh.com/sources/bills/ §schema. Use SQLite
shapes (no text[], no jsonb — junction tables and JSON1).

Schedule recommendation: Hourly at :30. Bills are introduced and acted
on Mon-Fri while Congress is in session; the GovInfo current-congress RSS
job runs every 4 hours, but hourly polling catches stragglers and amortizes
any single-fetch failure across the day. Off-hours and weekends are
cheap-no-ops because the RSS feed simply has no new batches. Sitemap diff
runs daily at 03:30; drift check weekly Sunday 04:00.

Tasks

0 of 17 done.

Changelog

No history yet.