Bills
Federal legislation: House and Senate bills, joint resolutions, concurrent resolutions, and simple resolutions across all bill types and congresses. The centerpiece data source — bills are the most cross-referenced entity in the substrate, linking to members, committees, votes, hearings, US Code, and the Federal Register.
This source is more complex than Federal Register because there are three official primary sources (GovInfo bulk, Congress.gov API, unitedstates/congress scrapers) plus a mandatory companion source (unitedstates/congress-legislators) for resolving sponsors and cosponsors to canonical member records.
Quick reference
Section titled “Quick reference”| Source name | Bills |
| Publishers | Library of Congress (Congress.gov), Office of the Clerk (House), Secretary of the Senate, Government Publishing Office (GovInfo) |
| License | Public domain (US government works); congress-legislators is CC0-1.0 |
| Coverage | BILLSTATUS XML: 113th Congress (2013) – present. Older bills available as text/summary only. Congress.gov API aggregate has ~426K bill records back further. |
| Volume | ~10K-19K bills/congress, ~150K total in BILLSTATUS-covered range, ~2.5M-3M text version files |
| Storage estimate | ~10-25 GB raw across all collections (gzipped XML + bill text in 3 formats) |
| Auth | Congress.gov API: register at api.data.gov for a key and set CONGRESS_GOV_API_KEY. GovInfo: none. GitHub raw / Pages: none. |
| Stable ID format | bill:{congress}-{type}-{number} e.g. bill:119-hr-1, bill:117-hjres-30 |
| Status | exploring — schema drafted, ingestion not built |
Source priority decision
Section titled “Source priority decision”Three primary sources overlap. Recommendation:
Primary: GovInfo BILLSTATUS XML. Single file per bill at a deterministic URL, contains essentially every field Congress.gov exposes (sponsor, cosponsors with bioguideId and isOriginalCosponsor, full action history with actionCode/type/sourceSystem, summaries, subjects, titles by titleType, committeeReports, relatedBills, amendments inline, textVersions with format URLs, laws). No auth, no rate limit, ZIP packages per congress×billtype, RSS for incremental updates.
Secondary: GovInfo BILLS / PLAW. Bill text in PDF/HTML/XML across all version codes (ih, is, eh, es, rh, rs, enr, etc.). Linked from BILLSTATUS via textVersions URLs — fetches are deterministic once BILLSTATUS is parsed. Public Laws (PLAW) ship with USLM 2.0 XML (modern format); live bill versions ship with the older "billres" DTD-based XML.
Tertiary: Congress.gov v3 API. Use only as enrichment / verification layer. Slightly cleaner JSON payload structure; cboCostEstimates and legislationUrl are convenient. But it duplicates BILLSTATUS data, costs API quota, and needs a real key. Not worth being primary. Note: this is the JSON API at api.congress.gov/v3/ — distinct from the public congress.gov/bill/... landing pages, which are now behind a Cloudflare Turnstile bot wall (see Access notes). When you need to confirm a bill's title or that a legislationUrl resolves, query the v3 API rather than fetching the landing page.
Companion (mandatory): unitedstates/congress-legislators. Hosted YAML registry of every member of Congress, current and historical, keyed by bioguide ID — the same ID used by both BILLSTATUS and Congress.gov. Required for joining bills to members.
Skip as runtime dependency: unitedstates/congress scrapers. They're a thin BILLSTATUS-XML-to-JSON converter at this point. Read bill_info.py and bills.py for action-code edge cases their maintainers have caught over a decade — but re-implement the parse ourselves, writing directly to our schema.
Access notes
Section titled “Access notes”Congress.gov: Register at api.data.gov for a key. DEMO_KEY is rate-limited to ~30 requests per hour per IP — fine for spot checks, not for ingestion; a real key is needed for ingestion. Current rate-limit numbers and the live-quota response headers are documented at api.data.gov rate limits.
Congress.gov landing pages (bot-walled, observed 2026-05-29): the human-facing pages at https://www.congress.gov/bill/... (the form legislationUrl points at, e.g. https://www.congress.gov/bill/119th-congress/house-bill/1) now return an HTTP 403 backed by a Cloudflare Turnstile challenge on a plain-HTTP GET. Do not verify legislationUrl by fetching the landing page. The URL pattern (/bill/{congress}th-congress/{type-slug}/{number}) is stable and verifiable by construction; to cross-check a bill's title or existence, query the Congress.gov v3 JSON API (api.congress.gov/v3/bill/{c}/{t}/{n}), which is not bot-walled. The same Turnstile wall is why congress.gov/help/field-values/bill-versions 403s on direct fetch.
GovInfo: No auth. Polite rate (~1-2 req/sec) recommended. No formal limit observed.
GitHub Pages mirror (unitedstates.github.io/congress-legislators/...) is faster, CORS-friendly, and serves valid Content-Type: text/yaml with Last-Modified + ETag. Prefer this over raw.githubusercontent.com when fetching the YAML registry.
User-Agent: Send a real-browser UA — it's harmless and avoids any heuristic blocks.
Endpoints
Section titled “Endpoints”Endpoint catalogs are upstream-documented; this page keeps only the Josh-specific source-priority call (above) and the gotchas below.
- Congress.gov v3 API (tertiary / enrichment only) — base
https://api.congress.gov/v3/, all requests need?api_key=.... Full endpoint catalog and OpenAPI: Congress.gov API docs.pagination.counton/bill= aggregate (~426K). Note:/cosponsorsis often empty even for major bills. - GovInfo bulk (primary / secondary) — BILLSTATUS / BILLSUM / BILLS / PLAW collections, ZIP packages per congress×type, and MODS/
wssearchmetadata sidecars. URL patterns and the BILLSTATUS user guide: GovInfo bulk data and the GPO bulk-data repo. PLAW ships USLM 2.0; live BILLS versions ship the older billres DTD XML. unitedstates/congress-legislators(companion / mandatory) — GitHub Pages YAML athttps://unitedstates.github.io/congress-legislators/{name}.yaml. Files used:legislators-current.yaml(~1.1 MB),legislators-historical.yaml(~9 MB),committees-current.yaml,committees-historical.yaml,committee-membership-current.yaml(plus district-offices, social-media, executive). No JSON/CSV mirror is committed in the repo — all YAML; conversion is on us. Update cadence: multiple commits per month is normal; pull on a daily cron.
GovInfo discovery quirks (probe findings): There is no JSON browse API for these collections — https://www.govinfo.gov/bulkdata/json/BILLSTATUS returns an HTML error page. There is also no usable site-root sitemap — https://www.govinfo.gov/sitemap.xml returns a 404. Enter discovery through the BILLSTATUS sitemap index (under https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/..., confirmed 2026-05-29), not a top-level sitemap. Incremental discovery uses the per-collection sitemaps (per-file <lastmod>) and the BILLSTATUS RSS batch feed; see the GovInfo BILLSTATUS bulk landing for the current URLs. The current-Congress job runs every 4 hours; prior congresses daily.
Bill vocabulary
Section titled “Bill vocabulary”Bill types (8 total)
Section titled “Bill types (8 total)”The 8 type codes are load-bearing — they're pinned in the bills.bill_type CHECK constraint:
| Code | Name |
|---|---|
hr | House Bill |
s | Senate Bill |
hjres | House Joint Resolution |
sjres | Senate Joint Resolution |
hconres | House Concurrent Resolution |
sconres | Senate Concurrent Resolution |
hres | House Simple Resolution |
sres | Senate Simple Resolution |
Casing gotcha: URLs use lowercase (/bill/119/hr/1); BILLSTATUS XML <type> is uppercase (HR, HJRES).
Action types and codes
Section titled “Action types and codes”The full <actions>/<item>/<type> enum (Actions by the President … VARIES) and the <actionCode> table are upstream-documented in the Congress.gov legislative glossary / User Guide. What Josh needs to remember:
- Each action carries a short alphanumeric
<actionCode>(e.g.E40000for "Became Public Law",H41610,36000). <sourceSystem>/<code>enum:0= House,1= House committee actions,2= Senate,9= Library of Congress.actionTimeis meaningful only whensourceSystem.code == 2(House floor).- Gotcha: the same legislative event can produce multiple action rows with different categorizations. "Became Public Law No: 119-21." may appear as both
type=President(actionCode=E40000) andtype=BecameLaw(actionCode=36000). Do not deduplicate naively — preserve both rows and useactionCodefor filtering.
Text version codes and title types
Section titled “Text version codes and title types”~30 text-version codes exist (ih, is, rh, rs, eh, es, eah, eas, eas2, pcs, pcs2, rfs, rds, rfh, rhuc, enr, …) and <titleType> codes run 01–05, 09, 10, 20–26, 29, 30–36, 39, 40 (Official Title as Introduced through Short Title as Enacted through Popular Title). The full canonical version table lives at congress.gov/help/field-values/bill-versions — but that page 403s behind the same Cloudflare Turnstile wall (see Access notes). Don't reproduce or fetch it: when parsing BILLSTATUS, the textVersions/item/type element gives the human-readable name next to each code, so unknown codes resolve at parse time without hitting that page. Each bill has many titles.
Stable ID format
Section titled “Stable ID format”Recommended: bill:{congress}-{type}-{number}, lowercased.
Examples:
bill:119-hr-1bill:117-hjres-30bill:118-sres-512
Rationale:
- Matches Congress.gov URL ordering exactly (
/bill/119/hr/1) — round-trippable. - Congress-first prefix is chronologically grouped, agent-readable.
- Hyphen separators are URL-safe and human-readable.
- Other formats considered:
unitedstates/congressuses{type}{number}-{congress}(hr1-119), GovInfo usesBILLS-{c}{t}{n}{v}packed (BILLS-119hr1eh), GovTrack used{type}{number}-{congress}. None is dominant. Putting congress first wins on agent ergonomics.
Derived IDs:
- Text version:
bill:{congress}-{type}-{number}-{versionCode}→bill:119-hr-1-eh. Concatenates to GovInfo packageId viaBILLS-{c}{t}{n}{v}→BILLS-119hr1eh. - Amendment:
amdt:{congress}-{type}-{number}→amdt:119-samdt-2849. - Public law:
law:{congress}-{number}→law:119-21.
Member cross-reference
Section titled “Member cross-reference”Sponsors and cosponsors in both Congress.gov API and BILLSTATUS XML are identified by bioguideId (e.g. A000375, D000600). unitedstates/congress-legislators uses id.bioguide as primary key for the same population — one-to-one match, no transformation.
Implementation: load legislators-current.yaml + legislators-historical.yaml into a dict keyed on bioguide. Decorate every bill sponsor/cosponsor row with name.official_full, current term's party/state/district, etc.
The YAML carries crosswalk IDs to GovTrack, OpenSecrets, VoteSmart, FEC, ICPSR, Wikidata, Wikipedia, and Ballotpedia — useful for later integrations.
Committee join: BILLSTATUS <committee><systemCode>hsbu00</systemCode> ↔ committees-current thomas_id: HSBU via systemCode = lowercase(thomas_id) + "00" (top-level) or lowercase(parent_thomas) + subcommittee_thomas_id (subcommittee). Need a small mapping shim.
BILLSTATUS XML shape (the canonical record)
Section titled “BILLSTATUS XML shape (the canonical record)”The full element-by-element <billStatus> / <bill> skeleton is upstream-documented in the GPO BILLSTATUS XML User Guide. The <bill> element nests committees, committeeReports, relatedBills, actions, sponsors, cboCostEstimates, laws, policyArea, subjects, summaries, title/titles, amendments, textVersions, and latestAction. The Josh-specific parse quirks that no upstream doc flags:
Sponsors / cosponsors. <sponsors> is always present (single sponsor, identified by bioguideId + fullName + party/state/district). <cosponsors> may be elided entirely on bills with no cosponsors. Don't assume cosponsors.count > 0 — major reconciliation bills routinely have zero cosponsors because they ride budget resolution authority. Cosponsor items carry sponsorshipDate and isOriginalCosponsor.
textVersions/item. Each item has a type (human-readable name, e.g. "Engrossed in House"), a date, and a formats list of urls. <textVersions> types include amendments-as-text (e.g. eas "Engrossed Amendment Senate"). These are text of an amendment to a bill, not bill-version variants. Plan the text-version table accordingly.
amendments inline. BILLSTATUS inlines all amendments as <amendment> elements with their own sponsors, cosponsors, actions, and latestAction. HR 1 (119th) has 493 amendments — the file is 2 MB largely because of inline amendment trees. Plan a separate bill_amendments table; do not store as JSON column.
Public-law linkage. When a bill becomes law, three independent signals appear: a <laws> item (<type>Public Law</type><number>119-21</number>), a <textVersions> item with <type>Public Law</type> linking to PLAW USLM, and an <actions> row with <actionCode>36000</actionCode><type>BecameLaw</type>, plus <latestAction><text>Became Public Law No: 119-21.</text></latestAction>.
Bill text formats
Section titled “Bill text formats”Per text version, three formats:
xml— the older DTD-based "billres" / OLC format. Not USLM for live bills.html—.htmextension.pdf— printable.
USLM XML is only available for enacted Public Laws at /content/pkg/PLAW-{congress}publ{lawnumber}/uslm/.... If we want USLM for live bills, we'd convert ourselves; for v1, parse the billres XML directly.
Per-text-version typical sizes (HR 1 119th — extreme upper bound):
- XML: 1.93 MB · HTML: 1.37 MB · PDF: 1.65 MB
Median bill is ~30-200 KB BILLSTATUS + 100-300 KB across all text versions.
Volume
Section titled “Volume”Per-congress bill counts (sitemap-derived, all 8 bill types):
| Congress | Total bills + resolutions |
|---|---|
| 113 (2013-14) | 10,637 |
| 114 (2015-16) | 12,063 |
| 115 (2017-18) | 13,556 |
| 116 (2019-20) | 16,601 |
| 117 (2021-22) | 17,828 |
| 118 (2023-24) | 19,315 |
| 119 (2025-26, in progress) | 15,565 |
BILLSTATUS bulk coverage starts at 113th Congress — older congresses have BILLS text and PLAW for enacted laws but no structured status. The original "108th onward" target needs a caveat: full structure for 113+, partial (text + summary + enrollment) for 108-112, thinner still for older congresses via Congress.gov API.
For the BILLSTATUS-covered range (113th-present): ~105K bill records + a few hundred thousand text-version files.
Total raw corpus estimate: ~10-25 GB across BILLSTATUS XML + BILLSUM + BILLS (xml only — skip pdf/html when xml is available) + PLAW USLM. Smaller than Federal Register because per-bill files are usually small.
Caching / incremental sync
Section titled “Caching / incremental sync”Three independent change signals exist (raw sitemap/RSS URLs are listed on the GovInfo BILLSTATUS bulk landing):
- Per-file
Last-Modifiedon every GovInfo XML and ZIP — usable forIf-Modified-Sincerequests on individual files. - Per-collection sitemap
<lastmod>— every URL listed with its modification time. Diff against last-stored to find changed bills since last run. - RSS batch feed — batch updates with file lists. Current-congress runs every 4 hours; prior congresses daily.
Practical incremental strategy:
- Hourly/4-hourly tick: poll RSS for batch updates. Each batch lists files changed; refetch only those.
- Daily tick: compare sitemap
<lastmod>against our last-seen timestamps for current-congress bill types. - Weekly tick (sanity check): compare sitemap counts per-congress against our row counts; alert on drift.
Congress.gov API ?fromDateTime=YYYY-MM-DDTHH:MM:SSZ&sort=updateDate+desc is the equivalent for the API path — useful for verification or when GovInfo is lagging.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- Members (companion source — load before bills)-- ============================================================
CREATE TABLE legislators ( bioguide_id text PRIMARY KEY, -- 'A000375' govtrack_id int, thomas_id text, lis_id text, opensecrets_id text, fec_ids text[], icpsr_id int, wikidata_id text, wikipedia text, first_name text, last_name text, middle_name text, suffix text, nickname text, official_full_name text, birthday date, gender text, is_current boolean NOT NULL DEFAULT false, raw_yaml jsonb NOT NULL, fetched_at timestamptz NOT NULL, updated_at timestamptz NOT NULL DEFAULT now());
CREATE TABLE legislator_terms ( id bigserial PRIMARY KEY, bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE, term_type text NOT NULL, -- 'rep' | 'sen' term_start date NOT NULL, term_end date NOT NULL, state text NOT NULL, district int, -- rep only senate_class int, -- sen only party text, party_affiliations jsonb, -- list when party changed mid-term caucus text, state_rank text, -- 'junior' | 'senior' url text, address text, phone text, rss_url text, UNIQUE (bioguide_id, term_start));
CREATE INDEX legislator_terms_active ON legislator_terms (term_end DESC, term_start);
-- ============================================================-- Committees (companion source — load before bills)-- ============================================================
CREATE TABLE committees ( id text PRIMARY KEY, -- 'committee:hsbu' (lowercase thomas_id) thomas_id text UNIQUE, -- 'HSBU' system_code text UNIQUE, -- 'hsbu00' for top-level; subcommittees have different code chamber text NOT NULL CHECK (chamber IN ('house', 'senate', 'joint')), name text NOT NULL, parent_id text REFERENCES committees(id), url text, minority_url text, address text, phone text, jurisdiction text, rss_url text, youtube_id text, is_current boolean NOT NULL DEFAULT true, raw_yaml jsonb, fetched_at timestamptz NOT NULL, updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX committees_parent ON committees (parent_id) WHERE parent_id IS NOT NULL;
-- ============================================================-- Bills (canonical)-- ============================================================
CREATE TABLE bills ( -- Stable canonical ID id text PRIMARY KEY, -- 'bill:119-hr-1'
-- Composite natural key congress smallint NOT NULL, -- 119 bill_type text NOT NULL CHECK (bill_type IN ( 'hr', 's', 'hjres', 'sjres', 'hconres', 'sconres', 'hres', 'sres' )), number int NOT NULL, -- 1 UNIQUE (congress, bill_type, number),
origin_chamber text CHECK (origin_chamber IN ('House', 'Senate')), introduced_date date,
-- Latest title (titleType = display title or official as introduced) title text NOT NULL,
-- Single-value rollups (denormalized for filter speed) policy_area text, sponsor_bioguide text REFERENCES legislators(bioguide_id), sponsor_is_by_request boolean, latest_action_date date, latest_action_text text, became_law boolean NOT NULL DEFAULT false, public_law_type text, -- 'Public Law' | 'Private Law' public_law_number text, -- '119-21'
-- Update tracking update_date timestamptz, -- BILLSTATUS <updateDate> update_date_inc_text timestamptz, -- <updateDateIncludingText> source_last_modified timestamptz, -- HTTP Last-Modified of BILLSTATUS XML
-- Lifecycle / forensics raw_billstatus_xml bytea, -- gzipped XML for forensics; nullable fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX bills_congress_intro ON bills (congress, introduced_date DESC);CREATE INDEX bills_sponsor ON bills (sponsor_bioguide) WHERE sponsor_bioguide IS NOT NULL;CREATE INDEX bills_latest_action ON bills (latest_action_date DESC NULLS LAST);CREATE INDEX bills_public_law ON bills (public_law_number) WHERE public_law_number IS NOT NULL;CREATE INDEX bills_update_date ON bills (update_date DESC NULLS LAST);
-- Full-text on title + summary text (body text comes from text versions, indexed separately)ALTER TABLE bills ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(title, '')), 'A') || setweight(to_tsvector('english', coalesce(policy_area, '')), 'B') || setweight(to_tsvector('english', coalesce(latest_action_text, '')), 'C') ) STORED;CREATE INDEX bills_search ON bills USING gin (search_tsv);
-- ============================================================-- Bill child tables-- ============================================================
CREATE TABLE bill_actions ( id bigserial PRIMARY KEY, bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, sequence int NOT NULL, -- order within the bill's action history action_date date NOT NULL, action_time time, -- only meaningful when source_system_code=2 action_code text, -- short alphanumeric, e.g. 'E40000' action_type text, -- the categorization, e.g. 'BecameLaw' source_system_code int, -- 0=House, 1=House cmte, 2=Senate, 9=LoC source_system_name text, text text NOT NULL, committee_system_code text, -- when action was committee-driven UNIQUE (bill_id, sequence));CREATE INDEX bill_actions_bill_date ON bill_actions (bill_id, action_date DESC);CREATE INDEX bill_actions_code ON bill_actions (action_code) WHERE action_code IS NOT NULL;
CREATE TABLE bill_cosponsors ( bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, bioguide_id text NOT NULL REFERENCES legislators(bioguide_id), sponsorship_date date NOT NULL, is_original_cosponsor boolean NOT NULL DEFAULT false, PRIMARY KEY (bill_id, bioguide_id));CREATE INDEX bill_cosponsors_bioguide ON bill_cosponsors (bioguide_id);
CREATE TABLE bill_committees ( id bigserial PRIMARY KEY, bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, committee_id text REFERENCES committees(id), activity_name text NOT NULL, -- 'Referred to', 'Reported', etc. activity_date timestamptz, UNIQUE (bill_id, committee_id, activity_name, activity_date));CREATE INDEX bill_committees_bill ON bill_committees (bill_id);
CREATE TABLE bill_subjects ( bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, subject text NOT NULL, PRIMARY KEY (bill_id, subject));CREATE INDEX bill_subjects_subject ON bill_subjects (subject);
CREATE TABLE bill_titles ( id bigserial PRIMARY KEY, bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, title_type_code text, -- '01', '02', ..., '40' title_type_name text, -- e.g. 'Official Title as Introduced' chamber text, -- 'House' | 'Senate' | NULL title text NOT NULL, UNIQUE (bill_id, title_type_code, chamber, title));
CREATE TABLE bill_summaries ( id bigserial PRIMARY KEY, bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, version_code text, -- e.g. '00' (introduced), '49', '50' (private law), etc. action_desc text, -- describes the bill state at summary time action_date date, text_html text NOT NULL, -- CDATA-wrapped HTML UNIQUE (bill_id, version_code, action_date));
CREATE TABLE bill_text_versions ( id text PRIMARY KEY, -- 'bill:119-hr-1-eh' bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, version_code text NOT NULL, -- 'eh', 'is', 'enr', etc. version_name text NOT NULL, -- 'Engrossed in House' version_date timestamptz, govinfo_package_id text, -- 'BILLS-119hr1eh' xml_url text, html_url text, pdf_url text, body_text text, -- extracted plain text body_xml text, -- raw billres XML (or USLM for PLAW) body_fetched_at timestamptz, UNIQUE (bill_id, version_code));CREATE INDEX bill_text_versions_bill ON bill_text_versions (bill_id);
CREATE TABLE bill_amendments ( id text PRIMARY KEY, -- 'amdt:119-samdt-2849' bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, congress smallint NOT NULL, amendment_type text NOT NULL CHECK (amendment_type IN ('hamdt', 'samdt', 'suamdt')), number int NOT NULL, purpose text, sponsor_bioguide text REFERENCES legislators(bioguide_id), submitted_date date, proposed_date date, latest_action_date date, latest_action_text text, chamber text, type_description text, -- 'Amendment in the Nature of a Substitute', etc. raw_xml bytea, -- the inline <amendment> subtree UNIQUE (congress, amendment_type, number));CREATE INDEX bill_amendments_bill ON bill_amendments (bill_id);
CREATE TABLE bill_related ( bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, related_bill_id text NOT NULL, -- soft FK — related bill may not yet be ingested relationship_type text NOT NULL, -- 'companion', 'identical', 'similar', etc. identified_by text NOT NULL, -- 'CRS' | 'House' | 'Senate' PRIMARY KEY (bill_id, related_bill_id, relationship_type, identified_by));CREATE INDEX bill_related_target ON bill_related (related_bill_id);
CREATE TABLE bill_laws ( bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, law_type text NOT NULL, -- 'Public Law' | 'Private Law' law_number text NOT NULL, -- '119-21' PRIMARY KEY (bill_id, law_type, law_number));
CREATE TABLE bill_cbo_estimates ( id bigserial PRIMARY KEY, bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, pub_date timestamptz, title text, description text, url text, UNIQUE (bill_id, url));
CREATE TABLE bill_committee_reports ( id bigserial PRIMARY KEY, bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE, citation text NOT NULL, -- e.g. 'H. Rept. 119-100' url text, UNIQUE (bill_id, citation));
-- ============================================================-- Vector chunks for similarity search on bill text-- ============================================================
CREATE TABLE bill_text_chunks ( id bigserial PRIMARY KEY, text_version_id text NOT NULL REFERENCES bill_text_versions(id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, embedding vector(1024), -- adjust dim per model UNIQUE (text_version_id, chunk_index));CREATE INDEX bill_text_chunks_embedding ON bill_text_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
- Legislators and committees in their own tables with their own ingestion path (from
congress-legislators). Bills join to them but they're not bill-children. Load these first, then bills. raw_billstatus_xmlas gzipped bytea — preserves full record for re-parsing without re-fetching. Optional (can be turned off post-stable to save space).- Amendments as a top-level table, not a JSON column on bills — they have their own sponsors, actions, and cross-references. HR 1 (119) has 493 of them.
bill_related.related_bill_idis a soft FK — the related bill may not yet be in our database (e.g., we ingest 119th first, but it references 118th bills we haven't loaded). Don't enforce FK constraint.- Text versions get their own stable IDs (
bill:119-hr-1-eh) so the agent can pass them around independently. - Search tsvector on title + policy_area + latest_action_text for now — bill body text is searched via
bill_text_chunks(embedding) or a separate full-text index onbill_text_versions.body_textto be added. - No tsvector on body text initially. Bill text bodies are large; deferring full-text indexing on them until we measure size impact. Use
bill_text_chunksfor similarity, which is what the agent'sfind_similar_billstool needs. - Public-law denormalized onto
billsin addition tobill_lawstable —became_lawboolean andpublic_law_numberare common filter targets, worth duplicating.
Download / update strategy
Section titled “Download / update strategy”Backfill (one-time, 113th → present for BILLSTATUS, broader for text)
Section titled “Backfill (one-time, 113th → present for BILLSTATUS, broader for text)”Recommended path: GovInfo BILLSTATUS ZIPs for backfill, per-file BILLS for text.
- Members + committees first. Pull
legislators-current.yaml,legislators-historical.yaml,committees-current.yaml,committees-historical.yaml,committee-membership-current.yaml. Parse YAML → load. ~10 MB total, single fetch each. - BILLSTATUS by congress×type. For each congress 113 through current, for each bill type:
- Download
BILLSTATUS-{c}-{t}.zip(10-35 MB each, ~40 ZIPs total for 113-119). - Extract to
/data/corpus/bills/billstatus/{c}/{t}/. - Parse each XML → bills + child tables.
- This is the fastest backfill path — total download is ~1.5 GB.
- Download
- BILLSUM by bill (or by ZIP if available). Per-bill XML for richer summary metadata if BILLSTATUS summaries are insufficient. Optional for v1.
- BILLS text by version. For each
bill_text_versionsrow, fetch the XML format (skip HTML/PDF unless XML missing).- Per-bill text fetch is 100s of GB across all bills × versions. Probably do this lazily — fetch on first agent request, populate
bill_text_versions.body_texton demand. - Or: backfill text only for bills that became law (PLAW collection — much smaller, USLM format).
- Per-bill text fetch is 100s of GB across all bills × versions. Probably do this lazily — fetch on first agent request, populate
- Older congresses (108th-112th). No BILLSTATUS coverage. Use Congress.gov API
/bill/{c}/...with a real api.data.gov key for partial structure. Lower priority — get 113+ solid first.
Time budget: BILLSTATUS backfill (full structure for 113-119) is hours, not days. Bill text backfill (lazy) — depends on workload.
Daily incremental
Section titled “Daily incremental”- Hourly: poll RSS. Parse the BILLSTATUS RSS batch feed's
<item><pubDate>and<description>(file list). For each file URL newer than last_run, refetch, re-parse, upsert. - Daily: sitemap diff. For current-congress bill types only. Compare each entry's
<lastmod>against ourbills.source_last_modified. Refetch any that changed. - Daily: legislators/committees. Re-pull
legislators-current.yamlandcommittees-current.yamlif their HTTPLast-Modifiedchanged. Multiple commits per month is normal. - Weekly: drift check. Compare per-congress sitemap counts against our row counts. Alert on discrepancies.
State tracking
Section titled “State tracking”Per architecture: ingestion state in Postgres tables ingestion_runs, ingestion_logs, ingestion_schedule. Each run records source name (bills_billstatus, bills_text, legislators, etc.) and the actual parameters in config jsonb so we can re-run.
Failure modes
Section titled “Failure modes”- HTTP 503 from GovInfo during ZIP download — retry with exponential backoff up to ~5 min, then halt and alert.
- Malformed BILLSTATUS XML — log doc, skip parse, mark
parsed_at = NULL. Re-attempt next run. - Bioguide mismatch (sponsor's bioguide_id not in legislators table) — usually means a member wasn't yet in
legislators-historical.yaml(rare race; the YAML lags slightly). Log and queue for retry on next legislators sync. - Action sequence ambiguity — same event with multiple type categorizations. Preserve all action rows; rely on
action_codefor filtering. - Public Law not yet in PLAW collection — bills can be marked
became_law=truebefore PLAW XML is available. Decouple PLAW fetch from BILLSTATUS upsert; have its own retry queue.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Coverage for 108th-112th congresses. BILLSTATUS doesn't cover these. Do we (a) skip entirely and document the v1 gap, (b) ingest partial via Congress.gov API alone, or (c) use the older
unitedstates/congresscached outputs as a one-time backfill? Decide before launching v1. - Bill text strategy. Eager fetch for all 2.5M+ text versions, or lazy fetch on first agent request? Lean toward lazy + eager-only for current-congress bills + enacted-law (PLAW) bodies. Measure before deciding.
Embedding model and dimension for bill text— resolved. Locked OSS-only on Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license, runs on CPU). The vec0 schema column isembedding float[1024]+embedding_bq bit[1024]. See Architecture, Storage stack. Chunking strategy is still open (next bullet).- Chunking for bill text. Section-aware (split at
<section>/<part>/<title>in billres XML / USLM) vs fixed-size with overlap. Section-aware matches USLM's structure but parser is more work. unitedstates/congressparsing edge cases. The maintainers have caught dozens of action-code idiosyncrasies over a decade. Worth readingbill_info.py/bills.pyend-to-end before writing our parser, even if we don't run their code.- Older bill types or naming variants. Have any bill types been retired or renamed historically? E.g., do early congresses have records under different type codes? Unlikely to be a v1 issue but worth checking before backfilling pre-113th.
- Action time zone.
<actionTime>from House floor is presumably ET, but isn't formally documented. Verify by spot-check against C-SPAN times. textVersionsfor amendments. As noted, amendment-as-text version codes (e.g.eas) refer to amendment text, not bill text. Need a clean way to model this — either (a) separatebill_amendment_text_versionstable, or (b) flag inbill_text_versions.is_amendment_text. (b) is simpler.