Skip to content

Bills

Federal legislation: House and Senate bills, joint resolutions, concurrent resolutions, and simple resolutions across all bill types and congresses. The centerpiece data source — bills are the most cross-referenced entity in the substrate, linking to members, committees, votes, hearings, US Code, and the Federal Register.

This source is more complex than Federal Register because there are three official primary sources (GovInfo bulk, Congress.gov API, unitedstates/congress scrapers) plus a mandatory companion source (unitedstates/congress-legislators) for resolving sponsors and cosponsors to canonical member records.

Source nameBills
PublishersLibrary of Congress (Congress.gov), Office of the Clerk (House), Secretary of the Senate, Government Publishing Office (GovInfo)
LicensePublic domain (US government works); congress-legislators is CC0-1.0
CoverageBILLSTATUS XML: 113th Congress (2013) – present. Older bills available as text/summary only. Congress.gov API aggregate has ~426K bill records back further.
Volume~10K-19K bills/congress, ~150K total in BILLSTATUS-covered range, ~2.5M-3M text version files
Storage estimate~10-25 GB raw across all collections (gzipped XML + bill text in 3 formats)
AuthCongress.gov API: register at api.data.gov for a key and set CONGRESS_GOV_API_KEY. GovInfo: none. GitHub raw / Pages: none.
Stable ID formatbill:{congress}-{type}-{number} e.g. bill:119-hr-1, bill:117-hjres-30
Statusexploring — schema drafted, ingestion not built

Three primary sources overlap. Recommendation:

Primary: GovInfo BILLSTATUS XML. Single file per bill at a deterministic URL, contains essentially every field Congress.gov exposes (sponsor, cosponsors with bioguideId and isOriginalCosponsor, full action history with actionCode/type/sourceSystem, summaries, subjects, titles by titleType, committeeReports, relatedBills, amendments inline, textVersions with format URLs, laws). No auth, no rate limit, ZIP packages per congress×billtype, RSS for incremental updates.

Secondary: GovInfo BILLS / PLAW. Bill text in PDF/HTML/XML across all version codes (ih, is, eh, es, rh, rs, enr, etc.). Linked from BILLSTATUS via textVersions URLs — fetches are deterministic once BILLSTATUS is parsed. Public Laws (PLAW) ship with USLM 2.0 XML (modern format); live bill versions ship with the older "billres" DTD-based XML.

Tertiary: Congress.gov v3 API. Use only as enrichment / verification layer. Slightly cleaner JSON payload structure; cboCostEstimates and legislationUrl are convenient. But it duplicates BILLSTATUS data, costs API quota, and needs a real key. Not worth being primary. Note: this is the JSON API at api.congress.gov/v3/ — distinct from the public congress.gov/bill/... landing pages, which are now behind a Cloudflare Turnstile bot wall (see Access notes). When you need to confirm a bill's title or that a legislationUrl resolves, query the v3 API rather than fetching the landing page.

Companion (mandatory): unitedstates/congress-legislators. Hosted YAML registry of every member of Congress, current and historical, keyed by bioguide ID — the same ID used by both BILLSTATUS and Congress.gov. Required for joining bills to members.

Skip as runtime dependency: unitedstates/congress scrapers. They're a thin BILLSTATUS-XML-to-JSON converter at this point. Read bill_info.py and bills.py for action-code edge cases their maintainers have caught over a decade — but re-implement the parse ourselves, writing directly to our schema.

Congress.gov: Register at api.data.gov for a key. DEMO_KEY is rate-limited to ~30 requests per hour per IP — fine for spot checks, not for ingestion; a real key is needed for ingestion. Current rate-limit numbers and the live-quota response headers are documented at api.data.gov rate limits.

Congress.gov landing pages (bot-walled, observed 2026-05-29): the human-facing pages at https://www.congress.gov/bill/... (the form legislationUrl points at, e.g. https://www.congress.gov/bill/119th-congress/house-bill/1) now return an HTTP 403 backed by a Cloudflare Turnstile challenge on a plain-HTTP GET. Do not verify legislationUrl by fetching the landing page. The URL pattern (/bill/{congress}th-congress/{type-slug}/{number}) is stable and verifiable by construction; to cross-check a bill's title or existence, query the Congress.gov v3 JSON API (api.congress.gov/v3/bill/{c}/{t}/{n}), which is not bot-walled. The same Turnstile wall is why congress.gov/help/field-values/bill-versions 403s on direct fetch.

GovInfo: No auth. Polite rate (~1-2 req/sec) recommended. No formal limit observed.

GitHub Pages mirror (unitedstates.github.io/congress-legislators/...) is faster, CORS-friendly, and serves valid Content-Type: text/yaml with Last-Modified + ETag. Prefer this over raw.githubusercontent.com when fetching the YAML registry.

User-Agent: Send a real-browser UA — it's harmless and avoids any heuristic blocks.

Endpoint catalogs are upstream-documented; this page keeps only the Josh-specific source-priority call (above) and the gotchas below.

  • Congress.gov v3 API (tertiary / enrichment only) — base https://api.congress.gov/v3/, all requests need ?api_key=.... Full endpoint catalog and OpenAPI: Congress.gov API docs. pagination.count on /bill = aggregate (~426K). Note: /cosponsors is often empty even for major bills.
  • GovInfo bulk (primary / secondary) — BILLSTATUS / BILLSUM / BILLS / PLAW collections, ZIP packages per congress×type, and MODS/wssearch metadata sidecars. URL patterns and the BILLSTATUS user guide: GovInfo bulk data and the GPO bulk-data repo. PLAW ships USLM 2.0; live BILLS versions ship the older billres DTD XML.
  • unitedstates/congress-legislators (companion / mandatory) — GitHub Pages YAML at https://unitedstates.github.io/congress-legislators/{name}.yaml. Files used: legislators-current.yaml (~1.1 MB), legislators-historical.yaml (~9 MB), committees-current.yaml, committees-historical.yaml, committee-membership-current.yaml (plus district-offices, social-media, executive). No JSON/CSV mirror is committed in the repo — all YAML; conversion is on us. Update cadence: multiple commits per month is normal; pull on a daily cron.

GovInfo discovery quirks (probe findings): There is no JSON browse API for these collections — https://www.govinfo.gov/bulkdata/json/BILLSTATUS returns an HTML error page. There is also no usable site-root sitemaphttps://www.govinfo.gov/sitemap.xml returns a 404. Enter discovery through the BILLSTATUS sitemap index (under https://www.govinfo.gov/sitemap/bulkdata/BILLSTATUS/..., confirmed 2026-05-29), not a top-level sitemap. Incremental discovery uses the per-collection sitemaps (per-file <lastmod>) and the BILLSTATUS RSS batch feed; see the GovInfo BILLSTATUS bulk landing for the current URLs. The current-Congress job runs every 4 hours; prior congresses daily.

The 8 type codes are load-bearing — they're pinned in the bills.bill_type CHECK constraint:

CodeName
hrHouse Bill
sSenate Bill
hjresHouse Joint Resolution
sjresSenate Joint Resolution
hconresHouse Concurrent Resolution
sconresSenate Concurrent Resolution
hresHouse Simple Resolution
sresSenate Simple Resolution

Casing gotcha: URLs use lowercase (/bill/119/hr/1); BILLSTATUS XML <type> is uppercase (HR, HJRES).

The full <actions>/<item>/<type> enum (Actions by the President … VARIES) and the <actionCode> table are upstream-documented in the Congress.gov legislative glossary / User Guide. What Josh needs to remember:

  • Each action carries a short alphanumeric <actionCode> (e.g. E40000 for "Became Public Law", H41610, 36000).
  • <sourceSystem>/<code> enum: 0 = House, 1 = House committee actions, 2 = Senate, 9 = Library of Congress. actionTime is meaningful only when sourceSystem.code == 2 (House floor).
  • Gotcha: the same legislative event can produce multiple action rows with different categorizations. "Became Public Law No: 119-21." may appear as both type=President (actionCode=E40000) and type=BecameLaw (actionCode=36000). Do not deduplicate naively — preserve both rows and use actionCode for filtering.

~30 text-version codes exist (ih, is, rh, rs, eh, es, eah, eas, eas2, pcs, pcs2, rfs, rds, rfh, rhuc, enr, …) and <titleType> codes run 01–05, 09, 10, 20–26, 29, 30–36, 39, 40 (Official Title as Introduced through Short Title as Enacted through Popular Title). The full canonical version table lives at congress.gov/help/field-values/bill-versions — but that page 403s behind the same Cloudflare Turnstile wall (see Access notes). Don't reproduce or fetch it: when parsing BILLSTATUS, the textVersions/item/type element gives the human-readable name next to each code, so unknown codes resolve at parse time without hitting that page. Each bill has many titles.

Recommended: bill:{congress}-{type}-{number}, lowercased.

Examples:

  • bill:119-hr-1
  • bill:117-hjres-30
  • bill:118-sres-512

Rationale:

  • Matches Congress.gov URL ordering exactly (/bill/119/hr/1) — round-trippable.
  • Congress-first prefix is chronologically grouped, agent-readable.
  • Hyphen separators are URL-safe and human-readable.
  • Other formats considered: unitedstates/congress uses {type}{number}-{congress} (hr1-119), GovInfo uses BILLS-{c}{t}{n}{v} packed (BILLS-119hr1eh), GovTrack used {type}{number}-{congress}. None is dominant. Putting congress first wins on agent ergonomics.

Derived IDs:

  • Text version: bill:{congress}-{type}-{number}-{versionCode}bill:119-hr-1-eh. Concatenates to GovInfo packageId via BILLS-{c}{t}{n}{v}BILLS-119hr1eh.
  • Amendment: amdt:{congress}-{type}-{number}amdt:119-samdt-2849.
  • Public law: law:{congress}-{number}law:119-21.

Sponsors and cosponsors in both Congress.gov API and BILLSTATUS XML are identified by bioguideId (e.g. A000375, D000600). unitedstates/congress-legislators uses id.bioguide as primary key for the same population — one-to-one match, no transformation.

Implementation: load legislators-current.yaml + legislators-historical.yaml into a dict keyed on bioguide. Decorate every bill sponsor/cosponsor row with name.official_full, current term's party/state/district, etc.

The YAML carries crosswalk IDs to GovTrack, OpenSecrets, VoteSmart, FEC, ICPSR, Wikidata, Wikipedia, and Ballotpedia — useful for later integrations.

Committee join: BILLSTATUS <committee><systemCode>hsbu00</systemCode> ↔ committees-current thomas_id: HSBU via systemCode = lowercase(thomas_id) + "00" (top-level) or lowercase(parent_thomas) + subcommittee_thomas_id (subcommittee). Need a small mapping shim.

BILLSTATUS XML shape (the canonical record)

Section titled “BILLSTATUS XML shape (the canonical record)”

The full element-by-element <billStatus> / <bill> skeleton is upstream-documented in the GPO BILLSTATUS XML User Guide. The <bill> element nests committees, committeeReports, relatedBills, actions, sponsors, cboCostEstimates, laws, policyArea, subjects, summaries, title/titles, amendments, textVersions, and latestAction. The Josh-specific parse quirks that no upstream doc flags:

Sponsors / cosponsors. <sponsors> is always present (single sponsor, identified by bioguideId + fullName + party/state/district). <cosponsors> may be elided entirely on bills with no cosponsors. Don't assume cosponsors.count > 0 — major reconciliation bills routinely have zero cosponsors because they ride budget resolution authority. Cosponsor items carry sponsorshipDate and isOriginalCosponsor.

textVersions/item. Each item has a type (human-readable name, e.g. "Engrossed in House"), a date, and a formats list of urls. <textVersions> types include amendments-as-text (e.g. eas "Engrossed Amendment Senate"). These are text of an amendment to a bill, not bill-version variants. Plan the text-version table accordingly.

amendments inline. BILLSTATUS inlines all amendments as <amendment> elements with their own sponsors, cosponsors, actions, and latestAction. HR 1 (119th) has 493 amendments — the file is 2 MB largely because of inline amendment trees. Plan a separate bill_amendments table; do not store as JSON column.

Public-law linkage. When a bill becomes law, three independent signals appear: a <laws> item (<type>Public Law</type><number>119-21</number>), a <textVersions> item with <type>Public Law</type> linking to PLAW USLM, and an <actions> row with <actionCode>36000</actionCode><type>BecameLaw</type>, plus <latestAction><text>Became Public Law No: 119-21.</text></latestAction>.

Per text version, three formats:

  • xml — the older DTD-based "billres" / OLC format. Not USLM for live bills.
  • html.htm extension.
  • pdf — printable.

USLM XML is only available for enacted Public Laws at /content/pkg/PLAW-{congress}publ{lawnumber}/uslm/.... If we want USLM for live bills, we'd convert ourselves; for v1, parse the billres XML directly.

Per-text-version typical sizes (HR 1 119th — extreme upper bound):

  • XML: 1.93 MB · HTML: 1.37 MB · PDF: 1.65 MB

Median bill is ~30-200 KB BILLSTATUS + 100-300 KB across all text versions.

Per-congress bill counts (sitemap-derived, all 8 bill types):

CongressTotal bills + resolutions
113 (2013-14)10,637
114 (2015-16)12,063
115 (2017-18)13,556
116 (2019-20)16,601
117 (2021-22)17,828
118 (2023-24)19,315
119 (2025-26, in progress)15,565

BILLSTATUS bulk coverage starts at 113th Congress — older congresses have BILLS text and PLAW for enacted laws but no structured status. The original "108th onward" target needs a caveat: full structure for 113+, partial (text + summary + enrollment) for 108-112, thinner still for older congresses via Congress.gov API.

For the BILLSTATUS-covered range (113th-present): ~105K bill records + a few hundred thousand text-version files.

Total raw corpus estimate: ~10-25 GB across BILLSTATUS XML + BILLSUM + BILLS (xml only — skip pdf/html when xml is available) + PLAW USLM. Smaller than Federal Register because per-bill files are usually small.

Three independent change signals exist (raw sitemap/RSS URLs are listed on the GovInfo BILLSTATUS bulk landing):

  1. Per-file Last-Modified on every GovInfo XML and ZIP — usable for If-Modified-Since requests on individual files.
  2. Per-collection sitemap <lastmod> — every URL listed with its modification time. Diff against last-stored to find changed bills since last run.
  3. RSS batch feed — batch updates with file lists. Current-congress runs every 4 hours; prior congresses daily.

Practical incremental strategy:

  • Hourly/4-hourly tick: poll RSS for batch updates. Each batch lists files changed; refetch only those.
  • Daily tick: compare sitemap <lastmod> against our last-seen timestamps for current-congress bill types.
  • Weekly tick (sanity check): compare sitemap counts per-congress against our row counts; alert on drift.

Congress.gov API ?fromDateTime=YYYY-MM-DDTHH:MM:SSZ&sort=updateDate+desc is the equivalent for the API path — useful for verification or when GovInfo is lagging.

-- ============================================================
-- Members (companion source — load before bills)
-- ============================================================
CREATE TABLE legislators (
bioguide_id text PRIMARY KEY, -- 'A000375'
govtrack_id int,
thomas_id text,
lis_id text,
opensecrets_id text,
fec_ids text[],
icpsr_id int,
wikidata_id text,
wikipedia text,
first_name text,
last_name text,
middle_name text,
suffix text,
nickname text,
official_full_name text,
birthday date,
gender text,
is_current boolean NOT NULL DEFAULT false,
raw_yaml jsonb NOT NULL,
fetched_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE TABLE legislator_terms (
id bigserial PRIMARY KEY,
bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
term_type text NOT NULL, -- 'rep' | 'sen'
term_start date NOT NULL,
term_end date NOT NULL,
state text NOT NULL,
district int, -- rep only
senate_class int, -- sen only
party text,
party_affiliations jsonb, -- list when party changed mid-term
caucus text,
state_rank text, -- 'junior' | 'senior'
url text,
address text,
phone text,
rss_url text,
UNIQUE (bioguide_id, term_start)
);
CREATE INDEX legislator_terms_active ON legislator_terms (term_end DESC, term_start);
-- ============================================================
-- Committees (companion source — load before bills)
-- ============================================================
CREATE TABLE committees (
id text PRIMARY KEY, -- 'committee:hsbu' (lowercase thomas_id)
thomas_id text UNIQUE, -- 'HSBU'
system_code text UNIQUE, -- 'hsbu00' for top-level; subcommittees have different code
chamber text NOT NULL CHECK (chamber IN ('house', 'senate', 'joint')),
name text NOT NULL,
parent_id text REFERENCES committees(id),
url text,
minority_url text,
address text,
phone text,
jurisdiction text,
rss_url text,
youtube_id text,
is_current boolean NOT NULL DEFAULT true,
raw_yaml jsonb,
fetched_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX committees_parent ON committees (parent_id) WHERE parent_id IS NOT NULL;
-- ============================================================
-- Bills (canonical)
-- ============================================================
CREATE TABLE bills (
-- Stable canonical ID
id text PRIMARY KEY, -- 'bill:119-hr-1'
-- Composite natural key
congress smallint NOT NULL, -- 119
bill_type text NOT NULL CHECK (bill_type IN (
'hr', 's', 'hjres', 'sjres', 'hconres', 'sconres', 'hres', 'sres'
)),
number int NOT NULL, -- 1
UNIQUE (congress, bill_type, number),
origin_chamber text CHECK (origin_chamber IN ('House', 'Senate')),
introduced_date date,
-- Latest title (titleType = display title or official as introduced)
title text NOT NULL,
-- Single-value rollups (denormalized for filter speed)
policy_area text,
sponsor_bioguide text REFERENCES legislators(bioguide_id),
sponsor_is_by_request boolean,
latest_action_date date,
latest_action_text text,
became_law boolean NOT NULL DEFAULT false,
public_law_type text, -- 'Public Law' | 'Private Law'
public_law_number text, -- '119-21'
-- Update tracking
update_date timestamptz, -- BILLSTATUS <updateDate>
update_date_inc_text timestamptz, -- <updateDateIncludingText>
source_last_modified timestamptz, -- HTTP Last-Modified of BILLSTATUS XML
-- Lifecycle / forensics
raw_billstatus_xml bytea, -- gzipped XML for forensics; nullable
fetched_at timestamptz NOT NULL,
parsed_at timestamptz,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX bills_congress_intro ON bills (congress, introduced_date DESC);
CREATE INDEX bills_sponsor ON bills (sponsor_bioguide) WHERE sponsor_bioguide IS NOT NULL;
CREATE INDEX bills_latest_action ON bills (latest_action_date DESC NULLS LAST);
CREATE INDEX bills_public_law ON bills (public_law_number) WHERE public_law_number IS NOT NULL;
CREATE INDEX bills_update_date ON bills (update_date DESC NULLS LAST);
-- Full-text on title + summary text (body text comes from text versions, indexed separately)
ALTER TABLE bills ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(policy_area, '')), 'B') ||
setweight(to_tsvector('english', coalesce(latest_action_text, '')), 'C')
) STORED;
CREATE INDEX bills_search ON bills USING gin (search_tsv);
-- ============================================================
-- Bill child tables
-- ============================================================
CREATE TABLE bill_actions (
id bigserial PRIMARY KEY,
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
sequence int NOT NULL, -- order within the bill's action history
action_date date NOT NULL,
action_time time, -- only meaningful when source_system_code=2
action_code text, -- short alphanumeric, e.g. 'E40000'
action_type text, -- the categorization, e.g. 'BecameLaw'
source_system_code int, -- 0=House, 1=House cmte, 2=Senate, 9=LoC
source_system_name text,
text text NOT NULL,
committee_system_code text, -- when action was committee-driven
UNIQUE (bill_id, sequence)
);
CREATE INDEX bill_actions_bill_date ON bill_actions (bill_id, action_date DESC);
CREATE INDEX bill_actions_code ON bill_actions (action_code) WHERE action_code IS NOT NULL;
CREATE TABLE bill_cosponsors (
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
bioguide_id text NOT NULL REFERENCES legislators(bioguide_id),
sponsorship_date date NOT NULL,
is_original_cosponsor boolean NOT NULL DEFAULT false,
PRIMARY KEY (bill_id, bioguide_id)
);
CREATE INDEX bill_cosponsors_bioguide ON bill_cosponsors (bioguide_id);
CREATE TABLE bill_committees (
id bigserial PRIMARY KEY,
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
committee_id text REFERENCES committees(id),
activity_name text NOT NULL, -- 'Referred to', 'Reported', etc.
activity_date timestamptz,
UNIQUE (bill_id, committee_id, activity_name, activity_date)
);
CREATE INDEX bill_committees_bill ON bill_committees (bill_id);
CREATE TABLE bill_subjects (
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
subject text NOT NULL,
PRIMARY KEY (bill_id, subject)
);
CREATE INDEX bill_subjects_subject ON bill_subjects (subject);
CREATE TABLE bill_titles (
id bigserial PRIMARY KEY,
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
title_type_code text, -- '01', '02', ..., '40'
title_type_name text, -- e.g. 'Official Title as Introduced'
chamber text, -- 'House' | 'Senate' | NULL
title text NOT NULL,
UNIQUE (bill_id, title_type_code, chamber, title)
);
CREATE TABLE bill_summaries (
id bigserial PRIMARY KEY,
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
version_code text, -- e.g. '00' (introduced), '49', '50' (private law), etc.
action_desc text, -- describes the bill state at summary time
action_date date,
text_html text NOT NULL, -- CDATA-wrapped HTML
UNIQUE (bill_id, version_code, action_date)
);
CREATE TABLE bill_text_versions (
id text PRIMARY KEY, -- 'bill:119-hr-1-eh'
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
version_code text NOT NULL, -- 'eh', 'is', 'enr', etc.
version_name text NOT NULL, -- 'Engrossed in House'
version_date timestamptz,
govinfo_package_id text, -- 'BILLS-119hr1eh'
xml_url text,
html_url text,
pdf_url text,
body_text text, -- extracted plain text
body_xml text, -- raw billres XML (or USLM for PLAW)
body_fetched_at timestamptz,
UNIQUE (bill_id, version_code)
);
CREATE INDEX bill_text_versions_bill ON bill_text_versions (bill_id);
CREATE TABLE bill_amendments (
id text PRIMARY KEY, -- 'amdt:119-samdt-2849'
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
congress smallint NOT NULL,
amendment_type text NOT NULL CHECK (amendment_type IN ('hamdt', 'samdt', 'suamdt')),
number int NOT NULL,
purpose text,
sponsor_bioguide text REFERENCES legislators(bioguide_id),
submitted_date date,
proposed_date date,
latest_action_date date,
latest_action_text text,
chamber text,
type_description text, -- 'Amendment in the Nature of a Substitute', etc.
raw_xml bytea, -- the inline <amendment> subtree
UNIQUE (congress, amendment_type, number)
);
CREATE INDEX bill_amendments_bill ON bill_amendments (bill_id);
CREATE TABLE bill_related (
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
related_bill_id text NOT NULL, -- soft FK — related bill may not yet be ingested
relationship_type text NOT NULL, -- 'companion', 'identical', 'similar', etc.
identified_by text NOT NULL, -- 'CRS' | 'House' | 'Senate'
PRIMARY KEY (bill_id, related_bill_id, relationship_type, identified_by)
);
CREATE INDEX bill_related_target ON bill_related (related_bill_id);
CREATE TABLE bill_laws (
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
law_type text NOT NULL, -- 'Public Law' | 'Private Law'
law_number text NOT NULL, -- '119-21'
PRIMARY KEY (bill_id, law_type, law_number)
);
CREATE TABLE bill_cbo_estimates (
id bigserial PRIMARY KEY,
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
pub_date timestamptz,
title text,
description text,
url text,
UNIQUE (bill_id, url)
);
CREATE TABLE bill_committee_reports (
id bigserial PRIMARY KEY,
bill_id text NOT NULL REFERENCES bills(id) ON DELETE CASCADE,
citation text NOT NULL, -- e.g. 'H. Rept. 119-100'
url text,
UNIQUE (bill_id, citation)
);
-- ============================================================
-- Vector chunks for similarity search on bill text
-- ============================================================
CREATE TABLE bill_text_chunks (
id bigserial PRIMARY KEY,
text_version_id text NOT NULL REFERENCES bill_text_versions(id) ON DELETE CASCADE,
chunk_index int NOT NULL,
chunk_text text NOT NULL,
embedding vector(1024), -- adjust dim per model
UNIQUE (text_version_id, chunk_index)
);
CREATE INDEX bill_text_chunks_embedding
ON bill_text_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

  • Legislators and committees in their own tables with their own ingestion path (from congress-legislators). Bills join to them but they're not bill-children. Load these first, then bills.
  • raw_billstatus_xml as gzipped bytea — preserves full record for re-parsing without re-fetching. Optional (can be turned off post-stable to save space).
  • Amendments as a top-level table, not a JSON column on bills — they have their own sponsors, actions, and cross-references. HR 1 (119) has 493 of them.
  • bill_related.related_bill_id is a soft FK — the related bill may not yet be in our database (e.g., we ingest 119th first, but it references 118th bills we haven't loaded). Don't enforce FK constraint.
  • Text versions get their own stable IDs (bill:119-hr-1-eh) so the agent can pass them around independently.
  • Search tsvector on title + policy_area + latest_action_text for now — bill body text is searched via bill_text_chunks (embedding) or a separate full-text index on bill_text_versions.body_text to be added.
  • No tsvector on body text initially. Bill text bodies are large; deferring full-text indexing on them until we measure size impact. Use bill_text_chunks for similarity, which is what the agent's find_similar_bills tool needs.
  • Public-law denormalized onto bills in addition to bill_laws table — became_law boolean and public_law_number are common filter targets, worth duplicating.

Backfill (one-time, 113th → present for BILLSTATUS, broader for text)

Section titled “Backfill (one-time, 113th → present for BILLSTATUS, broader for text)”

Recommended path: GovInfo BILLSTATUS ZIPs for backfill, per-file BILLS for text.

  1. Members + committees first. Pull legislators-current.yaml, legislators-historical.yaml, committees-current.yaml, committees-historical.yaml, committee-membership-current.yaml. Parse YAML → load. ~10 MB total, single fetch each.
  2. BILLSTATUS by congress×type. For each congress 113 through current, for each bill type:
    • Download BILLSTATUS-{c}-{t}.zip (10-35 MB each, ~40 ZIPs total for 113-119).
    • Extract to /data/corpus/bills/billstatus/{c}/{t}/.
    • Parse each XML → bills + child tables.
    • This is the fastest backfill path — total download is ~1.5 GB.
  3. BILLSUM by bill (or by ZIP if available). Per-bill XML for richer summary metadata if BILLSTATUS summaries are insufficient. Optional for v1.
  4. BILLS text by version. For each bill_text_versions row, fetch the XML format (skip HTML/PDF unless XML missing).
    • Per-bill text fetch is 100s of GB across all bills × versions. Probably do this lazily — fetch on first agent request, populate bill_text_versions.body_text on demand.
    • Or: backfill text only for bills that became law (PLAW collection — much smaller, USLM format).
  5. Older congresses (108th-112th). No BILLSTATUS coverage. Use Congress.gov API /bill/{c}/... with a real api.data.gov key for partial structure. Lower priority — get 113+ solid first.

Time budget: BILLSTATUS backfill (full structure for 113-119) is hours, not days. Bill text backfill (lazy) — depends on workload.

  1. Hourly: poll RSS. Parse the BILLSTATUS RSS batch feed's <item><pubDate> and <description> (file list). For each file URL newer than last_run, refetch, re-parse, upsert.
  2. Daily: sitemap diff. For current-congress bill types only. Compare each entry's <lastmod> against our bills.source_last_modified. Refetch any that changed.
  3. Daily: legislators/committees. Re-pull legislators-current.yaml and committees-current.yaml if their HTTP Last-Modified changed. Multiple commits per month is normal.
  4. Weekly: drift check. Compare per-congress sitemap counts against our row counts. Alert on discrepancies.

Per architecture: ingestion state in Postgres tables ingestion_runs, ingestion_logs, ingestion_schedule. Each run records source name (bills_billstatus, bills_text, legislators, etc.) and the actual parameters in config jsonb so we can re-run.

  • HTTP 503 from GovInfo during ZIP download — retry with exponential backoff up to ~5 min, then halt and alert.
  • Malformed BILLSTATUS XML — log doc, skip parse, mark parsed_at = NULL. Re-attempt next run.
  • Bioguide mismatch (sponsor's bioguide_id not in legislators table) — usually means a member wasn't yet in legislators-historical.yaml (rare race; the YAML lags slightly). Log and queue for retry on next legislators sync.
  • Action sequence ambiguity — same event with multiple type categorizations. Preserve all action rows; rely on action_code for filtering.
  • Public Law not yet in PLAW collection — bills can be marked became_law=true before PLAW XML is available. Decouple PLAW fetch from BILLSTATUS upsert; have its own retry queue.

These don't block ingestion but should be resolved before this source is "shipped":

  • Coverage for 108th-112th congresses. BILLSTATUS doesn't cover these. Do we (a) skip entirely and document the v1 gap, (b) ingest partial via Congress.gov API alone, or (c) use the older unitedstates/congress cached outputs as a one-time backfill? Decide before launching v1.
  • Bill text strategy. Eager fetch for all 2.5M+ text versions, or lazy fetch on first agent request? Lean toward lazy + eager-only for current-congress bills + enacted-law (PLAW) bodies. Measure before deciding.
  • Embedding model and dimension for bill textresolved. Locked OSS-only on Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license, runs on CPU). The vec0 schema column is embedding float[1024] + embedding_bq bit[1024]. See Architecture, Storage stack. Chunking strategy is still open (next bullet).
  • Chunking for bill text. Section-aware (split at <section> / <part> / <title> in billres XML / USLM) vs fixed-size with overlap. Section-aware matches USLM's structure but parser is more work.
  • unitedstates/congress parsing edge cases. The maintainers have caught dozens of action-code idiosyncrasies over a decade. Worth reading bill_info.py/bills.py end-to-end before writing our parser, even if we don't run their code.
  • Older bill types or naming variants. Have any bill types been retired or renamed historically? E.g., do early congresses have records under different type codes? Unlikely to be a v1 issue but worth checking before backfilling pre-113th.
  • Action time zone. <actionTime> from House floor is presumably ET, but isn't formally documented. Verify by spot-check against C-SPAN times.
  • textVersions for amendments. As noted, amendment-as-text version codes (e.g. eas) refer to amendment text, not bill text. Need a clean way to model this — either (a) separate bill_amendment_text_versions table, or (b) flag in bill_text_versions.is_amendment_text. (b) is simpler.