Skip to content

Committee reports

Formal explanations of bills issued by the committee of jurisdiction. When a House or Senate committee orders a bill reported, it usually files a written report — citation form H. Rept. 119-100 or S. Rept. 119-50 — that includes the committee's recommendation, a section-by-section analysis of the bill, the CBO cost estimate, dissenting / minority views, the rollcall of the committee vote, and required statements (constitutional authority, federal mandates, etc.).

Committee reports are the written record of legislative intent. Courts cite them. Lobbyists scrutinize them. Anyone trying to understand what a bill actually does should read them. For Josh, they are FK-target-rich (they reference bills, members, committees, CBO estimates, and other reports) and high-value retrieval text.

Source nameCommittee reports
PublisherGovernment Publishing Office (CRPT collection); reports authored by House and Senate committees
LicensePublic domain
Coverage1995 – present (online corpus); 104th Congress onward systematically.
Volume~500-1300 reports per year. ~25,000+ all-time across both chambers.
Storage estimate~3-5 GB raw HTML + PDF; ~1-2 GB extracted text
Primary sourceGovInfo CRPT bulk — per-year sitemap + per-package MODS/HTML/PDF (see GovInfo bulk data)
AuthNone
Rate limitsNone observed; ~1-2 req/sec polite
Incremental sync hintsCRPT yearly sitemap diff; MODS recordChangeDate
Stable ID formatreport:{packageId} e.g. report:CRPT-119hrpt100, report:CRPT-119srpt50
Statusexploring — schema drafted, ingestion not built

Primary: GovInfo CRPT bulk. Per-year sitemap, per-package MODS XML, HTML rendition, PDF rendition. Same shape as CHRG (hearings transcripts) — reuse the parser scaffolding. See the GovInfo CRPT help page for the collection's package layout and api.govinfo.gov for the endpoint reference.

Secondary: Congress.gov v3 /committee-report. Useful for verification + the relatedBills cross-reference (which is also in MODS but more cleanly structured in JSON). Costs API quota; not worth being primary. See the Congress.gov API docs for the endpoint's response structure.

Skip: per-committee committee-report pages on House/Senate sites. Heterogeneous, fragile, low marginal value over GovInfo CRPT.

GovInfo: open. Same patterns as CHRG. Real-browser User-Agent harmless. ~1-2 req/sec polite.

The bills source already has a bill_committee_reports child table — that's the bills-side index. This source is the canonical entity, with full body text, the citation graph, and the report-as-document.

report:{packageId}, packageId verbatim from GovInfo. The packageId follows GovInfo's CRPT-{c}{chamberCode}rpt{number} convention (chamber code h = House, s = Senate, e = Senate Executive Report on treaties); see the GovInfo CRPT help page for the full package-naming reference.

Examples:

  • report:CRPT-119hrpt100
  • report:CRPT-119srpt50
  • report:CRPT-119erpt6

Conference reports (the joint House+Senate reconciliation document) use the same format but typically appear in both chambers' hrpt and srpt series with adjacent numbers.

The loader maps GovInfo packageIds to Bluebook citation form:

CitationpackageId
H. Rept. 119-100CRPT-119hrpt100
S. Rept. 119-50CRPT-119srpt50
H. Rept. 119-200, Pt. 1CRPT-119hrpt200 (multi-part — granules carry parts)
H. Conf. Rept. 119-50CRPT-119hrpt50 with isConferenceReport=true

Upstream MODS <docClass> values map into Josh's chamber + doc_class discriminators: HRPT → House, SRPT → Senate, ERPT → Senate Executive Report (treaty consideration). Conference reports are not a separate doc class — they carry <isConferenceReport>true</isConferenceReport> and become the is_conference_report flag rather than a separate table. The MODS vocabulary itself is documented at the GovInfo CRPT help page.

The per-package MODS XML is the metadata source; the raw element reference lives in the Library of Congress MODS standard. The Josh-relevant extraction:

MODS element / attributeJosh column
<titleInfo><title>title (the all-caps formal title)
<extension><shortTitle>short_title
<extension><subtitle>subtitle ("Report to accompany H.R. 1286, …")
<originInfo><dateIssued>issue_date
<identifier type="preferred citation">citation (H. Rept. 119-100)
<classification authority="sudocs">sudoc_class_number (Y 1.1/8:119-100)
<extension><docClass>doc_class + derived chamber
<extension><accessId>package_id
<extension><congress> / <session> / <number>congress / session / report_number
<extension><isConferenceReport> / <isSerialSet>is_conference_report / is_serial_set
<extension><granuleClass>granule_class (FIRSTPART for multi-part)
<abstract>abstract
<extension><recommendation>recommendation
<extension><submittedStatement>submitted_by_member_name ("Mr. Bost")
<extension><congCommittee> (authorityId)submitting_committee_system_code (hsvr00), resolved to submitting_committee_id
<extension><bill context="…">committee_report_bills rows

The <bill> elements carry a context attribute:

  • PRIMARY — the bill the report accompanies (the main bill).
  • OTHER — bills referenced in passing (often companion bills, related reconciliations, prior versions).

A single committee report can reference many bills via OTHER. We surface them as a many-to-many in committee_report_bills (the context value becomes the relationship column).

Same <html><body><pre>...</pre></body></html> wrapping as CHRG. Sections inside the <pre>:

  1. Cover page — chamber, report citation, congress/session, bill identifier ("Mr. BOST, from the Committee on Veterans' Affairs, submitted the following REPORT, [To accompany H.R. 1286]").
  2. Committee recommendation paragraph.
  3. Section-by-section analysis of the bill — section text, then committee commentary.
  4. Background and need for legislation.
  5. Hearings held on the topic (with dates, witness lists).
  6. Committee consideration — markup vote rollcall.
  7. Committee oversight findings.
  8. Statement of constitutional authority (House) / Regulatory impact statement (Senate).
  9. Federal Mandates statement.
  10. Cost estimate — full CBO cost estimate inlined.
  11. Earmark statement (when applicable).
  12. Changes in existing law — the bill's amendments shown in Ramseyer / [strike-through] / italic format.
  13. Dissenting / Minority views (when filed).
  14. Additional / Supplemental views.

The section-by-section analysis is the most retrieval-valuable part. It's the closest thing to "plain English explanation of what each part of the bill does" we have.

MetricValue
Reports per year~500-1,300
Total all-time online~25,000
Per-package HTML~30-200 KB
Per-package PDF~100 KB - 5 MB
Conference reports per congress~10-30 (rare but high-impact)

Per-year CRPT sitemap counts (sample): 2023: 497, 2024: 813.

Postgres footprint: ~2-4 GB after ingestion.

  1. Weekly: per-year CRPT sitemap diff. For current year + previous year. Diff <lastmod>; refetch where changed.
  2. MODS recordChangeDate as per-package watermark. Skip re-parse when unchanged.
  3. Real-time signal via the bills source BILLSTATUS: when a BILLSTATUS XML adds a <committeeReports> entry, that's the announcement of a new committee report. Cross-check with our CRPT loader; useful for "report just landed" queue. Already captured in bill_committee_reports.
-- ============================================================
-- Committee reports (the canonical entity)
-- ============================================================
CREATE TABLE committee_reports (
id text PRIMARY KEY, -- 'report:CRPT-119hrpt100'
package_id text NOT NULL UNIQUE, -- 'CRPT-119hrpt100'
chamber text NOT NULL CHECK (chamber IN ('house', 'senate')),
doc_class text NOT NULL CHECK (doc_class IN ('HRPT', 'SRPT', 'ERPT')),
congress smallint NOT NULL,
session smallint CHECK (session IN (1, 2) OR session IS NULL),
report_number int NOT NULL, -- the integer after the dash in citation
citation text NOT NULL, -- 'H. Rept. 119-100'
short_title text,
title text NOT NULL, -- the all-caps formal title
subtitle text, -- 'Report to accompany H.R. 1286, ...'
-- When
issue_date date NOT NULL, -- <dateIssued>
-- Submitter
submitted_by_member_name text, -- "Mr. Bost"
submitted_by_bioguide text REFERENCES legislators(bioguide_id), -- resolved soft FK
submitting_committee_id text REFERENCES committees(id),
submitting_committee_system_code text, -- 'hsvr00'
-- Type discriminators
is_conference_report boolean NOT NULL DEFAULT false,
is_serial_set boolean NOT NULL DEFAULT false,
granule_class text, -- 'FIRSTPART' for multi-part
-- Body / abstract
abstract text, -- MODS <abstract>
recommendation text, -- MODS <recommendation>
body_text text, -- extracted from HTML
body_html bytea, -- gzipped raw HTML
-- Source URLs
govinfo_html_url text,
govinfo_pdf_url text,
govinfo_zip_url text,
mods_url text,
sudoc_class_number text, -- 'Y 1.1/8:119-100'
-- Lifecycle
raw_mods_xml bytea, -- gzipped MODS for re-parse
sitemap_lastmod timestamptz,
fetched_at timestamptz NOT NULL,
parsed_at timestamptz,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (chamber, congress, report_number, doc_class)
);
CREATE INDEX committee_reports_congress ON committee_reports (congress, chamber, report_number);
CREATE INDEX committee_reports_issue_date ON committee_reports (issue_date DESC);
CREATE INDEX committee_reports_committee ON committee_reports (submitting_committee_id);
CREATE INDEX committee_reports_conference ON committee_reports (issue_date DESC) WHERE is_conference_report;
ALTER TABLE committee_reports ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(short_title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(subtitle, '')), 'B') ||
setweight(to_tsvector('english', coalesce(abstract, '') || ' ' || coalesce(recommendation, '')), 'C') ||
setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
) STORED;
CREATE INDEX committee_reports_search ON committee_reports USING gin (search_tsv);
-- Many-to-many: a report can reference many bills (PRIMARY + OTHER)
CREATE TABLE committee_report_bills (
report_id text NOT NULL REFERENCES committee_reports(id) ON DELETE CASCADE,
bill_id text NOT NULL, -- 'bill:119-hr-1286'; soft FK
relationship text NOT NULL CHECK (relationship IN ('PRIMARY', 'OTHER')),
PRIMARY KEY (report_id, bill_id, relationship)
);
CREATE INDEX committee_report_bills_bill ON committee_report_bills (bill_id);
-- Vector chunks
CREATE TABLE committee_report_chunks (
id bigserial PRIMARY KEY,
report_id text NOT NULL REFERENCES committee_reports(id) ON DELETE CASCADE,
chunk_index int NOT NULL,
chunk_text text NOT NULL,
embedding vector(1024),
UNIQUE (report_id, chunk_index)
);
CREATE INDEX committee_report_chunks_embedding
ON committee_report_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

  • Single table for House + Senate + Executive reports — they have the same shape; the chamber + doc_class discriminators are sufficient.
  • report_number is the integer suffix (100, 50, 200) — the integer space resets per (chamber, congress). Multi-part reports share a number with Pt. N suffix; we represent multi-parts as separate rows since GovInfo gives them separate packageIds.
  • committee_report_bills.bill_id as soft FK — a report's OTHER-context bills may not all be in our corpus.
  • raw_mods_xml preserved — MODS structure has been stable but adds new fields occasionally. Keep raw.
  • No separate "conference report" tableis_conference_report flag on the same row.
  • Section-by-section parsing deferred to v1.x — schema doesn't yet model the report's internal sections (background, section-by-section, dissenting views, etc.). Body text is searchable; structured section table can come later.
  1. For each year in 1995..current_year:
    • Fetch CRPT_{year}_sitemap.xml.
    • For each <loc>, derive packageId.
    • Fetch MODS XML, HTML, PDF (PDF lazy if storage tight).
    • Parse MODS for metadata; HTML for body; soft-resolve bills + committee.
    • Insert.
  2. ~25K packages × 2 fetches each ≈ 50K calls. With 4 workers @ 1 req/s ≈ ~3-4 hours.
  1. Daily: poll current-year + previous-year CRPT sitemap. Diff <lastmod>; refetch.
  2. Hourly during session days: BILLSTATUS-driven trigger. When a bill_committee_reports row arrives via the bills loader and the corresponding CRPT package isn't yet in our table, queue a fetch.

Source key: committee_reports. State stores per-year sitemap last_modified.

  • Multi-part report inconsistency. H. Rept. 119-200, Pt. 1 and Pt. 2 are separate packages but represent one report. Treat as separate rows linked by report_number + congress + chamber; downstream queries can re-aggregate.
  • MODS <bill> element variants. context="PRIMARY" and context="OTHER" are common; sometimes both versions of 1286 appear (one PRIMARY, one OTHER) — collapse to single PRIMARY row.
  • Submitter resolution to bioguide fails. "Mr. Bost" needs resolution against the committee membership at the time. Not always unambiguous (multiple members named Smith have served). Heuristic: take all members of submitting_committee_id at issue_date, look for last-name match, prefer the chair/ranking member. When ambiguous, leave bioguide null.
  • HTML parse failure on older reports. Pre-2010 CRPT HTML is occasionally malformed. Fall back to PDF text extraction.
  • Conference reports parsing. Conference reports include the joint explanatory statement plus the bill text adopted by both chambers. Body text gets very long (10-100+ MB for major budget bills). Plan body chunking accordingly.

These don't block ingestion but should be resolved before this source is "shipped":

  • Section-by-section parser — the highest-signal content per character. Worth building a careful parser that splits the report by section heading and stores per-section rows. Defer to v1.x; ship v1 with full body text only.
  • Dissenting / minority views as separate row? Within a single report, dissenting views are authored by different members. Splitting them out for citation precision is appealing. Defer.
  • Conference report ↔ bill resolution. A conference report cites the underlying bill but is published independently. Make sure bill_id FK works for conference reports' primary bills.
  • CBO cost estimate inlined in body. The full CBO cost estimate text appears inside the report body. Cross-link to cbo_publications (see CBO cost estimates) when we can match by date + bill (heuristic). Defer to v1.x; the cross-reference is in the citation graph anyway.
  • Pre-1995 reports — exist in print, mostly digitized as scans. Out of scope for v1.
  • Treaty reports (ERPT doc class) — Senate executive reports on treaties. Treaties as a source are deferred to v2; the reports themselves we ingest along with HRPT/SRPT but treat as orphans (no bill_id).
  • Very large conference reports. A committee print of an omnibus conference report can be hundreds of MB. Verify body text storage strategy (drop body, keep only chunks for very-large rows?).