Committee reports
Formal explanations of bills issued by the committee of jurisdiction. When a House or Senate committee orders a bill reported, it usually files a written report — citation form H. Rept. 119-100 or S. Rept. 119-50 — that includes the committee's recommendation, a section-by-section analysis of the bill, the CBO cost estimate, dissenting / minority views, the rollcall of the committee vote, and required statements (constitutional authority, federal mandates, etc.).
Committee reports are the written record of legislative intent. Courts cite them. Lobbyists scrutinize them. Anyone trying to understand what a bill actually does should read them. For Josh, they are FK-target-rich (they reference bills, members, committees, CBO estimates, and other reports) and high-value retrieval text.
Quick reference
Section titled “Quick reference”| Source name | Committee reports |
| Publisher | Government Publishing Office (CRPT collection); reports authored by House and Senate committees |
| License | Public domain |
| Coverage | 1995 – present (online corpus); 104th Congress onward systematically. |
| Volume | ~500-1300 reports per year. ~25,000+ all-time across both chambers. |
| Storage estimate | ~3-5 GB raw HTML + PDF; ~1-2 GB extracted text |
| Primary source | GovInfo CRPT bulk — per-year sitemap + per-package MODS/HTML/PDF (see GovInfo bulk data) |
| Auth | None |
| Rate limits | None observed; ~1-2 req/sec polite |
| Incremental sync hints | CRPT yearly sitemap diff; MODS recordChangeDate |
| Stable ID format | report:{packageId} e.g. report:CRPT-119hrpt100, report:CRPT-119srpt50 |
| Status | exploring — schema drafted, ingestion not built |
Source priority decision
Section titled “Source priority decision”Primary: GovInfo CRPT bulk. Per-year sitemap, per-package MODS XML, HTML rendition, PDF rendition. Same shape as CHRG (hearings transcripts) — reuse the parser scaffolding. See the GovInfo CRPT help page for the collection's package layout and api.govinfo.gov for the endpoint reference.
Secondary: Congress.gov v3 /committee-report. Useful for verification + the relatedBills cross-reference (which is also in MODS but more cleanly structured in JSON). Costs API quota; not worth being primary. See the Congress.gov API docs for the endpoint's response structure.
Skip: per-committee committee-report pages on House/Senate sites. Heterogeneous, fragile, low marginal value over GovInfo CRPT.
Access notes
Section titled “Access notes”GovInfo: open. Same patterns as CHRG. Real-browser User-Agent harmless. ~1-2 req/sec polite.
The bills source already has a bill_committee_reports child table — that's the bills-side index. This source is the canonical entity, with full body text, the citation graph, and the report-as-document.
Stable ID format
Section titled “Stable ID format”report:{packageId}, packageId verbatim from GovInfo. The packageId follows GovInfo's CRPT-{c}{chamberCode}rpt{number} convention (chamber code h = House, s = Senate, e = Senate Executive Report on treaties); see the GovInfo CRPT help page for the full package-naming reference.
Examples:
report:CRPT-119hrpt100report:CRPT-119srpt50report:CRPT-119erpt6
Conference reports (the joint House+Senate reconciliation document) use the same format but typically appear in both chambers' hrpt and srpt series with adjacent numbers.
Citation form
Section titled “Citation form”The loader maps GovInfo packageIds to Bluebook citation form:
| Citation | packageId |
|---|---|
| H. Rept. 119-100 | CRPT-119hrpt100 |
| S. Rept. 119-50 | CRPT-119srpt50 |
| H. Rept. 119-200, Pt. 1 | CRPT-119hrpt200 (multi-part — granules carry parts) |
| H. Conf. Rept. 119-50 | CRPT-119hrpt50 with isConferenceReport=true |
Upstream MODS <docClass> values map into Josh's chamber + doc_class discriminators: HRPT → House, SRPT → Senate, ERPT → Senate Executive Report (treaty consideration). Conference reports are not a separate doc class — they carry <isConferenceReport>true</isConferenceReport> and become the is_conference_report flag rather than a separate table. The MODS vocabulary itself is documented at the GovInfo CRPT help page.
Response shapes
Section titled “Response shapes”MODS field → Josh column
Section titled “MODS field → Josh column”The per-package MODS XML is the metadata source; the raw element reference lives in the Library of Congress MODS standard. The Josh-relevant extraction:
| MODS element / attribute | Josh column |
|---|---|
<titleInfo><title> | title (the all-caps formal title) |
<extension><shortTitle> | short_title |
<extension><subtitle> | subtitle ("Report to accompany H.R. 1286, …") |
<originInfo><dateIssued> | issue_date |
<identifier type="preferred citation"> | citation (H. Rept. 119-100) |
<classification authority="sudocs"> | sudoc_class_number (Y 1.1/8:119-100) |
<extension><docClass> | doc_class + derived chamber |
<extension><accessId> | package_id |
<extension><congress> / <session> / <number> | congress / session / report_number |
<extension><isConferenceReport> / <isSerialSet> | is_conference_report / is_serial_set |
<extension><granuleClass> | granule_class (FIRSTPART for multi-part) |
<abstract> | abstract |
<extension><recommendation> | recommendation |
<extension><submittedStatement> | submitted_by_member_name ("Mr. Bost") |
<extension><congCommittee> (authorityId) | submitting_committee_system_code (hsvr00), resolved to submitting_committee_id |
<extension><bill context="…"> | committee_report_bills rows |
The <bill> elements carry a context attribute:
PRIMARY— the bill the report accompanies (the main bill).OTHER— bills referenced in passing (often companion bills, related reconciliations, prior versions).
A single committee report can reference many bills via OTHER. We surface them as a many-to-many in committee_report_bills (the context value becomes the relationship column).
HTML body shape
Section titled “HTML body shape”Same <html><body><pre>...</pre></body></html> wrapping as CHRG. Sections inside the <pre>:
- Cover page — chamber, report citation, congress/session, bill identifier ("Mr. BOST, from the Committee on Veterans' Affairs, submitted the following REPORT, [To accompany H.R. 1286]").
- Committee recommendation paragraph.
- Section-by-section analysis of the bill — section text, then committee commentary.
- Background and need for legislation.
- Hearings held on the topic (with dates, witness lists).
- Committee consideration — markup vote rollcall.
- Committee oversight findings.
- Statement of constitutional authority (House) / Regulatory impact statement (Senate).
- Federal Mandates statement.
- Cost estimate — full CBO cost estimate inlined.
- Earmark statement (when applicable).
- Changes in existing law — the bill's amendments shown in Ramseyer /
[strike-through]/ italic format. - Dissenting / Minority views (when filed).
- Additional / Supplemental views.
The section-by-section analysis is the most retrieval-valuable part. It's the closest thing to "plain English explanation of what each part of the bill does" we have.
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Reports per year | ~500-1,300 |
| Total all-time online | ~25,000 |
| Per-package HTML | ~30-200 KB |
| Per-package PDF | ~100 KB - 5 MB |
| Conference reports per congress | ~10-30 (rare but high-impact) |
Per-year CRPT sitemap counts (sample): 2023: 497, 2024: 813.
Postgres footprint: ~2-4 GB after ingestion.
Caching / incremental sync
Section titled “Caching / incremental sync”- Weekly: per-year CRPT sitemap diff. For current year + previous year. Diff
<lastmod>; refetch where changed. - MODS
recordChangeDateas per-package watermark. Skip re-parse when unchanged. - Real-time signal via the bills source BILLSTATUS: when a BILLSTATUS XML adds a
<committeeReports>entry, that's the announcement of a new committee report. Cross-check with our CRPT loader; useful for "report just landed" queue. Already captured inbill_committee_reports.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- Committee reports (the canonical entity)-- ============================================================
CREATE TABLE committee_reports ( id text PRIMARY KEY, -- 'report:CRPT-119hrpt100' package_id text NOT NULL UNIQUE, -- 'CRPT-119hrpt100'
chamber text NOT NULL CHECK (chamber IN ('house', 'senate')), doc_class text NOT NULL CHECK (doc_class IN ('HRPT', 'SRPT', 'ERPT')), congress smallint NOT NULL, session smallint CHECK (session IN (1, 2) OR session IS NULL), report_number int NOT NULL, -- the integer after the dash in citation
citation text NOT NULL, -- 'H. Rept. 119-100' short_title text, title text NOT NULL, -- the all-caps formal title subtitle text, -- 'Report to accompany H.R. 1286, ...'
-- When issue_date date NOT NULL, -- <dateIssued>
-- Submitter submitted_by_member_name text, -- "Mr. Bost" submitted_by_bioguide text REFERENCES legislators(bioguide_id), -- resolved soft FK submitting_committee_id text REFERENCES committees(id), submitting_committee_system_code text, -- 'hsvr00'
-- Type discriminators is_conference_report boolean NOT NULL DEFAULT false, is_serial_set boolean NOT NULL DEFAULT false, granule_class text, -- 'FIRSTPART' for multi-part
-- Body / abstract abstract text, -- MODS <abstract> recommendation text, -- MODS <recommendation> body_text text, -- extracted from HTML body_html bytea, -- gzipped raw HTML
-- Source URLs govinfo_html_url text, govinfo_pdf_url text, govinfo_zip_url text, mods_url text, sudoc_class_number text, -- 'Y 1.1/8:119-100'
-- Lifecycle raw_mods_xml bytea, -- gzipped MODS for re-parse sitemap_lastmod timestamptz, fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (chamber, congress, report_number, doc_class));
CREATE INDEX committee_reports_congress ON committee_reports (congress, chamber, report_number);CREATE INDEX committee_reports_issue_date ON committee_reports (issue_date DESC);CREATE INDEX committee_reports_committee ON committee_reports (submitting_committee_id);CREATE INDEX committee_reports_conference ON committee_reports (issue_date DESC) WHERE is_conference_report;
ALTER TABLE committee_reports ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(short_title, '')), 'A') || setweight(to_tsvector('english', coalesce(subtitle, '')), 'B') || setweight(to_tsvector('english', coalesce(abstract, '') || ' ' || coalesce(recommendation, '')), 'C') || setweight(to_tsvector('english', coalesce(body_text, '')), 'D') ) STORED;CREATE INDEX committee_reports_search ON committee_reports USING gin (search_tsv);
-- Many-to-many: a report can reference many bills (PRIMARY + OTHER)CREATE TABLE committee_report_bills ( report_id text NOT NULL REFERENCES committee_reports(id) ON DELETE CASCADE, bill_id text NOT NULL, -- 'bill:119-hr-1286'; soft FK relationship text NOT NULL CHECK (relationship IN ('PRIMARY', 'OTHER')), PRIMARY KEY (report_id, bill_id, relationship));CREATE INDEX committee_report_bills_bill ON committee_report_bills (bill_id);
-- Vector chunksCREATE TABLE committee_report_chunks ( id bigserial PRIMARY KEY, report_id text NOT NULL REFERENCES committee_reports(id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, embedding vector(1024), UNIQUE (report_id, chunk_index));CREATE INDEX committee_report_chunks_embedding ON committee_report_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
- Single table for House + Senate + Executive reports — they have the same shape; the chamber + doc_class discriminators are sufficient.
report_numberis the integer suffix (100,50,200) — the integer space resets per(chamber, congress). Multi-part reports share a number withPt. Nsuffix; we represent multi-parts as separate rows since GovInfo gives them separate packageIds.committee_report_bills.bill_idas soft FK — a report'sOTHER-context bills may not all be in our corpus.raw_mods_xmlpreserved — MODS structure has been stable but adds new fields occasionally. Keep raw.- No separate "conference report" table —
is_conference_reportflag on the same row. - Section-by-section parsing deferred to v1.x — schema doesn't yet model the report's internal sections (background, section-by-section, dissenting views, etc.). Body text is searchable; structured section table can come later.
Download / update strategy
Section titled “Download / update strategy”Backfill
Section titled “Backfill”- For each year in 1995..current_year:
- Fetch
CRPT_{year}_sitemap.xml. - For each
<loc>, derive packageId. - Fetch MODS XML, HTML, PDF (PDF lazy if storage tight).
- Parse MODS for metadata; HTML for body; soft-resolve bills + committee.
- Insert.
- Fetch
- ~25K packages × 2 fetches each ≈ 50K calls. With 4 workers @ 1 req/s ≈ ~3-4 hours.
Daily incremental
Section titled “Daily incremental”- Daily: poll current-year + previous-year CRPT sitemap. Diff
<lastmod>; refetch. - Hourly during session days: BILLSTATUS-driven trigger. When a
bill_committee_reportsrow arrives via the bills loader and the corresponding CRPT package isn't yet in our table, queue a fetch.
State tracking
Section titled “State tracking”Source key: committee_reports. State stores per-year sitemap last_modified.
Failure modes
Section titled “Failure modes”- Multi-part report inconsistency.
H. Rept. 119-200, Pt. 1andPt. 2are separate packages but represent one report. Treat as separate rows linked byreport_number+ congress + chamber; downstream queries can re-aggregate. - MODS
<bill>element variants.context="PRIMARY"andcontext="OTHER"are common; sometimes both versions of1286appear (one PRIMARY, one OTHER) — collapse to single PRIMARY row. - Submitter resolution to bioguide fails. "Mr. Bost" needs resolution against the committee membership at the time. Not always unambiguous (multiple members named Smith have served). Heuristic: take all members of
submitting_committee_idatissue_date, look for last-name match, prefer the chair/ranking member. When ambiguous, leave bioguide null. - HTML parse failure on older reports. Pre-2010 CRPT HTML is occasionally malformed. Fall back to PDF text extraction.
- Conference reports parsing. Conference reports include the joint explanatory statement plus the bill text adopted by both chambers. Body text gets very long (10-100+ MB for major budget bills). Plan body chunking accordingly.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Section-by-section parser — the highest-signal content per character. Worth building a careful parser that splits the report by section heading and stores per-section rows. Defer to v1.x; ship v1 with full body text only.
- Dissenting / minority views as separate row? Within a single report, dissenting views are authored by different members. Splitting them out for citation precision is appealing. Defer.
- Conference report ↔ bill resolution. A conference report cites the underlying bill but is published independently. Make sure
bill_idFK works for conference reports' primary bills. - CBO cost estimate inlined in body. The full CBO cost estimate text appears inside the report body. Cross-link to
cbo_publications(see CBO cost estimates) when we can match by date + bill (heuristic). Defer to v1.x; the cross-reference is in the citation graph anyway. - Pre-1995 reports — exist in print, mostly digitized as scans. Out of scope for v1.
- Treaty reports (
ERPTdoc class) — Senate executive reports on treaties. Treaties as a source are deferred to v2; the reports themselves we ingest along with HRPT/SRPT but treat as orphans (nobill_id). - Very large conference reports. A committee print of an omnibus conference report can be hundreds of MB. Verify body text storage strategy (drop body, keep only chunks for very-large rows?).