Committee reports ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Committee reports (H. Rep. and S. Rep. — GovInfo CRPT collection) are the
written record of legislative intent: section-by-section analyses, dissents,
CBO cost estimates inlined, and the committee vote rollcall. They are
FK-target-rich (reports reference bills, members, and committees) and the
best substrate answer to "why was this bill amended" / "what did the
committee say" queries. Alternative 6th/7th launch source if Federal
Register or hearings slip.
User stories
As a policy analyst using my own AI agent, I want to ask the agent for the committee report accompanying a bill and the dissenting views filed against it so that I understand legislative intent and minority objections without leaving the chat.
As an OSS self-hoster, I want to run `josh ingest committee-reports --congress 119` on a fresh substrate so that I can populate H. Rep. / S. Rep. entries from GovInfo's CRPT bulk endpoints on my own instance.
As a downstream legislative-history consumer (bills, roll-call votes), I want to join `bills → committee_reports → committees → legislators` cleanly so that queries like "every report that committee X submitted on bills sponsored by member Y" resolve through real foreign keys, not free-text matching.
Acceptance criteria (EARS)
- When `josh ingest committee-reports --congress 119` is run on the deployed substrate, the system shall populate `committee_reports` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
- When a committee report HTML rendition is fetched, the system shall write the raw payload under `/data/corpus/committee-reports/bodies/raw/` and a normalized Markdown body under `/data/corpus/committee-reports/bodies/markdown/`.
- While the ingester is running, if a single report fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next report rather than crashing the run.
- When an FTS5 query `MATCH 'section-by-section'` is executed against `committee_reports_fts`, the system shall return BM25-ranked results with the per-column weight tuple `(10.0, 4.0, 2.0, 1.0)`.
- Where the report MODS extension declares one or more `<bill>` elements, the system shall populate the `committee_report_bills` junction table with one row per (report, bill, relationship) where `relationship IN ('PRIMARY', 'OTHER')`.
- Where the MODS extension declares a `<congCommittee>` whose `authorityId` resolves against `committees`, the system shall populate `committee_reports.submitting_committee_id`; otherwise the column shall be left null and a `committee_unresolved` event written to `ingestion_logs`.
- Where a report body exceeds ~5K tokens (true for nearly all reports, and routinely true for conference reports), the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding.
- The system shall enforce a unique constraint on `(chamber, congress, report_number, doc_class)` so that re-running the ingester is idempotent and `H. Rept. 119-12` cannot be inserted twice under different `id`s.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` (per-year sitemap `lastmod`) such that the next run only fetches packages whose sitemap `<lastmod>` is newer.
Success determiner
Command
set -euo pipefail
# Smoke backfill: 50 reports from the current Congress
ssh josh 'docker exec josh-ingester josh ingest committee-reports --congress 119 --max 50'
# Verify rows landed with all citation columns populated
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM committee_reports
WHERE congress = 119
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND published_at IS NOT NULL AND retrieved_at IS NOT NULL
AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL;"'
# FTS5 sanity
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM committee_reports_fts WHERE committee_reports_fts MATCH '\''committee'\'';"'
# FK linkage: rows that resolved a bill_id (PRIMARY relationship)
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM committee_report_bills
WHERE relationship = ''PRIMARY'';"'
# FK linkage: rows that resolved submitting_committee_id
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM committee_reports
WHERE congress = 119 AND submitting_committee_id IS NOT NULL;"'
Expect
Smoke run only. Full backfill scope (current Congress vs back to 104th vs full GPO 1995-onward coverage) is open per `clarifications_needed` and runs detached on the server per `add-a-new-source.html` once chosen.
Clarifications needed
- Backfill scope: current Congress (119th, ~500-1300 reports) vs back to 104th (~25K reports, ~3-5 GB raw) vs full GPO 1995-onward (same ~25K — 1995 is roughly the start of CRPT online coverage)? Currently leaning current Congress for v1 launch, expanding to all-time on a follow-up backfill.
- Pre-1995 reports exist in print (mostly digitized scans, heterogeneous quality). Confirm out-of-scope for v1 — assumed deferred but worth naming explicitly.
- Conference reports: ingest into the same `committee_reports` table with `is_conference_report=true` (per `https://docs.usejosh.com/sources/committee-reports/`), or split into a sibling table? Default plan is single table with discriminator flag.
- Bill FK resolution when the referenced bill is not yet in the substrate: soft FK (text `bill_id` column, no DB-level constraint, resolve at query time) is the documented plan; confirm this matches what `bills-ingester` exposes, since lazy-join requires that bill IDs follow the canonical `bill:<congress>-<type>-<number>` shape verbatim.
- Committee FK resolution when the report's `<congCommittee authorityId>` is not yet in `committees`: leave `submitting_committee_id` null and re-resolve on the next run, vs block ingestion of the report? Default is null + log + retry; confirm acceptable.
- Submitter member resolution (`submitted_by_bioguide`) — heuristic name-match against committee membership at `issue_date` is documented as ambiguous in the per-source doc. Confirm we ship v1 with `submitted_by_member_name` populated and `submitted_by_bioguide` best-effort / often null.
- Section-by-section parser, dissenting-views split, and CBO inline cross-link to `cbo_publications` are all listed as v1.x in the per-source doc — confirm v1 ships full body text only, with structured-section work deferred.
- Treaty reports (`ERPT` doc class, Senate Executive Reports): ingest alongside HRPT/SRPT as orphans (no `bill_id`) per the per-source doc, or skip until treaties land in v2?
- Very-large conference reports (omnibus joint explanatory statements, hundreds of MB) — store full `body_text` or chunks-only? Storage strategy for the long-tail not yet decided.
Out of scope
- Pre-1995 committee reports — GPO online coverage starts ~1995; older reports exist only as print/scans of heterogeneous quality.
- State legislature committee report equivalents — Phase 6.
- House and Senate floor speeches and debate transcripts — that's `congressional-record`, a separate source.
- Per-section structured parser (background, section-by-section, dissenting views as separate rows) — deferred to v1.x; v1 ships full body text + FTS5 + chunk-level embeddings.
- CBO cost-estimate cross-link to `cbo_publications` — citation graph captures it; structured cross-reference deferred to v1.x.
Dependencies
Plan
Source module follows the standard Source protocol — seehttps://docs.usejosh.com/operations/ingestion-architecture/ andhttps://docs.usejosh.com/operations/add-a-new-source/. Per-source spec athttps://docs.usejosh.com/sources/committee-reports/ (probe findings + indicative DDL —
the Alembic migration inshared/josh_substrate/src/josh_substrate/migrations/versions/ is the
source of truth).
Discover/fetch/parse/load shape:
- discover: GovInfo per-year sitemap
https://www.govinfo.gov/sitemap/CRPT_{YYYY}_sitemap.xml. ~500-1300
<url> entries per recent year; each carries <lastmod>. Watermark from
ingestion_source_state.last_seen_key (per-year sitemap-lastmod tuple).
No auth, ~1-2 req/sec polite.
- fetch: per-package MODS XML
(/metadata/pkg/CRPT-{packageId}/mods.xml) + HTML rendition
(/content/pkg/CRPT-{packageId}/html/CRPT-{packageId}.htm) to
/data/corpus/committee-reports/bodies/raw/. SHA256 over the HTML
bytes is raw_sha256 for citation. PDF rendition fetched lazily
(off the hot path; some pre-2010 HTML is malformed and PDF is the
fallback). Multi-part conference reports come in separate packageIds —
each becomes its own row, linked by (chamber, congress, report_number).
- parse: MODS lxml walker for metadata
(docClass, chamber, congress, number, session,
isConferenceReport, <bill> PRIMARY/OTHER references, <congCommittee>
authorityId, <dateIssued>, <recommendation>, <abstract>,
<submittedStatement>, citation <identifier type="preferred citation">).
HTML walker over the <pre>-wrapped plaintext for body extraction →
Markdown body to /data/corpus/committee-reports/bodies/markdown/.
Pre-2010 HTML fallback path: PDF text extraction.
- load: idempotent upsert on the (chamber, congress, report_number, natural-key unique constraint. Junction-table inserts for
doc_class)
committee_report_bills (one row per (report, bill, PRIMARY|OTHER)).
submitting_committee_id resolved against committees.system_code
(the <congCommittee authorityId>) — when the committee is missing,
leave null and write committee_unresolved to ingestion_logs; same
for unresolved bill IDs (soft FK pattern). FTS5 sync triggers fire on
insert/update/delete. Body chunks enqueued for embedding.
Migration shape (canonical lives in the Alembic versions directory; sketch
here is illustrative):
- committee_reports table with the citation-metadata 6 columns,
chamber / doc_class / congress / session / report_number,
is_conference_report flag, submitted_by_member_name +
submitted_by_bioguide (soft FK to legislators), submitting_committee_id
(soft FK to committees), body Markdown path, and the unique constraint
(chamber, congress, report_number, doc_class).
- committee_report_bills(report_id, bill_id, relationship) junction —
bill_id is a soft FK (text-only, no DB-level constraint) so the
ingester can record references to bills not yet in the substrate.
- committee_reports_fts virtual table + 3 sync triggers
(BM25 weights (title=10.0, abstract=4.0, recommendation=2.0,).
body=1.0)
- committee_report_chunks + committee_report_chunk_vecs (vec0 with
both float[1024] and bit[1024] for BQ rescore, per the substrate
storage decision).
FK resolution strategy:
- Bills: soft FK. committee_report_bills.bill_id is text only,
no DB-level constraint. Lazy join at query time. When a referenced bill
isn't yet in bills, the row still lands; the join just returns null on
the bill side until the bill ingester catches up. Rationale: the
committee-reports ingester must not block on bills coverage gaps,
especially for old conference reports where the underlying bill may
pre-date the bills backfill window.
- Committees: best-effort. Lookup against committees.system_code by
the MODS <congCommittee authorityId>. On miss, leave null + log; the
next run after legislators-and-committees-ingester re-runs will pick
it up.
- Submitter member: best-effort name-match heuristic. v1 always
populates submitted_by_member_name (the literal "Mr. Bost" string from
MODS); submitted_by_bioguide is set only when name → committee
membership at issue_date resolves unambiguously. Ambiguous cases stay
null.
Schedule recommendation: daily at :30. Committee reports publish
irregularly when Congress is in session — daily polling of current-year +
previous-year CRPT sitemaps catches new reports without spamming GovInfo.
Hourly is overkill (this isn't FR's daily-bulk shape), but the BILLSTATUS
cross-trigger from the bills ingester (when a <committeeReports> entry
appears) can enqueue an immediate fetch for the "report just landed" case.
Tasks
0 of 17 done.
- t1 Alembic migration applied locally + production
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/committee_reports.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (raw + markdown paths, per-format converter unit-tested — HTML primary, PDF fallback for pre-2010)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests on 3+ fixtures (HRPT, SRPT, ERPT)
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 Vector vec0 table wired (committee reports are long; conference reports especially so)
- t8 Unit test for MODS + HTML parser against captured fixtures
- t9 Integration test for the full discover→fetch→parse→load pipeline
- t10 Minimum-viable backfill complete at planned volume
- t11 Incremental cron registered and observed firing once
- t12 No recurring errors in `ingestion_logs` for the last completed run
- t13 Source doc updated with as-built notes; status flipped to ingestion-built
- t14 Inventory row in `https://docs.usejosh.com/josh-data-sources/` updated to ingestion-built
- t15 12-sample manual HTML/MODS→Markdown validation gate (per add-a-new-source.html) — covers HRPT, SRPT, ERPT, conference, and multi-part
- t16 URL verification gate: 10/10 random records' `source_url` resolves to the matching CRPT package on GovInfo via WebFetch
- t17 Bill + committee FK resolution integrity test — for a sample of 50 reports, confirm `committee_report_bills` PRIMARY rows match MODS PRIMARY-context bills, and `submitting_committee_id` matches the MODS `<congCommittee authorityId>` when the committee exists in `committees`
Changelog
No history yet.