Hearing transcripts

The text of what was actually said in hearings — every speaker turn, every Q&A round, every witness statement read into the record. This is the highest-value-per-MB corpus in Josh: it lets the agent answer "what did Jensen Huang testify about export controls" or "summarize the GAO IG's testimony in the May 7 oversight hearing."

This source is distinct from hearings (the calendar + meeting + witness metadata). That source records that a hearing happened; this source carries the content of what was said. The two link via hearings.id (the GovInfo packageId).

Coverage gap warning: the v1 plan promised "every committee hearing." The reality is messier. Officially published hearing volumes (CHRG collection at GovInfo) cover only a fraction of all meetings — typically the high-profile ones. CRS has documented that only a minority of committee hearings are ever printed for some committees. We need a layered strategy: official text where available, ASR-from-video where it isn't, with explicit gap tracking.

Quick reference


Source name	Hearing transcripts
Publishers	Government Publishing Office (CHRG collection), C-SPAN (video + closed captions), individual committees (occasionally PDFs of preliminary transcripts)
License	CHRG: public domain. C-SPAN: terms of use restrictive — research use only, no rebroadcast. Closed captions are derivative of broadcast. ASR-derived: depends on input source.
Coverage	CHRG: 1995 – present, ~600-1300 packages/year, ~35K total in API. Pre-1995: paper-only or scanned. C-SPAN closed captions: 1987 – present (broadcast), but no programmatic download.
Volume	~35K transcripts × ~300 KB HTML each = ~10 GB raw HTML, ~30-50 GB PDF, ~1-3 GB plain text after extraction
Stable ID format	See Stable ID format — three divergent forms exist on this page and must be reconciled.
Upstream	GovInfo CHRG collection help and developer / bulk-data + sitemap docs. No auth for GovInfo. C-SPAN clip downloads require a MyC-SPAN account (4 free/month).
Status	exploring — schema drafted, ASR pipeline shape defined but not built

Source priority decision

Three layers, in priority order:

Layer 1 (canonical, where available): GovInfo CHRG. Official, GPO-typeset, edited-for-record text. Comes in PDF (printed) and HTML (formatted text) renderings, both pointing at the same <pre>-wrapped plaintext blob. MODS XML carries witness names, jacket numbers, dates, committee codes. Use this for any hearing that has been formally printed. Coverage ~600-1300 hearings per year.

Layer 2 (best-effort): committee-published preliminary transcripts. Some committees post unofficial transcripts as meeting_documents with documentType: "Hearing: Transcript" (we saw an example in hearings — HHRG-118-GO00-Transcript-20230208.pdf). These appear weeks-to-months before the formal CHRG publication and are immensely useful for staying current. Pull them via the hearings flow (already a meeting_documents row); extract text and link as a transcript record.

Layer 3 (ASR fallback): video → automated transcription. For meetings with no published transcript and no committee-published preliminary, we run ASR on the video. Options for source video:

Congress.gov video pages (congress.gov/event/{c}-Congress/{chamber}-event/{eventId}) — embedded.
YouTube — official committee channels publish meeting recordings (the meetingMeeting.videos[] field of source #6 carries YouTube URLs).
C-SPAN — best coverage, but research-use-only license.
House.gov video archive — committees host their own.

ASR engine: Whisper-large-v3 or distil-whisper-v3 self-hosted, or a managed service (Groq Whisper API, Replicate). Quality is good enough for retrieval (BM25 search, semantic search) but not citation-grade. Flag ASR-derived transcripts in the schema so the agent can tell the user "I'm relying on auto-transcribed text — verify before citing."

Skip: scraping individual committee websites for transcripts. Heterogeneous, fragile, low-yield. The few committees that publish reliably are already feeding meeting_documents.

Skip: OCR of pre-1995 paper transcripts. Out of scope for v1. Library of Congress's Hathitrust has scanned older volumes; HOLLIS / catalog.gpo.gov are the right entrance points but neither is digitized text. Re-revisit in v2.

Access notes

GovInfo CHRG: no auth, polite rate, real-browser UA optional. The HTML rendition (html/CHRG-{packageId}.htm) is <pre>-wrapped plaintext — already structured-ish, easy to parse. This HTML rendition is the primary text-extraction target. PDF is canonical for citation purposes. Endpoint/URL patterns (HTML, PDF, ZIP, MODS, PREMIS, wssearch/getContentDetail) live in GovInfo's CHRG collection help.

C-SPAN: no programmatic transcript API. Closed captions exist (used internally for video search) but are not exposed for download. Going through MyC-SPAN (a free account, 4 clip downloads/month) to download videos and re-transcribe is OK for ASR; pulling captions out of HLS streams is technically possible but unsanctioned. See C-SPAN terms & conditions.

Discovery + diff: per-year sitemap

Discovery and incremental diff run off GovInfo's per-year CHRG sitemaps (CHRG_{YYYY}_sitemap.xml, ~600-1300 packages per recent year); each <url> carries a <lastmod> we diff against. The sitemap/bulk-data conventions are documented at GovInfo developers.

Per-year counts (sample):

2023: 1,301 packages
2024: 835 packages
2025: 634 packages (publication lag — many 2025 hearings won't print until 2026-2027)

Cumulative ~35K all-time. The publication lag means the latest few quarters of a sitemap year always under-counts — hearings from Q4 2025 will continue to land in the 2025 sitemap through 2027.

Vocabulary / enums

The MODS field catalog (extension/genre is always government publication; extension/branch is always legislative; the upstream-defined granuleClass / docClass value lists) is documented in GovInfo's MODS help. The Josh consumption notes for the fields we actually join on live under MODS structure. Observed values worth noting: granuleClass is one of OTHERPART / MAIN / WITNESS / MEMBER (multi-part hearings are split into granules), and docClass is HHRG (House), SHRG (Senate), or JHRG (Joint).

Transcript source layer (our enum, not from upstream)

We classify each transcript record by provenance:

Value	Meaning
`chrg_official`	GovInfo CHRG package — formally published, edited
`committee_preliminary`	PDF transcript posted by committee before CHRG publication
`asr_video`	Auto-transcribed from video (low fidelity)
`asr_audio`	Auto-transcribed from audio-only feed

The agent surfaces this so a user can tell whether a quote is citation-grade.

Stable ID format

The transcript record is the body content of an existing hearing record, so its ID derives from the parent hearings source.

For ASR-derived transcripts where no CHRG package exists yet, we mint a synthetic ID. This rolls into a CHRG-keyed record once the official volume is published — at which point we link the ASR transcript as a prior version (preserving it) and prefer the CHRG body for the agent's reads.

CHRG HTML structure

The HTML body is <html><body><pre>...all text...</pre></body></html>. Inside the <pre>, the structure is loose but learnable:

Cover page header. "House Hearing, 118 Congress" / "Senate Hearing, 117 Congress" / "Joint Hearing", title (often all-caps with line breaks, multi-line), "BEFORE THE COMMITTEE ON ...", chamber, congress, session, date.
Cover page metadata. "Serial No. 118-3", U.S. GPO publication line, jacket info (50-898 PDF).
Committee membership listing. Two-column ASCII layout — majority members left, minority right.
Table of contents (optional). Witness list with page numbers.
Body — the proceedings. Speaker turns prefixed with Mr. NAME. or Chairman NAME. or The CHAIRMAN. or Mr. RASKIN. etc., followed by paragraphs. Italicized stage directions in [brackets] (e.g. [Whereupon, at 4:12 p.m., the Committee was adjourned.], [GRAPHIC NOT AVAILABLE IN TIFF FORMAT]).
Appendices. Witness submitted statements, attachments, member statements for the record. Each labeled with the witness/member name and a header rule (-----).
Closing marker. [all] or [Whereupon, ...adjourned.].

Speaker turn parsing. A robust regex for speaker turns:

^(Mr\.|Mrs\.|Ms\.|Dr\.|Senator|Chair(?:man|woman)|Vice Chair|Ranking Member|Representative|The CHAIRMAN|The CHAIR)\s+([A-Z][A-Z'\.\- ]+?)\.\s

The speaker name is mostly all-caps in the body but appears in mixed case in the witness/member listings. Resolution to bioguide ID requires fuzzy matching against legislators (last name + state hint from the committee membership listing).

Witness vs member discrimination. Members appear in the committee listing on the cover page. Anyone speaking who isn't listed there is a witness. Cross-check with the meeting_witnesses table from hearings for resolved witness names.

Page breaks. [[Page N]] markers indicate the printed page number. Useful for citing back to the printed serial.

Unparseable junk. Long ASCII tables, bracketed graphics callouts ([GRAPHIC NOT AVAILABLE IN TIFF FORMAT]), bracketed audio annotations ([laughter], [applause], [off-mic comment]). Filter at parse time but preserve in body_text_raw.

MODS structure (high-value subset)

GovInfo's full MODS field catalog is documented in their MODS help. The few elements Josh actually joins on, trimmed to their join annotations:

<extension>
  <heldDate>2023-02-08</heldDate>           <!-- can repeat for multi-day -->
  <docClass>HHRG</docClass>
  <witness>Mr. James Baker, Former Twitter Deputy General Counsel; Former FBI General Counsel</witness>
  <witness>Ms. Vijaya Gadde, Former Twitter Chief Legal Officer</witness>
  <congSerial chamber="H" congress="118" number="3"/>
  <eventId>115286</eventId>                  <!-- joins to committee_meetings -->
</extension>

The eventId link is the same number as committee_meetings.event_id — exact join.

<witness> strings are richer than the API's witness list (full credentials inline). Useful for entity extraction.

<congSerial chamber="..." congress="..." number="..."/> is the chamber serial (Serial No. 118-3). House and Senate publish independent serial sequences.

ASR pipeline shape

For meetings without a published transcript, the auto-transcription pipeline:

Discovery. Daily, query for committee meetings where:
- meeting_status = 'Held'
- meeting_date <= now() - interval '7 days' (give committee + GPO time to publish official text)
- No row in hearing_transcripts with this meeting_id yet
- The meeting_videos table has at least one video URL
Video acquisition. Prefer YouTube (most reliable for download via yt-dlp) > congress.gov direct > C-SPAN (last-resort due to license). Save .mp3 (audio-only, ~50 MB per hour) to /data/corpus/hearings/audio/{eventId}/.
Transcription. Whisper-large-v3 self-hosted on GPU; or distil-whisper-v3 if cost-constrained. Output .json with word-level timestamps + per-segment text. Save to /data/corpus/hearings/asr/{eventId}/.
Speaker diarization (optional v1.x). pyannote.audio's diarization model + speaker-embedding clustering. Match clusters to known speakers via voice prints if we have them; otherwise label as Speaker_1, Speaker_2. Skip in v1 — un-diarized text is searchable; the agent can ask follow-up questions.
Speaker labeling. Cross-reference timestamps with the meeting's meeting_witnesses and committee roster from committees. For now, a heuristic: assume the committee chair speaks first and last; assume opening statements alternate majority → minority leadership → ranking member → other members. Confidence scores stored alongside.
Load. Insert as hearing_transcripts row with source = 'asr_video', body_text populated, no package_id. When the official CHRG package eventually appears, the loader detects it and adds a second row (doesn't replace) so we keep the ASR for audit.
Re-rank citations. When CHRG arrives, the agent's retrieval re-ranks: official > preliminary > ASR.

Cost estimate. ~3,500 meetings/year × ~3 hours each × $0.10/hr (Whisper local on amortized H100 at $1.50/hr, real-time = 0.1× wall) = ~$1,000/year compute amortized. Acceptable.

Quality estimate. Whisper-large-v3 on clean committee audio: ~5-10% WER. Names are the worst — congressional hearings have many proper nouns the model hasn't seen often. Post-processing pass to align speaker labels with member names mitigates.

Volume

Source layer	Records (estimate)	Storage
CHRG official	~35K records (1995-present), ~600-1300 added per year	~10 GB HTML + ~30 GB PDF + ~1-3 GB extracted plaintext
Committee preliminary PDFs	~500-2000 per year, but transient (replaced by CHRG eventually)	<1 GB
ASR backfill (history)	~30-50K meetings without transcript, 1-3 hrs each	Audio: ~5 TB (we won't store, transient). Text: ~5-10 GB
ASR ongoing (per year)	~3,500 meetings/year	<1 GB text/year

Storage strategy: keep extracted plaintext + raw HTML for CHRG; keep extracted ASR text only (drop audio after transcription). Total Postgres footprint: ~15-20 GB. Body text searched via GIN index on body_tsv; chunks for embedding via hearing_transcript_chunks. (Chunker strategy is TBD — no chunker is named or evaluated for this source yet; only the hearing_transcript_chunks table and chunk→turn anchoring exist.)

Caching / incremental sync

Per-year sitemap diff (weekly). Fetch each year's CHRG_{YYYY}_sitemap.xml and CHRG_{YYYY-1}_sitemap.xml (recent year still has churn). Diff <lastmod> per URL; refetch where changed.
MODS recordChangeDate as the per-package change watermark. Skip re-parse if recordChangeDate <= last_parsed_at AND <lastmod> unchanged.
Daily ASR queue scan. Pull meetings >7 days post-gavel with no transcript yet, up to a per-day cap (CPU/GPU bound).
Re-pass on new CHRG arrivals. When a new CHRG package matches an existing eventId we already have ASR for, mark the ASR row as superseded (don't delete) and load the official text.

Schema (Postgres DDL)

-- ============================================================
-- Hearing transcripts (the body content)
-- ============================================================

CREATE TABLE hearing_transcripts (
    id text PRIMARY KEY,                          -- 'transcript:CHRG-118hhrg50898' or 'transcript:asr:119-h-119082'

    -- Linkage
    hearing_id text REFERENCES hearings(id) ON DELETE CASCADE,         -- when CHRG-published
    meeting_id text REFERENCES committee_meetings(id) ON DELETE SET NULL, -- when ASR/preliminary

    -- Provenance
    source text NOT NULL CHECK (source IN (
        'chrg_official', 'committee_preliminary', 'asr_video', 'asr_audio'
    )),
    is_superseded boolean NOT NULL DEFAULT false, -- true when an authoritative version exists

    -- Source URLs (per-rendering)
    govinfo_html_url text,
    govinfo_pdf_url text,
    committee_pdf_url text,
    source_video_url text,

    -- Text content
    body_text text NOT NULL,                      -- cleaned, structured plaintext
    body_text_raw text,                           -- before junk filtering — preserves brackets, page markers, ASCII tables

    -- ASR-specific metadata
    asr_engine text,                              -- 'whisper-large-v3', 'distil-whisper-v3', etc.
    asr_engine_version text,
    asr_word_count int,
    asr_avg_confidence numeric(4,3),              -- 0.000-1.000
    asr_word_timestamps_url text,                 -- path to /data/corpus/hearings/asr/{id}.json with word-level timing

    -- Lifecycle
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now(),

    -- One transcript per (hearing, source) pair; multiple sources allowed for same hearing
    UNIQUE (hearing_id, source) WHERE hearing_id IS NOT NULL,
    UNIQUE (meeting_id, source) WHERE meeting_id IS NOT NULL AND hearing_id IS NULL
);

CREATE INDEX hearing_transcripts_hearing ON hearing_transcripts (hearing_id) WHERE hearing_id IS NOT NULL;
CREATE INDEX hearing_transcripts_meeting ON hearing_transcripts (meeting_id) WHERE meeting_id IS NOT NULL;
CREATE INDEX hearing_transcripts_source ON hearing_transcripts (source);
CREATE INDEX hearing_transcripts_active
    ON hearing_transcripts (hearing_id, source)
    WHERE NOT is_superseded;

-- Search
ALTER TABLE hearing_transcripts ADD COLUMN body_tsv tsvector
    GENERATED ALWAYS AS (
        to_tsvector('english', coalesce(body_text, ''))
    ) STORED;
CREATE INDEX hearing_transcripts_body_search ON hearing_transcripts USING gin (body_tsv);

-- ============================================================
-- Speaker turns (parsed from body_text)
-- ============================================================

CREATE TABLE hearing_transcript_turns (
    id bigserial PRIMARY KEY,
    transcript_id text NOT NULL REFERENCES hearing_transcripts(id) ON DELETE CASCADE,
    sequence int NOT NULL,                        -- order within transcript
    speaker_raw text NOT NULL,                    -- 'Mr. RASKIN', 'The CHAIRMAN', 'Mr. BAKER'
    speaker_role text,                            -- 'member', 'witness', 'chairman', 'unidentified'
    speaker_bioguide text REFERENCES legislators(bioguide_id),  -- when resolvable
    speaker_witness_id bigint REFERENCES meeting_witnesses(id), -- when resolvable
    text text NOT NULL,
    -- ASR-specific
    asr_start_seconds numeric(10,3),
    asr_end_seconds numeric(10,3),
    UNIQUE (transcript_id, sequence)
);

CREATE INDEX hearing_transcript_turns_transcript ON hearing_transcript_turns (transcript_id, sequence);
CREATE INDEX hearing_transcript_turns_member ON hearing_transcript_turns (speaker_bioguide) WHERE speaker_bioguide IS NOT NULL;

-- Search by member's words
ALTER TABLE hearing_transcript_turns ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', coalesce(text, ''))) STORED;
CREATE INDEX hearing_transcript_turns_search ON hearing_transcript_turns USING gin (search_tsv);

-- ============================================================
-- Embedding chunks
-- ============================================================

CREATE TABLE hearing_transcript_chunks (
    id bigserial PRIMARY KEY,
    transcript_id text NOT NULL REFERENCES hearing_transcripts(id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    -- Anchor back to source for citation
    starting_turn_id bigint REFERENCES hearing_transcript_turns(id),
    embedding vector(1024),
    UNIQUE (transcript_id, chunk_index)
);
CREATE INDEX hearing_transcript_chunks_embedding
    ON hearing_transcript_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

Multiple transcripts per hearing/meeting allowed — the unique constraint is (hearing_id, source) so a single hearing can have a chrg_official row plus an asr_video row plus a committee_preliminary row. is_superseded=true flags the lower-fidelity rows once the official lands; we don't delete.
hearing_id and meeting_id both nullable, but at least one must be set. ASR transcripts may have meeting_id only (no CHRG package yet); CHRG transcripts may have hearing_id only (the meeting_id resolution depends on the MODS <eventId> cross-reference being populated, which it usually is but not always for older hearings). A check constraint CHECK (hearing_id IS NOT NULL OR meeting_id IS NOT NULL) is enforceable; left out for brevity.
hearing_transcript_turns separate table — speaker-turn-level retrieval is the dominant access pattern. The agent searches "what did Schumer say about chips" and wants turns, not whole transcripts.
speaker_bioguide and speaker_witness_id both nullable — resolution to a canonical person is best-effort. A single transcript may have both members and witnesses; a single turn maps to at most one.
Word-level ASR timing kept off-row — asr_word_timestamps_url points to a JSON file in /data/corpus/hearings/asr/, not stored in Postgres (1-3 MB per hour is too much for inline). The per-turn asr_start_seconds / asr_end_seconds is the in-row anchor.
hearing_transcript_chunks.starting_turn_id anchors chunks back to turns so the agent can quote a chunk and link to the source turn for citation.

Download / update strategy

Backfill

CHRG (1995-present):

For each year 1995..current:
- Fetch CHRG_{YYYY}_sitemap.xml. ~600-1300 URLs each.
- For each <loc>, derive packageId (suffix after /app/details/).
- Fetch HTML rendering: /content/pkg/CHRG-{packageId}/html/CHRG-{packageId}.htm. Save to /data/corpus/hearings/transcripts/{year}/{packageId}.htm.
- Fetch MODS XML. Save to /data/corpus/hearings/transcripts/{year}/{packageId}.mods.xml.
- Parse HTML → speaker turns. Parse MODS → witness names + heldDate + eventId.
- Insert hearing_transcripts + hearing_transcript_turns.
- Embed chunks.
~35K packages × 2 fetches each = ~70K calls. With 4 concurrent workers @ 1 req/s: ~5 hours.

Committee preliminary PDFs:

From meeting_documents where document_type = 'Hearing: Transcript', fetch each PDF, extract text via pdfplumber or pdftotext, parse turns. Insert with source='committee_preliminary'. ~1-2K total currently.

ASR backfill — defer. Don't backfill ASR over historical hearings without published transcripts. Cost-efficient strategy: ASR only for new meetings that haven't yet been published, plus on-demand if an agent specifically asks about a hearing without text.

Daily incremental

Weekly: CHRG sitemap diff. For current year + previous year (publication lag). For changed <lastmod>, refetch + re-parse.
Daily: scan committee_meetings. Find held meetings >7 days old without hearing_transcripts rows, with at least one meeting_videos row. Queue for ASR.
Daily: ASR queue worker. Pull from queue, fetch video → audio, transcribe, parse, insert.
On CHRG arrival: supersede ASR. When sitemap diff detects a new CHRG for a meeting that has ASR, set is_superseded=true on the ASR row, insert the new official row.

State tracking

Source keys: hearing_transcripts_chrg, hearing_transcripts_committee, hearing_transcripts_asr. Each tracked separately. ASR runs have a task row per meeting with status (queued, downloading, transcribing, parsed, error).

Failure modes

CHRG HTML parse failure (unusual layout). Rare in 2010+ packages, common for pre-2005. Fall back to PDF text extraction; mark parsed_at = NULL; flag for manual review.
MODS missing <eventId>. Older packages don't have this. Fall back to fuzzy match: (committee_systemCode, heldDate) → committee_meetings. If no match, leave meeting_id = NULL.
Multi-day hearing → multiple meetings → one CHRG. The MODS has multiple <heldDate> elements. Insert hearing_dates rows for each (already in hearings schema), and link the single transcript to the earliest matching meeting. Document this in the hearings ↔ hearing-transcripts join semantics.
ASR transcribed wrong language / very poor audio. Check asr_avg_confidence. Below 0.5 → retry once. Below 0.5 again → flag for human review; insert with is_superseded=false but a confidence note in the agent's surface.
YouTube takedown of committee video before we transcribe. Some committees pull old videos. Fall back to congress.gov video page. If both gone, we miss this meeting; log and move on.
Disk pressure — ASR audio at 50 MB/hour × thousands of meetings won't fit on /data if we keep everything. Strict policy: delete audio once transcription succeeds and is verified.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

ASR engine choice. Whisper-large-v3 (best quality, ~$0.10-0.30/hr local on H100; OpenAI/Groq API ~$0.36/hr) vs distil-whisper-v3 (3-6× faster, ~95% of accuracy) vs API services (Replicate, Groq) for quality + cost. Benchmark on a sample of real hearing audio before committing.
Speaker diarization in v1? A diarized transcript with speaker labels is dramatically more useful than an undifferentiated wall of text. pyannote.audio is mature; runs on the same GPU. Adds ~30% to the per-hearing time. Lean toward yes for v1 if compute allows.
Voice-print library for senators / cabinet officials. With a small library of known speakers' voice prints, we can label diarized clusters by name with high confidence. Building the library is an annotation task; defer to v1.x.
Member-name resolution from committee membership listing. Most CHRG cover pages have a two-column ASCII listing of committee members. Parsing it gives a ground-truth list of speakers we should expect, plus a state hint for disambiguating last names. Worth a careful parser.
Pre-2010 CHRG layout drift. Older publications have less consistent formatting. Plan a parser test suite of fixtures pulled from each decade.
Witness statements vs spoken testimony. Witness statements (their pre-prepared remarks, available in meeting_documents.witness_documents) differ from what they actually said in the hearing. Both are valuable but distinct. Keep them separate — the meeting_documents table already holds witness statement PDFs.
Committee preliminary transcripts identification. meeting_documents.document_type = 'Hearing: Transcript' is one signal but not the only one. Some committees use 'Member Statements' for full transcripts. Build a small classifier: PDF page count + title keywords.
ASR cost ceiling. ~$1,000/year compute is a back-of-envelope. If the agent gets used heavily and we ASR-on-demand for arbitrary historical meetings, costs balloon. Set a per-month budget.
Citation-grade flag in agent output. Plumbing the source enum into the agent's response so users see "(auto-transcribed, verify)" alongside ASR-derived quotes is a UX detail worth nailing early.
License compliance for C-SPAN. Even ASR-derived text from C-SPAN broadcasts inherits some restriction. Default to YouTube and congress.gov for video sourcing; only fall back to C-SPAN with a clear research-use justification.