Hearing transcripts
The text of what was actually said in hearings — every speaker turn, every Q&A round, every witness statement read into the record. This is the highest-value-per-MB corpus in Josh: it lets the agent answer "what did Jensen Huang testify about export controls" or "summarize the GAO IG's testimony in the May 7 oversight hearing."
This source is distinct from hearings (the calendar + meeting + witness metadata). That source records that a hearing happened; this source carries the content of what was said. The two link via hearings.id (the GovInfo packageId).
Coverage gap warning: the v1 plan promised "every committee hearing." The reality is messier. Officially published hearing volumes (CHRG collection at GovInfo) cover only a fraction of all meetings — typically the high-profile ones. CRS has documented that only a minority of committee hearings are ever printed for some committees. We need a layered strategy: official text where available, ASR-from-video where it isn't, with explicit gap tracking.
Quick reference
Section titled “Quick reference”| Source name | Hearing transcripts |
| Publishers | Government Publishing Office (CHRG collection), C-SPAN (video + closed captions), individual committees (occasionally PDFs of preliminary transcripts) |
| License | CHRG: public domain. C-SPAN: terms of use restrictive — research use only, no rebroadcast. Closed captions are derivative of broadcast. ASR-derived: depends on input source. |
| Coverage | CHRG: 1995 – present, ~600-1300 packages/year, ~35K total in API. Pre-1995: paper-only or scanned. C-SPAN closed captions: 1987 – present (broadcast), but no programmatic download. |
| Volume | ~35K transcripts × ~300 KB HTML each = ~10 GB raw HTML, ~30-50 GB PDF, ~1-3 GB plain text after extraction |
| Stable ID format | See Stable ID format — three divergent forms exist on this page and must be reconciled. |
| Upstream | GovInfo CHRG collection help and developer / bulk-data + sitemap docs. No auth for GovInfo. C-SPAN clip downloads require a MyC-SPAN account (4 free/month). |
| Status | exploring — schema drafted, ASR pipeline shape defined but not built |
Source priority decision
Section titled “Source priority decision”Three layers, in priority order:
Layer 1 (canonical, where available): GovInfo CHRG. Official, GPO-typeset, edited-for-record text. Comes in PDF (printed) and HTML (formatted text) renderings, both pointing at the same <pre>-wrapped plaintext blob. MODS XML carries witness names, jacket numbers, dates, committee codes. Use this for any hearing that has been formally printed. Coverage ~600-1300 hearings per year.
Layer 2 (best-effort): committee-published preliminary transcripts. Some committees post unofficial transcripts as meeting_documents with documentType: "Hearing: Transcript" (we saw an example in hearings — HHRG-118-GO00-Transcript-20230208.pdf). These appear weeks-to-months before the formal CHRG publication and are immensely useful for staying current. Pull them via the hearings flow (already a meeting_documents row); extract text and link as a transcript record.
Layer 3 (ASR fallback): video → automated transcription. For meetings with no published transcript and no committee-published preliminary, we run ASR on the video. Options for source video:
- Congress.gov video pages (
congress.gov/event/{c}-Congress/{chamber}-event/{eventId}) — embedded. - YouTube — official committee channels publish meeting recordings (the
meetingMeeting.videos[]field of source #6 carries YouTube URLs). - C-SPAN — best coverage, but research-use-only license.
- House.gov video archive — committees host their own.
ASR engine: Whisper-large-v3 or distil-whisper-v3 self-hosted, or a managed service (Groq Whisper API, Replicate). Quality is good enough for retrieval (BM25 search, semantic search) but not citation-grade. Flag ASR-derived transcripts in the schema so the agent can tell the user "I'm relying on auto-transcribed text — verify before citing."
Skip: scraping individual committee websites for transcripts. Heterogeneous, fragile, low-yield. The few committees that publish reliably are already feeding meeting_documents.
Skip: OCR of pre-1995 paper transcripts. Out of scope for v1. Library of Congress's Hathitrust has scanned older volumes; HOLLIS / catalog.gpo.gov are the right entrance points but neither is digitized text. Re-revisit in v2.
Access notes
Section titled “Access notes”GovInfo CHRG: no auth, polite rate, real-browser UA optional. The HTML rendition (html/CHRG-{packageId}.htm) is <pre>-wrapped plaintext — already structured-ish, easy to parse. This HTML rendition is the primary text-extraction target. PDF is canonical for citation purposes. Endpoint/URL patterns (HTML, PDF, ZIP, MODS, PREMIS, wssearch/getContentDetail) live in GovInfo's CHRG collection help.
C-SPAN: no programmatic transcript API. Closed captions exist (used internally for video search) but are not exposed for download. Going through MyC-SPAN (a free account, 4 clip downloads/month) to download videos and re-transcribe is OK for ASR; pulling captions out of HLS streams is technically possible but unsanctioned. See C-SPAN terms & conditions.
Discovery + diff: per-year sitemap
Section titled “Discovery + diff: per-year sitemap”Discovery and incremental diff run off GovInfo's per-year CHRG sitemaps (CHRG_{YYYY}_sitemap.xml, ~600-1300 packages per recent year); each <url> carries a <lastmod> we diff against. The sitemap/bulk-data conventions are documented at GovInfo developers.
Per-year counts (sample):
- 2023: 1,301 packages
- 2024: 835 packages
- 2025: 634 packages (publication lag — many 2025 hearings won't print until 2026-2027)
Cumulative ~35K all-time. The publication lag means the latest few quarters of a sitemap year always under-counts — hearings from Q4 2025 will continue to land in the 2025 sitemap through 2027.
Vocabulary / enums
Section titled “Vocabulary / enums”The MODS field catalog (extension/genre is always government publication; extension/branch is always legislative; the upstream-defined granuleClass / docClass value lists) is documented in GovInfo's MODS help. The Josh consumption notes for the fields we actually join on live under MODS structure. Observed values worth noting: granuleClass is one of OTHERPART / MAIN / WITNESS / MEMBER (multi-part hearings are split into granules), and docClass is HHRG (House), SHRG (Senate), or JHRG (Joint).
Transcript source layer (our enum, not from upstream)
Section titled “Transcript source layer (our enum, not from upstream)”We classify each transcript record by provenance:
| Value | Meaning |
|---|---|
chrg_official | GovInfo CHRG package — formally published, edited |
committee_preliminary | PDF transcript posted by committee before CHRG publication |
asr_video | Auto-transcribed from video (low fidelity) |
asr_audio | Auto-transcribed from audio-only feed |
The agent surfaces this so a user can tell whether a quote is citation-grade.
Stable ID format
Section titled “Stable ID format”The transcript record is the body content of an existing hearing record, so its ID derives from the parent hearings source.
For ASR-derived transcripts where no CHRG package exists yet, we mint a synthetic ID. This rolls into a CHRG-keyed record once the official volume is published — at which point we link the ASR transcript as a prior version (preserving it) and prefer the CHRG body for the agent's reads.
CHRG HTML structure
Section titled “CHRG HTML structure”The HTML body is <html><body><pre>...all text...</pre></body></html>. Inside the <pre>, the structure is loose but learnable:
- Cover page header. "House Hearing, 118 Congress" / "Senate Hearing, 117 Congress" / "Joint Hearing", title (often all-caps with line breaks, multi-line), "BEFORE THE COMMITTEE ON ...", chamber, congress, session, date.
- Cover page metadata. "Serial No. 118-3", U.S. GPO publication line, jacket info (
50-898 PDF). - Committee membership listing. Two-column ASCII layout — majority members left, minority right.
- Table of contents (optional). Witness list with page numbers.
- Body — the proceedings. Speaker turns prefixed with
Mr. NAME.orChairman NAME.orThe CHAIRMAN.orMr. RASKIN.etc., followed by paragraphs. Italicized stage directions in[brackets](e.g.[Whereupon, at 4:12 p.m., the Committee was adjourned.],[GRAPHIC NOT AVAILABLE IN TIFF FORMAT]). - Appendices. Witness submitted statements, attachments, member statements for the record. Each labeled with the witness/member name and a header rule (
-----). - Closing marker.
[all]or[Whereupon, ...adjourned.].
Speaker turn parsing. A robust regex for speaker turns:
^(Mr\.|Mrs\.|Ms\.|Dr\.|Senator|Chair(?:man|woman)|Vice Chair|Ranking Member|Representative|The CHAIRMAN|The CHAIR)\s+([A-Z][A-Z'\.\- ]+?)\.\sThe speaker name is mostly all-caps in the body but appears in mixed case in the witness/member listings. Resolution to bioguide ID requires fuzzy matching against legislators (last name + state hint from the committee membership listing).
Witness vs member discrimination. Members appear in the committee listing on the cover page. Anyone speaking who isn't listed there is a witness. Cross-check with the meeting_witnesses table from hearings for resolved witness names.
Page breaks. [[Page N]] markers indicate the printed page number. Useful for citing back to the printed serial.
Unparseable junk. Long ASCII tables, bracketed graphics callouts ([GRAPHIC NOT AVAILABLE IN TIFF FORMAT]), bracketed audio annotations ([laughter], [applause], [off-mic comment]). Filter at parse time but preserve in body_text_raw.
MODS structure (high-value subset)
Section titled “MODS structure (high-value subset)”GovInfo's full MODS field catalog is documented in their MODS help. The few elements Josh actually joins on, trimmed to their join annotations:
<extension> <heldDate>2023-02-08</heldDate> <!-- can repeat for multi-day --> <docClass>HHRG</docClass> <witness>Mr. James Baker, Former Twitter Deputy General Counsel; Former FBI General Counsel</witness> <witness>Ms. Vijaya Gadde, Former Twitter Chief Legal Officer</witness> <congSerial chamber="H" congress="118" number="3"/> <eventId>115286</eventId> <!-- joins to committee_meetings --></extension>The eventId link is the same number as committee_meetings.event_id — exact join.
<witness> strings are richer than the API's witness list (full credentials inline). Useful for entity extraction.
<congSerial chamber="..." congress="..." number="..."/> is the chamber serial (Serial No. 118-3). House and Senate publish independent serial sequences.
ASR pipeline shape
Section titled “ASR pipeline shape”For meetings without a published transcript, the auto-transcription pipeline:
- Discovery. Daily, query for committee meetings where:
meeting_status = 'Held'meeting_date <= now() - interval '7 days'(give committee + GPO time to publish official text)- No row in
hearing_transcriptswith thismeeting_idyet - The
meeting_videostable has at least one video URL
- Video acquisition. Prefer YouTube (most reliable for download via
yt-dlp) > congress.gov direct > C-SPAN (last-resort due to license). Save.mp3(audio-only, ~50 MB per hour) to/data/corpus/hearings/audio/{eventId}/. - Transcription. Whisper-large-v3 self-hosted on GPU; or distil-whisper-v3 if cost-constrained. Output
.jsonwith word-level timestamps + per-segment text. Save to/data/corpus/hearings/asr/{eventId}/. - Speaker diarization (optional v1.x). pyannote.audio's diarization model + speaker-embedding clustering. Match clusters to known speakers via voice prints if we have them; otherwise label as
Speaker_1,Speaker_2. Skip in v1 — un-diarized text is searchable; the agent can ask follow-up questions. - Speaker labeling. Cross-reference timestamps with the meeting's
meeting_witnessesand committee roster fromcommittees. For now, a heuristic: assume the committee chair speaks first and last; assume opening statements alternate majority → minority leadership → ranking member → other members. Confidence scores stored alongside. - Load. Insert as
hearing_transcriptsrow withsource = 'asr_video',body_textpopulated, nopackage_id. When the official CHRG package eventually appears, the loader detects it and adds a second row (doesn't replace) so we keep the ASR for audit. - Re-rank citations. When CHRG arrives, the agent's retrieval re-ranks: official > preliminary > ASR.
Cost estimate. ~3,500 meetings/year × ~3 hours each × $0.10/hr (Whisper local on amortized H100 at $1.50/hr, real-time = 0.1× wall) = ~$1,000/year compute amortized. Acceptable.
Quality estimate. Whisper-large-v3 on clean committee audio: ~5-10% WER. Names are the worst — congressional hearings have many proper nouns the model hasn't seen often. Post-processing pass to align speaker labels with member names mitigates.
Volume
Section titled “Volume”| Source layer | Records (estimate) | Storage |
|---|---|---|
| CHRG official | ~35K records (1995-present), ~600-1300 added per year | ~10 GB HTML + ~30 GB PDF + ~1-3 GB extracted plaintext |
| Committee preliminary PDFs | ~500-2000 per year, but transient (replaced by CHRG eventually) | <1 GB |
| ASR backfill (history) | ~30-50K meetings without transcript, 1-3 hrs each | Audio: ~5 TB (we won't store, transient). Text: ~5-10 GB |
| ASR ongoing (per year) | ~3,500 meetings/year | <1 GB text/year |
Storage strategy: keep extracted plaintext + raw HTML for CHRG; keep extracted ASR text only (drop audio after transcription). Total Postgres footprint: ~15-20 GB. Body text searched via GIN index on body_tsv; chunks for embedding via hearing_transcript_chunks. (Chunker strategy is TBD — no chunker is named or evaluated for this source yet; only the hearing_transcript_chunks table and chunk→turn anchoring exist.)
Caching / incremental sync
Section titled “Caching / incremental sync”- Per-year sitemap diff (weekly). Fetch each year's
CHRG_{YYYY}_sitemap.xmlandCHRG_{YYYY-1}_sitemap.xml(recent year still has churn). Diff<lastmod>per URL; refetch where changed. - MODS
recordChangeDateas the per-package change watermark. Skip re-parse ifrecordChangeDate <= last_parsed_atAND<lastmod>unchanged. - Daily ASR queue scan. Pull meetings >7 days post-gavel with no transcript yet, up to a per-day cap (CPU/GPU bound).
- Re-pass on new CHRG arrivals. When a new CHRG package matches an existing
eventIdwe already have ASR for, mark the ASR row as superseded (don't delete) and load the official text.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- Hearing transcripts (the body content)-- ============================================================
CREATE TABLE hearing_transcripts ( id text PRIMARY KEY, -- 'transcript:CHRG-118hhrg50898' or 'transcript:asr:119-h-119082'
-- Linkage hearing_id text REFERENCES hearings(id) ON DELETE CASCADE, -- when CHRG-published meeting_id text REFERENCES committee_meetings(id) ON DELETE SET NULL, -- when ASR/preliminary
-- Provenance source text NOT NULL CHECK (source IN ( 'chrg_official', 'committee_preliminary', 'asr_video', 'asr_audio' )), is_superseded boolean NOT NULL DEFAULT false, -- true when an authoritative version exists
-- Source URLs (per-rendering) govinfo_html_url text, govinfo_pdf_url text, committee_pdf_url text, source_video_url text,
-- Text content body_text text NOT NULL, -- cleaned, structured plaintext body_text_raw text, -- before junk filtering — preserves brackets, page markers, ASCII tables
-- ASR-specific metadata asr_engine text, -- 'whisper-large-v3', 'distil-whisper-v3', etc. asr_engine_version text, asr_word_count int, asr_avg_confidence numeric(4,3), -- 0.000-1.000 asr_word_timestamps_url text, -- path to /data/corpus/hearings/asr/{id}.json with word-level timing
-- Lifecycle fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now(),
-- One transcript per (hearing, source) pair; multiple sources allowed for same hearing UNIQUE (hearing_id, source) WHERE hearing_id IS NOT NULL, UNIQUE (meeting_id, source) WHERE meeting_id IS NOT NULL AND hearing_id IS NULL);
CREATE INDEX hearing_transcripts_hearing ON hearing_transcripts (hearing_id) WHERE hearing_id IS NOT NULL;CREATE INDEX hearing_transcripts_meeting ON hearing_transcripts (meeting_id) WHERE meeting_id IS NOT NULL;CREATE INDEX hearing_transcripts_source ON hearing_transcripts (source);CREATE INDEX hearing_transcripts_active ON hearing_transcripts (hearing_id, source) WHERE NOT is_superseded;
-- SearchALTER TABLE hearing_transcripts ADD COLUMN body_tsv tsvector GENERATED ALWAYS AS ( to_tsvector('english', coalesce(body_text, '')) ) STORED;CREATE INDEX hearing_transcripts_body_search ON hearing_transcripts USING gin (body_tsv);
-- ============================================================-- Speaker turns (parsed from body_text)-- ============================================================
CREATE TABLE hearing_transcript_turns ( id bigserial PRIMARY KEY, transcript_id text NOT NULL REFERENCES hearing_transcripts(id) ON DELETE CASCADE, sequence int NOT NULL, -- order within transcript speaker_raw text NOT NULL, -- 'Mr. RASKIN', 'The CHAIRMAN', 'Mr. BAKER' speaker_role text, -- 'member', 'witness', 'chairman', 'unidentified' speaker_bioguide text REFERENCES legislators(bioguide_id), -- when resolvable speaker_witness_id bigint REFERENCES meeting_witnesses(id), -- when resolvable text text NOT NULL, -- ASR-specific asr_start_seconds numeric(10,3), asr_end_seconds numeric(10,3), UNIQUE (transcript_id, sequence));
CREATE INDEX hearing_transcript_turns_transcript ON hearing_transcript_turns (transcript_id, sequence);CREATE INDEX hearing_transcript_turns_member ON hearing_transcript_turns (speaker_bioguide) WHERE speaker_bioguide IS NOT NULL;
-- Search by member's wordsALTER TABLE hearing_transcript_turns ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', coalesce(text, ''))) STORED;CREATE INDEX hearing_transcript_turns_search ON hearing_transcript_turns USING gin (search_tsv);
-- ============================================================-- Embedding chunks-- ============================================================
CREATE TABLE hearing_transcript_chunks ( id bigserial PRIMARY KEY, transcript_id text NOT NULL REFERENCES hearing_transcripts(id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, -- Anchor back to source for citation starting_turn_id bigint REFERENCES hearing_transcript_turns(id), embedding vector(1024), UNIQUE (transcript_id, chunk_index));CREATE INDEX hearing_transcript_chunks_embedding ON hearing_transcript_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
- Multiple transcripts per hearing/meeting allowed — the unique constraint is
(hearing_id, source)so a single hearing can have achrg_officialrow plus anasr_videorow plus acommittee_preliminaryrow.is_superseded=trueflags the lower-fidelity rows once the official lands; we don't delete. hearing_idandmeeting_idboth nullable, but at least one must be set. ASR transcripts may havemeeting_idonly (no CHRG package yet); CHRG transcripts may havehearing_idonly (themeeting_idresolution depends on the MODS<eventId>cross-reference being populated, which it usually is but not always for older hearings). A check constraintCHECK (hearing_id IS NOT NULL OR meeting_id IS NOT NULL)is enforceable; left out for brevity.hearing_transcript_turnsseparate table — speaker-turn-level retrieval is the dominant access pattern. The agent searches "what did Schumer say about chips" and wants turns, not whole transcripts.speaker_bioguideandspeaker_witness_idboth nullable — resolution to a canonical person is best-effort. A single transcript may have both members and witnesses; a single turn maps to at most one.- Word-level ASR timing kept off-row —
asr_word_timestamps_urlpoints to a JSON file in/data/corpus/hearings/asr/, not stored in Postgres (1-3 MB per hour is too much for inline). The per-turnasr_start_seconds/asr_end_secondsis the in-row anchor. hearing_transcript_chunks.starting_turn_idanchors chunks back to turns so the agent can quote a chunk and link to the source turn for citation.
Download / update strategy
Section titled “Download / update strategy”Backfill
Section titled “Backfill”CHRG (1995-present):
- For each year 1995..current:
- Fetch
CHRG_{YYYY}_sitemap.xml. ~600-1300 URLs each. - For each
<loc>, derivepackageId(suffix after/app/details/). - Fetch HTML rendering:
/content/pkg/CHRG-{packageId}/html/CHRG-{packageId}.htm. Save to/data/corpus/hearings/transcripts/{year}/{packageId}.htm. - Fetch MODS XML. Save to
/data/corpus/hearings/transcripts/{year}/{packageId}.mods.xml. - Parse HTML → speaker turns. Parse MODS → witness names + heldDate + eventId.
- Insert
hearing_transcripts+hearing_transcript_turns. - Embed chunks.
- Fetch
- ~35K packages × 2 fetches each = ~70K calls. With 4 concurrent workers @ 1 req/s: ~5 hours.
Committee preliminary PDFs:
- From
meeting_documentswheredocument_type = 'Hearing: Transcript', fetch each PDF, extract text viapdfplumberorpdftotext, parse turns. Insert withsource='committee_preliminary'. ~1-2K total currently.
ASR backfill — defer. Don't backfill ASR over historical hearings without published transcripts. Cost-efficient strategy: ASR only for new meetings that haven't yet been published, plus on-demand if an agent specifically asks about a hearing without text.
Daily incremental
Section titled “Daily incremental”- Weekly: CHRG sitemap diff. For current year + previous year (publication lag). For changed
<lastmod>, refetch + re-parse. - Daily: scan committee_meetings. Find held meetings >7 days old without
hearing_transcriptsrows, with at least onemeeting_videosrow. Queue for ASR. - Daily: ASR queue worker. Pull from queue, fetch video → audio, transcribe, parse, insert.
- On CHRG arrival: supersede ASR. When sitemap diff detects a new CHRG for a meeting that has ASR, set
is_superseded=trueon the ASR row, insert the new official row.
State tracking
Section titled “State tracking”Source keys: hearing_transcripts_chrg, hearing_transcripts_committee, hearing_transcripts_asr. Each tracked separately. ASR runs have a task row per meeting with status (queued, downloading, transcribing, parsed, error).
Failure modes
Section titled “Failure modes”- CHRG HTML parse failure (unusual layout). Rare in 2010+ packages, common for pre-2005. Fall back to PDF text extraction; mark
parsed_at = NULL; flag for manual review. - MODS missing
<eventId>. Older packages don't have this. Fall back to fuzzy match:(committee_systemCode, heldDate)→committee_meetings. If no match, leavemeeting_id = NULL. - Multi-day hearing → multiple meetings → one CHRG. The MODS has multiple
<heldDate>elements. Inserthearing_datesrows for each (already in hearings schema), and link the single transcript to the earliest matching meeting. Document this in the hearings ↔ hearing-transcripts join semantics. - ASR transcribed wrong language / very poor audio. Check
asr_avg_confidence. Below 0.5 → retry once. Below 0.5 again → flag for human review; insert withis_superseded=falsebut a confidence note in the agent's surface. - YouTube takedown of committee video before we transcribe. Some committees pull old videos. Fall back to congress.gov video page. If both gone, we miss this meeting; log and move on.
- Disk pressure — ASR audio at 50 MB/hour × thousands of meetings won't fit on
/dataif we keep everything. Strict policy: delete audio once transcription succeeds and is verified.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- ASR engine choice. Whisper-large-v3 (best quality, ~$0.10-0.30/hr local on H100; OpenAI/Groq API ~$0.36/hr) vs distil-whisper-v3 (3-6× faster, ~95% of accuracy) vs API services (Replicate, Groq) for quality + cost. Benchmark on a sample of real hearing audio before committing.
- Speaker diarization in v1? A diarized transcript with speaker labels is dramatically more useful than an undifferentiated wall of text. pyannote.audio is mature; runs on the same GPU. Adds ~30% to the per-hearing time. Lean toward yes for v1 if compute allows.
- Voice-print library for senators / cabinet officials. With a small library of known speakers' voice prints, we can label diarized clusters by name with high confidence. Building the library is an annotation task; defer to v1.x.
- Member-name resolution from committee membership listing. Most CHRG cover pages have a two-column ASCII listing of committee members. Parsing it gives a ground-truth list of speakers we should expect, plus a state hint for disambiguating last names. Worth a careful parser.
- Pre-2010 CHRG layout drift. Older publications have less consistent formatting. Plan a parser test suite of fixtures pulled from each decade.
- Witness statements vs spoken testimony. Witness statements (their pre-prepared remarks, available in
meeting_documents.witness_documents) differ from what they actually said in the hearing. Both are valuable but distinct. Keep them separate — the meeting_documents table already holds witness statement PDFs. - Committee preliminary transcripts identification.
meeting_documents.document_type = 'Hearing: Transcript'is one signal but not the only one. Some committees use'Member Statements'for full transcripts. Build a small classifier: PDF page count + title keywords. - ASR cost ceiling. ~$1,000/year compute is a back-of-envelope. If the agent gets used heavily and we ASR-on-demand for arbitrary historical meetings, costs balloon. Set a per-month budget.
- Citation-grade flag in agent output. Plumbing the
sourceenum into the agent's response so users see "(auto-transcribed, verify)" alongside ASR-derived quotes is a UX detail worth nailing early. - License compliance for C-SPAN. Even ASR-derived text from C-SPAN broadcasts inherits some restriction. Default to YouTube and congress.gov for video sourcing; only fall back to C-SPAN with a clear research-use justification.