sourceplannedp0

Hearing transcripts ingester

hearing-transcripts-ingester · updated 2026-05-10T18:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

The text of what was actually said in hearings — every speaker turn,
every Q&A round, every witness statement read into the record. This is
the highest-value-per-MB corpus in Josh: it lets the agent answer
"what did Jensen Huang testify about export controls" or "summarize the
GAO IG's testimony in the May 7 oversight hearing." Distinct from
hearings-ingester (the meeting calendar metadata) — this source
carries the body content of what was said.

Coverage gap acknowledged: officially-printed CHRG volumes cover only a
fraction of all meetings; the layered strategy is "official text where
available, ASR-from-video where not, with explicit gap tracking."

As a policy analyst using my own AI agent, I want to ask "what did witness X testify on topic Y" and get back the speaker turns with the printed-record citation so that I can quote testimony exactly, with confidence in the source.

As an OSS self-hoster, I want to run `josh ingest hearing-transcripts` and populate the GovInfo CHRG corpus on my own instance so that I have searchable transcript text without operating an ASR pipeline.

As a downstream agent doing FTS5 + vector retrieval, I want hearing transcripts chunked, embedded, and citation-stamped so that hybrid search returns transcript chunks alongside other source chunks.

  1. When `josh ingest hearing-transcripts --since 2026-01-01` is run on the deployed substrate, the system shall populate `hearing_transcripts` rows whose six citation columns are all non-null.
  2. When a CHRG transcript HTML body is fetched, the system shall write the raw payload under `/data/corpus/hearing-transcripts/bodies/raw/` and a normalized Markdown body (with speaker-turn structure preserved) under `/data/corpus/hearing-transcripts/bodies/markdown/`.
  3. While the ingester is running, if a single transcript fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `packageId`) and continue with the next transcript.
  4. Where a transcript declares one or more witnesses (from MODS), the system shall populate the `transcript_witnesses` junction table for that transcript.
  5. Where a transcript is multi-part (multiple granules), the system shall populate the `transcript_granules` table with one row per granule preserving `(transcript_id, granule_sequence)` ordering.
  6. When an FTS5 query is executed against `hearing_transcripts_fts`, the system shall return BM25-ranked results.
  7. Where a transcript body exceeds ~5K tokens (almost all do), the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `hearing_transcript_chunks_vec0`.
  8. When an ASR-derived transcript is loaded, the system shall set `body_normalizer_version` to indicate ASR provenance (e.g. `asr_whisper_v1`) so downstream consumers can flag low-confidence text.
kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest hearing-transcripts --since 2026-04-01 --max 25'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM hearing_transcripts
  WHERE published_at >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM hearing_transcripts_fts WHERE hearing_transcripts_fts MATCH '\''testimony'\'';"'

Expect

>= 25 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. Full backfill (1995-present, ~35K CHRG packages, ~10 GB raw HTML) runs detached on the server per `add-a-new-source.html`. ASR layer (Layer 3) is gated separately — see clarifications.

  • Backfill scope: 1995-present full CHRG corpus (~35K packages) vs current Congress only for v1 launch. Lean: full — packages are small, ingest is sequential.
  • Layer 2 (committee-published preliminary transcripts via `meeting_documents`) — wire in v1 or defer? Pulls from `hearings-ingester` meeting_documents rows. Lean: yes for v1 — bridges the publication-lag gap.
  • Layer 3 (ASR fallback): in v1 or defer? ASR engine choice (Whisper-large-v3 self-hosted vs Groq Whisper API vs Replicate), source video selection (Congress.gov / YouTube / C-SPAN / House.gov), and quality gate. Lean: defer to v1.x — ship CHRG + Layer 2 first, prove value, then add ASR.
  • Per-speaker-turn granularity: index speaker turns as separate FTS5 rows (lets agent cite per-turn) or one FTS row per transcript (simpler)? Lean: per-transcript for v1 with speaker-name extracted into a separate column; per-turn is a v1.x improvement.
  • Pre-1995 paper transcripts (Hathitrust scans): out of scope for v1 (already documented), but reaffirm before kicking off backfill.
  • Multi-volume hearings: model as N rows in `hearing_transcripts` keyed by `(packageId, granuleId)`, or one parent row + N `transcript_granules` rows? Lean: parent + granules so a single 'find this hearing' query returns one logical record.
  • C-SPAN closed-caption scraping (terms-of-use restrictive; ASR layer covers same need with cleaner license).
  • Per-committee scraping for transcripts — heterogeneous, fragile, low-yield.
  • OCR of pre-1995 paper transcripts — v2.
  • Real-time webhooks / push subscriptions — surface-layer concern.

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/hearing-transcripts/ (probe findings + indicative DDL).

Layered strategy (priority order):
- Layer 1 (canonical): GovInfo CHRG. Per-year sitemap
https://www.govinfo.gov/sitemap/CHRG_{YYYY}_sitemap.xml enumerates
every published-hearing package; per-package HTML at
/content/pkg/CHRG-{packageId}/html/CHRG-{packageId}.htm is <pre>-
wrapped plaintext (easy to parse); MODS at
/metadata/pkg/CHRG-{packageId}/mods.xml carries witness names,
jacket numbers, dates, committee codes.
- Layer 2 (best-effort): committee-published preliminary transcripts.
Pulled from hearings-ingester meeting_documents rows where
documentType: "Hearing: Transcript". Flagged with
body_normalizer_version='committee_preliminary_v1'.
- Layer 3 (ASR fallback, deferred): Whisper on video for hearings
with no published transcript. Gated per clarifications_needed.

Discover/fetch/parse/load shape (Layer 1):
- discover: yearly CHRG sitemap diff against
ingestion_source_state.last_seen_key='hearing-transcripts-chrg-{YYYY}'.
- fetch: per-package HTML + MODS to
/data/corpus/hearing-transcripts/bodies/raw/. SHA256 the HTML for
raw_sha256. Polite ~1-2 req/sec.
- parse: chrg_html_v1 extracts the <pre> content into
speaker-tagged Markdown (regex on "Mr. SMITH." style speaker prefixes
is the v1 heuristic; statistical or ML splitter is v1.x). MODS extracts
witnesses, jacket number, committee, held date, preferred citation.
- load: idempotent upsert on (packageId) natural key. Granule rows
inserted per multi-part hearing. FTS5 sync triggers fire on
insert/update/delete. Body chunks enqueued for embedding.

Migration shape: hearing_transcripts, transcript_witnesses,
transcript_granules, hearing_transcripts_fts virtual + 3 sync
triggers, hearing_transcript_chunks, hearing_transcript_chunks_vec0
(vec0 with float[1024] and bit[1024]).

Schedule: daily 03:30 UTC. CHRG publication lag is weeks-to-months, so
daily polling of the CHRG sitemap catches new packages with no urgency
pressure.

0 of 16 done.

  • t1 Alembic migration applied locally + production (hearing_transcripts + 2 junctions + FTS5 + vec0)
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/hearing_transcripts.py
  • t3 Source module implementing the `Source` protocol
  • t4 Body normalization wired (chrg_html_v1 walker preserving speaker turns; raw + markdown paths; per-format converter unit-tested)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests
  • t6 FTS5 virtual table + 3 sync triggers wired into migration
  • t7 Vector vec0 table wired (transcripts almost always exceed 5K tokens)
  • t8 Unit tests for chrg_html_v1 parser against captured fixtures (single-volume, multi-part, witness-statement-heavy)
  • t9 Integration test for the full discover→fetch→parse→load pipeline
  • t10 Smoke backfill (25 transcripts) clean per success_determiner
  • t11 Production backfill at agreed scope (per clarifications_needed) complete
  • t12 Daily cron registered and observed firing once
  • t13 No recurring errors in `ingestion_logs` after 24h
  • t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
  • t15 12-sample manual HTML→Markdown validation gate (per add-a-new-source.html — stratified across single-witness, multi-witness, multi-part, recent vs 2000s vintage)
  • t16 Layer 2 (committee-preliminary) pull from hearings-ingester meeting_documents wired (or explicit decision to defer to v1.x per clarifications)

No history yet.

docs/spec/hearing-transcripts-ingester.html · generated by bin/build-spec.py