Hearing transcripts ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

The text of what was actually said in hearings — every speaker turn,
every Q&A round, every witness statement read into the record. This is
the highest-value-per-MB corpus in Josh: it lets the agent answer
"what did Jensen Huang testify about export controls" or "summarize the
GAO IG's testimony in the May 7 oversight hearing." Distinct from
hearings-ingester (the meeting calendar metadata) — this source
carries the body content of what was said.

Coverage gap acknowledged: officially-printed CHRG volumes cover only a
fraction of all meetings; the layered strategy is "official text where
available, ASR-from-video where not, with explicit gap tracking."

User stories

As a policy analyst using my own AI agent, I want to ask "what did witness X testify on topic Y" and get back the speaker turns with the printed-record citation so that I can quote testimony exactly, with confidence in the source.

As an OSS self-hoster, I want to run `josh ingest hearing-transcripts` and populate the GovInfo CHRG corpus on my own instance so that I have searchable transcript text without operating an ASR pipeline.

As a downstream agent doing FTS5 + vector retrieval, I want hearing transcripts chunked, embedded, and citation-stamped so that hybrid search returns transcript chunks alongside other source chunks.

Acceptance criteria (EARS)

When `josh ingest hearing-transcripts --since 2026-01-01` is run on the deployed substrate, the system shall populate `hearing_transcripts` rows whose six citation columns are all non-null.
When a CHRG transcript HTML body is fetched, the system shall write the raw payload under `/data/corpus/hearing-transcripts/bodies/raw/` and a normalized Markdown body (with speaker-turn structure preserved) under `/data/corpus/hearing-transcripts/bodies/markdown/`.
While the ingester is running, if a single transcript fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `packageId`) and continue with the next transcript.
Where a transcript declares one or more witnesses (from MODS), the system shall populate the `transcript_witnesses` junction table for that transcript.
Where a transcript is multi-part (multiple granules), the system shall populate the `transcript_granules` table with one row per granule preserving `(transcript_id, granule_sequence)` ordering.
When an FTS5 query is executed against `hearing_transcripts_fts`, the system shall return BM25-ranked results.
Where a transcript body exceeds ~5K tokens (almost all do), the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `hearing_transcript_chunks_vec0`.
When an ASR-derived transcript is loaded, the system shall set `body_normalizer_version` to indicate ASR provenance (e.g. `asr_whisper_v1`) so downstream consumers can flag low-confidence text.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest hearing-transcripts --since 2026-04-01 --max 25'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM hearing_transcripts
  WHERE published_at >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM hearing_transcripts_fts WHERE hearing_transcripts_fts MATCH '\''testimony'\'';"'

Expect

>= 25 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. Full backfill (1995-present, ~35K CHRG packages, ~10 GB raw HTML) runs detached on the server per `add-a-new-source.html`. ASR layer (Layer 3) is gated separately — see clarifications.

Clarifications needed

Backfill scope: 1995-present full CHRG corpus (~35K packages) vs current Congress only for v1 launch. Lean: full — packages are small, ingest is sequential.
Layer 2 (committee-published preliminary transcripts via `meeting_documents`) — wire in v1 or defer? Pulls from `hearings-ingester` meeting_documents rows. Lean: yes for v1 — bridges the publication-lag gap.
Layer 3 (ASR fallback): in v1 or defer? ASR engine choice (Whisper-large-v3 self-hosted vs Groq Whisper API vs Replicate), source video selection (Congress.gov / YouTube / C-SPAN / House.gov), and quality gate. Lean: defer to v1.x — ship CHRG + Layer 2 first, prove value, then add ASR.
Per-speaker-turn granularity: index speaker turns as separate FTS5 rows (lets agent cite per-turn) or one FTS row per transcript (simpler)? Lean: per-transcript for v1 with speaker-name extracted into a separate column; per-turn is a v1.x improvement.
Pre-1995 paper transcripts (Hathitrust scans): out of scope for v1 (already documented), but reaffirm before kicking off backfill.
Multi-volume hearings: model as N rows in `hearing_transcripts` keyed by `(packageId, granuleId)`, or one parent row + N `transcript_granules` rows? Lean: parent + granules so a single 'find this hearing' query returns one logical record.

Out of scope

C-SPAN closed-caption scraping (terms-of-use restrictive; ASR layer covers same need with cleaner license).
Per-committee scraping for transcripts — heterogeneous, fragile, low-yield.
OCR of pre-1995 paper transcripts — v2.
Real-time webhooks / push subscriptions — surface-layer concern.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/hearing-transcripts/ (probe findings + indicative DDL).

Layered strategy (priority order):
- Layer 1 (canonical): GovInfo CHRG. Per-year sitemap
https://www.govinfo.gov/sitemap/CHRG_{YYYY}_sitemap.xml enumerates
every published-hearing package; per-package HTML at
/content/pkg/CHRG-{packageId}/html/CHRG-{packageId}.htm is <pre>-
wrapped plaintext (easy to parse); MODS at
/metadata/pkg/CHRG-{packageId}/mods.xml carries witness names,
jacket numbers, dates, committee codes.
- Layer 2 (best-effort): committee-published preliminary transcripts.
Pulled from hearings-ingester meeting_documents rows where
documentType: "Hearing: Transcript". Flagged with
body_normalizer_version='committee_preliminary_v1'.
- Layer 3 (ASR fallback, deferred): Whisper on video for hearings
with no published transcript. Gated per clarifications_needed.

Discover/fetch/parse/load shape (Layer 1):
- discover: yearly CHRG sitemap diff against
ingestion_source_state.last_seen_key='hearing-transcripts-chrg-{YYYY}'.
- fetch: per-package HTML + MODS to
/data/corpus/hearing-transcripts/bodies/raw/. SHA256 the HTML for
raw_sha256. Polite ~1-2 req/sec.
- parse: chrg_html_v1 extracts the <pre> content into
speaker-tagged Markdown (regex on "Mr. SMITH." style speaker prefixes
is the v1 heuristic; statistical or ML splitter is v1.x). MODS extracts
witnesses, jacket number, committee, held date, preferred citation.
- load: idempotent upsert on (packageId) natural key. Granule rows
inserted per multi-part hearing. FTS5 sync triggers fire on
insert/update/delete. Body chunks enqueued for embedding.

Migration shape: hearing_transcripts, transcript_witnesses,
transcript_granules, hearing_transcripts_fts virtual + 3 sync
triggers, hearing_transcript_chunks, hearing_transcript_chunks_vec0
(vec0 with float[1024] and bit[1024]).

Schedule: daily 03:30 UTC. CHRG publication lag is weeks-to-months, so
daily polling of the CHRG sitemap catches new packages with no urgency
pressure.

Tasks

0 of 16 done.

Changelog

No history yet.