Hearings and markup schedule ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Every committee meeting Congress holds — hearings, markups, business
meetings, oversight panels — both as future-scheduled events (the
calendar) and as past records (the meeting + witness metadata). This is
the source-of-truth for "what is committee X doing this week" and
"who testified at the Snowden hearing." Distinct from hearing-transcripts
which carries the *content* of what was said; this source carries the
*metadata* of when, where, who, and what bills were on the agenda.

User stories

As a policy analyst using my own AI agent, I want to ask the agent which hearings on my topics are scheduled this week so that I can plan attendance, witness coordination, and prep without scraping committee pages.

As an OSS self-hoster, I want to run `josh ingest hearings` on a fresh substrate so that I can populate the meeting calendar + witness list from public APIs on my own instance.

As a downstream FK consumer (hearing-transcripts, bills), I want to FK to meetings by `meeting:{congress}-{chamber}-{eventId}` and to printed transcripts by `hearing:{jacketNumber}` so that cross-source queries (e.g. "every hearing about HR 1, with the transcripts where available") resolve cleanly.

Acceptance criteria (EARS)

When `josh ingest hearings --since 2026-01-01` is run on the deployed substrate, the system shall populate `committee_meetings` rows whose six citation columns are all non-null.
When the Congress.gov v3 `/hearing` endpoint returns a printed transcript record, the system shall populate a corresponding `hearing_volumes` row keyed by `hearing:{jacketNumber}` (the response carries no top-level `packageId`; the GovInfo CHRG package id, if needed, must be parsed out of the `formats[]` URLs) and link it to its `associatedMeeting` when one exists.
While the ingester is running, if a single meeting or hearing record fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `eventId` or `jacketNumber`) and continue with the next record.
Where a meeting declares one or more witnesses, the system shall populate the `meeting_witnesses` junction table with one row per witness; if a witness has supporting documents, they shall be linked via `meeting_witness_documents`.
Where a meeting declares one or more bills on its agenda, the system shall populate the `meeting_bills` junction table linking `committee_meetings.id` to `bills.id` (resolving the bill soft-FK when the bill is in the substrate; leaving `bill_id` NULL otherwise).
When an FTS5 query is executed against `committee_meetings_fts`, the system shall return BM25-ranked results over `(title, witnesses, agenda)`.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest `updateDate` observed so the next run only fetches changed records.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest hearings --since 2026-04-01 --max 50'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM committee_meetings
  WHERE update_date >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM committee_meetings_fts WHERE committee_meetings_fts MATCH '\''oversight'\'';"'

Expect

>= 50 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. Full backfill scope is 113th Congress – present (~17,600 meetings + ~35,000 hearing volumes per Congress.gov pagination counts) and runs detached on the server per `add-a-new-source.html`.

Clarifications needed

Backfill scope: 113th-current (Congress.gov coverage start) vs current-Congress only for v1 launch. Lean: 113th-current (~17,600 meetings) — small enough to backfill in a few hours.
Witness-documents PDFs: lazy-fetch on first agent request, or eager-fetch for current Congress only? Lean: lazy (witness PDFs are heavy and most aren't queried).
Cancelled / postponed meetings: keep historical row with `meetingStatus='Cancelled'` (full audit trail) or hard-delete? Lean: keep — cancellations themselves are signal.

Out of scope

State legislative hearings — Phase 6+.
Per-committee `*.house.gov` and `*.senate.gov` schedule pages — duplicates Congress.gov with worse structure.
C-SPAN / video downloads — separate concern under `hearing-transcripts-ingester` ASR layer.
Hearing transcript content — `hearing-transcripts-ingester` carries the body text.
Real-time webhooks / push subscriptions — surface-layer concern.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/hearings/ (probe findings + indicative DDL — the
migration shipped with this spec is the source of truth).

Discover/fetch/parse/load shape:
- discover: Congress.gov v3 /committee-meeting paginated by
?fromDateTime= + ?sort=updateDate+desc for the calendar; Congress.gov
v3 /hearing paginated similarly for the printed-volume metadata layer.
GovInfo CHRG sitemap (https://www.govinfo.gov/sitemap/CHRG_{YYYY}_sitemap.xml)
as a low-cost backfill discovery path.
- fetch: per-meeting JSON via /v3/committee-meeting/{c}/{chamber}/{eventId}
→ /data/corpus/hearings/bodies/raw/; per-hearing-volume JSON via
/v3/hearing/{c}/{chamber}/{jacketNumber} (the volume body text is
hearing-transcripts-ingester territory). The /hearing response carries
NO top-level packageId — it is keyed by jacketNumber /
libraryOfCongressIdentifier; the GovInfo CHRG package id appears only
embedded inside the formats[] URLs, so derive hearing_volumes keys
from jacketNumber and parse the CHRG package id out of formats[] only
when it is actually needed. SHA256 the JSON for raw_sha256.
- parse: extract structured fields (title, status, datetime, location,
committee FK, witnesses, witness documents, bills on agenda, video URLs,
hearingTranscript[] cross-ref). Markdown body = title + agenda summary
+ witness list; written to /data/corpus/hearings/bodies/markdown/.
- load: idempotent upsert on (congress, chamber, event_id) for
meetings; on (congress, chamber, jacket_number) for hearing volumes.
Junction inserts for meeting_witnesses, meeting_witness_documents,
meeting_bills, meeting_committees. Soft-FK to bills (resolved when
the bill exists, NULL otherwise). FTS5 sync triggers fire on
insert/update/delete.

Migration shape: committee_meetings, meeting_witnesses,
meeting_witness_documents, meeting_bills, meeting_committees,
hearing_volumes (the printed-record layer), hearing_volume_meetings
(M:N — a printed volume can cover multiple meetings, a meeting can produce
multiple volumes), committee_meetings_fts virtual + 3 sync triggers.
No *_chunks_vec0 — body content lives on hearing-transcripts-ingester.

Schedule: hourly :45. Congress.gov updates throughout the day during
session weeks; hourly catches witness adds and cancellations promptly
without burning quota off-hours.

Tasks

0 of 15 done.

Changelog

No history yet.