Hearings and markup schedule ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Every committee meeting Congress holds — hearings, markups, business
meetings, oversight panels — both as future-scheduled events (the
calendar) and as past records (the meeting + witness metadata). This is
the source-of-truth for "what is committee X doing this week" and
"who testified at the Snowden hearing." Distinct from hearing-transcripts
which carries the *content* of what was said; this source carries the
*metadata* of when, where, who, and what bills were on the agenda.
User stories
As a policy analyst using my own AI agent, I want to ask the agent which hearings on my topics are scheduled this week so that I can plan attendance, witness coordination, and prep without scraping committee pages.
As an OSS self-hoster, I want to run `josh ingest hearings` on a fresh substrate so that I can populate the meeting calendar + witness list from public APIs on my own instance.
As a downstream FK consumer (hearing-transcripts, bills), I want to FK to meetings by `meeting:{congress}-{chamber}-{eventId}` and to printed transcripts by `hearing:{jacketNumber}` so that cross-source queries (e.g. "every hearing about HR 1, with the transcripts where available") resolve cleanly.
Acceptance criteria (EARS)
- When `josh ingest hearings --since 2026-01-01` is run on the deployed substrate, the system shall populate `committee_meetings` rows whose six citation columns are all non-null.
- When the Congress.gov v3 `/hearing` endpoint returns a printed transcript record, the system shall populate a corresponding `hearing_volumes` row keyed by `hearing:{jacketNumber}` (the response carries no top-level `packageId`; the GovInfo CHRG package id, if needed, must be parsed out of the `formats[]` URLs) and link it to its `associatedMeeting` when one exists.
- While the ingester is running, if a single meeting or hearing record fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `eventId` or `jacketNumber`) and continue with the next record.
- Where a meeting declares one or more witnesses, the system shall populate the `meeting_witnesses` junction table with one row per witness; if a witness has supporting documents, they shall be linked via `meeting_witness_documents`.
- Where a meeting declares one or more bills on its agenda, the system shall populate the `meeting_bills` junction table linking `committee_meetings.id` to `bills.id` (resolving the bill soft-FK when the bill is in the substrate; leaving `bill_id` NULL otherwise).
- When an FTS5 query is executed against `committee_meetings_fts`, the system shall return BM25-ranked results over `(title, witnesses, agenda)`.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest `updateDate` observed so the next run only fetches changed records.
Success determiner
Command
set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest hearings --since 2026-04-01 --max 50'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM committee_meetings
WHERE update_date >= ''2026-04-01''
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM committee_meetings_fts WHERE committee_meetings_fts MATCH '\''oversight'\'';"'
Expect
Smoke run only. Full backfill scope is 113th Congress – present (~17,600 meetings + ~35,000 hearing volumes per Congress.gov pagination counts) and runs detached on the server per `add-a-new-source.html`.
Clarifications needed
- Backfill scope: 113th-current (Congress.gov coverage start) vs current-Congress only for v1 launch. Lean: 113th-current (~17,600 meetings) — small enough to backfill in a few hours.
- Witness-documents PDFs: lazy-fetch on first agent request, or eager-fetch for current Congress only? Lean: lazy (witness PDFs are heavy and most aren't queried).
- Cancelled / postponed meetings: keep historical row with `meetingStatus='Cancelled'` (full audit trail) or hard-delete? Lean: keep — cancellations themselves are signal.
Out of scope
- State legislative hearings — Phase 6+.
- Per-committee `*.house.gov` and `*.senate.gov` schedule pages — duplicates Congress.gov with worse structure.
- C-SPAN / video downloads — separate concern under `hearing-transcripts-ingester` ASR layer.
- Hearing transcript content — `hearing-transcripts-ingester` carries the body text.
- Real-time webhooks / push subscriptions — surface-layer concern.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/hearings/ (probe findings + indicative DDL — the
migration shipped with this spec is the source of truth).
Discover/fetch/parse/load shape:
- discover: Congress.gov v3 /committee-meeting paginated by
?fromDateTime= + ?sort=updateDate+desc for the calendar; Congress.gov
v3 /hearing paginated similarly for the printed-volume metadata layer.
GovInfo CHRG sitemap (https://www.govinfo.gov/sitemap/CHRG_{YYYY}_sitemap.xml)
as a low-cost backfill discovery path.
- fetch: per-meeting JSON via /v3/committee-meeting/{c}/{chamber}/{eventId}
→ /data/corpus/hearings/bodies/raw/; per-hearing-volume JSON via
/v3/hearing/{c}/{chamber}/{jacketNumber} (the volume body text is
hearing-transcripts-ingester territory). The /hearing response carries
NO top-level packageId — it is keyed by jacketNumber /
libraryOfCongressIdentifier; the GovInfo CHRG package id appears only
embedded inside the formats[] URLs, so derive hearing_volumes keys
from jacketNumber and parse the CHRG package id out of formats[] only
when it is actually needed. SHA256 the JSON for raw_sha256.
- parse: extract structured fields (title, status, datetime, location,
committee FK, witnesses, witness documents, bills on agenda, video URLs,
hearingTranscript[] cross-ref). Markdown body = title + agenda summary
+ witness list; written to /data/corpus/hearings/bodies/markdown/.
- load: idempotent upsert on (congress, chamber, event_id) for
meetings; on (congress, chamber, jacket_number) for hearing volumes.
Junction inserts for meeting_witnesses, meeting_witness_documents,
meeting_bills, meeting_committees. Soft-FK to bills (resolved when
the bill exists, NULL otherwise). FTS5 sync triggers fire on
insert/update/delete.
Migration shape: committee_meetings, meeting_witnesses,meeting_witness_documents, meeting_bills, meeting_committees,hearing_volumes (the printed-record layer), hearing_volume_meetings
(M:N — a printed volume can cover multiple meetings, a meeting can produce
multiple volumes), committee_meetings_fts virtual + 3 sync triggers.
No *_chunks_vec0 — body content lives on hearing-transcripts-ingester.
Schedule: hourly :45. Congress.gov updates throughout the day during
session weeks; hourly catches witness adds and cancellations promptly
without burning quota off-hours.
Tasks
0 of 15 done.
- t1 Alembic migration applied locally + production (committee_meetings + hearing_volumes + 4 junctions + FTS5)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/hearings.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (Markdown body = title + agenda + witness list; raw JSON cached)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 N/A — no body chunks (transcript content lives in hearing-transcripts-ingester)
- t8 Unit tests for Congress.gov JSON parser against captured fixtures (scheduled meeting, cancelled meeting, multi-witness hearing, hearing-with-printed-volume)
- t9 Integration test for the full discover→fetch→parse→load pipeline
- t10 Smoke backfill (50 meetings) clean per success_determiner
- t11 Production backfill at agreed scope (per clarifications_needed) complete
- t12 Hourly cron registered and observed firing once
- t13 No recurring errors in `ingestion_logs` for the last completed run
- t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
- t15 Bills FK soft-resolution test: meetings inserted before their referenced bill is in the substrate get the bill_id back-filled on the next bills run
Changelog
No history yet.