Hearings and markup schedule

Every committee meeting Congress holds — hearings, markups, business meetings, oversight panels — both as future-scheduled events (the calendar) and as past records (the meeting + transcript). This is the source-of-truth for "what is committee X doing this week" and "who testified at the Snowden hearing."

This source is distinct from hearing transcripts (source #7). A meeting record carries the metadata: when, where, which committee, what witnesses, what bills are on the agenda, links to documents and (eventually) video. The published transcript — the printed/PDF/HTML proceedings volume — comes later, sometimes months later, and is documented in hearing transcripts.

Quick reference


Source name	Hearings and markup schedule
Publishers	Library of Congress (Congress.gov), House Clerk (docs.house.gov), Senate Sergeant at Arms (committee schedule pages), Government Publishing Office (CHRG bulk data)
License	Public domain
Coverage	Congress.gov `/committee-meeting`: 113th Congress (2013) – present. `/hearing` (transcript records): 92nd Congress (1971) – present, but coverage thin pre-110th.
Volume	~1,500-2,000 committee meetings per congress (combined chambers), ~3,470 in 119th-to-date. ~35,000 published hearing volumes all-time across the API.
Storage estimate	~1-2 GB metadata; ~5-10 GB if we include all linked PDFs (witness statements, support docs, attendance sheets)
Stable ID format	Meeting: `meeting:{congress}-{chamber}-{eventId}` e.g. `meeting:118-h-115286`. Hearing volume: `hearing:{jacketNumber}` e.g. `hearing:50898` (the `/hearing` response has no top-level `packageId`; the GovInfo CHRG package id, e.g. `CHRG-118hhrg50898`, appears only inside the `formats[]` URLs).
Status	exploring — schema drafted, ingestion not built

Endpoint roots, auth, rate limits, caching headers, and incremental-sync params are upstream-documented — see the Congress.gov v3 API, api.data.gov key/rate limits, and the GovInfo developer hub. Congress.gov needs a real api.data.gov key for ingestion (DEMO_KEY's ~30/hr is unusable); GovInfo bulk has no auth.

Source priority decision

Three overlapping sources cover hearings. Recommendation:

Primary: Congress.gov v3 /committee-meeting. Best structured API for the forward-looking calendar — gives meetingStatus (Scheduled / Held / Cancelled / Postponed), date/time, committee, location (building + room), witnesses, witness documents, meeting documents, video URLs, and outbound link to hearingTranscript[] when the printed proceedings exist. Coverage starts 113th Congress.

Primary: Congress.gov v3 /hearing. Best for the published transcript volume metadata — jacketNumber, citation (e.g. H.Hrg.118), libraryOfCongressIdentifier, formatted-text and PDF URLs, and inbound link to associatedMeeting. Use this in tandem with /committee-meeting. The API /hearing is the metadata layer over GovInfo's CHRG collection. The response is keyed by jacketNumber / libraryOfCongressIdentifier and carries no top-level packageId — the GovInfo CHRG package id appears only embedded inside the formats[] URLs (e.g. the PDF/Formatted-Text links resolve to CHRG-118hhrg50898), so it must be parsed out of those URLs when needed.

Secondary: GovInfo CHRG bulk. The actual content (full-text HTML, PDF, MODS metadata, ZIP of all parts) lives in the GovInfo CHRG collection. A per-year sitemap enumerates every published-hearing package with <lastmod>. Useful for backfill (33K+ packages) and as a sitemap-diff incremental signal. See GovInfo CHRG help.

Tertiary: docs.house.gov calendar pages. The docs.house.gov ByEvent/ByWeek pages render the same data as the Congress.gov /committee-meeting/118/house/N API call but as HTML. The eventId is the same numeric identifier in both systems, which means the URL is round-trippable from the API response. Use only as a fallback if the API is down.

Skip: per-committee *.house.gov and *.senate.gov schedule pages. They duplicate Congress.gov's data with worse structure. The committee aliases (agriculture.house.gov/calendar, etc.) are useful for press release scraping (different source) but redundant for the meeting calendar.

Note: the Congress.gov /hearing endpoint is transcript-volume-centric (one record per published proceedings volume — the thing GPO bound and printed). The /committee-meeting endpoint is event-centric (one record per scheduled gavel-fall). They link via associatedMeeting ↔ hearingTranscript[] but are not 1:1: a single meeting may produce multiple printed volumes (multi-part hearings); a printed volume may cover multiple meetings (a series). The schema must model both.

Access notes

Congress.gov API quotas, headers, and DEMO_KEY behavior identical to bills. A real api.data.gov key is needed for ingestion; DEMO_KEY's ~30/hr ceiling is unusable. ?fromDateTime=YYYY-MM-DDTHH:MM:SSZ&sort=updateDate desc is the canonical incremental filter.

GovInfo bulk has no auth, no formal rate limit. ~1-2 req/sec is polite. The /wssearch/getContentDetail?packageId=... endpoint returns rich JSON with collection metadata + download links (we used this for bills too).

Real-browser User-Agent harmless across all three.

Endpoints

The Congress.gov v3 endpoint paths, query parameters (fromDateTime / toDateTime / sort / limit / offset / congress), and the GovInfo CHRG URL conventions are documented upstream:

Congress.gov OpenAPI — committee-meeting (event-centric: list, per-congress, per-congress/chamber, single-detail variants)
Congress.gov OpenAPI — hearing (transcript-volume-centric)
GovInfo developer hub — CHRG sitemap, getContentDetail, mods.xml, content-package PDF/HTML/ZIP conventions

Josh-specific notes on those endpoints:

pagination.count reports the total: ~17,600 for /committee-meeting, ~35,300 for /hearing.
jacketNumber is the GPO printer's "jacket" identifier — a 5-digit integer per published volume, scoped per congress×chamber.
The GovInfo CHRG per-year sitemap is the only place to get getContentDetail-sourced metadata (SuDoc Class Number, Hearing Sub Type) — these are not in the Congress.gov API. The sitemap URL pattern resolves to /app/details/CHRG-{packageId}.
House calendar: docs.house.gov ByEvent.aspx?EventID=N uses the same eventId as the Congress.gov API, so the URL is round-trippable from the API response. There is no XML / JSON feed equivalent — the docs.house.gov system is ASP.NET-rendered HTML only.
Senate calendar: no centralized feed. Each committee publishes its own page, and many senate.gov properties are bot-walled by Akamai — the same problem as roll-call votes. Rely on the Congress.gov API as the canonical Senate-meeting source; it ingests Senate committee data centrally with better coverage than scraping each committee's site.

Vocabulary / enums

The meetingStatus, type, documentType, and formats enum value lists are upstream-controlled vocabularies (Congress.gov / GovInfo) — see the Congress.gov OpenAPI and GovInfo CHRG help. The canonical Josh record of the values we accept lives in the CHECK constraints in the DDL below (meeting_status, chamber, document_role).

Josh-specific glosses on those vocabularies:

House document filenames follow HHRG-{c}-{committeeCode}-{date}-{slot}-{seq}.pdf. Senate filenames mirror this with an SHRG- prefix (verify when probing Senate samples).
GovInfo "Hearing Sub Type" observed values: General, Field, Markup, Business Meeting, Briefing, Roundtable. Mostly redundant with the API's committeeMeeting.type.

Stable ID formats

Meeting: meeting:{congress}-{chamber}-{eventId}, lowercased chamber.

Examples:

meeting:118-h-115286 (House Government Reform meeting on Twitter Files, Feb 8 2023)
meeting:119-s-119082
meeting:118-nochamber-119250 (rare — joint or unattributed)

The API path is /committee-meeting/{congress}/{chamber}/{eventId} — this is round-trippable.

Hearing volume: hearing:{jacketNumber}, where jacketNumber is the GPO printer's jacket identifier carried verbatim on the /hearing record. The /hearing response has no top-level packageId — it is keyed by jacketNumber / libraryOfCongressIdentifier, so the stable ID is derived from jacketNumber.

Examples:

hearing:50898 (House volume; the underlying GovInfo package is CHRG-118hhrg50898)
hearing:47102 (Senate volume; underlying GovInfo package CHRG-117shrg47102)

The GovInfo CHRG package id — which does encode congress + chamber + jacketNumber + sometimes a suffix — surfaces only embedded inside the formats[] URLs on the /hearing record, not as a first-class field. Parse it out of those URLs when it is actually needed (e.g. to fetch GovInfo MODS/ZIP). Because jacketNumber alone is not unique across chambers (50898 exists in both houses), scope it by congress + chamber when joining.

The GovInfo package id pattern (when parsed from formats[]): CHRG-{c}{chamberCode}hrg{jacketNumber} where chamberCode ∈ {h, s, j} (j = joint hearing volumes).

Response shapes

The field-by-field committeeMeeting and hearing JSON response shapes are documented in the upstream OpenAPI schemas (committee-meeting, hearing). The GovInfo getContentDetail?packageId=CHRG-... response shape (a columnnamevalueset of collection metadata plus a download block with premis/mods/zip links) is documented at the GovInfo developer hub. Josh interpretation of those shapes:

committeeMeeting witnesses are per-event (just name and position); a separate child table meeting_witnesses is appropriate.
committeeMeeting.committees is a list because joint hearings link multiple committees; systemCode is the BILLSTATUS-XML-style code (e.g. hsgo00 for House Government Reform), which joins to our committees table directly.
hearing.dates[] is plural — a hearing volume can span multiple meeting dates; a multi-part hearing series gets bound into a single printed volume. Plan storage with a hearing_dates child table.
hearing has no top-level packageId. The volume is keyed by jacketNumber / libraryOfCongressIdentifier; the GovInfo CHRG package id (CHRG-118hhrg50898) appears only inside the formats[].url values, so derive it by parsing those URLs rather than reading a field.
GovInfo getContentDetail is the only source of Hearing Sub Type and SuDoc Class Number (e.g. Y 4.OV 2:118-3) plus Held Date — none of these are in the Congress.gov API response.

Volume

Metric	Value
`/committee-meeting` total (113th-119th)	~17,600
`/hearing` total (all coverage)	~35,300
Per-congress meetings	~3,500-4,000
Per-congress published volumes	~1,500-2,500 (many meetings don't get a printed hearing — markups, business meetings rarely do)
Witness count per hearing	typically 3-8
Meeting documents per hearing	5-30 (many support docs)

Storage: meetings + hearings + witnesses + documents metadata = ~1-2 GB Postgres after parse. PDF bodies (witness statements, transcripts) add ~5-10 GB if we cache them in /data/corpus/hearings/. Transcripts proper are documented in source #7 — we may choose to skip body PDFs at this layer and let the transcript loader handle them.

Caching / incremental sync

Three independent change signals:

Congress.gov API ?fromDateTime= filter on updateDate — both /committee-meeting and /hearing support this (see Endpoints). Run a frequent (15-30 minute) incremental tick that fetches everything updated since last run, sorted updateDate desc. This catches:
- New scheduled meetings.
- Updates to existing meetings (witness changes, location updates, postponements).
- New transcript volumes (eventually — usually months after the meeting).
- Updates to existing transcript records.
GovInfo CHRG sitemap — the per-year sitemap's <url> elements each carry <loc> (package URL) and <lastmod> (most recent update). Run weekly to detect transcript publications and re-publications. Cheaper than scanning the API.
Per-event polling for upcoming meetings within the next 14 days — meeting status can flip to Postponed or Cancelled rapidly. Re-fetch each upcoming Scheduled meeting daily up until its date.

Schema (Postgres DDL)

-- ============================================================
-- Committee meetings (event-centric)
-- ============================================================

CREATE TABLE committee_meetings (
    id text PRIMARY KEY,                          -- 'meeting:118-h-115286'
    congress smallint NOT NULL,
    chamber text NOT NULL CHECK (chamber IN ('house', 'senate', 'nochamber')),
    event_id text NOT NULL,
    UNIQUE (congress, chamber, event_id),

    type text NOT NULL,                           -- 'Hearing', 'Markup', 'Business Meeting', etc.
    title text,
    meeting_status text CHECK (meeting_status IN (
        'Scheduled', 'Held', 'Postponed', 'Cancelled', 'Closed'
    ) OR meeting_status IS NULL),

    -- When + where
    meeting_date timestamptz,                     -- the gavel time
    location_building text,
    location_room text,

    -- Lifecycle
    update_date timestamptz NOT NULL,             -- API updateDate
    raw_json jsonb NOT NULL,
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX committee_meetings_date ON committee_meetings (meeting_date DESC NULLS LAST);
CREATE INDEX committee_meetings_status_date
    ON committee_meetings (meeting_date)
    WHERE meeting_status = 'Scheduled' AND meeting_date >= CURRENT_DATE;
CREATE INDEX committee_meetings_update ON committee_meetings (update_date DESC);

ALTER TABLE committee_meetings ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(type, '')), 'C')
    ) STORED;
CREATE INDEX committee_meetings_search ON committee_meetings USING gin (search_tsv);

-- One meeting → many committees (joint hearings)
CREATE TABLE committee_meeting_committees (
    meeting_id text REFERENCES committee_meetings(id) ON DELETE CASCADE,
    committee_id text REFERENCES committees(id),
    committee_system_code text,                   -- captured even if committees row missing
    PRIMARY KEY (meeting_id, committee_system_code)
);

-- Witnesses appearing at the meeting
CREATE TABLE meeting_witnesses (
    id bigserial PRIMARY KEY,
    meeting_id text NOT NULL REFERENCES committee_meetings(id) ON DELETE CASCADE,
    sequence int NOT NULL,                        -- order witnesses are listed
    name text NOT NULL,
    position text,
    UNIQUE (meeting_id, sequence)
);
CREATE INDEX meeting_witnesses_name ON meeting_witnesses (name);

-- Documents linked to a meeting (witness statements, support docs, member statements)
CREATE TABLE meeting_documents (
    id bigserial PRIMARY KEY,
    meeting_id text NOT NULL REFERENCES committee_meetings(id) ON DELETE CASCADE,
    document_type text NOT NULL,                  -- 'Support Document', 'Witness Statement', etc.
    document_role text NOT NULL CHECK (document_role IN ('meeting', 'witness')),
    name text,                                    -- meeting-doc display name
    format text,                                  -- 'PDF', 'HTML'
    url text NOT NULL,
    -- Body retrieval is lazy; not all docs are worth caching
    body_text text,                               -- extracted (if fetched)
    fetched_at timestamptz,
    UNIQUE (meeting_id, url)
);
CREATE INDEX meeting_documents_type ON meeting_documents (meeting_id, document_type);

-- Videos linked to the meeting
CREATE TABLE meeting_videos (
    id bigserial PRIMARY KEY,
    meeting_id text NOT NULL REFERENCES committee_meetings(id) ON DELETE CASCADE,
    name text,
    url text NOT NULL,
    platform text,                                -- 'congress.gov' | 'youtube' | 'cspan' | 'committee-site'
    UNIQUE (meeting_id, url)
);

-- ============================================================
-- Hearings (published transcript volumes)
-- ============================================================

CREATE TABLE hearings (
    id text PRIMARY KEY,                          -- 'hearing:50898' (derived from jacketNumber)
    package_id text,                              -- 'CHRG-118hhrg50898'; NOT a top-level API field — parsed out of formats[] URLs when needed

    congress smallint NOT NULL,
    chamber text NOT NULL CHECK (chamber IN ('house', 'senate', 'joint')),
    jacket_number int NOT NULL,
    UNIQUE (congress, chamber, jacket_number),    -- jacketNumber is unique only when scoped by congress + chamber
    senate_number int,                            -- 'number' field, Senate only

    citation text,                                -- 'H.Hrg.118' (rare; mostly null in samples)
    library_of_congress_identifier text,          -- 'LC70386'
    sudoc_class_number text,                      -- from GovInfo: 'Y 4.OV 2:118-3'
    hearing_sub_type text,                        -- from GovInfo: 'General', 'Markup', etc.

    title text NOT NULL,                          -- the volume title (usually all-caps)

    -- Source URLs
    formatted_text_url text,                      -- congress.gov HTML
    pdf_url text,                                 -- congress.gov / govinfo PDF
    govinfo_html_url text,                        -- govinfo.gov HTML
    govinfo_zip_url text,                         -- govinfo.gov ZIP archive
    mods_url text,                                -- govinfo.gov mods.xml

    -- Lifecycle
    update_date timestamptz NOT NULL,
    sitemap_lastmod timestamptz,                  -- per-package sitemap signal
    raw_json jsonb NOT NULL,                      -- API response
    raw_govinfo_metadata jsonb,                   -- getContentDetail response
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX hearings_congress_chamber ON hearings (congress, chamber);
CREATE INDEX hearings_jacket ON hearings (chamber, jacket_number);
CREATE INDEX hearings_update ON hearings (update_date DESC);

ALTER TABLE hearings ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(citation, '') || ' ' || coalesce(library_of_congress_identifier, '')), 'D')
    ) STORED;
CREATE INDEX hearings_search ON hearings USING gin (search_tsv);

-- Hearing → committees (a published volume can cover multiple committees, but most have one)
CREATE TABLE hearing_committees (
    hearing_id text REFERENCES hearings(id) ON DELETE CASCADE,
    committee_id text REFERENCES committees(id),
    committee_system_code text,
    PRIMARY KEY (hearing_id, committee_system_code)
);

-- Hearing dates (multi-day hearings produce multiple date rows)
CREATE TABLE hearing_dates (
    hearing_id text NOT NULL REFERENCES hearings(id) ON DELETE CASCADE,
    held_date date NOT NULL,
    PRIMARY KEY (hearing_id, held_date)
);

-- Meeting ↔ Hearing back-reference (a meeting can produce multiple hearing volumes; vice versa)
CREATE TABLE meeting_hearings (
    meeting_id text NOT NULL REFERENCES committee_meetings(id) ON DELETE CASCADE,
    hearing_id text NOT NULL REFERENCES hearings(id) ON DELETE CASCADE,
    PRIMARY KEY (meeting_id, hearing_id)
);

Schema decisions worth flagging:

committee_meetings and hearings as separate top-level tables — they have distinct lifecycles (a meeting exists from scheduling onward; a hearing exists only after publication, often months later). The meeting_hearings join table captures the M:N relationship.
meeting_documents as a single table for both witness and meeting-level documents — document_role discriminates. The set of document types is small enough that one table beats two with mostly-overlapping shape.
body_text nullable on meeting_documents — body fetch is lazy. Many support docs aren't worth caching (PDFs of news articles, member-circulated graphics). Fetch on first agent request.
raw_govinfo_metadata jsonb on hearings — preserves the GovInfo getContentDetail payload. SuDoc Class, Hearing Sub Type, and Held Date come from there only.
hearing_committees.committee_id is nullable; committee_system_code is the canonical key. Same pattern used elsewhere for soft FKs to committees that may be historical and not in our current committees table.
No body of the hearing (transcript) text on this table. That's source #7 (hearing transcripts). The hearings table is metadata only; body comes via a separate table that joins to hearings.

Download / update strategy

Backfill (one-time, 113th → present for meetings; 92nd → present for transcripts)

Meetings (Congress.gov):

For each congress 113 through current:
- Page through /v3/committee-meeting filtered by congress (limit=250, sort=updateDate desc) until exhausted — see Endpoints.
- For each row, fetch detail: /v3/committee-meeting/{c}/{chamber}/{eventId}.
- Save raw JSON to /data/corpus/hearings/meetings/{c}/{chamber}/{eventId}.json.
- Parse, load.
~17,600 meetings × 1 detail fetch each ≈ 17,600 calls. With a real api.data.gov key (5K/hr): ~4 hours. With concurrent workers + the per-minute cap: ~1 hour.

Hearings (Congress.gov):

For each congress 92 through current:
- Page through /v3/hearing filtered by congress (limit=250) until exhausted.
- For each row, fetch detail: /v3/hearing/{c}/{chamber}/{jacketNumber}.
- Save raw JSON.
~35,300 hearings × 1 detail fetch ≈ 35,300 calls. With real key: ~7-8 hours.

GovInfo enrichment:

For each hearing's packageId, fetch /wssearch/getContentDetail?packageId=... to populate sudoc_class_number and hearing_sub_type. Save raw JSON. Free / unauth — concurrency limited only by politeness.

Sitemap-driven shortcut (alternative):

Instead of paginating the API for older transcripts, fetch GovInfo CHRG sitemaps year-by-year (CHRG_{1995..2024}_sitemap.xml). Each entry has packageId + lastmod. For each package, fetch getContentDetail + MODS. Skip the API entirely for backfill of transcript records older than 113th congress (where API coverage is thinner anyway).

Daily incremental

Every 30 minutes — meetings. /v3/committee-meeting?fromDateTime={last_run}&sort=updateDate desc, paginate until we reach a row already seen. Refetch detail for each.
Every 30 minutes — hearings. /v3/hearing?fromDateTime={last_run}&sort=updateDate desc.
Daily — upcoming-meeting refresh. For each meeting with meeting_status = 'Scheduled' and meeting_date <= now() + interval '14 days', refetch detail. Captures last-minute postponements.
Weekly — sitemap diff. Fetch the current-year CHRG sitemap. For any <lastmod> newer than hearings.sitemap_lastmod, refetch GovInfo metadata.

State tracking

Source keys: hearings_meetings, hearings_transcripts_api, hearings_govinfo_enrich. Each tracked separately in ingestion_runs so we can re-run pieces without redoing the world.

Failure modes

Senate Akamai for committee schedule pages — irrelevant if we trust Congress.gov as canonical (which we should; see Source priority decision). Skip Senate per-committee scrapes.
Cancelled or rescheduled meetings — meeting row's event_id stays the same; the row's meeting_status and possibly meeting_date change. Use update_date to detect.
Witness changes mid-flight — common. The witness list updates as people drop out. Always replace meeting_witnesses rows on every detail fetch (they're cheap; no FK references to specific witness rows).
Document URL rot — committees occasionally re-upload documents under different filenames. Detect via the URL diff at parse time; archive the old meeting_documents row (don't delete) and add the new one.
GovInfo hearing_sub_type missing for older volumes — pre-2009ish CHRG packages don't have all the metadata fields. Treat as nullable.
Joint hearings — a meeting may legitimately have multiple committees from both chambers. Schema handles it; tests should cover it.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Witness ↔ legislator resolution. Witnesses are free-text names ("Ms. Vijaya Gadde"). Some witnesses are former or sitting government officials we have records for. Building a witness ↔ canonical-person index is valuable but a separate task; defer to v1.x.
Markup ↔ bill cross-reference. When a meeting type is Markup, the title usually mentions specific bills ("Markup of H.R. 498"). Extracting the bill IDs from the title and adding to bill_committees.activity_name = 'Marked up by' is worth doing — but the API does not expose a structured field for this. Consider a regex/NER pass at parse time.
Hearing-transcript handoff. The actual transcript text + speaker turns lives in source #7. This source's hearings table records the existence, location, and metadata of a printed volume; the text and witness-statements get parsed separately. Avoid duplication.
PDF body caching policy. Witness statements (PDFs) average 5-30 pages each, ~30-200 KB. If we cache all of them eagerly, ~1-3 GB total. Lean toward eager-caching witness statements (they're high-signal for the agent's vote-prediction and witness-tracking jobs) and lazy-fetching support documents.
Closed sessions. When meetingStatus = 'Closed', no transcript is published and witnesses may be redacted. Schema handles it; document downstream behavior (the agent should know "this hearing was closed" rather than throwing on missing data).
Field hearings outside DC. location.building and location.room may be free-text city/venue names rather than building codes. Schema accepts any text, but downstream "search by building" filters will need normalization.
Pre-113th meeting calendar. Congress.gov /committee-meeting only goes back to the 113th. Earlier meetings exist as /hearing records (with dates[]) but no scheduled-meeting record. Decide if backfilling pre-113th events from the hearing dates is in v1 scope.
Real-time gavel time vs scheduled time. The API's meetingStatus = 'Held' flips after the gavel falls, but meeting_date may still reflect the scheduled time, not the actual gavel time. C-SPAN has the actual gavel time. v2 enrichment.