Congressional Record

The official daily journal of floor proceedings in the House and Senate, plus the Daily Digest summary, plus the Extensions of Remarks (member-submitted statements that weren't actually delivered on the floor). The Congressional Record is the record of what was said in Congress — every speech, every parliamentary motion, every quorum call, every vote tally, plus the fully-edited statements members "extend" into the record without speaking.

For Josh, this is the densest citation source in the v1 corpus. The MODS metadata is extraordinarily rich: per-granule bioguide IDs for every speaking member, bills referenced, US Code sections cited, Statutes at Large cross-references, committees, votes, reports, amendments, RINs, and laws. The citation graph more or less falls out of CREC MODS for free.

This source is distinct from hearing transcripts — that's committee hearings (CHRG); this is floor proceedings (CREC).

Quick reference


Source name	Congressional Record
Publisher	Government Publishing Office (GPO); produced from House and Senate session transcripts
License	Public domain
Coverage	1873 – present (founding); online corpus 1995 – present systematically — see the GPO CREC collection. The bound Congressional Record (the year-end consolidated version) lags the daily edition by years.
Auth	None. Public domain, no key required; ~1–2 req/sec polite. See GovInfo for developers.
Stable ID format	Issue: `crec:{YYYY-MM-DD}` e.g. `crec:2025-12-18`. Granule: `crec-granule:{granuleId}` e.g. `crec-granule:CREC-2025-12-18-pt1-PgS8351`.
Status	exploring — schema drafted, ingestion not built

Endpoint patterns, package/granule ID construction rules, the per-year sitemaps, and the MODS bulk-data layout are upstream GPO reference — see GovInfo bulk data and the GPO MODS Bulk Data User Guide. Storage and volume estimates track the CREC collection.

Source priority decision

Primary: GovInfo CREC bulk. The canonical GPO output. Per-day package with PDF (full + per-section), MODS XML (rich metadata), and per-granule HTML rendering. Reuse the parser scaffolding from hearing transcripts and committee reports.

Secondary accelerator: unitedstates/congressional-record parser. A community-maintained Python parser that takes daily HTML and splits into per-speaker segments. Public-domain (CC0). Useful as a reference implementation; we may inline-port to our own parser rather than depending on it at runtime.

Skip: thomas.loc.gov / congress.gov "Congressional Record" landing. Slower to ingest, redirects to GovInfo for bulk anyway.

Skip: pre-1995 historical Congressional Record. Hathitrust has scans; Library of Congress has structured 1873-onward indexes. Out of scope for v1; consider in v2 if a specific use case appears.

Quirks and probe findings

GovInfo access is open and polite (~1–2 req/sec; real-browser User-Agent harmless), with the same patterns as CHRG / CRPT. The mechanics that bit us in practice:

HTML is per-granule only. The HTML issue-level rendering at /content/pkg/CREC-{date}/html/CREC-{date}.htm redirects to a GovInfo error page — there is no single combined-issue HTML. Instead, each speaker turn / floor segment is its own granule with its own HTML rendering, and the granule list comes from the issue's MODS XML. This is the primary text-extraction surface.
Sitemap <lastmod> runs far past <dateIssued>. Many 2024 packages were re-touched in 2025-10, so the per-URL sitemap <lastmod> is well ahead of the package's <dateIssued>. The weekly sync must diff a trailing window (last ~90 days) to catch these retroactive corrections rather than trusting <lastmod> as a publish date.
Granule HTML 404. Some older granules have only PDF representation. Fall back to PDF text extraction for the granule's pages (pdftotext -f {start_page} -l {end_page}).
MODS shape drift. Pre-2010 issues have less metadata richness in <extension> (some tags were added in later years). Defensive parser; missing fields are NULL.
<congMember> is authoritative. When a member's name appears in [Mr. SMITH] form, the MODS <congMember> is authoritative — don't re-resolve from the text.
<time> is often absent. Many granules don't have <time> — Daily Digest and Extensions of Remarks rarely do.

State key for incremental sync: crec. State stores last-seen issue date + per-issue-date last_modified. The daily edition is published the morning after the session day. Polling/feed mechanics are upstream — see GovInfo feeds.

Stable ID formats

Issue: crec:{YYYY-MM-DD}

Examples:

crec:2025-12-18
crec:2024-01-08

Granule: crec-granule:{granuleId}

Examples:

crec-granule:CREC-2025-12-18-pt1-PgS8351
crec-granule:CREC-2025-12-18-pt1-PgH6048

MODS structure (the high-value layer)

Issue-level MODS has standard publication metadata + an array of <relatedItem type="constituent"> elements, one per granule. Each constituent has its own <extension> block with the citation graph. The full field-by-field <mods>/<relatedItem>/<extension> layout is documented upstream in the GPO MODS Bulk Data User Guide.

The interpretation that matters for Josh: the MODS <extension> element on each granule is the citation graph row — <congMember>, <bill>, <USCode>, <statuteAtLarge>, <congCommittee>, <congVote>, <congReport>, <amendment>, <law>, <rin>. We unpack it into our schema's edge tables (below). There are ~300+ <relatedItem> elements per issue.

Enum reference (consumed by the DDL CHECK constraints)

Granule classes (MODS <granuleClass>) — the values the crec_granules.granule_class CHECK constraint enforces:

Value	Meaning
`HOUSE`	House floor proceedings
`SENATE`	Senate floor proceedings
`EXTENSIONS`	Extensions of Remarks (member-submitted, not delivered on floor)
`DAILYDIGEST`	Daily Digest summary entries (committee meeting notices, programs for next day)

Sample distribution from one day (2025-12-18): 148 HOUSE + 135 SENATE + 37 EXTENSIONS + 6 DAILYDIGEST = 326 granules.

congMember role — the values stored in crec_granule_members.role:

Value	Meaning
`SPEAKING`	Member is speaking
`PRESIDING`	Member presiding (Speaker, President pro tempore, etc.)
`SUBMITTING`	Member submitting an extension or document

HTML body shape (per-granule)

Each granule HTML wraps the body in <html><body><pre>...</pre></body></html>. Inside <pre> are a heading line (speaker name in [Mr. NAME (XX-NN)] form or topic title), the speaker turn body with [Page H6048] page-break markers, and bracket-wrapped procedural notes ([Time: 10:23 a.m.], [Roll No. 362], [Whereupon, the House proceeded to ...]).

The speaker turn parser is similar to CHRG but with different formatting conventions — names appear as [The PRESIDING OFFICER], [Mr. SCHUMER], Mr. SCHUMER. Mr. President, .... The bracket-marker rendering detail itself is upstream — see the GPO CREC collection.

Schema (Postgres DDL)

Status: exploring. This DDL is drafted, not shipped — ingestion for this source is not yet built. Cross-check data status before relying on any of it. Migrations under shared/josh_substrate/.../migrations/versions/ are the schema source of truth once this source is built. A content gap to close before shipping: this page records vector(1024) + HNSW in the DDL but does not yet name a chunker strategy or carry a chunker eval result.

-- ============================================================
-- Congressional Record issues (one per session day)
-- ============================================================

CREATE TABLE crec_issues (
    id text PRIMARY KEY,                          -- 'crec:2025-12-18'
    package_id text NOT NULL UNIQUE,              -- 'CREC-2025-12-18'
    issue_date date NOT NULL UNIQUE,

    volume int NOT NULL,                          -- 171
    issue_number int NOT NULL,                    -- 214
    congress smallint NOT NULL,
    session smallint NOT NULL,

    -- Page ranges by section
    house_pages_start text,                       -- 'H6027'
    house_pages_end text,                         -- 'H6119'
    senate_pages_start text,
    senate_pages_end text,
    extensions_pages_start text,
    extensions_pages_end text,
    digest_pages_start text,
    digest_pages_end text,

    -- Source URLs (per PDF view)
    full_pdf_url text,
    house_pdf_url text,
    senate_pdf_url text,
    extensions_pdf_url text,
    digest_pdf_url text,
    mods_url text,

    -- Lifecycle
    raw_mods_xml bytea,                           -- gzipped
    sitemap_lastmod timestamptz,
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX crec_issues_date ON crec_issues (issue_date DESC);
CREATE INDEX crec_issues_volume ON crec_issues (volume, issue_number);

-- ============================================================
-- Per-granule rows (each speaker turn / topic block / digest entry)
-- ============================================================

CREATE TABLE crec_granules (
    id text PRIMARY KEY,                          -- 'crec-granule:CREC-2025-12-18-pt1-PgS8351'
    granule_id text NOT NULL UNIQUE,              -- 'CREC-2025-12-18-pt1-PgS8351'
    issue_id text NOT NULL REFERENCES crec_issues(id) ON DELETE CASCADE,

    granule_class text NOT NULL CHECK (granule_class IN (
        'HOUSE', 'SENATE', 'EXTENSIONS', 'DAILYDIGEST'
    )),
    chamber text CHECK (chamber IN ('HOUSE', 'SENATE') OR chamber IS NULL),

    title text,                                   -- per-granule title (often the topic)
    page_start text NOT NULL,                     -- 'S8351'
    page_end text,
    granule_time time,                            -- approximate speech time (when present)

    -- Body
    body_text text,
    body_html bytea,                              -- gzipped raw HTML

    -- Source URLs
    granule_html_url text,
    granule_mods_url text,

    -- Lifecycle
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX crec_granules_issue ON crec_granules (issue_id);
CREATE INDEX crec_granules_class ON crec_granules (granule_class, page_start);

ALTER TABLE crec_granules ADD COLUMN body_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
    ) STORED;
CREATE INDEX crec_granules_body ON crec_granules USING gin (body_tsv);

-- ============================================================
-- Citation graph edges (the high-value MODS extension data)
-- ============================================================

-- Members speaking in a granule
CREATE TABLE crec_granule_members (
    id bigserial PRIMARY KEY,
    granule_id text NOT NULL REFERENCES crec_granules(id) ON DELETE CASCADE,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id),
    role text NOT NULL,                           -- 'SPEAKING', 'PRESIDING', 'SUBMITTING'
    chamber text,
    party text,
    state text,
    UNIQUE (granule_id, bioguide_id, role)
);
CREATE INDEX crec_granule_members_member ON crec_granule_members (bioguide_id);

-- Bills referenced
CREATE TABLE crec_granule_bills (
    granule_id text NOT NULL REFERENCES crec_granules(id) ON DELETE CASCADE,
    bill_id text NOT NULL,                        -- 'bill:119-hr-3447'; soft FK
    relationship text NOT NULL,                   -- 'PRIMARY' | 'OTHER'
    PRIMARY KEY (granule_id, bill_id, relationship)
);
CREATE INDEX crec_granule_bills_bill ON crec_granule_bills (bill_id);

-- Committees referenced
CREATE TABLE crec_granule_committees (
    granule_id text NOT NULL REFERENCES crec_granules(id) ON DELETE CASCADE,
    committee_id text REFERENCES committees(id),
    committee_system_code text,
    PRIMARY KEY (granule_id, committee_system_code)
);

-- Roll call votes referenced
CREATE TABLE crec_granule_votes (
    granule_id text NOT NULL REFERENCES crec_granules(id) ON DELETE CASCADE,
    vote_id text NOT NULL,                        -- 'vote:119-h-1-362'; soft FK
    PRIMARY KEY (granule_id, vote_id)
);

-- Generic references for USC, Statutes at Large, Reports, Laws, Amendments, RINs
CREATE TABLE crec_granule_citations (
    id bigserial PRIMARY KEY,
    granule_id text NOT NULL REFERENCES crec_granules(id) ON DELETE CASCADE,
    citation_type text NOT NULL CHECK (citation_type IN (
        'us_code', 'statute_at_large', 'committee_report', 'public_law',
        'amendment', 'rin', 'executive_order'
    )),
    citation_value text NOT NULL,                 -- the raw citation string
    -- Resolved IDs when possible (soft FKs to other tables)
    resolved_id text                              -- e.g. 'usc:42-1396'; NULL when unresolved
);
CREATE INDEX crec_granule_citations_granule ON crec_granule_citations (granule_id, citation_type);

-- Vector chunks
CREATE TABLE crec_granule_chunks (
    id bigserial PRIMARY KEY,
    granule_id text NOT NULL REFERENCES crec_granules(id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    embedding vector(1024),
    UNIQUE (granule_id, chunk_index)
);
CREATE INDEX crec_granule_chunks_embedding
    ON crec_granule_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

crec_issues and crec_granules as separate tables — issues are the "container" with paging/issue-level metadata; granules are the body. Most retrieval is at the granule level (one speaker turn or one topic block).
crec_granule_members is the gold mine — every member quote can be traced to bioguide directly. This makes "what has Senator X said about education" a clean query.
Citation edges in dedicated tables — crec_granule_bills, crec_granule_votes, crec_granule_committees for high-volume joins. crec_granule_citations as a catchall for less-frequent reference types (USC, statutes at large, etc.).
raw_mods_xml preserved at issue level — the file is large (5 MB) but full preservation supports re-parsing as our citation extractor matures.
crec_granules.body_text searched, not at issue level — granule-level retrieval is the dominant pattern.
Time precision (granule_time) preserved when available — useful for cross-referencing C-SPAN video timestamps in v1.x.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Bound vs daily edition. Page numbers in the daily edition can shift in the year-end bound edition. Citation precision in court contexts requires the bound page. We default to daily; consider ingesting bound retrospectively (but it lags by 5+ years).
Pre-1995 historical Record. Hathitrust + LoC have scans. Out of scope for v1.
Speaker turn parser. The MODS gives us <congMember> per granule; the HTML gives us turn-by-turn speech. Granular speaker-turn-level rows (one per speaker turn within a granule) would enable "Senator X said Y at S8352 line 14" precision. Defer to v1.x; ship v1 with granule-level granularity.
Citation graph extractor reuse. The crec_granule_citations table is a candidate to merge with the global citation graph from the U.S. Code and public laws sources. When we build the unified citation graph, this table likely becomes a denormalized view.
unitedstates/congressional-record dependency. That parser handles pre-MODS-rich-metadata issues better than ours might. Decide: vendor it, fork it, or reimplement.
Daily Digest as separate source? The Daily Digest is structured differently (committee meeting notices, bills introduced, programs for next day). Could be a separate source (a markup-schedule annex). Keep in CREC for v1; consider extracting in v1.x.
Cross-reference to roll call votes. When the Record reports a roll call ([Roll No. 362]), the crec_granule_votes link needs to be resolved at parse time — Roll No. 362 plus session date plus chamber → vote_id. Build the resolver carefully.
Cross-reference to extensions and member-submitted documents. Extensions of Remarks frequently reproduce articles, statements, and reports submitted "for the Record." These are valuable text not authored on the floor. Schema treats them as regular granules with granule_class=EXTENSIONS; OK for v1.