Congressional Record ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

The official daily journal of floor proceedings in the House and Senate,
plus the Daily Digest summary, plus the Extensions of Remarks. The
Congressional Record is the record of what was said in Congress — every
speech, every parliamentary motion, every quorum call, every vote tally.
For Josh, this is the densest citation source in the v1 corpus: MODS
metadata is extraordinarily rich (per-granule bioguide IDs for every
speaking member, bills referenced, USC sections cited, Stat. cross-
references, committees, votes, reports, amendments, RINs, laws). The
citation graph more or less falls out of CREC MODS for free. Distinct
from hearing-transcripts-ingester (committee hearings, CHRG); this is
floor proceedings (CREC).

User stories

As a policy analyst using my own AI agent, I want to ask "what did Sen. X say on the floor about the SPEED Act" so that I get the speaker turn with the printed-record citation.

As an OSS self-hoster, I want to run `josh ingest congressional-record` on a fresh substrate so that I can populate floor proceedings from GovInfo CREC on my own instance.

As a downstream agent building the citation graph, I want per-granule bioguide IDs + bill / USC / committee references already extracted so that "who said what, and what they cited" is queryable in the substrate.

Acceptance criteria (EARS)

When `josh ingest congressional-record --since 2026-01-01` is run on the deployed substrate, the system shall populate `crec_issues` rows whose six citation columns are all non-null.
When a CREC issue is parsed, the system shall populate one `crec_granules` row per granule (`HOUSE`, `SENATE`, `EXTENSIONS`, `DAILYDIGEST`) with full body text and granule-class discriminator.
When a granule MODS extension declares speaking members (`bioguideId`), the system shall populate `crec_granule_speakers` resolved to `legislators.bioguide_id`.
When a granule MODS extension declares cited bills, USC sections, committees, votes, or laws, the system shall populate the corresponding `crec_granule_*_refs` junction tables (soft-FKs into the relevant source tables, with raw citation text preserved).
While the ingester is running, if a single granule fails to parse, then the system shall log the failure to `ingestion_logs` (with `(packageId, granuleId)`) and continue with the next granule — issue-level fetch shall not abort on per-granule failure.
When an FTS5 query is executed against `crec_granules_fts`, the system shall return BM25-ranked results.
Where a granule body exceeds ~5K tokens, the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `crec_granule_chunks_vec0`.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest CREC sitemap `<lastmod>` observed.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest congressional-record --since 2026-04-01 --max 5'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT
    (SELECT COUNT(*) FROM crec_issues WHERE published_at >= ''2026-04-01''
      AND source_url IS NOT NULL AND raw_sha256 IS NOT NULL
      AND citation_string IS NOT NULL) AS issues,
    (SELECT COUNT(*) FROM crec_granules g
      JOIN crec_issues i ON i.id = g.issue_id
      WHERE i.published_at >= ''2026-04-01'') AS granules,
    (SELECT COUNT(*) FROM crec_granule_speakers s
      JOIN crec_granules g ON g.id = s.granule_id
      JOIN crec_issues i ON i.id = g.issue_id
      WHERE i.published_at >= ''2026-04-01'') AS speakers;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM crec_granules_fts WHERE crec_granules_fts MATCH '\''quorum'\'';"'

Expect

>= 5 issues, >= 500 granules (~150-300 per issue typical), >= 50 speaker resolutions, FTS5 match >= 1.

Smoke run only — 5 issues × ~300 granules each is ~1500 granule rows. Full backfill scope is 1995-present (~5,000 packages × 200-300 granules = ~1-1.5M granules) and runs detached on the server per `add-a-new-source.html`.

Clarifications needed

Backfill scope: 1995-present (full online corpus, ~5,000 packages, ~10-15 GB raw PDF + ~3-5 GB extracted text) vs current Congress only for v1 launch. Lean: full — packages are small individually, ingest is sequential, and the citation-graph value compounds with breadth.
Bound edition (`CRECB-{YYYY}`) — pull or skip for v1? Daily edition is canonical for currentness; bound has citation-page-of-record stability (page numbers shift between daily and bound). Lean: skip for v1; revisit when citation-grade page numbers matter.
`unitedstates/congressional-record` parser — inline-port to our parser or runtime dependency? Lean: inline-port (their parser is CC0; we'd vendor under `shared/josh_substrate/parsers/crec_html_v1.py` so we control the version).
Per-granule chunking strategy: chunk by paragraph, by speaker turn, or by fixed-size with overlap? Lean: by speaker turn (granules are already small enough that one granule = one chunk for ~80% of granules; multi-speaker granules split at speaker boundaries).
Daily Digest granule body — store full text + structured-extraction (committee meeting notices into `meetings_announced` rows) or just full text? Lean: full text for v1; structured extraction is v1.x.

Out of scope

State legislative records (CA / NY analogues) — Phase 6+.
Pre-1995 historical Congressional Record — Hathitrust scans available; out of scope for v1.
Bound edition (`CRECB-`) — daily edition is canonical for v1.
thomas.loc.gov / congress.gov "Congressional Record" landing — slower than GovInfo bulk, redirects there anyway.
Real-time webhooks / push subscriptions — surface-layer concern.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/congressional-record/.

Discover/fetch/parse/load shape:
- discover: GovInfo per-year sitemap
https://www.govinfo.gov/sitemap/CREC_{YYYY}_sitemap.xml (~150-200
entries per year, one per session day). Watermark from
ingestion_source_state.last_seen_key='crec-{YYYY}-lastmod' is the
highest <lastmod> seen.
- fetch: per-issue ZIP at /content/pkg/CREC-{date}.zip (PDFs +
per-granule HTMLs + MODS) to /data/corpus/congressional-record/bodies/raw/.
Or per-granule HTML at /content/pkg/CREC-{date}/html/CREC-{date}-pt1-Pg{page}.htm
if we want streaming-style ingest (lower disk peak). MODS at
/metadata/pkg/CREC-{date}/mods.xml (~5 MB per issue, rich).
- parse: crec_html_v1 for granule body text (vendored from
unitedstates/congressional-record); crec_mods_v1 lxml walker for
issue + granule metadata (speakers, bills, USC, committees, votes,
reports, amendments, RINs, laws). Markdown body per granule to
bodies/markdown/{packageId}/{granuleId}.md.
- load: idempotent upsert on (packageId) for issues; on
(packageId, granuleId) for granules. Speaker rows resolved against
legislators.bioguide_id. Bill / USC / committee / vote refs as
soft-FKs (resolved when target is in substrate, raw text preserved
otherwise). FTS5 sync triggers fire on insert/update/delete. Body
chunks > ~5K tokens enqueued for embedding.

Migration shape: crec_issues, crec_granules, crec_granule_speakers,
crec_granule_bill_refs, crec_granule_usc_refs, crec_granule_committee_refs,
crec_granule_vote_refs, crec_granule_report_refs, crec_granule_amendment_refs,
crec_granule_law_refs, crec_granule_rin_refs, crec_granules_fts virtual
+ 3 sync triggers, crec_granule_chunks + crec_granule_chunks_vec0 (vec0
with float[1024] and bit[1024]).

Schedule: daily 06:00 UTC. CREC publishes the next morning UTC after a
session day; daily polling catches new issues with no urgency pressure.
Yearly sitemap re-checked on the first of each month for <lastmod>
re-touches on past issues.

Tasks

0 of 16 done.

Changelog

No history yet.