Legislators and committees ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

The unitedstates/congress-legislators mirror (7 fetched YAML files;
committees-historical.yaml is the 8th upstream file but intentionally
out of scope at v1) is the canonical
open-data source for every member of Congress (current + historical),
every committee + subcommittee, and current committee assignments. It's
the smallest second source — no body normalization, no vector pipeline,
no chunking — which made it the right shape to validate that the
add-a-new-source runbook generalizes beyond CRS before bills exposes
the full schema surface. Every other Congress-side source FKs to
legislators.bioguide_id and committees.thomas_id.

Code is intact (ingester module, parser, migration, audit-driven 11-table
schema, FTS5 wiring, citation formatter, tests). The 12,780-legislator /
230-committee / 3,879-membership production load that previously
satisfied this spec lived on the destroyed DigitalOcean droplet
(2026-05-10) and needs to be re-run against the OVHcloud substrate.
Re-backfill is minutes — the upstream YAML mirror is small and stable.

User stories

As a policy analyst using my own AI agent, I want to ask the agent who sits on a given committee or sponsors a bill so that I get accurate, name-resolved answers with citations.

As a downstream source ingester (bills, votes, hearings), I want to FK to legislators and committees by stable ID so that cross-source citation graph queries resolve cleanly.

As an OSS self-hoster, I want to run `josh ingest legislators-committees` on a fresh substrate so that the FK targets exist before I ingest any Congress-side source.

Acceptance criteria (EARS)

When `josh ingest legislators-committees` is run on the deployed substrate, the system shall populate `legislators`, `committees`, `committee_memberships`, and the legislator side tables (`legislator_terms`, `legislator_leadership_roles`, `legislator_family`, `legislator_other_names`, `legislator_fec_ids`, `legislator_bioguide_previous`, `legislator_social_media`, `legislator_district_offices`) from the 7 fetched upstream YAML files.
When a legislator row is written, the system shall populate all six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) — `source_org` shall be `unitedstates/congress-legislators`.
While the ingester is running, if a single YAML record fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record.
When a legislator's `bioguide_id` already exists, the system shall upsert that row rather than insert a duplicate, preserving FK targets for downstream sources.
When an FTS5 query is executed against `legislators_fts` or `committees_fts`, the system shall return BM25-ranked name/jurisdiction matches.
When a daily incremental run completes, the system shall update `ingestion_source_state.last_seen_key` so the next run's conditional GET on the upstream YAML files only re-fetches changed files.

Success determiner

kindsql

DB

substrate

Query

SELECT
  (SELECT COUNT(*) FROM legislators) AS legislators,
  (SELECT COUNT(*) FROM committees) AS committees,
  (SELECT COUNT(*) FROM committee_memberships) AS memberships,
  (SELECT COUNT(*) FROM legislators
    WHERE source_url IS NOT NULL AND source_org IS NOT NULL
      AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
      AND citation_string IS NOT NULL) AS legislators_fully_cited;

Compare (machine-checked)

[
  {
    "op": "gte",
    "value": 12000
  },
  {
    "op": "gte",
    "value": 200
  },
  {
    "op": "gte",
    "value": 3500
  },
  {
    "op": "eq_col",
    "value": 0
  }
]

Expect (prose)

legislators >= 12000, committees >= 200, memberships >= 3500, legislators_fully_cited == legislators (column 3 == column 0).

Production state as of 2026-05-09 (on the now-destroyed DO droplet): 12,780 legislators (536 current + ~12,200 historical + 67 executive), 230 committees (49 top-level + 181 subs), 3,879 current-Congress committee assignments. Floors leave headroom for upstream churn without false-failing the determiner. **Determiner currently fails** against an empty substrate — re-run after the OVHcloud provision + backfill (per `substrate-bare-metal-host` t9) to re-flip the spec verified → shipped. Of the citation columns checked, only raw_sha256 is nullable (the others are NOT NULL-enforced in the schema), so the fully_cited == total parity check is in practice a raw_sha256-completeness check.

Clarifications needed

None.

Out of scope

State-level legislators / committees.
Historical committee memberships (only current Congress in upstream YAML).
Caucus memberships and personal-office staff (separate sources — see staff-directories spec).

Dependencies

Plan

Source module follows the standard Source protocol:
josh-ingester/ingester/sources/legislators_committees.py.

- Discover — pull the 7 YAMLs from the GitHub Pages host
https://unitedstates.github.io/congress-legislators/
(legislators-current.yaml, legislators-historical.yaml,
executive.yaml, committees-current.yaml,
committee-membership-current.yaml, plus social-media / district-
office companions). committees-historical.yaml exists upstream but
is intentionally skipped at v1; there is no leadership companion file
(leadership_roles is an in-record field on each legislator, not a
separate YAML). Per-file conditional GET keyed on ETag
(If-None-Match → 304) as primary, Last-Modified
(If-Modified-Since) as secondary; state watermark is a JSON of
per-file ETags / Last-Modified values. (The raw host
raw.githubusercontent.com is NOT used — it serves no Last-Modified
and ignores If-Modified-Since; the GitHub Pages host serves both
headers and returns 304 to both conditional forms.)
- Fetch — write each YAML to
/data/corpus/legislators_committees/bodies/raw/<file>.yaml so an
audit trail exists.
- Parse — PyYAML → Pydantic models in
shared/josh_substrate/src/josh_substrate/models/legislators.py.
Each legislator's history of terms / names / FEC IDs / bioguide
previous-IDs unpacks into the side tables.
- Load — upsert by bioguide_id (legislators) and thomas_id
(committees). Memberships are deleted-and-reinserted per committee
(current-Congress snapshot only).
- Citation — shared/josh_substrate/citations/formatters/legislators_committees.py
builds a human-readable citation_string ("Sen. Jane Doe (D-CA),
member of Senate Judiciary Committee" style).
- Schema — 0003_legislators_committees.py migration: 11 tables
+ 2 FTS5 virtual tables (legislators_fts, committees_fts) + 6
sync triggers. **No *_chunks or *_chunks_vec0 — no body text.
- Audit-driven** — current production state is the result of an
audit landed 2026-05-09 (commit 6f16967) that redesigned the schema
after the initial pass. The audit-redesign rationale is preserved
internally and isn't required for re-running the ingester.

Tasks

10 of 14 done.

Changelog

2026-05-10T18:00:00Z shipped→verified DigitalOcean droplet destroyed 2026-05-10 — the production load (12,780 legislators / 230 committees / 3,879 memberships, audit- driven schema landed 2026-05-09) is gone with the volume. Code (ingester module, parser, audit-driven 11-table migration, FTS5 wiring, citation formatter, tests, integration pipeline) all intact in the repo. Status flipped shipped → verified: the success_determiner ran green against real data once and the code's contract still holds; the production data load needs to be re-run against the OVHcloud substrate (per `substrate-bare-metal-host` t9). Tasks t10-t12 + t14 reset to false to track the re-load.
2026-05-09T13:00:00Z in_progress→shipped 6f16967 Audit-driven redesign landed 2026-05-09 (commit 6f16967): 12,780 legislators, 230 committees, 3,879 current-Congress memberships. Spec backfilled with real acceptance criteria + sql success_determiner + the ingestion-built task contract; removing the stub-criteria drift left over from the seeded template.