Legislators and committees ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
The unitedstates/congress-legislators mirror (7 fetched YAML files;committees-historical.yaml is the 8th upstream file but intentionally
out of scope at v1) is the canonical
open-data source for every member of Congress (current + historical),
every committee + subcommittee, and current committee assignments. It's
the smallest second source — no body normalization, no vector pipeline,
no chunking — which made it the right shape to validate that the
add-a-new-source runbook generalizes beyond CRS before bills exposes
the full schema surface. Every other Congress-side source FKs tolegislators.bioguide_id and committees.thomas_id.
Code is intact (ingester module, parser, migration, audit-driven 11-table
schema, FTS5 wiring, citation formatter, tests). The 12,780-legislator /
230-committee / 3,879-membership production load that previously
satisfied this spec lived on the destroyed DigitalOcean droplet
(2026-05-10) and needs to be re-run against the OVHcloud substrate.
Re-backfill is minutes — the upstream YAML mirror is small and stable.
User stories
As a policy analyst using my own AI agent, I want to ask the agent who sits on a given committee or sponsors a bill so that I get accurate, name-resolved answers with citations.
As a downstream source ingester (bills, votes, hearings), I want to FK to legislators and committees by stable ID so that cross-source citation graph queries resolve cleanly.
As an OSS self-hoster, I want to run `josh ingest legislators-committees` on a fresh substrate so that the FK targets exist before I ingest any Congress-side source.
Acceptance criteria (EARS)
- When `josh ingest legislators-committees` is run on the deployed substrate, the system shall populate `legislators`, `committees`, `committee_memberships`, and the legislator side tables (`legislator_terms`, `legislator_leadership_roles`, `legislator_family`, `legislator_other_names`, `legislator_fec_ids`, `legislator_bioguide_previous`, `legislator_social_media`, `legislator_district_offices`) from the 7 fetched upstream YAML files.
- When a legislator row is written, the system shall populate all six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) — `source_org` shall be `unitedstates/congress-legislators`.
- While the ingester is running, if a single YAML record fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record.
- When a legislator's `bioguide_id` already exists, the system shall upsert that row rather than insert a duplicate, preserving FK targets for downstream sources.
- When an FTS5 query is executed against `legislators_fts` or `committees_fts`, the system shall return BM25-ranked name/jurisdiction matches.
- When a daily incremental run completes, the system shall update `ingestion_source_state.last_seen_key` so the next run's conditional GET on the upstream YAML files only re-fetches changed files.
Success determiner
DB
Query
SELECT
(SELECT COUNT(*) FROM legislators) AS legislators,
(SELECT COUNT(*) FROM committees) AS committees,
(SELECT COUNT(*) FROM committee_memberships) AS memberships,
(SELECT COUNT(*) FROM legislators
WHERE source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL) AS legislators_fully_cited;
Compare (machine-checked)
[
{
"op": "gte",
"value": 12000
},
{
"op": "gte",
"value": 200
},
{
"op": "gte",
"value": 3500
},
{
"op": "eq_col",
"value": 0
}
]
Expect (prose)
Production state as of 2026-05-09 (on the now-destroyed DO droplet): 12,780 legislators (536 current + ~12,200 historical + 67 executive), 230 committees (49 top-level + 181 subs), 3,879 current-Congress committee assignments. Floors leave headroom for upstream churn without false-failing the determiner. **Determiner currently fails** against an empty substrate — re-run after the OVHcloud provision + backfill (per `substrate-bare-metal-host` t9) to re-flip the spec verified → shipped. Of the citation columns checked, only raw_sha256 is nullable (the others are NOT NULL-enforced in the schema), so the fully_cited == total parity check is in practice a raw_sha256-completeness check.
Clarifications needed
None.
Out of scope
- State-level legislators / committees.
- Historical committee memberships (only current Congress in upstream YAML).
- Caucus memberships and personal-office staff (separate sources — see staff-directories spec).
Dependencies
Plan
Source module follows the standard Source protocol:josh-ingester/ingester/sources/legislators_committees.py.
- Discover — pull the 7 YAMLs from the GitHub Pages host
https://unitedstates.github.io/congress-legislators/
(legislators-current.yaml, legislators-historical.yaml,
executive.yaml, committees-current.yaml,
committee-membership-current.yaml, plus social-media / district-
office companions). committees-historical.yaml exists upstream but
is intentionally skipped at v1; there is no leadership companion file
(leadership_roles is an in-record field on each legislator, not a
separate YAML). Per-file conditional GET keyed on ETag
(If-None-Match → 304) as primary, Last-Modified
(If-Modified-Since) as secondary; state watermark is a JSON of
per-file ETags / Last-Modified values. (The raw host
raw.githubusercontent.com is NOT used — it serves no Last-Modified
and ignores If-Modified-Since; the GitHub Pages host serves both
headers and returns 304 to both conditional forms.)
- Fetch — write each YAML to
/data/corpus/legislators_committees/bodies/raw/<file>.yaml so an
audit trail exists.
- Parse — PyYAML → Pydantic models in
shared/josh_substrate/src/josh_substrate/models/legislators.py.
Each legislator's history of terms / names / FEC IDs / bioguide
previous-IDs unpacks into the side tables.
- Load — upsert by bioguide_id (legislators) and thomas_id
(committees). Memberships are deleted-and-reinserted per committee
(current-Congress snapshot only).
- Citation — shared/josh_substrate/citations/formatters/legislators_committees.py
builds a human-readable citation_string ("Sen. Jane Doe (D-CA),
member of Senate Judiciary Committee" style).
- Schema — 0003_legislators_committees.py migration: 11 tables
+ 2 FTS5 virtual tables (legislators_fts, committees_fts) + 6
sync triggers. **No *_chunks or *_chunks_vec0 — no body text.
- Audit-driven** — current production state is the result of an
audit landed 2026-05-09 (commit 6f16967) that redesigned the schema
after the initial pass. The audit-redesign rationale is preserved
internally and isn't required for re-running the ingester.
Tasks
10 of 14 done.
- t1 Alembic migration written + applied locally (0003_legislators_committees) — production re-apply pending OVHcloud provision
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/legislators.py
- t3 Source module implementing the `Source` protocol
- t4 N/A — no body text on this source (raw YAML cached for audit only)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests
- t6 FTS5 virtual tables (legislators_fts, committees_fts) + 6 sync triggers wired into migration
- t7 N/A — no body text, no vector pipeline
- t8 Unit test for parser against captured YAML fixtures
- t9 Integration test for the full discover→fetch→parse→load pipeline
- t10 Full backfill complete on the OVHcloud substrate (code-equivalent of the prior 2026-05-09 DO load: ~12,780 legislators + 230 committees + 3,879 memberships)
- t11 Incremental cron registered (daily 02:00 ET) on the OVHcloud host and observed firing
- t12 No recurring errors in `ingestion_logs` for the last completed run on the new host
- t13 Source doc updated with as-built notes
- t14 Inventory rows in `https://docs.usejosh.com/josh-data-sources/` + `https://docs.usejosh.com/data-status/` re-flipped to ingestion-built / shipped after the OVHcloud backfill lands
Changelog
-
2026-05-10T18:00:00Z
shipped→verifiedDigitalOcean droplet destroyed 2026-05-10 — the production load (12,780 legislators / 230 committees / 3,879 memberships, audit- driven schema landed 2026-05-09) is gone with the volume. Code (ingester module, parser, audit-driven 11-table migration, FTS5 wiring, citation formatter, tests, integration pipeline) all intact in the repo. Status flipped shipped → verified: the success_determiner ran green against real data once and the code's contract still holds; the production data load needs to be re-run against the OVHcloud substrate (per `substrate-bare-metal-host` t9). Tasks t10-t12 + t14 reset to false to track the re-load. -
2026-05-09T13:00:00Z
in_progress→shipped6f16967 Audit-driven redesign landed 2026-05-09 (commit 6f16967): 12,780 legislators, 230 committees, 3,879 current-Congress memberships. Spec backfilled with real acceptance criteria + sql success_determiner + the ingestion-built task contract; removing the stub-criteria drift left over from the seeded template.