sourceverifiedp0

Legislators and committees ingester

legislators-and-committees-ingester · updated 2026-05-29T00:00:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

The unitedstates/congress-legislators mirror (7 fetched YAML files;
committees-historical.yaml is the 8th upstream file but intentionally
out of scope at v1) is the canonical
open-data source for every member of Congress (current + historical),
every committee + subcommittee, and current committee assignments. It's
the smallest second source — no body normalization, no vector pipeline,
no chunking — which made it the right shape to validate that the
add-a-new-source runbook generalizes beyond CRS before bills exposes
the full schema surface. Every other Congress-side source FKs to
legislators.bioguide_id and committees.thomas_id.

Code is intact (ingester module, parser, migration, audit-driven 11-table
schema, FTS5 wiring, citation formatter, tests). The 12,780-legislator /
230-committee / 3,879-membership production load that previously
satisfied this spec lived on the destroyed DigitalOcean droplet
(2026-05-10) and needs to be re-run against the OVHcloud substrate.
Re-backfill is minutes — the upstream YAML mirror is small and stable.

As a policy analyst using my own AI agent, I want to ask the agent who sits on a given committee or sponsors a bill so that I get accurate, name-resolved answers with citations.

As a downstream source ingester (bills, votes, hearings), I want to FK to legislators and committees by stable ID so that cross-source citation graph queries resolve cleanly.

As an OSS self-hoster, I want to run `josh ingest legislators-committees` on a fresh substrate so that the FK targets exist before I ingest any Congress-side source.

  1. When `josh ingest legislators-committees` is run on the deployed substrate, the system shall populate `legislators`, `committees`, `committee_memberships`, and the legislator side tables (`legislator_terms`, `legislator_leadership_roles`, `legislator_family`, `legislator_other_names`, `legislator_fec_ids`, `legislator_bioguide_previous`, `legislator_social_media`, `legislator_district_offices`) from the 7 fetched upstream YAML files.
  2. When a legislator row is written, the system shall populate all six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) — `source_org` shall be `unitedstates/congress-legislators`.
  3. While the ingester is running, if a single YAML record fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record.
  4. When a legislator's `bioguide_id` already exists, the system shall upsert that row rather than insert a duplicate, preserving FK targets for downstream sources.
  5. When an FTS5 query is executed against `legislators_fts` or `committees_fts`, the system shall return BM25-ranked name/jurisdiction matches.
  6. When a daily incremental run completes, the system shall update `ingestion_source_state.last_seen_key` so the next run's conditional GET on the upstream YAML files only re-fetches changed files.
kindsql

DB

substrate

Query

SELECT
  (SELECT COUNT(*) FROM legislators) AS legislators,
  (SELECT COUNT(*) FROM committees) AS committees,
  (SELECT COUNT(*) FROM committee_memberships) AS memberships,
  (SELECT COUNT(*) FROM legislators
    WHERE source_url IS NOT NULL AND source_org IS NOT NULL
      AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
      AND citation_string IS NOT NULL) AS legislators_fully_cited;

Compare (machine-checked)

[
  {
    "op": "gte",
    "value": 12000
  },
  {
    "op": "gte",
    "value": 200
  },
  {
    "op": "gte",
    "value": 3500
  },
  {
    "op": "eq_col",
    "value": 0
  }
]

Expect (prose)

legislators >= 12000, committees >= 200, memberships >= 3500, legislators_fully_cited == legislators (column 3 == column 0).

Production state as of 2026-05-09 (on the now-destroyed DO droplet): 12,780 legislators (536 current + ~12,200 historical + 67 executive), 230 committees (49 top-level + 181 subs), 3,879 current-Congress committee assignments. Floors leave headroom for upstream churn without false-failing the determiner. **Determiner currently fails** against an empty substrate — re-run after the OVHcloud provision + backfill (per `substrate-bare-metal-host` t9) to re-flip the spec verified → shipped. Of the citation columns checked, only raw_sha256 is nullable (the others are NOT NULL-enforced in the schema), so the fully_cited == total parity check is in practice a raw_sha256-completeness check.

None.

  • State-level legislators / committees.
  • Historical committee memberships (only current Congress in upstream YAML).
  • Caucus memberships and personal-office staff (separate sources — see staff-directories spec).

Source module follows the standard Source protocol:
josh-ingester/ingester/sources/legislators_committees.py.

- Discover — pull the 7 YAMLs from the GitHub Pages host
https://unitedstates.github.io/congress-legislators/
(legislators-current.yaml, legislators-historical.yaml,
executive.yaml, committees-current.yaml,
committee-membership-current.yaml, plus social-media / district-
office companions). committees-historical.yaml exists upstream but
is intentionally skipped at v1; there is no leadership companion file
(leadership_roles is an in-record field on each legislator, not a
separate YAML). Per-file conditional GET keyed on ETag
(If-None-Match → 304) as primary, Last-Modified
(If-Modified-Since) as secondary; state watermark is a JSON of
per-file ETags / Last-Modified values. (The raw host
raw.githubusercontent.com is NOT used — it serves no Last-Modified
and ignores If-Modified-Since; the GitHub Pages host serves both
headers and returns 304 to both conditional forms.)
- Fetch — write each YAML to
/data/corpus/legislators_committees/bodies/raw/<file>.yaml so an
audit trail exists.
- ParsePyYAML → Pydantic models in
shared/josh_substrate/src/josh_substrate/models/legislators.py.
Each legislator's history of terms / names / FEC IDs / bioguide
previous-IDs unpacks into the side tables.
- Load — upsert by bioguide_id (legislators) and thomas_id
(committees). Memberships are deleted-and-reinserted per committee
(current-Congress snapshot only).
- Citationshared/josh_substrate/citations/formatters/legislators_committees.py
builds a human-readable citation_string ("Sen. Jane Doe (D-CA),
member of Senate Judiciary Committee" style).
- Schema0003_legislators_committees.py migration: 11 tables
+ 2 FTS5 virtual tables (legislators_fts, committees_fts) + 6
sync triggers. **No *_chunks or *_chunks_vec0 — no body text.
-
Audit-driven** — current production state is the result of an
audit landed 2026-05-09 (commit 6f16967) that redesigned the schema
after the initial pass. The audit-redesign rationale is preserved
internally and isn't required for re-running the ingester.

10 of 14 done.

  • t1 Alembic migration written + applied locally (0003_legislators_committees) — production re-apply pending OVHcloud provision
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/legislators.py
  • t3 Source module implementing the `Source` protocol
  • t4 N/A — no body text on this source (raw YAML cached for audit only)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests
  • t6 FTS5 virtual tables (legislators_fts, committees_fts) + 6 sync triggers wired into migration
  • t7 N/A — no body text, no vector pipeline
  • t8 Unit test for parser against captured YAML fixtures
  • t9 Integration test for the full discover→fetch→parse→load pipeline
  • t10 Full backfill complete on the OVHcloud substrate (code-equivalent of the prior 2026-05-09 DO load: ~12,780 legislators + 230 committees + 3,879 memberships)
  • t11 Incremental cron registered (daily 02:00 ET) on the OVHcloud host and observed firing
  • t12 No recurring errors in `ingestion_logs` for the last completed run on the new host
  • t13 Source doc updated with as-built notes
  • t14 Inventory rows in `https://docs.usejosh.com/josh-data-sources/` + `https://docs.usejosh.com/data-status/` re-flipped to ingestion-built / shipped after the OVHcloud backfill lands
  • 2026-05-10T18:00:00Z shippedverified DigitalOcean droplet destroyed 2026-05-10 — the production load (12,780 legislators / 230 committees / 3,879 memberships, audit- driven schema landed 2026-05-09) is gone with the volume. Code (ingester module, parser, audit-driven 11-table migration, FTS5 wiring, citation formatter, tests, integration pipeline) all intact in the repo. Status flipped shipped → verified: the success_determiner ran green against real data once and the code's contract still holds; the production data load needs to be re-run against the OVHcloud substrate (per `substrate-bare-metal-host` t9). Tasks t10-t12 + t14 reset to false to track the re-load.
  • 2026-05-09T13:00:00Z in_progressshipped 6f16967 Audit-driven redesign landed 2026-05-09 (commit 6f16967): 12,780 legislators, 230 committees, 3,879 current-Congress memberships. Spec backfilled with real acceptance criteria + sql success_determiner + the ingestion-built task contract; removing the stub-criteria drift left over from the seeded template.

docs/spec/legislators-and-committees-ingester.html · generated by bin/build-spec.py