Legislators and Committees

Canonical registries of every Member of Congress (current and historical), every committee and subcommittee (active and dissolved), and the membership rosters that link them. This is a companion source — its records don't change much in absolute volume, but almost every other Josh source has a foreign key into it: bills join sponsors and cosponsors, roll-call votes join voters, hearings join witnesses, lobbying disclosures resolve to lobbied members, citation graph rows reference statute authors. Load this first. Nothing downstream is fully ingestible without it.

The good news: there is a single, community-maintained, public-domain canonical registry — unitedstates/congress-legislators — keyed by the same bioguide IDs that every official US government data source already uses (Congress.gov, GovInfo, House Clerk, Senate.gov, OpenSecrets). One-to-one matching, no transformation, no fuzzy joins.

Quick reference


Source name	Legislators and Committees
Publishers	Community-maintained (@unitedstates/congress-legislators — Eric Mill, Sunlight Foundation alumni, GovTrack maintainers, etc.), drawing on official House/Senate/Bioguide/LoC data.
License	CC0-1.0 (public domain)
Coverage	Legislators: 1789–present (every Member ever — ~12,800 historical + ~540 current). Committees: current Congress only for `committees-current` + `committee-membership-current`. Historical committees exist upstream but are out of scope at v1.
Storage estimate	<50 MB raw across the 7 fetched files; <500 MB in Postgres including denormalization
Access path	GitHub Pages mirror `https://unitedstates.github.io/congress-legislators/{file}.yaml` — preferred (see Access notes)
Auth	None
Stable ID format	Member: `member:{bioguide_id}` e.g. `member:C000127`. Committee: `committee:{thomas_id}` e.g. `committee:HSAG`. Subcommittee: `committee:{parent_thomas}-{sub_thomas_id}` e.g. `committee:HSAG-15`.
Status	exploring — schema drafted, ingestion not built

For the upstream file inventory (sizes, record counts), HTTP/CDN behavior, and rate limits, see the unitedstates/congress-legislators repo.

Source priority decision

Several sources overlap for the same data. Recommendation:

Primary: unitedstates/congress-legislators GitHub Pages YAML. Single repo, public-domain, hand-curated since 2013, updated continuously. Same bioguide ID space as every official US government source. No translation layer needed.

Secondary: Congress.gov v3 API /member and /committee. Useful as a verification / spot-check source but redundant with the YAML — the YAML maintainers themselves cross-reference Congress.gov when curating. Costs api.data.gov quota for nothing we can't already get for free.

Skip: scraping bioguide.congress.gov directly. The YAML registry already does this work upstream. Scraping ourselves is reinventing the registry.

Skip: GovTrack member/committee dumps. The congress-legislators registry includes id.govtrack crosswalks; we'd just be ingesting downstream copies of upstream truth.

The YAML files are also the only canonical source for some fields that no official feed exposes well — leadership roles, district office addresses, social media handles, family relationships (e.g. dynasty tracking), and historical thomas/lis/icpsr/govtrack ID crosswalks. Building these from primary sources would take months.

Access notes

The repo lives on GitHub but the GitHub Pages mirror is the right access path, not raw.githubusercontent.com. The Pages mirror returns valid Content-Type: text/yaml (raw GitHub returns text/plain), serves clean Last-Modified and ETag headers, is CDN-cached at Fastly, and has CORS open for browser use.

https://unitedstates.github.io/congress-legislators/{file}.yaml

Why this matters for Josh: raw.githubusercontent.com serves no Last-Modified and ignores If-Modified-Since, which would break the conditional-GET state watermark the ingester relies on (see Caching / incremental sync). A real-browser User-Agent isn't required (no anti-bot wall observed). For programmatic discovery of what's changed, hit the GitHub commits API instead of the Pages mirror.

Files inventoried

Eight files exist upstream at https://unitedstates.github.io/congress-legislators/{name}.yaml; the ingester fetches 7 of them:

legislators-current.yaml — sitting Members of Congress (House + Senate).
legislators-historical.yaml — every former Member, 1789–present.
committees-current.yaml — active standing, select, special, and joint committees with subcommittees inline.
committee-membership-current.yaml — who serves on what committee/subcommittee right now, with rank and title.
legislators-social-media.yaml — Twitter, Facebook, YouTube, Instagram per current member. Official accounts only (campaign accounts excluded).
legislators-district-offices.yaml — district/state office addresses, lat/long, phone, fax.
executive.yaml — Presidents and Vice Presidents (1789–present).

committees-historical.yaml exists upstream but is intentionally out of scope at v1 (dissolved-committee lineages aren't needed for the current-Congress FK graph) and is not fetched. There is no JSON or CSV mirror committed in the repo — conversion is on the consumer.

Stable ID formats

Member: member:{bioguide_id}, lowercased prefix.

Examples:

member:C000127 (Maria Cantwell)
member:K000367 (Amy Klobuchar)
member:W000178 (George Washington — yes, Washington has a bioguide ID)

Rationale: bioguide IDs are the de facto canonical IDs across every US government data source — Congress.gov API, GovInfo BILLSTATUS XML, House Clerk roll-call votes, Senate.gov XML, OpenSecrets, Bioguide proper. They never collide and never get reused, even across centuries. Why invent a new ID?

Committee: committee:{thomas_id}, uppercase Thomas ID preserved.

Top-level: committee:HSAG (House Agriculture), committee:SSAF (Senate Ag), committee:JSPR (Joint Printing).
Subcommittee: committee:HSAG-15 (House Ag Forestry/Horticulture subcommittee — 15 is the subcommittee's thomas_id).

The Thomas ID is what unitedstates/congress-legislators uses, what BILLSTATUS XML's <systemCode> derives from (hsbu00 = lowercase(HSBU) + "00"), and what every committee membership file keys on. The system_code (hsbu00, hsag15) is a separate field — see the schema. The full system_code is derived at parse time: top-level committees use lowercase(thomas_id) + "00" (e.g. hsag00); a subcommittee uses lowercase(parent.thomas_id) + subcommittee.thomas_id (e.g. hsag15 for HSAG subcommittee 15).

Crosswalk: the id block on each legislator carries IDs to every other system (govtrack, opensecrets, fec, icpsr, etc.). Preserve them all so we can join to outside datasets later.

Upstream record shapes

The full field-by-field shape of each YAML file — the legislator id/name/bio/terms[]/leadership_roles[]/family[]/other_names[] blocks, the committee and subcommittee fields, the committee-membership map, the social-media and district-office records, and all enum/vocabulary values (terms[].type, end-type, how, Senate class, state_rank, committee type, membership party/title) — is documented upstream in the congress-legislators README. Josh lands these into the Postgres schema below; the Josh-specific data-modeling consequences are called out here:

family[] is a soft pointer. The relative name is free-text, not a bioguide_id. Resolving it to another member is on us if we want a true graph (legislator_family.relative_bioguide is a soft FK, NULL when unresolved).
Subcommittee thomas_id is unique only within its parent committee. Numeric strings ('15', '22') collide across parents — Josh composes parent_thomas + sub_thomas for committees.id and the system_code.
committee-membership-current.yaml dual-keys at one flat level. It's a map keyed by Thomas ID where SSAF (full committee) and SSAF13 (a subcommittee) are separate top-level keys — there's no nesting at this layer (the nesting lives in committees-current.yaml). Member entries carry party: majority/minority (chamber-relative, not Democrat/Republican), a within-party seniority rank (1 = most senior), an optional title, and chamber on joint committees.
Social media is current members only, official accounts only. Per the file's editorial policy, only taxpayer-funded "official" legislative accounts are accepted (campaign/personal accounts excluded). Numeric IDs (twitter_id, instagram_id) are kept because handles can be renamed but numeric IDs are stable.
District office id is {bioguide}-{slug} (e.g. A000055-cullman) and is stable across pulls — Josh uses it as the primary key.
executive.yaml follows the same shape as legislators-historical.yaml but with terms[].type in {prez, viceprez} and a few fewer ID crosswalks (no thomas/govtrack/opensecrets for early presidents). 80 records, Washington (1789) through the current administration. Useful for resolving Federal Register president slugs (donald-trump, joe-biden) to a canonical record with bio.

Schema (Postgres DDL)

-- ============================================================
-- Members of Congress (current + historical, unified table)
-- ============================================================

CREATE TABLE legislators (
    id text PRIMARY KEY,                          -- 'member:C000127'
    bioguide_id text NOT NULL UNIQUE,             -- 'C000127'

    -- All cross-system IDs for joining to other datasets
    thomas_id text,
    lis_id text,
    govtrack_id int,
    opensecrets_id text,
    votesmart_id int,
    fec_ids text[],                               -- list per career
    cspan_id int,
    icpsr_id int,
    house_history_id int,
    house_history_alternate_id int,
    bioguide_previous text[],                     -- when bioguide was renumbered
    wikidata_id text,
    wikipedia text,                               -- article title
    ballotpedia text,
    maplight_id int,
    google_entity_id text,
    pictorial_id int,

    -- Names
    first_name text,
    last_name text,
    middle_name text,
    suffix text,
    nickname text,
    official_full_name text,

    -- Bio
    birthday date,
    gender text CHECK (gender IN ('M', 'F') OR gender IS NULL),

    -- Status flags
    is_current boolean NOT NULL DEFAULT false,    -- in legislators-current.yaml right now
    is_executive boolean NOT NULL DEFAULT false,  -- in executive.yaml (presidents/VPs)

    -- Lifecycle / forensics
    raw_yaml jsonb NOT NULL,                      -- full record for re-parsing
    fetched_at timestamptz NOT NULL,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX legislators_current ON legislators (last_name, first_name) WHERE is_current;
CREATE INDEX legislators_govtrack ON legislators (govtrack_id) WHERE govtrack_id IS NOT NULL;
CREATE INDEX legislators_icpsr ON legislators (icpsr_id) WHERE icpsr_id IS NOT NULL;
CREATE INDEX legislators_opensecrets ON legislators (opensecrets_id) WHERE opensecrets_id IS NOT NULL;

-- Search by name
ALTER TABLE legislators ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(official_full_name, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(last_name, '') || ' ' || coalesce(first_name, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(nickname, '')), 'C')
    ) STORED;
CREATE INDEX legislators_search ON legislators USING gin (search_tsv);

-- ============================================================
-- Terms (one per office period; many per legislator)
-- ============================================================

CREATE TABLE legislator_terms (
    id bigserial PRIMARY KEY,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,

    term_type text NOT NULL CHECK (term_type IN ('rep', 'sen', 'prez', 'viceprez')),
    term_start date NOT NULL,
    term_end date NOT NULL,

    state text,                                   -- 2-letter, NULL for prez/viceprez
    district int,                                 -- rep only; 0 = at-large
    senate_class smallint CHECK (senate_class IN (1, 2, 3) OR senate_class IS NULL),
    state_rank text CHECK (state_rank IN ('junior', 'senior') OR state_rank IS NULL),

    party text,                                   -- display name; party_affiliations is canonical when set
    party_affiliations jsonb,                     -- list of {start, end, party} when party changed mid-term
    caucus text,

    how text,                                     -- 'special-election', 'appointment', or NULL
    end_type text,                                -- end-type renamed to be SQL-friendly

    -- Contact info (for sen/rep — the DC office)
    url text,
    address text,
    office text,
    phone text,
    fax text,
    contact_form text,
    rss_url text,

    UNIQUE (bioguide_id, term_start, term_type)
);

CREATE INDEX legislator_terms_bioguide ON legislator_terms (bioguide_id);
CREATE INDEX legislator_terms_active ON legislator_terms (term_end DESC, term_start) WHERE term_end >= CURRENT_DATE;
CREATE INDEX legislator_terms_state_district ON legislator_terms (state, district);

-- ============================================================
-- Leadership roles (sparse — most legislators have none)
-- ============================================================

CREATE TABLE legislator_leadership_roles (
    id bigserial PRIMARY KEY,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
    title text NOT NULL,
    chamber text CHECK (chamber IN ('house', 'senate') OR chamber IS NULL),
    role_start date NOT NULL,
    role_end date NOT NULL,
    UNIQUE (bioguide_id, title, role_start)
);

CREATE INDEX legislator_leadership_active
    ON legislator_leadership_roles (bioguide_id, role_end DESC)
    WHERE role_end >= CURRENT_DATE;

-- ============================================================
-- Family relations (sparse, free-text — soft pointer to other members)
-- ============================================================

CREATE TABLE legislator_family (
    id bigserial PRIMARY KEY,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
    relative_name text NOT NULL,                  -- free-text; resolution to bioguide is a separate task
    relative_bioguide text,                       -- populated where we can match — soft FK
    relation text NOT NULL,                       -- 'son', 'father', 'brother', 'husband', etc.
    UNIQUE (bioguide_id, relative_name, relation)
);

-- ============================================================
-- Other names (married/maiden names, naturalization name changes — historical)
-- ============================================================

CREATE TABLE legislator_other_names (
    id bigserial PRIMARY KEY,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
    last_name text,
    middle_name text,
    used_until date,                              -- 'end' in YAML
    UNIQUE (bioguide_id, last_name, middle_name)
);

-- ============================================================
-- Social media (current members only)
-- ============================================================

CREATE TABLE legislator_social_media (
    bioguide_id text PRIMARY KEY REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
    twitter text,
    twitter_id text,
    facebook text,
    youtube text,
    youtube_id text,
    instagram text,
    instagram_id text,
    fetched_at timestamptz NOT NULL,
    updated_at timestamptz NOT NULL DEFAULT now()
);

-- ============================================================
-- District offices (current members only)
-- ============================================================

CREATE TABLE legislator_district_offices (
    id text PRIMARY KEY,                          -- 'A000055-cullman' from YAML
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
    address text,
    suite text,
    building text,
    city text,
    state text,
    zip text,
    latitude numeric(9, 6),
    longitude numeric(9, 6),
    phone text,
    fax text,
    fetched_at timestamptz NOT NULL,
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX legislator_district_offices_bioguide ON legislator_district_offices (bioguide_id);
CREATE INDEX legislator_district_offices_state ON legislator_district_offices (state);

-- ============================================================
-- Committees + subcommittees (current — historical in same table with is_current flag)
-- ============================================================

CREATE TABLE committees (
    id text PRIMARY KEY,                          -- 'committee:HSAG' or 'committee:HSAG-15'
    thomas_id text NOT NULL,                      -- 'HSAG' or '15' (subcommittee local id)
    parent_id text REFERENCES committees(id),     -- subcommittees only

    -- system_code is what BILLSTATUS XML uses — derived: hsag00 (full) or hsag15 (sub)
    system_code text UNIQUE,

    chamber text NOT NULL CHECK (chamber IN ('house', 'senate', 'joint')),
    name text NOT NULL,

    -- Only set on top-level committees
    house_committee_id text,
    senate_committee_id text,

    url text,
    minority_url text,
    rss_url text,
    minority_rss_url text,
    address text,
    phone text,
    jurisdiction text,
    jurisdiction_source text,
    wikipedia text,                               -- article title
    youtube_id text,

    is_current boolean NOT NULL DEFAULT true,

    raw_yaml jsonb,
    fetched_at timestamptz NOT NULL,
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX committees_parent ON committees (parent_id) WHERE parent_id IS NOT NULL;
CREATE INDEX committees_chamber ON committees (chamber) WHERE is_current;
CREATE INDEX committees_system_code ON committees (system_code) WHERE system_code IS NOT NULL;

-- ============================================================
-- Committee membership (current — historical handled via congress-specific
-- replays of committees-historical when we get to it)
-- ============================================================

CREATE TABLE committee_memberships (
    id bigserial PRIMARY KEY,
    committee_id text NOT NULL REFERENCES committees(id) ON DELETE CASCADE,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id) ON DELETE CASCADE,
    party_alignment text NOT NULL CHECK (party_alignment IN ('majority', 'minority')),
    rank int NOT NULL,                            -- seniority rank within party
    title text,                                   -- 'Chairman', 'Ranking Member', etc.
    chamber text CHECK (chamber IN ('house', 'senate') OR chamber IS NULL),  -- joint committees only
    -- Period of membership (current snapshot, but congress-stamped for time-series)
    congress smallint NOT NULL,
    UNIQUE (committee_id, bioguide_id, congress)
);

CREATE INDEX committee_memberships_member ON committee_memberships (bioguide_id);
CREATE INDEX committee_memberships_committee ON committee_memberships (committee_id);
CREATE INDEX committee_memberships_chair
    ON committee_memberships (committee_id)
    WHERE title IN ('Chairman', 'Chair');

Schema decisions worth flagging:

Single legislators table for current + historical + executive with is_current and is_executive flags. The shape is identical (modulo a few mostly-null fields for early presidents). Splitting into three tables would force every join site to UNION.
raw_yaml preserved per record — the YAML registry adds new ID crosswalks every few years (google_entity_id, pictorial, wikidata weren't all there a decade ago). Keep raw to re-parse.
legislator_terms is the canonical source for state/district/party at any given date. The denormalized current-state on bills.sponsor_* (and similar) is a snapshot at time of sponsorship. To answer "who represents NY-12 today" join to legislator_terms with term_start <= now < term_end.
Committees + subcommittees in one table with parent_id self-FK. Subcommittees inherit chamber from parent (denormalized to make the chamber filter cheap). The system_code column is the BILLSTATUS XML join key — derived at parse time (hsag00 for the top-level HSAG; hsag15 for its subcommittee with thomas_id: '15').
committee_memberships is congress-stamped even though the source file is committee-membership-current — so when the next congress's data arrives, we don't lose history. Re-running ingestion in a new congress inserts new rows rather than overwriting.
legislator_family.relative_bioguide is a soft FK (NULL when we couldn't resolve the free-text name). Resolution is a future task — possibly aided by the wikidata_id crosswalk.
party_affiliations as a jsonb column on legislator_terms rather than a separate table — it's a sparse, list-of-3-fields-each structure. When set (rare), we'd rather query it as JSON than join.
No executive table separately. executive.yaml records load into legislators with is_executive=true and terms in legislator_terms with term_type IN ('prez', 'viceprez').

Volume

Steady-state. Roughly 540 current legislators, 12,230 historical legislators (effectively static — grows by ~30-50 per congress as members retire), 49 committees + 181 subcommittees in current congress, 230 committee-roster entries. Storage in Postgres (after denormalizing terms, party affiliations, leadership, family, other names) is well under 500 MB.

Update cadence (from upstream git log): staffer-driven changes (committee assignments, leadership, contact info) every few days; real-time member entry/exit per resignation/swearing-in (committed within hours); daily-ish automated upstream runs; major churn per election cycle (terms file rewrites for all sitting members). A daily pull is sufficient for v1. Consider a faster (hourly) pull only during the first weeks of a new Congress when the committee membership churn is rapid.

Caching / incremental sync

Each YAML file on the GitHub Pages mirror exposes ETag, Last-Modified, and Cache-Control: max-age=600 headers. The incremental strategy:

Per-file If-None-Match (primary) / If-Modified-Since (secondary) — a daily cron sends a conditional GET keyed on the stored ETag first, falling back to Last-Modified. 304 responses are free; 200 responses mean the file changed since last pull. (Only the GitHub Pages host honors both forms — raw.githubusercontent.com serves no Last-Modified and ignores If-Modified-Since.)
GitHub commits API as a coarse change feed — returns recent commits with commit.author.date and commit.message, useful for human-readable change reports ("Rep. Smith resigned on 2026-03-15") and for when a Last-Modified is bumped by an unrelated commit. See the GitHub list-commits API.
Granular diff at parse time — even when a file changed, most of its content didn't. Compute a row-level diff (by primary key) when reloading. Insert/update only changed rows.

Because the data is small, even a no-cache full reload daily is cheap (~10 MB total transfer). The conditional-GET dance is more about detecting a change so we know whether to re-parse and re-emit per-row diffs.

Download / update strategy

Backfill (one-time, then daily)

This source is small enough that "backfill" and "incremental" are the same operation. There is no historical-only fetch — we re-pull the same URLs daily.

Fetch the 7 in-scope YAML files in parallel. 7 concurrent HTTPS requests (committees-historical.yaml is skipped at v1). Total transfer ~11 MB.
Parse YAML → records. Use a streaming YAML parser (PyYAML is fine at this scale, ruamel.yaml for cycle safety if needed).
Upsert per-record. For each table, compute the row-level set difference between fresh and stored rows; insert new, update changed, soft-delete missing-from-fresh rows that should still exist in legislators (because a member becoming ex-current should keep a row with is_current=false).

Specifically:

legislators-current.yaml → set is_current=true, upsert.
legislators-historical.yaml → set is_current=false, upsert.
A member who moves from current to historical (e.g., resigned mid-session): the historical YAML adds them, the current YAML removes them. The merge must catch that: any row in legislators whose bioguide was in last run's current set but isn't in this run's current set, flip is_current to false.
executive.yaml → set is_executive=true, upsert.
committees-current.yaml → flatten committees + subcommittees, set is_current=true, upsert. Compute system_code at parse time.
committees-historical.yaml → not fetched at v1 (out of scope — would set is_current=false and upsert when added later).
committee-membership-current.yaml → reset all current-congress rows, re-insert.
legislators-social-media.yaml, legislators-district-offices.yaml → upsert.

Daily incremental

A 02:00 UTC daily cron (after the upstream nightly automation runs) is sufficient.

Conditional-GET pattern — for each file, hold the previous run's Last-Modified and ETag and send If-None-Match/If-Modified-Since. On 304, skip parse. On 200, full parse + diff-merge.

The ingester also pulls the GitHub commits API (unauthenticated) to surface a human-readable change log: each commit's commit.message (e.g. "Rep. Cherfilus-McCormick resigned, Rep. David Scott died") is recorded in ingestion_logs for the run, so the admin UI can show why the file changed without reading diffs.

State tracking

Per ingestion architecture: ingestion state in ingestion_runs, ingestion_logs, ingestion_source_state. Source key: legislators_committees. Per-file state stores last_etag, last_modified, last_parsed_at.

Failure modes

HTTP 503 from GitHub Pages CDN — exponential backoff and retry against the same Pages host. raw.githubusercontent.com is not used as a fallback: it serves no Last-Modified and ignores If-Modified-Since, so switching hosts would break the conditional-GET state watermark.
YAML parse error — log full file path, halt for this file, alert. Don't update partial state. The maintainers occasionally introduce a YAML error during a manual edit — usually fixed within hours.
New top-level field on a record — raw_yaml preserves it; alert via ingestion_logs so we know to extend the schema. Don't fail.
Bioguide collision with existing record but different name — extremely rare (bioguide IDs are never reused), but possible if a member's bioguide is renumbered. Update via bioguide_previous array; alert.
Subcommittee thomas_id collision across parents — by design, subcommittee thomas_ids are only unique within parent. Compose parent_thomas + sub_thomas for committees.id and the system_code.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Historical committee membership. committee-membership-current.yaml covers only the current congress. Reconstructing committee rosters for past congresses requires either (a) git history of the file across previous-congress sessions, (b) the committees-historical.yaml membership-by-congress section (if it exists — needs check), or (c) the Congress.gov API /committee/{c}/{chamber}/{committeeCode}/members. Decide before promising "show me who chaired Senate Finance in the 115th." Likely v1.x.
Caucus memberships beyond chamber alignment. The terms[].caucus field captures formal chamber alignment for independents (e.g., Sanders caucuses with Democrats). It does not capture issue-based caucuses (Freedom Caucus, Problem Solvers, Black Caucus, etc.). Those live in source #29 (caucus memberships, separate doc).
family[] resolution to bioguide. The current free-text relative name doesn't link to another bioguide — could be valuable for political dynasty queries ("show me all Bushes in Congress"). One-time enrichment task, possibly via wikidata kinship properties.
Executive.yaml coverage cliff. Bioguide IDs exist for every president, but thomas and lis are absent for early presidents (they predate those systems). Make sure code handles missing crosswalks gracefully.
Pre-1789 territorial delegates. The historical file appears to start at the first congress (1789). Earlier Continental Congress representatives — if they have bioguide IDs at all — may be absent. Out of scope for federal substrate, but check for completeness.
District office geocoding accuracy. The latitude/longitude values in legislators-district-offices.yaml are committed manually and have observed ~50 m precision. Good enough for "is this office near the requester" use cases; not good enough for in-building queries.
House delegates and resident commissioners. DC, Puerto Rico, Guam, Virgin Islands, American Samoa, Northern Marianas — they have terms[].type: rep with non-state values in state (e.g. DC, PR, GU). Verify our state CHAR(2) validation accepts these (currently broad — no enum constraint).
Update lag for ingestion. When a member dies or resigns mid-session, BILLSTATUS XML may continue listing them as a current sponsor for hours-to-days. The legislators YAML usually updates within hours of public reporting, but downstream sources may lag. We'll need to handle "sponsor bioguide doesn't yet exist as is_current=true" gracefully in bills loader.