Staff directories
Three thinly-related person datasets, all classified as best-effort + community contribution:
- Congressional staff directory — chiefs of staff, legislative directors, legislative assistants, committee staff. The people who actually write bills and shape policy on Capitol Hill.
- Federal agency officials — agency heads, deputies, key program directors. The decision-makers in the executive branch.
- Caucus memberships — Freedom Caucus, Problem Solvers Caucus, Black Caucus, Hispanic Caucus, etc. Informal groupings of legislators around shared interests.
These three are bundled into one doc because all three share the same characteristic problem: there is no canonical, machine-readable, comprehensively-maintained public source. Each requires either:
- Scraping fragmentation across hundreds of individual office websites.
- Community-contributed datasets that lag, are incomplete, or have gone dormant.
- Manual data entry by Josh users.
Honest disclosure: this is the weakest source area in v1. We promise partial coverage, document the gaps, and design the schema for community contribution — analogous to how Wikipedia and OpenStreetMap operate. If any of these become high-priority for users, we revisit with dedicated effort or commercial data partners.
Quick reference
Section titled “Quick reference”| Source name | Staff directories (combined) |
| Sub-sources | Congressional staff, federal agency officials, caucus memberships |
| Publishers | None canonical. Mix of: official chamber/agency sites (incomplete), USA.gov agency index (contact directory — agency-discovery seed only, no official names), Wikidata (community-edited), Ballotpedia (paywalled), commercial directories (Leadership Connect, Quorum, Gary Hill, Knowlegis — $K-100K licenses). |
| License | Mixed: official sites public-domain; community sources varies (Wikidata CC0, Ballotpedia CC-BY-SA non-commercial). |
| Coverage | Partial. Best for sitting Members and senior agency officials; thinning toward staff and caucus memberships. |
| Volume | Congressional staff: ~10,000-15,000 people across 535 offices + ~50 committees. Agency officials: ~5,000-10,000 senior. Caucus memberships: ~2,000-3,000 active member-caucus pairs. |
| Storage estimate | <500 MB |
| Auth | Varies by source |
| Rate limits | Varies; many sites bot-walled |
| Caching headers | Varies |
| Incremental sync hints | Mostly manual; site-specific |
| Stable ID format | Person: staff:{slug} (e.g. staff:smith-jane-c001234); caucus: caucus:{slug} (e.g. caucus:freedom-caucus) |
| Status | exploring (partial) — schema drafted, ingestion built best-effort |
Source priority decision
Section titled “Source priority decision”Per sub-source:
Congressional staff directory
Section titled “Congressional staff directory”Primary: per-office .house.gov / .senate.gov website scrapes. Each Member's official site has a "Staff" or "Meet the Team" page with Chief of Staff, Legislative Director, and key Legislative Assistants. Format varies wildly — some sites have structured listings with role + name + bio, others have a single photo carousel.
Skip for staff names: USA.gov agency index. It is an A–Z agency contact index (agency websites, phone numbers, mailing addresses) — no senior-official names and no office-staff names. Useful only as an agency-discovery seed, not a staff source. (usa.gov/agency-index)
Skip: ProPublica / OpenSecrets / OpenStates for staff — they don't maintain this.
Skip: commercial directories (Leadership Connect / Quorum / Gary Hill / Knowlegis) — paywalled at $10K+/year; not OSS-compatible; can't redistribute.
Future: community contribution. Plan a staff_contributions flow where users with verified affiliation can update entries.
Federal agency officials
Section titled “Federal agency officials”Agency-discovery seed: USA.gov agency index. The A–Z index at usa.gov/agency-index is a contact directory (agency websites, phone numbers, mailing addresses) with no senior-official names and no org chart. It supplies the list of agencies to visit — not the officials themselves.
Primary (officials tier): per-agency Leadership pages. Each agency's own Leadership page carries the cabinet, deputy secretaries, and major bureau heads. Structured for bigger agencies (DOJ, DOD, HHS); chaotic for smaller ones.
Secondary: Wikidata SPARQL. Community-maintained; has bioguide-equivalent cross-references for some senior officials, and fills gaps where an agency Leadership page is unparseable. (query.wikidata.org)
Skip: scraping White House directory directly — bot-walled and changes per administration.
Caucus memberships
Section titled “Caucus memberships”Primary: per-caucus website scrapes. A few caucuses publish member rosters publicly (Freedom Caucus, Problem Solvers Caucus, Bipartisan Climate Solutions Caucus, Congressional Black Caucus, etc.). The rest don't publish at all.
Secondary: Wikipedia. Community-maintained articles often list members. Quality varies.
Tertiary: news scrapes. When a caucus chair changes, news articles list current members — extract via NLP.
Skip: commercial directories for the same reason as staff.
Access notes
Section titled “Access notes”This is largely a write-many-fragile-scrapers operation. Each sub-source has its own access pattern.
For Member-office scrapes specifically:
- ~535 individual sites, ~50 committee sites.
- Most run on a small set of CMS templates (House Drupal 10/11 on Apache / Senate Drupal). Pattern-detect template; per-template parser.
- Real-browser User-Agent harmless; some sites bot-wall, most don't.
- Update cadence: low (staff turn over once per ~2 years on average).
Endpoints / URL patterns
Section titled “Endpoints / URL patterns”There is no single endpoint structure. Per-office and per-caucus URLs are scraped individually. The relevant upstream datasets are community-maintained references, not APIs:
- unitedstates/congress-legislators — already covered in legislators & committees. Does not include staff or caucus memberships.
- Wikidata items for US House (Q7137) and US Senate (Q66096) — linking to Members + officials with cross-references.
- Per-caucus public sites (e.g. freedomcaucus.house.gov/members — illustrative example; rare to find and may not resolve).
- USA.gov A–Z agency index (usa.gov/agency-index) as the agency-discovery seed — lists agencies + contact info, no official names.
Vocabulary / enums
Section titled “Vocabulary / enums”Staff role
Section titled “Staff role”A normalized enum of common Hill staff titles:
| Role | Acronym | Notes |
|---|---|---|
chief_of_staff | COS | Senior aide; manages office. |
deputy_chief_of_staff | DCOS | |
legislative_director | LD | Manages legislative agenda. |
legislative_assistant | LA | Staffs specific issue areas. |
legislative_correspondent | LC | Junior; constituent mail + research. |
communications_director | Comms Dir | |
press_secretary | Press Sec | |
scheduler | ||
district_director | District Dir | Runs district/state offices. |
staff_director | Staff Dir | For committees only. |
general_counsel | GC | |
professional_staff_member | PSM | Committee policy staffer. |
senior_advisor | ||
intern | Excluded from data — privacy. | |
other |
Caucus type
Section titled “Caucus type”| Type | Notes |
|---|---|
ideological | Freedom Caucus, Progressive Caucus, etc. |
demographic | Congressional Black Caucus, Hispanic Caucus, Asian Pacific American Caucus, etc. |
issue | Bipartisan Climate Solutions Caucus, Congressional Privacy Caucus, etc. |
bipartisan | Problem Solvers, Tuesday Group, etc. |
regional | Western Caucus, Rural Caucus, etc. |
other |
Federal agency official type
Section titled “Federal agency official type”| Type | Notes |
|---|---|
political_appointee_senate_confirmed | PAS — requires Senate confirmation |
political_appointee_non_confirmed | PA — not Senate-confirmed |
senior_executive_service | SES — career senior executive |
career | Standard career civil servant |
other |
Stable ID formats
Section titled “Stable ID formats”Staff person: staff:{slug}-{disambiguator} where slug is {lastname}-{firstname} lowercased and disambiguator is a short hash or sequence number.
Examples:
staff:smith-jane-001staff:rodriguez-carlos-002
Avoiding email or SSN/EIN-style identifiers. Slugs are stable enough; collisions resolved by disambiguator.
Caucus: caucus:{slug}, kebab-case lowercased name.
Examples:
caucus:freedom-caucuscaucus:problem-solvers-caucuscaucus:congressional-black-caucus
Agency official: agency-official:{agency_id}-{slug}-{seq}
Examples:
agency-official:doj-garland-merrick-001agency-official:epa-zeldin-lee-001
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- Congressional staff persons-- ============================================================
CREATE TABLE congressional_staff ( id text PRIMARY KEY, -- 'staff:smith-jane-001'
first_name text, last_name text NOT NULL, middle_name text, suffix text, full_name text NOT NULL, -- display name
-- Affiliation (current) member_bioguide text REFERENCES legislators(bioguide_id), -- soft FK; for personal-office staff committee_id text REFERENCES committees(id), -- soft FK; for committee staff is_majority_staff boolean, -- committee staff only
-- Role role text NOT NULL, -- normalized enum title text, -- raw title from source issue_areas text[], -- self-reported (e.g. ['health', 'tax'])
-- Contact (often partial) email text, phone text, twitter text,
-- Provenance source_url text NOT NULL, source_type text NOT NULL CHECK (source_type IN ( 'office_scrape', 'committee_scrape', 'usa_gov', 'wikidata', 'community_contribution' )), confidence text NOT NULL CHECK (confidence IN ('high', 'medium', 'low', 'unverified')),
-- Lifecycle is_current boolean NOT NULL DEFAULT true, started_at date, ended_at date, raw_html bytea, fetched_at timestamptz NOT NULL, last_verified_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX congressional_staff_member ON congressional_staff (member_bioguide);CREATE INDEX congressional_staff_committee ON congressional_staff (committee_id);CREATE INDEX congressional_staff_role ON congressional_staff (role);CREATE INDEX congressional_staff_active ON congressional_staff (member_bioguide) WHERE is_current;
ALTER TABLE congressional_staff ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', full_name), 'A') || setweight(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(role, '')), 'B') || setweight(to_tsvector('english', coalesce(array_to_string(issue_areas, ' '), '')), 'C') ) STORED;CREATE INDEX congressional_staff_search ON congressional_staff USING gin (search_tsv);
-- ============================================================-- Federal agency officials-- ============================================================
CREATE TABLE federal_agency_officials ( id text PRIMARY KEY, -- 'agency-official:doj-garland-merrick-001'
first_name text, last_name text NOT NULL, full_name text NOT NULL,
agency_id text NOT NULL REFERENCES agencies(id), -- soft FK agency_name text NOT NULL, sub_agency_name text, -- when known (bureau / office)
title text NOT NULL, -- "Attorney General", "Secretary of State", etc. role_type text CHECK (role_type IN ( 'political_appointee_senate_confirmed', 'political_appointee_non_confirmed', 'senior_executive_service', 'career', 'other' )),
-- Lifecycle is_current boolean NOT NULL DEFAULT true, appointed_at date, departed_at date,
administration text, -- '47-Trump', '46-Biden' — joins to executive nomination_pn text, -- Congress.gov nomination ID when PAS confirmation_vote_id text, -- 'vote:119-s-1-50' when PAS confirmed via roll call
-- Provenance source_url text, source_type text NOT NULL, confidence text NOT NULL CHECK (confidence IN ('high', 'medium', 'low', 'unverified')),
raw_data jsonb, fetched_at timestamptz NOT NULL, last_verified_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX federal_agency_officials_agency ON federal_agency_officials (agency_id);CREATE INDEX federal_agency_officials_active ON federal_agency_officials (agency_id) WHERE is_current;CREATE INDEX federal_agency_officials_admin ON federal_agency_officials (administration);
ALTER TABLE federal_agency_officials ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', full_name || ' ' || title), 'A') || setweight(to_tsvector('english', agency_name || ' ' || coalesce(sub_agency_name, '')), 'B') ) STORED;CREATE INDEX federal_agency_officials_search ON federal_agency_officials USING gin (search_tsv);
-- ============================================================-- Caucuses-- ============================================================
CREATE TABLE caucuses ( id text PRIMARY KEY, -- 'caucus:freedom-caucus' name text NOT NULL, slug text NOT NULL UNIQUE, chamber text CHECK (chamber IN ('house', 'senate', 'joint') OR chamber IS NULL), caucus_type text NOT NULL CHECK (caucus_type IN ( 'ideological', 'demographic', 'issue', 'bipartisan', 'regional', 'other' )), description text, website text, founded_year int, is_active boolean NOT NULL DEFAULT true,
-- Leadership chair_bioguide text REFERENCES legislators(bioguide_id), co_chair_bioguide text REFERENCES legislators(bioguide_id),
-- Provenance source_url text, raw_html bytea, fetched_at timestamptz NOT NULL, last_verified_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX caucuses_type ON caucuses (caucus_type) WHERE is_active;CREATE INDEX caucuses_chamber ON caucuses (chamber) WHERE is_active;
-- Many-to-many: members ↔ caucusesCREATE TABLE caucus_memberships ( id bigserial PRIMARY KEY, caucus_id text NOT NULL REFERENCES caucuses(id) ON DELETE CASCADE, bioguide_id text NOT NULL REFERENCES legislators(bioguide_id), congress smallint NOT NULL, -- which congress this membership applies to role text, -- 'member', 'co-chair', 'founding_member', etc. started_at date, ended_at date, is_current boolean NOT NULL DEFAULT true,
-- Provenance per membership source_url text, confidence text CHECK (confidence IN ('high', 'medium', 'low', 'unverified')), last_verified_at timestamptz,
UNIQUE (caucus_id, bioguide_id, congress));
CREATE INDEX caucus_memberships_member ON caucus_memberships (bioguide_id, is_current);CREATE INDEX caucus_memberships_caucus ON caucus_memberships (caucus_id, is_current);Schema decisions worth flagging:
confidenceenum on every row — staff and caucus data is partial and ages quickly. Surfacing confidence to the agent lets it caveat answers ("based on a 2024 office scrape, possibly outdated").source_urlandlast_verified_atper row — a staffer who's still listed on a Member's website 2 years after departure is common. Thelast_verified_atlets us downgrade confidence over time.is_currentflag, never delete — preserves history. Even when staff leaves, the record stays for "who was Senator X's LD in 2023" queries.congress-stamped caucus_memberships — caucus rosters reset each congress. Per-congress rows preserve history.- Soft FKs to legislators, committees, agencies throughout — most entities resolve, but partial data is still useful with NULL. (See legislators & committees for the
legislatorsandcommitteestables.) raw_htmlpreserved on staff and caucus rows when the source is a scrape — lets us re-parse when site templates change.
Download / update strategy
Section titled “Download / update strategy”Backfill — best-effort
Section titled “Backfill — best-effort”Congressional staff:
- For each Member of Congress (~535):
- Fetch their
.house.govor.senate.govsite. - Detect template (House Drupal 10/11 on Apache, Senate Drupal, custom). Use template-specific parser.
- Extract staff names + titles + emails (where shown).
- Save raw HTML.
- Fetch their
- For each Committee (~50):
- Fetch committee site.
- Extract Staff Director, Chief Counsel, Communications Director, etc.
- Quarterly re-scrape.
Federal agency officials:
- Fetch the USA.gov agency index (usa.gov/agency-index) as the agency-discovery seed — it lists agencies + contact info, not officials.
- For each discovered agency, fetch its own "Leadership" page and extract the senior-official names + titles.
- Cross-reference Wikidata (SPARQL) for senior officials' bioguide-equivalent and to fill gaps where a Leadership page is unparseable.
- For PAS positions: cross-reference Congress.gov nomination IDs.
- Yearly re-scrape; on administration changes, full refresh.
Caucus memberships:
- Curated list of caucuses (~30-50) with public membership pages.
- For each, scrape membership.
- Wikipedia fallback for caucuses without public membership lists.
- Quarterly re-scrape; on chair-change news, immediate refresh.
Daily incremental
Section titled “Daily incremental”Limited. The data churns slowly (staff every 6-24 months on average). Quarterly batch refresh is the realistic cadence.
Community contribution
Section titled “Community contribution”Plan a future feature: authenticated users with verified Hill / agency affiliation can submit corrections. Each contribution becomes a community_contribution source-type row with confidence='medium' and goes through review.
Failure modes
Section titled “Failure modes”- Member site bot-walls. Some sites have CDN bot-detection. Headed-browser fetcher fallback.
- Site template changes. Common after election cycles when offices refresh sites. Per-template parser fails silently — alert when a site that previously yielded staff now returns zero.
- Departed staff still listed. Common.
last_verified_atdecays confidence. - Caucus website missing. Many caucuses don't publish member lists. Schema accommodates with
is_active=truebut no memberships. - Wikipedia inconsistencies. Wikipedia caucus articles vary in quality. Use confidence=
lowfor Wiki-derived rows. - Acting officials (e.g., "Acting Secretary"). Schema allows
titleto capture "Acting" prefix; downstream consumers handle. - Multi-role staff. A senior aide may simultaneously be COS and Legislative Director. Multiple rows or compose role string. Lean: multiple rows per person, distinguished by
committee_id/ context.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Public release of partial coverage. Is it more honest to ship with 30% coverage and flag gaps, or to skip in v1 entirely until we have 70%+ coverage? Lean: ship partial, with confidence flags surfaced.
- Commercial directory licensing. Leadership Connect and Quorum have comprehensive coverage. Cost is $50K-100K+/year. v1: skip. v2 if operationally justified.
- Privacy and harassment risk. Staff are sometimes targeted online. We do not publish home addresses, personal phone numbers, or photos without consent. Limit to publicly-disclosed Hill / agency contact info.
- Agency officials below SES. Career civil servants are not generally public-facing. Schema accommodates them but we focus on PAS + senior career officials.
- Caucus discovery. New caucuses form regularly. Detection via news monitoring + Wikipedia.
- Cross-source person resolution. "Janet Doe, LD for Senator X" might be the same person who appears as a witness in a hearing two years later. Building a unified person index is a separate, large project. v1 keeps each data source's persons distinct; v2 attempts cross-source linking.
- Per-office issue-area inference. Staff often list "issues" (health, tax, etc.) on bio pages. We capture in
issue_areas[]; standardizing the vocabulary is open. - Update flag for departing staff. Periodically email-check a staffer's listed email; bounce → mark
is_current=false. Privacy-respectful "ping check" cadence is sensitive. - Caucus membership source ambiguity. Some caucuses have multiple definitions: dues-paying vs vote-affiliated vs publicly-aligned. Document scope per caucus.
- Senior executive service appointments. SES rosters are partially public via OPM data; volume is large. Defer until v2.