Staff directories

Three thinly-related person datasets, all classified as best-effort + community contribution:

Congressional staff directory — chiefs of staff, legislative directors, legislative assistants, committee staff. The people who actually write bills and shape policy on Capitol Hill.
Federal agency officials — agency heads, deputies, key program directors. The decision-makers in the executive branch.
Caucus memberships — Freedom Caucus, Problem Solvers Caucus, Black Caucus, Hispanic Caucus, etc. Informal groupings of legislators around shared interests.

These three are bundled into one doc because all three share the same characteristic problem: there is no canonical, machine-readable, comprehensively-maintained public source. Each requires either:

Scraping fragmentation across hundreds of individual office websites.
Community-contributed datasets that lag, are incomplete, or have gone dormant.
Manual data entry by Josh users.

Honest disclosure: this is the weakest source area in v1. We promise partial coverage, document the gaps, and design the schema for community contribution — analogous to how Wikipedia and OpenStreetMap operate. If any of these become high-priority for users, we revisit with dedicated effort or commercial data partners.

Quick reference


Source name	Staff directories (combined)
Sub-sources	Congressional staff, federal agency officials, caucus memberships
Publishers	None canonical. Mix of: official chamber/agency sites (incomplete), USA.gov agency index (contact directory — agency-discovery seed only, no official names), Wikidata (community-edited), Ballotpedia (paywalled), commercial directories (Leadership Connect, Quorum, Gary Hill, Knowlegis — $K-100K licenses).
License	Mixed: official sites public-domain; community sources varies (Wikidata CC0, Ballotpedia CC-BY-SA non-commercial).
Coverage	Partial. Best for sitting Members and senior agency officials; thinning toward staff and caucus memberships.
Volume	Congressional staff: ~10,000-15,000 people across 535 offices + ~50 committees. Agency officials: ~5,000-10,000 senior. Caucus memberships: ~2,000-3,000 active member-caucus pairs.
Storage estimate	<500 MB
Auth	Varies by source
Rate limits	Varies; many sites bot-walled
Caching headers	Varies
Incremental sync hints	Mostly manual; site-specific
Stable ID format	Person: `staff:{slug}` (e.g. `staff:smith-jane-c001234`); caucus: `caucus:{slug}` (e.g. `caucus:freedom-caucus`)
Status	exploring (partial) — schema drafted, ingestion built best-effort

Source priority decision

Per sub-source:

Congressional staff directory

Primary: per-office .house.gov / .senate.gov website scrapes. Each Member's official site has a "Staff" or "Meet the Team" page with Chief of Staff, Legislative Director, and key Legislative Assistants. Format varies wildly — some sites have structured listings with role + name + bio, others have a single photo carousel.

Skip for staff names: USA.gov agency index. It is an A–Z agency contact index (agency websites, phone numbers, mailing addresses) — no senior-official names and no office-staff names. Useful only as an agency-discovery seed, not a staff source. (usa.gov/agency-index)

Skip: ProPublica / OpenSecrets / OpenStates for staff — they don't maintain this.

Skip: commercial directories (Leadership Connect / Quorum / Gary Hill / Knowlegis) — paywalled at $10K+/year; not OSS-compatible; can't redistribute.

Future: community contribution. Plan a staff_contributions flow where users with verified affiliation can update entries.

Federal agency officials

Agency-discovery seed: USA.gov agency index. The A–Z index at usa.gov/agency-index is a contact directory (agency websites, phone numbers, mailing addresses) with no senior-official names and no org chart. It supplies the list of agencies to visit — not the officials themselves.

Primary (officials tier): per-agency Leadership pages. Each agency's own Leadership page carries the cabinet, deputy secretaries, and major bureau heads. Structured for bigger agencies (DOJ, DOD, HHS); chaotic for smaller ones.

Secondary: Wikidata SPARQL. Community-maintained; has bioguide-equivalent cross-references for some senior officials, and fills gaps where an agency Leadership page is unparseable. (query.wikidata.org)

Skip: scraping White House directory directly — bot-walled and changes per administration.

Caucus memberships

Primary: per-caucus website scrapes. A few caucuses publish member rosters publicly (Freedom Caucus, Problem Solvers Caucus, Bipartisan Climate Solutions Caucus, Congressional Black Caucus, etc.). The rest don't publish at all.

Secondary: Wikipedia. Community-maintained articles often list members. Quality varies.

Tertiary: news scrapes. When a caucus chair changes, news articles list current members — extract via NLP.

Skip: commercial directories for the same reason as staff.

Access notes

This is largely a write-many-fragile-scrapers operation. Each sub-source has its own access pattern.

For Member-office scrapes specifically:

~535 individual sites, ~50 committee sites.
Most run on a small set of CMS templates (House Drupal 10/11 on Apache / Senate Drupal). Pattern-detect template; per-template parser.
Real-browser User-Agent harmless; some sites bot-wall, most don't.
Update cadence: low (staff turn over once per ~2 years on average).

Endpoints / URL patterns

There is no single endpoint structure. Per-office and per-caucus URLs are scraped individually. The relevant upstream datasets are community-maintained references, not APIs:

unitedstates/congress-legislators — already covered in legislators & committees. Does not include staff or caucus memberships.
Wikidata items for US House (Q7137) and US Senate (Q66096) — linking to Members + officials with cross-references.
Per-caucus public sites (e.g. freedomcaucus.house.gov/members — illustrative example; rare to find and may not resolve).
USA.gov A–Z agency index (usa.gov/agency-index) as the agency-discovery seed — lists agencies + contact info, no official names.

Vocabulary / enums

Staff role

A normalized enum of common Hill staff titles:

Role	Acronym	Notes
`chief_of_staff`	COS	Senior aide; manages office.
`deputy_chief_of_staff`	DCOS
`legislative_director`	LD	Manages legislative agenda.
`legislative_assistant`	LA	Staffs specific issue areas.
`legislative_correspondent`	LC	Junior; constituent mail + research.
`communications_director`	Comms Dir
`press_secretary`	Press Sec
`scheduler`
`district_director`	District Dir	Runs district/state offices.
`staff_director`	Staff Dir	For committees only.
`general_counsel`	GC
`professional_staff_member`	PSM	Committee policy staffer.
`senior_advisor`
`intern`		Excluded from data — privacy.
`other`

Caucus type

Type	Notes
`ideological`	Freedom Caucus, Progressive Caucus, etc.
`demographic`	Congressional Black Caucus, Hispanic Caucus, Asian Pacific American Caucus, etc.
`issue`	Bipartisan Climate Solutions Caucus, Congressional Privacy Caucus, etc.
`bipartisan`	Problem Solvers, Tuesday Group, etc.
`regional`	Western Caucus, Rural Caucus, etc.
`other`

Federal agency official type

Type	Notes
`political_appointee_senate_confirmed`	PAS — requires Senate confirmation
`political_appointee_non_confirmed`	PA — not Senate-confirmed
`senior_executive_service`	SES — career senior executive
`career`	Standard career civil servant
`other`

Stable ID formats

Staff person: staff:{slug}-{disambiguator} where slug is {lastname}-{firstname} lowercased and disambiguator is a short hash or sequence number.

Examples:

staff:smith-jane-001
staff:rodriguez-carlos-002

Avoiding email or SSN/EIN-style identifiers. Slugs are stable enough; collisions resolved by disambiguator.

Caucus: caucus:{slug}, kebab-case lowercased name.

Examples:

caucus:freedom-caucus
caucus:problem-solvers-caucus
caucus:congressional-black-caucus

Agency official: agency-official:{agency_id}-{slug}-{seq}

Examples:

agency-official:doj-garland-merrick-001
agency-official:epa-zeldin-lee-001

Schema (Postgres DDL)

-- ============================================================
-- Congressional staff persons
-- ============================================================

CREATE TABLE congressional_staff (
    id text PRIMARY KEY,                          -- 'staff:smith-jane-001'

    first_name text,
    last_name text NOT NULL,
    middle_name text,
    suffix text,
    full_name text NOT NULL,                      -- display name

    -- Affiliation (current)
    member_bioguide text REFERENCES legislators(bioguide_id),  -- soft FK; for personal-office staff
    committee_id text REFERENCES committees(id),               -- soft FK; for committee staff
    is_majority_staff boolean,                                  -- committee staff only

    -- Role
    role text NOT NULL,                                         -- normalized enum
    title text,                                                 -- raw title from source
    issue_areas text[],                                         -- self-reported (e.g. ['health', 'tax'])

    -- Contact (often partial)
    email text,
    phone text,
    twitter text,

    -- Provenance
    source_url text NOT NULL,
    source_type text NOT NULL CHECK (source_type IN (
        'office_scrape', 'committee_scrape', 'usa_gov',
        'wikidata', 'community_contribution'
    )),
    confidence text NOT NULL CHECK (confidence IN ('high', 'medium', 'low', 'unverified')),

    -- Lifecycle
    is_current boolean NOT NULL DEFAULT true,
    started_at date,
    ended_at date,
    raw_html bytea,
    fetched_at timestamptz NOT NULL,
    last_verified_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX congressional_staff_member ON congressional_staff (member_bioguide);
CREATE INDEX congressional_staff_committee ON congressional_staff (committee_id);
CREATE INDEX congressional_staff_role ON congressional_staff (role);
CREATE INDEX congressional_staff_active ON congressional_staff (member_bioguide) WHERE is_current;

ALTER TABLE congressional_staff ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', full_name), 'A') ||
        setweight(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(role, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(array_to_string(issue_areas, ' '), '')), 'C')
    ) STORED;
CREATE INDEX congressional_staff_search ON congressional_staff USING gin (search_tsv);

-- ============================================================
-- Federal agency officials
-- ============================================================

CREATE TABLE federal_agency_officials (
    id text PRIMARY KEY,                          -- 'agency-official:doj-garland-merrick-001'

    first_name text,
    last_name text NOT NULL,
    full_name text NOT NULL,

    agency_id text NOT NULL REFERENCES agencies(id),  -- soft FK
    agency_name text NOT NULL,
    sub_agency_name text,                              -- when known (bureau / office)

    title text NOT NULL,                                -- "Attorney General", "Secretary of State", etc.
    role_type text CHECK (role_type IN (
        'political_appointee_senate_confirmed',
        'political_appointee_non_confirmed',
        'senior_executive_service',
        'career',
        'other'
    )),

    -- Lifecycle
    is_current boolean NOT NULL DEFAULT true,
    appointed_at date,
    departed_at date,

    administration text,                               -- '47-Trump', '46-Biden' — joins to executive
    nomination_pn text,                                -- Congress.gov nomination ID when PAS
    confirmation_vote_id text,                         -- 'vote:119-s-1-50' when PAS confirmed via roll call

    -- Provenance
    source_url text,
    source_type text NOT NULL,
    confidence text NOT NULL CHECK (confidence IN ('high', 'medium', 'low', 'unverified')),

    raw_data jsonb,
    fetched_at timestamptz NOT NULL,
    last_verified_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX federal_agency_officials_agency ON federal_agency_officials (agency_id);
CREATE INDEX federal_agency_officials_active ON federal_agency_officials (agency_id) WHERE is_current;
CREATE INDEX federal_agency_officials_admin ON federal_agency_officials (administration);

ALTER TABLE federal_agency_officials ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', full_name || ' ' || title), 'A') ||
        setweight(to_tsvector('english', agency_name || ' ' || coalesce(sub_agency_name, '')), 'B')
    ) STORED;
CREATE INDEX federal_agency_officials_search ON federal_agency_officials USING gin (search_tsv);

-- ============================================================
-- Caucuses
-- ============================================================

CREATE TABLE caucuses (
    id text PRIMARY KEY,                          -- 'caucus:freedom-caucus'
    name text NOT NULL,
    slug text NOT NULL UNIQUE,
    chamber text CHECK (chamber IN ('house', 'senate', 'joint') OR chamber IS NULL),
    caucus_type text NOT NULL CHECK (caucus_type IN (
        'ideological', 'demographic', 'issue', 'bipartisan', 'regional', 'other'
    )),
    description text,
    website text,
    founded_year int,
    is_active boolean NOT NULL DEFAULT true,

    -- Leadership
    chair_bioguide text REFERENCES legislators(bioguide_id),
    co_chair_bioguide text REFERENCES legislators(bioguide_id),

    -- Provenance
    source_url text,
    raw_html bytea,
    fetched_at timestamptz NOT NULL,
    last_verified_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX caucuses_type ON caucuses (caucus_type) WHERE is_active;
CREATE INDEX caucuses_chamber ON caucuses (chamber) WHERE is_active;

-- Many-to-many: members ↔ caucuses
CREATE TABLE caucus_memberships (
    id bigserial PRIMARY KEY,
    caucus_id text NOT NULL REFERENCES caucuses(id) ON DELETE CASCADE,
    bioguide_id text NOT NULL REFERENCES legislators(bioguide_id),
    congress smallint NOT NULL,                   -- which congress this membership applies to
    role text,                                    -- 'member', 'co-chair', 'founding_member', etc.
    started_at date,
    ended_at date,
    is_current boolean NOT NULL DEFAULT true,

    -- Provenance per membership
    source_url text,
    confidence text CHECK (confidence IN ('high', 'medium', 'low', 'unverified')),
    last_verified_at timestamptz,

    UNIQUE (caucus_id, bioguide_id, congress)
);

CREATE INDEX caucus_memberships_member ON caucus_memberships (bioguide_id, is_current);
CREATE INDEX caucus_memberships_caucus ON caucus_memberships (caucus_id, is_current);

Schema decisions worth flagging:

confidence enum on every row — staff and caucus data is partial and ages quickly. Surfacing confidence to the agent lets it caveat answers ("based on a 2024 office scrape, possibly outdated").
source_url and last_verified_at per row — a staffer who's still listed on a Member's website 2 years after departure is common. The last_verified_at lets us downgrade confidence over time.
is_current flag, never delete — preserves history. Even when staff leaves, the record stays for "who was Senator X's LD in 2023" queries.
congress-stamped caucus_memberships — caucus rosters reset each congress. Per-congress rows preserve history.
Soft FKs to legislators, committees, agencies throughout — most entities resolve, but partial data is still useful with NULL. (See legislators & committees for the legislators and committees tables.)
raw_html preserved on staff and caucus rows when the source is a scrape — lets us re-parse when site templates change.

Download / update strategy

Backfill — best-effort

Congressional staff:

For each Member of Congress (~535):
- Fetch their .house.gov or .senate.gov site.
- Detect template (House Drupal 10/11 on Apache, Senate Drupal, custom). Use template-specific parser.
- Extract staff names + titles + emails (where shown).
- Save raw HTML.
For each Committee (~50):
- Fetch committee site.
- Extract Staff Director, Chief Counsel, Communications Director, etc.
Quarterly re-scrape.

Federal agency officials:

Fetch the USA.gov agency index (usa.gov/agency-index) as the agency-discovery seed — it lists agencies + contact info, not officials.
For each discovered agency, fetch its own "Leadership" page and extract the senior-official names + titles.
Cross-reference Wikidata (SPARQL) for senior officials' bioguide-equivalent and to fill gaps where a Leadership page is unparseable.
For PAS positions: cross-reference Congress.gov nomination IDs.
Yearly re-scrape; on administration changes, full refresh.

Caucus memberships:

Curated list of caucuses (~30-50) with public membership pages.
For each, scrape membership.
Wikipedia fallback for caucuses without public membership lists.
Quarterly re-scrape; on chair-change news, immediate refresh.

Daily incremental

Limited. The data churns slowly (staff every 6-24 months on average). Quarterly batch refresh is the realistic cadence.

Community contribution

Plan a future feature: authenticated users with verified Hill / agency affiliation can submit corrections. Each contribution becomes a community_contribution source-type row with confidence='medium' and goes through review.

Failure modes

Member site bot-walls. Some sites have CDN bot-detection. Headed-browser fetcher fallback.
Site template changes. Common after election cycles when offices refresh sites. Per-template parser fails silently — alert when a site that previously yielded staff now returns zero.
Departed staff still listed. Common. last_verified_at decays confidence.
Caucus website missing. Many caucuses don't publish member lists. Schema accommodates with is_active=true but no memberships.
Wikipedia inconsistencies. Wikipedia caucus articles vary in quality. Use confidence=low for Wiki-derived rows.
Acting officials (e.g., "Acting Secretary"). Schema allows title to capture "Acting" prefix; downstream consumers handle.
Multi-role staff. A senior aide may simultaneously be COS and Legislative Director. Multiple rows or compose role string. Lean: multiple rows per person, distinguished by committee_id / context.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Public release of partial coverage. Is it more honest to ship with 30% coverage and flag gaps, or to skip in v1 entirely until we have 70%+ coverage? Lean: ship partial, with confidence flags surfaced.
Commercial directory licensing. Leadership Connect and Quorum have comprehensive coverage. Cost is $50K-100K+/year. v1: skip. v2 if operationally justified.
Privacy and harassment risk. Staff are sometimes targeted online. We do not publish home addresses, personal phone numbers, or photos without consent. Limit to publicly-disclosed Hill / agency contact info.
Agency officials below SES. Career civil servants are not generally public-facing. Schema accommodates them but we focus on PAS + senior career officials.
Caucus discovery. New caucuses form regularly. Detection via news monitoring + Wikipedia.
Cross-source person resolution. "Janet Doe, LD for Senator X" might be the same person who appears as a witness in a hearing two years later. Building a unified person index is a separate, large project. v1 keeps each data source's persons distinct; v2 attempts cross-source linking.
Per-office issue-area inference. Staff often list "issues" (health, tax, etc.) on bio pages. We capture in issue_areas[]; standardizing the vocabulary is open.
Update flag for departing staff. Periodically email-check a staffer's listed email; bounce → mark is_current=false. Privacy-respectful "ping check" cadence is sensitive.
Caucus membership source ambiguity. Some caucuses have multiple definitions: dues-paying vs vote-affiliated vs publicly-aligned. Document scope per caucus.
Senior executive service appointments. SES rosters are partially public via OPM data; volume is large. Defer until v2.