Staff directories ingester (congressional staff + agency officials + caucus memberships)

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Three thinly-related person datasets, all classified as best-effort +
community contribution: (1) congressional staff
(chiefs of staff, LDs, LAs, committee staff — the people who actually
write bills), (2) federal agency officials (heads, deputies, key
program directors), and (3) caucus memberships (Freedom Caucus,
Problem Solvers, etc.). Bundled into one ingester because all three
share the same characteristic problem: there is no canonical,
machine-readable, comprehensively-maintained public source. Each
requires fragmented per-site scraping or community-contributed datasets.
Honest disclosure: this is the weakest source area in v1 — partial
coverage shipped, gap explicitly documented.

User stories

As a Hill staffer using my own AI agent, I want to ask "who's the LD for Sen. X" and get a name (or a clear "we don't have it") so that I can route an outreach email without guessing.

As a policy analyst using my own AI agent, I want to ask "who chairs the Freedom Caucus" or "list current Problem Solvers Caucus members" so that my agent can answer caucus-mapping questions without scraping per-caucus sites myself.

As an OSS self-hoster, I want to run `josh ingest staff-directories` on a fresh substrate so that I get the partial-coverage staff + agency + caucus tables on my own instance — with confidence flags so my agent never overstates coverage.

Acceptance criteria (EARS)

When `josh ingest staff-directories` is run on the deployed substrate, the system shall populate `staff_persons`, `agency_officials`, `caucus_memberships`, and `caucuses` rows from the configured discovery sources.
When a staff/official record is written, the system shall populate `confidence` (`high` for official-site scrape, `medium` for Wikidata/Wikipedia, `low` for news-NLP), the six citation columns, and a `last_verified_at` timestamp so stale records are visible.
While the ingester is running, if a single per-office or per-caucus scrape fails, then the system shall log the failure to `ingestion_logs` and continue with the next target — one broken site shall not block the rest.
Where a staff person can be linked to a Member's office, the system shall populate `staff_person_offices` linking `staff_persons.id` to `legislators.bioguide_id` (or to a committee via `committees.thomas_id`).
Where a caucus has a current chair / vice-chair / executive director, the system shall populate `caucus_leadership` rows.
When an FTS5 query is executed against `staff_persons_fts` or `caucuses_fts`, the system shall return BM25-ranked name/role matches.
When the ingester encounters a per-office page in a recognized template (House Drupal 10/11 on Apache / Senate Drupal), it shall use the per-template parser; otherwise it shall log to `staff_unparseable_targets` for review.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest staff-directories'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT
    (SELECT COUNT(*) FROM staff_persons
      WHERE source_url IS NOT NULL AND raw_sha256 IS NOT NULL
      AND citation_string IS NOT NULL) AS staff_cited,
    (SELECT COUNT(*) FROM agency_officials
      WHERE source_url IS NOT NULL AND citation_string IS NOT NULL) AS agency_cited,
    (SELECT COUNT(*) FROM caucus_memberships) AS caucus_memberships,
    (SELECT COUNT(*) FROM staff_persons WHERE confidence IN (''high'',''medium'',''low'')) AS confidence_set;"'

Expect

staff_cited >= 200 (partial coverage of ~535 offices' senior staff expected for v1), agency_cited >= 100, caucus_memberships >= 100, confidence_set == staff_cited.

Coverage is explicitly partial. Floors are conservative: a few hundred staff persons + a few hundred agency officials + a few hundred caucus memberships are realistic for the v1 best-effort scrape. Numbers will improve as community contribution kicks in (separate feature). This is the weakest source area in v1 — the spec ships the ingester + schema + confidence flags, not full coverage.

Clarifications needed

Backfill scope priority: per-office scrapes for sitting Members (~535 sites) vs caucus scrapes (~30-50 caucuses with public rosters) vs agency-officials (usa.gov/agency-index as agency-discovery seed only — it carries contacts, not officials — then per-agency Leadership pages + Wikidata SPARQL for the actual names). Lean: caucus first (smallest, highest hit rate per scrape), then sitting-Member offices (highest user-value), then agency officials (most fragmented).
Per-office template detection strategy: hand-code a per-template parser for House Drupal 10/11 (Apache) + Senate Drupal (covers ~80% of offices), per-target parser for the rest? Or LLM-first extraction across all sites? Lean: hand-code templates for the dominant CMSes, log unrecognized to `staff_unparseable_targets` for case-by-case review.
Community-contribution flow (`staff_contributions` writes by verified users): in v1 or defer until OSS launch when there's a contributor base? Lean: defer the write path to v1.x; v1 ships read-only ingester + schema that anticipates contributions.
Wikidata reliance for agency-officials and caucuses: pull SPARQL queries (clean, structured) or the per-item JSON (slower, more thorough)? Lean: SPARQL for v1; per-item JSON for the cross-reference enrichment pass.
Bot-walls on individual House/Senate sites: blanket headed-browser fetcher (slow, expensive) or `requests` with realistic UA + retry on 403 to headed (faster, falls back when needed)? Lean: requests-first with headed fallback.
Stale-record decay: `last_verified_at` past 12 months — mark as stale automatically, or only on next manual revisit? Lean: 12-month auto-stale flag, agents surface as 'low confidence due to age'.

Out of scope

Commercial-directory licensing (Leadership Connect / Quorum / Gary Hill / Knowlegis) — paywalled at $10K+/year; not OSS-compatible.
Personal-office staff for non-sitting Members (former Senators / Reps) — out of scope for v1.
Sub-agency / regional federal staff (only senior cabinet-and-deputy tier) — partial-coverage v1 ceiling.
State-level legislative staff and state agency officials — Phase 6+.
Caucus voting positions / party-line analysis — that's the roll-call ingester's job, not staff directories.
Real-time scraping during business hours (gentle, daily-ish refresh; ~535 office sites + ~50 caucus sites is enough to trigger throttling if too aggressive).

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/staff-directories/.

Three sub-sources, one source module, three discover-fetch-parse paths
feeding shared schema:

1. Congressional staff — per-office scrapes against *.house.gov
and *.senate.gov Member sites, plus committee sites. Per-template
parsers for House Drupal 10/11 (Apache) + Senate Drupal cover the majority;
per-target hand-config for the rest.
2. Federal agency officials — usa.gov/agency-index is an A–Z agency
*contact* index (websites/phones/addresses), with no senior-official
names or org chart, so it is not usable for the officials tier; the
senior tier comes from per-agency Leadership pages for bigger
agencies plus Wikidata SPARQL for cross-reference enrichment. usa.gov
is retained only as an agency-discovery seed (the list of agencies to
visit), not as an officials source.
3. Caucus memberships — per-caucus website scrapes for the few
that publish (Freedom Caucus, Problem Solvers, Bipartisan Climate
Solutions, CBC, etc.); Wikipedia for community-maintained backbone;
news scrapes (deferred) for rotation tracking.

Discover/fetch/parse/load shape:
- discover: per-source-type config in
shared/josh_substrate/data/staff_directories_targets.yaml listing
URLs, expected templates, and confidence-level priors. Watermark per
target in ingestion_source_state.last_seen_key='staff-{kind}-{target_slug}-lastmod'.
- fetch: HTTP-first with realistic UA; fall back to headed browser
on 403. Raw HTML to /data/corpus/staff-directories/bodies/raw/{kind}/{target_slug}/{date}.html.
SHA256 for raw_sha256. Polite ~0.5 req/sec sustained per host.
- parse: per-template parser produces staff_persons /
agency_officials / caucus_memberships rows with confidence flags.
LLM-assisted extraction is an escape hatch for non-template targets
(gated behind a budget cap to avoid runaway costs).
- load: idempotent upsert on (person_slug) for staff;
(agency_id, role) for officials; (caucus_slug, bioguide_id) for
memberships. last_verified_at bumped on every successful re-fetch
(whether the data changed or not). FTS5 sync triggers fire on
insert/update/delete.

Migration shape: staff_persons, staff_person_offices, staff_roles,
agency_officials, caucuses, caucus_memberships, caucus_leadership,
staff_unparseable_targets, staff_persons_fts, caucuses_fts. Soft-FKs
to legislators and committees throughout. No *_chunks_vec0 —
staff directory is structured, not body text.

Schedule: weekly Sunday 04:00 UTC for the per-office/per-caucus refresh
(low cadence — staff turn over once per ~2 years on average; weekly is
plenty). Daily 03:00 UTC for the lightweight usa.gov/agency-index
discovery-seed refresh + Wikidata SPARQL pulls.

Tasks

0 of 15 done.

Changelog

No history yet.