Roll-call votes ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Every recorded floor vote in the House and Senate — passage, motions to
recommit, amendment votes, cloture, motion to proceed, election of
Speaker, motions to table — and every member's individual vote on each.
This is the most direct expression of legislator behavior in the
substrate. FK-target-rich on bills (most roll calls attach to alegis-num) and on legislators (every recorded vote names the member),
which makes "how did Senator X vote on the SPEED Act" or "every roll call
on HR 1" trivial joins. Both chambers publish XML back to 1989, and the
schemas are completely different. As of 2026-05-29 the Senate side is no
longer bot-walled: vote_menu_119_1.xml and vote_119_1_00001.xml both
return plain HTTP/1.1 200 (Server: Apache) to a bare curl (confirmed
with default and curl/8 UA), so plain requests/httpx fetch works for
both chambers — no headed browser required.
User stories
As a policy analyst using my own AI agent, I want to ask "how did the swing-district Republicans vote on the budget bill" so that I can do vote-prediction and aisle-crossing analysis without scraping clerk.house.gov myself.
As an OSS self-hoster, I want to run `josh ingest roll-call-votes` on a fresh substrate so that I can populate House + Senate votes from public XML on my own instance.
As a downstream agent doing impact analysis, I want per-member vote rows joined to bills and legislators so that aisle-crossing, party-line, and absent-from-vote queries work in one substrate.
Acceptance criteria (EARS)
- When `josh ingest roll-call-votes --congress 119` is run on the deployed substrate, the system shall populate `roll_call_votes` rows whose six citation columns are all non-null.
- When a roll-call XML is parsed, the system shall populate one `roll_call_member_votes` row per voting member, resolved to `legislators.bioguide_id` (with the Senate `lis_member_id`→`bioguide_id` crosswalk applied for Senate votes).
- While the ingester is running, if a single roll-call XML fails to parse, then the system shall log the failure to `ingestion_logs` (with `(chamber, congress, session, vote_number)`) and continue with the next vote.
- Where a roll-call references a `legis-num` (e.g. `H R 498` or `H AMDT 234`), the system shall resolve it to a `bills.id` (or `bill_amendments.id` if the substrate carries amendments) and write the FK; if unresolved, the row shall record the raw `legis-num` text and queue a retry.
- The system shall enforce a UNIQUE constraint on `(chamber, congress, session, vote_number)` so re-fetching the same XML upserts rather than duplicates.
- When an FTS5 query is executed against `roll_call_votes_fts`, the system shall return BM25-ranked results over `(question, vote_description)`.
- When a Senate XML is fetched, the system shall do so over plain `requests`/`httpx` (as of 2026-05-29 the Akamai bot wall is gone — `vote_menu_119_1.xml` and `vote_119_1_00001.xml` return HTTP/1.1 200 from Apache to a bare `curl`); if a future fetch returns 403, then the system shall fall back to the headed-browser fetcher pattern.
Success determiner
Command
set -euo pipefail
# === CNV preflight: substrate reachable? loadavg sane? ===
# Three-state exit: 0 = PASS, 1 = FAIL, 77 = CNV (could-not-verify).
# A routine interpreter must treat 77 as "skip, do not act" so transient
# infra problems never cascade into a spec-rollback decision.
if ! ssh -o ConnectTimeout=5 -o BatchMode=yes josh 'true' 2>/dev/null; then
echo "CNV: ssh josh unreachable; cannot verify" >&2
exit 77
fi
LOAD1=$(ssh josh 'cut -d" " -f1 /proc/loadavg')
if awk "BEGIN { exit !($LOAD1 > 8.0) }"; then
echo "CNV: josh loadavg=${LOAD1} > 8.0; refusing to verify under load" >&2
exit 77
fi
# === Smoke ingest: 50 votes from the current Congress ===
ssh josh 'docker exec josh-ingester josh ingest roll-call-votes --congress 119 --session 1 --max 50'
# === Asserting probes — exit nonzero on any assertion failure ===
# Anti-pattern this replaces: bare COUNT(*) queries whose returned values
# were never read; bash exit 0 meant "SELECT parsed", not "assertion held".
python3 - <<'PY'
import subprocess
def q(sql: str) -> int:
out = subprocess.check_output(
["ssh", "josh", "docker", "exec", "josh-core",
"sqlite3", "/data/josh.db", sql],
text=True,
).strip()
return int(out) if out else 0
# Probe 1 — AC #1: six citation columns non-null on >= 50 rows
populated = q("""
SELECT COUNT(*) FROM roll_call_votes
WHERE congress = 119 AND session = 1
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;
""")
assert populated >= 50, \
f"FAIL probe-1: populated roll_call_votes = {populated}, expected >= 50"
# Probe 2 — AC #2: per-member rows resolved through legislators.bioguide_id
member_votes = q("""
SELECT COUNT(*) FROM roll_call_member_votes rcmv
JOIN legislators l ON l.bioguide_id = rcmv.bioguide_id
WHERE rcmv.vote_id IN (SELECT id FROM roll_call_votes WHERE congress = 119);
""")
assert member_votes >= 50, \
f"FAIL probe-2: bioguide-resolved member_votes = {member_votes}, expected >= 50"
print(f"PASS: populated={populated} member_votes={member_votes}")
PY
Expect
Smoke run only. Full backfill (101st Congress – present, ~30-50K roll calls × ~435/100 members = ~5-10M per-member rows) runs detached on the server per `add-a-new-source.html`. As of 2026-05-29 the Senate side fetches over plain HTTP (Akamai bot wall gone — bare `curl` gets HTTP/1.1 200 from Apache); no headed-browser fetcher needed. Adversarial mutation suite lives at `docs/spec/mutations/roll-call-votes-ingester.yaml`. Before flipping this spec to `verified`, every mutation marked `expect: fail` must cause the determiner to exit nonzero. Mutations marked `expect: pass` document known gaps in this v1 determiner (deferred to v2).
Clarifications needed
- Backfill scope: 101st-current (full ~30-50K roll calls × all members) vs current Congress only for v1 launch. Lean: 113th-current to align with bills coverage; older votes have no bill FK target.
- Senate `lis_member_id`→`bioguide_id` crosswalk: source. `unitedstates/congress` Python scrapers maintain one; copy into `shared/josh_substrate/data/` or fetch at ingestion time? Lean: copy-as-data, refresh quarterly.
- Amendment votes: model `legis-num='H AMDT 234'` as FK to a `bill_amendments` table (added by `bills-ingester`) or as a free-text column? Lean: bills FK + free-text fallback so unresolved amendments don't lose data.
- Absent / not-voting members: insert a `roll_call_member_votes` row with `position='absent'` (full attendance audit) or omit (smaller table, lossy)? Lean: insert — attendance is itself signal.
- Headed-browser fetcher: as of 2026-05-29 the Senate Akamai bot wall is gone (`vote_menu_119_1.xml` and `vote_119_1_00001.xml` return HTTP/1.1 200 from Apache to a bare `curl`), so v1 fetches Senate XML over plain `requests`/`httpx` — no headed browser. Lean: drop the headed-browser path for Senate; keep it as a documented fallback only if 403s reappear.
- Congress.gov v3 `/house-vote` enrichment (118th+ only): wire as a verification cross-check or skip for v1? Lean: skip for v1 (Clerk XML is authoritative); enrich in v1.x when API quota is paid for.
Out of scope
- State-level roll-call votes — Phase 6+.
- Pre-1989 historical roll calls — out of digital structured coverage.
- Voice votes / unanimous consent (no per-member breakdown) — only recorded roll calls in scope.
- Scraping of `senate.gov/.htm` HTML — XML adjacent has same data, structured.
- GovTrack vote dumps — derived from same Clerk + Senate XMLs; direct is better.
- Real-time webhooks / push subscriptions — surface-layer concern.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/roll-call-votes/.
Two chambers, two parsers, one schema:
- House: https://clerk.house.gov/evs/{year}/index.asp (HTML index)
+ roll{NNN}.xml (per-vote XML). Open, no bot wall. Vote XML uses
name-id which is bioguide-format directly.
- Senate: https://www.senate.gov/legislative/LIS/roll_call_lists/vote_menu_{C}_{S}.xml
(per-session menu) + roll_call_votes/vote{C}{S}/vote_{C}_{S}_{NNNNN}.xml
(per-vote XML, 5-digit zero-padded number). No bot wall as of
2026-05-29 — vote_menu_119_1.xml and vote_119_1_00001.xml return
HTTP/1.1 200 (Server: Apache) to a bare curl, so plain fetch works.
Vote XML uses lis_member_id — needs crosswalk to bioguide.
Discover/fetch/parse/load shape:
- discover: House — poll index.asp for the current year, diff
against last seen (year, rollnumber) set. Senate — poll
vote_menu_{C}_{S}.xml, diff against last seen vote_number set.
Watermark per-chamber-per-session in ingestion_source_state.
- fetch: House and Senate XML both over plain requests/httpx
(Senate Akamai bot wall gone as of 2026-05-29 — bare curl gets
HTTP/1.1 200 from Apache). SHA256 the XML bytes for raw_sha256.
- parse: chamber-specific lxml walker → roll_call_votes row +
roll_call_member_votes rows. House name-id is bioguide-direct;
Senate lis_member_id runs through the crosswalk.
- load: idempotent upsert on (chamber, congress, session, vote_number).
Per-member rows inserted in one transaction per vote. FTS5 sync
triggers fire on insert/update/delete. Bill FK resolved opportunistically
(legis-num → bills.id when bill exists in substrate).
Migration shape: roll_call_votes (with the six citation columns + the
unique constraint), roll_call_member_votes (PK (vote_id, bioguide_id),
FK to roll_call_votes and legislators), roll_call_votes_fts virtual
+ 3 sync triggers. No *_chunks_vec0 — the question/description text is
short and structured, FTS5 is enough.
Schedule: hourly :05 during weekdays. House votes happen in clusters
during active session days; Senate similarly. With plain HTTP fetch on
both chambers (Senate Akamai wall gone as of 2026-05-29), keep the
per-tick budget at ~30 votes max so a backfill doesn't tar-pit the live
ingest.
Tasks
0 of 17 done.
- t1 Alembic migration applied locally + production (roll_call_votes + roll_call_member_votes + FTS5)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/roll_call_votes.py (House + Senate variants)
- t3 Source module implementing the `Source` protocol
- t4 Body normalization N/A (structured XML → structured rows; no Markdown body)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests on House + Senate fixtures
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 N/A — no body chunks, no vector pipeline
- t8 Unit tests for House + Senate parsers against captured XML fixtures (passage, amendment, motion to recommit, election-of-Speaker variants)
- t9 Integration test for the full discover→fetch→parse→load pipeline (House + Senate)
- t10 Smoke backfill (50 votes from current Congress, both chambers) clean per success_determiner
- t11 Production backfill at agreed scope (per clarifications_needed) complete
- t12 Hourly cron registered and observed firing once
- t13 No recurring errors in `ingestion_logs` for the last completed run (allow transient bioguide-mismatch retries)
- t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
- t15 Senate `lis_member_id`→`bioguide_id` crosswalk shipped (per clarification — copy from `unitedstates/congress` into `shared/josh_substrate/data/`)
- t16 Senate plain-HTTP fetch confirmed working through one full backfill day without 403s (Akamai bot wall gone as of 2026-05-29; headed-browser fetcher kept as documented fallback only if 403s reappear)
- t17 URL verification gate: 5/5 random House + 5/5 random Senate vote rows match upstream `result` + `tally` via WebFetch / browser-use
Changelog
No history yet.