sourceplannedp0

LDA filings ingester (Senate + House unified at lda.gov)

lda-filings-ingester · updated 2026-05-29T00:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Lobbying Disclosure Act filings: who is lobbying whom, on what bills,
for which clients, and (for political contributions) where the
lobbyists' money is going. Federal lobbying activity must be reported
quarterly to both chambers; political contributions semi-annually.
The Senate Office of Public Records and Clerk of the House publish
through a unified API at lda.gov/api/v1/ (the legacy
lda.senate.gov/api/v1/ is sunset 2026-06-30). For Josh, LDA filings
are the primary signal of "who is influencing this legislation": a user
looking at a bill should see the disclosed lobbyists who reported
lobbying activity on it; a user looking at a Member should see
contributions reported to/from PACs the lobbyists represent.

As a policy analyst using my own AI agent, I want to ask "who's lobbying on the SPEED Act this quarter, and for whom" so that I can map the influence landscape without scraping lda.gov myself.

As an OSS self-hoster, I want to run `josh ingest lda-filings` on a fresh substrate so that I can populate the LDA corpus from the unified lda.gov API on my own instance.

As a downstream agent doing influence-graph analysis, I want filings joined to bills, lobbyists, registrants, and clients in one substrate so that "every Q1 2025 filing that listed HR 1 as a lobbied issue" resolves in a single query.

  1. When `josh ingest lda-filings --year 2026` is run on the deployed substrate, the system shall populate `lda_filings` rows whose six citation columns are all non-null.
  2. When a filing is parsed, the system shall populate child rows in `lda_filing_lobbying_activities` (one per `lobbying_activities[]` entry) and `lda_activity_government_entities` (per the post-2021-02-14 per-activity model; pre-date filings populate `lda_filings.government_entities_filing_level` as a JSON array).
  3. While the ingester is running, if a single filing fails to parse or the lda.gov API returns 502, then the system shall log the failure to `ingestion_logs` (with the offending `filing_uuid`) and continue with the next record.
  4. Where a filing's `lobbying_activities[]` declares specific bills (`description` text patterns like `H.R. 4920`; `general_issue_specific` is null on lda.gov), the system shall extract and write to `lda_activity_bills` linking to `bills.id` (soft-FK; raw text preserved).
  5. When LD-203 contribution filings are ingested, the system shall populate `lda_contributions` rows with one row per `contribution_items[]` entry, resolving recipient legislator references to `legislators.bioguide_id` opportunistically.
  6. The system shall enforce a UNIQUE constraint on `(filing_uuid)` so re-fetching the same filing upserts rather than duplicates.
  7. When an FTS5 query is executed against `lda_filings_fts`, the system shall return BM25-ranked results over `(client_name, registrant_name, lobbying_issues_text)`.
  8. When a fetch is made against lda.gov, the request shall use a plain HTTP client (no Akamai bot wall — default, blank, or python-requests UA returns 200 JSON from gunicorn).
kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest lda-filings --year 2026 --max 100'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT
    (SELECT COUNT(*) FROM lda_filings WHERE filing_year = 2026
      AND source_url IS NOT NULL AND raw_sha256 IS NOT NULL
      AND citation_string IS NOT NULL) AS filings,
    (SELECT COUNT(*) FROM lda_filing_lobbying_activities lfla
      JOIN lda_filings f ON f.id = lfla.filing_id
      WHERE f.filing_year = 2026) AS activities;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM lda_filings_fts WHERE lda_filings_fts MATCH '\''appropriations'\'';"'

Expect

>= 100 filings, >= 100 activities (typically multiple per filing), FTS5 match >= 1.

Smoke run only. Full backfill (1999-present, ~1.94M filings) at the api-keyed 120/min throttle takes ~5 hours saturated; runs detached on the server per `add-a-new-source.html`. Plain HTTP client throughout (no headed browser — lda.gov returns 200 JSON to plain curl).

  • Backfill scope: full 1999-present (~1.94M filings) vs last 5 years (~250K-400K) vs last 2 years (~100K-160K) for v1 launch. Lean: full — at api-keyed throttle it's ~5 hours saturated and the historical breadth is the value prop.
  • API key procurement: lda.gov API key is free at https://lda.gov/api/register/ but needs registration. Confirm key is in `.kamal/secrets` (env: `LDA_GOV_API_KEY`) before backfill. Anonymous's 15/min limit makes ~1.94M backfill take ~9 days.
  • No headed-browser dependency: plain curl (default, blank, or python-requests UA) returns 200 JSON from gunicorn at lda.gov/api/v1/ — the Akamai bot wall is gone. Use a plain HTTP client (`requests`/`httpx`) for this source.
  • Government-entity granularity flip date (2021-02-14): the schema accommodates both views via per-activity rows (`lda_activity_government_entities`) for newer filings and a filing-level JSON column for older. Confirm the parser handles both branches without losing data.
  • Bill-text extraction from the `description` field (the per-activity free-text field; `general_issue_specific` is null on lda.gov): regex (`H\.R\. \d+`, `S\. \d+`) vs LLM extraction. The field is free-text and notoriously messy. Lean: regex for v1 with a `clarifications_log` for matches that look like bills but don't resolve.
  • Contribution-recipient resolution: LD-203 names are free-text (e.g. 'McConnell, Mitch'); resolving to bioguide is fuzzy. Lean: best-effort resolve via `legislators.fuzzy_name_match`; raw text always preserved.
  • Constants endpoints (filing types, issue codes, government entities, etc.) — refresh on every ingester boot or daily? Lean: daily — they don't count against the rate limit and the enums change quarterly at most.
  • State lobbying disclosures (CA / NY / TX state-level analogues) — Phase 6+.
  • FARA filings (Foreign Agents Registration Act, related but different schema) — v2 per `https://docs.usejosh.com/josh-data-sources/`.
  • ProPublica's bulk LDA mirror — derived from the same Senate API; direct is better.
  • OpenSecrets-style enrichment (employer matching, PAC affiliations) — v2 layer on top of raw LDA.
  • Real-time webhooks / push subscriptions — surface-layer concern.
  • Writing back to lda.gov (filing submission) — read-only ingest.

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/lda-filings/.

Discover/fetch/parse/load shape:
- discover: GET /api/v1/filings/?filing_year={year}&ordering=-dt_posted&page=N
paginated. A filter is no longer required — unfiltered /filings/
returns 200 (count ~1.94M) — but we filter by filing_year for
watermarked incremental runs. Watermark from
ingestion_source_state.last_seen_key=
'lda-filings-{year}-dtposted'
is the most-recent dt_posted seen.
Constants endpoints (filing types, issue codes, government entities,
etc.) refreshed daily into lda_constants_* tables.
- fetch: per-filing JSON via /api/v1/filings/{filing_uuid}/ to
/data/corpus/lda-filings/bodies/raw/{filing_uuid}.json. SHA256 the
JSON for raw_sha256. Plain HTTP client throughout (no bot wall —
plain curl gets 200 JSON from gunicorn). Polite pacing under the
120/min api-keyed throttle.
- parse: extract filing-level fields (filing_uuid, filing_year,
filing_period, filing_type, registrant, client, dt_posted, posted_url,
income, expenses) into lda_filings. Extract per-activity fields
into lda_filing_lobbying_activities (general_issue_code,
description, lobbyists list, government_entities). Extract
bill references from the per-activity description free-text field
(general_issue_specific is null on lda.gov) into
lda_activity_bills. LD-203 contributions follow a parallel parser
into lda_contributions.
- load: idempotent upsert on (filing_uuid). Child rows inserted in
one transaction per filing. Lobbyist / registrant / client rows
upserted from per-filing snapshots into lda_lobbyists,
lda_registrants, lda_clients (these are slowly-changing dimensions
— re-fetched yearly via the /lobbyists/, /registrants/, /clients/
endpoints). FTS5 sync triggers fire on insert/update/delete.

Migration shape: lda_filings (with the six citation columns +
filing_uuid UNIQUE), lda_filing_lobbying_activities,
lda_activity_government_entities, lda_activity_bills (soft-FK to
bills), lda_contributions (LD-203), lda_registrants, lda_clients,
lda_lobbyists, lda_constants_filing_types,
lda_constants_issue_codes, lda_constants_government_entities,
lda_filings_fts virtual + 3 sync triggers. No *_chunks_vec0
filings are structured, not body text.

Schedule: hourly :20 during weekdays, daily off-hours. New filings land
throughout business hours; hourly catches them promptly. Constants
endpoints refreshed once daily at 02:00 UTC (don't count against quota).

Fetcher: plain HTTP client (requests/httpx) — no headed browser.
The former Akamai bot wall is gone; plain curl (default, blank, or
python-requests UA) returns 200 JSON from gunicorn. The 120/min
api-keyed throttle is the load-bearing constraint, not bot detection.

0 of 17 done.

  • t1 Alembic migration applied locally + production (lda_filings + 7 child + 3 dimension + 3 constants tables + FTS5)
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/lda.py
  • t3 Source module implementing the `Source` protocol
  • t4 Body normalization N/A (structured JSON → structured rows; no Markdown body)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests (LD-1, LD-2, LD-203)
  • t6 FTS5 virtual table + 3 sync triggers wired into migration
  • t7 N/A — no body chunks, no vector pipeline
  • t8 Unit tests for filing parser against captured fixtures (LD-1 registration, LD-2 quarterly, LD-203 contribution; both pre-2021-02-14 and post-2021-02-14 government-entity granularity)
  • t9 Integration test for full discover→fetch→parse→load pipeline (with mocked lda.gov)
  • t10 Smoke backfill (100 filings, current year) clean per success_determiner
  • t11 Production backfill at agreed scope (per clarifications_needed) complete
  • t12 Hourly cron registered and observed firing once
  • t13 No recurring errors in `ingestion_logs` for the last completed run
  • t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
  • t15 lda.gov API key obtained and stored in `.kamal/secrets` (env: `LDA_GOV_API_KEY`)
  • t16 Plain HTTP fetcher (requests/httpx) confirmed working through one full backfill day without 403s (no bot wall — lda.gov returns 200 JSON to plain curl)
  • t17 Bill-extraction QA: smoke-backfill cohort of `description` strings that look like bill citations resolve to `bills.id` rows at >70% rate (the rest preserved as raw text in `lda_activity_bills.raw_citation`)

No history yet.

docs/spec/lda-filings-ingester.html · generated by bin/build-spec.py