LDA filings
Lobbying Disclosure Act filings: who is lobbying whom, on what bills, for which clients, and (for political contributions) where the lobbyists' money is going. Federal lobbying activity must be reported quarterly to both chambers of Congress; political contributions semi-annually. The Lobbying Disclosure Act (1995) and Honest Leadership and Open Government Act (2007) define the disclosure requirements.
For Josh, LDA filings are the primary signal of "who is influencing this legislation." A user looking at a bill should be able to see the disclosed lobbyists who reported lobbying activity on it; a user looking at a Member should be able to see contributions reported to/from PACs the lobbyists represent.
The Office of Public Records (Senate) and Clerk of the House are migrating to a single unified API at lda.gov — the new home consolidates what used to be split across lda.senate.gov/api/v1/ and disclosurespreview.house.gov/lobbyingdisclosure/. The new lda.gov API is already live (probed live during this work), with the same JSON schema as the legacy Senate API. The old API at lda.senate.gov is sunset on 2026-06-30 — the deprecation headers (Deprecation: @1768003199, Sunset: Tue, 30 Jun 2026 23:59:59 GMT, Link: <https://lda.gov/api/v1/>; rel="successor-version") point at lda.gov as the successor-version. The disclosurespreview.house.gov mirror is superseded by the same unified system.
Upstream API reference (endpoints, filters, enum vocabularies, and full response shapes): lda.gov ReDoc v1 (version 1.0.0).
Quick reference
Section titled “Quick reference”| Source name | LDA filings (Senate + House unified at lda.gov) |
| Publishers | Office of Public Records, U.S. Senate + Clerk of the House — operating a unified system |
| License | Public domain |
| Coverage | 1999 – present |
| Volume | ~1.94M filings (1999-present); ~50K-80K new filings per year |
| Storage estimate | ~3-5 GB API JSON + linked filing PDFs (PDFs lazily fetched) |
| API base (canonical) | https://lda.gov/api/v1/ — the new unified endpoint. Reference: ReDoc v1. Register for a key at https://lda.gov/api/register/. |
| Auth | Anonymous read OK; API key strongly recommended for ingestion (higher throttle) |
| Stable ID format | Filing: lda:{filing_uuid} e.g. lda:455edc06-55d1-41ed-878e-70a4040f953c |
| Status | exploring — schema drafted, ingestion not built. lda.gov endpoint live and verified. |
Source priority decision
Section titled “Source priority decision”Primary: lda.gov API at https://lda.gov/api/v1/. The unified API published by the Senate Office of Public Records / House Clerk. Same JSON schema as the legacy lda.senate.gov/api/v1/. URLs in responses now use lda.gov (not lda.senate.gov). Build directly against this from day one.
Skip: lda.senate.gov/api/v1/ — deprecated, sunset 2026-06-30. The deprecation headers point at lda.gov as the successor-version. No reason to build against the legacy URL.
Skip: disclosurespreview.house.gov — superseded by the unified lda.gov system.
Skip: ProPublica's bulk LDA mirror — derived from the same Senate API. Direct is better.
Skip: OpenSecrets lobbying data — derived + enriched (employer matching, etc.). v1 ingests raw LDA; v2 may layer OpenSecrets crosswalks.
Access notes
Section titled “Access notes”No bot wall — a plain HTTP client gets 200 JSON. A live source-drift sweep (2026-05-29) confirms the former Akamai bot wall is gone:
- Plain
curlwith a default, blank, orpython-requestsUser-Agent → 200 OK with normal JSON from gunicorn. Confirmed againsthttps://lda.gov/api/v1/(returns endpoint inventory) andhttps://lda.gov/api/v1/filings/?filing_year=2025&page_size=5(returns full filing JSON, no auth). - No JA3/TLS-fingerprint trick is required; UA spoofing is unnecessary.
Plan: use a plain Python HTTP client (requests / httpx) for this source. The 120/min api-keyed throttle is the load-bearing constraint, not bot detection. Fallback note: if 403s from Akamai reappear, fall back to an HTTP/2 client with a browser-like TLS fingerprint (e.g. curl_cffi, tls-client, niquests) or a headed Chromium fetcher — but neither is needed today.
Get an API key. Anonymous's 15/min limit makes ~1.94M filings backfill take ~9 days even at full saturation; with an API key (120/min), ~5 hours of saturated requests. Registration is at https://lda.gov/api/register/ and is free for research / developer use. Auth is via DRF Token in the Authorization: Token <key> header — share via .kamal/secrets. Anonymous: 15 req/min (~900/hr) per IP. API key: 120 req/min (~7,200/hr) per user. The canonical limits live in the upstream ReDoc reference and may drift.
Constants endpoints don't count against the rate limit — fetch the enum endpoints freely (filing types, lobbying activity issues, government entities, countries, states, prefixes, suffixes, contribution item types). Same for raw PDF/HTML at https://lda.gov/filings/public/filing/{uuid}/print/.
The filings list endpoint no longer requires a filter. The 2026-05-29 sweep confirms unfiltered GET /api/v1/filings/ returns 200 (count ~1.94M) and ?page=2 alone paginates without error. We still filter by filing_year for watermarked incremental runs, but it is no longer a hard requirement for pagination.
Caveat: government-entity granularity changes 2021-02-14. Filings posted before that date list government entities at the filing level (an aggregate list); filings after that date have entities broken down per lobbying_activities[] item. The schema accommodates both via the lda_activity_government_entities table and a fallback lda_filings.government_entities_filing_level field if we want both views.
Endpoints, vocabulary, and response shapes
Section titled “Endpoints, vocabulary, and response shapes”The endpoint inventory (/filings/, /contributions/, /registrants/, /clients/, /lobbyists/, and the /constants/* enum endpoints), the common filter parameters (filing_year, filing_period, filing_type, ordering, page, limit), the enum vocabularies (filing-type codes such as RR/Q1/MM/YE; filing periods; the 3-letter issue codes like BUD/HCR/TAX; filer types), and the full JSON response shapes are all upstream-maintained — served live at the /constants/* endpoints and documented in the lda.gov ReDoc reference. We do not duplicate them here.
Josh-specific interpretive notes on those shapes:
income(firms charging clients) andexpenses(in-house lobbying) on an LD-2 filing are mutually exclusive — a firm filing reportsincome; a corporate in-house team reportsexpenses.- On an LD-203 contribution filing,
filer_typeislobbyistorregistrant;lobbyistis set whenfiler_type = 'lobbyist';no_contributionsis true on a no-activity report;pacs[]lists PACs the filer contributed to; eachcontribution_items[]row carrieshonoree_name(the recipient member or candidate). - The embedded registrant carries
house_registrant_id, a cross-reference to the House system, andppb_country(principal place of business).
Stable ID formats
Section titled “Stable ID formats”Filing: lda:{filing_uuid} — UUID-based, opaque.
Examples:
lda:455edc06-55d1-41ed-878e-70a4040f953clda:86260004-84e7-46e3-9cfa-76edae508768
Registrant: lda-reg:{id} (e.g. lda-reg:9181)
Client: lda-client:{id} (e.g. lda-client:113256)
Lobbyist: lda-lob:{id} (e.g. lda-lob:43217)
These IDs are stable across queries within v1; v2 migration may need an ID-mapping table.
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Total filings (1999+) | 1,946,003 (current API count) |
| Filings per year | ~50,000-80,000 |
| Active registrants | ~3,000-5,000 |
| Active clients | ~10,000-20,000 |
| Active lobbyists | ~10,000-15,000 |
Postgres footprint: ~3-5 GB.
Caching / incremental sync
Section titled “Caching / incremental sync”The API supports ?ordering=-dt_posted for newest-first; combined with a dt_posted filter for incremental:
- Hourly:
/filings/?ordering=-dt_posted&page[size]=100— fetch first page, paginate until we hit a row already in our table. - Daily:
/contributions/?ordering=-dt_posted— same pattern. - Weekly: full reconciliation for current quarter — fetch all rows for current
filing_year+filing_periodto catch updates we may have missed.
Schema (proposed Postgres DDL)
Section titled “Schema (proposed Postgres DDL)”-- ============================================================-- Registrants (lobbying firms or in-house orgs)-- ============================================================
CREATE TABLE lda_registrants ( id text PRIMARY KEY, -- 'lda-reg:9181' registrant_id int NOT NULL UNIQUE, house_registrant_id int, -- House's parallel ID
name text NOT NULL, description text, address_1 text, address_2 text, city text, state text, zip text, country text, ppb_country text, -- principal place of business
contact_name text, contact_telephone text,
dt_updated timestamptz, raw_json jsonb, fetched_at timestamptz NOT NULL, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX lda_registrants_name ON lda_registrants USING gin (to_tsvector('english', name));
-- ============================================================-- Clients (entities being lobbied for)-- ============================================================
CREATE TABLE lda_clients ( id text PRIMARY KEY, -- 'lda-client:113256' client_id int NOT NULL UNIQUE, house_client_id int,
name text NOT NULL, general_description text, state text, country text, ppb_country text, is_government_entity boolean, is_self_select boolean,
dt_updated timestamptz, raw_json jsonb, fetched_at timestamptz NOT NULL, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX lda_clients_name ON lda_clients USING gin (to_tsvector('english', name));
-- ============================================================-- Lobbyists (individuals)-- ============================================================
CREATE TABLE lda_lobbyists ( id text PRIMARY KEY, -- 'lda-lob:43217' lobbyist_id int NOT NULL UNIQUE, prefix text, first_name text, middle_name text, last_name text NOT NULL, nickname text, suffix text, raw_json jsonb, fetched_at timestamptz NOT NULL, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX lda_lobbyists_name ON lda_lobbyists (last_name, first_name);
-- ============================================================-- LD-1 / LD-2 filings (registration + quarterly activity)-- ============================================================
CREATE TABLE lda_filings ( id text PRIMARY KEY, -- 'lda:{uuid}' filing_uuid uuid NOT NULL UNIQUE,
filing_type text NOT NULL, -- 'RR', 'Q1', '1A', etc. filing_type_display text, filing_year smallint NOT NULL, filing_period text NOT NULL, is_amendment boolean NOT NULL DEFAULT false, is_termination boolean NOT NULL DEFAULT false, is_no_activity boolean NOT NULL DEFAULT false,
-- Money income numeric(14,2), expenses numeric(14,2), expenses_method text,
-- Parties registrant_id int REFERENCES lda_registrants(registrant_id), client_id int REFERENCES lda_clients(client_id),
-- Dates dt_posted timestamptz NOT NULL, termination_date date,
-- Source URLs filing_document_url text, filing_document_content_type text,
-- Lifecycle raw_json jsonb NOT NULL, fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX lda_filings_year_period ON lda_filings (filing_year, filing_period);CREATE INDEX lda_filings_registrant ON lda_filings (registrant_id);CREATE INDEX lda_filings_client ON lda_filings (client_id);CREATE INDEX lda_filings_dt_posted ON lda_filings (dt_posted DESC);
-- ============================================================-- Lobbying activities (LD-2 detail rows)-- ============================================================
CREATE TABLE lda_lobbying_activities ( id bigserial PRIMARY KEY, filing_id text NOT NULL REFERENCES lda_filings(id) ON DELETE CASCADE, sequence int NOT NULL, general_issue_code text NOT NULL, -- 'TAX', 'HCR', etc. general_issue_display text, description text, foreign_entity_issues text, UNIQUE (filing_id, sequence));
CREATE INDEX lda_activities_issue ON lda_lobbying_activities (general_issue_code);
ALTER TABLE lda_lobbying_activities ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(general_issue_display, '') || ' ' || coalesce(general_issue_code, '')), 'B') || setweight(to_tsvector('english', coalesce(description, '')), 'D') ) STORED;CREATE INDEX lda_activities_search ON lda_lobbying_activities USING gin (search_tsv);
-- Lobbyists per activity (M:N)CREATE TABLE lda_activity_lobbyists ( activity_id bigint NOT NULL REFERENCES lda_lobbying_activities(id) ON DELETE CASCADE, lobbyist_id int NOT NULL REFERENCES lda_lobbyists(lobbyist_id), PRIMARY KEY (activity_id, lobbyist_id));
-- Government entities lobbied per activity (M:N)CREATE TABLE lda_activity_government_entities ( activity_id bigint NOT NULL REFERENCES lda_lobbying_activities(id) ON DELETE CASCADE, entity_id int NOT NULL, -- references the LDA enum entity_name text NOT NULL, PRIMARY KEY (activity_id, entity_id));
-- ============================================================-- LD-203 contribution filings-- ============================================================
CREATE TABLE lda_contribution_filings ( id text PRIMARY KEY, -- 'lda-c:{uuid}' filing_uuid uuid NOT NULL UNIQUE, filing_type text NOT NULL, -- 'MM', 'YE', 'MMA', 'YEA' filing_year smallint NOT NULL, filing_period text NOT NULL, -- 'mid_year', 'year_end' filer_type text NOT NULL CHECK (filer_type IN ('lobbyist', 'registrant')), registrant_id int REFERENCES lda_registrants(registrant_id), lobbyist_id int REFERENCES lda_lobbyists(lobbyist_id), no_contributions boolean NOT NULL DEFAULT false, dt_posted timestamptz NOT NULL, pacs text[], raw_json jsonb NOT NULL, fetched_at timestamptz NOT NULL, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX lda_contributions_year ON lda_contribution_filings (filing_year, filing_period);
-- ============================================================-- Itemized contributions (LD-203 line items)-- ============================================================
CREATE TABLE lda_contribution_items ( id bigserial PRIMARY KEY, contribution_filing_id text NOT NULL REFERENCES lda_contribution_filings(id) ON DELETE CASCADE, sequence int NOT NULL, contribution_type text, -- 'FECA', 'PRESIDENTIAL_LIBRARY', etc. contributor_name text, honoree_name text, -- Member / candidate / official receiving the contribution honoree_bioguide text REFERENCES legislators(bioguide_id), -- soft FK when resolvable payee_name text, contribution_date date, amount numeric(12,2), UNIQUE (contribution_filing_id, sequence));
CREATE INDEX lda_contribution_items_honoree ON lda_contribution_items (honoree_bioguide) WHERE honoree_bioguide IS NOT NULL;CREATE INDEX lda_contribution_items_date ON lda_contribution_items (contribution_date DESC);Schema decisions worth flagging:
- One filing → many
lda_lobbying_activitiesrows, each capturing one issue area with its lobbyist + government-entity edges. Lets us answer "what was lobbied on TAX in 2025-Q1" cleanly. lda_contribution_items.honoree_bioguideas soft FK — most contributions go to sitting Members; resolution is best-effort (free-text honoree_name → bioguide via fuzzy match againstlegislators.last_name + first_name, see legislators & committees). Critical for the "PAC contributions to my representative" workflow.raw_jsonpreserved per filing — the v1 → v2 migration may add fields; preserving raw lets us re-parse.- No body text on filings — the LD-2 form has structured activity rows; the PDF is the human-readable rendering. The activity rows are the structured signal we ingest.
billscross-reference deferred. LD-2 activity descriptions sometimes mention specific bill numbers ("H.R. 1", "S. 5"). Extraction at parse time is feasible but error-prone (free text). Defer to v1.x; ship v1 with description-text searchable. (See bills for the BILLSTATUS target.)
Download / update strategy
Section titled “Download / update strategy”Backfill (1999-present)
Section titled “Backfill (1999-present)”- Iterate by year + period:
- For year in 1999..2025:
- For period in [first_quarter, second_quarter, third_quarter, fourth_quarter, mid_year, year_end]:
- Page through
/filings/?filing_year={Y}&filing_period={P}&ordering=dt_posteduntil exhausted. - Same for
/contributions/.
- Page through
- For period in [first_quarter, second_quarter, third_quarter, fourth_quarter, mid_year, year_end]:
- For year in 1999..2025:
- ~1.94M filings × per-row detail not always needed (list endpoint returns full shape). Single API call paginates 25-100 rows.
- Reference data first:
/registrants/,/clients/,/lobbyists/— small enough to backfill in entirety.
Time estimate (lda.gov, with API key, 120/min sustained):
- ~1.94M filings ÷ 120 req/min ÷ 60 = ~270 minutes of saturated requests = ~5 hours.
- ~1.94M contributions same magnitude = ~5 hours.
- With pagination at default page size (25), more like ~12-24 hours for filings + contributions combined.
- Reference data (registrants, clients, lobbyists) negligible — minutes.
Without API key (15/min anonymous): roughly 8x slower → ~4 days. Strongly preferred to register and use a key.
Daily incremental
Section titled “Daily incremental”- Hourly:
/filings/?ordering=-dt_posted&filing_year={current}. Stop at last-seen UUID. Thefiling_yearfilter scopes the watermarked incremental run (no longer required for pagination — unfiltered pagination now returns 200). - Hourly:
/contributions/?ordering=-dt_posted&filing_year={current}. - Weekly: per-quarter reconciliation. Compare row counts against our table.
State tracking
Section titled “State tracking”Source keys: lda_filings, lda_contributions, lda_registrants, lda_clients, lda_lobbyists. State stores last-seen dt_posted watermark per source plus the most recent filing_uuid for tie-breaking.
Failure modes
Section titled “Failure modes”- HTTP 429 throttle. API returns
Retry-Afterin seconds — honor it. Don't retry tighter than the recommended wait. - HTTP 403 from Akamai (fallback only). Not observed as of the 2026-05-29 sweep — plain HTTP clients get 200 JSON. If 403s reappear, fall back to a TLS-fingerprint-spoofing client (
curl_cffi,tls-client,niquests) or a headless-with-stealth Chromium. - Pagination without filter. No longer an error — unfiltered
/filings/?page=Nreturns 200 as of the 2026-05-29 sweep. We still passfiling_yearfor watermarked incremental runs. - Government entity granularity for pre-2021-02-14 filings. Fall back to filing-level government entities; flag at ingest so downstream consumers know.
- Honoree resolution to bioguide — many contributions go to candidates not yet sworn in. Fuzzy match; leave NULL when ambiguous.
- Filing amendments — new
filing_uuid(amendments get distinct UUIDs); they reference the parent filing viafiling_type. Capture amendments as separate rows; downstream consumers walk the amendment chain. - No-activity report —
is_no_activityflag; activities array empty.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- HTTP client (resolved). No bot wall as of the 2026-05-29 sweep — a plain Python
requests/httpxclient gets 200 JSON. No TLS-fingerprint trick or headed browser needed. Fallback only if 403s reappear:curl_cffi/tls-client/niquests, or a headed Chromium fetcher. - API key acquisition. Free registration at
lda.gov/api/register/. One-time setup; share via.kamal/secrets. - Bill-number extraction from activity descriptions. A high-value cross-reference: when a filing's
lobbying_activities[].descriptionmentions "H.R. 1, the One Big Beautiful Bill Act," we should populate alda_activity_billsjoin table linking the activity to the bill_id. The free-text bill-citation field isdescription— notgeneral_issue_specific, which is null on lda.gov. Lean: ship v1 with description-text search; build the extractor in v1.x. - House LDA divergence. HLOGA requires identical filings in both chambers, but historically some filings appear in House but not Senate due to filing system glitches. v1 trusts Senate; v1.x can spot-check House.
- PAC contribution graph. LD-203 contributions are itemized by PAC. Building a "which PACs are funded by lobbyists for which clients" graph is an analytic layer, not an ingest concern.
- Foreign Agents (FARA). The LDA covers domestic lobbying. Foreign agent filings under FARA are a separate registration system — listed in v2 deferral table. Distinct schema; document later.
- Contribution amount ambiguity. Some filings report ranges or "see attached PDF." Schema's
numeric(12,2)requires a number; when unparseable, leave NULL and log. - Relationship to bills cross-reference (BILLSTATUS). Bills have a
policy_areaand subjects; LDA filings have issue codes. Build a small crosswalk so "lobbying on HCR (Health Care)" can be cross-referenced with bills tagged "Health" subject. - Data quality. LDA filings are notoriously self-reported with limited verification. The
descriptionfield is free-text and varies wildly. We ingest as-is; downstream consumers handle the noise. - Mass amendments. When the OPR requires registrants to re-file (e.g., format change), thousands of amendments hit at once. Plan for burst.