Josh — Data Sources (v1 / v2 canonical inventory)

Working document. The source of truth for what Josh ingests, in what order, with what status.

This document is the source of truth for what Josh ingests, in what order, and at what status. Per-source deep-dive docs in docs/sources/ carry the authoritative endpoint specs, schemas, and probe findings — this document is the index that points to them.

How this document is used

Status column tracks where each source is in the pipeline: planned → exploring → schema-drafted → ingestion-built → mirror-published.
This document is the index. Per-source deep-dive docs live in docs/sources/ — one markdown file per source, with full endpoint specs, schemas, probe findings, and ingestion strategy. The per-source docs are the source of truth for findings; this index just links to them.
Volume estimates are rough; refine as we measure.
Adding a source to v1 requires bumping it into the table and writing its per-source doc. Adding to v2 just requires the table row.

Summary — v1

The 30 sources Josh ingests for v1 launch.

Legislative core

#	Source	Primary access	Format	Volume	Status
1	Bills (text + status + actions + sponsors + cosponsors + amendments + versions)	GovInfo BILLSTATUS + BILLS + Congress.gov API	XML + JSON	~10-25 GB	schema-drafted
2	Bill summaries (CRS-prepared) — covered in bills.html	BILLSTATUS XML inline	XML	(rolls into bills)	schema-drafted
3	Members of Congress (bio, terms, party, chamber, state, district)	`unitedstates/congress-legislators` GitHub Pages	YAML	<500 MB	schema-drafted
4	Committees + subcommittees (membership, jurisdiction, leadership)	`unitedstates/congress-legislators` GitHub Pages	YAML	<500 MB	schema-drafted
5	Committee assignments (member ↔ committee)	`unitedstates/congress-legislators` GitHub Pages	YAML	<100 MB	schema-drafted
6	Hearings + markup schedule (calendar, witnesses)	Congress.gov API + GovInfo CHRG	JSON / XML	~1-2 GB	schema-drafted
7	Hearing transcripts (best-effort, partial)	GovInfo CHRG + committee PDFs + ASR fallback	HTML / PDF / video	~15-20 GB	schema-drafted
8	Roll call votes (chamber + per-member)	House Clerk + Senate.gov XML	XML	~2 GB	schema-drafted
9	Sponsorships + cosponsorships — covered in bills.html	BILLSTATUS XML inline	XML	(rolls into bills)	schema-drafted

Statutory law

#	Source	Primary access	Format	Volume	Status
10	US Code (USC)	GovInfo USCODE + OLRC release points	USLM XML	~3-5 GB	schema-drafted
11	Public Laws (enacted law text, distinct from bills)	GovInfo Bulk PLAW collection	USLM XML / PDF	~2 GB	schema-drafted

Regulatory

#	Source	Primary access	Format	Volume	Status
12	Federal Register documents (rules, NPRMs, notices, IFRs, ANPRs)	federalregister.gov API	JSON	~20-30 GB	schema-drafted
13	Executive Orders — covered in federal-register.html	FR `presidential_document_type=executive_order` subtype	JSON	(rolls into FR)	schema-drafted
14	Proclamations + memoranda — covered in federal-register.html	FR `presidential_document_type` subtype	JSON	(rolls into FR)	schema-drafted
15	Regulations.gov dockets (metadata, document list, NOT comment text)	regulations.gov API v4	JSON	~1-2 GB	schema-drafted
16	Comment counts + deadlines per docket	regulations.gov API v4	JSON	<500 MB	schema-drafted
17	eCFR (current)	ecfr.gov REST API	XML / JSON	~3-5 GB	schema-drafted
18	CFR Annual Edition (historical)	GovInfo CFR collection	XML	~5 GB	schema-drafted

Reference and analysis

#	Source	Primary access	Format	Volume	Status
19	CRS reports	EveryCRSReport.com mirror + Congress.gov API	HTML / PDF / JSON	~10 GB DB + ~10 GB raw	schema-drafted (code shipped + verified; 23,092-report load 2026-05-08 lost when DO droplet destroyed 2026-05-10 — re-ingest pending OVHcloud provision)
20	GAO reports	gao.gov RSS + headed-browser scrape	HTML / PDF / RSS	~10-15 GB	schema-drafted
21	CBO cost estimates	cbo.gov RSS + headed-browser scrape	HTML / PDF / RSS	~1-2 GB	schema-drafted
22	Committee reports (formal explanations of bills)	GovInfo Bulk CRPT collection	XML / PDF	~3-5 GB	schema-drafted
23	Congressional Record (floor proceedings, daily digest, extensions of remarks)	GovInfo Bulk CREC collection	XML / HTML / PDF	~10-15 GB	schema-drafted
24	Statement of Administration Policy (SAP)	`unitedstates/statements-of-administration-policy` GitHub mirror	YAML / PDF	<500 MB	schema-drafted

Lobbying

#	Source	Primary access	Format	Volume	Status
25	Senate LDA filings (LD-1, LD-2, LD-203)	lda.gov API v1 (unified)	JSON	~3-5 GB	schema-drafted
26	House LDA filings	unified with Senate at lda.gov	JSON	—	schema-drafted

People metadata (best-effort)

#	Source	Primary access	Format	Volume	Status
27	Congressional staff directory (chiefs of staff, LDs, LAs, committee staff)	House.gov + Senate.gov + community contribution	HTML	<200 MB	schema-drafted (partial coverage)
28	Federal agency officials (heads, deputies, key staff)	Agency websites + USA.gov + Wikidata	HTML	<200 MB	schema-drafted (partial coverage)
29	Caucus memberships	Per-caucus scrape + Wikipedia	HTML	<50 MB	schema-drafted (partial coverage)

Taxonomy

#	Source	Primary access	Format	Volume	Status
30	Issue / topic taxonomy	Hand-curated Josh meta-taxonomy + LCSH backbone + per-source native taxonomies	YAML / JSON-LD	<100 MB	schema-drafted

Estimated v1 corpus total: ~80-110 GB raw / ~50-70 GB compressed. Larger than the original ~50 GB estimate because USC, Public Laws, Committee Reports, and the Congressional Record were missing from the original count.

Accelerators (not sources, but used during ingestion)

unitedstates/congress — community-maintained scrapers, public domain (CC0). Massive head start for bills, members, votes.
eyecite — citation extraction library. Used to build the citation graph (bill ↔ USC, rule ↔ USC, doc ↔ doc).

Summary — v2 (deferred from v1)

Source	Reason for deferral
Reg.gov public comment text (full text)	Hundreds of GB to TBs; per-docket on-demand fetch instead.
FEC campaign finance (receipts, disbursements, PAC giving)	Useful for member-voting analysis; not in core v1 loop.
Federal contracts and solicitations (SAM.gov, USAspending, FPDS-NG)	Different market segment; possible v2/v3.
State legislative tracking (50 states)	Real per-state engineering work. v2; OpenStates as v1 stopgap pointer.
EU and international tracking	v3, hosted-first.
Member statements / press releases via per-office sites	Fragile. v1 ships floor statements via Congressional Record only.
Member social media (X, Bluesky, Truth Social)	Fragile, hostile platforms. v2 if at all.
District-to-address matching (Census TIGER/Line)	Useful but adds geo dependency. v2.
FARA filings (Foreign Agents Registration Act)	Different from LDA, valuable for embassy buyer segment. v2.
Member financial disclosure forms (House Clerk + Senate ECRD)	Privacy-sensitive area, not core to v1 loop. v2.
Treaties	Congress.gov API includes them; cheap to add later. v2.
Nominations (presidential nominees)	Confirmation hearings already covered; nomination records themselves v2.

Permanent non-scope

For clarity, sources Josh deliberately does not ingest:

News articles and editorial content (POLITICO Pro, E&E News, Bloomberg journalists, CQ, Roll Call) — different shape of product; aggregating editorial content isn't substrate work.
Westlaw / LexisNexis case law — different vertical (legal, not policy).
Industry research / private datasets — out of scope; partner if needed.
Local government / school board tracking — out of scope; the per-jurisdiction discovery cost makes it a different shape of product.
Grassroots messaging deliverability — different product entirely; integrate, don't compete.
PAC management / FEC compliance — money-touching, regulatory liability.
Stakeholder CRM — out of scope; integrate with existing CRMs rather than build one.
Custom human-analyst research services (Oxford Analytica style) — different business.