Skip to content

Josh — Data Sources (v1 / v2 canonical inventory)

Working document. The source of truth for what Josh ingests, in what order, with what status.

This document is the source of truth for what Josh ingests, in what order, and at what status. Per-source deep-dive docs in docs/sources/ carry the authoritative endpoint specs, schemas, and probe findings — this document is the index that points to them.

  • Status column tracks where each source is in the pipeline: plannedexploringschema-draftedingestion-builtmirror-published.
  • This document is the index. Per-source deep-dive docs live in docs/sources/ — one markdown file per source, with full endpoint specs, schemas, probe findings, and ingestion strategy. The per-source docs are the source of truth for findings; this index just links to them.
  • Volume estimates are rough; refine as we measure.
  • Adding a source to v1 requires bumping it into the table and writing its per-source doc. Adding to v2 just requires the table row.

The 30 sources Josh ingests for v1 launch.

#SourcePrimary accessFormatVolumeStatus
1Bills (text + status + actions + sponsors + cosponsors + amendments + versions)GovInfo BILLSTATUS + BILLS + Congress.gov APIXML + JSON~10-25 GBschema-drafted
2Bill summaries (CRS-prepared) — covered in bills.htmlBILLSTATUS XML inlineXML(rolls into bills)schema-drafted
3Members of Congress (bio, terms, party, chamber, state, district)unitedstates/congress-legislators GitHub PagesYAML<500 MBschema-drafted
4Committees + subcommittees (membership, jurisdiction, leadership)unitedstates/congress-legislators GitHub PagesYAML<500 MBschema-drafted
5Committee assignments (member ↔ committee)unitedstates/congress-legislators GitHub PagesYAML<100 MBschema-drafted
6Hearings + markup schedule (calendar, witnesses)Congress.gov API + GovInfo CHRGJSON / XML~1-2 GBschema-drafted
7Hearing transcripts (best-effort, partial)GovInfo CHRG + committee PDFs + ASR fallbackHTML / PDF / video~15-20 GBschema-drafted
8Roll call votes (chamber + per-member)House Clerk + Senate.gov XMLXML~2 GBschema-drafted
9Sponsorships + cosponsorships — covered in bills.htmlBILLSTATUS XML inlineXML(rolls into bills)schema-drafted
#SourcePrimary accessFormatVolumeStatus
10US Code (USC)GovInfo USCODE + OLRC release pointsUSLM XML~3-5 GBschema-drafted
11Public Laws (enacted law text, distinct from bills)GovInfo Bulk PLAW collectionUSLM XML / PDF~2 GBschema-drafted
#SourcePrimary accessFormatVolumeStatus
12Federal Register documents (rules, NPRMs, notices, IFRs, ANPRs)federalregister.gov APIJSON~20-30 GBschema-drafted
13Executive Orders — covered in federal-register.htmlFR presidential_document_type=executive_order subtypeJSON(rolls into FR)schema-drafted
14Proclamations + memoranda — covered in federal-register.htmlFR presidential_document_type subtypeJSON(rolls into FR)schema-drafted
15Regulations.gov dockets (metadata, document list, NOT comment text)regulations.gov API v4JSON~1-2 GBschema-drafted
16Comment counts + deadlines per docketregulations.gov API v4JSON<500 MBschema-drafted
17eCFR (current)ecfr.gov REST APIXML / JSON~3-5 GBschema-drafted
18CFR Annual Edition (historical)GovInfo CFR collectionXML~5 GBschema-drafted
#SourcePrimary accessFormatVolumeStatus
19CRS reportsEveryCRSReport.com mirror + Congress.gov APIHTML / PDF / JSON~10 GB DB + ~10 GB rawschema-drafted (code shipped + verified; 23,092-report load 2026-05-08 lost when DO droplet destroyed 2026-05-10 — re-ingest pending OVHcloud provision)
20GAO reportsgao.gov RSS + headed-browser scrapeHTML / PDF / RSS~10-15 GBschema-drafted
21CBO cost estimatescbo.gov RSS + headed-browser scrapeHTML / PDF / RSS~1-2 GBschema-drafted
22Committee reports (formal explanations of bills)GovInfo Bulk CRPT collectionXML / PDF~3-5 GBschema-drafted
23Congressional Record (floor proceedings, daily digest, extensions of remarks)GovInfo Bulk CREC collectionXML / HTML / PDF~10-15 GBschema-drafted
24Statement of Administration Policy (SAP)unitedstates/statements-of-administration-policy GitHub mirrorYAML / PDF<500 MBschema-drafted
#SourcePrimary accessFormatVolumeStatus
25Senate LDA filings (LD-1, LD-2, LD-203)lda.gov API v1 (unified)JSON~3-5 GBschema-drafted
26House LDA filingsunified with Senate at lda.govJSONschema-drafted
#SourcePrimary accessFormatVolumeStatus
27Congressional staff directory (chiefs of staff, LDs, LAs, committee staff)House.gov + Senate.gov + community contributionHTML<200 MBschema-drafted (partial coverage)
28Federal agency officials (heads, deputies, key staff)Agency websites + USA.gov + WikidataHTML<200 MBschema-drafted (partial coverage)
29Caucus membershipsPer-caucus scrape + WikipediaHTML<50 MBschema-drafted (partial coverage)
#SourcePrimary accessFormatVolumeStatus
30Issue / topic taxonomyHand-curated Josh meta-taxonomy + LCSH backbone + per-source native taxonomiesYAML / JSON-LD<100 MBschema-drafted

Estimated v1 corpus total: ~80-110 GB raw / ~50-70 GB compressed. Larger than the original ~50 GB estimate because USC, Public Laws, Committee Reports, and the Congressional Record were missing from the original count.

Accelerators (not sources, but used during ingestion)

Section titled “Accelerators (not sources, but used during ingestion)”
  • unitedstates/congress — community-maintained scrapers, public domain (CC0). Massive head start for bills, members, votes.
  • eyecite — citation extraction library. Used to build the citation graph (bill ↔ USC, rule ↔ USC, doc ↔ doc).

SourceReason for deferral
Reg.gov public comment text (full text)Hundreds of GB to TBs; per-docket on-demand fetch instead.
FEC campaign finance (receipts, disbursements, PAC giving)Useful for member-voting analysis; not in core v1 loop.
Federal contracts and solicitations (SAM.gov, USAspending, FPDS-NG)Different market segment; possible v2/v3.
State legislative tracking (50 states)Real per-state engineering work. v2; OpenStates as v1 stopgap pointer.
EU and international trackingv3, hosted-first.
Member statements / press releases via per-office sitesFragile. v1 ships floor statements via Congressional Record only.
Member social media (X, Bluesky, Truth Social)Fragile, hostile platforms. v2 if at all.
District-to-address matching (Census TIGER/Line)Useful but adds geo dependency. v2.
FARA filings (Foreign Agents Registration Act)Different from LDA, valuable for embassy buyer segment. v2.
Member financial disclosure forms (House Clerk + Senate ECRD)Privacy-sensitive area, not core to v1 loop. v2.
TreatiesCongress.gov API includes them; cheap to add later. v2.
Nominations (presidential nominees)Confirmation hearings already covered; nomination records themselves v2.

For clarity, sources Josh deliberately does not ingest:

  • News articles and editorial content (POLITICO Pro, E&E News, Bloomberg journalists, CQ, Roll Call) — different shape of product; aggregating editorial content isn't substrate work.
  • Westlaw / LexisNexis case law — different vertical (legal, not policy).
  • Industry research / private datasets — out of scope; partner if needed.
  • Local government / school board tracking — out of scope; the per-jurisdiction discovery cost makes it a different shape of product.
  • Grassroots messaging deliverability — different product entirely; integrate, don't compete.
  • PAC management / FEC compliance — money-touching, regulatory liability.
  • Stakeholder CRM — out of scope; integrate with existing CRMs rather than build one.
  • Custom human-analyst research services (Oxford Analytica style) — different business.