Working document. The source of truth for what Josh ingests, in what order, with what status.
This document is the source of truth for what Josh ingests, in what order, and at what status. Per-source deep-dive docs in docs/sources/ carry the authoritative endpoint specs, schemas, and probe findings — this document is the index that points to them.
- Status column tracks where each source is in the pipeline:
planned → exploring → schema-drafted → ingestion-built → mirror-published.
- This document is the index. Per-source deep-dive docs live in
docs/sources/ — one markdown file per source, with full endpoint specs, schemas, probe findings, and ingestion strategy. The per-source docs are the source of truth for findings; this index just links to them.
- Volume estimates are rough; refine as we measure.
- Adding a source to v1 requires bumping it into the table and writing its per-source doc. Adding to v2 just requires the table row.
The 30 sources Josh ingests for v1 launch.
| # | Source | Primary access | Format | Volume | Status |
|---|
| 1 | Bills (text + status + actions + sponsors + cosponsors + amendments + versions) | GovInfo BILLSTATUS + BILLS + Congress.gov API | XML + JSON | ~10-25 GB | schema-drafted |
| 2 | Bill summaries (CRS-prepared) — covered in bills.html | BILLSTATUS XML inline | XML | (rolls into bills) | schema-drafted |
| 3 | Members of Congress (bio, terms, party, chamber, state, district) | unitedstates/congress-legislators GitHub Pages | YAML | <500 MB | schema-drafted |
| 4 | Committees + subcommittees (membership, jurisdiction, leadership) | unitedstates/congress-legislators GitHub Pages | YAML | <500 MB | schema-drafted |
| 5 | Committee assignments (member ↔ committee) | unitedstates/congress-legislators GitHub Pages | YAML | <100 MB | schema-drafted |
| 6 | Hearings + markup schedule (calendar, witnesses) | Congress.gov API + GovInfo CHRG | JSON / XML | ~1-2 GB | schema-drafted |
| 7 | Hearing transcripts (best-effort, partial) | GovInfo CHRG + committee PDFs + ASR fallback | HTML / PDF / video | ~15-20 GB | schema-drafted |
| 8 | Roll call votes (chamber + per-member) | House Clerk + Senate.gov XML | XML | ~2 GB | schema-drafted |
| 9 | Sponsorships + cosponsorships — covered in bills.html | BILLSTATUS XML inline | XML | (rolls into bills) | schema-drafted |
| # | Source | Primary access | Format | Volume | Status |
|---|
| 10 | US Code (USC) | GovInfo USCODE + OLRC release points | USLM XML | ~3-5 GB | schema-drafted |
| 11 | Public Laws (enacted law text, distinct from bills) | GovInfo Bulk PLAW collection | USLM XML / PDF | ~2 GB | schema-drafted |
| # | Source | Primary access | Format | Volume | Status |
|---|
| 19 | CRS reports | EveryCRSReport.com mirror + Congress.gov API | HTML / PDF / JSON | ~10 GB DB + ~10 GB raw | schema-drafted (code shipped + verified; 23,092-report load 2026-05-08 lost when DO droplet destroyed 2026-05-10 — re-ingest pending OVHcloud provision) |
| 20 | GAO reports | gao.gov RSS + headed-browser scrape | HTML / PDF / RSS | ~10-15 GB | schema-drafted |
| 21 | CBO cost estimates | cbo.gov RSS + headed-browser scrape | HTML / PDF / RSS | ~1-2 GB | schema-drafted |
| 22 | Committee reports (formal explanations of bills) | GovInfo Bulk CRPT collection | XML / PDF | ~3-5 GB | schema-drafted |
| 23 | Congressional Record (floor proceedings, daily digest, extensions of remarks) | GovInfo Bulk CREC collection | XML / HTML / PDF | ~10-15 GB | schema-drafted |
| 24 | Statement of Administration Policy (SAP) | unitedstates/statements-of-administration-policy GitHub mirror | YAML / PDF | <500 MB | schema-drafted |
| # | Source | Primary access | Format | Volume | Status |
|---|
| 27 | Congressional staff directory (chiefs of staff, LDs, LAs, committee staff) | House.gov + Senate.gov + community contribution | HTML | <200 MB | schema-drafted (partial coverage) |
| 28 | Federal agency officials (heads, deputies, key staff) | Agency websites + USA.gov + Wikidata | HTML | <200 MB | schema-drafted (partial coverage) |
| 29 | Caucus memberships | Per-caucus scrape + Wikipedia | HTML | <50 MB | schema-drafted (partial coverage) |
| # | Source | Primary access | Format | Volume | Status |
|---|
| 30 | Issue / topic taxonomy | Hand-curated Josh meta-taxonomy + LCSH backbone + per-source native taxonomies | YAML / JSON-LD | <100 MB | schema-drafted |
Estimated v1 corpus total: ~80-110 GB raw / ~50-70 GB compressed. Larger than the original ~50 GB estimate because USC, Public Laws, Committee Reports, and the Congressional Record were missing from the original count.
unitedstates/congress — community-maintained scrapers, public domain (CC0). Massive head start for bills, members, votes.
eyecite — citation extraction library. Used to build the citation graph (bill ↔ USC, rule ↔ USC, doc ↔ doc).
| Source | Reason for deferral |
|---|
| Reg.gov public comment text (full text) | Hundreds of GB to TBs; per-docket on-demand fetch instead. |
| FEC campaign finance (receipts, disbursements, PAC giving) | Useful for member-voting analysis; not in core v1 loop. |
| Federal contracts and solicitations (SAM.gov, USAspending, FPDS-NG) | Different market segment; possible v2/v3. |
| State legislative tracking (50 states) | Real per-state engineering work. v2; OpenStates as v1 stopgap pointer. |
| EU and international tracking | v3, hosted-first. |
| Member statements / press releases via per-office sites | Fragile. v1 ships floor statements via Congressional Record only. |
| Member social media (X, Bluesky, Truth Social) | Fragile, hostile platforms. v2 if at all. |
| District-to-address matching (Census TIGER/Line) | Useful but adds geo dependency. v2. |
| FARA filings (Foreign Agents Registration Act) | Different from LDA, valuable for embassy buyer segment. v2. |
| Member financial disclosure forms (House Clerk + Senate ECRD) | Privacy-sensitive area, not core to v1 loop. v2. |
| Treaties | Congress.gov API includes them; cheap to add later. v2. |
| Nominations (presidential nominees) | Confirmation hearings already covered; nomination records themselves v2. |
For clarity, sources Josh deliberately does not ingest:
- News articles and editorial content (POLITICO Pro, E&E News, Bloomberg journalists, CQ, Roll Call) — different shape of product; aggregating editorial content isn't substrate work.
- Westlaw / LexisNexis case law — different vertical (legal, not policy).
- Industry research / private datasets — out of scope; partner if needed.
- Local government / school board tracking — out of scope; the per-jurisdiction discovery cost makes it a different shape of product.
- Grassroots messaging deliverability — different product entirely; integrate, don't compete.
- PAC management / FEC compliance — money-touching, regulatory liability.
- Stakeholder CRM — out of scope; integrate with existing CRMs rather than build one.
- Custom human-analyst research services (Oxford Analytica style) — different business.