CBO cost estimates ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Cost estimates and budgetary analyses from the Congressional Budget
Office, the legislative branch's nonpartisan budget shop. Every bill
ordered reported by a House or Senate committee gets a CBO score (when
feasible) — 5- and 10-year budget impact. CBO also publishes broader
analyses: federal budget outlook, agency-program reviews, working papers.
For Josh, CBO is critical for the vote-impact and budget-summary
workflows: "what's the 10-year cost of the SPEED Act," "what's CBO's
score on H.R. 4920," "what's the budget outlook for FY2027." The
bill ↔ CBO linkage already has a schema target (bill_cbo_estimates inbills.html); this ingester populates the source table CBO publications
point at.
User stories
As a policy analyst using my own AI agent, I want to ask the agent for the latest CBO score on a bill, with the budget-window summary so that I can quote the official cost without scraping cbo.gov.
As an OSS self-hoster, I want to run `josh ingest cbo-cost-estimates` on a fresh substrate so that I can populate CBO publications from public RSS + asset URLs on my own instance.
As an analyst running impact summaries, I want per-bill CBO estimates joined to bills and CRS reports so that my agent can answer "what's the cost AND who supports it" in one substrate.
Acceptance criteria (EARS)
- When `josh ingest cbo-cost-estimates --since 2026-01-01` is run on the deployed substrate, the system shall populate `cbo_publications` rows whose six citation columns are all non-null.
- When a CBO publication PDF is fetched, the system shall write the raw payload under `/data/corpus/cbo-cost-estimates/bodies/raw/` and a normalized Markdown body under `/data/corpus/cbo-cost-estimates/bodies/markdown/`.
- While the ingester is running, if a single PDF or product page fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `publication_id`) and continue with the next record.
- Where a publication's title declares a bill (`H.R. NNNN, Title` or `S. NNNN, Title`), the system shall populate `cbo_publication_bills` linking `cbo_publications.id` to `bills.id` (resolving the soft-FK when the bill is in the substrate; raw bill citation preserved otherwise).
- Where the publication subtitle declares the bill stage (`As ordered reported`, `As passed by the House`, etc.), the system shall set `cbo_publications.bill_stage` to the corresponding enum value.
- When an FTS5 query is executed against `cbo_publications_fts`, the system shall return BM25-ranked results.
- Where a publication body exceeds ~5K tokens (broader analyses routinely cross this), the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `cbo_publication_chunks_vec0`.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest RSS `<pubDate>` observed.
Success determiner
Command
set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest cbo-cost-estimates --since 2026-04-01 --max 25'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM cbo_publications
WHERE published_at >= ''2026-04-01''
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM cbo_publications_fts WHERE cbo_publications_fts MATCH '\''appropriations'\'';"'
Expect
Smoke run only. Full backfill scope is post-2000 (~25K publications online) and runs detached on the server per `add-a-new-source.html`. Both HTML product pages and `/system/files/` asset PDFs are DataDome-walled (PDFs 403 on direct curl), so the headed-browser fetcher is required across the whole discover→fetch loop.
Clarifications needed
- Backfill scope: post-2000 (~25K publications) vs current FY only (~600-1000) for v1 launch. Lean: post-2000 — ingest is sequential, asset PDFs are ~1-2 GB total, citation graph value benefits from breadth.
- PDF normalizer: PyMuPDF (fast, decent) vs pdfplumber (best for tables) vs Marker (LLM-assisted, expensive). CBO docs are table-heavy (budget windows). Lean: pdfplumber — same call as GAO's PDF normalizer choice; ideally the choice converges across PDF-heavy sources.
- Headed-browser dependency story for both product detail pages AND asset PDFs (both behind the DataDome wall — `/system/files/` PDFs now 403 on direct curl): does `josh-ingester` ship with the same browser instance used for GAO + Senate + LDA, or per-source workers? Lean: shared long-running browser context — same Chromium serves all bot-walled sources, including the PDF fetch.
- Bill-stage parsing: regex on subtitle text (current plan) vs LLM extraction (more robust to wording variation)? Lean: regex for v1 with a clarifications log if matches fall below 95%.
- Working papers and presentations — store body text (PDFs) or just metadata + link? Lean: store body for working papers (high-value research), metadata-only for presentations (slides ingest poorly into text).
Out of scope
- State-level budget offices (legislative branch budget analysis at the state level) — Phase 6+.
- Pre-2000 historical CBO publications — out of digital coverage; revisit in v2.
- CBO Data Library spreadsheets (Baseline Projections, Long-Term Outlook) — separate dataset, useful for v1.x.
- Real-time webhooks / push subscriptions — surface-layer concern.
- Scraping cbo.gov search directly — RSS gives linear coverage of new publications.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/cbo-cost-estimates/.
Discover/fetch/parse/load shape:
- discover: https://www.cbo.gov/publications/all/rss.xml (RSS, open
to bots) is the sole discovery feed; filter to cost estimates
client-side off the feed entries. The former cost-estimate-only feed at
/cost-estimates/all/rss.xml is dead — it returns a 200 text/html
Drupal antibot page (unrendered Twig title, zero <item> elements), so
it cannot be used for discovery or filtering. Watermark from
ingestion_source_state.last_seen_key='cbo-publications-rss' keyed on
<pubDate>. Backfill via paginated /publications HTML (headed browser).
- fetch: per-publication product page via headed browser
(https://www.cbo.gov/publication/{id}) saved to
/data/corpus/cbo-cost-estimates/bodies/raw/{id}.html, plus the asset
PDF at /system/files/{YYYY-MM}/{slug}.pdf (extract slug from product
page — slug is not always derivable from title). Asset PDFs are now
DataDome-walled (direct curl returns 403 even with full browser
headers), so the PDF fetcher must reuse the same headed/DataDome path
as the product pages — not a direct HTTP GET. SHA256 over the PDF
bytes is raw_sha256.
- parse: extract product page fields (title, subtitle/bill-stage,
publish date, summary, "As ordered reported by..." note, formats list).
Run the chosen PDF normalizer to produce Markdown body. Bill-stage
enum populated from subtitle regex.
- load: idempotent upsert on (publication_id) natural key. Bill
soft-FK populated from title parse. FTS5 sync triggers fire on
insert/update/delete. Body chunks enqueued for embedding when body
exceeds threshold.
Migration shape: cbo_publications (with the six citation columns +publication_id UNIQUE + publication_type discriminator + bill_stage
enum), cbo_publication_bills (M:N — a publication can score multiple
bills, a bill can have multiple stage estimates), cbo_publications_fts
virtual + 3 sync triggers, cbo_publication_chunks +cbo_publication_chunks_vec0 (vec0 with float[1024] and bit[1024]).
Schedule: 4-hourly RSS poll. CBO publishes 1-5 cost estimates per
business day during session weeks; 4-hourly catches stragglers without
spamming. Off-hours and weekends are cheap-no-ops.
Anti-bot: real-Chromium headed browser for both product pages AND asset
PDFs (mirrors GAO/LDA pattern) — /system/files/ PDFs are DataDome-walled
and 403 on direct curl. Only the /publications/all/rss.xml feed accepts
direct curl with a realistic UA.
Tasks
0 of 16 done.
- t1 Alembic migration applied locally + production (cbo_publications + cbo_publication_bills + FTS5 + vec0)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/cbo.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (chosen PDF normalizer per clarification + raw HTML fallback)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests (cost estimate, working paper, broader report)
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 Vector vec0 table wired (broader analyses routinely cross 5K-token threshold)
- t8 Unit tests for product-page HTML parser, PDF normalizer, and bill-stage subtitle regex against captured fixtures
- t9 Integration test for full discover→fetch→parse→load pipeline (RSS → headed-browser fetch → PDF extract → row insert)
- t10 Smoke backfill (25 publications) clean per success_determiner
- t11 Production backfill at agreed scope (per clarifications_needed) complete
- t12 4-hourly cron registered and observed firing once
- t13 No recurring errors in `ingestion_logs` for the last completed run
- t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
- t15 12-sample manual PDF→Markdown validation gate (per add-a-new-source.html — stratified across cost estimate, working paper, broader report, table-heavy budget outlook)
- t16 Bill-soft-FK resolution test: smoke-backfill cohort of cost estimates with `H.R. NNNN`/`S. NNNN` titles all resolve to a `bills.id` (when bill exists in substrate) or queue retry
Changelog
No history yet.