CBO cost estimates ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Cost estimates and budgetary analyses from the Congressional Budget
Office, the legislative branch's nonpartisan budget shop. Every bill
ordered reported by a House or Senate committee gets a CBO score (when
feasible) — 5- and 10-year budget impact. CBO also publishes broader
analyses: federal budget outlook, agency-program reviews, working papers.
For Josh, CBO is critical for the vote-impact and budget-summary
workflows: "what's the 10-year cost of the SPEED Act," "what's CBO's
score on H.R. 4920," "what's the budget outlook for FY2027." The
bill ↔ CBO linkage already has a schema target (bill_cbo_estimates in
bills.html); this ingester populates the source table CBO publications
point at.

User stories

As a policy analyst using my own AI agent, I want to ask the agent for the latest CBO score on a bill, with the budget-window summary so that I can quote the official cost without scraping cbo.gov.

As an OSS self-hoster, I want to run `josh ingest cbo-cost-estimates` on a fresh substrate so that I can populate CBO publications from public RSS + asset URLs on my own instance.

As an analyst running impact summaries, I want per-bill CBO estimates joined to bills and CRS reports so that my agent can answer "what's the cost AND who supports it" in one substrate.

Acceptance criteria (EARS)

When `josh ingest cbo-cost-estimates --since 2026-01-01` is run on the deployed substrate, the system shall populate `cbo_publications` rows whose six citation columns are all non-null.
When a CBO publication PDF is fetched, the system shall write the raw payload under `/data/corpus/cbo-cost-estimates/bodies/raw/` and a normalized Markdown body under `/data/corpus/cbo-cost-estimates/bodies/markdown/`.
While the ingester is running, if a single PDF or product page fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending `publication_id`) and continue with the next record.
Where a publication's title declares a bill (`H.R. NNNN, Title` or `S. NNNN, Title`), the system shall populate `cbo_publication_bills` linking `cbo_publications.id` to `bills.id` (resolving the soft-FK when the bill is in the substrate; raw bill citation preserved otherwise).
Where the publication subtitle declares the bill stage (`As ordered reported`, `As passed by the House`, etc.), the system shall set `cbo_publications.bill_stage` to the corresponding enum value.
When an FTS5 query is executed against `cbo_publications_fts`, the system shall return BM25-ranked results.
Where a publication body exceeds ~5K tokens (broader analyses routinely cross this), the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `cbo_publication_chunks_vec0`.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest RSS `<pubDate>` observed.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest cbo-cost-estimates --since 2026-04-01 --max 25'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM cbo_publications
  WHERE published_at >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM cbo_publications_fts WHERE cbo_publications_fts MATCH '\''appropriations'\'';"'

Expect

>= 25 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. Full backfill scope is post-2000 (~25K publications online) and runs detached on the server per `add-a-new-source.html`. Both HTML product pages and `/system/files/` asset PDFs are DataDome-walled (PDFs 403 on direct curl), so the headed-browser fetcher is required across the whole discover→fetch loop.

Clarifications needed

Backfill scope: post-2000 (~25K publications) vs current FY only (~600-1000) for v1 launch. Lean: post-2000 — ingest is sequential, asset PDFs are ~1-2 GB total, citation graph value benefits from breadth.
PDF normalizer: PyMuPDF (fast, decent) vs pdfplumber (best for tables) vs Marker (LLM-assisted, expensive). CBO docs are table-heavy (budget windows). Lean: pdfplumber — same call as GAO's PDF normalizer choice; ideally the choice converges across PDF-heavy sources.
Headed-browser dependency story for both product detail pages AND asset PDFs (both behind the DataDome wall — `/system/files/` PDFs now 403 on direct curl): does `josh-ingester` ship with the same browser instance used for GAO + Senate + LDA, or per-source workers? Lean: shared long-running browser context — same Chromium serves all bot-walled sources, including the PDF fetch.
Bill-stage parsing: regex on subtitle text (current plan) vs LLM extraction (more robust to wording variation)? Lean: regex for v1 with a clarifications log if matches fall below 95%.
Working papers and presentations — store body text (PDFs) or just metadata + link? Lean: store body for working papers (high-value research), metadata-only for presentations (slides ingest poorly into text).

Out of scope

State-level budget offices (legislative branch budget analysis at the state level) — Phase 6+.
Pre-2000 historical CBO publications — out of digital coverage; revisit in v2.
CBO Data Library spreadsheets (Baseline Projections, Long-Term Outlook) — separate dataset, useful for v1.x.
Real-time webhooks / push subscriptions — surface-layer concern.
Scraping cbo.gov search directly — RSS gives linear coverage of new publications.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/cbo-cost-estimates/.

Discover/fetch/parse/load shape:
- discover: https://www.cbo.gov/publications/all/rss.xml (RSS, open
to bots) is the sole discovery feed; filter to cost estimates
client-side off the feed entries. The former cost-estimate-only feed at
/cost-estimates/all/rss.xml is dead — it returns a 200 text/html
Drupal antibot page (unrendered Twig title, zero <item> elements), so
it cannot be used for discovery or filtering. Watermark from
ingestion_source_state.last_seen_key='cbo-publications-rss' keyed on
<pubDate>. Backfill via paginated /publications HTML (headed browser).
- fetch: per-publication product page via headed browser
(https://www.cbo.gov/publication/{id}) saved to
/data/corpus/cbo-cost-estimates/bodies/raw/{id}.html, plus the asset
PDF at /system/files/{YYYY-MM}/{slug}.pdf (extract slug from product
page — slug is not always derivable from title). Asset PDFs are now
DataDome-walled (direct curl returns 403 even with full browser
headers), so the PDF fetcher must reuse the same headed/DataDome path
as the product pages — not a direct HTTP GET. SHA256 over the PDF
bytes is raw_sha256.
- parse: extract product page fields (title, subtitle/bill-stage,
publish date, summary, "As ordered reported by..." note, formats list).
Run the chosen PDF normalizer to produce Markdown body. Bill-stage
enum populated from subtitle regex.
- load: idempotent upsert on (publication_id) natural key. Bill
soft-FK populated from title parse. FTS5 sync triggers fire on
insert/update/delete. Body chunks enqueued for embedding when body
exceeds threshold.

Migration shape: cbo_publications (with the six citation columns +
publication_id UNIQUE + publication_type discriminator + bill_stage
enum), cbo_publication_bills (M:N — a publication can score multiple
bills, a bill can have multiple stage estimates), cbo_publications_fts
virtual + 3 sync triggers, cbo_publication_chunks +
cbo_publication_chunks_vec0 (vec0 with float[1024] and bit[1024]).

Schedule: 4-hourly RSS poll. CBO publishes 1-5 cost estimates per
business day during session weeks; 4-hourly catches stragglers without
spamming. Off-hours and weekends are cheap-no-ops.

Anti-bot: real-Chromium headed browser for both product pages AND asset
PDFs (mirrors GAO/LDA pattern) — /system/files/ PDFs are DataDome-walled
and 403 on direct curl. Only the /publications/all/rss.xml feed accepts
direct curl with a realistic UA.

Tasks

0 of 16 done.

Changelog

No history yet.