Skip to content

CBO cost estimates

Cost estimates and budgetary analyses from the Congressional Budget Office, the legislative branch's nonpartisan budget shop. Every bill ordered reported by a House or Senate committee gets a CBO score (when feasible) — the budget impact over a 5-year and 10-year window. CBO also publishes broader analyses: the federal budget outlook, agency-program reviews, working papers, and presentations.

For Josh, CBO cost estimates are critical for the vote-impact and budget-summary workflows: "what's the 10-year cost of the SPEED Act," "what's CBO's score on H.R. 4920," "what's the budget outlook for FY2027." The data is structured enough to ingest reliably; the published score is a foreign-key target on bills (we already drafted bill_cbo_estimates in bills).

This source is a thin doc — CBO publishes ~600-1000 cost estimates per year and a few dozen broader publications. Volume and complexity are modest.

CBO publishes no structured API or developer reference — there is no JSON API, and nothing upstream documents the page layouts, feeds, or the bot-walls. The content below is reverse-engineered from probing the live site; see CBO's products / RSS overview for the only upstream-facing pointer.

Source nameCBO cost estimates
PublisherCongressional Budget Office
LicensePublic domain
Coverage1976 – present (CBO's founding). Online corpus reliable post-2000.
Volume~600-1000 cost estimates per year + ~150-300 broader publications. ~25,000+ all-time.
Storage estimate~1-2 GB raw PDF + HTML; ~500 MB extracted text
Sitehttps://www.cbo.gov/
Primary feedhttps://www.cbo.gov/publications/all/rss.xml (all CBO publications, including cost estimates) — the only open feed (see Access notes)
AuthNone
Rate limitsNone observed; HTML pages and /system/files/ asset PDFs bot-walled by DataDome (real Chrome required; PDFs 403 on direct curl)
Incremental sync hintsRSS <pubDate> + per-publication pubDate
Stable ID formatcbo:{publication_id} e.g. cbo:62310
Statusexploring — schema drafted, ingestion not built

Primary: CBO publications RSS feed (/publications/all/rss.xml) for discovery with client-side filtering to cost estimates, headed-browser fetch for both product detail pages and /system/files/ asset PDFs (both DataDome-walled).

Secondary: Congress.gov bill API. BILLSTATUS XML's <cboCostEstimates> element references CBO publications by URL — confirms the bill ↔ CBO linkage when CRS or committee staff has tagged it. Use as enrichment. (The upstream BILLSTATUS schema is documented at usgpo/bill-status.)

Skip: scraping search results from cbo.gov directly. Search is heavily JS-driven and protected; the RSS gives us linear coverage of new publications.

HTML pages AND asset PDFs bot-walled by DataDome (x-datadome: protected header on 403 responses). Only the /publications/all/rss.xml feed is open to direct curl with a real-browser User-Agent. Both the product detail pages and the /system/files/ asset PDFs need the headed-browser fetcher (same one as Senate / GAO) — direct curl of a PDF now returns 403 even with full browser headers.

The site uses an absolute URL pattern for assets: cbo.gov/system/files/{YYYY-MM}/{slug}.pdf. Slugs are the publication's node_alias derived from title at publication time. The slug is not always derivable from the title — extract it from the product page rather than constructing it.

There is no JSON API. Everything is HTML + RSS + PDF. See CBO's products overview for the publication types themselves; the URL shapes (/publication/{id}, /system/files/{YYYY-MM}/{slug}.pdf, the /data Data Library) carry no Josh-specific finding beyond what's noted here.

The former cost-estimates-only RSS feed at https://www.cbo.gov/cost-estimates/all/rss.xml is dead — it returns a 200 text/html Drupal antibot page (unrendered Twig title, zero <item> elements). Do not use it for discovery or cross-check. Filter cost estimates client-side off the all-publications feed instead.

The /data Data Library hosts recurring spreadsheets (Baseline Projections, Long-Term Budget Outlook) — a separate dataset, useful for v1.x (see Open questions).

Publication types (inferred from title prefix and RSS subtitle)

Section titled “Publication types (inferred from title prefix and RSS subtitle)”
TypeTitle patternDescription
cost_estimate"H.R. NNNN, Title" or "S. NNNN, Title" prefixPer-bill cost estimate (the dominant volume)
reportvariousBroader analytical reports (DoD budget, Long-Term Outlook, etc.)
working_paper"Working Paper YYYY-NN" suffix or /publication/...working-paper-... slugResearch papers from CBO economists
presentation"Presentation about ..." or specific slug patternConference/seminar slides
letter"Letter to ..." or "Information Letter"Member-correspondence responses
data_publicationSpreadsheets / interactive toolsUnderlying data files

CBO cost estimates state the bill stage in the subtitle. Observed values:

Subtitle patternMeaning
As ordered reported by the [Committee] on [Date]After committee markup vote
As reported by the [Committee] on [Date]After committee report filed
As passed by the House on [Date]After House passage (Senate-side estimate)
As passed by the Senate on [Date]After Senate passage
As introducedAt introduction (less common; CBO usually waits for committee action)
As enactedAfter enactment (rare; usually rolled into broader analyses)

A single bill can have multiple CBO estimates — one per stage. Schema must allow many estimates per bill.

cbo:{publication_id}, where publication_id is the integer node ID at the end of cbo.gov/publication/{N}.

Examples:

  • cbo:62310 (HR 3447 Chip Security Act)
  • cbo:62301 (DoD 2026 Budget Request analysis)

The integer is monotonically increasing; new publications get higher IDs. There's no per-bill collision because each estimate is its own publication.

The discovery feed is a standard RSS 2.0 + Dublin Core stream (see CBO's RSS overview). The Josh-relevant mapping per item:

  • <title> is the bill ID + short title for cost estimates, or the full title for non-cost-estimate publications.
  • <description> is the bill stage / subtitle (one line) for cost estimates, or a longer summary paragraph for other publications.
  • <pubDate> is RFC 2822 with East Coast TZ.
  • <link> / <guid> is the cbo.gov/publication/{id} permalink.

The product detail page (fetched after the bot-wall) carries the title, publish date, subtitle ("As ordered reported by..." for cost estimates; abstract for others), the summary + body, the "Download PDF" link, sometimes a supplementary spreadsheet, and a "Related publications" sidebar listing earlier-stage estimates of the same bill. The columns Josh actually stores from this page are defined in the schema DDL below — pdf_url is constructed from the /system/files/... link, supplementary_file_urls from other /system/files/ links, and related_publications from the sidebar.

For cost estimates specifically, the title-parsing yields a (bill_type, bill_number) pair — e.g. H.R. 3447(hr, 3447). Combined with the year, we can resolve to a bills.id if we know the congress (the year tells us — Congress 119 covers 2025-2026).

MetricValue
Cost estimates per year~600-1000
Other publications per year~150-300
Total publications all-time~25,000 (post-2000)
Per-publication PDF~100 KB - 5 MB; median ~300 KB for cost estimates
Per-publication HTML~30-150 KB

Postgres footprint: ~500 MB - 1 GB after ingestion + indexing.

One RSS feed is the discovery signal:

  1. https://www.cbo.gov/publications/all/rss.xml — all publications. Poll 6-hourly; filter to cost estimates client-side off the feed entries.

The former cost-estimates-only feed at https://www.cbo.gov/cost-estimates/all/rss.xml is dead — it returns a 200 text/html Drupal antibot page with zero <item> elements, so it is no longer usable for discovery or cross-check.

For each new RSS item, derive (publication_id, link, pubDate) and check against cbo_publications.id. Fetch new items via headed browser, extract metadata + body, then fetch the PDF via the same headed-browser path (the /system/files/ asset is DataDome-walled and 403s on direct curl).

PDFs occasionally re-issue (errata) but rarely change publication_id. Detect via the Last-Modified shift on the PDF asset.

-- ============================================================
-- CBO publications (all types — cost estimates + reports + working papers)
-- ============================================================
CREATE TABLE cbo_publications (
id text PRIMARY KEY, -- 'cbo:62310'
publication_id int NOT NULL UNIQUE, -- 62310
publication_type text NOT NULL CHECK (publication_type IN (
'cost_estimate', 'report', 'working_paper', 'presentation',
'letter', 'data_publication', 'other'
)),
title text NOT NULL,
subtitle text, -- "As ordered reported by..."
publish_date date NOT NULL,
-- Body
summary_html text,
body_text text, -- extracted from HTML or PDF
-- For cost estimates: link to the bill being scored
bill_id text, -- 'bill:119-hr-3447'; soft FK
bill_stage text, -- 'ordered_reported_committee', 'reported_committee', 'passed_house', 'passed_senate', 'introduced', 'enacted'
committee_id text REFERENCES committees(id), -- the committee that ordered/reported (when applicable)
committee_action_date date, -- the date in the subtitle
-- Source URLs
publication_url text NOT NULL, -- cbo.gov/publication/{id}
pdf_url text,
supplementary_file_urls text[], -- xlsx, csv, etc.
-- Lifecycle
raw_html bytea, -- gzipped product page
raw_rss_description text, -- RSS body
fetched_at timestamptz NOT NULL,
parsed_at timestamptz,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX cbo_publications_publish_date ON cbo_publications (publish_date DESC);
CREATE INDEX cbo_publications_type ON cbo_publications (publication_type, publish_date DESC);
CREATE INDEX cbo_publications_bill ON cbo_publications (bill_id) WHERE bill_id IS NOT NULL;
ALTER TABLE cbo_publications ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(subtitle, '')), 'B') ||
setweight(to_tsvector('english', coalesce(summary_html, '')), 'C') ||
setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
) STORED;
CREATE INDEX cbo_publications_search ON cbo_publications USING gin (search_tsv);
-- ============================================================
-- Cost-estimate-specific structured fields
-- ============================================================
CREATE TABLE cbo_cost_estimates (
publication_id int PRIMARY KEY REFERENCES cbo_publications(publication_id) ON DELETE CASCADE,
-- Quantitative score (extracted at parse time when available)
direct_spending_estimate jsonb, -- {fy2026: ..., fy2027: ..., 5_year_total: ..., 10_year_total: ...}
revenue_effect jsonb, -- same shape
deficit_effect jsonb,
discretionary_authorization jsonb, -- "Authorization of appropriations of $..."
-- Mandates assessment
intergovernmental_mandate text, -- 'No Mandates' | 'Below Threshold' | 'Above Threshold' | 'Increase'
private_sector_mandate text,
-- Pay-go relevance
paygo_increases_deficit boolean,
-- Confidence flags
estimate_complete boolean NOT NULL DEFAULT false, -- false when CBO did not produce a complete score (e.g., "no estimate could be made")
estimate_text_only text -- when scoring tables aren't extractable, the prose summary lives here
);
-- ============================================================
-- Vector chunks for similarity
-- ============================================================
CREATE TABLE cbo_publication_chunks (
id bigserial PRIMARY KEY,
publication_id int NOT NULL REFERENCES cbo_publications(publication_id) ON DELETE CASCADE,
chunk_index int NOT NULL,
chunk_text text NOT NULL,
embedding vector(1024),
UNIQUE (publication_id, chunk_index)
);
CREATE INDEX cbo_publication_chunks_embedding
ON cbo_publication_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

  • One unified cbo_publications table for all publication types — most queries don't care if it's a cost estimate vs report. The discriminator + child table for cost-estimate-specific structured fields gives the best of both worlds.
  • cbo_cost_estimates is a child with the same primary key — the one-to-one extension table pattern. Parsing the cost-estimate scoring tables is fragile (PDFs vary); when extraction fails, estimate_complete=false and estimate_text_only carries the prose.
  • Quantitative fields as jsonb rather than typed columns. The fiscal-year keys vary (5-year vs 10-year windows; some estimates only project 3 years). JSONB is the right shape.
  • bill_id as soft FK — CBO often estimates bills before our bills loader has caught them (their cost estimate may post within hours of committee markup; BILLSTATUS XML may lag).
  • raw_html preserved — DataDome occasionally serves variant HTML; preserving raw lets us re-parse later.
  • Cross-check via bill_cbo_estimates (defined in bills) — that table has bill_id, pub_date, title, description, url. The CBO loader populates it with the canonical CBO link as confirmation.

A committee_id foreign key references committees(id) (see legislators & committees).

Chunker / eval status — known gap. Only the cbo_publication_chunks table (vector(1024), HNSW) exists; no explicit chunking strategy or embedding eval is documented for this source yet. Resolve before moving from exploring → shipped.

  1. Pull RSS feed (recent ~50 items only — RSS doesn't archive). For backfill, paginate cbo.gov/publication?page=N via headed browser. ~25K publications across ~500 pages.
  2. Per publication:
    • Fetch product page via headed browser. Extract title, subtitle, publish date, body, PDF URL, related publications.
    • Fetch PDF via the headed browser — the /system/files/ asset is DataDome-walled and 403s on direct curl even with full browser headers.
    • Extract body text from PDF.
    • For cost estimates: parse the budget table (regex + heuristic on \$[\d,]+\s*million patterns; defer rich extraction to v1.x).
    • Soft-resolve bill_id from title regex ((H\.R\.|S\.|H\.J\.Res\.|...)\s*(\d+)).
  3. Embed chunks.

Backfill 25K publications × headed-browser fetch = ~10-15 hours wall.

  1. 6-hourly: poll the publications RSS. New items → filter to cost estimates → product page fetch + PDF + parse.

Source key: cbo_publications. Run state stores last seen publication_id max + last RSS <pubDate>.

  • DataDome 403 escalation — headed browser may need session rotation. Same playbook as Senate / GAO.
  • PDF score-table extraction fails — set estimate_complete=false, store prose in estimate_text_only. The score is still searchable via body_text.
  • Bill ID regex fails — common for non-cost-estimate publications. Leave bill_id=NULL.
  • Re-issued cost estimate (corrigendum) — same publication_id, new pubDate. Update in place; preserve old raw_html.

These don't block ingestion but should be resolved before this source is "shipped":

  • Score table extraction quality. Cost estimate PDFs have well-formatted tables of dollar amounts by fiscal year, but extracting them reliably from PDF requires OCR-aware parsing or a structured-table extractor (pdfplumber's table mode, camelot, or vision models). Lean: ship v1 with prose-only score and a flag; build structured extractor in v1.x once we have a fixture set.
  • CBO's pre-2000 corpus. Some early estimates exist only as scans. Out of scope for v1.
  • Recurring data files (Baseline Projections, Budget Outlook) at cbo.gov/data are Excel workbooks updated quarterly. Different beast — separate ingestion. Defer to v1.x as its own source if useful.
  • Working papers vs reports. The line is fuzzy; both contain analysis. Use publication_type as a hint but don't promise the agent it's a strict taxonomy.
  • Bill stage extraction reliability. "As ordered reported by..." vs "As reported by..." vs "As passed by..." — we want a clean bill_stage enum. The subtitle text is short and structured but committee names vary. Build a small classifier; track parser accuracy.
  • Linking to public laws when CBO publishes "as enacted" estimates. Less common but exists. Schema supports via bill_id; the bill_stage='enacted' flag captures the case.
  • Multi-bill estimates — some CBO publications cover multiple bills (e.g., omnibus packages). Title carries multiple bill IDs separated by commas. Schema allows only one bill_id; consider a join table cbo_publication_bills for the rare multi-bill case.