CBO cost estimates
Cost estimates and budgetary analyses from the Congressional Budget Office, the legislative branch's nonpartisan budget shop. Every bill ordered reported by a House or Senate committee gets a CBO score (when feasible) — the budget impact over a 5-year and 10-year window. CBO also publishes broader analyses: the federal budget outlook, agency-program reviews, working papers, and presentations.
For Josh, CBO cost estimates are critical for the vote-impact and
budget-summary workflows: "what's the 10-year cost of the SPEED Act," "what's
CBO's score on H.R. 4920," "what's the budget outlook for FY2027." The data is
structured enough to ingest reliably; the published score is a foreign-key target
on bills (we already drafted bill_cbo_estimates in bills).
This source is a thin doc — CBO publishes ~600-1000 cost estimates per year and a few dozen broader publications. Volume and complexity are modest.
CBO publishes no structured API or developer reference — there is no JSON API, and nothing upstream documents the page layouts, feeds, or the bot-walls. The content below is reverse-engineered from probing the live site; see CBO's products / RSS overview for the only upstream-facing pointer.
Quick reference
Section titled “Quick reference”| Source name | CBO cost estimates |
| Publisher | Congressional Budget Office |
| License | Public domain |
| Coverage | 1976 – present (CBO's founding). Online corpus reliable post-2000. |
| Volume | ~600-1000 cost estimates per year + ~150-300 broader publications. ~25,000+ all-time. |
| Storage estimate | ~1-2 GB raw PDF + HTML; ~500 MB extracted text |
| Site | https://www.cbo.gov/ |
| Primary feed | https://www.cbo.gov/publications/all/rss.xml (all CBO publications, including cost estimates) — the only open feed (see Access notes) |
| Auth | None |
| Rate limits | None observed; HTML pages and /system/files/ asset PDFs bot-walled by DataDome (real Chrome required; PDFs 403 on direct curl) |
| Incremental sync hints | RSS <pubDate> + per-publication pubDate |
| Stable ID format | cbo:{publication_id} e.g. cbo:62310 |
| Status | exploring — schema drafted, ingestion not built |
Source priority decision
Section titled “Source priority decision”Primary: CBO publications RSS feed (/publications/all/rss.xml) for discovery
with client-side filtering to cost estimates, headed-browser fetch for both
product detail pages and /system/files/ asset PDFs (both DataDome-walled).
Secondary: Congress.gov bill API. BILLSTATUS XML's <cboCostEstimates>
element references CBO publications by URL — confirms the bill ↔ CBO linkage when
CRS or committee staff has tagged it. Use as enrichment. (The upstream BILLSTATUS
schema is documented at usgpo/bill-status.)
Skip: scraping search results from cbo.gov directly. Search is heavily JS-driven and protected; the RSS gives us linear coverage of new publications.
Access notes
Section titled “Access notes”HTML pages AND asset PDFs bot-walled by DataDome (x-datadome: protected
header on 403 responses). Only the /publications/all/rss.xml feed is open to
direct curl with a real-browser User-Agent. Both the product detail pages and the
/system/files/ asset PDFs need the headed-browser fetcher (same one as
Senate / GAO) — direct curl of a PDF now returns 403 even
with full browser headers.
The site uses an absolute URL pattern for assets:
cbo.gov/system/files/{YYYY-MM}/{slug}.pdf. Slugs are the publication's
node_alias derived from title at publication time. The slug is not always
derivable from the title — extract it from the product page rather than
constructing it.
There is no JSON API. Everything is HTML + RSS + PDF. See CBO's
products overview for the publication types
themselves; the URL shapes (/publication/{id}, /system/files/{YYYY-MM}/{slug}.pdf,
the /data Data Library) carry no Josh-specific finding beyond what's noted here.
The dead cost-estimates feed
Section titled “The dead cost-estimates feed”The former cost-estimates-only RSS feed at
https://www.cbo.gov/cost-estimates/all/rss.xml is dead — it returns a 200
text/html Drupal antibot page (unrendered Twig title, zero <item> elements).
Do not use it for discovery or cross-check. Filter cost estimates client-side off
the all-publications feed instead.
The /data Data Library hosts recurring spreadsheets (Baseline Projections,
Long-Term Budget Outlook) — a separate dataset, useful for v1.x (see
Open questions).
Vocabulary / enums
Section titled “Vocabulary / enums”Publication types (inferred from title prefix and RSS subtitle)
Section titled “Publication types (inferred from title prefix and RSS subtitle)”| Type | Title pattern | Description |
|---|---|---|
cost_estimate | "H.R. NNNN, Title" or "S. NNNN, Title" prefix | Per-bill cost estimate (the dominant volume) |
report | various | Broader analytical reports (DoD budget, Long-Term Outlook, etc.) |
working_paper | "Working Paper YYYY-NN" suffix or /publication/...working-paper-... slug | Research papers from CBO economists |
presentation | "Presentation about ..." or specific slug pattern | Conference/seminar slides |
letter | "Letter to ..." or "Information Letter" | Member-correspondence responses |
data_publication | Spreadsheets / interactive tools | Underlying data files |
Bill stage (from cost estimate subtitle)
Section titled “Bill stage (from cost estimate subtitle)”CBO cost estimates state the bill stage in the subtitle. Observed values:
| Subtitle pattern | Meaning |
|---|---|
As ordered reported by the [Committee] on [Date] | After committee markup vote |
As reported by the [Committee] on [Date] | After committee report filed |
As passed by the House on [Date] | After House passage (Senate-side estimate) |
As passed by the Senate on [Date] | After Senate passage |
As introduced | At introduction (less common; CBO usually waits for committee action) |
As enacted | After enactment (rare; usually rolled into broader analyses) |
A single bill can have multiple CBO estimates — one per stage. Schema must allow many estimates per bill.
Stable ID format
Section titled “Stable ID format”cbo:{publication_id}, where publication_id is the integer node ID at the
end of cbo.gov/publication/{N}.
Examples:
cbo:62310(HR 3447 Chip Security Act)cbo:62301(DoD 2026 Budget Request analysis)
The integer is monotonically increasing; new publications get higher IDs. There's no per-bill collision because each estimate is its own publication.
Response shapes
Section titled “Response shapes”The discovery feed is a standard RSS 2.0 + Dublin Core stream (see CBO's RSS overview). The Josh-relevant mapping per item:
<title>is the bill ID + short title for cost estimates, or the full title for non-cost-estimate publications.<description>is the bill stage / subtitle (one line) for cost estimates, or a longer summary paragraph for other publications.<pubDate>is RFC 2822 with East Coast TZ.<link>/<guid>is thecbo.gov/publication/{id}permalink.
The product detail page (fetched after the bot-wall) carries the title, publish
date, subtitle ("As ordered reported by..." for cost estimates; abstract for
others), the summary + body, the "Download PDF" link, sometimes a supplementary
spreadsheet, and a "Related publications" sidebar listing earlier-stage estimates
of the same bill. The columns Josh actually stores from this page are defined in
the schema DDL below — pdf_url is constructed from the
/system/files/... link, supplementary_file_urls from other /system/files/
links, and related_publications from the sidebar.
For cost estimates specifically, the title-parsing yields a (bill_type,
bill_number) pair — e.g. H.R. 3447 → (hr, 3447). Combined with the year, we
can resolve to a bills.id if we know the congress (the year tells us — Congress
119 covers 2025-2026).
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Cost estimates per year | ~600-1000 |
| Other publications per year | ~150-300 |
| Total publications all-time | ~25,000 (post-2000) |
| Per-publication PDF | ~100 KB - 5 MB; median ~300 KB for cost estimates |
| Per-publication HTML | ~30-150 KB |
Postgres footprint: ~500 MB - 1 GB after ingestion + indexing.
Caching / incremental sync
Section titled “Caching / incremental sync”One RSS feed is the discovery signal:
https://www.cbo.gov/publications/all/rss.xml— all publications. Poll 6-hourly; filter to cost estimates client-side off the feed entries.
The former cost-estimates-only feed at
https://www.cbo.gov/cost-estimates/all/rss.xml is dead — it returns a 200
text/html Drupal antibot page with zero <item> elements, so it is no longer
usable for discovery or cross-check.
For each new RSS item, derive (publication_id, link, pubDate) and check against
cbo_publications.id. Fetch new items via headed browser, extract metadata +
body, then fetch the PDF via the same headed-browser path (the /system/files/
asset is DataDome-walled and 403s on direct curl).
PDFs occasionally re-issue (errata) but rarely change publication_id. Detect via
the Last-Modified shift on the PDF asset.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- CBO publications (all types — cost estimates + reports + working papers)-- ============================================================
CREATE TABLE cbo_publications ( id text PRIMARY KEY, -- 'cbo:62310' publication_id int NOT NULL UNIQUE, -- 62310
publication_type text NOT NULL CHECK (publication_type IN ( 'cost_estimate', 'report', 'working_paper', 'presentation', 'letter', 'data_publication', 'other' )),
title text NOT NULL, subtitle text, -- "As ordered reported by..." publish_date date NOT NULL,
-- Body summary_html text, body_text text, -- extracted from HTML or PDF
-- For cost estimates: link to the bill being scored bill_id text, -- 'bill:119-hr-3447'; soft FK bill_stage text, -- 'ordered_reported_committee', 'reported_committee', 'passed_house', 'passed_senate', 'introduced', 'enacted' committee_id text REFERENCES committees(id), -- the committee that ordered/reported (when applicable) committee_action_date date, -- the date in the subtitle
-- Source URLs publication_url text NOT NULL, -- cbo.gov/publication/{id} pdf_url text, supplementary_file_urls text[], -- xlsx, csv, etc.
-- Lifecycle raw_html bytea, -- gzipped product page raw_rss_description text, -- RSS body fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX cbo_publications_publish_date ON cbo_publications (publish_date DESC);CREATE INDEX cbo_publications_type ON cbo_publications (publication_type, publish_date DESC);CREATE INDEX cbo_publications_bill ON cbo_publications (bill_id) WHERE bill_id IS NOT NULL;
ALTER TABLE cbo_publications ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(title, '')), 'A') || setweight(to_tsvector('english', coalesce(subtitle, '')), 'B') || setweight(to_tsvector('english', coalesce(summary_html, '')), 'C') || setweight(to_tsvector('english', coalesce(body_text, '')), 'D') ) STORED;CREATE INDEX cbo_publications_search ON cbo_publications USING gin (search_tsv);
-- ============================================================-- Cost-estimate-specific structured fields-- ============================================================
CREATE TABLE cbo_cost_estimates ( publication_id int PRIMARY KEY REFERENCES cbo_publications(publication_id) ON DELETE CASCADE,
-- Quantitative score (extracted at parse time when available) direct_spending_estimate jsonb, -- {fy2026: ..., fy2027: ..., 5_year_total: ..., 10_year_total: ...} revenue_effect jsonb, -- same shape deficit_effect jsonb, discretionary_authorization jsonb, -- "Authorization of appropriations of $..."
-- Mandates assessment intergovernmental_mandate text, -- 'No Mandates' | 'Below Threshold' | 'Above Threshold' | 'Increase' private_sector_mandate text,
-- Pay-go relevance paygo_increases_deficit boolean,
-- Confidence flags estimate_complete boolean NOT NULL DEFAULT false, -- false when CBO did not produce a complete score (e.g., "no estimate could be made") estimate_text_only text -- when scoring tables aren't extractable, the prose summary lives here);
-- ============================================================-- Vector chunks for similarity-- ============================================================
CREATE TABLE cbo_publication_chunks ( id bigserial PRIMARY KEY, publication_id int NOT NULL REFERENCES cbo_publications(publication_id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, embedding vector(1024), UNIQUE (publication_id, chunk_index));CREATE INDEX cbo_publication_chunks_embedding ON cbo_publication_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
- One unified
cbo_publicationstable for all publication types — most queries don't care if it's a cost estimate vs report. The discriminator + child table for cost-estimate-specific structured fields gives the best of both worlds. cbo_cost_estimatesis a child with the same primary key — the one-to-one extension table pattern. Parsing the cost-estimate scoring tables is fragile (PDFs vary); when extraction fails,estimate_complete=falseandestimate_text_onlycarries the prose.- Quantitative fields as
jsonbrather than typed columns. The fiscal-year keys vary (5-year vs 10-year windows; some estimates only project 3 years). JSONB is the right shape. bill_idas soft FK — CBO often estimates bills before our bills loader has caught them (their cost estimate may post within hours of committee markup; BILLSTATUS XML may lag).raw_htmlpreserved — DataDome occasionally serves variant HTML; preserving raw lets us re-parse later.- Cross-check via
bill_cbo_estimates(defined in bills) — that table hasbill_id,pub_date,title,description,url. The CBO loader populates it with the canonical CBO link as confirmation.
A committee_id foreign key references committees(id) (see
legislators & committees).
Chunker / eval status — known gap. Only the
cbo_publication_chunkstable (vector(1024), HNSW) exists; no explicit chunking strategy or embedding eval is documented for this source yet. Resolve before moving fromexploring→ shipped.
Download / update strategy
Section titled “Download / update strategy”Backfill
Section titled “Backfill”- Pull RSS feed (recent ~50 items only — RSS doesn't archive). For backfill,
paginate
cbo.gov/publication?page=Nvia headed browser. ~25K publications across ~500 pages. - Per publication:
- Fetch product page via headed browser. Extract title, subtitle, publish date, body, PDF URL, related publications.
- Fetch PDF via the headed browser — the
/system/files/asset is DataDome-walled and 403s on direct curl even with full browser headers. - Extract body text from PDF.
- For cost estimates: parse the budget table (regex + heuristic on
\$[\d,]+\s*millionpatterns; defer rich extraction to v1.x). - Soft-resolve
bill_idfrom title regex ((H\.R\.|S\.|H\.J\.Res\.|...)\s*(\d+)).
- Embed chunks.
Backfill 25K publications × headed-browser fetch = ~10-15 hours wall.
Daily incremental
Section titled “Daily incremental”- 6-hourly: poll the publications RSS. New items → filter to cost estimates → product page fetch + PDF + parse.
State tracking
Section titled “State tracking”Source key: cbo_publications. Run state stores last seen publication_id max +
last RSS <pubDate>.
Failure modes
Section titled “Failure modes”- DataDome 403 escalation — headed browser may need session rotation. Same playbook as Senate / GAO.
- PDF score-table extraction fails — set
estimate_complete=false, store prose inestimate_text_only. The score is still searchable viabody_text. - Bill ID regex fails — common for non-cost-estimate publications. Leave
bill_id=NULL. - Re-issued cost estimate (corrigendum) — same
publication_id, newpubDate. Update in place; preserve oldraw_html.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Score table extraction quality. Cost estimate PDFs have well-formatted tables
of dollar amounts by fiscal year, but extracting them reliably from PDF requires
OCR-aware parsing or a structured-table extractor (
pdfplumber's table mode,camelot, or vision models). Lean: ship v1 with prose-only score and a flag; build structured extractor in v1.x once we have a fixture set. - CBO's pre-2000 corpus. Some early estimates exist only as scans. Out of scope for v1.
- Recurring data files (Baseline Projections, Budget Outlook) at
cbo.gov/dataare Excel workbooks updated quarterly. Different beast — separate ingestion. Defer to v1.x as its own source if useful. - Working papers vs reports. The line is fuzzy; both contain analysis. Use
publication_typeas a hint but don't promise the agent it's a strict taxonomy. - Bill stage extraction reliability. "As ordered reported by..." vs "As reported
by..." vs "As passed by..." — we want a clean
bill_stageenum. The subtitle text is short and structured but committee names vary. Build a small classifier; track parser accuracy. - Linking to public laws when CBO publishes "as enacted" estimates. Less common
but exists. Schema supports via
bill_id; thebill_stage='enacted'flag captures the case. - Multi-bill estimates — some CBO publications cover multiple bills (e.g.,
omnibus packages). Title carries multiple bill IDs separated by commas. Schema
allows only one
bill_id; consider a join tablecbo_publication_billsfor the rare multi-bill case.