CBO cost estimates

Cost estimates and budgetary analyses from the Congressional Budget Office, the legislative branch's nonpartisan budget shop. Every bill ordered reported by a House or Senate committee gets a CBO score (when feasible) — the budget impact over a 5-year and 10-year window. CBO also publishes broader analyses: the federal budget outlook, agency-program reviews, working papers, and presentations.

For Josh, CBO cost estimates are critical for the vote-impact and budget-summary workflows: "what's the 10-year cost of the SPEED Act," "what's CBO's score on H.R. 4920," "what's the budget outlook for FY2027." The data is structured enough to ingest reliably; the published score is a foreign-key target on bills (we already drafted bill_cbo_estimates in bills).

This source is a thin doc — CBO publishes ~600-1000 cost estimates per year and a few dozen broader publications. Volume and complexity are modest.

CBO publishes no structured API or developer reference — there is no JSON API, and nothing upstream documents the page layouts, feeds, or the bot-walls. The content below is reverse-engineered from probing the live site; see CBO's products / RSS overview for the only upstream-facing pointer.

Quick reference


Source name	CBO cost estimates
Publisher	Congressional Budget Office
License	Public domain
Coverage	1976 – present (CBO's founding). Online corpus reliable post-2000.
Volume	~600-1000 cost estimates per year + ~150-300 broader publications. ~25,000+ all-time.
Storage estimate	~1-2 GB raw PDF + HTML; ~500 MB extracted text
Site	https://www.cbo.gov/
Primary feed	`https://www.cbo.gov/publications/all/rss.xml` (all CBO publications, including cost estimates) — the only open feed (see Access notes)
Auth	None
Rate limits	None observed; HTML pages and `/system/files/` asset PDFs bot-walled by DataDome (real Chrome required; PDFs 403 on direct curl)
Incremental sync hints	RSS `<pubDate>` + per-publication `pubDate`
Stable ID format	`cbo:{publication_id}` e.g. `cbo:62310`
Status	exploring — schema drafted, ingestion not built

Source priority decision

Primary: CBO publications RSS feed (/publications/all/rss.xml) for discovery with client-side filtering to cost estimates, headed-browser fetch for both product detail pages and /system/files/ asset PDFs (both DataDome-walled).

Secondary: Congress.gov bill API. BILLSTATUS XML's <cboCostEstimates> element references CBO publications by URL — confirms the bill ↔ CBO linkage when CRS or committee staff has tagged it. Use as enrichment. (The upstream BILLSTATUS schema is documented at usgpo/bill-status.)

Skip: scraping search results from cbo.gov directly. Search is heavily JS-driven and protected; the RSS gives us linear coverage of new publications.

Access notes

HTML pages AND asset PDFs bot-walled by DataDome (x-datadome: protected header on 403 responses). Only the /publications/all/rss.xml feed is open to direct curl with a real-browser User-Agent. Both the product detail pages and the /system/files/ asset PDFs need the headed-browser fetcher (same one as Senate / GAO) — direct curl of a PDF now returns 403 even with full browser headers.

The site uses an absolute URL pattern for assets: cbo.gov/system/files/{YYYY-MM}/{slug}.pdf. Slugs are the publication's node_alias derived from title at publication time. The slug is not always derivable from the title — extract it from the product page rather than constructing it.

There is no JSON API. Everything is HTML + RSS + PDF. See CBO's products overview for the publication types themselves; the URL shapes (/publication/{id}, /system/files/{YYYY-MM}/{slug}.pdf, the /data Data Library) carry no Josh-specific finding beyond what's noted here.

The dead cost-estimates feed

The former cost-estimates-only RSS feed at https://www.cbo.gov/cost-estimates/all/rss.xml is dead — it returns a 200 text/html Drupal antibot page (unrendered Twig title, zero <item> elements). Do not use it for discovery or cross-check. Filter cost estimates client-side off the all-publications feed instead.

The /data Data Library hosts recurring spreadsheets (Baseline Projections, Long-Term Budget Outlook) — a separate dataset, useful for v1.x (see Open questions).

Vocabulary / enums

Publication types (inferred from title prefix and RSS subtitle)

Type	Title pattern	Description
`cost_estimate`	"H.R. NNNN, Title" or "S. NNNN, Title" prefix	Per-bill cost estimate (the dominant volume)
`report`	various	Broader analytical reports (DoD budget, Long-Term Outlook, etc.)
`working_paper`	"Working Paper YYYY-NN" suffix or `/publication/...working-paper-...` slug	Research papers from CBO economists
`presentation`	"Presentation about ..." or specific slug pattern	Conference/seminar slides
`letter`	"Letter to ..." or "Information Letter"	Member-correspondence responses
`data_publication`	Spreadsheets / interactive tools	Underlying data files

Bill stage (from cost estimate subtitle)

CBO cost estimates state the bill stage in the subtitle. Observed values:

Subtitle pattern	Meaning
`As ordered reported by the [Committee] on [Date]`	After committee markup vote
`As reported by the [Committee] on [Date]`	After committee report filed
`As passed by the House on [Date]`	After House passage (Senate-side estimate)
`As passed by the Senate on [Date]`	After Senate passage
`As introduced`	At introduction (less common; CBO usually waits for committee action)
`As enacted`	After enactment (rare; usually rolled into broader analyses)

A single bill can have multiple CBO estimates — one per stage. Schema must allow many estimates per bill.

Stable ID format

cbo:{publication_id}, where publication_id is the integer node ID at the end of cbo.gov/publication/{N}.

Examples:

cbo:62310 (HR 3447 Chip Security Act)
cbo:62301 (DoD 2026 Budget Request analysis)

The integer is monotonically increasing; new publications get higher IDs. There's no per-bill collision because each estimate is its own publication.

Response shapes

The discovery feed is a standard RSS 2.0 + Dublin Core stream (see CBO's RSS overview). The Josh-relevant mapping per item:

<title> is the bill ID + short title for cost estimates, or the full title for non-cost-estimate publications.
<description> is the bill stage / subtitle (one line) for cost estimates, or a longer summary paragraph for other publications.
<pubDate> is RFC 2822 with East Coast TZ.
<link> / <guid> is the cbo.gov/publication/{id} permalink.

The product detail page (fetched after the bot-wall) carries the title, publish date, subtitle ("As ordered reported by..." for cost estimates; abstract for others), the summary + body, the "Download PDF" link, sometimes a supplementary spreadsheet, and a "Related publications" sidebar listing earlier-stage estimates of the same bill. The columns Josh actually stores from this page are defined in the schema DDL below — pdf_url is constructed from the /system/files/... link, supplementary_file_urls from other /system/files/ links, and related_publications from the sidebar.

For cost estimates specifically, the title-parsing yields a (bill_type, bill_number) pair — e.g. H.R. 3447 → (hr, 3447). Combined with the year, we can resolve to a bills.id if we know the congress (the year tells us — Congress 119 covers 2025-2026).

Volume

Metric	Value
Cost estimates per year	~600-1000
Other publications per year	~150-300
Total publications all-time	~25,000 (post-2000)
Per-publication PDF	~100 KB - 5 MB; median ~300 KB for cost estimates
Per-publication HTML	~30-150 KB

Postgres footprint: ~500 MB - 1 GB after ingestion + indexing.

Caching / incremental sync

One RSS feed is the discovery signal:

https://www.cbo.gov/publications/all/rss.xml — all publications. Poll 6-hourly; filter to cost estimates client-side off the feed entries.

The former cost-estimates-only feed at https://www.cbo.gov/cost-estimates/all/rss.xml is dead — it returns a 200 text/html Drupal antibot page with zero <item> elements, so it is no longer usable for discovery or cross-check.

For each new RSS item, derive (publication_id, link, pubDate) and check against cbo_publications.id. Fetch new items via headed browser, extract metadata + body, then fetch the PDF via the same headed-browser path (the /system/files/ asset is DataDome-walled and 403s on direct curl).

PDFs occasionally re-issue (errata) but rarely change publication_id. Detect via the Last-Modified shift on the PDF asset.

Schema (Postgres DDL)

-- ============================================================
-- CBO publications (all types — cost estimates + reports + working papers)
-- ============================================================

CREATE TABLE cbo_publications (
    id text PRIMARY KEY,                          -- 'cbo:62310'
    publication_id int NOT NULL UNIQUE,           -- 62310

    publication_type text NOT NULL CHECK (publication_type IN (
        'cost_estimate', 'report', 'working_paper', 'presentation',
        'letter', 'data_publication', 'other'
    )),

    title text NOT NULL,
    subtitle text,                                -- "As ordered reported by..."
    publish_date date NOT NULL,

    -- Body
    summary_html text,
    body_text text,                               -- extracted from HTML or PDF

    -- For cost estimates: link to the bill being scored
    bill_id text,                                 -- 'bill:119-hr-3447'; soft FK
    bill_stage text,                              -- 'ordered_reported_committee', 'reported_committee', 'passed_house', 'passed_senate', 'introduced', 'enacted'
    committee_id text REFERENCES committees(id),  -- the committee that ordered/reported (when applicable)
    committee_action_date date,                   -- the date in the subtitle

    -- Source URLs
    publication_url text NOT NULL,                -- cbo.gov/publication/{id}
    pdf_url text,
    supplementary_file_urls text[],               -- xlsx, csv, etc.

    -- Lifecycle
    raw_html bytea,                               -- gzipped product page
    raw_rss_description text,                     -- RSS body
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX cbo_publications_publish_date ON cbo_publications (publish_date DESC);
CREATE INDEX cbo_publications_type ON cbo_publications (publication_type, publish_date DESC);
CREATE INDEX cbo_publications_bill ON cbo_publications (bill_id) WHERE bill_id IS NOT NULL;

ALTER TABLE cbo_publications ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(subtitle, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(summary_html, '')), 'C') ||
        setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
    ) STORED;
CREATE INDEX cbo_publications_search ON cbo_publications USING gin (search_tsv);

-- ============================================================
-- Cost-estimate-specific structured fields
-- ============================================================

CREATE TABLE cbo_cost_estimates (
    publication_id int PRIMARY KEY REFERENCES cbo_publications(publication_id) ON DELETE CASCADE,

    -- Quantitative score (extracted at parse time when available)
    direct_spending_estimate jsonb,               -- {fy2026: ..., fy2027: ..., 5_year_total: ..., 10_year_total: ...}
    revenue_effect jsonb,                         -- same shape
    deficit_effect jsonb,
    discretionary_authorization jsonb,            -- "Authorization of appropriations of $..."

    -- Mandates assessment
    intergovernmental_mandate text,               -- 'No Mandates' | 'Below Threshold' | 'Above Threshold' | 'Increase'
    private_sector_mandate text,

    -- Pay-go relevance
    paygo_increases_deficit boolean,

    -- Confidence flags
    estimate_complete boolean NOT NULL DEFAULT false,  -- false when CBO did not produce a complete score (e.g., "no estimate could be made")
    estimate_text_only text                       -- when scoring tables aren't extractable, the prose summary lives here
);

-- ============================================================
-- Vector chunks for similarity
-- ============================================================

CREATE TABLE cbo_publication_chunks (
    id bigserial PRIMARY KEY,
    publication_id int NOT NULL REFERENCES cbo_publications(publication_id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    embedding vector(1024),
    UNIQUE (publication_id, chunk_index)
);
CREATE INDEX cbo_publication_chunks_embedding
    ON cbo_publication_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

One unified cbo_publications table for all publication types — most queries don't care if it's a cost estimate vs report. The discriminator + child table for cost-estimate-specific structured fields gives the best of both worlds.
cbo_cost_estimates is a child with the same primary key — the one-to-one extension table pattern. Parsing the cost-estimate scoring tables is fragile (PDFs vary); when extraction fails, estimate_complete=false and estimate_text_only carries the prose.
Quantitative fields as jsonb rather than typed columns. The fiscal-year keys vary (5-year vs 10-year windows; some estimates only project 3 years). JSONB is the right shape.
bill_id as soft FK — CBO often estimates bills before our bills loader has caught them (their cost estimate may post within hours of committee markup; BILLSTATUS XML may lag).
raw_html preserved — DataDome occasionally serves variant HTML; preserving raw lets us re-parse later.
Cross-check via bill_cbo_estimates (defined in bills) — that table has bill_id, pub_date, title, description, url. The CBO loader populates it with the canonical CBO link as confirmation.

A committee_id foreign key references committees(id) (see legislators & committees).

Chunker / eval status — known gap. Only the cbo_publication_chunks table (vector(1024), HNSW) exists; no explicit chunking strategy or embedding eval is documented for this source yet. Resolve before moving from exploring → shipped.

Download / update strategy

Backfill

Pull RSS feed (recent ~50 items only — RSS doesn't archive). For backfill, paginate cbo.gov/publication?page=N via headed browser. ~25K publications across ~500 pages.
Per publication:
- Fetch product page via headed browser. Extract title, subtitle, publish date, body, PDF URL, related publications.
- Fetch PDF via the headed browser — the /system/files/ asset is DataDome-walled and 403s on direct curl even with full browser headers.
- Extract body text from PDF.
- For cost estimates: parse the budget table (regex + heuristic on \$[\d,]+\s*million patterns; defer rich extraction to v1.x).
- Soft-resolve bill_id from title regex ((H\.R\.|S\.|H\.J\.Res\.|...)\s*(\d+)).
Embed chunks.

Backfill 25K publications × headed-browser fetch = ~10-15 hours wall.

Daily incremental

6-hourly: poll the publications RSS. New items → filter to cost estimates → product page fetch + PDF + parse.

State tracking

Source key: cbo_publications. Run state stores last seen publication_id max + last RSS <pubDate>.

Failure modes

DataDome 403 escalation — headed browser may need session rotation. Same playbook as Senate / GAO.
PDF score-table extraction fails — set estimate_complete=false, store prose in estimate_text_only. The score is still searchable via body_text.
Bill ID regex fails — common for non-cost-estimate publications. Leave bill_id=NULL.
Re-issued cost estimate (corrigendum) — same publication_id, new pubDate. Update in place; preserve old raw_html.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Score table extraction quality. Cost estimate PDFs have well-formatted tables of dollar amounts by fiscal year, but extracting them reliably from PDF requires OCR-aware parsing or a structured-table extractor (pdfplumber's table mode, camelot, or vision models). Lean: ship v1 with prose-only score and a flag; build structured extractor in v1.x once we have a fixture set.
CBO's pre-2000 corpus. Some early estimates exist only as scans. Out of scope for v1.
Recurring data files (Baseline Projections, Budget Outlook) at cbo.gov/data are Excel workbooks updated quarterly. Different beast — separate ingestion. Defer to v1.x as its own source if useful.
Working papers vs reports. The line is fuzzy; both contain analysis. Use publication_type as a hint but don't promise the agent it's a strict taxonomy.
Bill stage extraction reliability. "As ordered reported by..." vs "As reported by..." vs "As passed by..." — we want a clean bill_stage enum. The subtitle text is short and structured but committee names vary. Build a small classifier; track parser accuracy.
Linking to public laws when CBO publishes "as enacted" estimates. Less common but exists. Schema supports via bill_id; the bill_stage='enacted' flag captures the case.
Multi-bill estimates — some CBO publications cover multiple bills (e.g., omnibus packages). Title carries multiple bill IDs separated by commas. Schema allows only one bill_id; consider a join table cbo_publication_bills for the rare multi-bill case.