Statements of Administration Policy (SAP)
A Statement of Administration Policy is a memorandum prepared by the Office of Management and Budget (OMB) on behalf of the President, expressing the administration's position on a piece of legislation pending before Congress. SAPs are short (typically 1-3 pages), structured, and high-signal: a "Strongly Opposes" SAP frequently signals a veto threat; a "Supports" SAP often accompanies the President's signature when the bill becomes law.
For Josh, SAPs answer "what's the White House position on this bill" — a question every Hill staffer asks weekly. Because they're authored by OMB and addressed to a specific bill, they're naturally FK'd into the bills table. This is a small but high-value source.
Quick reference
Section titled “Quick reference”| Source name | Statements of Administration Policy |
| Publishers | Office of Management and Budget (OMB), executive office of the President |
| License | Public domain (US government work) |
| Coverage | Reagan administration (1985) – present. Modern systematic coverage from Obama administration (2009) forward. |
| Volume | ~50-200 SAPs per congress; ~108 in 119th-to-date (Trump 2 administration). ~1,095 PDFs all-time (systematic from Obama 2009 onward). |
| Storage estimate | ~161 MB (mostly small PDFs) |
| Primary mirror | unitedstates/statements-of-administration-policy GitHub repo (CC0; Flatgov team-maintained) |
| Current source | https://www.whitehouse.gov/omb/statements-of-administration-policy/ |
| Pre-Obama source | UCSB American Presidency Project (mirrored in same GitHub repo) |
| Auth | None |
| Rate limits | None observed for GitHub raw / Pages |
| Caching headers | GitHub: Last-Modified + ETag |
| Incremental sync hints | YAML metadata file Last-Modified; commit log for human-readable change feed |
| Stable ID format | sap:{date}-{bill-id-slug} e.g. sap:2025-01-22-s5, sap:2012-02-06-hr1734 |
| Status | exploring — schema drafted, ingestion not built |
Source priority decision
Section titled “Source priority decision”Primary: unitedstates/statements-of-administration-policy GitHub repo. Same maintainers as congress-legislators (Flatgov team). Hosts:
- Per-administration YAML metadata files (
archive/44-Obama.yaml,archive/47-Trump.yaml). - PDFs of every SAP from Obama onward at
archive/statements/{admin}/{congress}/{filename}.pdf(filenames are arbitrary per the repo README;{admin}is number-prefixed, e.g.46-Biden). - Pre-Obama SAPs reference UCSB American Presidency Project URLs (no PDF; HTML scrape needed for body text).
- Continuously scraped via
scraper.pyagainst currentwhitehouse.govanducsb_scraper.pyfor archive.
This is the canonical aggregate. Public-domain (CC0). No auth. Real-time updates within hours of OMB posting.
Secondary: whitehouse.gov/omb/statements-of-administration-policy/. The live source. The mirror scrapes it; we do not need to scrape directly. Keep as a fallback if the mirror lapses.
Skip: scraping UCSB American Presidency Project for pre-Obama text. The mirror has the metadata + URL pointers; we can fetch UCSB pages on-demand for body text if needed (mostly historical context for v1).
The presidents.yaml file at the repo root maps administration directory names (44-Obama, 47-Trump) to canonical president records (with bioguide IDs from executive.yaml in legislators & committees). Cross-reference for clean president-to-SAP joins.
Access notes
Section titled “Access notes”GitHub raw / Pages: open, fast, well-cached. No special handling needed. Last-Modified and ETag headers are reliable for conditional GET. See the mirror repo for the raw YAML / PDF path layout.
The whitehouse.gov live listing has an intermittent bot wall (varies by White House administration); the GitHub mirror is open and uncapped, which is why it's the primary source. This bot-wall is a documented assumption from earlier observation, not a fresh live probe (status is still exploring).
Endpoints / URL patterns
Section titled “Endpoints / URL patterns”Josh reads three things from the mirror: a per-administration/congress directory listing (to discover new SAPs), the per-admin YAML metadata, and individual SAP PDFs. The git commit log gives a human-readable change feed for the admin UI. All of this rides standard GitHub HTTP surfaces — see the GitHub Contents API and Commits API docs for the raw endpoint shapes.
The live whitehouse.gov listing at https://www.whitehouse.gov/omb/statements-of-administration-policy/ is the fallback source and has a bot wall sometimes depending on the White House administration.
File path convention
Section titled “File path convention”SAP PDFs live at archive/statements/{admin-dir}/{congress}/{filename}.pdf in the mirror (admin dirs are number-prefixed, e.g. 44-Obama, 47-Trump — see the repo README). The load-bearing Josh rule:
Do not parse the bill from the filename. Filenames are arbitrary — the repo README explicitly warns they are not a stable, parseable convention. In practice many follow a {YYYY-MM-DD}_{billslug} shape (ISO date, e.g. 2021-02-26_hr1319.pdf), but you must not depend on it. The authoritative bill mapping is the per-admin YAML's bills + congress fields — these are the source of truth for both the PDF inventory and the bill linkage. Roughly 34 of ~1,095 SAPs (~3%) are multi-bill, amendment-targeted, or prose-titled, so no deterministic filename→bill_id slug derivation holds across the corpus. Read bills/congress from the YAML entry and resolve the soft FK from there.
Vocabulary / enums
Section titled “Vocabulary / enums”SAP position (extracted from body text)
Section titled “SAP position (extracted from body text)”The SAP text begins with a position statement. The high-signal categorical:
| Category | Body text cue |
|---|---|
strongly_supports | "The Administration strongly supports..." |
supports | "The Administration supports..." |
does_not_oppose | "The Administration does not oppose..." |
concerns | "The Administration has concerns about..." or "The Administration is concerned that..." |
opposes | "The Administration opposes..." |
strongly_opposes | "The Administration strongly opposes..." |
veto_threat | Body contains "...the President's senior advisors will recommend that he veto..." |
The veto threat is a categorical added to the position — strongly_opposes + veto_threat is the strongest stance. Some OMB SAPs explicitly say the President "will" veto; others use the "senior advisors recommend" formulation.
Per-bill type (from the per-admin YAML's bills field)
Section titled “Per-bill type (from the per-admin YAML's bills field)”Same enum as Bills' bill_type: hr, s, hjres, sjres, hconres, sconres, hres, sres. Read it from the YAML bills entry (not the filename). Concurrent / simple resolutions rarely get SAPs (they don't reach the President's desk) but exist.
Stable ID format
Section titled “Stable ID format”sap:{YYYY-MM-DD}-{billslug} — preserves both the issue date and the bill it addresses.
Examples:
sap:2025-01-22-s5sap:2012-02-06-hr1734sap:2025-02-24-hjres20
The combination is unique because OMB issues at most one SAP per bill per day. (Repeat SAPs on the same bill at different stages get separate dates.)
Response shapes
Section titled “Response shapes”Each SAP entry comes from a per-administration YAML file in the mirror, whose fields supply the metadata Josh lands. bill_id provenance: it is resolved from the YAML entry's bills + congress fields, never from the PDF filename. Obama+ entries carry a local file: path to the PDF; pre-Obama entries instead carry a url: pointing at the UCSB American Presidency Project (no PDF — body text needs an HTML scrape). See the mirror README for the full field-by-field YAML layout, and the UCSB SAP archive for the pre-Obama source pages.
PDF body shape
Section titled “PDF body shape”The PDF is short (1-3 pages typically). Body structure:
EXECUTIVE OFFICE OF THE PRESIDENTOFFICE OF MANAGEMENT AND BUDGETWASHINGTON, D.C. 20503
[Date — usually a Tuesday or Wednesday]
STATEMENT OF ADMINISTRATION POLICY
[BILL TITLE — H.R. 1734 — Civilian Property Realignment Act of 2011]
The Administration [strongly supports / supports / opposes / strongly opposes / has concerns about] [bill]because...
[Body paragraphs — typically 200-1000 words. Discusses specific provisions, agencyimplementation concerns, alternative approaches the administration prefers.]
[Veto threat (when applicable):]"If [bill] were presented to the President in its current form, his senior advisorswould recommend that he veto the bill."
[Sometimes a closing line summarizing the administration's preferred path forward.]OCR / text extraction is straightforward — short, single-column, no tables.
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Per congress | ~50-200 SAPs (varies by administration's level of legislative engagement) |
| 119th congress to-date | ~108 |
| All-time PDFs in the mirror | ~1,095 (systematic from Obama 2009 onward; ~161 MB total) |
| Multi-bill / amendment / prose-titled | ~34 of ~1,095 (~3%) — no single parseable bill in the filename |
| Per-PDF size | 50-200 KB |
| Per-PDF text length | ~500-1500 words |
On-disk corpus footprint: ~161 MB of PDFs.
Caching / incremental sync
Section titled “Caching / incremental sync”- Daily: poll directory listing for the current administration + congress via the GitHub Contents API. Diff against our
sap_statements.id. - For each new file: fetch PDF, parse body, extract position + bill linkage.
- Weekly: poll
archive/{admin}.yamlfor older administrations to catch retroactive additions. - GitHub commits API: human-readable change tracking. Useful for the admin UI to show "SAPs added this week."
Last-Modified / ETag make conditional GET cheap — see GitHub's REST best practices. The repo is well-maintained but not real-time — expect a 1-3 day lag from OMB posting to the mirror updating.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- Statements of Administration Policy-- ============================================================
CREATE TABLE sap_statements ( id text PRIMARY KEY, -- 'sap:2025-01-22-s5'
-- When + who date_issued date NOT NULL, administration text NOT NULL, -- '47-Trump' (matches presidents.yaml) president_bioguide text REFERENCES legislators(bioguide_id), -- soft FK to executive.yaml records congress smallint NOT NULL,
-- What document_title text NOT NULL, bill_id text, -- 'bill:119-s-5'; soft FK bill_type text, -- 's', 'hr', 'hjres', etc. bill_number int, bills_extra text[], -- when SAP covers multiple bills (rare)
-- Position (extracted from body text) position text CHECK (position IN ( 'strongly_supports', 'supports', 'does_not_oppose', 'concerns', 'opposes', 'strongly_opposes', 'unknown' )), is_veto_threat boolean NOT NULL DEFAULT false, position_confidence numeric(3,2), -- 0.0-1.0; how confident position parser was
-- Body body_text text, -- extracted from PDF body_pdf_url text, -- raw.githubusercontent.com path fallback_url text, -- UCSB URL when no PDF (pre-Obama)
-- Source source_yaml_url text, fetched_from_url text, -- original whitehouse.gov / archives URL
-- Lifecycle raw_pdf bytea, -- gzipped PDF bytes (small enough to inline) fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX sap_statements_date ON sap_statements (date_issued DESC);CREATE INDEX sap_statements_bill ON sap_statements (bill_id) WHERE bill_id IS NOT NULL;CREATE INDEX sap_statements_admin ON sap_statements (administration, date_issued DESC);CREATE INDEX sap_statements_position ON sap_statements (position) WHERE position IS NOT NULL;CREATE INDEX sap_statements_veto ON sap_statements (date_issued DESC) WHERE is_veto_threat;
ALTER TABLE sap_statements ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(document_title, '')), 'A') || setweight(to_tsvector('english', coalesce(body_text, '')), 'D') ) STORED;CREATE INDEX sap_statements_search ON sap_statements USING gin (search_tsv);Schema decisions worth flagging:
raw_pdfinlined as gzipped bytea — SAPs are small (<200 KB each, ~1,095 total ≈ 161 MB). Worth keeping the raw PDF for re-extraction.positionas a normalized enum withposition_confidenceindicating parser confidence. Lean conservative — when extraction is ambiguous, setposition='unknown'and let downstream consumers fall back to body-text search.is_veto_threatis a separate boolean — the categorical "strongly opposes" doesn't always carry a veto threat, and "opposes" sometimes does. They're orthogonal.bills_extra text[]for the rare multi-bill SAPs — the YAML supports a list underbills:. Most have one bill; we denormalize the canonical one tobill_idand keep extras here.president_bioguideas soft FK — every modern president has a bioguide viaexecutive.yaml.- No vector chunks initially — bodies are short (1-2K words avg). Full-text search via tsvector covers it. Add
sap_chunksif/when retrieval needs similarity over the corpus.
Download / update strategy
Section titled “Download / update strategy”Backfill
Section titled “Backfill”- Pull
presidents.yamlto map administration dirs to canonical president records. - Pull each
archive/{admin}.yaml(where present — Obama+ have YAML; pre-Obama uses inline UCSB URLs). - For Obama+ (44, 45, 46, 47):
- List directory contents at
archive/statements/{admin}/{congress}/. - For each PDF, fetch raw bytes.
- Extract text via
pdftotext/pdfplumber. - Parse position categorical from opening paragraph. Resolve
bill_idfrom the per-admin YAML'sbills+congressfields (the filename is arbitrary and must not be parsed for bill IDs). - Insert row.
- List directory contents at
- For pre-Obama (37 admins back to Reagan):
- Iterate YAML rows.
- Where
fileis set, fetch local PDF. - Where
urlis set (UCSB), defer body text — store metadata only withbody_text=NULLandfallback_url. Lazy-fetch body on first agent request, parse text from UCSB HTML.
- ~1,095 PDF-backed rows (~161 MB). With 4 concurrent workers the full backfill runs in minutes.
Daily incremental
Section titled “Daily incremental”- Daily 12:00 UTC: poll the current administration + congress directory via the GitHub Contents API.
- Diff vs
sap_statements.id. New filenames → fetch PDF + parse + insert. - Weekly: poll older admin directories for retroactive additions (rare but happens).
- Weekly: GitHub commits API to populate human-readable change feed in admin UI.
State tracking
Section titled “State tracking”Source key: sap_statements. State stores last-seen filenames per admin/congress.
Failure modes
Section titled “Failure modes”- Position parser fails to classify. OMB writers occasionally use unusual phrasing. Set
position='unknown',position_confidence=0. Body text remains searchable. - Bill resolution fails for a YAML entry with no resolvable bill — the ~34/~1,095 (~3%) multi-bill, amendment-targeted, or prose-titled SAPs. Carry every bill listed under the YAML
billsfield; where none resolves cleanly, fall back to fuzzy match on document_title and flag for manual review. - PDF text extraction fails (rare; older SAPs scanned, not native PDF). Fall back to OCR via
tesseract. Keepraw_pdfso re-extraction is possible later. - GitHub repo lapses or rate-limits the API. Fall back to direct White House scrape (which has its own bot wall sometimes). Document the scraping path as backup.
- OMB issues a "revised" SAP on the same bill same day. Rare but possible. Schema's
idwould collide; suffix with-v2if detected.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Position parser quality. A simple keyword classifier ("Administration strongly supports" →
strongly_supports) covers most. Some SAPs phrase differently. Build a small evaluation set + measure F1. - Veto threat detection. Watching for "would recommend that he veto" / "will veto" / "his senior advisors". A trained classifier or a simple regex set both work. Lean simple regex for v1; track precision/recall.
- Pre-Obama body text. UCSB American Presidency Project pages have body text in HTML. Fetching ~1,500 historical pages (Reagan-Bush43) is fine but adds an HTML scraper. Defer to v1.x; ship v1 with metadata-only for pre-Obama.
- Multi-bill SAPs. Currently rare (a few percent). Schema handles via
bills_extra; extraction needs to populate it correctly. Test with a known multi-bill case. - SAPs on amendments / motion to proceed. Most SAPs target final-passage bills, but a few target an amendment to a bill or a motion to proceed. The slug captures the bill but loses the stage. Decide if
bill_stageis worth adding (it would be similar to CBO'sbill_stage). - Whitehouse.gov direct scrape as fallback. Per-administration White House URLs change every transition (
obamawhitehouse.archives.gov,trumpwhitehouse.archives.gov). Theunitedstatesmirror tracks them. If we fall back to direct scrape, code must know which URL pattern is current. - Veto threats vs actual vetoes. Cross-referencing veto-threat SAPs against actual presidential vetoes (which would come from a separate source) is a research-grade analysis target. Worth flagging in the schema as future enrichment but not blocking.
- Cross-administration coverage of OBE-d SAPs. Some SAPs were posted then withdrawn. The mirror keeps them; ingest with
is_active=false? Currently no flag — not enough cases observed to model. Defer.