Skip to content

Statements of Administration Policy (SAP)

A Statement of Administration Policy is a memorandum prepared by the Office of Management and Budget (OMB) on behalf of the President, expressing the administration's position on a piece of legislation pending before Congress. SAPs are short (typically 1-3 pages), structured, and high-signal: a "Strongly Opposes" SAP frequently signals a veto threat; a "Supports" SAP often accompanies the President's signature when the bill becomes law.

For Josh, SAPs answer "what's the White House position on this bill" — a question every Hill staffer asks weekly. Because they're authored by OMB and addressed to a specific bill, they're naturally FK'd into the bills table. This is a small but high-value source.

Source nameStatements of Administration Policy
PublishersOffice of Management and Budget (OMB), executive office of the President
LicensePublic domain (US government work)
CoverageReagan administration (1985) – present. Modern systematic coverage from Obama administration (2009) forward.
Volume~50-200 SAPs per congress; ~108 in 119th-to-date (Trump 2 administration). ~1,095 PDFs all-time (systematic from Obama 2009 onward).
Storage estimate~161 MB (mostly small PDFs)
Primary mirrorunitedstates/statements-of-administration-policy GitHub repo (CC0; Flatgov team-maintained)
Current sourcehttps://www.whitehouse.gov/omb/statements-of-administration-policy/
Pre-Obama sourceUCSB American Presidency Project (mirrored in same GitHub repo)
AuthNone
Rate limitsNone observed for GitHub raw / Pages
Caching headersGitHub: Last-Modified + ETag
Incremental sync hintsYAML metadata file Last-Modified; commit log for human-readable change feed
Stable ID formatsap:{date}-{bill-id-slug} e.g. sap:2025-01-22-s5, sap:2012-02-06-hr1734
Statusexploring — schema drafted, ingestion not built

Primary: unitedstates/statements-of-administration-policy GitHub repo. Same maintainers as congress-legislators (Flatgov team). Hosts:

  • Per-administration YAML metadata files (archive/44-Obama.yaml, archive/47-Trump.yaml).
  • PDFs of every SAP from Obama onward at archive/statements/{admin}/{congress}/{filename}.pdf (filenames are arbitrary per the repo README; {admin} is number-prefixed, e.g. 46-Biden).
  • Pre-Obama SAPs reference UCSB American Presidency Project URLs (no PDF; HTML scrape needed for body text).
  • Continuously scraped via scraper.py against current whitehouse.gov and ucsb_scraper.py for archive.

This is the canonical aggregate. Public-domain (CC0). No auth. Real-time updates within hours of OMB posting.

Secondary: whitehouse.gov/omb/statements-of-administration-policy/. The live source. The mirror scrapes it; we do not need to scrape directly. Keep as a fallback if the mirror lapses.

Skip: scraping UCSB American Presidency Project for pre-Obama text. The mirror has the metadata + URL pointers; we can fetch UCSB pages on-demand for body text if needed (mostly historical context for v1).

The presidents.yaml file at the repo root maps administration directory names (44-Obama, 47-Trump) to canonical president records (with bioguide IDs from executive.yaml in legislators & committees). Cross-reference for clean president-to-SAP joins.

GitHub raw / Pages: open, fast, well-cached. No special handling needed. Last-Modified and ETag headers are reliable for conditional GET. See the mirror repo for the raw YAML / PDF path layout.

The whitehouse.gov live listing has an intermittent bot wall (varies by White House administration); the GitHub mirror is open and uncapped, which is why it's the primary source. This bot-wall is a documented assumption from earlier observation, not a fresh live probe (status is still exploring).

Josh reads three things from the mirror: a per-administration/congress directory listing (to discover new SAPs), the per-admin YAML metadata, and individual SAP PDFs. The git commit log gives a human-readable change feed for the admin UI. All of this rides standard GitHub HTTP surfaces — see the GitHub Contents API and Commits API docs for the raw endpoint shapes.

The live whitehouse.gov listing at https://www.whitehouse.gov/omb/statements-of-administration-policy/ is the fallback source and has a bot wall sometimes depending on the White House administration.

SAP PDFs live at archive/statements/{admin-dir}/{congress}/{filename}.pdf in the mirror (admin dirs are number-prefixed, e.g. 44-Obama, 47-Trump — see the repo README). The load-bearing Josh rule:

Do not parse the bill from the filename. Filenames are arbitrary — the repo README explicitly warns they are not a stable, parseable convention. In practice many follow a {YYYY-MM-DD}_{billslug} shape (ISO date, e.g. 2021-02-26_hr1319.pdf), but you must not depend on it. The authoritative bill mapping is the per-admin YAML's bills + congress fields — these are the source of truth for both the PDF inventory and the bill linkage. Roughly 34 of ~1,095 SAPs (~3%) are multi-bill, amendment-targeted, or prose-titled, so no deterministic filename→bill_id slug derivation holds across the corpus. Read bills/congress from the YAML entry and resolve the soft FK from there.

The SAP text begins with a position statement. The high-signal categorical:

CategoryBody text cue
strongly_supports"The Administration strongly supports..."
supports"The Administration supports..."
does_not_oppose"The Administration does not oppose..."
concerns"The Administration has concerns about..." or "The Administration is concerned that..."
opposes"The Administration opposes..."
strongly_opposes"The Administration strongly opposes..."
veto_threatBody contains "...the President's senior advisors will recommend that he veto..."

The veto threat is a categorical added to the position — strongly_opposes + veto_threat is the strongest stance. Some OMB SAPs explicitly say the President "will" veto; others use the "senior advisors recommend" formulation.

Per-bill type (from the per-admin YAML's bills field)

Section titled “Per-bill type (from the per-admin YAML's bills field)”

Same enum as Bills' bill_type: hr, s, hjres, sjres, hconres, sconres, hres, sres. Read it from the YAML bills entry (not the filename). Concurrent / simple resolutions rarely get SAPs (they don't reach the President's desk) but exist.

sap:{YYYY-MM-DD}-{billslug} — preserves both the issue date and the bill it addresses.

Examples:

  • sap:2025-01-22-s5
  • sap:2012-02-06-hr1734
  • sap:2025-02-24-hjres20

The combination is unique because OMB issues at most one SAP per bill per day. (Repeat SAPs on the same bill at different stages get separate dates.)

Each SAP entry comes from a per-administration YAML file in the mirror, whose fields supply the metadata Josh lands. bill_id provenance: it is resolved from the YAML entry's bills + congress fields, never from the PDF filename. Obama+ entries carry a local file: path to the PDF; pre-Obama entries instead carry a url: pointing at the UCSB American Presidency Project (no PDF — body text needs an HTML scrape). See the mirror README for the full field-by-field YAML layout, and the UCSB SAP archive for the pre-Obama source pages.

The PDF is short (1-3 pages typically). Body structure:

EXECUTIVE OFFICE OF THE PRESIDENT
OFFICE OF MANAGEMENT AND BUDGET
WASHINGTON, D.C. 20503
[Date — usually a Tuesday or Wednesday]
STATEMENT OF ADMINISTRATION POLICY
[BILL TITLE — H.R. 1734 — Civilian Property Realignment Act of 2011]
The Administration [strongly supports / supports / opposes / strongly opposes / has concerns about] [bill]
because...
[Body paragraphs — typically 200-1000 words. Discusses specific provisions, agency
implementation concerns, alternative approaches the administration prefers.]
[Veto threat (when applicable):]
"If [bill] were presented to the President in its current form, his senior advisors
would recommend that he veto the bill."
[Sometimes a closing line summarizing the administration's preferred path forward.]

OCR / text extraction is straightforward — short, single-column, no tables.

MetricValue
Per congress~50-200 SAPs (varies by administration's level of legislative engagement)
119th congress to-date~108
All-time PDFs in the mirror~1,095 (systematic from Obama 2009 onward; ~161 MB total)
Multi-bill / amendment / prose-titled~34 of ~1,095 (~3%) — no single parseable bill in the filename
Per-PDF size50-200 KB
Per-PDF text length~500-1500 words

On-disk corpus footprint: ~161 MB of PDFs.

  1. Daily: poll directory listing for the current administration + congress via the GitHub Contents API. Diff against our sap_statements.id.
  2. For each new file: fetch PDF, parse body, extract position + bill linkage.
  3. Weekly: poll archive/{admin}.yaml for older administrations to catch retroactive additions.
  4. GitHub commits API: human-readable change tracking. Useful for the admin UI to show "SAPs added this week."

Last-Modified / ETag make conditional GET cheap — see GitHub's REST best practices. The repo is well-maintained but not real-time — expect a 1-3 day lag from OMB posting to the mirror updating.

-- ============================================================
-- Statements of Administration Policy
-- ============================================================
CREATE TABLE sap_statements (
id text PRIMARY KEY, -- 'sap:2025-01-22-s5'
-- When + who
date_issued date NOT NULL,
administration text NOT NULL, -- '47-Trump' (matches presidents.yaml)
president_bioguide text REFERENCES legislators(bioguide_id), -- soft FK to executive.yaml records
congress smallint NOT NULL,
-- What
document_title text NOT NULL,
bill_id text, -- 'bill:119-s-5'; soft FK
bill_type text, -- 's', 'hr', 'hjres', etc.
bill_number int,
bills_extra text[], -- when SAP covers multiple bills (rare)
-- Position (extracted from body text)
position text CHECK (position IN (
'strongly_supports', 'supports', 'does_not_oppose',
'concerns', 'opposes', 'strongly_opposes',
'unknown'
)),
is_veto_threat boolean NOT NULL DEFAULT false,
position_confidence numeric(3,2), -- 0.0-1.0; how confident position parser was
-- Body
body_text text, -- extracted from PDF
body_pdf_url text, -- raw.githubusercontent.com path
fallback_url text, -- UCSB URL when no PDF (pre-Obama)
-- Source
source_yaml_url text,
fetched_from_url text, -- original whitehouse.gov / archives URL
-- Lifecycle
raw_pdf bytea, -- gzipped PDF bytes (small enough to inline)
fetched_at timestamptz NOT NULL,
parsed_at timestamptz,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX sap_statements_date ON sap_statements (date_issued DESC);
CREATE INDEX sap_statements_bill ON sap_statements (bill_id) WHERE bill_id IS NOT NULL;
CREATE INDEX sap_statements_admin ON sap_statements (administration, date_issued DESC);
CREATE INDEX sap_statements_position ON sap_statements (position) WHERE position IS NOT NULL;
CREATE INDEX sap_statements_veto ON sap_statements (date_issued DESC) WHERE is_veto_threat;
ALTER TABLE sap_statements ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(document_title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
) STORED;
CREATE INDEX sap_statements_search ON sap_statements USING gin (search_tsv);

Schema decisions worth flagging:

  • raw_pdf inlined as gzipped bytea — SAPs are small (<200 KB each, ~1,095 total ≈ 161 MB). Worth keeping the raw PDF for re-extraction.
  • position as a normalized enum with position_confidence indicating parser confidence. Lean conservative — when extraction is ambiguous, set position='unknown' and let downstream consumers fall back to body-text search.
  • is_veto_threat is a separate boolean — the categorical "strongly opposes" doesn't always carry a veto threat, and "opposes" sometimes does. They're orthogonal.
  • bills_extra text[] for the rare multi-bill SAPs — the YAML supports a list under bills:. Most have one bill; we denormalize the canonical one to bill_id and keep extras here.
  • president_bioguide as soft FK — every modern president has a bioguide via executive.yaml.
  • No vector chunks initially — bodies are short (1-2K words avg). Full-text search via tsvector covers it. Add sap_chunks if/when retrieval needs similarity over the corpus.
  1. Pull presidents.yaml to map administration dirs to canonical president records.
  2. Pull each archive/{admin}.yaml (where present — Obama+ have YAML; pre-Obama uses inline UCSB URLs).
  3. For Obama+ (44, 45, 46, 47):
    • List directory contents at archive/statements/{admin}/{congress}/.
    • For each PDF, fetch raw bytes.
    • Extract text via pdftotext / pdfplumber.
    • Parse position categorical from opening paragraph. Resolve bill_id from the per-admin YAML's bills + congress fields (the filename is arbitrary and must not be parsed for bill IDs).
    • Insert row.
  4. For pre-Obama (37 admins back to Reagan):
    • Iterate YAML rows.
    • Where file is set, fetch local PDF.
    • Where url is set (UCSB), defer body text — store metadata only with body_text=NULL and fallback_url. Lazy-fetch body on first agent request, parse text from UCSB HTML.
  5. ~1,095 PDF-backed rows (~161 MB). With 4 concurrent workers the full backfill runs in minutes.
  1. Daily 12:00 UTC: poll the current administration + congress directory via the GitHub Contents API.
  2. Diff vs sap_statements.id. New filenames → fetch PDF + parse + insert.
  3. Weekly: poll older admin directories for retroactive additions (rare but happens).
  4. Weekly: GitHub commits API to populate human-readable change feed in admin UI.

Source key: sap_statements. State stores last-seen filenames per admin/congress.

  • Position parser fails to classify. OMB writers occasionally use unusual phrasing. Set position='unknown', position_confidence=0. Body text remains searchable.
  • Bill resolution fails for a YAML entry with no resolvable bill — the ~34/~1,095 (~3%) multi-bill, amendment-targeted, or prose-titled SAPs. Carry every bill listed under the YAML bills field; where none resolves cleanly, fall back to fuzzy match on document_title and flag for manual review.
  • PDF text extraction fails (rare; older SAPs scanned, not native PDF). Fall back to OCR via tesseract. Keep raw_pdf so re-extraction is possible later.
  • GitHub repo lapses or rate-limits the API. Fall back to direct White House scrape (which has its own bot wall sometimes). Document the scraping path as backup.
  • OMB issues a "revised" SAP on the same bill same day. Rare but possible. Schema's id would collide; suffix with -v2 if detected.

These don't block ingestion but should be resolved before this source is "shipped":

  • Position parser quality. A simple keyword classifier ("Administration strongly supports" → strongly_supports) covers most. Some SAPs phrase differently. Build a small evaluation set + measure F1.
  • Veto threat detection. Watching for "would recommend that he veto" / "will veto" / "his senior advisors". A trained classifier or a simple regex set both work. Lean simple regex for v1; track precision/recall.
  • Pre-Obama body text. UCSB American Presidency Project pages have body text in HTML. Fetching ~1,500 historical pages (Reagan-Bush43) is fine but adds an HTML scraper. Defer to v1.x; ship v1 with metadata-only for pre-Obama.
  • Multi-bill SAPs. Currently rare (a few percent). Schema handles via bills_extra; extraction needs to populate it correctly. Test with a known multi-bill case.
  • SAPs on amendments / motion to proceed. Most SAPs target final-passage bills, but a few target an amendment to a bill or a motion to proceed. The slug captures the bill but loses the stage. Decide if bill_stage is worth adding (it would be similar to CBO's bill_stage).
  • Whitehouse.gov direct scrape as fallback. Per-administration White House URLs change every transition (obamawhitehouse.archives.gov, trumpwhitehouse.archives.gov). The unitedstates mirror tracks them. If we fall back to direct scrape, code must know which URL pattern is current.
  • Veto threats vs actual vetoes. Cross-referencing veto-threat SAPs against actual presidential vetoes (which would come from a separate source) is a research-grade analysis target. Worth flagging in the schema as future enrichment but not blocking.
  • Cross-administration coverage of OBE-d SAPs. Some SAPs were posted then withdrawn. The mirror keeps them; ingest with is_active=false? Currently no flag — not enough cases observed to model. Defer.