sourceplannedp0

Statements of Administration Policy ingester

statements-of-administration-policy-ingester · updated 2026-05-29T00:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

A Statement of Administration Policy (SAP) is an OMB memorandum on
behalf of the President expressing the administration's position on
pending legislation. Short (typically 1-3 pages), structured, and
high-signal: a "Strongly Opposes" SAP frequently signals a veto threat;
a "Supports" SAP often accompanies the President's signature when the
bill becomes law. For Josh, SAPs answer "what's the White House position
on this bill" — a question every Hill staffer asks weekly. Naturally
FK'd into bills via the per-admin YAML's bills/congress fields (the
repo README warns that filenames are arbitrary, so bills cannot be
parsed from the slug). Small but high-value source.

As a Hill staffer using my own AI agent, I want to ask "did the White House issue a SAP on the SPEED Act, and what's the position" so that I get the categorical signal (strongly_supports / opposes / veto_threat) and the SAP text.

As an OSS self-hoster, I want to run `josh ingest sap` on a fresh substrate so that I can populate SAPs from the unitedstates GitHub mirror on my own instance.

As a downstream agent doing vote-prediction analysis, I want SAPs joined to bills with the position categorical extracted so that "what bills did the White House threaten to veto this Congress" is a single SQL query.

  1. When `josh ingest sap` is run on the deployed substrate, the system shall populate `saps` rows whose six citation columns are all non-null.
  2. When a SAP PDF is fetched, the system shall write the raw payload under `/data/corpus/sap/bodies/raw/` and a normalized Markdown body under `/data/corpus/sap/bodies/markdown/`.
  3. While the ingester is running, if a single SAP fails to parse, then the system shall log the failure to `ingestion_logs` (with the offending file path) and continue with the next record.
  4. Where a SAP's per-admin YAML entry lists one or more bills (the `bills` + `congress` fields — filenames are arbitrary per the repo README, and ~34/1095 SAPs are multi-bill/amendment/prose), the system shall populate `sap_bills` linking `saps.id` to `bills.id` (resolving the soft-FK when the bill is in the substrate; raw YAML bill reference preserved otherwise).
  5. When the SAP body text is parsed, the system shall extract the position categorical (`strongly_supports` / `supports` / `does_not_oppose` / `concerns` / `opposes` / `strongly_opposes` / `veto_threat`) into `saps.position`.
  6. When an FTS5 query is executed against `saps_fts`, the system shall return BM25-ranked results.
  7. When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` so the next run only re-fetches changed YAMLs / new PDFs (per-file `If-Modified-Since` + `ETag`).
kindbash

Command

set -euo pipefail

# === CNV preflight: substrate reachable? loadavg sane? ===
# Three-state exit: 0 PASS, 1 FAIL, 77 CNV.
if ! ssh -o ConnectTimeout=5 -o BatchMode=yes josh 'true' 2>/dev/null; then
  echo "CNV: ssh josh unreachable; cannot verify" >&2
  exit 77
fi
LOAD1=$(ssh josh 'cut -d" " -f1 /proc/loadavg')
if awk "BEGIN { exit !($LOAD1 > 8.0) }"; then
  echo "CNV: josh loadavg=${LOAD1} > 8.0; refusing to verify under load" >&2
  exit 77
fi

# === Smoke ingest ===
ssh josh 'docker exec josh-ingester josh ingest sap --congress 119'

# === Asserting probes ===
python3 - <<'PY'
import subprocess

def q(sql: str) -> int:
    out = subprocess.check_output(
        ["ssh", "josh", "docker", "exec", "josh-core",
         "sqlite3", "/data/josh.db", sql],
        text=True,
    ).strip()
    return int(out) if out else 0

# Probe 1 — AC #1: SAPs with all six citation columns
saps = q("""
    SELECT COUNT(*) FROM saps
    WHERE congress = 119
      AND source_url IS NOT NULL AND raw_sha256 IS NOT NULL
      AND citation_string IS NOT NULL;
""")
assert saps >= 50, f"FAIL probe-1: saps={saps}, expected >= 50"

# Probe 2 — AC #4: SAPs mapped to >=1 bill via the per-admin YAML.
# NOT 1:1 with saps — ~3% (multi-bill/amendment/prose) carry zero or
# multiple bill links, so we assert the distinct-SAP-with-a-link ratio.
linked_saps = q("""
    SELECT COUNT(DISTINCT sb.sap_id) FROM sap_bills sb
    JOIN saps s ON s.id = sb.sap_id
    WHERE s.congress = 119;
""")
link_ratio = linked_saps / saps if saps else 0.0
assert link_ratio >= 0.95, \
    f"FAIL probe-2: linked_saps/saps = {linked_saps}/{saps} = {link_ratio:.3f}, expected >= 0.95"

# Probe 3 — AC #5: position categorical extracted on >=95% of SAPs
positions = q("""
    SELECT COUNT(*) FROM saps
    WHERE congress = 119
      AND position IN ('strongly_supports','supports','does_not_oppose',
                       'concerns','opposes','strongly_opposes','veto_threat');
""")
ratio = positions / saps if saps else 0.0
assert ratio >= 0.95, \
    f"FAIL probe-3: positions/saps = {positions}/{saps} = {ratio:.3f}, expected >= 0.95"

print(f"PASS: saps={saps} linked_saps={linked_saps} link_ratio={link_ratio:.3f} positions={positions} ratio={ratio:.3f}")
PY

Expect

Stdout ends with PASS where saps >= 50 AND linked_saps/saps >= 0.95 AND positions/saps >= 0.95. Exit codes: 0 = PASS, 1 = FAIL (probe assertion), 77 = CNV.

Smoke run only — 119th Congress is the current one. Full backfill (Reagan administration onward via the unitedstates GitHub mirror, ~1,095 PDFs / ~161 MB all-time) runs in minutes. Adversarial mutation suite lives at `docs/spec/mutations/statements-of-administration-policy-ingester.yaml`.

  • Backfill scope: full 1985-present (the GitHub mirror holds ~1,095 PDFs / ~161 MB all-time, systematically from Obama 2009 onward) vs Obama-onward vs current administration only. Lean: full — corpus is tiny, breadth is free.
  • Pre-Obama SAPs reference UCSB American Presidency Project URLs (no PDF; HTML scrape needed for body text). Pull body text in v1 or store metadata-only? Lean: metadata-only for pre-Obama; body text on-demand from UCSB if needed.
  • Position-categorical extraction: regex on body text (current plan, ~95% accurate per the position table in the source doc) vs LLM extraction. Lean: regex for v1; flag the 5% misses for review.
  • Bill-FK soft resolution policy when the bill isn't yet in the substrate: insert SAP with `bill_id=NULL` and re-resolve on a periodic job, or block SAP load? Lean: soft-FK with later resolution — bills ingester runs more often, and the per-admin YAML's `bills`/`congress` fields give a stable bill reference to re-resolve against.
  • President FK: link `saps.president_bioguide_id` to `legislators` executive entries (executive.yaml in `legislators-and-committees`)? Lean: yes — clean cross-source pivot.
  • State analogues (gubernatorial veto messages on state bills) — Phase 6+.
  • Real-time scraping of `whitehouse.gov/omb/statements-of-administration-policy/` directly — the GitHub mirror does this; we ride their work.
  • Body text of pre-Obama SAPs (UCSB-only) — metadata-only for v1.
  • Real-time webhooks / push subscriptions — surface-layer concern.

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/statements-of-administration-policy/.

Discover/fetch/parse/load shape:
- discover: GitHub API listing of
archive/statements/{admin}/{congress}/ for each administration of
interest, where {admin} is the number-prefixed admin directory
(46-Biden, 45-Trump, 44-Obama — NOT lowercase biden). The
per-admin YAML metadata file (bills + congress per entry) is the
authoritative bill mapping and PDF inventory, since filenames are
arbitrary per the repo README. Watermark from
ingestion_source_state.last_seen_key='sap-{admin}-lastmod' is the
repo's last-commit timestamp affecting that admin's directory.
- fetch: per-SAP PDF from
https://raw.githubusercontent.com/unitedstates/statements-of-administration-policy/main/archive/statements/{admin}/{congress}/{date}_{slug}.pdf
where {admin} is number-prefixed (46-Biden) and {date} is ISO
YYYY-MM-DD (e.g. archive/statements/46-Biden/117/2021-02-26_hr1319.pdf),
to /data/corpus/sap/bodies/raw/. SHA256 for raw_sha256.
Conditional GET via Last-Modified / ETag.
- parse: PDF text extraction (PyMuPDF or pdfplumber — ideally same
chosen normalizer as CBO/GAO). Position extraction via regex on the
body's first ~500 chars (per the position table in the source doc).
Bill IDs read from the per-admin YAML's bills + congress fields
(filenames are arbitrary; the slug must not be parsed for bill IDs —
~34/1095 SAPs are multi-bill/amendment/prose).
- load: idempotent upsert on (date, bill_slug). President FK
resolved via presidents.yaml cross-reference into
legislators executive entries. Bill soft-FK resolved when the bill
is in the substrate. FTS5 sync triggers fire on insert/update/delete.

Migration shape: saps (with the six citation columns + position enum
+ president_bioguide_id FK), sap_bills (M:N — a SAP can address
multiple bills in rare cases; a bill can have multiple SAPs at different
stages), saps_fts virtual + 3 sync triggers. No *_chunks_vec0
SAPs are 1-3 pages, FTS5 + per-document embedding (no chunking) is
enough, but we may add it in v1.x for multi-bill SAPs.

Schedule: every 4 hours during active session weeks (SAPs land in
bursts on contested bills); daily otherwise. The mirror updates within
hours of OMB posting.

0 of 16 done.

  • t1 Alembic migration applied locally + production (saps + sap_bills + FTS5)
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/sap.py
  • t3 Source module implementing the `Source` protocol
  • t4 Body normalization wired (PDF normalizer per shared CBO/GAO choice; raw + markdown paths)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests (Obama-era, Trump-era, recent)
  • t6 FTS5 virtual table + 3 sync triggers wired into migration
  • t7 N/A for v1 — SAPs are short (1-3 pages); per-document embedding without chunking
  • t8 Unit tests for position regex against captured fixtures (one per categorical value: strongly_supports, supports, does_not_oppose, concerns, opposes, strongly_opposes, veto_threat)
  • t9 Integration test for full discover→fetch→parse→load pipeline against the GitHub mirror
  • t10 Smoke backfill (119th Congress, ~108 SAPs) clean per success_determiner
  • t11 Production backfill at agreed scope (per clarifications_needed) complete
  • t12 4-hourly cron registered and observed firing once
  • t13 No recurring errors in `ingestion_logs` for the last completed run
  • t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
  • t15 Bill-soft-FK resolution test: every smoke-backfill SAP whose per-admin YAML lists a bill present in the substrate has a corresponding `sap_bills` row
  • t16 Position-extraction QA: ≥95% of smoke-backfill cohort has a non-NULL `position` value (regex hit rate target)

No history yet.

docs/spec/statements-of-administration-policy-ingester.html · generated by bin/build-spec.py