GAO reports ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

GAO reports are heavily-cited oversight + investigatory products from the
legislative branch's audit arm — the substrate's strongest single signal for
agency accountability and program-performance questions. Shipping GAO is
also the substrate's first PDF-heavy, anti-bot-walled source, and forces the
RSS-discovery + headed-browser-fetch pattern into the ingester framework.
GAO is one of CRS / GAO / committee-reports as the 6th-or-7th launch source
and is launch-desired (not launch-blocking).

User stories

As a policy analyst using my own AI agent, I want to ask the agent for recent GAO findings on a given program or topic so that I get cited oversight evidence without scraping gao.gov myself.

As an OSS self-hoster, I want to run `josh ingest gao-reports` on a fresh substrate so that I can populate GAO products from the public RSS + asset URLs on my own instance.

As an oversight-focused analyst, I want to correlate GAO findings on a program with the bills and hearings touching it so that my agent can answer "what has GAO said about X, and what is Congress doing about it".

Acceptance criteria (EARS)

When `josh ingest gao-reports --since 2026-01-01` is run on the deployed substrate, the system shall populate `gao_reports` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
The system shall enforce a UNIQUE constraint on `gao_reports.report_number` such that re-ingesting the same GAO product is an idempotent upsert, not a duplicate row.
When a GAO report PDF is fetched, the system shall write the raw payload under `/data/corpus/gao-reports/bodies/raw/` and a normalized Markdown body under `/data/corpus/gao-reports/bodies/markdown/`.
While the ingester is running, if a single PDF or product page fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record (no whole-batch crash).
Where the RSS item exposes per-topic tags or per-recommendation rows, the system shall populate the `gao_report_topics` and `gao_report_recommendations` junction tables for that report (no SQLite arrays).
When an FTS5 query `MATCH 'medicare fraud'` is executed against `gao_reports_fts`, the system shall return BM25-ranked results with the per-column weight tuple `(10.0, 4.0, 2.0, 1.0)` over (title, fast_facts, what_recommends, body_text).
While PDF normalization is pending, the system shall bound the `ingestion_normalization_queue` depth such that no more than N (configured) PDFs are extracted concurrently — PDF extraction is CPU-heavy and must not starve the rest of the ingester.
Where a report body exceeds ~5K tokens, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding (most GAO reports cross this threshold).
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` for the `gao-reports-rss` source key such that the next run only fetches reports newer than the watermark.

Success determiner

kindbash

Command

set -euo pipefail
# Smoke backfill (25 docs — PDFs are heavy, keep it small)
ssh josh 'docker exec josh-ingester josh ingest gao-reports --since 2026-04-01 --max 25'
# Verify rows landed with all citation columns
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM gao_reports
  WHERE published_at >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
# FTS5 sanity
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM gao_reports_fts WHERE gao_reports_fts MATCH '\''agency'\'';"'

Expect

>= 25 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. PDF normalization runs out-of-band via the ingestion_normalization_queue, so body_text may still be empty for some rows immediately after ingest — that's expected; the citation columns must still be populated. Full backfill (scope TBD per clarifications_needed) runs detached on the server per `add-a-new-source.html`.

Clarifications needed

Backfill scope: current FY only (~600–1000 reports), 2010-forward (~12K reports per https://docs.usejosh.com/sources/gao-reports/), or full ~50K online corpus 1995-present? Lean is 2010-forward but confirm before kicking off.
Headed-browser dependency story: does `josh-ingester` ship with Playwright + a Chromium binary baked into the image, or do we run a separate `josh-ingester-browser` worker that the main ingester RPCs into? Touches image size and the deploy.yml shape.
PDF normalization tool: PyMuPDF (fast, decent layout), pdfplumber (best for tables, slow), or Marker (LLM-assisted, highest fidelity, expensive). GAO reports are layout-heavy with tables; pick before t4 lands. May need a new `pdf_v1` normalizer spec — flag if not yet a substrate spec.
Anti-bot strategy concretely: rotating User-Agent + browser fingerprint, request pacing (RPS cap), backoff on 403/429/503, residential proxy (no, prefer not), or just slow + headed? Akamai-style fingerprinting per https://docs.usejosh.com/sources/gao-reports/#access-notes.
Whether to ingest GAO testimony (URL suffix `T`) and correspondence (suffix `R`) into the same `gao_reports` table with a `product_type` discriminator, or split into separate tables. Lean: same table, discriminator column — but confirm.
Recommendations refresh: monthly polling of every report with an Open* recommendation is ~10K browser fetches / month per the source doc. Is that load acceptable given the anti-bot wall, or do we cap to N most-recent reports?
Pre-2003 PDFs are scanned; OCR adds compute cost. Default to skipping OCR (leave `body_text` NULL, expose PDF link only) — confirm.
Cross-source FK story: GAO product pages list 'Requesters' (Members / committees that requested the work). Worth a `gao_report_requesters` junction with FKs to legislators / committees — in v1 or defer?

Out of scope

State equivalents and federal IG (Inspector General) reports — Phase 6+.
GAO bid-protest decisions at `/legal/bid-protests/...` — different URL space, different product line, separate value. Defer to v1.x.
GAO video / audio testimony recordings — text-only ingestion in v1.
OCR of pre-2003 scanned PDFs — leave `body_text` NULL, expose PDF link only.
Authors / GAO-contact directory ("what else has Director X published") — defer to v1.x.
Real-time webhooks / push subscriptions — surface-layer concern.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/gao-reports/ (probe findings + indicative DDL —
the migration shipped with this spec is the source of truth).

Discover/fetch/parse/load shape:
- discover: https://www.gao.gov/rss/reports.xml (open to bots) is the
sole discovery feed. The per-topic feeds at /rss/topics/{slug}.xml
return a fast repeatable 404 (the mechanism is gone — not the Akamai wall,
not a timeout), so there is no secondary RSS safety net. Watermark from
ingestion_source_state.last_seen_key='gao-reports-rss' keyed on
<pubDate>. Backfill (out of scope for the success_determiner) uses a
headed-browser crawl of /reports-testimonies?page=N. Feed item links are
lowercase (/products/gao-26-107967), so normalize casing when deriving
the product-page URL and asset path.
- fetch: per-report product page via headed browser
(https://www.gao.gov/products/{report-number}, lowercase in feed links —
e.g. /products/gao-26-107967) saved to
/data/corpus/gao-reports/bodies/raw/<report-number>.html, plus the
deterministic asset PDF
(/assets/gao-{c}-{NNNNNN}.pdf) saved alongside as .pdf. SHA256 over
the PDF bytes is raw_sha256 for citation; the HTML is preserved for
re-parse when the Drupal frontend drifts.
- parse: extract structured sections from the product HTML
(title, GAO ID, publish + public-release dates, fast facts, what found,
why studied, what recommends, topics, agencies, per-recommendation rows).
Run the chosen PDF normalizer (PyMuPDF / pdfplumber / Marker — TBD per
clarifications) to produce a Markdown body under
/data/corpus/gao-reports/bodies/markdown/<report-number>.md. RSS
<description> is preserved as a fallback body when product-page parse
fails.
- load: idempotent upsert on (report_number) natural key (UNIQUE).
Junction-table inserts for topics, agencies, and recommendations
(no SQLite arrays). FTS5 sync triggers fire on insert/update/delete.
Body chunks enqueued for embedding when body length exceeds the
threshold.

Migration shape: gao_reports (id text PK = gao:GAO-26-107932,
report_number UNIQUE, product_type discriminator,
title/subtitle/fast_facts/what_found/why_studied/what_recommends,
published_at, public_release_date, citation columns,
body_normalizer_version), gao_report_topics(report_id, topic),
gao_report_agencies(report_id, agency_name, agency_id),
gao_report_recommendations(id, report_id, recommendation_number, text, agency_name, status, status_date, status_history), gao_reports_fts
virtual + 3 sync triggers, gao_chunks + gao_chunk_vecs (vec0 with
float[1024] and bit[1024]).

Anti-bot strategy: real-Chromium headed browser (same fetcher pattern
as Senate roll-call XML) with realistic UA + accept headers, pace at
1–2 RPS, exponential backoff on 403/429/503. Only the RSS endpoint
(/rss/reports.xml) is open to bots; /assets/... PDFs are now walled
equally to /products/... HTML (Akamai 403), so the headed browser is
required for both — there is no cheaper PDF path to test first.

PDF normalization is queued via ingestion_normalization_queue (bounded
depth) so the discover→fetch loop never blocks on extraction. The
ingester writes the row to gao_reports with the raw payload + citation
columns first; body_text and body_normalizer_version populate
asynchronously.

Schedule: 6-hourly RSS poll (GAO publishes 5–15 reports per business day —
daily would be fine, 6-hourly catches stragglers without spamming).
Monthly recommendations refresh runs as a separate cron entry against the
same source module.

Tasks

0 of 17 done.

Changelog

No history yet.