GAO reports ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
GAO reports are heavily-cited oversight + investigatory products from the
legislative branch's audit arm — the substrate's strongest single signal for
agency accountability and program-performance questions. Shipping GAO is
also the substrate's first PDF-heavy, anti-bot-walled source, and forces the
RSS-discovery + headed-browser-fetch pattern into the ingester framework.
GAO is one of CRS / GAO / committee-reports as the 6th-or-7th launch source
and is launch-desired (not launch-blocking).
User stories
As a policy analyst using my own AI agent, I want to ask the agent for recent GAO findings on a given program or topic so that I get cited oversight evidence without scraping gao.gov myself.
As an OSS self-hoster, I want to run `josh ingest gao-reports` on a fresh substrate so that I can populate GAO products from the public RSS + asset URLs on my own instance.
As an oversight-focused analyst, I want to correlate GAO findings on a program with the bills and hearings touching it so that my agent can answer "what has GAO said about X, and what is Congress doing about it".
Acceptance criteria (EARS)
- When `josh ingest gao-reports --since 2026-01-01` is run on the deployed substrate, the system shall populate `gao_reports` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
- The system shall enforce a UNIQUE constraint on `gao_reports.report_number` such that re-ingesting the same GAO product is an idempotent upsert, not a duplicate row.
- When a GAO report PDF is fetched, the system shall write the raw payload under `/data/corpus/gao-reports/bodies/raw/` and a normalized Markdown body under `/data/corpus/gao-reports/bodies/markdown/`.
- While the ingester is running, if a single PDF or product page fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record (no whole-batch crash).
- Where the RSS item exposes per-topic tags or per-recommendation rows, the system shall populate the `gao_report_topics` and `gao_report_recommendations` junction tables for that report (no SQLite arrays).
- When an FTS5 query `MATCH 'medicare fraud'` is executed against `gao_reports_fts`, the system shall return BM25-ranked results with the per-column weight tuple `(10.0, 4.0, 2.0, 1.0)` over (title, fast_facts, what_recommends, body_text).
- While PDF normalization is pending, the system shall bound the `ingestion_normalization_queue` depth such that no more than N (configured) PDFs are extracted concurrently — PDF extraction is CPU-heavy and must not starve the rest of the ingester.
- Where a report body exceeds ~5K tokens, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding (most GAO reports cross this threshold).
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` for the `gao-reports-rss` source key such that the next run only fetches reports newer than the watermark.
Success determiner
Command
set -euo pipefail
# Smoke backfill (25 docs — PDFs are heavy, keep it small)
ssh josh 'docker exec josh-ingester josh ingest gao-reports --since 2026-04-01 --max 25'
# Verify rows landed with all citation columns
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM gao_reports
WHERE published_at >= ''2026-04-01''
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;"'
# FTS5 sanity
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM gao_reports_fts WHERE gao_reports_fts MATCH '\''agency'\'';"'
Expect
Smoke run only. PDF normalization runs out-of-band via the ingestion_normalization_queue, so body_text may still be empty for some rows immediately after ingest — that's expected; the citation columns must still be populated. Full backfill (scope TBD per clarifications_needed) runs detached on the server per `add-a-new-source.html`.
Clarifications needed
- Backfill scope: current FY only (~600–1000 reports), 2010-forward (~12K reports per https://docs.usejosh.com/sources/gao-reports/), or full ~50K online corpus 1995-present? Lean is 2010-forward but confirm before kicking off.
- Headed-browser dependency story: does `josh-ingester` ship with Playwright + a Chromium binary baked into the image, or do we run a separate `josh-ingester-browser` worker that the main ingester RPCs into? Touches image size and the deploy.yml shape.
- PDF normalization tool: PyMuPDF (fast, decent layout), pdfplumber (best for tables, slow), or Marker (LLM-assisted, highest fidelity, expensive). GAO reports are layout-heavy with tables; pick before t4 lands. May need a new `pdf_v1` normalizer spec — flag if not yet a substrate spec.
- Anti-bot strategy concretely: rotating User-Agent + browser fingerprint, request pacing (RPS cap), backoff on 403/429/503, residential proxy (no, prefer not), or just slow + headed? Akamai-style fingerprinting per https://docs.usejosh.com/sources/gao-reports/#access-notes.
- Whether to ingest GAO testimony (URL suffix `T`) and correspondence (suffix `R`) into the same `gao_reports` table with a `product_type` discriminator, or split into separate tables. Lean: same table, discriminator column — but confirm.
- Recommendations refresh: monthly polling of every report with an Open* recommendation is ~10K browser fetches / month per the source doc. Is that load acceptable given the anti-bot wall, or do we cap to N most-recent reports?
- Pre-2003 PDFs are scanned; OCR adds compute cost. Default to skipping OCR (leave `body_text` NULL, expose PDF link only) — confirm.
- Cross-source FK story: GAO product pages list 'Requesters' (Members / committees that requested the work). Worth a `gao_report_requesters` junction with FKs to legislators / committees — in v1 or defer?
Out of scope
- State equivalents and federal IG (Inspector General) reports — Phase 6+.
- GAO bid-protest decisions at `/legal/bid-protests/...` — different URL space, different product line, separate value. Defer to v1.x.
- GAO video / audio testimony recordings — text-only ingestion in v1.
- OCR of pre-2003 scanned PDFs — leave `body_text` NULL, expose PDF link only.
- Authors / GAO-contact directory ("what else has Director X published") — defer to v1.x.
- Real-time webhooks / push subscriptions — surface-layer concern.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/gao-reports/ (probe findings + indicative DDL —
the migration shipped with this spec is the source of truth).
Discover/fetch/parse/load shape:
- discover: https://www.gao.gov/rss/reports.xml (open to bots) is the
sole discovery feed. The per-topic feeds at /rss/topics/{slug}.xml
return a fast repeatable 404 (the mechanism is gone — not the Akamai wall,
not a timeout), so there is no secondary RSS safety net. Watermark from
ingestion_source_state.last_seen_key='gao-reports-rss' keyed on
<pubDate>. Backfill (out of scope for the success_determiner) uses a
headed-browser crawl of /reports-testimonies?page=N. Feed item links are
lowercase (/products/gao-26-107967), so normalize casing when deriving
the product-page URL and asset path.
- fetch: per-report product page via headed browser
(https://www.gao.gov/products/{report-number}, lowercase in feed links —
e.g. /products/gao-26-107967) saved to
/data/corpus/gao-reports/bodies/raw/<report-number>.html, plus the
deterministic asset PDF
(/assets/gao-{c}-{NNNNNN}.pdf) saved alongside as .pdf. SHA256 over
the PDF bytes is raw_sha256 for citation; the HTML is preserved for
re-parse when the Drupal frontend drifts.
- parse: extract structured sections from the product HTML
(title, GAO ID, publish + public-release dates, fast facts, what found,
why studied, what recommends, topics, agencies, per-recommendation rows).
Run the chosen PDF normalizer (PyMuPDF / pdfplumber / Marker — TBD per
clarifications) to produce a Markdown body under
/data/corpus/gao-reports/bodies/markdown/<report-number>.md. RSS
<description> is preserved as a fallback body when product-page parse
fails.
- load: idempotent upsert on (report_number) natural key (UNIQUE).
Junction-table inserts for topics, agencies, and recommendations
(no SQLite arrays). FTS5 sync triggers fire on insert/update/delete.
Body chunks enqueued for embedding when body length exceeds the
threshold.
Migration shape: gao_reports (id text PK = gao:GAO-26-107932,report_number UNIQUE, product_type discriminator,title/subtitle/fast_facts/what_found/why_studied/what_recommends,published_at, public_release_date, citation columns,body_normalizer_version), gao_report_topics(report_id, topic),gao_report_agencies(report_id, agency_name, agency_id),gao_report_recommendations(id, report_id, recommendation_number, text,,
agency_name, status, status_date, status_history)gao_reports_fts
virtual + 3 sync triggers, gao_chunks + gao_chunk_vecs (vec0 withfloat[1024] and bit[1024]).
Anti-bot strategy: real-Chromium headed browser (same fetcher pattern
as Senate roll-call XML) with realistic UA + accept headers, pace at
1–2 RPS, exponential backoff on 403/429/503. Only the RSS endpoint
(/rss/reports.xml) is open to bots; /assets/... PDFs are now walled
equally to /products/... HTML (Akamai 403), so the headed browser is
required for both — there is no cheaper PDF path to test first.
PDF normalization is queued via ingestion_normalization_queue (bounded
depth) so the discover→fetch loop never blocks on extraction. The
ingester writes the row to gao_reports with the raw payload + citation
columns first; body_text and body_normalizer_version populate
asynchronously.
Schedule: 6-hourly RSS poll (GAO publishes 5–15 reports per business day —
daily would be fine, 6-hourly catches stragglers without spamming).
Monthly recommendations refresh runs as a separate cron entry against the
same source module.
Tasks
0 of 17 done.
- t1 Alembic migration applied locally + production (gao_reports + topics/agencies/recommendations junctions + FTS5 + vec0)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/gao.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (chosen PDF normalizer per clarification + raw HTML fallback; per-format converter unit-tested)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests on 3+ fixtures
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 Vector vec0 table wired (most GAO reports cross the 5K-token threshold)
- t8 Unit tests for product-page HTML parser and recommendations-table parser against captured fixtures
- t9 Integration test for the full discover→fetch→parse→load pipeline (RSS → headed-browser fetch → PDF extract → row insert)
- t10 Minimum-viable backfill complete at planned volume (scope TBD per clarification — lean: 2010-forward)
- t11 Incremental cron registered (6-hourly RSS poll) and observed firing once
- t12 No recurring errors in `ingestion_logs` for the last completed run
- t13 Source doc `https://docs.usejosh.com/sources/gao-reports/` updated with as-built notes; status flipped to ingestion-built
- t14 Inventory rows in `https://docs.usejosh.com/josh-data-sources/` and `https://docs.usejosh.com/data-status/` updated to ingestion-built / shipped
- t15 12-sample manual PDF→Markdown validation gate (per add-a-new-source.html — stratified across report length, table density, FY, with/without highlights; binary all-12 × all-12-criteria pass; non-optional given PDF extraction fragility)
- t16 URL verification gate: 10/10 random `gao_reports.source_url` records match upstream titles via WebFetch
- t17 PDF extraction unit tests against captured fixtures (short report, long report, table-heavy report, scanned/OCR-needed report — assertion: gracefully empty body_text rather than crash on the scanned one)
Changelog
No history yet.