GAO reports
Reports and testimonies from the Government Accountability Office, the legislative branch's audit and investigative arm. GAO produces in-depth investigations of federal programs, fraud audits, performance reviews, technology assessments, and Congressional testimony. Each report contains formal findings, often with recommendations to specific agencies — and a tracked implementation status that updates over years.
For Josh, GAO's value is two-fold: (1) substantive analysis the agent can cite when answering oversight or accountability questions, (2) the recommendations database — a structured row-per-recommendation-per-agency dataset that quantifies implementation status. The recommendations database is the highest-signal subset; treat it as a separate sub-source.
GAO publishes no official public API and no published developer/field-reference docs — it is a direct-scrape source. Almost everything Josh-specific on this page is operational reverse-engineering that no upstream doc states. The gao.gov site is the canonical reference for the raw content; the bot-wall, feed, ID, and schema notes below are ours.
Quick reference
Section titled “Quick reference”| Source name | GAO reports |
| Publisher | Government Accountability Office |
| License | Public domain (US government works) |
| Coverage | 1971 – present (GAO predecessor "GAO Reports" since 1921 mostly paper). Online corpus reliable for ~1995 onward. |
| Volume | ~600-1000 reports + ~100-300 testimonies per year. Historical corpus ~50K+ reports. ~10,000+ active recommendations. |
| Storage estimate | ~10-15 GB raw PDF + HTML; ~2-4 GB extracted text |
| Site | https://www.gao.gov/ |
| Discovery feed | /rss/reports.xml (recent reports) — the sole open path (see Access notes). Per-topic feeds (/rss/topics/{slug}.xml) are dead (repeatable 404, mechanism gone). |
| Asset path pattern | https://www.gao.gov/assets/gao-{c}-{NNNNNN}.pdf (full report) and gao-{c}-{NNNNNN}-highlights.pdf — deterministic from the product number, but now bot-walled equally to product pages (Akamai 403). |
| Auth | None |
| Rate limits | None observed; HTML pages bot-walled (real Chrome required) |
| Caching headers | RSS: per-feed Last-Modified. PDFs: per-file Last-Modified. |
| Incremental sync hints | RSS <pubDate> + per-asset Last-Modified; sitemap exists but bot-walled |
| Stable ID format | gao:{report-number} e.g. gao:GAO-26-107932, gao:GAO-25-106451 |
| Status | exploring — schema drafted, ingestion not built |
Source priority decision
Section titled “Source priority decision”Primary: gao.gov direct. RSS for the discovery feed (open to bots), individual product pages via headed-browser, asset PDFs at the deterministic asset URL pattern.
Secondary: GovInfo GAO collections. Some GAO reports are in
/bulkdata/GAOREPORTS/... — sparse coverage, GovInfo only mirrors a subset that
GAO formally submits to GPO. Unreliable as primary.
Skip: ProPublica's GAO scraper / community mirrors. They lag the source and re-add a parsing layer. Direct is better.
Skip: Recommendations API third parties. A few exist; underlying data is on gao.gov already.
Access notes
Section titled “Access notes”HTML pages bot-walled. https://www.gao.gov/products/... and
https://www.gao.gov/sitemap.xml return 403 to non-browser clients. Same
Akamai-style fingerprinting we hit on Senate.gov. The RSS endpoint at
/rss/reports.xml is the sole open path — that's our discovery feed. Asset
URLs (/assets/gao-{c}-{NNNNNN}.pdf) are now walled equally to the product
pages (Akamai 403) — there is no longer a cheaper PDF path; the headed browser is
required for both.
For ingestion, the same headed-browser fetcher used for Senate roll-call XML (see roll-call votes) can be reused here. It scales: GAO publishes ~1,000 reports/year, all fetchable in a few hours of browser time.
Feed item links are lowercase (e.g. /products/gao-26-107967), so normalize
casing when deriving the product-page URL and the deterministic asset path. The
full browser context is required for both the product/HTML pages and the asset
PDFs.
Per-topic RSS feeds are dead. GAO formerly published per-topic feeds at
predictable URLs (/rss/topics/agriculture.xml, /rss/topics/health-care.xml,
~30 topics). Every /rss/topics/{slug}.xml now returns a fast, repeatable 404
(the mechanism is gone, not an Akamai wall or a timeout). The primary
/rss/reports.xml feed is the sole discovery feed; there is no per-topic
safety net. Filtered-by-topic ingestion must be done by reading
/rss/reports.xml and filtering on the product page's topic tags.
Endpoints / URL patterns
Section titled “Endpoints / URL patterns”The raw URL surface (RSS feed, /products/{report-number}, /assets/...pdf,
-highlights.pdf, the recommendations database, /sitemap.xml,
/topics/{slug}) lives on the gao.gov site — GAO has no
developer-docs page that enumerates it, so the operational facts that matter for
Josh are captured in Access notes above (which paths are open,
which are bot-walled, which are dead). The two load-bearing paths:
https://www.gao.gov/rss/reports.xml— RSS feed of recent reports, the one path open to bots. Carries the full "What GAO Found" body in<description>(see Response shapes).https://www.gao.gov/assets/gao-{c}-{NNNNNN}.pdf— full report PDF, deterministic from the product number, but bot-walled equally to product pages (Akamai 403), so headed-browser required.
The searchable recommendations database is bot-walled too.
Vocabulary / enums
Section titled “Vocabulary / enums”Report number format
Section titled “Report number format”GAO-{c}-{NNNNNN} where {c} is the two-digit GAO fiscal year (not congress)
and {NNNNNN} is a 6-digit sequential ID.
Examples:
GAO-26-107932— FY2026 reportGAO-25-106451— FY2025 reportGAO-23-105680— FY2023 report
GAO's fiscal year (and report-number year) starts October 1 of the prior calendar
year. So GAO-26-... reports are published October 2025 – September 2026.
Pre-2003 numbers used letter-prefix sequences (GAO/T-RCED-99-1). For the v1
ingestion scope (mostly post-2010), the modern numeric format dominates.
Product type is inferred from the URL slug + title pattern: a plain
/products/gao-{c}-{n} is a standard report; a trailing T
(/products/gao-{c}-{n}t) is testimony; a trailing R is correspondence;
a "Technology Assessment" title is a technology assessment; bid-protest
decisions live in a separate URL space (/legal/decisions/...). These map to
the product_type CHECK constraint in the schema. The
full GAO topic taxonomy (~30 top-level topics, each
with a kebab-case slug) is stored free-text in gao_report_topics — we do not
pre-define an enum and accept new topics as they appear.
Recommendation status
Section titled “Recommendation status”Landed as the gao_recommendations.status CHECK constraint — these five values
must stay byte-identical to the schema (including the em-dash):
Open— pending; agency has not implementedOpen—Partially Addressed— agency took some but not all stepsClosed—Implemented— agency took recommended action; verified by GAOClosed—Not Implemented— agency declined / time-limited; closed without actionClosed—Other— withdrawn, OBE'd, or moot
The upstream prose definitions live in the recommendations database; the canonical enum is the CHECK constraint below.
Stable ID format
Section titled “Stable ID format”gao:{report-number}, uppercase preserved.
Examples:
gao:GAO-26-107932gao:GAO-25-106451T(testimony)
Per-recommendation IDs (when ingesting the recommendations database):
gao-rec:{report-number}-{recommendation-number}e.g.gao-rec:GAO-25-106451-1for the first recommendation in that report.
Response shapes
Section titled “Response shapes”Because GAO publishes no schema reference, the observed shapes below are the only documentation of GAO's feed/page structure that exists — they are kept here rather than linked out.
RSS feed item
Section titled “RSS feed item”<item> <title>Hydrogen Energy: Technologies Offer Potential Benefits but Face Challenges to Widespread Use</title> <link>https://www.gao.gov/products/gao-26-107932</link> <description> What GAO Found Hydrogen energy technologies offer long-duration energy storage, ... [Full "What GAO Found" body, ~500-2000 words] </description> <pubDate>Wed, 30 Apr 2026 00:00:00 -0400</pubDate></item>The RSS <description> contains the full "What GAO Found" body, often
pre-formatted plaintext from the page. This is enough text for a useful initial
parse without fetching the full PDF — for backfill, the RSS body alone may serve
for retrieval indexing. <pubDate> is in RFC 822 / RFC 2822 format. Always
00:00:00 — date-precision only.
Product detail page (post-bot-wall fetch)
Section titled “Product detail page (post-bot-wall fetch)”Sections observed via headed-browser fetch of gao-26-107932: the GAO ID and
publish/public-release dates (the latter sometimes lags publish), a Topics tag
list, Fast Facts (lay summary), the Highlights block (What GAO Found,
Why GAO Did This Study, What GAO Recommends), a Full Report section
linking the full-report and highlights PDFs, GAO Contacts (report authors),
and a Recommendations table when the report has formal recommendations. The
"What GAO Recommends" section sometimes says "GAO is making N recommendations"
with details below; sometimes just one summary paragraph. The Drupal HTML
structure has been observed inconsistent across templates — see
Open questions.
Recommendations table per report
Section titled “Recommendations table per report”When a report has formal recommendations the detail page exposes a structured
table — one row per recommendation with #, recommendation prose, target agency,
status, and date, e.g.:
| # | Recommendation | Agency | Status | Date ||---|---------------------------------------|--------|---------------------|--------------|| 1 | DoD should establish a process to ... | DoD | Open | Apr 30, 2026 || 2 | DOE should issue guidance ... | DOE | Closed—Implemented | Sep 12, 2024 |Each row's Status updates over time as agencies implement (or don't). This is the source of the recommendations database and drives the monthly refresh.
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Reports per year | ~600-1,000 |
| Testimonies per year | ~100-300 |
| Total all-time (online) | ~50,000 |
| Per-report PDF size | ~500 KB - 5 MB |
| RSS feed size | ~50-100 KB (recent ~50 reports) |
| Per-report HTML | ~50-150 KB |
| Recommendations all-time | ~30,000+ tracked, ~10,000+ open at any time |
| Recommendations per report | 0-50; median ~3-5 |
Postgres footprint: ~3-5 GB after backfill (text + metadata, no PDF bytes inline). Recommendations table ~30K rows.
Caching / incremental sync
Section titled “Caching / incremental sync”Three signals:
- RSS
<pubDate>— new reports appear in/rss/reports.xmlon publication day. Read every 6-12 hours. - Per-asset
Last-Modifiedon the PDF — confirms re-issuances and corrigenda. - Recommendations database polling — recommendations status updates lag the report and continue for years. Re-poll the database monthly per active report.
Practical incremental:
- 6-hourly: RSS poll of
/rss/reports.xml(the sole feed) → diff againstgao_reports.idset → fetch new product pages + PDFs. - Backstop for missed reports (e.g., feed truncation): headed-browser crawl of
/reports-testimonies?page=N— the per-topic RSS feeds are dead (404), so there is no lightweight secondary feed. - Monthly: recommendations status refresh for all reports with at least one
Openrecommendation. ~10K reports × 1 page fetch = 10K browser requests/month, ~3 hours.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- GAO reports-- ============================================================
CREATE TABLE gao_reports ( id text PRIMARY KEY, -- 'gao:GAO-26-107932' report_number text NOT NULL UNIQUE, -- 'GAO-26-107932'
-- Type discriminator (inferred from URL suffix or title) product_type text NOT NULL CHECK (product_type IN ( 'report', 'testimony', 'correspondence', 'technology_assessment', 'decision', 'other' )),
title text NOT NULL, subtitle text, -- post-colon part of title
-- Dates publish_date date NOT NULL, public_release_date date, -- sometimes lags publish (e.g. classified annexes)
-- Body sections fast_facts text, -- lay summary what_found text, -- "What GAO Found" why_studied text, -- "Why GAO Did This Study" what_recommends text, -- "What GAO Recommends" prose
-- Source URLs product_url text NOT NULL, -- gao.gov product page full_report_pdf_url text, -- /assets/gao-{c}-{n}.pdf highlights_pdf_url text, -- /assets/gao-{c}-{n}-highlights.pdf
-- Body text (extracted from PDF or HTML) body_text text, -- full report text
-- Lifecycle raw_html bytea, -- gzipped product page HTML raw_rss_description text, -- RSS body — useful as backup body text fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX gao_reports_publish_date ON gao_reports (publish_date DESC);CREATE INDEX gao_reports_type ON gao_reports (product_type, publish_date DESC);CREATE INDEX gao_reports_recent ON gao_reports (publish_date DESC) WHERE publish_date >= '2020-01-01';
ALTER TABLE gao_reports ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(title, '')), 'A') || setweight(to_tsvector('english', coalesce(fast_facts, '') || ' ' || coalesce(subtitle, '')), 'B') || setweight(to_tsvector('english', coalesce(what_found, '') || ' ' || coalesce(why_studied, '') || ' ' || coalesce(what_recommends, '')), 'C') || setweight(to_tsvector('english', coalesce(body_text, '')), 'D') ) STORED;CREATE INDEX gao_reports_search ON gao_reports USING gin (search_tsv);
-- ============================================================-- Topics (free-text from GAO; we don't pre-define enum)-- ============================================================
CREATE TABLE gao_report_topics ( report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE, topic text NOT NULL, -- 'Energy', 'Health Care', etc. PRIMARY KEY (report_id, topic));CREATE INDEX gao_report_topics_topic ON gao_report_topics (topic);
-- ============================================================-- Agencies referenced (free-text initially; can soft-FK later)-- ============================================================
CREATE TABLE gao_report_agencies ( report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE, agency_name text NOT NULL, -- 'Department of Energy' agency_id text REFERENCES agencies(id), -- soft FK to agencies table when resolvable PRIMARY KEY (report_id, agency_name));
-- ============================================================-- Recommendations (the high-value structured subset)-- ============================================================
CREATE TABLE gao_recommendations ( id text PRIMARY KEY, -- 'gao-rec:GAO-25-106451-1' report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE, recommendation_number int NOT NULL, -- sequence within report
text text NOT NULL, -- recommendation prose
-- Target agency_name text NOT NULL, agency_id text REFERENCES agencies(id),
-- Status (updates over time) status text NOT NULL CHECK (status IN ( 'Open', 'Open—Partially Addressed', 'Closed—Implemented', 'Closed—Not Implemented', 'Closed—Other' )), status_date date, -- last status update status_history jsonb, -- list of {date, status, comment}
-- Lifecycle fetched_at timestamptz NOT NULL, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (report_id, recommendation_number));
CREATE INDEX gao_recommendations_status ON gao_recommendations (status, status_date DESC);CREATE INDEX gao_recommendations_agency ON gao_recommendations (agency_name);CREATE INDEX gao_recommendations_open ON gao_recommendations (report_id) WHERE status LIKE 'Open%';
-- ============================================================-- Vector chunks for similarity-- ============================================================
CREATE TABLE gao_report_chunks ( id bigserial PRIMARY KEY, report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, embedding vector(1024), UNIQUE (report_id, chunk_index));CREATE INDEX gao_report_chunks_embedding ON gao_report_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
raw_rss_descriptionpreserved as a fallback body text source. The RSS body is ~80% of "What GAO Found" pre-extracted; useful when product-page parsing fails.raw_htmlas gzipped bytea — product HTML is small, preserving it lets us re-parse later when GAO adds structured fields we currently miss (recommendations table parsing is fragile).gao_recommendations.status_historyas jsonb — status changes over years. The JSON list is small (typically <10 entries per recommendation) and rarely queried for filter; keeping it inline simplifies updates.agency_idis a soft FK — many GAO recommendations target agencies whose canonical ID may not be in ouragenciestable (especially sub-agencies). Don't enforce.- No formal
subagencyorbureaufield — too inconsistent; GAO's "agency" target is sometimes the cabinet department, sometimes a sub-bureau. Resolution is a future task.
Download / update strategy
Section titled “Download / update strategy”Backfill
Section titled “Backfill”- Discovery: Use the headed browser (same fetcher as Senate roll-call) to
crawl
https://www.gao.gov/reports-testimonies?page=Nfor every paginated page. This is the comprehensive list. - Extract
(report_number, title, publish_date, product_url)per row. - For each
report_number:- Fetch product page via headed browser. Save HTML to
/data/corpus/gao/products/{report_number}.html. - Extract sections (fast facts, what found, why studied, what recommends, topics, agencies, recommendations).
- Construct asset URLs deterministically; fetch full-report PDF + highlights PDF.
- Save to
/data/corpus/gao/assets/{report_number}.pdfand.highlights.pdf.
- Fetch product page via headed browser. Save HTML to
- Parse + load.
- Embed chunks.
Time budget: ~50K reports backfilled × 1 product page + 1 PDF each = 100K browser-mediated fetches. With 2 concurrent browser instances at ~2 fetches/min = ~17 days. Strategy: backfill from 2010 forward (~12K reports, ~3 days), accept gaps for older.
Alternatively use the RSS feed accumulator approach for backfill: GAO doesn't archive RSS but the feed contains last ~50 entries with full body. With patience (~10 years × 12 polls/year = 120 RSS fetches over time) we accumulate, but for fresh launch that's not viable. Headed-browser crawl is the answer.
Daily incremental
Section titled “Daily incremental”- 6-hourly: RSS poll. Fetch
/rss/reports.xml. Parse<item>s. Diff againstgao_reports.id. - For each new item:
- Use the RSS body for initial
what_foundtext — already extracted. - Queue a headed-browser fetch for the product page (gives us topics, agencies, recommendations).
- Queue PDF fetch.
- Use the RSS body for initial
- Periodic backstop: headed-browser crawl of
/reports-testimonies?page=N. Catch any items missed by the main feed (rare — feed is comprehensive). The per-topic RSS feeds that previously served this role are dead (404).
Monthly recommendations refresh
Section titled “Monthly recommendations refresh”For every report with any Open* recommendation, re-fetch the product page and
re-parse the recommendations table. Update gao_recommendations.status + append
to status_history if changed. ~10K product fetches/month, ~3 hours of
headed-browser time.
State tracking
Section titled “State tracking”Source keys: gao_reports_rss, gao_reports_backfill,
gao_recommendations_refresh. Each tracked separately.
Failure modes
Section titled “Failure modes”- gao.gov 503 / 429 — back off; retry. Reduce concurrency.
- Product page schema drift — GAO occasionally restructures the Drupal frontend. Detect via missing required field; fall back to PDF text extraction; alert for parser update.
- PDF asset 404 — some reports have only highlights, or the asset path is non-standard. Fall back to whatever links the product page does expose.
- Duplicate report numbers across years — should not happen (numbers include FY) but guard against it; raise on conflict.
- Recommendations table parse failure — recommendation text is structured but
the HTML has been observed to vary. Soft-fail: insert
gao_reportswithout recommendations; queue for retry next refresh.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Pre-2010 backfill scope. ~30K reports exist 1995-2010 in the online corpus. Are they substantively useful for an agent in 2026? Some are. Decision: skip pre-2010 in v1; revisit if user feedback says we're missing important historical context.
- Bid Protest decisions are at
/legal/bid-protests/...— separate URL space, separate database, separate value. Defer to v1.x; document the gap. - Restricted Reports at
/reports-testimonies/restricted— classified versions exist with a public summary. Ingest the public summary only. - Highlights vs full report dual-text. Highlights is the 1-page lay summary;
full report is 30-200 pages. Both are public. Indexing both wastes space. Lean:
index highlights body text as
fast_facts; index full report body inbody_text. - Recommendations fuzzy-matching to agencies. GAO's agency strings are inconsistent ("Department of Defense" vs "DoD" vs "Department of the Army"). A canonical resolver would benefit cross-source queries; defer to a separate normalization task.
- Product detail page extraction reliability. The Drupal HTML structure has been observed inconsistent (different field names per template). Build a defensive parser with multiple selector fallbacks; track parse-success rate.
- Whether to OCR pre-2003 PDFs. Many older reports are scanned. OCR adds
compute cost; the agent rarely needs them. Lean toward skipping OCR; flag with
body_text IS NULLand provide PDF link only. - Authors / GAO contact resolution. Each report names a GAO point-of-contact (a director-level GAO staffer). Building an analyst directory is useful for "what else has Director X published" — defer to v1.x.
- Cross-reference to bills/hearings. GAO reports often reference congressional
requests. The product page sometimes lists "Requesters" — Members or committees
that requested the work. Worth extracting to a separate table
(
gao_report_requesters) with FKs to legislators / committees.