Skip to content

GAO reports

Reports and testimonies from the Government Accountability Office, the legislative branch's audit and investigative arm. GAO produces in-depth investigations of federal programs, fraud audits, performance reviews, technology assessments, and Congressional testimony. Each report contains formal findings, often with recommendations to specific agencies — and a tracked implementation status that updates over years.

For Josh, GAO's value is two-fold: (1) substantive analysis the agent can cite when answering oversight or accountability questions, (2) the recommendations database — a structured row-per-recommendation-per-agency dataset that quantifies implementation status. The recommendations database is the highest-signal subset; treat it as a separate sub-source.

GAO publishes no official public API and no published developer/field-reference docs — it is a direct-scrape source. Almost everything Josh-specific on this page is operational reverse-engineering that no upstream doc states. The gao.gov site is the canonical reference for the raw content; the bot-wall, feed, ID, and schema notes below are ours.

Source nameGAO reports
PublisherGovernment Accountability Office
LicensePublic domain (US government works)
Coverage1971 – present (GAO predecessor "GAO Reports" since 1921 mostly paper). Online corpus reliable for ~1995 onward.
Volume~600-1000 reports + ~100-300 testimonies per year. Historical corpus ~50K+ reports. ~10,000+ active recommendations.
Storage estimate~10-15 GB raw PDF + HTML; ~2-4 GB extracted text
Sitehttps://www.gao.gov/
Discovery feed/rss/reports.xml (recent reports) — the sole open path (see Access notes). Per-topic feeds (/rss/topics/{slug}.xml) are dead (repeatable 404, mechanism gone).
Asset path patternhttps://www.gao.gov/assets/gao-{c}-{NNNNNN}.pdf (full report) and gao-{c}-{NNNNNN}-highlights.pdf — deterministic from the product number, but now bot-walled equally to product pages (Akamai 403).
AuthNone
Rate limitsNone observed; HTML pages bot-walled (real Chrome required)
Caching headersRSS: per-feed Last-Modified. PDFs: per-file Last-Modified.
Incremental sync hintsRSS <pubDate> + per-asset Last-Modified; sitemap exists but bot-walled
Stable ID formatgao:{report-number} e.g. gao:GAO-26-107932, gao:GAO-25-106451
Statusexploring — schema drafted, ingestion not built

Primary: gao.gov direct. RSS for the discovery feed (open to bots), individual product pages via headed-browser, asset PDFs at the deterministic asset URL pattern.

Secondary: GovInfo GAO collections. Some GAO reports are in /bulkdata/GAOREPORTS/... — sparse coverage, GovInfo only mirrors a subset that GAO formally submits to GPO. Unreliable as primary.

Skip: ProPublica's GAO scraper / community mirrors. They lag the source and re-add a parsing layer. Direct is better.

Skip: Recommendations API third parties. A few exist; underlying data is on gao.gov already.

HTML pages bot-walled. https://www.gao.gov/products/... and https://www.gao.gov/sitemap.xml return 403 to non-browser clients. Same Akamai-style fingerprinting we hit on Senate.gov. The RSS endpoint at /rss/reports.xml is the sole open path — that's our discovery feed. Asset URLs (/assets/gao-{c}-{NNNNNN}.pdf) are now walled equally to the product pages (Akamai 403) — there is no longer a cheaper PDF path; the headed browser is required for both.

For ingestion, the same headed-browser fetcher used for Senate roll-call XML (see roll-call votes) can be reused here. It scales: GAO publishes ~1,000 reports/year, all fetchable in a few hours of browser time.

Feed item links are lowercase (e.g. /products/gao-26-107967), so normalize casing when deriving the product-page URL and the deterministic asset path. The full browser context is required for both the product/HTML pages and the asset PDFs.

Per-topic RSS feeds are dead. GAO formerly published per-topic feeds at predictable URLs (/rss/topics/agriculture.xml, /rss/topics/health-care.xml, ~30 topics). Every /rss/topics/{slug}.xml now returns a fast, repeatable 404 (the mechanism is gone, not an Akamai wall or a timeout). The primary /rss/reports.xml feed is the sole discovery feed; there is no per-topic safety net. Filtered-by-topic ingestion must be done by reading /rss/reports.xml and filtering on the product page's topic tags.

The raw URL surface (RSS feed, /products/{report-number}, /assets/...pdf, -highlights.pdf, the recommendations database, /sitemap.xml, /topics/{slug}) lives on the gao.gov site — GAO has no developer-docs page that enumerates it, so the operational facts that matter for Josh are captured in Access notes above (which paths are open, which are bot-walled, which are dead). The two load-bearing paths:

  • https://www.gao.gov/rss/reports.xml — RSS feed of recent reports, the one path open to bots. Carries the full "What GAO Found" body in <description> (see Response shapes).
  • https://www.gao.gov/assets/gao-{c}-{NNNNNN}.pdf — full report PDF, deterministic from the product number, but bot-walled equally to product pages (Akamai 403), so headed-browser required.

The searchable recommendations database is bot-walled too.

GAO-{c}-{NNNNNN} where {c} is the two-digit GAO fiscal year (not congress) and {NNNNNN} is a 6-digit sequential ID.

Examples:

  • GAO-26-107932 — FY2026 report
  • GAO-25-106451 — FY2025 report
  • GAO-23-105680 — FY2023 report

GAO's fiscal year (and report-number year) starts October 1 of the prior calendar year. So GAO-26-... reports are published October 2025 – September 2026.

Pre-2003 numbers used letter-prefix sequences (GAO/T-RCED-99-1). For the v1 ingestion scope (mostly post-2010), the modern numeric format dominates.

Product type is inferred from the URL slug + title pattern: a plain /products/gao-{c}-{n} is a standard report; a trailing T (/products/gao-{c}-{n}t) is testimony; a trailing R is correspondence; a "Technology Assessment" title is a technology assessment; bid-protest decisions live in a separate URL space (/legal/decisions/...). These map to the product_type CHECK constraint in the schema. The full GAO topic taxonomy (~30 top-level topics, each with a kebab-case slug) is stored free-text in gao_report_topics — we do not pre-define an enum and accept new topics as they appear.

Landed as the gao_recommendations.status CHECK constraint — these five values must stay byte-identical to the schema (including the em-dash):

  • Open — pending; agency has not implemented
  • Open—Partially Addressed — agency took some but not all steps
  • Closed—Implemented — agency took recommended action; verified by GAO
  • Closed—Not Implemented — agency declined / time-limited; closed without action
  • Closed—Other — withdrawn, OBE'd, or moot

The upstream prose definitions live in the recommendations database; the canonical enum is the CHECK constraint below.

gao:{report-number}, uppercase preserved.

Examples:

  • gao:GAO-26-107932
  • gao:GAO-25-106451T (testimony)

Per-recommendation IDs (when ingesting the recommendations database):

  • gao-rec:{report-number}-{recommendation-number} e.g. gao-rec:GAO-25-106451-1 for the first recommendation in that report.

Because GAO publishes no schema reference, the observed shapes below are the only documentation of GAO's feed/page structure that exists — they are kept here rather than linked out.

<item>
<title>Hydrogen Energy: Technologies Offer Potential Benefits but Face Challenges to Widespread Use</title>
<link>https://www.gao.gov/products/gao-26-107932</link>
<description>
What GAO Found
Hydrogen energy technologies offer long-duration energy storage, ...
[Full "What GAO Found" body, ~500-2000 words]
</description>
<pubDate>Wed, 30 Apr 2026 00:00:00 -0400</pubDate>
</item>

The RSS <description> contains the full "What GAO Found" body, often pre-formatted plaintext from the page. This is enough text for a useful initial parse without fetching the full PDF — for backfill, the RSS body alone may serve for retrieval indexing. <pubDate> is in RFC 822 / RFC 2822 format. Always 00:00:00 — date-precision only.

Sections observed via headed-browser fetch of gao-26-107932: the GAO ID and publish/public-release dates (the latter sometimes lags publish), a Topics tag list, Fast Facts (lay summary), the Highlights block (What GAO Found, Why GAO Did This Study, What GAO Recommends), a Full Report section linking the full-report and highlights PDFs, GAO Contacts (report authors), and a Recommendations table when the report has formal recommendations. The "What GAO Recommends" section sometimes says "GAO is making N recommendations" with details below; sometimes just one summary paragraph. The Drupal HTML structure has been observed inconsistent across templates — see Open questions.

When a report has formal recommendations the detail page exposes a structured table — one row per recommendation with #, recommendation prose, target agency, status, and date, e.g.:

| # | Recommendation | Agency | Status | Date |
|---|---------------------------------------|--------|---------------------|--------------|
| 1 | DoD should establish a process to ... | DoD | Open | Apr 30, 2026 |
| 2 | DOE should issue guidance ... | DOE | Closed—Implemented | Sep 12, 2024 |

Each row's Status updates over time as agencies implement (or don't). This is the source of the recommendations database and drives the monthly refresh.

MetricValue
Reports per year~600-1,000
Testimonies per year~100-300
Total all-time (online)~50,000
Per-report PDF size~500 KB - 5 MB
RSS feed size~50-100 KB (recent ~50 reports)
Per-report HTML~50-150 KB
Recommendations all-time~30,000+ tracked, ~10,000+ open at any time
Recommendations per report0-50; median ~3-5

Postgres footprint: ~3-5 GB after backfill (text + metadata, no PDF bytes inline). Recommendations table ~30K rows.

Three signals:

  1. RSS <pubDate> — new reports appear in /rss/reports.xml on publication day. Read every 6-12 hours.
  2. Per-asset Last-Modified on the PDF — confirms re-issuances and corrigenda.
  3. Recommendations database polling — recommendations status updates lag the report and continue for years. Re-poll the database monthly per active report.

Practical incremental:

  • 6-hourly: RSS poll of /rss/reports.xml (the sole feed) → diff against gao_reports.id set → fetch new product pages + PDFs.
  • Backstop for missed reports (e.g., feed truncation): headed-browser crawl of /reports-testimonies?page=N — the per-topic RSS feeds are dead (404), so there is no lightweight secondary feed.
  • Monthly: recommendations status refresh for all reports with at least one Open recommendation. ~10K reports × 1 page fetch = 10K browser requests/month, ~3 hours.
-- ============================================================
-- GAO reports
-- ============================================================
CREATE TABLE gao_reports (
id text PRIMARY KEY, -- 'gao:GAO-26-107932'
report_number text NOT NULL UNIQUE, -- 'GAO-26-107932'
-- Type discriminator (inferred from URL suffix or title)
product_type text NOT NULL CHECK (product_type IN (
'report', 'testimony', 'correspondence', 'technology_assessment', 'decision', 'other'
)),
title text NOT NULL,
subtitle text, -- post-colon part of title
-- Dates
publish_date date NOT NULL,
public_release_date date, -- sometimes lags publish (e.g. classified annexes)
-- Body sections
fast_facts text, -- lay summary
what_found text, -- "What GAO Found"
why_studied text, -- "Why GAO Did This Study"
what_recommends text, -- "What GAO Recommends" prose
-- Source URLs
product_url text NOT NULL, -- gao.gov product page
full_report_pdf_url text, -- /assets/gao-{c}-{n}.pdf
highlights_pdf_url text, -- /assets/gao-{c}-{n}-highlights.pdf
-- Body text (extracted from PDF or HTML)
body_text text, -- full report text
-- Lifecycle
raw_html bytea, -- gzipped product page HTML
raw_rss_description text, -- RSS body — useful as backup body text
fetched_at timestamptz NOT NULL,
parsed_at timestamptz,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX gao_reports_publish_date ON gao_reports (publish_date DESC);
CREATE INDEX gao_reports_type ON gao_reports (product_type, publish_date DESC);
CREATE INDEX gao_reports_recent ON gao_reports (publish_date DESC) WHERE publish_date >= '2020-01-01';
ALTER TABLE gao_reports ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(fast_facts, '') || ' ' || coalesce(subtitle, '')), 'B') ||
setweight(to_tsvector('english', coalesce(what_found, '') || ' ' || coalesce(why_studied, '') || ' ' || coalesce(what_recommends, '')), 'C') ||
setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
) STORED;
CREATE INDEX gao_reports_search ON gao_reports USING gin (search_tsv);
-- ============================================================
-- Topics (free-text from GAO; we don't pre-define enum)
-- ============================================================
CREATE TABLE gao_report_topics (
report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
topic text NOT NULL, -- 'Energy', 'Health Care', etc.
PRIMARY KEY (report_id, topic)
);
CREATE INDEX gao_report_topics_topic ON gao_report_topics (topic);
-- ============================================================
-- Agencies referenced (free-text initially; can soft-FK later)
-- ============================================================
CREATE TABLE gao_report_agencies (
report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
agency_name text NOT NULL, -- 'Department of Energy'
agency_id text REFERENCES agencies(id), -- soft FK to agencies table when resolvable
PRIMARY KEY (report_id, agency_name)
);
-- ============================================================
-- Recommendations (the high-value structured subset)
-- ============================================================
CREATE TABLE gao_recommendations (
id text PRIMARY KEY, -- 'gao-rec:GAO-25-106451-1'
report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
recommendation_number int NOT NULL, -- sequence within report
text text NOT NULL, -- recommendation prose
-- Target
agency_name text NOT NULL,
agency_id text REFERENCES agencies(id),
-- Status (updates over time)
status text NOT NULL CHECK (status IN (
'Open', 'Open—Partially Addressed',
'Closed—Implemented', 'Closed—Not Implemented', 'Closed—Other'
)),
status_date date, -- last status update
status_history jsonb, -- list of {date, status, comment}
-- Lifecycle
fetched_at timestamptz NOT NULL,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (report_id, recommendation_number)
);
CREATE INDEX gao_recommendations_status ON gao_recommendations (status, status_date DESC);
CREATE INDEX gao_recommendations_agency ON gao_recommendations (agency_name);
CREATE INDEX gao_recommendations_open ON gao_recommendations (report_id) WHERE status LIKE 'Open%';
-- ============================================================
-- Vector chunks for similarity
-- ============================================================
CREATE TABLE gao_report_chunks (
id bigserial PRIMARY KEY,
report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
chunk_index int NOT NULL,
chunk_text text NOT NULL,
embedding vector(1024),
UNIQUE (report_id, chunk_index)
);
CREATE INDEX gao_report_chunks_embedding
ON gao_report_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

  • raw_rss_description preserved as a fallback body text source. The RSS body is ~80% of "What GAO Found" pre-extracted; useful when product-page parsing fails.
  • raw_html as gzipped bytea — product HTML is small, preserving it lets us re-parse later when GAO adds structured fields we currently miss (recommendations table parsing is fragile).
  • gao_recommendations.status_history as jsonb — status changes over years. The JSON list is small (typically <10 entries per recommendation) and rarely queried for filter; keeping it inline simplifies updates.
  • agency_id is a soft FK — many GAO recommendations target agencies whose canonical ID may not be in our agencies table (especially sub-agencies). Don't enforce.
  • No formal subagency or bureau field — too inconsistent; GAO's "agency" target is sometimes the cabinet department, sometimes a sub-bureau. Resolution is a future task.
  1. Discovery: Use the headed browser (same fetcher as Senate roll-call) to crawl https://www.gao.gov/reports-testimonies?page=N for every paginated page. This is the comprehensive list.
  2. Extract (report_number, title, publish_date, product_url) per row.
  3. For each report_number:
    • Fetch product page via headed browser. Save HTML to /data/corpus/gao/products/{report_number}.html.
    • Extract sections (fast facts, what found, why studied, what recommends, topics, agencies, recommendations).
    • Construct asset URLs deterministically; fetch full-report PDF + highlights PDF.
    • Save to /data/corpus/gao/assets/{report_number}.pdf and .highlights.pdf.
  4. Parse + load.
  5. Embed chunks.

Time budget: ~50K reports backfilled × 1 product page + 1 PDF each = 100K browser-mediated fetches. With 2 concurrent browser instances at ~2 fetches/min = ~17 days. Strategy: backfill from 2010 forward (~12K reports, ~3 days), accept gaps for older.

Alternatively use the RSS feed accumulator approach for backfill: GAO doesn't archive RSS but the feed contains last ~50 entries with full body. With patience (~10 years × 12 polls/year = 120 RSS fetches over time) we accumulate, but for fresh launch that's not viable. Headed-browser crawl is the answer.

  1. 6-hourly: RSS poll. Fetch /rss/reports.xml. Parse <item>s. Diff against gao_reports.id.
  2. For each new item:
    • Use the RSS body for initial what_found text — already extracted.
    • Queue a headed-browser fetch for the product page (gives us topics, agencies, recommendations).
    • Queue PDF fetch.
  3. Periodic backstop: headed-browser crawl of /reports-testimonies?page=N. Catch any items missed by the main feed (rare — feed is comprehensive). The per-topic RSS feeds that previously served this role are dead (404).

For every report with any Open* recommendation, re-fetch the product page and re-parse the recommendations table. Update gao_recommendations.status + append to status_history if changed. ~10K product fetches/month, ~3 hours of headed-browser time.

Source keys: gao_reports_rss, gao_reports_backfill, gao_recommendations_refresh. Each tracked separately.

  • gao.gov 503 / 429 — back off; retry. Reduce concurrency.
  • Product page schema drift — GAO occasionally restructures the Drupal frontend. Detect via missing required field; fall back to PDF text extraction; alert for parser update.
  • PDF asset 404 — some reports have only highlights, or the asset path is non-standard. Fall back to whatever links the product page does expose.
  • Duplicate report numbers across years — should not happen (numbers include FY) but guard against it; raise on conflict.
  • Recommendations table parse failure — recommendation text is structured but the HTML has been observed to vary. Soft-fail: insert gao_reports without recommendations; queue for retry next refresh.

These don't block ingestion but should be resolved before this source is "shipped":

  • Pre-2010 backfill scope. ~30K reports exist 1995-2010 in the online corpus. Are they substantively useful for an agent in 2026? Some are. Decision: skip pre-2010 in v1; revisit if user feedback says we're missing important historical context.
  • Bid Protest decisions are at /legal/bid-protests/... — separate URL space, separate database, separate value. Defer to v1.x; document the gap.
  • Restricted Reports at /reports-testimonies/restricted — classified versions exist with a public summary. Ingest the public summary only.
  • Highlights vs full report dual-text. Highlights is the 1-page lay summary; full report is 30-200 pages. Both are public. Indexing both wastes space. Lean: index highlights body text as fast_facts; index full report body in body_text.
  • Recommendations fuzzy-matching to agencies. GAO's agency strings are inconsistent ("Department of Defense" vs "DoD" vs "Department of the Army"). A canonical resolver would benefit cross-source queries; defer to a separate normalization task.
  • Product detail page extraction reliability. The Drupal HTML structure has been observed inconsistent (different field names per template). Build a defensive parser with multiple selector fallbacks; track parse-success rate.
  • Whether to OCR pre-2003 PDFs. Many older reports are scanned. OCR adds compute cost; the agent rarely needs them. Lean toward skipping OCR; flag with body_text IS NULL and provide PDF link only.
  • Authors / GAO contact resolution. Each report names a GAO point-of-contact (a director-level GAO staffer). Building an analyst directory is useful for "what else has Director X published" — defer to v1.x.
  • Cross-reference to bills/hearings. GAO reports often reference congressional requests. The product page sometimes lists "Requesters" — Members or committees that requested the work. Worth extracting to a separate table (gao_report_requesters) with FKs to legislators / committees.