GAO reports

Reports and testimonies from the Government Accountability Office, the legislative branch's audit and investigative arm. GAO produces in-depth investigations of federal programs, fraud audits, performance reviews, technology assessments, and Congressional testimony. Each report contains formal findings, often with recommendations to specific agencies — and a tracked implementation status that updates over years.

For Josh, GAO's value is two-fold: (1) substantive analysis the agent can cite when answering oversight or accountability questions, (2) the recommendations database — a structured row-per-recommendation-per-agency dataset that quantifies implementation status. The recommendations database is the highest-signal subset; treat it as a separate sub-source.

GAO publishes no official public API and no published developer/field-reference docs — it is a direct-scrape source. Almost everything Josh-specific on this page is operational reverse-engineering that no upstream doc states. The gao.gov site is the canonical reference for the raw content; the bot-wall, feed, ID, and schema notes below are ours.

Quick reference


Source name	GAO reports
Publisher	Government Accountability Office
License	Public domain (US government works)
Coverage	1971 – present (GAO predecessor "GAO Reports" since 1921 mostly paper). Online corpus reliable for ~1995 onward.
Volume	~600-1000 reports + ~100-300 testimonies per year. Historical corpus ~50K+ reports. ~10,000+ active recommendations.
Storage estimate	~10-15 GB raw PDF + HTML; ~2-4 GB extracted text
Site	https://www.gao.gov/
Discovery feed	`/rss/reports.xml` (recent reports) — the sole open path (see Access notes). Per-topic feeds (`/rss/topics/{slug}.xml`) are dead (repeatable 404, mechanism gone).
Asset path pattern	`https://www.gao.gov/assets/gao-{c}-{NNNNNN}.pdf` (full report) and `gao-{c}-{NNNNNN}-highlights.pdf` — deterministic from the product number, but now bot-walled equally to product pages (Akamai 403).
Auth	None
Rate limits	None observed; HTML pages bot-walled (real Chrome required)
Caching headers	RSS: per-feed `Last-Modified`. PDFs: per-file `Last-Modified`.
Incremental sync hints	RSS `<pubDate>` + per-asset `Last-Modified`; sitemap exists but bot-walled
Stable ID format	`gao:{report-number}` e.g. `gao:GAO-26-107932`, `gao:GAO-25-106451`
Status	exploring — schema drafted, ingestion not built

Source priority decision

Primary: gao.gov direct. RSS for the discovery feed (open to bots), individual product pages via headed-browser, asset PDFs at the deterministic asset URL pattern.

Secondary: GovInfo GAO collections. Some GAO reports are in /bulkdata/GAOREPORTS/... — sparse coverage, GovInfo only mirrors a subset that GAO formally submits to GPO. Unreliable as primary.

Skip: ProPublica's GAO scraper / community mirrors. They lag the source and re-add a parsing layer. Direct is better.

Skip: Recommendations API third parties. A few exist; underlying data is on gao.gov already.

Access notes

HTML pages bot-walled. https://www.gao.gov/products/... and https://www.gao.gov/sitemap.xml return 403 to non-browser clients. Same Akamai-style fingerprinting we hit on Senate.gov. The RSS endpoint at /rss/reports.xml is the sole open path — that's our discovery feed. Asset URLs (/assets/gao-{c}-{NNNNNN}.pdf) are now walled equally to the product pages (Akamai 403) — there is no longer a cheaper PDF path; the headed browser is required for both.

For ingestion, the same headed-browser fetcher used for Senate roll-call XML (see roll-call votes) can be reused here. It scales: GAO publishes ~1,000 reports/year, all fetchable in a few hours of browser time.

Feed item links are lowercase (e.g. /products/gao-26-107967), so normalize casing when deriving the product-page URL and the deterministic asset path. The full browser context is required for both the product/HTML pages and the asset PDFs.

Per-topic RSS feeds are dead. GAO formerly published per-topic feeds at predictable URLs (/rss/topics/agriculture.xml, /rss/topics/health-care.xml, ~30 topics). Every /rss/topics/{slug}.xml now returns a fast, repeatable 404 (the mechanism is gone, not an Akamai wall or a timeout). The primary /rss/reports.xml feed is the sole discovery feed; there is no per-topic safety net. Filtered-by-topic ingestion must be done by reading /rss/reports.xml and filtering on the product page's topic tags.

Endpoints / URL patterns

The raw URL surface (RSS feed, /products/{report-number}, /assets/...pdf, -highlights.pdf, the recommendations database, /sitemap.xml, /topics/{slug}) lives on the gao.gov site — GAO has no developer-docs page that enumerates it, so the operational facts that matter for Josh are captured in Access notes above (which paths are open, which are bot-walled, which are dead). The two load-bearing paths:

https://www.gao.gov/rss/reports.xml — RSS feed of recent reports, the one path open to bots. Carries the full "What GAO Found" body in <description> (see Response shapes).
https://www.gao.gov/assets/gao-{c}-{NNNNNN}.pdf — full report PDF, deterministic from the product number, but bot-walled equally to product pages (Akamai 403), so headed-browser required.

The searchable recommendations database is bot-walled too.

Vocabulary / enums

Report number format

GAO-{c}-{NNNNNN} where {c} is the two-digit GAO fiscal year (not congress) and {NNNNNN} is a 6-digit sequential ID.

Examples:

GAO-26-107932 — FY2026 report
GAO-25-106451 — FY2025 report
GAO-23-105680 — FY2023 report

GAO's fiscal year (and report-number year) starts October 1 of the prior calendar year. So GAO-26-... reports are published October 2025 – September 2026.

Pre-2003 numbers used letter-prefix sequences (GAO/T-RCED-99-1). For the v1 ingestion scope (mostly post-2010), the modern numeric format dominates.

Product type is inferred from the URL slug + title pattern: a plain /products/gao-{c}-{n} is a standard report; a trailing T (/products/gao-{c}-{n}t) is testimony; a trailing R is correspondence; a "Technology Assessment" title is a technology assessment; bid-protest decisions live in a separate URL space (/legal/decisions/...). These map to the product_type CHECK constraint in the schema. The full GAO topic taxonomy (~30 top-level topics, each with a kebab-case slug) is stored free-text in gao_report_topics — we do not pre-define an enum and accept new topics as they appear.

Recommendation status

Landed as the gao_recommendations.status CHECK constraint — these five values must stay byte-identical to the schema (including the em-dash):

Open — pending; agency has not implemented
Open—Partially Addressed — agency took some but not all steps
Closed—Implemented — agency took recommended action; verified by GAO
Closed—Not Implemented — agency declined / time-limited; closed without action
Closed—Other — withdrawn, OBE'd, or moot

The upstream prose definitions live in the recommendations database; the canonical enum is the CHECK constraint below.

Stable ID format

gao:{report-number}, uppercase preserved.

Examples:

gao:GAO-26-107932
gao:GAO-25-106451T (testimony)

Per-recommendation IDs (when ingesting the recommendations database):

gao-rec:{report-number}-{recommendation-number} e.g. gao-rec:GAO-25-106451-1 for the first recommendation in that report.

Response shapes

Because GAO publishes no schema reference, the observed shapes below are the only documentation of GAO's feed/page structure that exists — they are kept here rather than linked out.

RSS feed item

<item>
  <title>Hydrogen Energy: Technologies Offer Potential Benefits but Face Challenges to Widespread Use</title>
  <link>https://www.gao.gov/products/gao-26-107932</link>
  <description>
    What GAO Found
    Hydrogen energy technologies offer long-duration energy storage, ...
    [Full "What GAO Found" body, ~500-2000 words]
  </description>
  <pubDate>Wed, 30 Apr 2026 00:00:00 -0400</pubDate>
</item>

The RSS <description> contains the full "What GAO Found" body, often pre-formatted plaintext from the page. This is enough text for a useful initial parse without fetching the full PDF — for backfill, the RSS body alone may serve for retrieval indexing. <pubDate> is in RFC 822 / RFC 2822 format. Always 00:00:00 — date-precision only.

Product detail page (post-bot-wall fetch)

Sections observed via headed-browser fetch of gao-26-107932: the GAO ID and publish/public-release dates (the latter sometimes lags publish), a Topics tag list, Fast Facts (lay summary), the Highlights block (What GAO Found, Why GAO Did This Study, What GAO Recommends), a Full Report section linking the full-report and highlights PDFs, GAO Contacts (report authors), and a Recommendations table when the report has formal recommendations. The "What GAO Recommends" section sometimes says "GAO is making N recommendations" with details below; sometimes just one summary paragraph. The Drupal HTML structure has been observed inconsistent across templates — see Open questions.

Recommendations table per report

When a report has formal recommendations the detail page exposes a structured table — one row per recommendation with #, recommendation prose, target agency, status, and date, e.g.:

| # | Recommendation                        | Agency | Status              | Date         |
|---|---------------------------------------|--------|---------------------|--------------|
| 1 | DoD should establish a process to ... | DoD    | Open                | Apr 30, 2026 |
| 2 | DOE should issue guidance ...         | DOE    | Closed—Implemented  | Sep 12, 2024 |

Each row's Status updates over time as agencies implement (or don't). This is the source of the recommendations database and drives the monthly refresh.

Volume

Metric	Value
Reports per year	~600-1,000
Testimonies per year	~100-300
Total all-time (online)	~50,000
Per-report PDF size	~500 KB - 5 MB
RSS feed size	~50-100 KB (recent ~50 reports)
Per-report HTML	~50-150 KB
Recommendations all-time	~30,000+ tracked, ~10,000+ open at any time
Recommendations per report	0-50; median ~3-5

Postgres footprint: ~3-5 GB after backfill (text + metadata, no PDF bytes inline). Recommendations table ~30K rows.

Caching / incremental sync

Three signals:

RSS <pubDate> — new reports appear in /rss/reports.xml on publication day. Read every 6-12 hours.
Per-asset Last-Modified on the PDF — confirms re-issuances and corrigenda.
Recommendations database polling — recommendations status updates lag the report and continue for years. Re-poll the database monthly per active report.

Practical incremental:

6-hourly: RSS poll of /rss/reports.xml (the sole feed) → diff against gao_reports.id set → fetch new product pages + PDFs.
Backstop for missed reports (e.g., feed truncation): headed-browser crawl of /reports-testimonies?page=N — the per-topic RSS feeds are dead (404), so there is no lightweight secondary feed.
Monthly: recommendations status refresh for all reports with at least one Open recommendation. ~10K reports × 1 page fetch = 10K browser requests/month, ~3 hours.

Schema (Postgres DDL)

-- ============================================================
-- GAO reports
-- ============================================================

CREATE TABLE gao_reports (
    id text PRIMARY KEY,                          -- 'gao:GAO-26-107932'
    report_number text NOT NULL UNIQUE,           -- 'GAO-26-107932'

    -- Type discriminator (inferred from URL suffix or title)
    product_type text NOT NULL CHECK (product_type IN (
        'report', 'testimony', 'correspondence', 'technology_assessment', 'decision', 'other'
    )),

    title text NOT NULL,
    subtitle text,                                -- post-colon part of title

    -- Dates
    publish_date date NOT NULL,
    public_release_date date,                     -- sometimes lags publish (e.g. classified annexes)

    -- Body sections
    fast_facts text,                              -- lay summary
    what_found text,                              -- "What GAO Found"
    why_studied text,                             -- "Why GAO Did This Study"
    what_recommends text,                         -- "What GAO Recommends" prose

    -- Source URLs
    product_url text NOT NULL,                    -- gao.gov product page
    full_report_pdf_url text,                     -- /assets/gao-{c}-{n}.pdf
    highlights_pdf_url text,                       -- /assets/gao-{c}-{n}-highlights.pdf

    -- Body text (extracted from PDF or HTML)
    body_text text,                               -- full report text

    -- Lifecycle
    raw_html bytea,                               -- gzipped product page HTML
    raw_rss_description text,                      -- RSS body — useful as backup body text
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX gao_reports_publish_date ON gao_reports (publish_date DESC);
CREATE INDEX gao_reports_type ON gao_reports (product_type, publish_date DESC);
CREATE INDEX gao_reports_recent ON gao_reports (publish_date DESC) WHERE publish_date >= '2020-01-01';

ALTER TABLE gao_reports ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(fast_facts, '') || ' ' || coalesce(subtitle, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(what_found, '') || ' ' || coalesce(why_studied, '') || ' ' || coalesce(what_recommends, '')), 'C') ||
        setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
    ) STORED;
CREATE INDEX gao_reports_search ON gao_reports USING gin (search_tsv);

-- ============================================================
-- Topics (free-text from GAO; we don't pre-define enum)
-- ============================================================

CREATE TABLE gao_report_topics (
    report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
    topic text NOT NULL,                          -- 'Energy', 'Health Care', etc.
    PRIMARY KEY (report_id, topic)
);
CREATE INDEX gao_report_topics_topic ON gao_report_topics (topic);

-- ============================================================
-- Agencies referenced (free-text initially; can soft-FK later)
-- ============================================================

CREATE TABLE gao_report_agencies (
    report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
    agency_name text NOT NULL,                    -- 'Department of Energy'
    agency_id text REFERENCES agencies(id),       -- soft FK to agencies table when resolvable
    PRIMARY KEY (report_id, agency_name)
);

-- ============================================================
-- Recommendations (the high-value structured subset)
-- ============================================================

CREATE TABLE gao_recommendations (
    id text PRIMARY KEY,                          -- 'gao-rec:GAO-25-106451-1'
    report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
    recommendation_number int NOT NULL,           -- sequence within report

    text text NOT NULL,                           -- recommendation prose

    -- Target
    agency_name text NOT NULL,
    agency_id text REFERENCES agencies(id),

    -- Status (updates over time)
    status text NOT NULL CHECK (status IN (
        'Open', 'Open—Partially Addressed',
        'Closed—Implemented', 'Closed—Not Implemented', 'Closed—Other'
    )),
    status_date date,                             -- last status update
    status_history jsonb,                         -- list of {date, status, comment}

    -- Lifecycle
    fetched_at timestamptz NOT NULL,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now(),

    UNIQUE (report_id, recommendation_number)
);

CREATE INDEX gao_recommendations_status ON gao_recommendations (status, status_date DESC);
CREATE INDEX gao_recommendations_agency ON gao_recommendations (agency_name);
CREATE INDEX gao_recommendations_open ON gao_recommendations (report_id) WHERE status LIKE 'Open%';

-- ============================================================
-- Vector chunks for similarity
-- ============================================================

CREATE TABLE gao_report_chunks (
    id bigserial PRIMARY KEY,
    report_id text NOT NULL REFERENCES gao_reports(id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    embedding vector(1024),
    UNIQUE (report_id, chunk_index)
);
CREATE INDEX gao_report_chunks_embedding
    ON gao_report_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

raw_rss_description preserved as a fallback body text source. The RSS body is ~80% of "What GAO Found" pre-extracted; useful when product-page parsing fails.
raw_html as gzipped bytea — product HTML is small, preserving it lets us re-parse later when GAO adds structured fields we currently miss (recommendations table parsing is fragile).
gao_recommendations.status_history as jsonb — status changes over years. The JSON list is small (typically <10 entries per recommendation) and rarely queried for filter; keeping it inline simplifies updates.
agency_id is a soft FK — many GAO recommendations target agencies whose canonical ID may not be in our agencies table (especially sub-agencies). Don't enforce.
No formal subagency or bureau field — too inconsistent; GAO's "agency" target is sometimes the cabinet department, sometimes a sub-bureau. Resolution is a future task.

Download / update strategy

Backfill

Discovery: Use the headed browser (same fetcher as Senate roll-call) to crawl https://www.gao.gov/reports-testimonies?page=N for every paginated page. This is the comprehensive list.
Extract (report_number, title, publish_date, product_url) per row.
For each report_number:
- Fetch product page via headed browser. Save HTML to /data/corpus/gao/products/{report_number}.html.
- Extract sections (fast facts, what found, why studied, what recommends, topics, agencies, recommendations).
- Construct asset URLs deterministically; fetch full-report PDF + highlights PDF.
- Save to /data/corpus/gao/assets/{report_number}.pdf and .highlights.pdf.
Parse + load.
Embed chunks.

Time budget: ~50K reports backfilled × 1 product page + 1 PDF each = 100K browser-mediated fetches. With 2 concurrent browser instances at ~2 fetches/min = ~17 days. Strategy: backfill from 2010 forward (~12K reports, ~3 days), accept gaps for older.

Alternatively use the RSS feed accumulator approach for backfill: GAO doesn't archive RSS but the feed contains last ~50 entries with full body. With patience (~10 years × 12 polls/year = 120 RSS fetches over time) we accumulate, but for fresh launch that's not viable. Headed-browser crawl is the answer.

Daily incremental

6-hourly: RSS poll. Fetch /rss/reports.xml. Parse <item>s. Diff against gao_reports.id.
For each new item:
- Use the RSS body for initial what_found text — already extracted.
- Queue a headed-browser fetch for the product page (gives us topics, agencies, recommendations).
- Queue PDF fetch.
Periodic backstop: headed-browser crawl of /reports-testimonies?page=N. Catch any items missed by the main feed (rare — feed is comprehensive). The per-topic RSS feeds that previously served this role are dead (404).

Monthly recommendations refresh

For every report with any Open* recommendation, re-fetch the product page and re-parse the recommendations table. Update gao_recommendations.status + append to status_history if changed. ~10K product fetches/month, ~3 hours of headed-browser time.

State tracking

Source keys: gao_reports_rss, gao_reports_backfill, gao_recommendations_refresh. Each tracked separately.

Failure modes

gao.gov 503 / 429 — back off; retry. Reduce concurrency.
Product page schema drift — GAO occasionally restructures the Drupal frontend. Detect via missing required field; fall back to PDF text extraction; alert for parser update.
PDF asset 404 — some reports have only highlights, or the asset path is non-standard. Fall back to whatever links the product page does expose.
Duplicate report numbers across years — should not happen (numbers include FY) but guard against it; raise on conflict.
Recommendations table parse failure — recommendation text is structured but the HTML has been observed to vary. Soft-fail: insert gao_reports without recommendations; queue for retry next refresh.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Pre-2010 backfill scope. ~30K reports exist 1995-2010 in the online corpus. Are they substantively useful for an agent in 2026? Some are. Decision: skip pre-2010 in v1; revisit if user feedback says we're missing important historical context.
Bid Protest decisions are at /legal/bid-protests/... — separate URL space, separate database, separate value. Defer to v1.x; document the gap.
Restricted Reports at /reports-testimonies/restricted — classified versions exist with a public summary. Ingest the public summary only.
Highlights vs full report dual-text. Highlights is the 1-page lay summary; full report is 30-200 pages. Both are public. Indexing both wastes space. Lean: index highlights body text as fast_facts; index full report body in body_text.
Recommendations fuzzy-matching to agencies. GAO's agency strings are inconsistent ("Department of Defense" vs "DoD" vs "Department of the Army"). A canonical resolver would benefit cross-source queries; defer to a separate normalization task.
Product detail page extraction reliability. The Drupal HTML structure has been observed inconsistent (different field names per template). Build a defensive parser with multiple selector fallbacks; track parse-success rate.
Whether to OCR pre-2003 PDFs. Many older reports are scanned. OCR adds compute cost; the agent rarely needs them. Lean toward skipping OCR; flag with body_text IS NULL and provide PDF link only.
Authors / GAO contact resolution. Each report names a GAO point-of-contact (a director-level GAO staffer). Building an analyst directory is useful for "what else has Director X published" — defer to v1.x.
Cross-reference to bills/hearings. GAO reports often reference congressional requests. The product page sometimes lists "Requesters" — Members or committees that requested the work. Worth extracting to a separate table (gao_report_requesters) with FKs to legislators / committees.