CRS reports

Congressional Research Service products: nonpartisan analysis written for Members of Congress and their staff. Covers virtually every policy area — economics, foreign affairs, intelligence, environment, agriculture, social policy. CRS reports are the go-to backgrounder format on Capitol Hill: a Member's office gets one before a vote, before a hearing, before a town hall.

For Josh, CRS reports are a high-value retrieval target — they're substantive, well-cited, and authored by domain experts. They are also (since 2018) all public, all online, and uniformly formatted enough to ingest reliably.

This is a reference / analysis source — no FK relationships into the legislative-action graph except the soft relatedMaterials links to bills. Load whenever; not blocking on legislators or bills. Citation extraction at parse time will populate cross-references to bills, USC, and CFR.

Quick reference


Source name	CRS reports
Publishers	Congressional Research Service (Library of Congress)
License	Reports are works of the US government — public domain. CRS asserts no copyright.
Coverage	1971 – present. Public posting became mandatory under the John S. McCain National Defense Authorization Act for FY2018 (2018), but the EveryCRSReport.com mirror reaches back well before that: 9,820 of the 23,211 CSV rows pre-date 2013 (back to 1971), and all of them are ingested — pre-2013 is not out of scope. ~23K active reports + thousands of archived versions.
Volume	~23K reports in the mirror (23,211 CSV rows as of May 2026, 9,820 of them pre-2013); ~600-1500 new + ~3K updates per year. Each report ~50-300 KB HTML, ~100-500 KB PDF.
Storage estimate	~3-5 GB raw HTML + PDF; ~1-2 GB extracted text
Primary source	EveryCRSReport.com community mirror (see Source priority decision)
Verification source	Congress.gov v3 `/crsreport` API — see api.congress.gov CRS report docs (register a key at api.data.gov/signup)
Stable ID format	`crs:{number}` e.g. `crs:R48481`, `crs:IF13213`
Status	exploring — schema drafted, ingestion not built

Upstream endpoint, auth, and rate-limit specifics live in the EveryCRSReport about page and the Congress.gov CRS report API docs; the access behavior Josh observed in practice is in Access notes below.

Source priority decision

Primary: EveryCRSReport.com mirror. Project run by Daniel Schuman + Demand Progress (and now FrontFin / Federation of American Scientists), continuously crawled from crsreports.congress.gov. Provides:

A daily-refreshed master CSV (reports.csv, ~6.6 MB) with every report's number, URL, latest pub date, title, latest PDF/HTML paths. Conditional GET works.
Per-report JSON at reports/{number}.json with full version history (every revision the mirror has crawled) and formats with SHA1 checksums. Top-level keys are exactly [active, id, number, source, topics, type, typeId, versions] — there is no authors, relatedBills, or bills key, so the crs_report_authors and crs_report_related_bills tables receive zero rows on this mirror.
The actual PDF and HTML files at deterministic paths.

This is the path of least resistance: one CSV download tells us the entire universe; one JSON per report gives the full version history; SHA1 checksums make idempotency cheap. The mirror's per-report JSON shape is documented (informally) on the EveryCRSReport about page — there is no canonical upstream API reference for it.

Secondary: Congress.gov v3 /crsreport. The official source. Cleaner JSON shape, official authoritative. Use as verification layer (spot-check a handful daily for drift) but not as primary because (a) it costs api.data.gov quota, (b) the EveryCRSReport mirror is comprehensive and faster, (c) the official feed sometimes lags the mirror by hours. The field-by-field response shape is documented at api.congress.gov.

Skip: scraping crsreports.congress.gov directly. EveryCRSReport already does this. Re-doing it duplicates work and creates a second crawl footprint on the underlying source.

Skip: separate pre-2018 backfill via UNT digital library, FAS legacy archive, etc. Unnecessary — the EveryCRSReport.com mirror already carries the older corpus (9,820 of 23,211 rows pre-date 2013, back to 1971), and the ingester loads all of it. No extra archive scrape is needed for pre-2018 coverage.

Access notes

EveryCRSReport: clean nginx behind Cloudflare, Last-Modified and ETag headers reliable, If-Modified-Since and If-None-Match both work. No User-Agent restrictions. Daily refresh of reports.csv happens around 07:00 UTC.

Congress.gov: same auth pattern as Bills. Real api.data.gov key needed for ingestion. DEMO_KEY hits ~30/hour cap quickly during probing.

Endpoints

Upstream URL patterns, the per-report JSON / CSV column schemas, and the Congress.gov v3 field listings are documented upstream:

EveryCRSReport (reports.csv columns, reports/{number}.json, files/... PDF/HTML paths): everycrsreport.com/about
Congress.gov v3 /crsreport and /crsreport/{id}: api.congress.gov CRS report docs

Josh-specific notes on those shapes that no upstream doc states:

The reports.csv URLs are relative to everycrsreport.com/.
The direct CRS document URLs on congress.gov follow crs_external_products/{TYPE}/PDF/{number}/{number}.{version}.pdf and crs_external_products/{TYPE}/HTML/{number}.web.html, where {TYPE} is the prefix (R, IF, RL, etc.) and {version} is a numeric per-report version (.1., .2.).
The mirror's per-report JSON has only the keys [active, id, number, source, topics, type, typeId, versions] — see Source priority decision for the zero-row consequence on crs_report_authors / crs_report_related_bills.
On Congress.gov, the relatedMaterials field is a soft cross-reference — a CRS report cites bills, but linking is editorial (CRS staff tag the report with relevant bills they discuss). Useful for the agent's "find me CRS analysis on this bill" query.
The authors field (Congress.gov only) is sparse on older reports; CRS keeps analyst names current but legacy reports (pre-2010) may have author byline removed for retired-staff privacy.

Vocabulary / enums

CRS report types (prefix in report `number`)

Distribution from a fresh EveryCRSReport CSV (May 2026):

Prefix	Count	Type
`R`	6,881	Standard report
`RL`	4,280	Long-form report (legacy prefix; phased out for new reports)
`IF`	3,183	In Focus — 2-page brief on a single topic
`RS`	2,433	Short report (legacy prefix)
`IN`	2,353	Insight — short topical analysis
`LSB`	1,359	Legal Sidebar — legal analysis
`IB`	400	Issue Brief (mostly historical)
`TE`	91	Testimony (when CRS analyst testified)
`IG`	91	Infographic
`ZZZ`, `ZZZA-F`	~85	Archived / withdrawn
`IP`	41	Information Paper
`IS`	32	Information Sheet
`MB`	19	Memorandum (rare; specific to a Member request)
`SP-`, `RA-`, `M-`, `CP`, `IPO`, `SS-CSRS`	<10 each	Specialized prefixes

Some rows have empty prefix or quote-comma issues (~2,994 rows have title commas/quotes that need RFC 4180-aware parsing — handle defensively in CSV parser; don't fail the row).

The Congress.gov contentType enum (reports, posts, resources, infographics, testimony) and the status enum (active | archived) are documented at api.congress.gov. Note: on the mirror an archived report may still be present, and Josh stores it.

Stable ID format

crs:{number}, uppercase prefix preserved.

Examples:

crs:R48481 (a "report" type)
crs:IF13213 (an "in focus" type)
crs:LSB10987 (a "legal sidebar")

The number is unique across all types — R48481 and IF13213 cannot collide because the type prefix is part of the canonical number. CRS ensures global uniqueness.

Versioning: every report has a version number that monotonically increases as updates are published. We store all versions. Per-version stable ID derived as crs:{number}-v{version} if needed (e.g. crs:R48481-v3), but the canonical row is the latest version; older versions are in crs_report_versions.

Response shapes

Both upstream JSON shapes are documented upstream — the EveryCRSReport mirror about page for reports/{number}.json and api.congress.gov for /crsreport/{id}. The Josh-relevant facts about those shapes (mirror key set, the zero-row consequence, the soft relatedMaterials link, the sparse authors) are captured under Source priority decision and Endpoints.

Schema (Postgres DDL)

-- ============================================================
-- CRS reports (latest version)
-- ============================================================

CREATE TABLE crs_reports (
    id text PRIMARY KEY,                          -- 'crs:IF13213'
    number text NOT NULL UNIQUE,                  -- 'IF13213'
    type_id text NOT NULL,                        -- 'IF', 'R', 'LSB', 'IN', etc.
    type_name text NOT NULL,                      -- 'In Focus', 'Report', 'Legal Sidebar', etc.
    content_type text,                            -- Congress.gov: 'reports' | 'posts' | 'resources' | ...

    title text NOT NULL,
    summary text,                                 -- Often null on mirror; richer via Congress.gov

    -- Latest version metadata
    latest_version int,
    latest_publish_date date NOT NULL,
    update_date timestamptz,                      -- Congress.gov updateDate

    -- Status
    is_active boolean NOT NULL DEFAULT true,
    status text,                                  -- 'active' | 'archived'

    -- Latest version source URLs
    latest_pdf_url text,
    latest_html_url text,
    latest_pdf_sha1 text,
    latest_html_sha1 text,

    -- Latest version body (extracted at parse)
    body_text text,                               -- from HTML rendition
    body_html bytea,                              -- gzipped raw HTML preserved
    body_pdf_url_local text,                      -- /data/corpus/crs/files/...

    -- Lifecycle
    raw_json_mirror jsonb,                        -- everycrsreport.com/reports/{n}.json
    raw_json_api jsonb,                           -- Congress.gov /crsreport/{id}
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX crs_reports_publish_date ON crs_reports (latest_publish_date DESC);
CREATE INDEX crs_reports_type ON crs_reports (type_id, latest_publish_date DESC);
CREATE INDEX crs_reports_active ON crs_reports (latest_publish_date DESC) WHERE is_active;
CREATE INDEX crs_reports_update_date ON crs_reports (update_date DESC NULLS LAST);

-- Search
ALTER TABLE crs_reports ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(summary, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
    ) STORED;
CREATE INDEX crs_reports_search ON crs_reports USING gin (search_tsv);

-- ============================================================
-- Per-version history (every prior version preserved)
-- ============================================================

CREATE TABLE crs_report_versions (
    id text PRIMARY KEY,                          -- 'crs:R48481-v3'
    report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
    version int NOT NULL,
    publish_date date NOT NULL,
    title text NOT NULL,
    summary text,
    pdf_url text,
    html_url text,
    pdf_sha1 text,
    html_sha1 text,
    body_text text,
    fetched_at timestamptz NOT NULL,
    UNIQUE (report_id, version)
);

CREATE INDEX crs_report_versions_report ON crs_report_versions (report_id, version DESC);

-- ============================================================
-- Authors (CRS analysts)
-- ============================================================

CREATE TABLE crs_report_authors (
    id bigserial PRIMARY KEY,
    report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
    author_name text NOT NULL,
    sequence int,                                 -- order author listed
    UNIQUE (report_id, author_name)
);

-- ============================================================
-- Topics (CRS-assigned)
-- ============================================================

CREATE TABLE crs_report_topics (
    report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
    topic text NOT NULL,
    PRIMARY KEY (report_id, topic)
);

CREATE INDEX crs_report_topics_topic ON crs_report_topics (topic);

-- ============================================================
-- Related bills (soft FK; CRS-tagged)
-- ============================================================

CREATE TABLE crs_report_related_bills (
    report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
    bill_id text NOT NULL,                        -- 'bill:117-hr-1234'; soft FK
    PRIMARY KEY (report_id, bill_id)
);

CREATE INDEX crs_report_related_bills_bill ON crs_report_related_bills (bill_id);

-- ============================================================
-- Vector chunks for similarity
-- ============================================================

CREATE TABLE crs_report_chunks (
    id bigserial PRIMARY KEY,
    report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    embedding vector(1024),
    UNIQUE (report_id, chunk_index)
);
CREATE INDEX crs_report_chunks_embedding
    ON crs_report_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

Latest version denormalized onto crs_reports — most queries want the current version. Past versions are in crs_report_versions for citation history.
Both raw_json_mirror and raw_json_api preserved — the two sources differ in coverage (mirror has fuller version history; API has richer related-materials and authors). Keeping both supports re-parsing if either source changes shape.
latest_html_sha1 checksum for idempotency — refetch only when the SHA1 differs from what we have.
Search tsvector on title (A), summary (B), body (D) — body text is heavy enough that we want it ranked lower than title/summary on tied relevance. No author-name weighting (analyst names rarely matter to retrieval).
crs_report_related_bills.bill_id as soft FK — the related bill may be from a congress we haven't fully ingested.
Body PDF URL kept as a path reference, not the bytes — PDF parsing is typically worse than HTML parsing; we treat HTML as primary and only fall back to PDF when HTML rendition is missing or malformed.

Caching / incremental sync

Daily strategy:

Conditional GET on reports.csv — If-Modified-Since: {last_run}. If 304, skip. If 200:
Diff CSV: any (number, latestPubDate) pair not in our crs_reports table is a new version.
For each new version, fetch reports/{number}.json, identify the latest version entry, fetch its PDF + HTML.
SHA1 checksum verifies file integrity (the mirror provides per-format SHA1 hashes).
Parse HTML → text. Update crs_reports.latest_* to the new version; insert previous-version row into crs_report_versions.

Optional secondary check (weekly): Congress.gov /crsreport?fromDateTime={last_run}&sort=updateDate desc — confirm we caught everything. Alert on drift.

Hourly (during business hours): for "active" reports likely to update (recently-published, frequently-updated), poll EveryCRSReport per-report JSON. Most reports update predictably, so this is rarely needed.

Refresh cadence is gentle — one full master CSV download per day is fine. EveryCRSReport's reports.csv is 6.6 MB — sufficient inventory for a daily diff.

Download / update strategy

Backfill (one-time)

Pull reports.csv (~6.6 MB).
For each row in CSV, fetch reports/{number}.json. ~23K JSON fetches at 1 req/s = ~6.5 hours. With 4 workers @ 1 req/s = ~2 hours.
For each report, fetch the latest latestHTML file (~3-50 KB). Same parallelism = ~2 hours for HTML.
PDFs — fetch lazily on first agent request unless we want eager backfill. PDFs total ~3-5 GB; HTML ~1-2 GB. Storage budget allows eager.
Parse + load + embed. Body text → tsvector + chunks.

Total backfill: ~6-8 hours wall.

Daily incremental

04:00 UTC: pull reports.csv with If-Modified-Since. If 304, skip remainder. If 200, continue.
Diff CSV against crs_reports rows: rows with latestPubDate > our latest_publish_date (new version).
For each new version, fetch reports/{number}.json. Insert version row, update parent row.
Fetch new HTML/PDF. SHA1-verify against the mirror's checksum.
Re-embed chunks for affected reports.

State tracking

Per architecture: source key crs_reports. Run state stores last_csv_etag, last_csv_lastmodified, last successful parse timestamp.

Failure modes

Mirror down — fall back to Congress.gov API. Daily cron logs a warning; multi-day outage triggers alert.
CSV parse errors — ~2,994 rows in the current CSV have title commas/quotes (or empty/malformed prefixes). Don't fail the whole run; log per-row parse errors and continue.
Report withdrawn — an active=false flag flips. Update is_active but keep all rows + body text (legal/historical research need).
HTML rendition malformed — fall back to PDF text extraction (pdftotext / pdfplumber). Mark parsed_at regardless.
Title-only references — older reports sometimes have no summary and no topics. Body is still searchable; UI surfaces "summary unavailable."
Mirror SHA1 mismatch with download — retry once. If persistent, alert; the mirror may be mid-update.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Author resolution. CRS analyst names in authors are free-text. Many are public via the CRS analyst directory. Building a canonical analyst directory would let the agent answer "what else has Janet Doe written about export controls." v1.x.
Topic taxonomy normalization. Congress.gov's topics[] strings (e.g. "Veterans Affairs", "International Trade") are CRS-internal categories that overlap with but don't equal the bill subject taxonomy in the topic-taxonomy source. Decide if we maintain a unified topic table or keep CRS topics in their own column. Lean toward source-specific to start; consolidate in v1.x.
Citation extraction at parse time vs separate batch. CRS reports cite bills, USC, CFR, court cases, and other CRS reports densely. Running eyecite at parse time slows ingestion but populates the citation graph table eagerly. Lean toward parse-time for the citation graph (this is the substrate's load-bearing capability) but bench first.
PDF as primary source for non-HTML reports. A small fraction of reports have HTML-malformed renditions. Default to HTML, fall back to PDF; document the fallback rate after backfill.
Stable-ID collision with old report numbers. CRS occasionally retires a number range. Verify by running our Postgres unique constraint over the full backfill — alert on conflicts.
Legal Sidebars (LSB) as a distinct subtype with different needs. They're shorter, more frequent, with different structural layout (more case citations). Worth a separate parser path for tighter quote-extraction.

Normalizer validation

Per the project's HTML→Markdown manual-validation gate, every HTML/XML→Markdown converter passes this stratified check before promotion. CRS uses html_v1 (markdownify-based, lossy-table detection, scripts/nav stripping).

Sample set — 12 reports, type-prefix-stratified

Selection rule: for each type, the most-recently-updated reports with >5 KB HTML. Captured as fixtures under josh-ingester/tests/fixtures/crs/<number>.html so the gate is reproducible.

Type	Number	Title (excerpt)	HTML bytes
R	R45770	The U.S. Election Assistance Commission (EAC)	195,842
R	R47021	Federal Involvement in Ocean-Based Research and Development	152,946
R	R48936	The Electoral College: Frequently Asked Questions	52,831
IF	IF11681	Defense Primer: LGM-35A Sentinel ICBM	39,084
IF	IF13218	Endangered Species Committee ("God Squad") Exemption	25,950
LSB	LSB11429	Artificial Intelligence and the Fourth Amendment	49,157
LSB	LSB11428	Vaccine Injury Compensation Program	45,048
IN	IN12690	The FY2027 President's Budget: Revenues	43,918
RL	RL30787	Parliamentary Reference Sources: House	137,994
RS	RS22331	Foreign Holdings of Federal Debt	105,295
IG	IG10096	US Service Academy Nominations: Timelines	11,801
TE	TE10122	Mitigation and Multiple Loss Properties	93,290

Validation run — 2026-05-08, normalizer `html_v1` v1.0.0

All 12 samples processed through html_v1.to_markdown(). Metrics + criteria assessment below.

#	Number	Ratio	Headings	MD-tables	Artifacts	Warnings
1	R45770	0.68	34	192	0	(none)
2	R47021	0.72	24	162	0	(none)
3	R48936	0.75	25	57	0	(none)
4	IF11681	0.55	10	5	0	(none)
5	IF13218	0.66	5	0	0	(none)
6	LSB11429	0.76	6	6	0	(none)
7	LSB11428	0.77	6	0	0	(none)
8	IN12690	0.64	4	3	0	(none)
9	RL30787	0.59	41	93	0	(none)
10	RS22331	0.67	9	55	1	"table 2: rendered as raw HTML"
11	IG10096	0.34	2	0	0	(none) — at lower bound; expected for image-heavy infographic
12	TE10122	0.77	19	119	0	(none)

Per-criterion result

Criterion	Pass bar	Status
Heading hierarchy	All `<h1>`–`<h4>` map to `#`–`####` in correct nesting order	✅ all 12
Section ordering	First-paragraph-of-section text in source order	✅ all 12 (verified via spot-check)
Footnotes	Markers preserved; round-trip without manual hunting	✅ where present (R45770: 12, R47021: 2; most CRS uses parenthetical citations rather than footnotes)
Tables	Every `<table>` becomes Markdown table OR Artifact reference	✅ all 12 — 9 produce Markdown tables; RS22331 caught 1 lossy table as `Artifact(kind='table')` with surfaced warning
Block quotes	`<blockquote>` becomes `>`	✅ verified in spot-check (LSB11429 has indented statutory excerpts)
Citations	"42 U.S.C. § 1396a"-style preserved verbatim, `§` not Unicode-mangled	✅ all 12 (R45770: 59 U.S.C. refs + 61 § symbols, all intact; Pub.L. abbreviations preserved across reports)
Links	Every `<a href>` becomes `[text](url)`; absolute URLs via `source_hint` resolution	✅ all 12 (counts range 13–275 per report)
Lists	`<ol>`/`<ul>` preserved with correct nesting; numbering preserved	✅ verified in spot-check
Emphasis	`<em>`/`<strong>` map to ``/`*`	✅ verified in spot-check
Images / figures	`<img>` / `<figure>` produce Markdown link with alt text or Artifact	✅ verified (IG10096 has `![]()` for empty-alt images; non-trivial captions preserved as paragraphs)
No HTML tag bleed	Zero raw `<div>`, `<span>` (tables exempt where lossy)	✅ all 12 — `grep '<div\|<span'` → 0 hits across all outputs
Body length sanity	`len(markdown) / len(html)` ∈ [0.3, 0.8]	✅ all 12 (range 0.34–0.77)

Promotion decision

html_v1 v1.0.0 PROMOTED for use as the CRS HTML normalizer. All 12 samples pass all criteria. Lossy-table detection works (RS22331). Length-ratio sanity holds even at the IG infographic edge (0.34, just above the 0.3 floor).

Known cosmetic issue (CRS-specific, deferred)

The EveryCRSReport.com page wrapper produces a duplicate H1 on most reports — once as the page title (# {title} - EveryCRSReport.com) and once as the actual report H1 (# {title}). This is upstream wrapper noise, not a normalizer defect. Address in the CRS source module's parse stage (Phase 4) by trimming the wrapper title or extracting only <main> / <article> content. Not a promotion blocker.

Re-run protocol

If html_v1 ships a new version:

Run josh-ingester/tests/fixtures/crs/ through the new converter.
Re-validate against this 12-sample set.
Update this section with the new run's metrics + per-criterion results.
Bump body_normalizer_version in crs_reports; the substrate's re-conversion path picks up changed rows.