Skip to content

CRS reports

Congressional Research Service products: nonpartisan analysis written for Members of Congress and their staff. Covers virtually every policy area — economics, foreign affairs, intelligence, environment, agriculture, social policy. CRS reports are the go-to backgrounder format on Capitol Hill: a Member's office gets one before a vote, before a hearing, before a town hall.

For Josh, CRS reports are a high-value retrieval target — they're substantive, well-cited, and authored by domain experts. They are also (since 2018) all public, all online, and uniformly formatted enough to ingest reliably.

This is a reference / analysis source — no FK relationships into the legislative-action graph except the soft relatedMaterials links to bills. Load whenever; not blocking on legislators or bills. Citation extraction at parse time will populate cross-references to bills, USC, and CFR.

Source nameCRS reports
PublishersCongressional Research Service (Library of Congress)
LicenseReports are works of the US government — public domain. CRS asserts no copyright.
Coverage1971 – present. Public posting became mandatory under the John S. McCain National Defense Authorization Act for FY2018 (2018), but the EveryCRSReport.com mirror reaches back well before that: 9,820 of the 23,211 CSV rows pre-date 2013 (back to 1971), and all of them are ingested — pre-2013 is not out of scope. ~23K active reports + thousands of archived versions.
Volume~23K reports in the mirror (23,211 CSV rows as of May 2026, 9,820 of them pre-2013); ~600-1500 new + ~3K updates per year. Each report ~50-300 KB HTML, ~100-500 KB PDF.
Storage estimate~3-5 GB raw HTML + PDF; ~1-2 GB extracted text
Primary sourceEveryCRSReport.com community mirror (see Source priority decision)
Verification sourceCongress.gov v3 /crsreport API — see api.congress.gov CRS report docs (register a key at api.data.gov/signup)
Stable ID formatcrs:{number} e.g. crs:R48481, crs:IF13213
Statusexploring — schema drafted, ingestion not built

Upstream endpoint, auth, and rate-limit specifics live in the EveryCRSReport about page and the Congress.gov CRS report API docs; the access behavior Josh observed in practice is in Access notes below.

Primary: EveryCRSReport.com mirror. Project run by Daniel Schuman + Demand Progress (and now FrontFin / Federation of American Scientists), continuously crawled from crsreports.congress.gov. Provides:

  • A daily-refreshed master CSV (reports.csv, ~6.6 MB) with every report's number, URL, latest pub date, title, latest PDF/HTML paths. Conditional GET works.
  • Per-report JSON at reports/{number}.json with full version history (every revision the mirror has crawled) and formats with SHA1 checksums. Top-level keys are exactly [active, id, number, source, topics, type, typeId, versions] — there is no authors, relatedBills, or bills key, so the crs_report_authors and crs_report_related_bills tables receive zero rows on this mirror.
  • The actual PDF and HTML files at deterministic paths.

This is the path of least resistance: one CSV download tells us the entire universe; one JSON per report gives the full version history; SHA1 checksums make idempotency cheap. The mirror's per-report JSON shape is documented (informally) on the EveryCRSReport about page — there is no canonical upstream API reference for it.

Secondary: Congress.gov v3 /crsreport. The official source. Cleaner JSON shape, official authoritative. Use as verification layer (spot-check a handful daily for drift) but not as primary because (a) it costs api.data.gov quota, (b) the EveryCRSReport mirror is comprehensive and faster, (c) the official feed sometimes lags the mirror by hours. The field-by-field response shape is documented at api.congress.gov.

Skip: scraping crsreports.congress.gov directly. EveryCRSReport already does this. Re-doing it duplicates work and creates a second crawl footprint on the underlying source.

Skip: separate pre-2018 backfill via UNT digital library, FAS legacy archive, etc. Unnecessary — the EveryCRSReport.com mirror already carries the older corpus (9,820 of 23,211 rows pre-date 2013, back to 1971), and the ingester loads all of it. No extra archive scrape is needed for pre-2018 coverage.

EveryCRSReport: clean nginx behind Cloudflare, Last-Modified and ETag headers reliable, If-Modified-Since and If-None-Match both work. No User-Agent restrictions. Daily refresh of reports.csv happens around 07:00 UTC.

Congress.gov: same auth pattern as Bills. Real api.data.gov key needed for ingestion. DEMO_KEY hits ~30/hour cap quickly during probing.

Upstream URL patterns, the per-report JSON / CSV column schemas, and the Congress.gov v3 field listings are documented upstream:

Josh-specific notes on those shapes that no upstream doc states:

  • The reports.csv URLs are relative to everycrsreport.com/.
  • The direct CRS document URLs on congress.gov follow crs_external_products/{TYPE}/PDF/{number}/{number}.{version}.pdf and crs_external_products/{TYPE}/HTML/{number}.web.html, where {TYPE} is the prefix (R, IF, RL, etc.) and {version} is a numeric per-report version (.1., .2.).
  • The mirror's per-report JSON has only the keys [active, id, number, source, topics, type, typeId, versions] — see Source priority decision for the zero-row consequence on crs_report_authors / crs_report_related_bills.
  • On Congress.gov, the relatedMaterials field is a soft cross-reference — a CRS report cites bills, but linking is editorial (CRS staff tag the report with relevant bills they discuss). Useful for the agent's "find me CRS analysis on this bill" query.
  • The authors field (Congress.gov only) is sparse on older reports; CRS keeps analyst names current but legacy reports (pre-2010) may have author byline removed for retired-staff privacy.

CRS report types (prefix in report number)

Section titled “CRS report types (prefix in report number)”

Distribution from a fresh EveryCRSReport CSV (May 2026):

PrefixCountType
R6,881Standard report
RL4,280Long-form report (legacy prefix; phased out for new reports)
IF3,183In Focus — 2-page brief on a single topic
RS2,433Short report (legacy prefix)
IN2,353Insight — short topical analysis
LSB1,359Legal Sidebar — legal analysis
IB400Issue Brief (mostly historical)
TE91Testimony (when CRS analyst testified)
IG91Infographic
ZZZ, ZZZA-F~85Archived / withdrawn
IP41Information Paper
IS32Information Sheet
MB19Memorandum (rare; specific to a Member request)
SP-, RA-, M-, CP, IPO, SS-CSRS<10 eachSpecialized prefixes

Some rows have empty prefix or quote-comma issues (~2,994 rows have title commas/quotes that need RFC 4180-aware parsing — handle defensively in CSV parser; don't fail the row).

The Congress.gov contentType enum (reports, posts, resources, infographics, testimony) and the status enum (active | archived) are documented at api.congress.gov. Note: on the mirror an archived report may still be present, and Josh stores it.

crs:{number}, uppercase prefix preserved.

Examples:

  • crs:R48481 (a "report" type)
  • crs:IF13213 (an "in focus" type)
  • crs:LSB10987 (a "legal sidebar")

The number is unique across all types — R48481 and IF13213 cannot collide because the type prefix is part of the canonical number. CRS ensures global uniqueness.

Versioning: every report has a version number that monotonically increases as updates are published. We store all versions. Per-version stable ID derived as crs:{number}-v{version} if needed (e.g. crs:R48481-v3), but the canonical row is the latest version; older versions are in crs_report_versions.

Both upstream JSON shapes are documented upstream — the EveryCRSReport mirror about page for reports/{number}.json and api.congress.gov for /crsreport/{id}. The Josh-relevant facts about those shapes (mirror key set, the zero-row consequence, the soft relatedMaterials link, the sparse authors) are captured under Source priority decision and Endpoints.

-- ============================================================
-- CRS reports (latest version)
-- ============================================================
CREATE TABLE crs_reports (
id text PRIMARY KEY, -- 'crs:IF13213'
number text NOT NULL UNIQUE, -- 'IF13213'
type_id text NOT NULL, -- 'IF', 'R', 'LSB', 'IN', etc.
type_name text NOT NULL, -- 'In Focus', 'Report', 'Legal Sidebar', etc.
content_type text, -- Congress.gov: 'reports' | 'posts' | 'resources' | ...
title text NOT NULL,
summary text, -- Often null on mirror; richer via Congress.gov
-- Latest version metadata
latest_version int,
latest_publish_date date NOT NULL,
update_date timestamptz, -- Congress.gov updateDate
-- Status
is_active boolean NOT NULL DEFAULT true,
status text, -- 'active' | 'archived'
-- Latest version source URLs
latest_pdf_url text,
latest_html_url text,
latest_pdf_sha1 text,
latest_html_sha1 text,
-- Latest version body (extracted at parse)
body_text text, -- from HTML rendition
body_html bytea, -- gzipped raw HTML preserved
body_pdf_url_local text, -- /data/corpus/crs/files/...
-- Lifecycle
raw_json_mirror jsonb, -- everycrsreport.com/reports/{n}.json
raw_json_api jsonb, -- Congress.gov /crsreport/{id}
fetched_at timestamptz NOT NULL,
parsed_at timestamptz,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX crs_reports_publish_date ON crs_reports (latest_publish_date DESC);
CREATE INDEX crs_reports_type ON crs_reports (type_id, latest_publish_date DESC);
CREATE INDEX crs_reports_active ON crs_reports (latest_publish_date DESC) WHERE is_active;
CREATE INDEX crs_reports_update_date ON crs_reports (update_date DESC NULLS LAST);
-- Search
ALTER TABLE crs_reports ADD COLUMN search_tsv tsvector
GENERATED ALWAYS AS (
setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
setweight(to_tsvector('english', coalesce(summary, '')), 'B') ||
setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
) STORED;
CREATE INDEX crs_reports_search ON crs_reports USING gin (search_tsv);
-- ============================================================
-- Per-version history (every prior version preserved)
-- ============================================================
CREATE TABLE crs_report_versions (
id text PRIMARY KEY, -- 'crs:R48481-v3'
report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
version int NOT NULL,
publish_date date NOT NULL,
title text NOT NULL,
summary text,
pdf_url text,
html_url text,
pdf_sha1 text,
html_sha1 text,
body_text text,
fetched_at timestamptz NOT NULL,
UNIQUE (report_id, version)
);
CREATE INDEX crs_report_versions_report ON crs_report_versions (report_id, version DESC);
-- ============================================================
-- Authors (CRS analysts)
-- ============================================================
CREATE TABLE crs_report_authors (
id bigserial PRIMARY KEY,
report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
author_name text NOT NULL,
sequence int, -- order author listed
UNIQUE (report_id, author_name)
);
-- ============================================================
-- Topics (CRS-assigned)
-- ============================================================
CREATE TABLE crs_report_topics (
report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
topic text NOT NULL,
PRIMARY KEY (report_id, topic)
);
CREATE INDEX crs_report_topics_topic ON crs_report_topics (topic);
-- ============================================================
-- Related bills (soft FK; CRS-tagged)
-- ============================================================
CREATE TABLE crs_report_related_bills (
report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
bill_id text NOT NULL, -- 'bill:117-hr-1234'; soft FK
PRIMARY KEY (report_id, bill_id)
);
CREATE INDEX crs_report_related_bills_bill ON crs_report_related_bills (bill_id);
-- ============================================================
-- Vector chunks for similarity
-- ============================================================
CREATE TABLE crs_report_chunks (
id bigserial PRIMARY KEY,
report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE,
chunk_index int NOT NULL,
chunk_text text NOT NULL,
embedding vector(1024),
UNIQUE (report_id, chunk_index)
);
CREATE INDEX crs_report_chunks_embedding
ON crs_report_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

  • Latest version denormalized onto crs_reports — most queries want the current version. Past versions are in crs_report_versions for citation history.
  • Both raw_json_mirror and raw_json_api preserved — the two sources differ in coverage (mirror has fuller version history; API has richer related-materials and authors). Keeping both supports re-parsing if either source changes shape.
  • latest_html_sha1 checksum for idempotency — refetch only when the SHA1 differs from what we have.
  • Search tsvector on title (A), summary (B), body (D) — body text is heavy enough that we want it ranked lower than title/summary on tied relevance. No author-name weighting (analyst names rarely matter to retrieval).
  • crs_report_related_bills.bill_id as soft FK — the related bill may be from a congress we haven't fully ingested.
  • Body PDF URL kept as a path reference, not the bytes — PDF parsing is typically worse than HTML parsing; we treat HTML as primary and only fall back to PDF when HTML rendition is missing or malformed.

Daily strategy:

  1. Conditional GET on reports.csvIf-Modified-Since: {last_run}. If 304, skip. If 200:
  2. Diff CSV: any (number, latestPubDate) pair not in our crs_reports table is a new version.
  3. For each new version, fetch reports/{number}.json, identify the latest version entry, fetch its PDF + HTML.
  4. SHA1 checksum verifies file integrity (the mirror provides per-format SHA1 hashes).
  5. Parse HTML → text. Update crs_reports.latest_* to the new version; insert previous-version row into crs_report_versions.

Optional secondary check (weekly): Congress.gov /crsreport?fromDateTime={last_run}&sort=updateDate desc — confirm we caught everything. Alert on drift.

Hourly (during business hours): for "active" reports likely to update (recently-published, frequently-updated), poll EveryCRSReport per-report JSON. Most reports update predictably, so this is rarely needed.

Refresh cadence is gentle — one full master CSV download per day is fine. EveryCRSReport's reports.csv is 6.6 MB — sufficient inventory for a daily diff.

  1. Pull reports.csv (~6.6 MB).
  2. For each row in CSV, fetch reports/{number}.json. ~23K JSON fetches at 1 req/s = ~6.5 hours. With 4 workers @ 1 req/s = ~2 hours.
  3. For each report, fetch the latest latestHTML file (~3-50 KB). Same parallelism = ~2 hours for HTML.
  4. PDFs — fetch lazily on first agent request unless we want eager backfill. PDFs total ~3-5 GB; HTML ~1-2 GB. Storage budget allows eager.
  5. Parse + load + embed. Body text → tsvector + chunks.

Total backfill: ~6-8 hours wall.

  1. 04:00 UTC: pull reports.csv with If-Modified-Since. If 304, skip remainder. If 200, continue.
  2. Diff CSV against crs_reports rows: rows with latestPubDate > our latest_publish_date (new version).
  3. For each new version, fetch reports/{number}.json. Insert version row, update parent row.
  4. Fetch new HTML/PDF. SHA1-verify against the mirror's checksum.
  5. Re-embed chunks for affected reports.

Per architecture: source key crs_reports. Run state stores last_csv_etag, last_csv_lastmodified, last successful parse timestamp.

  • Mirror down — fall back to Congress.gov API. Daily cron logs a warning; multi-day outage triggers alert.
  • CSV parse errors — ~2,994 rows in the current CSV have title commas/quotes (or empty/malformed prefixes). Don't fail the whole run; log per-row parse errors and continue.
  • Report withdrawn — an active=false flag flips. Update is_active but keep all rows + body text (legal/historical research need).
  • HTML rendition malformed — fall back to PDF text extraction (pdftotext / pdfplumber). Mark parsed_at regardless.
  • Title-only references — older reports sometimes have no summary and no topics. Body is still searchable; UI surfaces "summary unavailable."
  • Mirror SHA1 mismatch with download — retry once. If persistent, alert; the mirror may be mid-update.

These don't block ingestion but should be resolved before this source is "shipped":

  • Author resolution. CRS analyst names in authors are free-text. Many are public via the CRS analyst directory. Building a canonical analyst directory would let the agent answer "what else has Janet Doe written about export controls." v1.x.
  • Topic taxonomy normalization. Congress.gov's topics[] strings (e.g. "Veterans Affairs", "International Trade") are CRS-internal categories that overlap with but don't equal the bill subject taxonomy in the topic-taxonomy source. Decide if we maintain a unified topic table or keep CRS topics in their own column. Lean toward source-specific to start; consolidate in v1.x.
  • Citation extraction at parse time vs separate batch. CRS reports cite bills, USC, CFR, court cases, and other CRS reports densely. Running eyecite at parse time slows ingestion but populates the citation graph table eagerly. Lean toward parse-time for the citation graph (this is the substrate's load-bearing capability) but bench first.
  • PDF as primary source for non-HTML reports. A small fraction of reports have HTML-malformed renditions. Default to HTML, fall back to PDF; document the fallback rate after backfill.
  • Stable-ID collision with old report numbers. CRS occasionally retires a number range. Verify by running our Postgres unique constraint over the full backfill — alert on conflicts.
  • Legal Sidebars (LSB) as a distinct subtype with different needs. They're shorter, more frequent, with different structural layout (more case citations). Worth a separate parser path for tighter quote-extraction.

Per the project's HTML→Markdown manual-validation gate, every HTML/XML→Markdown converter passes this stratified check before promotion. CRS uses html_v1 (markdownify-based, lossy-table detection, scripts/nav stripping).

Sample set — 12 reports, type-prefix-stratified

Section titled “Sample set — 12 reports, type-prefix-stratified”

Selection rule: for each type, the most-recently-updated reports with >5 KB HTML. Captured as fixtures under josh-ingester/tests/fixtures/crs/<number>.html so the gate is reproducible.

TypeNumberTitle (excerpt)HTML bytes
RR45770The U.S. Election Assistance Commission (EAC)195,842
RR47021Federal Involvement in Ocean-Based Research and Development152,946
RR48936The Electoral College: Frequently Asked Questions52,831
IFIF11681Defense Primer: LGM-35A Sentinel ICBM39,084
IFIF13218Endangered Species Committee ("God Squad") Exemption25,950
LSBLSB11429Artificial Intelligence and the Fourth Amendment49,157
LSBLSB11428Vaccine Injury Compensation Program45,048
ININ12690The FY2027 President's Budget: Revenues43,918
RLRL30787Parliamentary Reference Sources: House137,994
RSRS22331Foreign Holdings of Federal Debt105,295
IGIG10096US Service Academy Nominations: Timelines11,801
TETE10122Mitigation and Multiple Loss Properties93,290

Validation run — 2026-05-08, normalizer html_v1 v1.0.0

Section titled “Validation run — 2026-05-08, normalizer html_v1 v1.0.0”

All 12 samples processed through html_v1.to_markdown(). Metrics + criteria assessment below.

#NumberRatioHeadingsMD-tablesArtifactsWarnings
1R457700.68341920(none)
2R470210.72241620(none)
3R489360.7525570(none)
4IF116810.551050(none)
5IF132180.66500(none)
6LSB114290.76660(none)
7LSB114280.77600(none)
8IN126900.64430(none)
9RL307870.5941930(none)
10RS223310.679551"table 2: rendered as raw HTML"
11IG100960.34200(none) — at lower bound; expected for image-heavy infographic
12TE101220.77191190(none)
CriterionPass barStatus
Heading hierarchyAll <h1><h4> map to ##### in correct nesting order✅ all 12
Section orderingFirst-paragraph-of-section text in source order✅ all 12 (verified via spot-check)
FootnotesMarkers preserved; round-trip without manual hunting✅ where present (R45770: 12, R47021: 2; most CRS uses parenthetical citations rather than footnotes)
TablesEvery <table> becomes Markdown table OR Artifact reference✅ all 12 — 9 produce Markdown tables; RS22331 caught 1 lossy table as Artifact(kind='table') with surfaced warning
Block quotes<blockquote> becomes >✅ verified in spot-check (LSB11429 has indented statutory excerpts)
Citations"42 U.S.C. § 1396a"-style preserved verbatim, § not Unicode-mangled✅ all 12 (R45770: 59 U.S.C. refs + 61 § symbols, all intact; Pub.L. abbreviations preserved across reports)
LinksEvery <a href> becomes [text](url); absolute URLs via source_hint resolution✅ all 12 (counts range 13–275 per report)
Lists<ol>/<ul> preserved with correct nesting; numbering preserved✅ verified in spot-check
Emphasis<em>/<strong> map to */**✅ verified in spot-check
Images / figures<img> / <figure> produce Markdown link with alt text or Artifact✅ verified (IG10096 has ![]() for empty-alt images; non-trivial captions preserved as paragraphs)
No HTML tag bleedZero raw <div>, <span> (tables exempt where lossy)✅ all 12 — grep '<div|<span' → 0 hits across all outputs
Body length sanitylen(markdown) / len(html) ∈ [0.3, 0.8]✅ all 12 (range 0.34–0.77)

html_v1 v1.0.0 PROMOTED for use as the CRS HTML normalizer. All 12 samples pass all criteria. Lossy-table detection works (RS22331). Length-ratio sanity holds even at the IG infographic edge (0.34, just above the 0.3 floor).

Known cosmetic issue (CRS-specific, deferred)

Section titled “Known cosmetic issue (CRS-specific, deferred)”

The EveryCRSReport.com page wrapper produces a duplicate H1 on most reports — once as the page title (# {title} - EveryCRSReport.com) and once as the actual report H1 (# {title}). This is upstream wrapper noise, not a normalizer defect. Address in the CRS source module's parse stage (Phase 4) by trimming the wrapper title or extracting only <main> / <article> content. Not a promotion blocker.

If html_v1 ships a new version:

  1. Run josh-ingester/tests/fixtures/crs/ through the new converter.
  2. Re-validate against this 12-sample set.
  3. Update this section with the new run's metrics + per-criterion results.
  4. Bump body_normalizer_version in crs_reports; the substrate's re-conversion path picks up changed rows.