CRS reports
Congressional Research Service products: nonpartisan analysis written for Members of Congress and their staff. Covers virtually every policy area — economics, foreign affairs, intelligence, environment, agriculture, social policy. CRS reports are the go-to backgrounder format on Capitol Hill: a Member's office gets one before a vote, before a hearing, before a town hall.
For Josh, CRS reports are a high-value retrieval target — they're substantive, well-cited, and authored by domain experts. They are also (since 2018) all public, all online, and uniformly formatted enough to ingest reliably.
This is a reference / analysis source — no FK relationships into the legislative-action graph except the soft relatedMaterials links to bills. Load whenever; not blocking on legislators or bills. Citation extraction at parse time will populate cross-references to bills, USC, and CFR.
Quick reference
Section titled “Quick reference”| Source name | CRS reports |
| Publishers | Congressional Research Service (Library of Congress) |
| License | Reports are works of the US government — public domain. CRS asserts no copyright. |
| Coverage | 1971 – present. Public posting became mandatory under the John S. McCain National Defense Authorization Act for FY2018 (2018), but the EveryCRSReport.com mirror reaches back well before that: 9,820 of the 23,211 CSV rows pre-date 2013 (back to 1971), and all of them are ingested — pre-2013 is not out of scope. ~23K active reports + thousands of archived versions. |
| Volume | ~23K reports in the mirror (23,211 CSV rows as of May 2026, 9,820 of them pre-2013); ~600-1500 new + ~3K updates per year. Each report ~50-300 KB HTML, ~100-500 KB PDF. |
| Storage estimate | ~3-5 GB raw HTML + PDF; ~1-2 GB extracted text |
| Primary source | EveryCRSReport.com community mirror (see Source priority decision) |
| Verification source | Congress.gov v3 /crsreport API — see api.congress.gov CRS report docs (register a key at api.data.gov/signup) |
| Stable ID format | crs:{number} e.g. crs:R48481, crs:IF13213 |
| Status | exploring — schema drafted, ingestion not built |
Upstream endpoint, auth, and rate-limit specifics live in the EveryCRSReport about page and the Congress.gov CRS report API docs; the access behavior Josh observed in practice is in Access notes below.
Source priority decision
Section titled “Source priority decision”Primary: EveryCRSReport.com mirror. Project run by Daniel Schuman + Demand Progress (and now FrontFin / Federation of American Scientists), continuously crawled from crsreports.congress.gov. Provides:
- A daily-refreshed master CSV (
reports.csv, ~6.6 MB) with every report's number, URL, latest pub date, title, latest PDF/HTML paths. Conditional GET works. - Per-report JSON at
reports/{number}.jsonwith full version history (every revision the mirror has crawled) and formats with SHA1 checksums. Top-level keys are exactly[active, id, number, source, topics, type, typeId, versions]— there is noauthors,relatedBills, orbillskey, so thecrs_report_authorsandcrs_report_related_billstables receive zero rows on this mirror. - The actual PDF and HTML files at deterministic paths.
This is the path of least resistance: one CSV download tells us the entire universe; one JSON per report gives the full version history; SHA1 checksums make idempotency cheap. The mirror's per-report JSON shape is documented (informally) on the EveryCRSReport about page — there is no canonical upstream API reference for it.
Secondary: Congress.gov v3 /crsreport. The official source. Cleaner JSON shape, official authoritative. Use as verification layer (spot-check a handful daily for drift) but not as primary because (a) it costs api.data.gov quota, (b) the EveryCRSReport mirror is comprehensive and faster, (c) the official feed sometimes lags the mirror by hours. The field-by-field response shape is documented at api.congress.gov.
Skip: scraping crsreports.congress.gov directly. EveryCRSReport already does this. Re-doing it duplicates work and creates a second crawl footprint on the underlying source.
Skip: separate pre-2018 backfill via UNT digital library, FAS legacy archive, etc. Unnecessary — the EveryCRSReport.com mirror already carries the older corpus (9,820 of 23,211 rows pre-date 2013, back to 1971), and the ingester loads all of it. No extra archive scrape is needed for pre-2018 coverage.
Access notes
Section titled “Access notes”EveryCRSReport: clean nginx behind Cloudflare, Last-Modified and ETag headers reliable, If-Modified-Since and If-None-Match both work. No User-Agent restrictions. Daily refresh of reports.csv happens around 07:00 UTC.
Congress.gov: same auth pattern as Bills. Real api.data.gov key needed for ingestion. DEMO_KEY hits ~30/hour cap quickly during probing.
Endpoints
Section titled “Endpoints”Upstream URL patterns, the per-report JSON / CSV column schemas, and the Congress.gov v3 field listings are documented upstream:
- EveryCRSReport (
reports.csvcolumns,reports/{number}.json,files/...PDF/HTML paths): everycrsreport.com/about - Congress.gov v3
/crsreportand/crsreport/{id}: api.congress.gov CRS report docs
Josh-specific notes on those shapes that no upstream doc states:
- The
reports.csvURLs are relative toeverycrsreport.com/. - The direct CRS document URLs on
congress.govfollowcrs_external_products/{TYPE}/PDF/{number}/{number}.{version}.pdfandcrs_external_products/{TYPE}/HTML/{number}.web.html, where{TYPE}is the prefix (R,IF,RL, etc.) and{version}is a numeric per-report version (.1.,.2.). - The mirror's per-report JSON has only the keys
[active, id, number, source, topics, type, typeId, versions]— see Source priority decision for the zero-row consequence oncrs_report_authors/crs_report_related_bills. - On Congress.gov, the
relatedMaterialsfield is a soft cross-reference — a CRS report cites bills, but linking is editorial (CRS staff tag the report with relevant bills they discuss). Useful for the agent's "find me CRS analysis on this bill" query. - The
authorsfield (Congress.gov only) is sparse on older reports; CRS keeps analyst names current but legacy reports (pre-2010) may have author byline removed for retired-staff privacy.
Vocabulary / enums
Section titled “Vocabulary / enums”CRS report types (prefix in report number)
Section titled “CRS report types (prefix in report number)”Distribution from a fresh EveryCRSReport CSV (May 2026):
| Prefix | Count | Type |
|---|---|---|
R | 6,881 | Standard report |
RL | 4,280 | Long-form report (legacy prefix; phased out for new reports) |
IF | 3,183 | In Focus — 2-page brief on a single topic |
RS | 2,433 | Short report (legacy prefix) |
IN | 2,353 | Insight — short topical analysis |
LSB | 1,359 | Legal Sidebar — legal analysis |
IB | 400 | Issue Brief (mostly historical) |
TE | 91 | Testimony (when CRS analyst testified) |
IG | 91 | Infographic |
ZZZ, ZZZA-F | ~85 | Archived / withdrawn |
IP | 41 | Information Paper |
IS | 32 | Information Sheet |
MB | 19 | Memorandum (rare; specific to a Member request) |
SP-, RA-, M-, CP, IPO, SS-CSRS | <10 each | Specialized prefixes |
Some rows have empty prefix or quote-comma issues (~2,994 rows have title commas/quotes that need RFC 4180-aware parsing — handle defensively in CSV parser; don't fail the row).
The Congress.gov contentType enum (reports, posts, resources, infographics, testimony) and the status enum (active | archived) are documented at api.congress.gov. Note: on the mirror an archived report may still be present, and Josh stores it.
Stable ID format
Section titled “Stable ID format”crs:{number}, uppercase prefix preserved.
Examples:
crs:R48481(a "report" type)crs:IF13213(an "in focus" type)crs:LSB10987(a "legal sidebar")
The number is unique across all types — R48481 and IF13213 cannot collide because the type prefix is part of the canonical number. CRS ensures global uniqueness.
Versioning: every report has a version number that monotonically increases as updates are published. We store all versions. Per-version stable ID derived as crs:{number}-v{version} if needed (e.g. crs:R48481-v3), but the canonical row is the latest version; older versions are in crs_report_versions.
Response shapes
Section titled “Response shapes”Both upstream JSON shapes are documented upstream — the EveryCRSReport mirror about page for reports/{number}.json and api.congress.gov for /crsreport/{id}. The Josh-relevant facts about those shapes (mirror key set, the zero-row consequence, the soft relatedMaterials link, the sparse authors) are captured under Source priority decision and Endpoints.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- CRS reports (latest version)-- ============================================================
CREATE TABLE crs_reports ( id text PRIMARY KEY, -- 'crs:IF13213' number text NOT NULL UNIQUE, -- 'IF13213' type_id text NOT NULL, -- 'IF', 'R', 'LSB', 'IN', etc. type_name text NOT NULL, -- 'In Focus', 'Report', 'Legal Sidebar', etc. content_type text, -- Congress.gov: 'reports' | 'posts' | 'resources' | ...
title text NOT NULL, summary text, -- Often null on mirror; richer via Congress.gov
-- Latest version metadata latest_version int, latest_publish_date date NOT NULL, update_date timestamptz, -- Congress.gov updateDate
-- Status is_active boolean NOT NULL DEFAULT true, status text, -- 'active' | 'archived'
-- Latest version source URLs latest_pdf_url text, latest_html_url text, latest_pdf_sha1 text, latest_html_sha1 text,
-- Latest version body (extracted at parse) body_text text, -- from HTML rendition body_html bytea, -- gzipped raw HTML preserved body_pdf_url_local text, -- /data/corpus/crs/files/...
-- Lifecycle raw_json_mirror jsonb, -- everycrsreport.com/reports/{n}.json raw_json_api jsonb, -- Congress.gov /crsreport/{id} fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX crs_reports_publish_date ON crs_reports (latest_publish_date DESC);CREATE INDEX crs_reports_type ON crs_reports (type_id, latest_publish_date DESC);CREATE INDEX crs_reports_active ON crs_reports (latest_publish_date DESC) WHERE is_active;CREATE INDEX crs_reports_update_date ON crs_reports (update_date DESC NULLS LAST);
-- SearchALTER TABLE crs_reports ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(title, '')), 'A') || setweight(to_tsvector('english', coalesce(summary, '')), 'B') || setweight(to_tsvector('english', coalesce(body_text, '')), 'D') ) STORED;CREATE INDEX crs_reports_search ON crs_reports USING gin (search_tsv);
-- ============================================================-- Per-version history (every prior version preserved)-- ============================================================
CREATE TABLE crs_report_versions ( id text PRIMARY KEY, -- 'crs:R48481-v3' report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE, version int NOT NULL, publish_date date NOT NULL, title text NOT NULL, summary text, pdf_url text, html_url text, pdf_sha1 text, html_sha1 text, body_text text, fetched_at timestamptz NOT NULL, UNIQUE (report_id, version));
CREATE INDEX crs_report_versions_report ON crs_report_versions (report_id, version DESC);
-- ============================================================-- Authors (CRS analysts)-- ============================================================
CREATE TABLE crs_report_authors ( id bigserial PRIMARY KEY, report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE, author_name text NOT NULL, sequence int, -- order author listed UNIQUE (report_id, author_name));
-- ============================================================-- Topics (CRS-assigned)-- ============================================================
CREATE TABLE crs_report_topics ( report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE, topic text NOT NULL, PRIMARY KEY (report_id, topic));
CREATE INDEX crs_report_topics_topic ON crs_report_topics (topic);
-- ============================================================-- Related bills (soft FK; CRS-tagged)-- ============================================================
CREATE TABLE crs_report_related_bills ( report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE, bill_id text NOT NULL, -- 'bill:117-hr-1234'; soft FK PRIMARY KEY (report_id, bill_id));
CREATE INDEX crs_report_related_bills_bill ON crs_report_related_bills (bill_id);
-- ============================================================-- Vector chunks for similarity-- ============================================================
CREATE TABLE crs_report_chunks ( id bigserial PRIMARY KEY, report_id text NOT NULL REFERENCES crs_reports(id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, embedding vector(1024), UNIQUE (report_id, chunk_index));CREATE INDEX crs_report_chunks_embedding ON crs_report_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
- Latest version denormalized onto
crs_reports— most queries want the current version. Past versions are incrs_report_versionsfor citation history. - Both
raw_json_mirrorandraw_json_apipreserved — the two sources differ in coverage (mirror has fuller version history; API has richer related-materials and authors). Keeping both supports re-parsing if either source changes shape. latest_html_sha1checksum for idempotency — refetch only when the SHA1 differs from what we have.- Search tsvector on title (A), summary (B), body (D) — body text is heavy enough that we want it ranked lower than title/summary on tied relevance. No author-name weighting (analyst names rarely matter to retrieval).
crs_report_related_bills.bill_idas soft FK — the related bill may be from a congress we haven't fully ingested.- Body PDF URL kept as a path reference, not the bytes — PDF parsing is typically worse than HTML parsing; we treat HTML as primary and only fall back to PDF when HTML rendition is missing or malformed.
Caching / incremental sync
Section titled “Caching / incremental sync”Daily strategy:
- Conditional GET on
reports.csv—If-Modified-Since: {last_run}. If 304, skip. If 200: - Diff CSV: any
(number, latestPubDate)pair not in ourcrs_reportstable is a new version. - For each new version, fetch
reports/{number}.json, identify the latest version entry, fetch its PDF + HTML. - SHA1 checksum verifies file integrity (the mirror provides per-format SHA1 hashes).
- Parse HTML → text. Update
crs_reports.latest_*to the new version; insert previous-version row intocrs_report_versions.
Optional secondary check (weekly): Congress.gov /crsreport?fromDateTime={last_run}&sort=updateDate desc — confirm we caught everything. Alert on drift.
Hourly (during business hours): for "active" reports likely to update (recently-published, frequently-updated), poll EveryCRSReport per-report JSON. Most reports update predictably, so this is rarely needed.
Refresh cadence is gentle — one full master CSV download per day is fine. EveryCRSReport's reports.csv is 6.6 MB — sufficient inventory for a daily diff.
Download / update strategy
Section titled “Download / update strategy”Backfill (one-time)
Section titled “Backfill (one-time)”- Pull
reports.csv(~6.6 MB). - For each row in CSV, fetch
reports/{number}.json. ~23K JSON fetches at 1 req/s = ~6.5 hours. With 4 workers @ 1 req/s = ~2 hours. - For each report, fetch the latest
latestHTMLfile (~3-50 KB). Same parallelism = ~2 hours for HTML. - PDFs — fetch lazily on first agent request unless we want eager backfill. PDFs total ~3-5 GB; HTML ~1-2 GB. Storage budget allows eager.
- Parse + load + embed. Body text → tsvector + chunks.
Total backfill: ~6-8 hours wall.
Daily incremental
Section titled “Daily incremental”- 04:00 UTC: pull
reports.csvwithIf-Modified-Since. If 304, skip remainder. If 200, continue. - Diff CSV against
crs_reportsrows: rows withlatestPubDate > our latest_publish_date(new version). - For each new version, fetch
reports/{number}.json. Insert version row, update parent row. - Fetch new HTML/PDF. SHA1-verify against the mirror's checksum.
- Re-embed chunks for affected reports.
State tracking
Section titled “State tracking”Per architecture: source key crs_reports. Run state stores last_csv_etag, last_csv_lastmodified, last successful parse timestamp.
Failure modes
Section titled “Failure modes”- Mirror down — fall back to Congress.gov API. Daily cron logs a warning; multi-day outage triggers alert.
- CSV parse errors — ~2,994 rows in the current CSV have title commas/quotes (or empty/malformed prefixes). Don't fail the whole run; log per-row parse errors and continue.
- Report withdrawn — an
active=falseflag flips. Updateis_activebut keep all rows + body text (legal/historical research need). - HTML rendition malformed — fall back to PDF text extraction (
pdftotext/pdfplumber). Markparsed_atregardless. - Title-only references — older reports sometimes have no
summaryand notopics. Body is still searchable; UI surfaces "summary unavailable." - Mirror SHA1 mismatch with download — retry once. If persistent, alert; the mirror may be mid-update.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Author resolution. CRS analyst names in
authorsare free-text. Many are public via the CRS analyst directory. Building a canonical analyst directory would let the agent answer "what else has Janet Doe written about export controls." v1.x. - Topic taxonomy normalization. Congress.gov's
topics[]strings (e.g. "Veterans Affairs", "International Trade") are CRS-internal categories that overlap with but don't equal the bill subject taxonomy in the topic-taxonomy source. Decide if we maintain a unified topic table or keep CRS topics in their own column. Lean toward source-specific to start; consolidate in v1.x. - Citation extraction at parse time vs separate batch. CRS reports cite bills, USC, CFR, court cases, and other CRS reports densely. Running
eyeciteat parse time slows ingestion but populates the citation graph table eagerly. Lean toward parse-time for the citation graph (this is the substrate's load-bearing capability) but bench first. - PDF as primary source for non-HTML reports. A small fraction of reports have HTML-malformed renditions. Default to HTML, fall back to PDF; document the fallback rate after backfill.
- Stable-ID collision with old report numbers. CRS occasionally retires a number range. Verify by running our Postgres unique constraint over the full backfill — alert on conflicts.
- Legal Sidebars (
LSB) as a distinct subtype with different needs. They're shorter, more frequent, with different structural layout (more case citations). Worth a separate parser path for tighter quote-extraction.
Normalizer validation
Section titled “Normalizer validation”Per the project's HTML→Markdown manual-validation gate, every HTML/XML→Markdown converter passes this stratified check before promotion. CRS uses html_v1 (markdownify-based, lossy-table detection, scripts/nav stripping).
Sample set — 12 reports, type-prefix-stratified
Section titled “Sample set — 12 reports, type-prefix-stratified”Selection rule: for each type, the most-recently-updated reports with >5 KB HTML. Captured as fixtures under josh-ingester/tests/fixtures/crs/<number>.html so the gate is reproducible.
| Type | Number | Title (excerpt) | HTML bytes |
|---|---|---|---|
| R | R45770 | The U.S. Election Assistance Commission (EAC) | 195,842 |
| R | R47021 | Federal Involvement in Ocean-Based Research and Development | 152,946 |
| R | R48936 | The Electoral College: Frequently Asked Questions | 52,831 |
| IF | IF11681 | Defense Primer: LGM-35A Sentinel ICBM | 39,084 |
| IF | IF13218 | Endangered Species Committee ("God Squad") Exemption | 25,950 |
| LSB | LSB11429 | Artificial Intelligence and the Fourth Amendment | 49,157 |
| LSB | LSB11428 | Vaccine Injury Compensation Program | 45,048 |
| IN | IN12690 | The FY2027 President's Budget: Revenues | 43,918 |
| RL | RL30787 | Parliamentary Reference Sources: House | 137,994 |
| RS | RS22331 | Foreign Holdings of Federal Debt | 105,295 |
| IG | IG10096 | US Service Academy Nominations: Timelines | 11,801 |
| TE | TE10122 | Mitigation and Multiple Loss Properties | 93,290 |
Validation run — 2026-05-08, normalizer html_v1 v1.0.0
Section titled “Validation run — 2026-05-08, normalizer html_v1 v1.0.0”All 12 samples processed through html_v1.to_markdown(). Metrics + criteria assessment below.
| # | Number | Ratio | Headings | MD-tables | Artifacts | Warnings |
|---|---|---|---|---|---|---|
| 1 | R45770 | 0.68 | 34 | 192 | 0 | (none) |
| 2 | R47021 | 0.72 | 24 | 162 | 0 | (none) |
| 3 | R48936 | 0.75 | 25 | 57 | 0 | (none) |
| 4 | IF11681 | 0.55 | 10 | 5 | 0 | (none) |
| 5 | IF13218 | 0.66 | 5 | 0 | 0 | (none) |
| 6 | LSB11429 | 0.76 | 6 | 6 | 0 | (none) |
| 7 | LSB11428 | 0.77 | 6 | 0 | 0 | (none) |
| 8 | IN12690 | 0.64 | 4 | 3 | 0 | (none) |
| 9 | RL30787 | 0.59 | 41 | 93 | 0 | (none) |
| 10 | RS22331 | 0.67 | 9 | 55 | 1 | "table 2: rendered as raw HTML" |
| 11 | IG10096 | 0.34 | 2 | 0 | 0 | (none) — at lower bound; expected for image-heavy infographic |
| 12 | TE10122 | 0.77 | 19 | 119 | 0 | (none) |
Per-criterion result
Section titled “Per-criterion result”| Criterion | Pass bar | Status |
|---|---|---|
| Heading hierarchy | All <h1>–<h4> map to #–#### in correct nesting order | ✅ all 12 |
| Section ordering | First-paragraph-of-section text in source order | ✅ all 12 (verified via spot-check) |
| Footnotes | Markers preserved; round-trip without manual hunting | ✅ where present (R45770: 12, R47021: 2; most CRS uses parenthetical citations rather than footnotes) |
| Tables | Every <table> becomes Markdown table OR Artifact reference | ✅ all 12 — 9 produce Markdown tables; RS22331 caught 1 lossy table as Artifact(kind='table') with surfaced warning |
| Block quotes | <blockquote> becomes > | ✅ verified in spot-check (LSB11429 has indented statutory excerpts) |
| Citations | "42 U.S.C. § 1396a"-style preserved verbatim, § not Unicode-mangled | ✅ all 12 (R45770: 59 U.S.C. refs + 61 § symbols, all intact; Pub.L. abbreviations preserved across reports) |
| Links | Every <a href> becomes [text](url); absolute URLs via source_hint resolution | ✅ all 12 (counts range 13–275 per report) |
| Lists | <ol>/<ul> preserved with correct nesting; numbering preserved | ✅ verified in spot-check |
| Emphasis | <em>/<strong> map to */** | ✅ verified in spot-check |
| Images / figures | <img> / <figure> produce Markdown link with alt text or Artifact | ✅ verified (IG10096 has ![]() for empty-alt images; non-trivial captions preserved as paragraphs) |
| No HTML tag bleed | Zero raw <div>, <span> (tables exempt where lossy) | ✅ all 12 — grep '<div|<span' → 0 hits across all outputs |
| Body length sanity | len(markdown) / len(html) ∈ [0.3, 0.8] | ✅ all 12 (range 0.34–0.77) |
Promotion decision
Section titled “Promotion decision”html_v1 v1.0.0 PROMOTED for use as the CRS HTML normalizer. All 12 samples pass all criteria. Lossy-table detection works (RS22331). Length-ratio sanity holds even at the IG infographic edge (0.34, just above the 0.3 floor).
Known cosmetic issue (CRS-specific, deferred)
Section titled “Known cosmetic issue (CRS-specific, deferred)”The EveryCRSReport.com page wrapper produces a duplicate H1 on most reports — once as the page title (# {title} - EveryCRSReport.com) and once as the actual report H1 (# {title}). This is upstream wrapper noise, not a normalizer defect. Address in the CRS source module's parse stage (Phase 4) by trimming the wrapper title or extracting only <main> / <article> content. Not a promotion blocker.
Re-run protocol
Section titled “Re-run protocol”If html_v1 ships a new version:
- Run
josh-ingester/tests/fixtures/crs/through the new converter. - Re-validate against this 12-sample set.
- Update this section with the new run's metrics + per-criterion results.
- Bump
body_normalizer_versionincrs_reports; the substrate's re-conversion path picks up changed rows.