CRS reports ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Congressional Research Service reports — Congress's in-house policy analysts.
EveryCRSReport.com mirror is the cleanest of the v1 sources (CSV index +
per-report JSON + HTML, conditional GET, daily refresh) so it served as the
test case that validated the body-normalization + citation-metadata + FTS5 +
sqlite-vec pipeline end to end before bills.
Code is intact (ingester module, parser, migration, tests, integration
pipeline). The 23,092-report production load that previously satisfied
this spec lived on the destroyed DigitalOcean droplet (2026-05-10) and
needs to be re-run against the OVHcloud substrate once it's provisioned.
Re-backfill is hours, not days — the EveryCRSReport.com CSV is the
authoritative state, and the parser is idempotent.
User stories
As a policy analyst using my own AI agent, I want to ask the agent for the latest CRS reports on a given topic so that I get authoritative non-partisan analysis with citations.
As an OSS self-hoster, I want to run `josh ingest crs-reports` on a fresh substrate so that I can populate the full ~23K-report mirror on my own instance.
As a downstream agent doing FTS5 + vector retrieval, I want CRS report bodies chunked, embedded, and citation-stamped so that hybrid search returns CRS chunks alongside other source chunks.
Acceptance criteria (EARS)
- When `josh ingest crs-reports` is run on the deployed substrate, the system shall populate `crs_reports` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
- When a CRS report HTML body is fetched, the system shall write the raw payload under `/data/corpus/crs_reports/bodies/raw/` and write the `html_v1`-normalized Markdown into the row's `body_text` column (with `body_normalizer_version` and `body_text_sha256` populated).
- While the ingester is running, if a single CSV row or report fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record.
- Where a per-report JSON carries a `topics` array, the system shall populate the `crs_report_topics` junction table for that report. (The EveryCRSReport.com per-report JSON for this corpus carries only `topics`/`versions` — no `authors`, `relatedBills`, or `bills` keys — so `crs_report_authors` and `crs_report_related_bills` receive zero rows; see `clarifications_needed`.)
- When an FTS5 query is executed against `crs_reports_fts`, the system shall return BM25-ranked results. (Per-column weight tuple — provisionally `(10.0, 4.0, 2.0, 1.0)` for `(title, abstract, action, body)` — is applied at query time by the search wrapper that ships with `rest-api-search`; criterion to be tightened then.)
- When a report body is loaded, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding into `crs_report_chunks_vec0`.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the max `latestPubDate` seen so the next run only fetches newer reports.
Success determiner
DB
Query
SELECT COUNT(*) FROM crs_reports
WHERE source_url IS NOT NULL
AND source_org IS NOT NULL
AND published_at IS NOT NULL
AND retrieved_at IS NOT NULL
AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;
Compare (machine-checked)
{
"op": "gte",
"value": 22000
}
Expect (prose)
Production count was 23,092 fully-citation-stamped reports as of 2026-05-08 (Phase 6 backfill on the now-destroyed DO droplet). The 22,000 floor leaves headroom for EveryCRSReport.com mirror churn without false-failing the determiner. **Determiner currently fails** against an empty substrate — re-run after the OVHcloud provision + backfill (per `substrate-bare-metal-host` t9) to re-flip the spec verified → shipped. Of the six citation columns, only published_at and raw_sha256 are nullable; the other four are NOT NULL-enforced in the schema, so the WHERE filter mechanically verifies completeness of those two columns (the other four cannot be null by construction).
Clarifications needed
- spec↔code mismatch (authors / related bills are dead extraction): the parser reads `meta.get("authors", [])` and `meta.get("relatedBills", []) or meta.get("bills", [])` from the per-report JSON, but the EveryCRSReport.com per-report JSON for this corpus has no such keys — observed top-level keys on R44889.json (and the R48936 / IF11681 / LSB11429 fixtures) are exactly [active, id, number, source, topics, type, typeId, versions]. Because of the `.get(..., [])` defaults the loops iterate empty lists, so `crs_report_authors` and `crs_report_related_bills` receive ZERO rows for every report — a silent no-op, not nulls and not a crash. Fix (code, not here): either drop the authors/relatedBills/bills child- building and the two tables from `_CHILD_TABLES`, or source authors elsewhere (they sometimes appear inside version `summary`/HTML, not as a JSON key). `topics` and `versions` do exist and populate correctly.
Out of scope
- State-level analogs.
- Real-time webhooks / push subscriptions (we poll the CSV daily).
Dependencies
Plan
Source module follows the standard Source protocol:josh-ingester/ingester/sources/crs_reports.py.
- Discover — GET https://www.everycrsreport.com/reports.csv with
If-Modified-Since. csv.DictReader (RFC 4180-aware) yields one
FetchTask per row. State watermark is the max latestPubDate seen.
- Fetch — per report: /reports/{number}.json (metadata, version
history) + /reports/{number}.html (body wrapper). SHA1s in the JSON
let us short-circuit unchanged bodies.
- Parse — html_v1 normalizer strips the wrapper page-title H1 and
converts to Markdown. Only topics (and versions) come out of the
per-report JSON; the EveryCRSReport.com JSON for this corpus carries no
authors, relatedBills, or bills keys, so author / related-bill
extraction yields zero rows (see clarifications_needed).
- Load — crs_reports + 4 junction tables + chunked body into
crs_report_chunks + enqueue chunks for embedding into
crs_report_chunks_vec0.
- Citation — shared/josh_substrate/citations/formatters/crs.py
builds the human-readable citation_string (CRS-style: "CRS Report
R12345, Title (May 8, 2026)").
- Schema — 0002_crs.py migration: crs_reports,
crs_report_versions, 3 junction tables, crs_report_chunks,
crs_reports_fts (FTS5 + 3 sync triggers), crs_report_chunks_vec0
(vec0 virtual table).
- Defensive parsing — 2,994 CSV rows have title commas/quotes;
per-row exceptions are logged and skipped, not aborted.
- Cron — daily 07:30 ET (CSV refreshed ~07:00 UTC).
Tasks
10 of 14 done.
- t1 Alembic migration written + applied locally (production re-apply pending OVHcloud provision)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/crs.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (raw + markdown paths, html_v1 unit-tested)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + unit tests
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 Vector vec0 table wired (crs_report_chunks_vec0)
- t8 Unit test for parser against captured fixtures
- t9 Integration test for the full discover→fetch→parse→load pipeline
- t10 Minimum-viable backfill complete on the OVHcloud substrate (~23K reports — code-equivalent of the prior 2026-05-08 DO load)
- t11 Incremental cron registered (daily 07:30 ET) on the OVHcloud host and observed firing
- t12 No recurring errors in `ingestion_logs` for the last completed run on the new host
- t13 Source doc updated with as-built notes
- t14 Inventory rows in `https://docs.usejosh.com/josh-data-sources/` + `https://docs.usejosh.com/data-status/` re-flipped to ingestion-built / shipped after the OVHcloud backfill lands
Changelog
-
2026-05-10T18:00:00Z
shipped→verifiedDigitalOcean droplet destroyed 2026-05-10 — the production backfill (23,092 reports as of 2026-05-08) is gone with the volume. Code (ingester module, parser, migration, tests, integration pipeline, citation formatter, FTS5/vec0 wiring) all intact in the repo. Status flipped shipped → verified: the success_determiner ran green against real data once and the code's contract still holds; the production data load needs to be re-run against the OVHcloud substrate (per `substrate-bare-metal-host` t9). Tasks t10-t12 + t14 reset to false to track the re-load. -
2026-05-09T13:00:00Z
in_progress→shipped23d1bcb Backfill landed 2026-05-08 (commit 23d1bcb): 23,092 reports + bodies + chunks + vectors + citations. Reliability follow-ups (commit 3121911) cleaned up SQLITE_BUSY, WAL checkpointer, and the type_id bug. Spec backfilled with real acceptance criteria + sql success_determiner; removing the in_progress + stub-criteria drift.