CRS reports ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Congressional Research Service reports — Congress's in-house policy analysts.
EveryCRSReport.com mirror is the cleanest of the v1 sources (CSV index +
per-report JSON + HTML, conditional GET, daily refresh) so it served as the
test case that validated the body-normalization + citation-metadata + FTS5 +
sqlite-vec pipeline end to end before bills.

Code is intact (ingester module, parser, migration, tests, integration
pipeline). The 23,092-report production load that previously satisfied
this spec lived on the destroyed DigitalOcean droplet (2026-05-10) and
needs to be re-run against the OVHcloud substrate once it's provisioned.
Re-backfill is hours, not days — the EveryCRSReport.com CSV is the
authoritative state, and the parser is idempotent.

User stories

As a policy analyst using my own AI agent, I want to ask the agent for the latest CRS reports on a given topic so that I get authoritative non-partisan analysis with citations.

As an OSS self-hoster, I want to run `josh ingest crs-reports` on a fresh substrate so that I can populate the full ~23K-report mirror on my own instance.

As a downstream agent doing FTS5 + vector retrieval, I want CRS report bodies chunked, embedded, and citation-stamped so that hybrid search returns CRS chunks alongside other source chunks.

Acceptance criteria (EARS)

When `josh ingest crs-reports` is run on the deployed substrate, the system shall populate `crs_reports` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
When a CRS report HTML body is fetched, the system shall write the raw payload under `/data/corpus/crs_reports/bodies/raw/` and write the `html_v1`-normalized Markdown into the row's `body_text` column (with `body_normalizer_version` and `body_text_sha256` populated).
While the ingester is running, if a single CSV row or report fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next record.
Where a per-report JSON carries a `topics` array, the system shall populate the `crs_report_topics` junction table for that report. (The EveryCRSReport.com per-report JSON for this corpus carries only `topics`/`versions` — no `authors`, `relatedBills`, or `bills` keys — so `crs_report_authors` and `crs_report_related_bills` receive zero rows; see `clarifications_needed`.)
When an FTS5 query is executed against `crs_reports_fts`, the system shall return BM25-ranked results. (Per-column weight tuple — provisionally `(10.0, 4.0, 2.0, 1.0)` for `(title, abstract, action, body)` — is applied at query time by the search wrapper that ships with `rest-api-search`; criterion to be tightened then.)
When a report body is loaded, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding into `crs_report_chunks_vec0`.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the max `latestPubDate` seen so the next run only fetches newer reports.

Success determiner

kindsql

DB

substrate

Query

SELECT COUNT(*) FROM crs_reports
WHERE source_url IS NOT NULL
  AND source_org IS NOT NULL
  AND published_at IS NOT NULL
  AND retrieved_at IS NOT NULL
  AND raw_sha256 IS NOT NULL
  AND citation_string IS NOT NULL;

Compare (machine-checked)

{
  "op": "gte",
  "value": 22000
}

Expect (prose)

Single COUNT >= 22000 (post-117th-Congress CRS report population on the production substrate).

Production count was 23,092 fully-citation-stamped reports as of 2026-05-08 (Phase 6 backfill on the now-destroyed DO droplet). The 22,000 floor leaves headroom for EveryCRSReport.com mirror churn without false-failing the determiner. **Determiner currently fails** against an empty substrate — re-run after the OVHcloud provision + backfill (per `substrate-bare-metal-host` t9) to re-flip the spec verified → shipped. Of the six citation columns, only published_at and raw_sha256 are nullable; the other four are NOT NULL-enforced in the schema, so the WHERE filter mechanically verifies completeness of those two columns (the other four cannot be null by construction).

Clarifications needed

spec↔code mismatch (authors / related bills are dead extraction): the parser reads `meta.get("authors", [])` and `meta.get("relatedBills", []) or meta.get("bills", [])` from the per-report JSON, but the EveryCRSReport.com per-report JSON for this corpus has no such keys — observed top-level keys on R44889.json (and the R48936 / IF11681 / LSB11429 fixtures) are exactly [active, id, number, source, topics, type, typeId, versions]. Because of the `.get(..., [])` defaults the loops iterate empty lists, so `crs_report_authors` and `crs_report_related_bills` receive ZERO rows for every report — a silent no-op, not nulls and not a crash. Fix (code, not here): either drop the authors/relatedBills/bills child- building and the two tables from `_CHILD_TABLES`, or source authors elsewhere (they sometimes appear inside version `summary`/HTML, not as a JSON key). `topics` and `versions` do exist and populate correctly.

Out of scope

State-level analogs.
Real-time webhooks / push subscriptions (we poll the CSV daily).

Dependencies

Plan

Source module follows the standard Source protocol:
josh-ingester/ingester/sources/crs_reports.py.

- Discover — GET https://www.everycrsreport.com/reports.csv with
If-Modified-Since. csv.DictReader (RFC 4180-aware) yields one
FetchTask per row. State watermark is the max latestPubDate seen.
- Fetch — per report: /reports/{number}.json (metadata, version
history) + /reports/{number}.html (body wrapper). SHA1s in the JSON
let us short-circuit unchanged bodies.
- Parse — html_v1 normalizer strips the wrapper page-title H1 and
converts to Markdown. Only topics (and versions) come out of the
per-report JSON; the EveryCRSReport.com JSON for this corpus carries no
authors, relatedBills, or bills keys, so author / related-bill
extraction yields zero rows (see clarifications_needed).
- Load — crs_reports + 4 junction tables + chunked body into
crs_report_chunks + enqueue chunks for embedding into
crs_report_chunks_vec0.
- Citation — shared/josh_substrate/citations/formatters/crs.py
builds the human-readable citation_string (CRS-style: "CRS Report
R12345, Title (May 8, 2026)").
- Schema — 0002_crs.py migration: crs_reports,
crs_report_versions, 3 junction tables, crs_report_chunks,
crs_reports_fts (FTS5 + 3 sync triggers), crs_report_chunks_vec0
(vec0 virtual table).
- Defensive parsing — 2,994 CSV rows have title commas/quotes;
per-row exceptions are logged and skipped, not aborted.
- Cron — daily 07:30 ET (CSV refreshed ~07:00 UTC).

Tasks

10 of 14 done.

Changelog

2026-05-10T18:00:00Z shipped→verified DigitalOcean droplet destroyed 2026-05-10 — the production backfill (23,092 reports as of 2026-05-08) is gone with the volume. Code (ingester module, parser, migration, tests, integration pipeline, citation formatter, FTS5/vec0 wiring) all intact in the repo. Status flipped shipped → verified: the success_determiner ran green against real data once and the code's contract still holds; the production data load needs to be re-run against the OVHcloud substrate (per `substrate-bare-metal-host` t9). Tasks t10-t12 + t14 reset to false to track the re-load.
2026-05-09T13:00:00Z in_progress→shipped 23d1bcb Backfill landed 2026-05-08 (commit 23d1bcb): 23,092 reports + bodies + chunks + vectors + citations. Reliability follow-ups (commit 3121911) cleaned up SQLITE_BUSY, WAL checkpointer, and the type_id bug. Spec backfilled with real acceptance criteria + sql success_determiner; removing the in_progress + stub-criteria drift.