eCFR and CFR ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

The Code of Federal Regulations (CFR) is the codified body of regulations
issued by federal agencies — the regulatory analog to the US Code's
statutes. Two sources, one substrate: eCFR (current state, daily
updated, clean REST API) for "what is 40 CFR 60 today" queries, and
CFR Annual (yearly snapshots from GovInfo) for citation-grade
date-stamped historical reference. CFR is the regulatory citation target:
every Federal Register rule cites CFR sections; bills cite CFR sections;
CRS reports cite CFR sections. Closing the citation graph
(bill → law → USC + Federal Register rule → CFR) needs this source.

User stories

As a policy analyst using my own AI agent, I want to ask "what does 40 CFR 60 say about NSPS limits today" and get the live text so that I'm working from the current rule, not a stale copy.

As a litigator's AI agent, I want to ask "what was 17 CFR 240 as of January 1, 2020" so that I have the citation-grade date-stamped historical version for legal context.

As an OSS self-hoster, I want to run `josh ingest ecfr-and-cfr` on a fresh substrate so that I can populate the regulatory text from public APIs without orchestrating multiple sources myself.

As the federal-register-ingester (downstream FK consumer), I want to soft-FK FR rule citations (`§ 40.60.1.1`) to a real `cfr_sections.id` so that agents can pivot from "this rule" to "the CFR section it amends".

Acceptance criteria (EARS)

When `josh ingest ecfr-and-cfr` is run on the deployed substrate, the system shall populate `cfr_titles`, `cfr_parts`, and `cfr_sections` rows from the eCFR REST API for all 50 titles, with all six citation columns populated on every section.
When a section's body is loaded, the system shall write the raw XML under `/data/corpus/ecfr-and-cfr/bodies/raw/` and a normalized Markdown body under `/data/corpus/ecfr-and-cfr/bodies/markdown/`.
While the ingester is running, if a single title fails to parse, then the system shall log the failure to `ingestion_logs` (with `(title_number, source)`) and continue with the next title.
Where the eCFR API exposes per-section version history (`/versions/title-{N}.json`), the system shall populate `cfr_section_versions` with one row per version (amended_on date + amending FR citation).
Where a CFR Annual snapshot is loaded, the system shall populate `cfr_annual_snapshots` with one row per `(title, year)` and link sections via `cfr_annual_section_text` keyed on `(title, year, part, section)`.
When an FTS5 query is executed against `cfr_sections_fts`, the system shall return BM25-ranked results.
Where a section body exceeds ~5K tokens, the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `cfr_section_chunks_vec0`.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` per-title using the eCFR API's `up_to_date_as_of` field so the next run only re-fetches titles whose date has advanced.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest ecfr-and-cfr --titles 40 --max-parts 10'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT
    (SELECT COUNT(*) FROM cfr_titles) AS titles,
    (SELECT COUNT(*) FROM cfr_sections WHERE title_number=40) AS sections_t40,
    (SELECT COUNT(*) FROM cfr_sections WHERE source_url IS NOT NULL
      AND source_org IS NOT NULL AND retrieved_at IS NOT NULL
      AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL) AS fully_cited;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM cfr_sections_fts WHERE cfr_sections_fts MATCH '\''pollution'\'';"'

Expect

>= 1 title, >= 100 sections in title 40 partial backfill, fully_cited == sections_t40, FTS5 match >= 1.

Smoke run only — Title 40 (EPA) is large (~1500 sections) but the `--max-parts 10` cap keeps the smoke under a few minutes. Full backfill of all 50 titles runs detached on the server per `add-a-new-source.html`; most titles are 5-50 MB of XML each.

Clarifications needed

Full historical CFR Annual coverage: 1996-present (~30 years × 50 titles = 1500 snapshots, ~15-25 GB) vs last 5 years (~250 snapshots, ~3-5 GB) for v1 launch. Lean: last 5 years for v1; full historical is a v1.x backfill.
Per-section version history granularity: ingest the full version history (~17,800 versions for Title 42 alone, ~hundreds of thousands across all titles) or just the latest amended_on per section? Lean: full — version history is the citation graph (which FR rule amended which section when).
eCFR daily refresh strategy: per-title diff via `up_to_date_as_of` (cheap) vs full re-fetch on any change (expensive but simple). Lean: per-title diff — ~50 cheap GETs per day vs ~3-5 GB of redundant XML.
Sub-section granularity: model `§ 40.60.1.1(a)(1)(ii)` as a separate row in `cfr_section_subsections` or just store as text under the parent section? Lean: text under parent for v1; subsection-level rows is a v1.x improvement (helps very-precise citation linkage).
Cross-source FR `<CITA>` parsing: extract `[37 FR 23603, Nov. 4, 1972, as amended at 50 FR 12466, Mar. 28, 1985]` into `cfr_section_versions.amending_fr_citation` rows during eCFR load? Or wait for a separate citation-extraction pass? Lean: do it during load — `eyecite` library or custom regex.

Out of scope

State regulatory codes (CCR / NYCRR / etc.) — Phase 6+.
Pre-1996 historical CFR (paper-only / scanned editions) — out of scope for v1.
Cornell LII / other CFR mirrors — derivative of eCFR + GPO; direct is better.
Real-time webhooks / push subscriptions — surface-layer concern.
Authority-line ("Authority: 44 U.S.C. 1506") graph extraction beyond raw text — separate citation-graph spec.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/ecfr-and-cfr/.

Two sources, layered:
- Layer 1 (current state): eCFR REST API at
https://www.ecfr.gov/api/versioner/v1/. Daily-current, well-versioned,
no auth. JSON for structure, XML for body content.
- Layer 2 (historical): GovInfo CFR Annual at
https://www.govinfo.gov/content/pkg/CFR-{YYYY}-title{N}-vol{V}/.
Yearly snapshots, citation-grade for legal use.

Discover/fetch/parse/load shape:
- discover (eCFR): GET /api/versioner/v1/titles.json for all 50
titles with up_to_date_as_of. Compare to per-title watermark in
ingestion_source_state.last_seen_key='ecfr-title-{N}-uptodate'.
- discover (CFR Annual): https://www.govinfo.gov/sitemap/CFR_{YYYY}_sitemap.xml
per year of interest. ~244 entries per year sitemap.
- fetch (eCFR): per-title GET /full/{date}/title-{N}.xml to
/data/corpus/ecfr-and-cfr/bodies/raw/title-{N}.xml. Plus
GET /versions/title-{N}.json for the version history. SHA256 the
title XML for raw_sha256.
- fetch (CFR Annual): per-volume XML from GovInfo bulk paths.
- parse: ecfr_xml_v1 walker over the DIV-numbered hierarchy
(DIV1=title, DIV5=part, DIV8=section). Builds parent rows for
titles + parts; section rows with full body text → Markdown to
bodies/markdown/. Authority line (<AUTH>) and citation block
(<CITA>) extracted into structured columns. Version history JSON
parses into cfr_section_versions.
- load: idempotent upsert on (title_number, part_number, section_number)
for sections; on (title_number, year, volume) for annual snapshots.
FTS5 sync triggers fire on insert/update/delete. Body chunks > ~5K
tokens enqueued for embedding.

Migration shape: cfr_titles, cfr_parts, cfr_sections,
cfr_section_versions (the citation-graph backbone),
cfr_annual_snapshots, cfr_annual_section_text, cfr_sections_fts
virtual + 3 sync triggers, cfr_section_chunks + cfr_section_chunks_vec0
(vec0 with float[1024] and bit[1024]).

Schedule: daily 04:30 UTC for eCFR (catches overnight Federal Register
publication propagation). CFR Annual sitemap polled monthly for new
yearly volumes (publication is annual + slow).

Tasks

0 of 16 done.

Changelog

No history yet.