sourceplannedp0

eCFR and CFR ingester

ecfr-and-cfr-ingester · updated 2026-05-10T18:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

The Code of Federal Regulations (CFR) is the codified body of regulations
issued by federal agencies — the regulatory analog to the US Code's
statutes. Two sources, one substrate: eCFR (current state, daily
updated, clean REST API) for "what is 40 CFR 60 today" queries, and
CFR Annual (yearly snapshots from GovInfo) for citation-grade
date-stamped historical reference. CFR is the regulatory citation target:
every Federal Register rule cites CFR sections; bills cite CFR sections;
CRS reports cite CFR sections. Closing the citation graph
(bill → law → USC + Federal Register rule → CFR) needs this source.

As a policy analyst using my own AI agent, I want to ask "what does 40 CFR 60 say about NSPS limits today" and get the live text so that I'm working from the current rule, not a stale copy.

As a litigator's AI agent, I want to ask "what was 17 CFR 240 as of January 1, 2020" so that I have the citation-grade date-stamped historical version for legal context.

As an OSS self-hoster, I want to run `josh ingest ecfr-and-cfr` on a fresh substrate so that I can populate the regulatory text from public APIs without orchestrating multiple sources myself.

As the federal-register-ingester (downstream FK consumer), I want to soft-FK FR rule citations (`§ 40.60.1.1`) to a real `cfr_sections.id` so that agents can pivot from "this rule" to "the CFR section it amends".

  1. When `josh ingest ecfr-and-cfr` is run on the deployed substrate, the system shall populate `cfr_titles`, `cfr_parts`, and `cfr_sections` rows from the eCFR REST API for all 50 titles, with all six citation columns populated on every section.
  2. When a section's body is loaded, the system shall write the raw XML under `/data/corpus/ecfr-and-cfr/bodies/raw/` and a normalized Markdown body under `/data/corpus/ecfr-and-cfr/bodies/markdown/`.
  3. While the ingester is running, if a single title fails to parse, then the system shall log the failure to `ingestion_logs` (with `(title_number, source)`) and continue with the next title.
  4. Where the eCFR API exposes per-section version history (`/versions/title-{N}.json`), the system shall populate `cfr_section_versions` with one row per version (amended_on date + amending FR citation).
  5. Where a CFR Annual snapshot is loaded, the system shall populate `cfr_annual_snapshots` with one row per `(title, year)` and link sections via `cfr_annual_section_text` keyed on `(title, year, part, section)`.
  6. When an FTS5 query is executed against `cfr_sections_fts`, the system shall return BM25-ranked results.
  7. Where a section body exceeds ~5K tokens, the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `cfr_section_chunks_vec0`.
  8. When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` per-title using the eCFR API's `up_to_date_as_of` field so the next run only re-fetches titles whose date has advanced.
kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest ecfr-and-cfr --titles 40 --max-parts 10'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT
    (SELECT COUNT(*) FROM cfr_titles) AS titles,
    (SELECT COUNT(*) FROM cfr_sections WHERE title_number=40) AS sections_t40,
    (SELECT COUNT(*) FROM cfr_sections WHERE source_url IS NOT NULL
      AND source_org IS NOT NULL AND retrieved_at IS NOT NULL
      AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL) AS fully_cited;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM cfr_sections_fts WHERE cfr_sections_fts MATCH '\''pollution'\'';"'

Expect

>= 1 title, >= 100 sections in title 40 partial backfill, fully_cited == sections_t40, FTS5 match >= 1.

Smoke run only — Title 40 (EPA) is large (~1500 sections) but the `--max-parts 10` cap keeps the smoke under a few minutes. Full backfill of all 50 titles runs detached on the server per `add-a-new-source.html`; most titles are 5-50 MB of XML each.

  • Full historical CFR Annual coverage: 1996-present (~30 years × 50 titles = 1500 snapshots, ~15-25 GB) vs last 5 years (~250 snapshots, ~3-5 GB) for v1 launch. Lean: last 5 years for v1; full historical is a v1.x backfill.
  • Per-section version history granularity: ingest the full version history (~17,800 versions for Title 42 alone, ~hundreds of thousands across all titles) or just the latest amended_on per section? Lean: full — version history is the citation graph (which FR rule amended which section when).
  • eCFR daily refresh strategy: per-title diff via `up_to_date_as_of` (cheap) vs full re-fetch on any change (expensive but simple). Lean: per-title diff — ~50 cheap GETs per day vs ~3-5 GB of redundant XML.
  • Sub-section granularity: model `§ 40.60.1.1(a)(1)(ii)` as a separate row in `cfr_section_subsections` or just store as text under the parent section? Lean: text under parent for v1; subsection-level rows is a v1.x improvement (helps very-precise citation linkage).
  • Cross-source FR `<CITA>` parsing: extract `[37 FR 23603, Nov. 4, 1972, as amended at 50 FR 12466, Mar. 28, 1985]` into `cfr_section_versions.amending_fr_citation` rows during eCFR load? Or wait for a separate citation-extraction pass? Lean: do it during load — `eyecite` library or custom regex.
  • State regulatory codes (CCR / NYCRR / etc.) — Phase 6+.
  • Pre-1996 historical CFR (paper-only / scanned editions) — out of scope for v1.
  • Cornell LII / other CFR mirrors — derivative of eCFR + GPO; direct is better.
  • Real-time webhooks / push subscriptions — surface-layer concern.
  • Authority-line ("Authority: 44 U.S.C. 1506") graph extraction beyond raw text — separate citation-graph spec.

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/ecfr-and-cfr/.

Two sources, layered:
- Layer 1 (current state): eCFR REST API at
https://www.ecfr.gov/api/versioner/v1/. Daily-current, well-versioned,
no auth. JSON for structure, XML for body content.
- Layer 2 (historical): GovInfo CFR Annual at
https://www.govinfo.gov/content/pkg/CFR-{YYYY}-title{N}-vol{V}/.
Yearly snapshots, citation-grade for legal use.

Discover/fetch/parse/load shape:
- discover (eCFR): GET /api/versioner/v1/titles.json for all 50
titles with up_to_date_as_of. Compare to per-title watermark in
ingestion_source_state.last_seen_key='ecfr-title-{N}-uptodate'.
- discover (CFR Annual): https://www.govinfo.gov/sitemap/CFR_{YYYY}_sitemap.xml
per year of interest. ~244 entries per year sitemap.
- fetch (eCFR): per-title GET /full/{date}/title-{N}.xml to
/data/corpus/ecfr-and-cfr/bodies/raw/title-{N}.xml. Plus
GET /versions/title-{N}.json for the version history. SHA256 the
title XML for raw_sha256.
- fetch (CFR Annual): per-volume XML from GovInfo bulk paths.
- parse: ecfr_xml_v1 walker over the DIV-numbered hierarchy
(DIV1=title, DIV5=part, DIV8=section). Builds parent rows for
titles + parts; section rows with full body text → Markdown to
bodies/markdown/. Authority line (<AUTH>) and citation block
(<CITA>) extracted into structured columns. Version history JSON
parses into cfr_section_versions.
- load: idempotent upsert on (title_number, part_number, section_number)
for sections; on (title_number, year, volume) for annual snapshots.
FTS5 sync triggers fire on insert/update/delete. Body chunks > ~5K
tokens enqueued for embedding.

Migration shape: cfr_titles, cfr_parts, cfr_sections,
cfr_section_versions (the citation-graph backbone),
cfr_annual_snapshots, cfr_annual_section_text, cfr_sections_fts
virtual + 3 sync triggers, cfr_section_chunks + cfr_section_chunks_vec0
(vec0 with float[1024] and bit[1024]).

Schedule: daily 04:30 UTC for eCFR (catches overnight Federal Register
publication propagation). CFR Annual sitemap polled monthly for new
yearly volumes (publication is annual + slow).

0 of 16 done.

  • t1 Alembic migration applied locally + production (cfr_titles + cfr_parts + cfr_sections + cfr_section_versions + cfr_annual_* + FTS5 + vec0)
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/cfr.py
  • t3 Source module implementing the `Source` protocol
  • t4 Body normalization wired (ecfr_xml_v1 walker preserving DIV5/DIV8 hierarchy; raw + markdown paths)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests (`40 CFR § 60.1`, `17 CFR § 240.10b-5`, undated section)
  • t6 FTS5 virtual table + 3 sync triggers wired into migration
  • t7 Vector vec0 table wired (long-section bodies routinely exceed 5K tokens)
  • t8 Unit tests for ecfr_xml_v1 parser against captured fixtures (multi-section part, reserved section, deeply-nested subsection, complex AUTH+CITA)
  • t9 Integration test for full discover→fetch→parse→load pipeline against eCFR + GovInfo CFR Annual
  • t10 Smoke backfill (Title 40 partial) clean per success_determiner
  • t11 Production backfill at agreed scope (per clarifications_needed — likely all 50 titles eCFR + last 5 years CFR Annual) complete
  • t12 Daily cron registered and observed firing once
  • t13 No recurring errors in `ingestion_logs` after 24h
  • t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
  • t15 12-sample manual XML→Markdown validation gate (per add-a-new-source.html — stratified across short section, long section, table-heavy section, reserved, recently-amended)
  • t16 Cross-source FR citation test: smoke-backfill cohort of FR rules that cite a CFR section all resolve to a `cfr_sections.id` (or queue retry)

No history yet.

docs/spec/ecfr-and-cfr-ingester.html · generated by bin/build-spec.py