eCFR and CFR ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
The Code of Federal Regulations (CFR) is the codified body of regulations
issued by federal agencies — the regulatory analog to the US Code's
statutes. Two sources, one substrate: eCFR (current state, daily
updated, clean REST API) for "what is 40 CFR 60 today" queries, and
CFR Annual (yearly snapshots from GovInfo) for citation-grade
date-stamped historical reference. CFR is the regulatory citation target:
every Federal Register rule cites CFR sections; bills cite CFR sections;
CRS reports cite CFR sections. Closing the citation graph
(bill → law → USC + Federal Register rule → CFR) needs this source.
User stories
As a policy analyst using my own AI agent, I want to ask "what does 40 CFR 60 say about NSPS limits today" and get the live text so that I'm working from the current rule, not a stale copy.
As a litigator's AI agent, I want to ask "what was 17 CFR 240 as of January 1, 2020" so that I have the citation-grade date-stamped historical version for legal context.
As an OSS self-hoster, I want to run `josh ingest ecfr-and-cfr` on a fresh substrate so that I can populate the regulatory text from public APIs without orchestrating multiple sources myself.
As the federal-register-ingester (downstream FK consumer), I want to soft-FK FR rule citations (`§ 40.60.1.1`) to a real `cfr_sections.id` so that agents can pivot from "this rule" to "the CFR section it amends".
Acceptance criteria (EARS)
- When `josh ingest ecfr-and-cfr` is run on the deployed substrate, the system shall populate `cfr_titles`, `cfr_parts`, and `cfr_sections` rows from the eCFR REST API for all 50 titles, with all six citation columns populated on every section.
- When a section's body is loaded, the system shall write the raw XML under `/data/corpus/ecfr-and-cfr/bodies/raw/` and a normalized Markdown body under `/data/corpus/ecfr-and-cfr/bodies/markdown/`.
- While the ingester is running, if a single title fails to parse, then the system shall log the failure to `ingestion_logs` (with `(title_number, source)`) and continue with the next title.
- Where the eCFR API exposes per-section version history (`/versions/title-{N}.json`), the system shall populate `cfr_section_versions` with one row per version (amended_on date + amending FR citation).
- Where a CFR Annual snapshot is loaded, the system shall populate `cfr_annual_snapshots` with one row per `(title, year)` and link sections via `cfr_annual_section_text` keyed on `(title, year, part, section)`.
- When an FTS5 query is executed against `cfr_sections_fts`, the system shall return BM25-ranked results.
- Where a section body exceeds ~5K tokens, the system shall enqueue rows in `chunk_embedding_jobs` for chunk-level embedding into `cfr_section_chunks_vec0`.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` per-title using the eCFR API's `up_to_date_as_of` field so the next run only re-fetches titles whose date has advanced.
Success determiner
Command
set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest ecfr-and-cfr --titles 40 --max-parts 10'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT
(SELECT COUNT(*) FROM cfr_titles) AS titles,
(SELECT COUNT(*) FROM cfr_sections WHERE title_number=40) AS sections_t40,
(SELECT COUNT(*) FROM cfr_sections WHERE source_url IS NOT NULL
AND source_org IS NOT NULL AND retrieved_at IS NOT NULL
AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL) AS fully_cited;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM cfr_sections_fts WHERE cfr_sections_fts MATCH '\''pollution'\'';"'
Expect
Smoke run only — Title 40 (EPA) is large (~1500 sections) but the `--max-parts 10` cap keeps the smoke under a few minutes. Full backfill of all 50 titles runs detached on the server per `add-a-new-source.html`; most titles are 5-50 MB of XML each.
Clarifications needed
- Full historical CFR Annual coverage: 1996-present (~30 years × 50 titles = 1500 snapshots, ~15-25 GB) vs last 5 years (~250 snapshots, ~3-5 GB) for v1 launch. Lean: last 5 years for v1; full historical is a v1.x backfill.
- Per-section version history granularity: ingest the full version history (~17,800 versions for Title 42 alone, ~hundreds of thousands across all titles) or just the latest amended_on per section? Lean: full — version history is the citation graph (which FR rule amended which section when).
- eCFR daily refresh strategy: per-title diff via `up_to_date_as_of` (cheap) vs full re-fetch on any change (expensive but simple). Lean: per-title diff — ~50 cheap GETs per day vs ~3-5 GB of redundant XML.
- Sub-section granularity: model `§ 40.60.1.1(a)(1)(ii)` as a separate row in `cfr_section_subsections` or just store as text under the parent section? Lean: text under parent for v1; subsection-level rows is a v1.x improvement (helps very-precise citation linkage).
- Cross-source FR `<CITA>` parsing: extract `[37 FR 23603, Nov. 4, 1972, as amended at 50 FR 12466, Mar. 28, 1985]` into `cfr_section_versions.amending_fr_citation` rows during eCFR load? Or wait for a separate citation-extraction pass? Lean: do it during load — `eyecite` library or custom regex.
Out of scope
- State regulatory codes (CCR / NYCRR / etc.) — Phase 6+.
- Pre-1996 historical CFR (paper-only / scanned editions) — out of scope for v1.
- Cornell LII / other CFR mirrors — derivative of eCFR + GPO; direct is better.
- Real-time webhooks / push subscriptions — surface-layer concern.
- Authority-line ("Authority: 44 U.S.C. 1506") graph extraction beyond raw text — separate citation-graph spec.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/ecfr-and-cfr/.
Two sources, layered:
- Layer 1 (current state): eCFR REST API at
https://www.ecfr.gov/api/versioner/v1/. Daily-current, well-versioned,
no auth. JSON for structure, XML for body content.
- Layer 2 (historical): GovInfo CFR Annual at
https://www.govinfo.gov/content/pkg/CFR-{YYYY}-title{N}-vol{V}/.
Yearly snapshots, citation-grade for legal use.
Discover/fetch/parse/load shape:
- discover (eCFR): GET /api/versioner/v1/titles.json for all 50
titles with up_to_date_as_of. Compare to per-title watermark in
ingestion_source_state.last_seen_key='ecfr-title-{N}-uptodate'.
- discover (CFR Annual): https://www.govinfo.gov/sitemap/CFR_{YYYY}_sitemap.xml
per year of interest. ~244 entries per year sitemap.
- fetch (eCFR): per-title GET /full/{date}/title-{N}.xml to
/data/corpus/ecfr-and-cfr/bodies/raw/title-{N}.xml. Plus
GET /versions/title-{N}.json for the version history. SHA256 the
title XML for raw_sha256.
- fetch (CFR Annual): per-volume XML from GovInfo bulk paths.
- parse: ecfr_xml_v1 walker over the DIV-numbered hierarchy
(DIV1=title, DIV5=part, DIV8=section). Builds parent rows for
titles + parts; section rows with full body text → Markdown to
bodies/markdown/. Authority line (<AUTH>) and citation block
(<CITA>) extracted into structured columns. Version history JSON
parses into cfr_section_versions.
- load: idempotent upsert on (title_number, part_number, section_number)
for sections; on (title_number, year, volume) for annual snapshots.
FTS5 sync triggers fire on insert/update/delete. Body chunks > ~5K
tokens enqueued for embedding.
Migration shape: cfr_titles, cfr_parts, cfr_sections,cfr_section_versions (the citation-graph backbone),cfr_annual_snapshots, cfr_annual_section_text, cfr_sections_fts
virtual + 3 sync triggers, cfr_section_chunks + cfr_section_chunks_vec0
(vec0 with float[1024] and bit[1024]).
Schedule: daily 04:30 UTC for eCFR (catches overnight Federal Register
publication propagation). CFR Annual sitemap polled monthly for new
yearly volumes (publication is annual + slow).
Tasks
0 of 16 done.
- t1 Alembic migration applied locally + production (cfr_titles + cfr_parts + cfr_sections + cfr_section_versions + cfr_annual_* + FTS5 + vec0)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/cfr.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (ecfr_xml_v1 walker preserving DIV5/DIV8 hierarchy; raw + markdown paths)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests (`40 CFR § 60.1`, `17 CFR § 240.10b-5`, undated section)
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 Vector vec0 table wired (long-section bodies routinely exceed 5K tokens)
- t8 Unit tests for ecfr_xml_v1 parser against captured fixtures (multi-section part, reserved section, deeply-nested subsection, complex AUTH+CITA)
- t9 Integration test for full discover→fetch→parse→load pipeline against eCFR + GovInfo CFR Annual
- t10 Smoke backfill (Title 40 partial) clean per success_determiner
- t11 Production backfill at agreed scope (per clarifications_needed — likely all 50 titles eCFR + last 5 years CFR Annual) complete
- t12 Daily cron registered and observed firing once
- t13 No recurring errors in `ingestion_logs` after 24h
- t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
- t15 12-sample manual XML→Markdown validation gate (per add-a-new-source.html — stratified across short section, long section, table-heavy section, reserved, recently-amended)
- t16 Cross-source FR citation test: smoke-backfill cohort of FR rules that cite a CFR section all resolve to a `cfr_sections.id` (or queue retry)
Changelog
No history yet.