eCFR and CFR
The Code of Federal Regulations (CFR) is the codified body of regulations issued by federal agencies — the regulatory analog to the US Code's statutes. The CFR is organized into 50 titles by subject area (Title 26 for IRS, Title 40 for EPA, Title 42 for HHS regulations, etc.), each updated annually.
Two sources, one substrate:
- eCFR (
ecfr.gov) — current CFR, updated daily as agencies publish rules in the Federal Register. Has a clean REST API. Sub-section granular versioning across time. - CFR Annual Edition (GovInfo CFR collection) — printed yearly snapshots used as the official date-stamped reference for legal citation purposes.
These cover the same content at different cadences. We ingest both: eCFR for current-state queries ("what is 40 CFR 60 today?") and CFR Annual for historical ("what was 40 CFR 60 as of January 1, 2020?").
For Josh, CFR is the regulatory citation target. Every Federal Register rule
cites CFR sections; bills cite CFR sections; CRS reports cite CFR sections. The
citation graph closes the loop: bill → law → USC + Federal Register rule → CFR.
The related ingester spec is ecfr-and-cfr-ingester;
the citation edges join to Federal Register (the other
half of the loop) and to U.S. Code (via <AUTH> authority lists).
Quick reference
Section titled “Quick reference”| Source name | eCFR + CFR (combined) |
| Publishers | Office of the Federal Register (eCFR); Government Publishing Office (CFR Annual). |
| License | Public domain |
| Coverage | eCFR: current state, daily-updated. CFR Annual: 1996 – present systematically (occasional gaps). |
| Volume | 50 titles. ~10,000 parts. ~150,000+ sections. ~3-5 GB raw XML per snapshot. |
| Storage estimate | eCFR (current state): ~3-5 GB. CFR Annual snapshots (5 most recent): ~15-25 GB. |
| Auth | None for both (upstream). |
| Incremental sync hints | eCFR versioner/versions/title-{N}.json returns full version history per title (~17,800 versions for Title 42 alone). Daily refresh suffices. |
| Stable ID format | Section: cfr:{title}-{part}-{section} e.g. cfr:40-60-1.1. With version: cfr:40-60-1.1@2025-12-01. |
| Status | exploring — schema drafted, ingestion not built |
Upstream API surface (endpoint URLs, rate limits, caching headers, auth) is documented by the publishers: the eCFR Developer API docs and the GovInfo CFR collection / GovInfo developer hub.
Source priority decision
Section titled “Source priority decision”Primary: eCFR API. The Office of the Federal Register publishes a clean, well-documented REST API that returns daily-current full title content as XML, hierarchy structure as JSON, full version history per title, and section-level granular dates ("amended on YYYY-MM-DD").
This is the easiest of all federal data sources to ingest. No bot wall, no auth, JSON+XML, well-versioned. Use as primary for current-state and recent-history queries.
Secondary: GovInfo CFR Annual — for citation-grade date-stamped historical snapshots. Not all queries need it, but legal citation contexts ("what was 17 CFR 240 as of Jan 1, 2020?") do.
Skip: Cornell LII / other CFR mirrors. They're downstream of eCFR + GPO. Direct is better.
Access notes
Section titled “Access notes”eCFR: open. Respects standard HTTP caching (Last-Modified, ETag). Real-browser
UA harmless. Polite rate well under any observed limit.
GovInfo CFR Annual: open. Same patterns as USCODE / CHRG / CRPT.
Endpoints / URL patterns
Section titled “Endpoints / URL patterns”The two upstream feeds are fully documented at the publishers' developer hubs;
see the eCFR Developer API docs
(/api/versioner/v1/ — titles.json, structure/{date}/title-{N}.json,
full/{date}/title-{N}.xml, versions/title-{N}.json, ancestry/...) and the
GovInfo CFR collection (per-year
CFR_{YYYY}_sitemap.xml, content/metadata packages per title-volume). {date} is
YYYY-MM-DD; the latest queryable date is up_to_date_as_of from titles.json.
What matters for Josh's parser is the shared XML structure (both feeds use the
FR/CFR-uniform DIV-numbered DTD) and the <CITA> / <AUTH> interpretation:
XML DIV hierarchy
Section titled “XML DIV hierarchy”The DIV-numbered hierarchy (a <DIV5 TYPE="PART"> wraps <DIV8 TYPE="SECTION">
nodes, each carrying <HEAD>, body <P> elements, an optional <AUTH> authority
block, and a <CITA> publication-history note):
DIV1= TitleDIV2= SubtitleDIV3= ChapterDIV4= SubchapterDIV5= Part (the granular regulatory unit)DIV6= SubpartDIV7= Subject group (rare)DIV8= Section (the citation target)DIV9= Subsection (rare; usually flattened into section)
<CITA> carries the publication history —
[37 FR 23603, Nov. 4, 1972, as amended at 50 FR 12466, Mar. 28, 1985]. This is
the inbound citation graph: the section was created/amended by these Federal
Register documents. Parse to build edges.
The vol{V} suffix on CFR Annual packages is because some titles are too large for
one volume (Title 26 IRC has many volumes; Title 40 EPA has ~30+). The CFR Annual
XML uses essentially the same DIV structure as eCFR (FR/CFR-uniform DTD).
Vocabulary / enums
Section titled “Vocabulary / enums”Title status. Each title carries a reserved flag: false for active titles
(49 of 50), true for a reserved title — Title 35 has been reserved historically.
Section types. CFR sections come in several flavors based on the type
attribute and presence of certain elements:
SECTION— standard regulatory textRESERVED—<RESERVED>element instead of body content (placeholder for future regulation)EFFECTIVE_DATE_NOTE— special timing note
For our purposes, all roll into cfr_sections with flags.
Stable ID format
Section titled “Stable ID format”Section: cfr:{title}-{part}-{section} — uppercase T not needed since we're
title-numeric.
Examples:
cfr:40-60-1.1— 40 CFR 60.1cfr:17-240-10b-5— 17 CFR 240.10b-5 (the famous SEC Rule 10b-5)cfr:42-411-15— 42 CFR 411.15
With version: cfr:40-60-1.1@2025-12-01 — useful when the agent specifically
needs the version of a section as of a date.
Part: cfr:{title}-{part} — e.g. cfr:40-60. Parts are commonly cited; e.g.,
"the new source performance standards in 40 CFR Part 60."
Response shapes
Section titled “Response shapes”The full eCFR JSON response shapes (titles.json, structure/{date}/title-{N}.json,
versions/title-{N}.json) are documented at the
eCFR Developer API docs.
The two glosses that drive Josh's model:
up_to_date_as_of(fromtitles.json) is the most recent date for which we can query a title's content. Some titles update frequently (Title 26 IRS), some rarely (Title 35 Reserved).versions/title-{N}.jsonis the full amendment history per section — the input to the versionedcfr_section_versionstable.
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Titles | 50 (1 reserved) |
| Parts | ~10,000 |
| Sections | ~150,000+ |
| Per-title XML (current) | 50 KB - 100 MB (Title 26 IRC, Title 40 EPA largest) |
| Total eCFR XML (current) | ~3-5 GB |
| Per-year CFR Annual | ~3-5 GB |
| Section versions all-time (per title) | up to ~17,800 (Title 42) |
Postgres footprint: ~5-10 GB for current state + last 5 annual editions.
Caching / incremental sync
Section titled “Caching / incremental sync”- Daily 06:00 UTC: eCFR refresh. For each of 50 titles, check
up_to_date_as_offromtitles.json. If newer than our last ingest:- Fetch
versions/title-{N}.jsonfor the version delta. - For each changed section, fetch
full/{up_to_date}/title-{N}.xml?section={X.Y}(or full title if many sections changed). - Update sections + version history.
- Fetch
- Yearly: CFR Annual ingest. When
CFR_{YYYY}_sitemap.xmlappears (typically January of the following year), ingest the new annual edition as a snapshot. - Per-section
If-Modified-Sincefor conditional GET on individual section content.
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- CFR titles-- ============================================================
CREATE TABLE cfr_titles ( id text PRIMARY KEY, -- 'cfr-title:40' title_number int NOT NULL UNIQUE, title_name text NOT NULL, latest_amended_on date, latest_issue_date date, up_to_date_as_of date, -- as reported by eCFR is_reserved boolean NOT NULL DEFAULT false,
raw_titles_json jsonb, -- the per-title row from /titles.json fetched_at timestamptz NOT NULL, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
-- ============================================================-- CFR parts (one row per part — commonly cited unit)-- ============================================================
CREATE TABLE cfr_parts ( id text PRIMARY KEY, -- 'cfr:40-60' title_number int NOT NULL, part_number text NOT NULL, -- '60' (string preserves alphanumeric like '110a')
-- Hierarchy chapter text, -- 'I' subchapter text, -- 'C' subpart text, -- rare at the part level
name text, -- 'Standards of Performance for...' authority text, -- the <AUTH> block content
UNIQUE (title_number, part_number));
CREATE INDEX cfr_parts_title ON cfr_parts (title_number, part_number);
-- ============================================================-- CFR sections (the citation target)-- ============================================================
CREATE TABLE cfr_sections ( id text PRIMARY KEY, -- 'cfr:40-60-1.1' title_number int NOT NULL, part_id text REFERENCES cfr_parts(id), part_number text NOT NULL, section_number text NOT NULL, -- '1.1' (string — preserves dot-separated subsections like '60.45a')
heading text, body_text text NOT NULL, body_xml bytea, -- gzipped section XML
-- Version stamp (most-recent) current_amendment_date date, -- when last amended current_issue_date date, -- when last published in FR up_to_date_as_of date, -- our snapshot date is_reserved boolean NOT NULL DEFAULT false,
-- Citation graph signals (extracted from <CITA>) cita_text text, -- raw "[37 FR 23603, Nov. 4, 1972, ...]" fr_citations text[], -- ['37 FR 23603', '50 FR 12466'] fr_document_ids text[], -- resolved fr_documents.id[] when resolvable
fetched_at timestamptz NOT NULL, parsed_at timestamptz, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now(),
UNIQUE (title_number, part_number, section_number));
CREATE INDEX cfr_sections_title_part ON cfr_sections (title_number, part_number);CREATE INDEX cfr_sections_active ON cfr_sections (title_number) WHERE NOT is_reserved;CREATE INDEX cfr_sections_amendment ON cfr_sections (current_amendment_date DESC NULLS LAST);CREATE INDEX cfr_sections_fr_docs ON cfr_sections USING gin (fr_document_ids);
ALTER TABLE cfr_sections ADD COLUMN search_tsv tsvector GENERATED ALWAYS AS ( setweight(to_tsvector('english', coalesce(heading, '')), 'A') || setweight(to_tsvector('english', coalesce(body_text, '')), 'D') ) STORED;CREATE INDEX cfr_sections_search ON cfr_sections USING gin (search_tsv);
-- ============================================================-- Section version history (one row per amendment per section)-- ============================================================
CREATE TABLE cfr_section_versions ( id text PRIMARY KEY, -- 'cfr:40-60-1.1@2024-03-15' section_id text NOT NULL REFERENCES cfr_sections(id) ON DELETE CASCADE, amendment_date date NOT NULL, issue_date date, body_text text, -- snapshot at this version body_xml bytea, -- gzipped fr_document_id text, -- the FR doc that made this amendment, if known fetched_at timestamptz NOT NULL, UNIQUE (section_id, amendment_date));
CREATE INDEX cfr_section_versions_section ON cfr_section_versions (section_id, amendment_date DESC);
-- ============================================================-- CFR structure nodes (nav)-- ============================================================
CREATE TABLE cfr_structure_nodes ( id text PRIMARY KEY, title_number int NOT NULL, parent_id text REFERENCES cfr_structure_nodes(id), level text NOT NULL CHECK (level IN ('subtitle', 'chapter', 'subchapter', 'subpart')), identifier text NOT NULL, -- 'I', 'A', 'C', etc. label text NOT NULL);
-- ============================================================-- Vector chunks-- ============================================================
CREATE TABLE cfr_section_chunks ( id bigserial PRIMARY KEY, section_id text NOT NULL REFERENCES cfr_sections(id) ON DELETE CASCADE, chunk_index int NOT NULL, chunk_text text NOT NULL, embedding vector(1024), UNIQUE (section_id, chunk_index));CREATE INDEX cfr_section_chunks_embedding ON cfr_section_chunks USING hnsw (embedding vector_cosine_ops);Schema decisions worth flagging:
cfr_sectionsrepresents the current state. Versions live incfr_section_versions. Default queries hit the current row.fr_document_idsarray is the explicit citation edge — every section says which FR docs created/amended it via<CITA>. Resolution at parse time means we can answer "what FR rule created 40 CFR 60.45a" with a clean join tofr_documents.section_numberandpart_numberastext— CFR section numbering uses dot-separated paths (60.45a,1.1-1) and parts can have alphanumeric IDs (110a).- No formal
cfr_part_chunksfor the part level — chunking is at section level (the right granularity for retrieval). raw_titles_jsononcfr_titles— preserves the per-title metadata blob for re-parse.- CFR Annual snapshots are loaded into
cfr_section_versionsrather than a separate table — they're versions, just labeled withissue_date = annual_edition_date. Query "as of annual 2020" returns the appropriate version row.
Chunking & eval status: section-level granularity (chunked + vectorized), tracked
in data status as chunker section_aware_fr_v1 (likely reuse) —
Phase 1 only; Phase 2 likely reuses the Federal Register result.
Download / update strategy
Section titled “Download / update strategy”Backfill
Section titled “Backfill”- Pull
titles.jsonfor the 50-title roster. - For each title (most-recent date):
- Fetch
structure/{date}/title-{N}.jsonto enumerate sections. - Fetch
full/{date}/title-{N}.xmlfor the body. - Parse XML by DIV5 (parts) + DIV8 (sections).
- Insert
cfr_titles,cfr_parts,cfr_sections. - Extract
<CITA>citations tofr_citations[]+ resolve tofr_document_ids.
- Fetch
- Per-title version history:
- Fetch
versions/title-{N}.json. - For each version row, insert
cfr_section_versionsrow (without body unless we backfill bodies — body backfill is bulk-volume; defer).
- Fetch
- CFR Annual snapshots for last 5 years:
- Fetch each year's sitemap.
- For each title-volume package, fetch ZIP, extract XML.
- Parse + insert as
cfr_section_versionsrows withissue_date = annual_edition_date.
Per-title backfill: ~10-50 MB XML download × 50 titles = ~1-5 GB. ~1-2 hours wall.
Daily incremental
Section titled “Daily incremental”- 06:00 UTC: poll
titles.json. For any title whoseup_to_date_as_of > our up_to_date_as_of: - Fetch
versions/title-{N}.jsonto get changed sections since last run. - For each changed section, fetch the new content.
- Update
cfr_sections.body_text+ insertcfr_section_versionsrow.
State tracking
Section titled “State tracking”Source key: cfr_ecfr for current state, cfr_annual for annual editions. Each
tracks per-title up_to_date_as_of.
Failure modes
Section titled “Failure modes”- eCFR API rate limit / outage. Fall back to GovInfo CFR Annual (less current but available).
- CITA parsing variance.
<CITA>text can include amendment chains, dates, FR cites in slightly different formats. Defensive regex; fall back to raw text. - Section identifier collisions across titles.
1.1exists in many titles. Always include title in the ID. - Reserved sections. Schema handles via
is_reserved; body_text empty. - Annual edition lag. CFR Annual for year YYYY publishes throughout YYYY+1 (multi-month). Sitemap entries appear as titles publish. Don't expect all 50 in any one fetch.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Subsection-level retrieval granularity. Same as USC — defer to v1.x; ship section-level for v1.
- Cross-section references within a section. Sections cite other sections ("§ 60.45 of this part"). Extract to citation graph at parse time.
- Authority text parsing.
<AUTH>lists USC authorities — these are inbound USC → CFR edges. Extract. - CFR Annual vs eCFR discrepancies. The CFR Annual is a snapshot; eCFR is daily-updated. They should agree as of the annual's date. Spot-check.
- Body backfill for old versions. Storing full body for every version of every section across 30 years is 100s of GB. Lean: store body for current + last 5 annuals. Older versions: store metadata only (date, FR citation), reconstruct body on demand from FR document if needed.
- Title 35 and other rare titles. Title 35 is reserved/empty; some titles are rarely amended. Schema accommodates.
- Section renumbering. Occasionally an agency renumbers a part. The old section ID still cited in older FR docs; we need a redirect or alias table. Defer.
- Annual edition single-volume vs multi-volume. Some titles like Title 40 EPA span ~30 volumes annually. Schema treats each volume's XML as a fragment; section IDs unique across the title.