eCFR and CFR

The Code of Federal Regulations (CFR) is the codified body of regulations issued by federal agencies — the regulatory analog to the US Code's statutes. The CFR is organized into 50 titles by subject area (Title 26 for IRS, Title 40 for EPA, Title 42 for HHS regulations, etc.), each updated annually.

Two sources, one substrate:

eCFR (ecfr.gov) — current CFR, updated daily as agencies publish rules in the Federal Register. Has a clean REST API. Sub-section granular versioning across time.
CFR Annual Edition (GovInfo CFR collection) — printed yearly snapshots used as the official date-stamped reference for legal citation purposes.

These cover the same content at different cadences. We ingest both: eCFR for current-state queries ("what is 40 CFR 60 today?") and CFR Annual for historical ("what was 40 CFR 60 as of January 1, 2020?").

For Josh, CFR is the regulatory citation target. Every Federal Register rule cites CFR sections; bills cite CFR sections; CRS reports cite CFR sections. The citation graph closes the loop: bill → law → USC + Federal Register rule → CFR. The related ingester spec is ecfr-and-cfr-ingester; the citation edges join to Federal Register (the other half of the loop) and to U.S. Code (via <AUTH> authority lists).

Quick reference


Source name	eCFR + CFR (combined)
Publishers	Office of the Federal Register (eCFR); Government Publishing Office (CFR Annual).
License	Public domain
Coverage	eCFR: current state, daily-updated. CFR Annual: 1996 – present systematically (occasional gaps).
Volume	50 titles. ~10,000 parts. ~150,000+ sections. ~3-5 GB raw XML per snapshot.
Storage estimate	eCFR (current state): ~3-5 GB. CFR Annual snapshots (5 most recent): ~15-25 GB.
Auth	None for both (upstream).
Incremental sync hints	eCFR `versioner/versions/title-{N}.json` returns full version history per title (~17,800 versions for Title 42 alone). Daily refresh suffices.
Stable ID format	Section: `cfr:{title}-{part}-{section}` e.g. `cfr:40-60-1.1`. With version: `cfr:40-60-1.1@2025-12-01`.
Status	exploring — schema drafted, ingestion not built

Upstream API surface (endpoint URLs, rate limits, caching headers, auth) is documented by the publishers: the eCFR Developer API docs and the GovInfo CFR collection / GovInfo developer hub.

Source priority decision

Primary: eCFR API. The Office of the Federal Register publishes a clean, well-documented REST API that returns daily-current full title content as XML, hierarchy structure as JSON, full version history per title, and section-level granular dates ("amended on YYYY-MM-DD").

This is the easiest of all federal data sources to ingest. No bot wall, no auth, JSON+XML, well-versioned. Use as primary for current-state and recent-history queries.

Secondary: GovInfo CFR Annual — for citation-grade date-stamped historical snapshots. Not all queries need it, but legal citation contexts ("what was 17 CFR 240 as of Jan 1, 2020?") do.

Skip: Cornell LII / other CFR mirrors. They're downstream of eCFR + GPO. Direct is better.

Access notes

eCFR: open. Respects standard HTTP caching (Last-Modified, ETag). Real-browser UA harmless. Polite rate well under any observed limit.

GovInfo CFR Annual: open. Same patterns as USCODE / CHRG / CRPT.

Endpoints / URL patterns

The two upstream feeds are fully documented at the publishers' developer hubs; see the eCFR Developer API docs (/api/versioner/v1/ — titles.json, structure/{date}/title-{N}.json, full/{date}/title-{N}.xml, versions/title-{N}.json, ancestry/...) and the GovInfo CFR collection (per-year CFR_{YYYY}_sitemap.xml, content/metadata packages per title-volume). {date} is YYYY-MM-DD; the latest queryable date is up_to_date_as_of from titles.json.

What matters for Josh's parser is the shared XML structure (both feeds use the FR/CFR-uniform DIV-numbered DTD) and the <CITA> / <AUTH> interpretation:

XML DIV hierarchy

The DIV-numbered hierarchy (a <DIV5 TYPE="PART"> wraps <DIV8 TYPE="SECTION"> nodes, each carrying <HEAD>, body <P> elements, an optional <AUTH> authority block, and a <CITA> publication-history note):

DIV1 = Title
DIV2 = Subtitle
DIV3 = Chapter
DIV4 = Subchapter
DIV5 = Part (the granular regulatory unit)
DIV6 = Subpart
DIV7 = Subject group (rare)
DIV8 = Section (the citation target)
DIV9 = Subsection (rare; usually flattened into section)

<CITA> carries the publication history — [37 FR 23603, Nov. 4, 1972, as amended at 50 FR 12466, Mar. 28, 1985]. This is the inbound citation graph: the section was created/amended by these Federal Register documents. Parse to build edges.

The vol{V} suffix on CFR Annual packages is because some titles are too large for one volume (Title 26 IRC has many volumes; Title 40 EPA has ~30+). The CFR Annual XML uses essentially the same DIV structure as eCFR (FR/CFR-uniform DTD).

Vocabulary / enums

Title status. Each title carries a reserved flag: false for active titles (49 of 50), true for a reserved title — Title 35 has been reserved historically.

Section types. CFR sections come in several flavors based on the type attribute and presence of certain elements:

SECTION — standard regulatory text
RESERVED — <RESERVED> element instead of body content (placeholder for future regulation)
EFFECTIVE_DATE_NOTE — special timing note

For our purposes, all roll into cfr_sections with flags.

Stable ID format

Section: cfr:{title}-{part}-{section} — uppercase T not needed since we're title-numeric.

Examples:

cfr:40-60-1.1 — 40 CFR 60.1
cfr:17-240-10b-5 — 17 CFR 240.10b-5 (the famous SEC Rule 10b-5)
cfr:42-411-15 — 42 CFR 411.15

With version: cfr:40-60-1.1@2025-12-01 — useful when the agent specifically needs the version of a section as of a date.

Part: cfr:{title}-{part} — e.g. cfr:40-60. Parts are commonly cited; e.g., "the new source performance standards in 40 CFR Part 60."

Response shapes

The full eCFR JSON response shapes (titles.json, structure/{date}/title-{N}.json, versions/title-{N}.json) are documented at the eCFR Developer API docs. The two glosses that drive Josh's model:

up_to_date_as_of (from titles.json) is the most recent date for which we can query a title's content. Some titles update frequently (Title 26 IRS), some rarely (Title 35 Reserved).
versions/title-{N}.json is the full amendment history per section — the input to the versioned cfr_section_versions table.

Volume

Metric	Value
Titles	50 (1 reserved)
Parts	~10,000
Sections	~150,000+
Per-title XML (current)	50 KB - 100 MB (Title 26 IRC, Title 40 EPA largest)
Total eCFR XML (current)	~3-5 GB
Per-year CFR Annual	~3-5 GB
Section versions all-time (per title)	up to ~17,800 (Title 42)

Postgres footprint: ~5-10 GB for current state + last 5 annual editions.

Caching / incremental sync

Daily 06:00 UTC: eCFR refresh. For each of 50 titles, check up_to_date_as_of from titles.json. If newer than our last ingest:
- Fetch versions/title-{N}.json for the version delta.
- For each changed section, fetch full/{up_to_date}/title-{N}.xml?section={X.Y} (or full title if many sections changed).
- Update sections + version history.
Yearly: CFR Annual ingest. When CFR_{YYYY}_sitemap.xml appears (typically January of the following year), ingest the new annual edition as a snapshot.
Per-section If-Modified-Since for conditional GET on individual section content.

Schema (Postgres DDL)

-- ============================================================
-- CFR titles
-- ============================================================

CREATE TABLE cfr_titles (
    id text PRIMARY KEY,                          -- 'cfr-title:40'
    title_number int NOT NULL UNIQUE,
    title_name text NOT NULL,
    latest_amended_on date,
    latest_issue_date date,
    up_to_date_as_of date,                        -- as reported by eCFR
    is_reserved boolean NOT NULL DEFAULT false,

    raw_titles_json jsonb,                        -- the per-title row from /titles.json
    fetched_at timestamptz NOT NULL,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

-- ============================================================
-- CFR parts (one row per part — commonly cited unit)
-- ============================================================

CREATE TABLE cfr_parts (
    id text PRIMARY KEY,                          -- 'cfr:40-60'
    title_number int NOT NULL,
    part_number text NOT NULL,                    -- '60' (string preserves alphanumeric like '110a')

    -- Hierarchy
    chapter text,                                 -- 'I'
    subchapter text,                              -- 'C'
    subpart text,                                 -- rare at the part level

    name text,                                    -- 'Standards of Performance for...'
    authority text,                               -- the <AUTH> block content

    UNIQUE (title_number, part_number)
);

CREATE INDEX cfr_parts_title ON cfr_parts (title_number, part_number);

-- ============================================================
-- CFR sections (the citation target)
-- ============================================================

CREATE TABLE cfr_sections (
    id text PRIMARY KEY,                          -- 'cfr:40-60-1.1'
    title_number int NOT NULL,
    part_id text REFERENCES cfr_parts(id),
    part_number text NOT NULL,
    section_number text NOT NULL,                 -- '1.1' (string — preserves dot-separated subsections like '60.45a')

    heading text,
    body_text text NOT NULL,
    body_xml bytea,                               -- gzipped section XML

    -- Version stamp (most-recent)
    current_amendment_date date,                  -- when last amended
    current_issue_date date,                      -- when last published in FR
    up_to_date_as_of date,                        -- our snapshot date
    is_reserved boolean NOT NULL DEFAULT false,

    -- Citation graph signals (extracted from <CITA>)
    cita_text text,                               -- raw "[37 FR 23603, Nov. 4, 1972, ...]"
    fr_citations text[],                          -- ['37 FR 23603', '50 FR 12466']
    fr_document_ids text[],                       -- resolved fr_documents.id[] when resolvable

    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now(),

    UNIQUE (title_number, part_number, section_number)
);

CREATE INDEX cfr_sections_title_part ON cfr_sections (title_number, part_number);
CREATE INDEX cfr_sections_active ON cfr_sections (title_number) WHERE NOT is_reserved;
CREATE INDEX cfr_sections_amendment ON cfr_sections (current_amendment_date DESC NULLS LAST);
CREATE INDEX cfr_sections_fr_docs ON cfr_sections USING gin (fr_document_ids);

ALTER TABLE cfr_sections ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(heading, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
    ) STORED;
CREATE INDEX cfr_sections_search ON cfr_sections USING gin (search_tsv);

-- ============================================================
-- Section version history (one row per amendment per section)
-- ============================================================

CREATE TABLE cfr_section_versions (
    id text PRIMARY KEY,                          -- 'cfr:40-60-1.1@2024-03-15'
    section_id text NOT NULL REFERENCES cfr_sections(id) ON DELETE CASCADE,
    amendment_date date NOT NULL,
    issue_date date,
    body_text text,                               -- snapshot at this version
    body_xml bytea,                               -- gzipped
    fr_document_id text,                          -- the FR doc that made this amendment, if known
    fetched_at timestamptz NOT NULL,
    UNIQUE (section_id, amendment_date)
);

CREATE INDEX cfr_section_versions_section ON cfr_section_versions (section_id, amendment_date DESC);

-- ============================================================
-- CFR structure nodes (nav)
-- ============================================================

CREATE TABLE cfr_structure_nodes (
    id text PRIMARY KEY,
    title_number int NOT NULL,
    parent_id text REFERENCES cfr_structure_nodes(id),
    level text NOT NULL CHECK (level IN ('subtitle', 'chapter', 'subchapter', 'subpart')),
    identifier text NOT NULL,                     -- 'I', 'A', 'C', etc.
    label text NOT NULL
);

-- ============================================================
-- Vector chunks
-- ============================================================

CREATE TABLE cfr_section_chunks (
    id bigserial PRIMARY KEY,
    section_id text NOT NULL REFERENCES cfr_sections(id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    embedding vector(1024),
    UNIQUE (section_id, chunk_index)
);
CREATE INDEX cfr_section_chunks_embedding
    ON cfr_section_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

cfr_sections represents the current state. Versions live in cfr_section_versions. Default queries hit the current row.
fr_document_ids array is the explicit citation edge — every section says which FR docs created/amended it via <CITA>. Resolution at parse time means we can answer "what FR rule created 40 CFR 60.45a" with a clean join to fr_documents.
section_number and part_number as text — CFR section numbering uses dot-separated paths (60.45a, 1.1-1) and parts can have alphanumeric IDs (110a).
No formal cfr_part_chunks for the part level — chunking is at section level (the right granularity for retrieval).
raw_titles_json on cfr_titles — preserves the per-title metadata blob for re-parse.
CFR Annual snapshots are loaded into cfr_section_versions rather than a separate table — they're versions, just labeled with issue_date = annual_edition_date. Query "as of annual 2020" returns the appropriate version row.

Chunking & eval status: section-level granularity (chunked + vectorized), tracked in data status as chunker section_aware_fr_v1 (likely reuse) — Phase 1 only; Phase 2 likely reuses the Federal Register result.

Download / update strategy

Backfill

Pull titles.json for the 50-title roster.
For each title (most-recent date):
- Fetch structure/{date}/title-{N}.json to enumerate sections.
- Fetch full/{date}/title-{N}.xml for the body.
- Parse XML by DIV5 (parts) + DIV8 (sections).
- Insert cfr_titles, cfr_parts, cfr_sections.
- Extract <CITA> citations to fr_citations[] + resolve to fr_document_ids.
Per-title version history:
- Fetch versions/title-{N}.json.
- For each version row, insert cfr_section_versions row (without body unless we backfill bodies — body backfill is bulk-volume; defer).
CFR Annual snapshots for last 5 years:
- Fetch each year's sitemap.
- For each title-volume package, fetch ZIP, extract XML.
- Parse + insert as cfr_section_versions rows with issue_date = annual_edition_date.

Per-title backfill: ~10-50 MB XML download × 50 titles = ~1-5 GB. ~1-2 hours wall.

Daily incremental

06:00 UTC: poll titles.json. For any title whose up_to_date_as_of > our up_to_date_as_of:
Fetch versions/title-{N}.json to get changed sections since last run.
For each changed section, fetch the new content.
Update cfr_sections.body_text + insert cfr_section_versions row.

State tracking

Source key: cfr_ecfr for current state, cfr_annual for annual editions. Each tracks per-title up_to_date_as_of.

Failure modes

eCFR API rate limit / outage. Fall back to GovInfo CFR Annual (less current but available).
CITA parsing variance. <CITA> text can include amendment chains, dates, FR cites in slightly different formats. Defensive regex; fall back to raw text.
Section identifier collisions across titles. 1.1 exists in many titles. Always include title in the ID.
Reserved sections. Schema handles via is_reserved; body_text empty.
Annual edition lag. CFR Annual for year YYYY publishes throughout YYYY+1 (multi-month). Sitemap entries appear as titles publish. Don't expect all 50 in any one fetch.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Subsection-level retrieval granularity. Same as USC — defer to v1.x; ship section-level for v1.
Cross-section references within a section. Sections cite other sections ("§ 60.45 of this part"). Extract to citation graph at parse time.
Authority text parsing. <AUTH> lists USC authorities — these are inbound USC → CFR edges. Extract.
CFR Annual vs eCFR discrepancies. The CFR Annual is a snapshot; eCFR is daily-updated. They should agree as of the annual's date. Spot-check.
Body backfill for old versions. Storing full body for every version of every section across 30 years is 100s of GB. Lean: store body for current + last 5 annuals. Older versions: store metadata only (date, FR citation), reconstruct body on demand from FR document if needed.
Title 35 and other rare titles. Title 35 is reserved/empty; some titles are rarely amended. Schema accommodates.
Section renumbering. Occasionally an agency renumbers a part. The old section ID still cited in older FR docs; we need a redirect or alias table. Defer.
Annual edition single-volume vs multi-volume. Some titles like Title 40 EPA span ~30 volumes annually. Schema treats each volume's XML as a fragment; section IDs unique across the title.