Federal Register

Daily journal of US Government rules, proposed rules, notices, and presidential documents. Published by the Office of the Federal Register (OFR), part of the National Archives. Coverage from 1994-01-03 through present, ~70-80K documents per year.

This source is the test case for the v1 ingestion framework. Federal Register's API is clean enough to learn from, the data shapes are rich enough to stress-test the schema, and the v1 wedge feature (regulatory comment drafting) depends on this source being solid.

The authoritative spec for the ingester is federal-register-ingester; the canonical upstream API reference is the Federal Register developer documentation (note: the developer-docs web page is bot-walled — see Access notes). This page records only what Josh decided and probed about the source.

Quick reference


Source name	Federal Register
Publisher	Office of the Federal Register / National Archives
License	Public domain (US government work)
Coverage	1994-01-03 – present
Volume	~2.5M docs total, ~70-80K/year, ~6-7K/month, ~200-400/day
API base	`https://www.federalregister.gov/api/v1/`
Bulk base	`https://www.govinfo.gov/bulkdata/json/FR/`
Auth	None — no API key required
Rate limit	No formal limit; ~1-2 req/sec sustained is safe
Caching headers	None exposed (`Cache-Control: no-store`)
Stable ID format	`fr:{document_number}` e.g. `fr:2026-08558`
Status	exploring — schema drafted, ingestion not built

Access notes

The public docs page at federalregister.gov/developers/documentation/api/v1 is bot-walled (CAPTCHA via unblock.federalregister.gov). The API endpoints themselves are wide open — no auth, no CAPTCHA, just polite rate limiting expected.

A real-browser User-Agent header is required to avoid the bot wall on web pages. The API endpoints accept any UA, but using a browser-shaped UA is safer:

User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36

Upstream API reference

The endpoint catalog, the full conditions[*] filter key list, the fields[] sparse-fieldset mechanics, the thin/fat document field enumerations, the embedded JSON object shapes (regulations_dot_gov_info, dockets, regulation_id_number_info), the Public Inspection field list, the body-fetch XML URL pattern, and the presidential_document_type / notice_type enum value lists are all documented upstream — see the Federal Register API v1 documentation. The GovInfo bulk-download URL tree is at govinfo.gov/bulkdata/FR.

Two structural facts worth keeping at hand: a single document detail fetch (GET /documents/{document_number}.json) returns the fat shape (47 fields) while the list endpoint (GET /documents.json) returns a thin shape (10 fields) unless fields[] is requested; and the document_number namespace is shared between Public Inspection and Published — a PI doc retains its number after publication, only the endpoint changes.

Josh-learned quirks

These contradict or are absent from the upstream docs; they were discovered by probing this mirror directly.

Filter-key vs response-field mismatches (the filter key is not the field name in the response):

Filter key is effective_date, response field is effective_on.
Filter key is comment_date, response field is comments_close_on.

Silent failures (no error, just wrong/empty results):

Comma-separated value lists silently fail — no error, just count: null. Use repeated [] keys (multi-value within one key = OR; multiple keys = AND).
Negation ([not]) is silently ignored.
An invalid order= value falls back to the default with no error. Only three order values produce distinct ordering: order=newest (publication_date desc, the default), order=oldest (publication_date asc), and order=relevance (only meaningful with term; injects a synthetic excerpts field). All other values silently fall back to default.
topics[] takes the canonical TOC topic name as an exact, case-sensitive match; topics with commas/spaces (e.g. "Banks, banking") need careful URL encoding and are fragile.

The 10,000 hard cap — load-bearing for backfill windowing. There is a hard 10,000-result cap per query: total_pages ceilings at 50 with per_page=200, and asking for page=51 silently returns page 1 (no error). The count field is also capped at 10,000 — a single query spanning more than 10K documents reports count: 10000 (or lower) and silently undercounts the true total. Do not size a backfill window off one count read; paginate/window by publication_date and sum the per-window totals instead. (Confirmed 2026-05-29 against the corrected ingester spec.)

per_page confirmed up to 1000. Brackets must be URL-encoded as %5B / %5D.

The CORRECT-not-CORRECTION trap. conditions[type][] takes uppercase codes (RULE, PRORULE, NOTICE, PRESDOCU, CORRECT, UNKNOWN) while result.type returns display values (Rule, Proposed Rule, Notice, Presidential Document, Correction, Uncategorized Document). The code for corrections is CORRECT — passing the full word CORRECTION returns 0 results with no error.

Rejected as invalid (negative probe results — these parameter names this mirror does not accept): query, q, search, agency_id (singular), agencies_short_name, agencies_name, agencies_slug, rin, effective_on, comments_close_on, regulations_dot_gov_docket_id, topic (singular), subtype, subtypes, updated_since, since, modified_after, page_views.

fields[] edge cases. No wildcard (fields[]=* returns 400). An invalid field name returns 400 with field 'X' not valid but does not enumerate the valid set. excerpts (the ~700-char relevance-highlight string with <span class="match"> markup, present only under order=relevance) is NOT in the 47-field list. president is not a top-level document field, but it IS available via fields[]=president, returning {identifier, name} (e.g. {"identifier": "donald-trump", "name": "Donald Trump"}).

Public Inspection distinct shape. Default /public-inspection-documents.json returns the next-publication-day filings only; use conditions[available_on]=YYYY-MM-DD to browse historical PI dates. PI carries fields not in regular docs (filing_type, filed_at, pdf_file_name, pdf_file_size, pdf_updated_at, num_pages, subject_1/2/3, editorial_note, agency_letters, agency_names, and docket_numbers — note: docket_numbers, not docket_ids). filing_type enum observed: regular, special (an April 30 sample held 89 regular + 4 special). /public-inspection-documents/current.json returns sync timestamps (special_filings_updated_at, regular_filings_updated_at) that we track for incremental sync.

Body fetch (XML)

full_text_xml_url returns structured XML — this is what we parse for body text. The URL pattern is documented upstream; no CAPTCHA, no auth, same browser-shaped UA recommended.

XML structure per document type

The body XML is HD-driven and shaped by document type — this is the structure the deferred section-aware chunker would split on.

Rule (<RULE>):
- <PREAMB> → AGENCY, SUBAGY, CFR, DEPDOC, RIN, SUBJECT, AGY, ACT, SUM, EFFDATE, ADD, FURINF
- <SUPLINF> → headings, paragraphs, LSTSUB, PART, REGTEXT(AMDPAR, AUTH), SECTION(SECTNO, SUBJECT), SIG(DATED, NAME, TITLE)
- <FRDOC>, <BILCOD>
Proposed Rule (<PRORULE>):
- <PREAMB> → same shape as Rule but no EFFDATE
- <SUPLINF> → headings + paragraphs + SIG(NAME, TITLE) (no REGTEXT since regs aren't yet codified)
Notice (<NOTICE>):
- <PREAMB> → PRTPAGE, AGENCY, SUBJECT, AGY, ACT, SUM, DATES, FURINF (no CFR, no RIN)
- <SUPLINF> → HD, P*, SIG(NAME, TITLE), optional GPOTABLE (TTITLE, BOXHD/CHED, ROW/ENT)
Presidential Document (<PRESDOCU>): wraps a single sub-element by subtype:
- Executive Order: <EXECORD>(TITLE3, PRES, EXECORDR, HD, FP*, P*, GPH(GID), PSIG, PLACE, DATE, FRDOC, FILED, BILCOD)
- Proclamation: <PROCLA>(TITLE3, PRES, PROC, HD, FP*, GPH(GID), PSIG, FRDOC, FILED, BILCOD)
- Memorandum: <PRMEMO>(TITLE3, PRES, MEMO, HD*, FP*, PRTPAGE, GPH, PSIG, PLACE, DATE, FRDOC, FILED, BILCOD)
- Presidential Notice: <PRNOTICE>(TITLE3, PRES, PNOTICE, HD, FP*, GPH(GID), PSIG, PLACE, DATE, FRDOC, FILED, BILCOD)
- Determination: <DETERM>(TITLE3, PRES, DETNO, HD*, FP*, GPH, PSIG, PLACE, DATE, FRDOC, FILED, BILCOD)

Common leaf elements across types:

HD — heading (with SOURCE attribute indicating heading level: HED, HD1, HD2)
P — paragraph
FP — flush paragraph (no indent)
E — emphasis (T attribute: 01 italic, 02 bold, 03 smallcaps)
GPH / GID — graphic / graphic ID
PRTPAGE — print page break
SIG — signature block
STARS — * * * ellipsis
GPOTABLE — tables

Bulk download — GovInfo

Federal Register data is mirrored at GovInfo with a clean bulk structure (browse the tree at govinfo.gov/bulkdata/FR). Each daily XML is the full issue (~2-5 MB) containing all documents for that day in one file — useful for backfill (skip ~2.5M API calls) but the body XML is the same content the FR API points to.

Caveat (Josh hybrid strategy): GovInfo daily XML contains the body content of all documents in that day's issue, but does NOT contain the metadata that's only in the FR API (e.g., regulations_dot_gov_info, dockets as full objects, regulation_id_number_info). Hybrid strategy: GovInfo bulk for body backfill if the API is too slow; FR API for metadata always.

Caching / incremental sync

No HTTP caching exposed. All endpoints return:

cache-control: no-store, no-cache, must-revalidate, private
pragma: no-cache

No Last-Modified, no ETag. There's an internal age: header from a CDN but it's not actionable.

No updated_at / last_modified field on document records. The only timestamps are page_views.last_updated (popularity) and regulations_dot_gov_info.checked_regulationsdotgov_at (tertiary metadata). Documents are effectively immutable post-publication except for corrections, which arrive as new documents linked via correction_of / corrections.

No conditions[updated_since] filter exists — all variants rejected.

Practical incremental strategy: the document_number format {YYYY}-NNNNNN is monotonic per publication date. Sync by publication_date >= last_seen with order=newest. Track per-source state in ingestion_runs. Watch corrections via conditions[correction]=1 with a date filter.

Rate limits

Rate limits as practiced, not as documented: no X-RateLimit-* headers, no Retry-After seen. Empirical: 20 sequential requests at ~1 req/sec all returned HTTP 200 with no throttling. Cap not characterized. Be polite — start at 1 req/sec, scale carefully if needed.

For backfill with concurrent workers: 4-8 workers each at ~1 req/sec is probably safe. Stop and back off if any worker sees a 429 or 503.

Schema (indicative — SQLite)

The migration in shared/josh_substrate/.../migrations/versions/ is the source of truth. The block below is illustrative — keep it as a guide for the field map and the relational shape, but assume migrations win when they disagree. SQLite-flavored: TEXT for IDs/timestamps, INTEGER for booleans (0/1), TEXT containing JSON for raw_json, junction tables instead of arrays.

-- Canonical FR document table
CREATE TABLE fr_documents (
    -- Stable canonical ID — what the agent passes between turns
    id TEXT PRIMARY KEY,                          -- 'fr:2026-08558'
    document_number TEXT NOT NULL UNIQUE,         -- '2026-08558'
    citation TEXT,                                -- '91 FR 24038'

    -- Type / classification
    type TEXT NOT NULL CHECK (type IN (
        'Rule', 'Proposed Rule', 'Notice',
        'Presidential Document', 'Correction', 'Uncategorized Document'
    )),
    subtype TEXT,
    presidential_doc_type TEXT CHECK (presidential_doc_type IS NULL OR presidential_doc_type IN (
        'executive_order', 'proclamation', 'determination',
        'notice', 'memorandum', 'presidential_order', 'other'
    )),
    notice_type TEXT CHECK (notice_type IS NULL OR notice_type IN (
        'sorn', 'sunshine_act_meeting'
    )),

    -- Dates as ISO-8601 TEXT ('YYYY-MM-DD'); lexicographic ordering matches chronological
    publication_date TEXT NOT NULL,
    effective_on TEXT,
    comments_close_on TEXT,
    signing_date TEXT,

    -- Content
    title TEXT NOT NULL,
    abstract TEXT,
    action TEXT,                                  -- "Notice; request for comment."
    dates_text TEXT,                              -- the API's `dates` field
    body_text TEXT,                               -- extracted from XML
    body_xml TEXT,                                -- raw XML preserved for re-parsing

    -- Presidential-specific
    presidential_doc_number TEXT,
    executive_order_number TEXT,
    proclamation_number TEXT,
    executive_order_notes TEXT,
    president_identifier TEXT,                    -- e.g. 'donald-trump'

    -- Issue paging in printed FR
    volume INTEGER,
    start_page INTEGER,
    end_page INTEGER,
    page_length INTEGER,

    -- Flags (0/1; SQLite has no boolean)
    significant INTEGER,
    not_received_for_publication INTEGER,

    -- Versioning relationships
    correction_of TEXT REFERENCES fr_documents(id) ON DELETE SET NULL,

    -- Source URLs (omitted here for brevity — see API spec above for full list)
    json_url TEXT,
    html_url TEXT,
    body_html_url TEXT,
    full_text_xml_url TEXT,
    raw_text_url TEXT,
    pdf_url TEXT,
    public_inspection_pdf_url TEXT,
    mods_url TEXT,
    regulations_dot_gov_url TEXT,
    comment_url TEXT,

    -- Lifecycle / forensics
    raw_json TEXT NOT NULL,                       -- full detail-endpoint response, JSON
    fetched_at TEXT NOT NULL,                     -- ISO-8601 UTC
    parsed_at TEXT,
    body_fetched_at TEXT,

    -- Maintenance
    inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
    updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now'))
);

-- Filter-friendly indexes
CREATE INDEX fr_documents_pub_date ON fr_documents (publication_date DESC);
CREATE INDEX fr_documents_type_pub ON fr_documents (type, publication_date DESC);
CREATE INDEX fr_documents_comment_close
    ON fr_documents (comments_close_on)
    WHERE comments_close_on IS NOT NULL;
CREATE INDEX fr_documents_effective
    ON fr_documents (effective_on)
    WHERE effective_on IS NOT NULL;
CREATE INDEX fr_documents_correction_of ON fr_documents (correction_of) WHERE correction_of IS NOT NULL;
CREATE INDEX fr_documents_significant ON fr_documents (publication_date DESC) WHERE significant = 1;

-- RINs and docket IDs are arrays in the FR API; in SQLite they live in junction tables
CREATE TABLE fr_document_rins (
    fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE,
    rin TEXT NOT NULL,                            -- e.g. '3150-AL36'
    PRIMARY KEY (fr_document_id, rin)
);
CREATE INDEX fr_document_rins_rin ON fr_document_rins (rin);

CREATE TABLE fr_document_docket_ids (
    fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE,
    docket_id TEXT NOT NULL,                      -- agency docket string (NOT regs.gov)
    PRIMARY KEY (fr_document_id, docket_id)
);
CREATE INDEX fr_document_docket_ids_docket ON fr_document_docket_ids (docket_id);

-- Full-text search via FTS5 with column weights applied at query time
-- Use bm25(fr_documents_fts, w_title, w_abstract, w_action, w_body) at SELECT time.
CREATE VIRTUAL TABLE fr_documents_fts USING fts5(
    fr_document_id UNINDEXED,
    title,
    abstract,
    action,
    body,
    tokenize = 'porter unicode61'
);

-- Triggers keep the FTS5 mirror in sync with the parent table
CREATE TRIGGER fr_documents_ai AFTER INSERT ON fr_documents BEGIN
    INSERT INTO fr_documents_fts (fr_document_id, title, abstract, action, body)
    VALUES (NEW.id, COALESCE(NEW.title,''), COALESCE(NEW.abstract,''),
            COALESCE(NEW.action,''), COALESCE(NEW.body_text,''));
END;
CREATE TRIGGER fr_documents_ad AFTER DELETE ON fr_documents BEGIN
    DELETE FROM fr_documents_fts WHERE fr_document_id = OLD.id;
END;
CREATE TRIGGER fr_documents_au AFTER UPDATE ON fr_documents BEGIN
    DELETE FROM fr_documents_fts WHERE fr_document_id = OLD.id;
    INSERT INTO fr_documents_fts (fr_document_id, title, abstract, action, body)
    VALUES (NEW.id, COALESCE(NEW.title,''), COALESCE(NEW.abstract,''),
            COALESCE(NEW.action,''), COALESCE(NEW.body_text,''));
END;

-- Agencies (shared with other sources)
CREATE TABLE agencies (
    id TEXT PRIMARY KEY,                          -- 'agency:securities-and-exchange-commission'
    fr_agency_id INTEGER UNIQUE,                  -- FR's numeric ID
    slug TEXT UNIQUE NOT NULL,                    -- 'securities-and-exchange-commission'
    short_name TEXT,                              -- 'SEC'
    name TEXT NOT NULL,
    raw_name TEXT,
    parent_id TEXT REFERENCES agencies(id),
    description TEXT,
    url TEXT
);
CREATE INDEX agencies_parent ON agencies (parent_id) WHERE parent_id IS NOT NULL;

CREATE TABLE fr_document_agencies (
    fr_document_id TEXT REFERENCES fr_documents(id) ON DELETE CASCADE,
    agency_id TEXT REFERENCES agencies(id) ON DELETE CASCADE,
    PRIMARY KEY (fr_document_id, agency_id)
);

-- Topics (subject taxonomy)
CREATE TABLE topics (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT NOT NULL UNIQUE
);

CREATE TABLE fr_document_topics (
    fr_document_id TEXT REFERENCES fr_documents(id) ON DELETE CASCADE,
    topic_id INTEGER REFERENCES topics(id) ON DELETE CASCADE,
    PRIMARY KEY (fr_document_id, topic_id)
);

-- CFR cross-references
CREATE TABLE fr_document_cfr_refs (
    fr_document_id TEXT REFERENCES fr_documents(id) ON DELETE CASCADE,
    cfr_title INTEGER NOT NULL,
    cfr_chapter TEXT,
    cfr_part TEXT,
    citation_url TEXT,
    PRIMARY KEY (fr_document_id, cfr_title, COALESCE(cfr_chapter, ''), COALESCE(cfr_part, ''))
);

-- Public Inspection (separate table — distinct lifecycle and shape)
CREATE TABLE fr_public_inspection (
    id TEXT PRIMARY KEY,                          -- 'fr-pi:2026-08677'
    document_number TEXT NOT NULL,                -- shared namespace with fr_documents
    type TEXT,
    subtype TEXT,
    title TEXT NOT NULL,
    filing_type TEXT NOT NULL CHECK (filing_type IN ('regular', 'special')),
    filed_at TEXT NOT NULL,
    publication_date TEXT,                        -- target publication date
    num_pages INTEGER,
    subject_1 TEXT,
    subject_2 TEXT,
    subject_3 TEXT,
    editorial_note TEXT,
    pdf_url TEXT,
    pdf_file_name TEXT,
    pdf_file_size INTEGER,
    pdf_updated_at TEXT,
    raw_text_url TEXT,
    html_url TEXT,
    json_url TEXT,
    raw_json TEXT NOT NULL,
    fetched_at TEXT NOT NULL,
    inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
    -- When the doc publishes, we don't delete this row — keep for audit trail
    published_as_id TEXT REFERENCES fr_documents(id) ON DELETE SET NULL
);
CREATE INDEX fr_pi_publication_date ON fr_public_inspection (publication_date DESC);
CREATE INDEX fr_pi_filed_at ON fr_public_inspection (filed_at DESC);

-- PI docket numbers in their own junction table (PI uses 'docket_numbers', not 'docket_ids')
CREATE TABLE fr_pi_docket_numbers (
    fr_pi_id TEXT NOT NULL REFERENCES fr_public_inspection(id) ON DELETE CASCADE,
    docket_number TEXT NOT NULL,
    PRIMARY KEY (fr_pi_id, docket_number)
);

-- Vector chunks via sqlite-vec
-- BQ+rescore today: store both binary (fast scan) and float (rescore) — sqlite-vec supports
-- both at modest space cost (binary alone is what gets brute-force-scanned).
-- Migrate to vec1 when it ships a stable release: same payload, different virtual table.
CREATE TABLE fr_document_chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE,
    chunk_index INTEGER NOT NULL,
    chunk_text TEXT NOT NULL,
    UNIQUE (fr_document_id, chunk_index)
);

-- Companion vec0 virtual table holds the embeddings keyed back to fr_document_chunks.id
CREATE VIRTUAL TABLE fr_document_chunk_vecs USING vec0(
    chunk_id INTEGER PRIMARY KEY,
    embedding float[1024],                        -- full-precision for rescore
    embedding_bq bit[1024]                        -- binary-quantized for fast brute-force scan
);

Schema decisions worth flagging:

raw_json always preserved — when we add fields later, we can re-parse without re-fetching.
Body text and body XML both stored: body_text for search, body_xml for re-extraction (section-aware navigation when we add it).
Type-specific fields nullable on the same table rather than per-type subtables. Most queries don't care about type, and the agent's fetch_rule returns one shape regardless. CHECK constraints enforce valid enums.
Public Inspection in a separate table rather than overloading fr_documents with a status flag. Different shape, different lifecycle, link via published_as_id on publication.
No topics enum constraint — the topic list is open-ended; we capture as we see them.
Vector chunks separate from the doc table because chunking strategy may change (we re-chunk and re-embed without rewriting fr_documents).
Triggers feed FTS5 rather than embedding column generation — SQLite generated columns can't directly source an FTS5 virtual table, but a 3-trigger pattern is the standard SQLite idiom.

Normalization rationale (derived from probing the API shapes): RINs and docket_ids are arrays in the FR API, so they become junction tables (fr_document_rins, fr_document_docket_ids). The full dockets objects are sparse — many notices/rules have empty dockets even with non-empty docket_ids. Public Inspection uses docket_numbers (not docket_ids), so it gets its own fr_pi_docket_numbers junction table.

Download / update strategy

Backfill (one-time, 1994 → present)

Recommended path: API-driven, parallel workers, chunked by month.

For each month from 1994-01 through current month:
- Page through /documents.json?conditions[publication_date][gte]={month-start}&conditions[publication_date][lte]={month-end}&per_page=1000 (each month is well under the 10K cap).
- Save raw list response to /data/corpus/federal_register/list/{YYYY}/{MM}.json.
Extract the document_number set from list responses.
For each document_number:
- Fetch detail: /documents/{number}.json. Save to /data/corpus/federal_register/metadata/{YYYY}/{MM}/{number}.json.
- Fetch body XML from full_text_xml_url. Save to /data/corpus/federal_register/bodies/xml/{YYYY}/{MM}/{number}.xml.
Parse + load into Postgres via the loader stage.
Embed + index via the indexer stage.

Concurrency. 4-8 workers, share-nothing. Each worker takes one month at a time. Watch for 429/503 — if seen, back off all workers.

Time budget. ~2.5M docs × ~2 API calls each (detail + body) = ~5M calls. At 4 workers × 1 req/sec = ~14 days. At 8 workers × 1.5 req/sec = ~5 days. Reasonable.

GovInfo bulk fallback. If API rate limits become painful, switch body fetches to GovInfo daily XML downloads (each daily XML has all docs for that day in one ~3 MB file — way cheaper). Keep API for metadata always.

Daily incremental

Two scheduled ticks, both triggered by cron / systemd timer in josh-ingester:

8:00 AM ET — sync Public Inspection:
- Hit /public-inspection-documents/current.json, capture regular_filings_updated_at and special_filings_updated_at.
- Skip if both timestamps are unchanged from last run.
- Otherwise iterate /public-inspection-documents.json?conditions[available_on]=tomorrow_or_today to refresh PI rows.
10:00 AM ET — sync newly-published documents:
- /documents.json?conditions[publication_date][is]={today}&order=newest&per_page=1000.
- Upsert metadata + fetch bodies for new documents.
- For each PI row whose document_number now resolves on /documents/{number}.json, upsert the published doc and link fr_public_inspection.published_as_id.
Hourly tick — sync corrections:
- /documents.json?conditions[correction]=1&conditions[publication_date][gte]={last_run} to catch corrections to historical documents.

State tracking

Per architecture doc: ingestion state lives in SQLite tables ingestion_runs, ingestion_logs, ingestion_source_state, ingestion_tasks. Each run records source name, started/finished timestamps, status, rows inserted/updated/skipped, and a config JSON column with the actual parameters (date range, etc.) so we can re-run identically.

Failure modes to handle

HTTP 429 / 503 — exponential backoff up to a few minutes; if persists, halt and alert.
HTTP 404 on a doc number that came from the list — possible if a doc is withdrawn between list and detail. Skip and log.
Body XML missing or malformed — log, mark body_fetched_at null, continue with metadata. Body can be retried later.
Schema-breaking changes in API response — raw_json preserved; alert on unknown top-level keys; manual investigation.
Disk space on /data — monitor; corpus alone is ~50-80 GB compressed. Volume is 100 GB; expand if needed.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

topics[] canonical enum — observed values work for some but URL-encoding of names with commas/spaces is fragile. Consider scraping the topics index page or building our own normalized topic table over time.
search_type_id parameter — accepted but undocumented, no observable effect. Likely safe to ignore.
Cap characterization — what's the actual concurrent rate ceiling? Will require empirical testing once ingestion runs.
~~Postgres FTS engine choice~~ — superseded. Storage backend moved from Postgres to SQLite (decision in Architecture, Storage stack). Search uses SQLite FTS5 with BM25 + per-column weights; vector retrieval uses sqlite-vec brute-force + binary-quantization+rescore today, with a vec1 IVFADC+OPQ migration planned when it releases.
~~Embedding model and dimension~~ — resolved. Locked OSS-only on Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license, runs on CPU). The vec0 schema column is embedding float[1024] + embedding_bq bit[1024]. See Architecture, Storage stack.
Chunking strategy — fixed-size with overlap vs section-aware (split at FR's <HD>/<SECTION> headers). Section-aware is better but more parsing work; defer to v1.x.
Citation extraction (eyecite) — when we add the citation graph table, we need to decide if extraction runs at parse time (slower ingestion) or as a separate batch pass.
Public Inspection retention — do we keep PI rows forever for audit, or prune after a window? Lean toward keeping forever (small data, useful history).

docs/sources/federal-register.html · Last updated 2026-05-29