Federal Register
Daily journal of US Government rules, proposed rules, notices, and presidential documents. Published by the Office of the Federal Register (OFR), part of the National Archives. Coverage from 1994-01-03 through present, ~70-80K documents per year.
This source is the test case for the v1 ingestion framework. Federal Register's API is clean enough to learn from, the data shapes are rich enough to stress-test the schema, and the v1 wedge feature (regulatory comment drafting) depends on this source being solid.
The authoritative spec for the ingester is federal-register-ingester; the canonical upstream API reference is the Federal Register developer documentation (note: the developer-docs web page is bot-walled — see Access notes). This page records only what Josh decided and probed about the source.
Quick reference
Section titled “Quick reference”| Source name | Federal Register |
| Publisher | Office of the Federal Register / National Archives |
| License | Public domain (US government work) |
| Coverage | 1994-01-03 – present |
| Volume | ~2.5M docs total, ~70-80K/year, ~6-7K/month, ~200-400/day |
| API base | https://www.federalregister.gov/api/v1/ |
| Bulk base | https://www.govinfo.gov/bulkdata/json/FR/ |
| Auth | None — no API key required |
| Rate limit | No formal limit; ~1-2 req/sec sustained is safe |
| Caching headers | None exposed (Cache-Control: no-store) |
| Stable ID format | fr:{document_number} e.g. fr:2026-08558 |
| Status | exploring — schema drafted, ingestion not built |
Access notes
Section titled “Access notes”The public docs page at federalregister.gov/developers/documentation/api/v1
is bot-walled (CAPTCHA via unblock.federalregister.gov). The API endpoints
themselves are wide open — no auth, no CAPTCHA, just polite rate limiting
expected.
A real-browser User-Agent header is required to avoid the bot wall on web pages. The API endpoints accept any UA, but using a browser-shaped UA is safer:
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36Upstream API reference
Section titled “Upstream API reference”The endpoint catalog, the full conditions[*] filter key list, the fields[]
sparse-fieldset mechanics, the thin/fat document field enumerations, the embedded
JSON object shapes (regulations_dot_gov_info, dockets,
regulation_id_number_info), the Public Inspection field list, the body-fetch
XML URL pattern, and the presidential_document_type / notice_type enum value
lists are all documented upstream — see the
Federal Register API v1 documentation.
The GovInfo bulk-download URL tree is at
govinfo.gov/bulkdata/FR.
Two structural facts worth keeping at hand: a single document detail fetch
(GET /documents/{document_number}.json) returns the fat shape (47 fields)
while the list endpoint (GET /documents.json) returns a thin shape (10 fields)
unless fields[] is requested; and the document_number namespace is shared
between Public Inspection and Published — a PI doc retains its number after
publication, only the endpoint changes.
Josh-learned quirks
Section titled “Josh-learned quirks”These contradict or are absent from the upstream docs; they were discovered by probing this mirror directly.
Filter-key vs response-field mismatches (the filter key is not the field name in the response):
- Filter key is
effective_date, response field iseffective_on. - Filter key is
comment_date, response field iscomments_close_on.
Silent failures (no error, just wrong/empty results):
- Comma-separated value lists silently fail — no error, just
count: null. Use repeated[]keys (multi-value within one key = OR; multiple keys = AND). - Negation (
[not]) is silently ignored. - An invalid
order=value falls back to the default with no error. Only three order values produce distinct ordering:order=newest(publication_datedesc, the default),order=oldest(publication_dateasc), andorder=relevance(only meaningful withterm; injects a syntheticexcerptsfield). All other values silently fall back to default. topics[]takes the canonical TOC topic name as an exact, case-sensitive match; topics with commas/spaces (e.g. "Banks, banking") need careful URL encoding and are fragile.
The 10,000 hard cap — load-bearing for backfill windowing. There is a hard
10,000-result cap per query: total_pages ceilings at 50 with per_page=200,
and asking for page=51 silently returns page 1 (no error). The count
field is also capped at 10,000 — a single query spanning more than 10K
documents reports count: 10000 (or lower) and silently undercounts the true
total. Do not size a backfill window off one count read; paginate/window
by publication_date and sum the per-window totals instead. (Confirmed
2026-05-29 against the corrected ingester spec.)
per_page confirmed up to 1000. Brackets must be URL-encoded as %5B /
%5D.
The CORRECT-not-CORRECTION trap. conditions[type][] takes uppercase
codes (RULE, PRORULE, NOTICE, PRESDOCU, CORRECT, UNKNOWN) while
result.type returns display values (Rule, Proposed Rule, Notice,
Presidential Document, Correction, Uncategorized Document). The code for
corrections is CORRECT — passing the full word CORRECTION returns 0 results
with no error.
Rejected as invalid (negative probe results — these parameter names this
mirror does not accept): query, q, search, agency_id (singular),
agencies_short_name, agencies_name, agencies_slug, rin, effective_on,
comments_close_on, regulations_dot_gov_docket_id, topic (singular),
subtype, subtypes, updated_since, since, modified_after, page_views.
fields[] edge cases. No wildcard (fields[]=* returns 400). An invalid
field name returns 400 with field 'X' not valid but does not enumerate the
valid set. excerpts (the ~700-char relevance-highlight string with
<span class="match"> markup, present only under order=relevance) is NOT in
the 47-field list. president is not a top-level document field, but it IS
available via fields[]=president, returning {identifier, name} (e.g.
{"identifier": "donald-trump", "name": "Donald Trump"}).
Public Inspection distinct shape. Default
/public-inspection-documents.json returns the next-publication-day filings
only; use conditions[available_on]=YYYY-MM-DD to browse historical PI dates.
PI carries fields not in regular docs (filing_type, filed_at,
pdf_file_name, pdf_file_size, pdf_updated_at, num_pages,
subject_1/2/3, editorial_note, agency_letters, agency_names, and
docket_numbers — note: docket_numbers, not docket_ids). filing_type
enum observed: regular, special (an April 30 sample held 89 regular + 4
special). /public-inspection-documents/current.json returns sync timestamps
(special_filings_updated_at, regular_filings_updated_at) that we track for
incremental sync.
Body fetch (XML)
Section titled “Body fetch (XML)”full_text_xml_url returns structured XML — this is what we parse for body
text. The URL pattern is documented
upstream; no
CAPTCHA, no auth, same browser-shaped UA recommended.
XML structure per document type
Section titled “XML structure per document type”The body XML is HD-driven and shaped by document type — this is the structure the deferred section-aware chunker would split on.
- Rule (
<RULE>):<PREAMB>→AGENCY,SUBAGY,CFR,DEPDOC,RIN,SUBJECT,AGY,ACT,SUM,EFFDATE,ADD,FURINF<SUPLINF>→ headings, paragraphs,LSTSUB,PART,REGTEXT(AMDPAR, AUTH),SECTION(SECTNO, SUBJECT),SIG(DATED, NAME, TITLE)<FRDOC>,<BILCOD>
- Proposed Rule (
<PRORULE>):<PREAMB>→ same shape as Rule but noEFFDATE<SUPLINF>→ headings + paragraphs +SIG(NAME, TITLE)(noREGTEXTsince regs aren't yet codified)
- Notice (
<NOTICE>):<PREAMB>→PRTPAGE, AGENCY, SUBJECT, AGY, ACT, SUM, DATES, FURINF(noCFR, noRIN)<SUPLINF>→HD, P*, SIG(NAME, TITLE), optionalGPOTABLE(TTITLE,BOXHD/CHED,ROW/ENT)
- Presidential Document (
<PRESDOCU>): wraps a single sub-element by subtype:- Executive Order:
<EXECORD>(TITLE3, PRES, EXECORDR, HD, FP*, P*, GPH(GID), PSIG, PLACE, DATE, FRDOC, FILED, BILCOD) - Proclamation:
<PROCLA>(TITLE3, PRES, PROC, HD, FP*, GPH(GID), PSIG, FRDOC, FILED, BILCOD) - Memorandum:
<PRMEMO>(TITLE3, PRES, MEMO, HD*, FP*, PRTPAGE, GPH, PSIG, PLACE, DATE, FRDOC, FILED, BILCOD) - Presidential Notice:
<PRNOTICE>(TITLE3, PRES, PNOTICE, HD, FP*, GPH(GID), PSIG, PLACE, DATE, FRDOC, FILED, BILCOD) - Determination:
<DETERM>(TITLE3, PRES, DETNO, HD*, FP*, GPH, PSIG, PLACE, DATE, FRDOC, FILED, BILCOD)
- Executive Order:
Common leaf elements across types:
HD— heading (withSOURCEattribute indicating heading level:HED,HD1,HD2)P— paragraphFP— flush paragraph (no indent)E— emphasis (T attribute:01italic,02bold,03smallcaps)GPH/GID— graphic / graphic IDPRTPAGE— print page breakSIG— signature blockSTARS—* * *ellipsisGPOTABLE— tables
Bulk download — GovInfo
Section titled “Bulk download — GovInfo”Federal Register data is mirrored at GovInfo with a clean bulk structure (browse the tree at govinfo.gov/bulkdata/FR). Each daily XML is the full issue (~2-5 MB) containing all documents for that day in one file — useful for backfill (skip ~2.5M API calls) but the body XML is the same content the FR API points to.
Caveat (Josh hybrid strategy): GovInfo daily XML contains the body content
of all documents in that day's issue, but does NOT contain the metadata that's
only in the FR API (e.g., regulations_dot_gov_info, dockets as full objects,
regulation_id_number_info). Hybrid strategy: GovInfo bulk for body backfill if
the API is too slow; FR API for metadata always.
Caching / incremental sync
Section titled “Caching / incremental sync”No HTTP caching exposed. All endpoints return:
cache-control: no-store, no-cache, must-revalidate, privatepragma: no-cacheNo Last-Modified, no ETag. There's an internal age: header from a CDN but
it's not actionable.
No updated_at / last_modified field on document records. The only
timestamps are page_views.last_updated (popularity) and
regulations_dot_gov_info.checked_regulationsdotgov_at (tertiary metadata).
Documents are effectively immutable post-publication except for corrections,
which arrive as new documents linked via correction_of / corrections.
No conditions[updated_since] filter exists — all variants rejected.
Practical incremental strategy: the document_number format
{YYYY}-NNNNNN is monotonic per publication date. Sync by
publication_date >= last_seen with order=newest. Track per-source state in
ingestion_runs. Watch corrections via conditions[correction]=1 with a date
filter.
Rate limits
Section titled “Rate limits”Rate limits as practiced, not as documented: no X-RateLimit-* headers, no
Retry-After seen. Empirical: 20 sequential requests at ~1 req/sec all returned
HTTP 200 with no throttling. Cap not characterized. Be polite — start at 1
req/sec, scale carefully if needed.
For backfill with concurrent workers: 4-8 workers each at ~1 req/sec is probably safe. Stop and back off if any worker sees a 429 or 503.
Schema (indicative — SQLite)
Section titled “Schema (indicative — SQLite)”The migration in
shared/josh_substrate/.../migrations/versions/is the source of truth. The block below is illustrative — keep it as a guide for the field map and the relational shape, but assume migrations win when they disagree. SQLite-flavored: TEXT for IDs/timestamps, INTEGER for booleans (0/1), TEXT containing JSON forraw_json, junction tables instead of arrays.
-- Canonical FR document tableCREATE TABLE fr_documents ( -- Stable canonical ID — what the agent passes between turns id TEXT PRIMARY KEY, -- 'fr:2026-08558' document_number TEXT NOT NULL UNIQUE, -- '2026-08558' citation TEXT, -- '91 FR 24038'
-- Type / classification type TEXT NOT NULL CHECK (type IN ( 'Rule', 'Proposed Rule', 'Notice', 'Presidential Document', 'Correction', 'Uncategorized Document' )), subtype TEXT, presidential_doc_type TEXT CHECK (presidential_doc_type IS NULL OR presidential_doc_type IN ( 'executive_order', 'proclamation', 'determination', 'notice', 'memorandum', 'presidential_order', 'other' )), notice_type TEXT CHECK (notice_type IS NULL OR notice_type IN ( 'sorn', 'sunshine_act_meeting' )),
-- Dates as ISO-8601 TEXT ('YYYY-MM-DD'); lexicographic ordering matches chronological publication_date TEXT NOT NULL, effective_on TEXT, comments_close_on TEXT, signing_date TEXT,
-- Content title TEXT NOT NULL, abstract TEXT, action TEXT, -- "Notice; request for comment." dates_text TEXT, -- the API's `dates` field body_text TEXT, -- extracted from XML body_xml TEXT, -- raw XML preserved for re-parsing
-- Presidential-specific presidential_doc_number TEXT, executive_order_number TEXT, proclamation_number TEXT, executive_order_notes TEXT, president_identifier TEXT, -- e.g. 'donald-trump'
-- Issue paging in printed FR volume INTEGER, start_page INTEGER, end_page INTEGER, page_length INTEGER,
-- Flags (0/1; SQLite has no boolean) significant INTEGER, not_received_for_publication INTEGER,
-- Versioning relationships correction_of TEXT REFERENCES fr_documents(id) ON DELETE SET NULL,
-- Source URLs (omitted here for brevity — see API spec above for full list) json_url TEXT, html_url TEXT, body_html_url TEXT, full_text_xml_url TEXT, raw_text_url TEXT, pdf_url TEXT, public_inspection_pdf_url TEXT, mods_url TEXT, regulations_dot_gov_url TEXT, comment_url TEXT,
-- Lifecycle / forensics raw_json TEXT NOT NULL, -- full detail-endpoint response, JSON fetched_at TEXT NOT NULL, -- ISO-8601 UTC parsed_at TEXT, body_fetched_at TEXT,
-- Maintenance inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')), updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')));
-- Filter-friendly indexesCREATE INDEX fr_documents_pub_date ON fr_documents (publication_date DESC);CREATE INDEX fr_documents_type_pub ON fr_documents (type, publication_date DESC);CREATE INDEX fr_documents_comment_close ON fr_documents (comments_close_on) WHERE comments_close_on IS NOT NULL;CREATE INDEX fr_documents_effective ON fr_documents (effective_on) WHERE effective_on IS NOT NULL;CREATE INDEX fr_documents_correction_of ON fr_documents (correction_of) WHERE correction_of IS NOT NULL;CREATE INDEX fr_documents_significant ON fr_documents (publication_date DESC) WHERE significant = 1;
-- RINs and docket IDs are arrays in the FR API; in SQLite they live in junction tablesCREATE TABLE fr_document_rins ( fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE, rin TEXT NOT NULL, -- e.g. '3150-AL36' PRIMARY KEY (fr_document_id, rin));CREATE INDEX fr_document_rins_rin ON fr_document_rins (rin);
CREATE TABLE fr_document_docket_ids ( fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE, docket_id TEXT NOT NULL, -- agency docket string (NOT regs.gov) PRIMARY KEY (fr_document_id, docket_id));CREATE INDEX fr_document_docket_ids_docket ON fr_document_docket_ids (docket_id);
-- Full-text search via FTS5 with column weights applied at query time-- Use bm25(fr_documents_fts, w_title, w_abstract, w_action, w_body) at SELECT time.CREATE VIRTUAL TABLE fr_documents_fts USING fts5( fr_document_id UNINDEXED, title, abstract, action, body, tokenize = 'porter unicode61');
-- Triggers keep the FTS5 mirror in sync with the parent tableCREATE TRIGGER fr_documents_ai AFTER INSERT ON fr_documents BEGIN INSERT INTO fr_documents_fts (fr_document_id, title, abstract, action, body) VALUES (NEW.id, COALESCE(NEW.title,''), COALESCE(NEW.abstract,''), COALESCE(NEW.action,''), COALESCE(NEW.body_text,''));END;CREATE TRIGGER fr_documents_ad AFTER DELETE ON fr_documents BEGIN DELETE FROM fr_documents_fts WHERE fr_document_id = OLD.id;END;CREATE TRIGGER fr_documents_au AFTER UPDATE ON fr_documents BEGIN DELETE FROM fr_documents_fts WHERE fr_document_id = OLD.id; INSERT INTO fr_documents_fts (fr_document_id, title, abstract, action, body) VALUES (NEW.id, COALESCE(NEW.title,''), COALESCE(NEW.abstract,''), COALESCE(NEW.action,''), COALESCE(NEW.body_text,''));END;
-- Agencies (shared with other sources)CREATE TABLE agencies ( id TEXT PRIMARY KEY, -- 'agency:securities-and-exchange-commission' fr_agency_id INTEGER UNIQUE, -- FR's numeric ID slug TEXT UNIQUE NOT NULL, -- 'securities-and-exchange-commission' short_name TEXT, -- 'SEC' name TEXT NOT NULL, raw_name TEXT, parent_id TEXT REFERENCES agencies(id), description TEXT, url TEXT);CREATE INDEX agencies_parent ON agencies (parent_id) WHERE parent_id IS NOT NULL;
CREATE TABLE fr_document_agencies ( fr_document_id TEXT REFERENCES fr_documents(id) ON DELETE CASCADE, agency_id TEXT REFERENCES agencies(id) ON DELETE CASCADE, PRIMARY KEY (fr_document_id, agency_id));
-- Topics (subject taxonomy)CREATE TABLE topics ( id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT NOT NULL UNIQUE);
CREATE TABLE fr_document_topics ( fr_document_id TEXT REFERENCES fr_documents(id) ON DELETE CASCADE, topic_id INTEGER REFERENCES topics(id) ON DELETE CASCADE, PRIMARY KEY (fr_document_id, topic_id));
-- CFR cross-referencesCREATE TABLE fr_document_cfr_refs ( fr_document_id TEXT REFERENCES fr_documents(id) ON DELETE CASCADE, cfr_title INTEGER NOT NULL, cfr_chapter TEXT, cfr_part TEXT, citation_url TEXT, PRIMARY KEY (fr_document_id, cfr_title, COALESCE(cfr_chapter, ''), COALESCE(cfr_part, '')));
-- Public Inspection (separate table — distinct lifecycle and shape)CREATE TABLE fr_public_inspection ( id TEXT PRIMARY KEY, -- 'fr-pi:2026-08677' document_number TEXT NOT NULL, -- shared namespace with fr_documents type TEXT, subtype TEXT, title TEXT NOT NULL, filing_type TEXT NOT NULL CHECK (filing_type IN ('regular', 'special')), filed_at TEXT NOT NULL, publication_date TEXT, -- target publication date num_pages INTEGER, subject_1 TEXT, subject_2 TEXT, subject_3 TEXT, editorial_note TEXT, pdf_url TEXT, pdf_file_name TEXT, pdf_file_size INTEGER, pdf_updated_at TEXT, raw_text_url TEXT, html_url TEXT, json_url TEXT, raw_json TEXT NOT NULL, fetched_at TEXT NOT NULL, inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')), -- When the doc publishes, we don't delete this row — keep for audit trail published_as_id TEXT REFERENCES fr_documents(id) ON DELETE SET NULL);CREATE INDEX fr_pi_publication_date ON fr_public_inspection (publication_date DESC);CREATE INDEX fr_pi_filed_at ON fr_public_inspection (filed_at DESC);
-- PI docket numbers in their own junction table (PI uses 'docket_numbers', not 'docket_ids')CREATE TABLE fr_pi_docket_numbers ( fr_pi_id TEXT NOT NULL REFERENCES fr_public_inspection(id) ON DELETE CASCADE, docket_number TEXT NOT NULL, PRIMARY KEY (fr_pi_id, docket_number));
-- Vector chunks via sqlite-vec-- BQ+rescore today: store both binary (fast scan) and float (rescore) — sqlite-vec supports-- both at modest space cost (binary alone is what gets brute-force-scanned).-- Migrate to vec1 when it ships a stable release: same payload, different virtual table.CREATE TABLE fr_document_chunks ( id INTEGER PRIMARY KEY AUTOINCREMENT, fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE, chunk_index INTEGER NOT NULL, chunk_text TEXT NOT NULL, UNIQUE (fr_document_id, chunk_index));
-- Companion vec0 virtual table holds the embeddings keyed back to fr_document_chunks.idCREATE VIRTUAL TABLE fr_document_chunk_vecs USING vec0( chunk_id INTEGER PRIMARY KEY, embedding float[1024], -- full-precision for rescore embedding_bq bit[1024] -- binary-quantized for fast brute-force scan);Schema decisions worth flagging:
raw_jsonalways preserved — when we add fields later, we can re-parse without re-fetching.- Body text and body XML both stored:
body_textfor search,body_xmlfor re-extraction (section-aware navigation when we add it). - Type-specific fields nullable on the same table rather than per-type
subtables. Most queries don't care about type, and the agent's
fetch_rulereturns one shape regardless. CHECK constraints enforce valid enums. - Public Inspection in a separate table rather than overloading
fr_documentswith a status flag. Different shape, different lifecycle, link viapublished_as_idon publication. - No
topicsenum constraint — the topic list is open-ended; we capture as we see them. - Vector chunks separate from the doc table because chunking strategy may
change (we re-chunk and re-embed without rewriting
fr_documents). - Triggers feed FTS5 rather than embedding column generation — SQLite generated columns can't directly source an FTS5 virtual table, but a 3-trigger pattern is the standard SQLite idiom.
Normalization rationale (derived from probing the API shapes): RINs and
docket_ids are arrays in the FR API, so they become junction tables
(fr_document_rins, fr_document_docket_ids). The full dockets objects are
sparse — many notices/rules have empty dockets even with non-empty
docket_ids. Public Inspection uses docket_numbers (not docket_ids), so it
gets its own fr_pi_docket_numbers junction table.
Download / update strategy
Section titled “Download / update strategy”Backfill (one-time, 1994 → present)
Section titled “Backfill (one-time, 1994 → present)”Recommended path: API-driven, parallel workers, chunked by month.
- For each month from 1994-01 through current month:
- Page through
/documents.json?conditions[publication_date][gte]={month-start}&conditions[publication_date][lte]={month-end}&per_page=1000(each month is well under the 10K cap). - Save raw list response to
/data/corpus/federal_register/list/{YYYY}/{MM}.json.
- Page through
- Extract the
document_numberset from list responses. - For each
document_number:- Fetch detail:
/documents/{number}.json. Save to/data/corpus/federal_register/metadata/{YYYY}/{MM}/{number}.json. - Fetch body XML from
full_text_xml_url. Save to/data/corpus/federal_register/bodies/xml/{YYYY}/{MM}/{number}.xml.
- Fetch detail:
- Parse + load into Postgres via the loader stage.
- Embed + index via the indexer stage.
Concurrency. 4-8 workers, share-nothing. Each worker takes one month at a time. Watch for 429/503 — if seen, back off all workers.
Time budget. ~2.5M docs × ~2 API calls each (detail + body) = ~5M calls. At 4 workers × 1 req/sec = ~14 days. At 8 workers × 1.5 req/sec = ~5 days. Reasonable.
GovInfo bulk fallback. If API rate limits become painful, switch body fetches to GovInfo daily XML downloads (each daily XML has all docs for that day in one ~3 MB file — way cheaper). Keep API for metadata always.
Daily incremental
Section titled “Daily incremental”Two scheduled ticks, both triggered by cron / systemd timer in josh-ingester:
- 8:00 AM ET — sync Public Inspection:
- Hit
/public-inspection-documents/current.json, captureregular_filings_updated_atandspecial_filings_updated_at. - Skip if both timestamps are unchanged from last run.
- Otherwise iterate
/public-inspection-documents.json?conditions[available_on]=tomorrow_or_todayto refresh PI rows.
- Hit
- 10:00 AM ET — sync newly-published documents:
/documents.json?conditions[publication_date][is]={today}&order=newest&per_page=1000.- Upsert metadata + fetch bodies for new documents.
- For each PI row whose
document_numbernow resolves on/documents/{number}.json, upsert the published doc and linkfr_public_inspection.published_as_id.
- Hourly tick — sync corrections:
/documents.json?conditions[correction]=1&conditions[publication_date][gte]={last_run}to catch corrections to historical documents.
State tracking
Section titled “State tracking”Per architecture doc: ingestion state lives in SQLite tables ingestion_runs,
ingestion_logs, ingestion_source_state, ingestion_tasks. Each run records
source name, started/finished timestamps, status, rows
inserted/updated/skipped, and a config JSON column with the actual parameters
(date range, etc.) so we can re-run identically.
Failure modes to handle
Section titled “Failure modes to handle”- HTTP 429 / 503 — exponential backoff up to a few minutes; if persists, halt and alert.
- HTTP 404 on a doc number that came from the list — possible if a doc is withdrawn between list and detail. Skip and log.
- Body XML missing or malformed — log, mark
body_fetched_atnull, continue with metadata. Body can be retried later. - Schema-breaking changes in API response —
raw_jsonpreserved; alert on unknown top-level keys; manual investigation. - Disk space on
/data— monitor; corpus alone is ~50-80 GB compressed. Volume is 100 GB; expand if needed.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
topics[]canonical enum — observed values work for some but URL-encoding of names with commas/spaces is fragile. Consider scraping the topics index page or building our own normalized topic table over time.search_type_idparameter — accepted but undocumented, no observable effect. Likely safe to ignore.- Cap characterization — what's the actual concurrent rate ceiling? Will require empirical testing once ingestion runs.
Postgres FTS engine choice— superseded. Storage backend moved from Postgres to SQLite (decision in Architecture, Storage stack). Search uses SQLite FTS5 with BM25 + per-column weights; vector retrieval usessqlite-vecbrute-force + binary-quantization+rescore today, with avec1IVFADC+OPQ migration planned when it releases.Embedding model and dimension— resolved. Locked OSS-only on Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license, runs on CPU). The vec0 schema column isembedding float[1024]+embedding_bq bit[1024]. See Architecture, Storage stack.- Chunking strategy — fixed-size with overlap vs section-aware (split at
FR's
<HD>/<SECTION>headers). Section-aware is better but more parsing work; defer to v1.x. - Citation extraction (eyecite) — when we add the citation graph table, we need to decide if extraction runs at parse time (slower ingestion) or as a separate batch pass.
- Public Inspection retention — do we keep PI rows forever for audit, or prune after a window? Lean toward keeping forever (small data, useful history).
docs/sources/federal-register.html · Last updated 2026-05-29