Federal Register ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Federal Register is the most cross-referenced regulatory source on the
substrate — every other regulatory source FKs to it eventually. Shipping FR
early validates the SQLite + FTS5 + sqlite-vec stack against a real,
high-volume, body-text source (NPRMs and final rules routinely run 30K+
tokens). It's also the bedrock under "what regs touch X" queries that
agents will run constantly.

User stories

As a policy analyst using my own AI agent, I want to ask the agent for the latest NPRMs from EPA on a given topic so that I get fresh data with citations without running my own scraper.

As an OSS self-hoster, I want to run `josh-ingester backfill federal-register` on a fresh substrate so that I can populate FR from the public API on my own instance.

As a downstream regulatory source (eCFR, regulations.gov dockets), I want to FK to FR documents by `fr:<doc-number>` ID so that cross-source citation graph queries resolve cleanly.

Acceptance criteria (EARS)

When `josh-ingester backfill federal-register --limit 50` is run inside the deployed `josh-ingester` container, the system shall populate `fr_documents` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
When an FR document body is fetched, the system shall write the raw XML/HTML under `/data/corpus/federal-register/bodies/raw/` and write the normalized Markdown into the row's `body_text` column (with `body_normalizer_version` and `body_text_sha256` populated).
While the ingester is running, if a single document fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next document.
Where the FR document declares one or more RINs, the system shall populate the `fr_document_rins` junction table for that document (no `text[]` columns; SQLite-shaped).
When an FTS5 query is executed against `fr_documents_fts`, the system shall return BM25-ranked results. (Per-column weight tuple — provisionally `(10.0, 4.0, 2.0, 1.0)` for `(title, abstract, action, body)` — is applied at query time by the search wrapper that ships with `rest-api-search`; criterion to be tightened then.)
Where a document body exceeds ~5K tokens, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` such that the next run only fetches newer documents.

Success determiner

kindbash

Command

set -euo pipefail
# Smoke backfill (50 docs) inside the deployed josh-ingester container
ssh josh 'docker exec $(docker ps --format "{{.Names}}" | grep ^josh-ingester-workers-) \
  josh-ingester backfill federal-register --limit 50'
# Verify rows landed with all citation columns populated
ssh josh 'docker exec $(docker ps --format "{{.Names}}" | grep ^josh-core-web-) \
  sqlite3 /data/josh.db "
    SELECT COUNT(*) FROM fr_documents
    WHERE source_url IS NOT NULL AND source_org IS NOT NULL
      AND published_at IS NOT NULL AND retrieved_at IS NOT NULL
      AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL;"'
# FTS5 sanity — BM25-ranked results exist for a generic regulatory term
ssh josh 'docker exec $(docker ps --format "{{.Names}}" | grep ^josh-core-web-) \
  sqlite3 /data/josh.db \
    "SELECT COUNT(*) FROM fr_documents_fts WHERE fr_documents_fts MATCH '\''rule'\'';"'

Expect

>= 50 fully-citation-stamped rows; FTS5 match returns >= 1.

Smoke run only — `--limit 50` caps the discover stage at 50 FetchTasks. Full production backfill scope is **2024–2026** (~60–65K docs at the observed ~2.0–2.35K docs/month publication volume; confirmed 2026-05-29) and runs detached on the server per the `add-a-new-source.html` Step 11 (`docker exec -d` form). Do not size the window off the API `count` field — it caps at 10,000, so a single-query total silently undercounts multi-year windows; paginate/window by `publication_date` to size.

Clarifications needed

Topics table shape from FR vs the planned topic-taxonomy structure — confirm alignment before topic junction tables land.

Out of scope

State Federal Registers (CA Reg.gov, NY etc.) — Phase 6.
Real-time webhooks / push subscriptions — surface-layer concern.
Public mirror / dataset publication — explicitly dropped from v1 strategy (cloud-private).

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/federal-register/ (probe findings + indicative DDL —
migration 0002_fr_documents.py is the source of truth).

Discover/fetch/parse/load shape:
- discover: api.federalregister.gov /documents.json paginated by
publication_date, watermark from ingestion_source_state.last_seen_key.
The response count field caps at 10,000, so backfill sizing must
paginate/window across publication_date rather than trust a single
count read (which silently undercounts multi-year windows). At the
observed ~2.0–2.35K docs/month, the 2024–2026 window is ~60–65K docs.
- fetch: per-doc raw XML (preferred) or HTML to
/data/corpus/federal-register/bodies/raw/. SHA256 the bytes; that's the
raw_sha256 for citation.
- parse: fr_xml_v1 lxml walker preserving heading hierarchy → Markdown
body to /data/corpus/federal-register/bodies/markdown/.
- load: idempotent upsert on (doc_number) natural key. Junction-table
inserts for RINs and docket IDs (no SQLite arrays). FTS5 sync triggers
fire on insert/update/delete. Body chunks enqueued for embedding.

Migration shape: 0002_fr_documents.py with fr_documents,
fr_document_rins, fr_document_dockets, fr_documents_fts virtual,
fr_documents_fts_* triggers, fr_chunks, fr_chunk_vecs (vec0 with
both float[1024] and bit[1024]).

Schedule: hourly at :15 (FR publishes Mon-Fri ~08:45 ET, but later updates
are common; hourly catches stragglers without spamming).

Tasks

0 of 16 done.

Changelog

No history yet.