Federal Register ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Federal Register is the most cross-referenced regulatory source on the
substrate — every other regulatory source FKs to it eventually. Shipping FR
early validates the SQLite + FTS5 + sqlite-vec stack against a real,
high-volume, body-text source (NPRMs and final rules routinely run 30K+
tokens). It's also the bedrock under "what regs touch X" queries that
agents will run constantly.
User stories
As a policy analyst using my own AI agent, I want to ask the agent for the latest NPRMs from EPA on a given topic so that I get fresh data with citations without running my own scraper.
As an OSS self-hoster, I want to run `josh-ingester backfill federal-register` on a fresh substrate so that I can populate FR from the public API on my own instance.
As a downstream regulatory source (eCFR, regulations.gov dockets), I want to FK to FR documents by `fr:<doc-number>` ID so that cross-source citation graph queries resolve cleanly.
Acceptance criteria (EARS)
- When `josh-ingester backfill federal-register --limit 50` is run inside the deployed `josh-ingester` container, the system shall populate `fr_documents` rows whose six citation columns (`source_url`, `source_org`, `published_at`, `retrieved_at`, `raw_sha256`, `citation_string`) are all non-null.
- When an FR document body is fetched, the system shall write the raw XML/HTML under `/data/corpus/federal-register/bodies/raw/` and write the normalized Markdown into the row's `body_text` column (with `body_normalizer_version` and `body_text_sha256` populated).
- While the ingester is running, if a single document fails to parse, then the system shall log the failure to `ingestion_logs` and continue with the next document.
- Where the FR document declares one or more RINs, the system shall populate the `fr_document_rins` junction table for that document (no `text[]` columns; SQLite-shaped).
- When an FTS5 query is executed against `fr_documents_fts`, the system shall return BM25-ranked results. (Per-column weight tuple — provisionally `(10.0, 4.0, 2.0, 1.0)` for `(title, abstract, action, body)` — is applied at query time by the search wrapper that ships with `rest-api-search`; criterion to be tightened then.)
- Where a document body exceeds ~5K tokens, the system shall enqueue rows in `ingestion_embedding_queue` for chunk-level embedding.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` such that the next run only fetches newer documents.
Success determiner
Command
set -euo pipefail
# Smoke backfill (50 docs) inside the deployed josh-ingester container
ssh josh 'docker exec $(docker ps --format "{{.Names}}" | grep ^josh-ingester-workers-) \
josh-ingester backfill federal-register --limit 50'
# Verify rows landed with all citation columns populated
ssh josh 'docker exec $(docker ps --format "{{.Names}}" | grep ^josh-core-web-) \
sqlite3 /data/josh.db "
SELECT COUNT(*) FROM fr_documents
WHERE source_url IS NOT NULL AND source_org IS NOT NULL
AND published_at IS NOT NULL AND retrieved_at IS NOT NULL
AND raw_sha256 IS NOT NULL AND citation_string IS NOT NULL;"'
# FTS5 sanity — BM25-ranked results exist for a generic regulatory term
ssh josh 'docker exec $(docker ps --format "{{.Names}}" | grep ^josh-core-web-) \
sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM fr_documents_fts WHERE fr_documents_fts MATCH '\''rule'\'';"'
Expect
Smoke run only — `--limit 50` caps the discover stage at 50 FetchTasks. Full production backfill scope is **2024–2026** (~60–65K docs at the observed ~2.0–2.35K docs/month publication volume; confirmed 2026-05-29) and runs detached on the server per the `add-a-new-source.html` Step 11 (`docker exec -d` form). Do not size the window off the API `count` field — it caps at 10,000, so a single-query total silently undercounts multi-year windows; paginate/window by `publication_date` to size.
Clarifications needed
- Topics table shape from FR vs the planned topic-taxonomy structure — confirm alignment before topic junction tables land.
Out of scope
- State Federal Registers (CA Reg.gov, NY etc.) — Phase 6.
- Real-time webhooks / push subscriptions — surface-layer concern.
- Public mirror / dataset publication — explicitly dropped from v1 strategy (cloud-private).
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/federal-register/ (probe findings + indicative DDL —
migration 0002_fr_documents.py is the source of truth).
Discover/fetch/parse/load shape:
- discover: api.federalregister.gov /documents.json paginated by
publication_date, watermark from ingestion_source_state.last_seen_key.
The response count field caps at 10,000, so backfill sizing must
paginate/window across publication_date rather than trust a single
count read (which silently undercounts multi-year windows). At the
observed ~2.0–2.35K docs/month, the 2024–2026 window is ~60–65K docs.
- fetch: per-doc raw XML (preferred) or HTML to
/data/corpus/federal-register/bodies/raw/. SHA256 the bytes; that's the
raw_sha256 for citation.
- parse: fr_xml_v1 lxml walker preserving heading hierarchy → Markdown
body to /data/corpus/federal-register/bodies/markdown/.
- load: idempotent upsert on (doc_number) natural key. Junction-table
inserts for RINs and docket IDs (no SQLite arrays). FTS5 sync triggers
fire on insert/update/delete. Body chunks enqueued for embedding.
Migration shape: 0002_fr_documents.py with fr_documents,fr_document_rins, fr_document_dockets, fr_documents_fts virtual,fr_documents_fts_* triggers, fr_chunks, fr_chunk_vecs (vec0 with
both float[1024] and bit[1024]).
Schedule: hourly at :15 (FR publishes Mon-Fri ~08:45 ET, but later updates
are common; hourly catches stragglers without spamming).
Tasks
0 of 16 done.
- t1 Migration 0002_fr_documents.py applied locally + production
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/fr.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (fr_xml_v1 walker; html_v1 fallback for legacy)
- t5 Citation metadata: 6 columns + citation_for() formatter + 3-fixture unit tests
- t6 FTS5 virtual table + 3 sync triggers in migration
- t7 Vector vec0 table for chunks (float + BQ)
- t8 Unit test for fr_xml_v1 parser against captured fixtures
- t9 Integration test for full discover→fetch→parse→load pipeline
- t10 Smoke backfill (50 docs) clean per success_determiner
- t11 Production backfill at agreed scope (2024–2026, ~60–65K docs) complete
- t12 Hourly cron registered and observed firing once
- t13 No recurring errors in ingestion_logs after 24h
- t14 https://docs.usejosh.com/sources/federal-register/ updated to ingestion-built; inventory row flipped
- t15 12-sample manual HTML/XML→Markdown validation gate (per add-a-new-source.html)
- t16 URL verification gate: 10/10 random records match upstream titles via WebFetch
Changelog
No history yet.