Regulations.gov dockets ingester
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
When a federal agency proposes or finalizes a regulation, it opens a
docket on regulations.gov for public comment. Each docket is the
lifecycle container for a single rulemaking — the regulatory document(s),
supporting analyses, comment count, and deadline. For Josh, dockets give
us the bridge between Federal Register documents and the broader
rulemaking proceeding: every FR rule cites a docket_id, and many
dockets house multiple FR documents over their lifecycle (proposed rule
→ final rule → correction). Comment counts and deadlines are also
high-signal for "what's open for comment now" agent queries. Comment
text itself is deferred to v2 (hundreds of GB to TB).
User stories
As a policy analyst using my own AI agent, I want to ask "what regulations are open for comment in EPA Air Quality this month, with deadlines" so that I can prep stakeholder comment letters before the window closes.
As an OSS self-hoster, I want to run `josh ingest regulations-dot-gov-dockets` on a fresh substrate so that I can populate the docket metadata + document list from regulations.gov on my own instance.
As the Federal Register ingester (cross-source consumer), I want to FK a `fr_documents.docket_id` to a real `dockets.id` row so that agents can pivot from "this rule" to "the full rulemaking docket" in one hop.
Acceptance criteria (EARS)
- When `josh ingest regulations-dot-gov-dockets --since 2026-01-01` is run on the deployed substrate, the system shall populate `dockets` rows whose six citation columns are all non-null.
- While the ingester is running, if a single docket fetch returns 429 (api.data.gov api-umbrella throttle, signalled by `x-ratelimit-*` headers and `Retry-After`), then the system shall retry with exponential backoff up to ~5 min and continue with the next record on terminal failure.
- When a docket's documents list is fetched, the system shall populate `docket_documents` rows with one row per document (proposed rule, final rule, supporting docs, etc.) — comment text rows shall NOT be inserted in v1.
- Where an FR document declares a `docket_id`, the system shall resolve it to a `dockets.id` and write the M:N junction row in `fr_document_dockets` (one docket can hold many FR docs; one FR doc is in exactly one docket).
- The system shall enforce a UNIQUE constraint on `dockets.docket_id` (e.g. `EPA-HQ-OAR-2025-0001`) so re-fetching the same docket upserts rather than duplicates.
- When an FTS5 query is executed against `dockets_fts`, the system shall return BM25-ranked results over `(title, agency, document_titles)`.
- When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest `lastModifiedDate` observed so the next run only fetches changed dockets.
Success determiner
Command
set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest regulations-dot-gov-dockets --since 2026-04-01 --max 50'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM dockets
WHERE last_modified_date >= ''2026-04-01''
AND source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM dockets_fts WHERE dockets_fts MATCH '\''rulemaking'\'';"'
Expect
Smoke run only. Full backfill scope is a question (full ~1.5M historical dockets vs current-window only) — see clarifications_needed. The api.data.gov key with elevated quota is required for any meaningful backfill.
Clarifications needed
- Backfill scope: full ~1.5M historical dockets (~2003-present) vs last-N-years window vs active-only. Active-only is ~10K-30K, easily backfilled. Full historical at the elevated 1,000/hr quota is ~62 days continuous. Lean: last 5 years + all currently-active = ~150K dockets, ~6 days at quota.
- API key procurement: api.data.gov key is free but requires registration and per-key throttle agreement. Confirm we have the key in `.kamal/secrets` before starting backfill (env var `REGULATIONS_GOV_API_KEY`).
- Document attachments (PDFs of supporting analyses): lazy-fetch on first agent request, or eager for active dockets only? Lean: lazy — most attachments aren't queried.
- Defensive retry strategy concretely: the API is fronted by api.data.gov's api-umbrella gateway (live headers show `via: api-umbrella (ApacheTrafficServer)`, `x-api-umbrella-request-id`, `x-vcap-request-id`, `x-ratelimit-*`; no Cloudflare markers). Throttling surfaces as HTTP 429 with `x-ratelimit-*` + `Retry-After`. Plan exponential backoff with jitter keyed on 429; max retries TBD.
- Comment metadata (count + deadline) — refresh policy. Once per docket per day for active dockets? Once per week for closed-comment-period dockets? Lean: daily for active (`commentEndDate >= today`), weekly otherwise.
Out of scope
- Public comment text (hundreds of GB to TB) — explicitly v2 per `https://docs.usejosh.com/josh-data-sources/` deferral list. Per-comment-text fetch is on-demand only.
- Scraping regulations.gov website (API is the right path).
- State-level regulations.gov analogs (CA Reg.gov, NY etc.) — Phase 6+.
- Real-time webhooks / push subscriptions — surface-layer concern.
- Comment submission / writing back to regulations.gov — read-only ingest.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/regulations-dot-gov-dockets/.
Discover/fetch/parse/load shape:
- discover: GET /v4/dockets?filter[lastModifiedDate][ge]={watermark}&page[size]=250,
paginated. Watermark from ingestion_source_state.last_seen_key=.
'regulations-dot-gov-dockets-lastmod'
- fetch: per-docket GET /v4/dockets/{docketId} for full detail; then
GET /v4/documents?filter[docketId]={docketId}&page[size]=250 for the
docket's documents list. JSON written to
/data/corpus/regulations-dot-gov-dockets/bodies/raw/{docketId}.json.
SHA256 the JSON for raw_sha256.
- parse: extract docket fields (title, agency, docket type, comment
deadlines, document count) into the row; extract per-document fields
into docket_documents. Markdown body = title + agency + summary +
documents list, written to bodies/markdown/.
- load: idempotent upsert on (docketId) natural key. Document rows
inserted in one transaction per docket. FTS5 sync triggers fire on
insert/update/delete. Cross-source FK to fr_documents via the
docket_id field on FR docs (resolved during the FR ingester's load
step or by a periodic re-resolution pass).
Migration shape: dockets (with the six citation columns + docket_id
UNIQUE), docket_documents (PK (docket_id, document_id)),dockets_fts virtual + 3 sync triggers, fr_document_dockets join
table (added in the FR migration but documented here for completeness).
No *_chunks_vec0 — docket metadata is structured, not body text.
Schedule: every 4 hours. Active dockets see comment count and document
list updates throughout the day; daily polling misses the per-day
deadline-approaching signal that's high-value for users. Off-hours, the
delta is small and the API quota cost is negligible.
Defensive retry: the API is fronted by api.data.gov's api-umbrella
gateway (not Cloudflare — live headers show via: api-umbrella,
(ApacheTrafficServer)x-api-umbrella-request-id, x-ratelimit-*).
Throttling surfaces as HTTP 429 with Retry-After + x-ratelimit-*;
respect Retry-After, exponential backoff with jitter keyed on 429,
terminate after 5 retries and log to ingestion_logs.
Tasks
0 of 16 done.
- t1 Alembic migration applied locally + production (dockets + docket_documents + fr_document_dockets junction + FTS5)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/dockets.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization wired (Markdown body = title + agency + documents list; raw JSON cached)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 N/A — no body chunks, no vector pipeline
- t8 Unit tests for parser against captured fixtures (active docket, closed-comments docket, multi-document docket)
- t9 Integration test for the full discover→fetch→parse→load pipeline (with mocked api.regulations.gov)
- t10 Smoke backfill (50 dockets) clean per success_determiner
- t11 Production backfill at agreed scope (per clarifications_needed) complete
- t12 4-hourly cron registered and observed firing once
- t13 No recurring errors in `ingestion_logs` after 24h (transient 429 throttle retries are fine)
- t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
- t15 api.data.gov elevated quota requested + key in `.kamal/secrets` (env: `REGULATIONS_GOV_API_KEY`)
- t16 Cross-source FK test: smoke-backfill cohort of FR docs that declare a docket_id all resolve to a `dockets.id` row (or queue retry)
Changelog
No history yet.