Regulations.gov dockets ingester

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

When a federal agency proposes or finalizes a regulation, it opens a
docket on regulations.gov for public comment. Each docket is the
lifecycle container for a single rulemaking — the regulatory document(s),
supporting analyses, comment count, and deadline. For Josh, dockets give
us the bridge between Federal Register documents and the broader
rulemaking proceeding: every FR rule cites a docket_id, and many
dockets house multiple FR documents over their lifecycle (proposed rule
→ final rule → correction). Comment counts and deadlines are also
high-signal for "what's open for comment now" agent queries. Comment
text itself is deferred to v2 (hundreds of GB to TB).

User stories

As a policy analyst using my own AI agent, I want to ask "what regulations are open for comment in EPA Air Quality this month, with deadlines" so that I can prep stakeholder comment letters before the window closes.

As an OSS self-hoster, I want to run `josh ingest regulations-dot-gov-dockets` on a fresh substrate so that I can populate the docket metadata + document list from regulations.gov on my own instance.

As the Federal Register ingester (cross-source consumer), I want to FK a `fr_documents.docket_id` to a real `dockets.id` row so that agents can pivot from "this rule" to "the full rulemaking docket" in one hop.

Acceptance criteria (EARS)

When `josh ingest regulations-dot-gov-dockets --since 2026-01-01` is run on the deployed substrate, the system shall populate `dockets` rows whose six citation columns are all non-null.
While the ingester is running, if a single docket fetch returns 429 (api.data.gov api-umbrella throttle, signalled by `x-ratelimit-*` headers and `Retry-After`), then the system shall retry with exponential backoff up to ~5 min and continue with the next record on terminal failure.
When a docket's documents list is fetched, the system shall populate `docket_documents` rows with one row per document (proposed rule, final rule, supporting docs, etc.) — comment text rows shall NOT be inserted in v1.
Where an FR document declares a `docket_id`, the system shall resolve it to a `dockets.id` and write the M:N junction row in `fr_document_dockets` (one docket can hold many FR docs; one FR doc is in exactly one docket).
The system shall enforce a UNIQUE constraint on `dockets.docket_id` (e.g. `EPA-HQ-OAR-2025-0001`) so re-fetching the same docket upserts rather than duplicates.
When an FTS5 query is executed against `dockets_fts`, the system shall return BM25-ranked results over `(title, agency, document_titles)`.
When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest `lastModifiedDate` observed so the next run only fetches changed dockets.

Success determiner

kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest regulations-dot-gov-dockets --since 2026-04-01 --max 50'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM dockets
  WHERE last_modified_date >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM dockets_fts WHERE dockets_fts MATCH '\''rulemaking'\'';"'

Expect

>= 50 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. Full backfill scope is a question (full ~1.5M historical dockets vs current-window only) — see clarifications_needed. The api.data.gov key with elevated quota is required for any meaningful backfill.

Clarifications needed

Backfill scope: full ~1.5M historical dockets (~2003-present) vs last-N-years window vs active-only. Active-only is ~10K-30K, easily backfilled. Full historical at the elevated 1,000/hr quota is ~62 days continuous. Lean: last 5 years + all currently-active = ~150K dockets, ~6 days at quota.
API key procurement: api.data.gov key is free but requires registration and per-key throttle agreement. Confirm we have the key in `.kamal/secrets` before starting backfill (env var `REGULATIONS_GOV_API_KEY`).
Document attachments (PDFs of supporting analyses): lazy-fetch on first agent request, or eager for active dockets only? Lean: lazy — most attachments aren't queried.
Defensive retry strategy concretely: the API is fronted by api.data.gov's api-umbrella gateway (live headers show `via: api-umbrella (ApacheTrafficServer)`, `x-api-umbrella-request-id`, `x-vcap-request-id`, `x-ratelimit-*`; no Cloudflare markers). Throttling surfaces as HTTP 429 with `x-ratelimit-*` + `Retry-After`. Plan exponential backoff with jitter keyed on 429; max retries TBD.
Comment metadata (count + deadline) — refresh policy. Once per docket per day for active dockets? Once per week for closed-comment-period dockets? Lean: daily for active (`commentEndDate >= today`), weekly otherwise.

Out of scope

Public comment text (hundreds of GB to TB) — explicitly v2 per `https://docs.usejosh.com/josh-data-sources/` deferral list. Per-comment-text fetch is on-demand only.
Scraping regulations.gov website (API is the right path).
State-level regulations.gov analogs (CA Reg.gov, NY etc.) — Phase 6+.
Real-time webhooks / push subscriptions — surface-layer concern.
Comment submission / writing back to regulations.gov — read-only ingest.

Dependencies

Plan

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/regulations-dot-gov-dockets/.

Discover/fetch/parse/load shape:
- discover: GET /v4/dockets?filter[lastModifiedDate][ge]={watermark}&page[size]=250,
paginated. Watermark from ingestion_source_state.last_seen_key= 'regulations-dot-gov-dockets-lastmod'.
- fetch: per-docket GET /v4/dockets/{docketId} for full detail; then
GET /v4/documents?filter[docketId]={docketId}&page[size]=250 for the
docket's documents list. JSON written to
/data/corpus/regulations-dot-gov-dockets/bodies/raw/{docketId}.json.
SHA256 the JSON for raw_sha256.
- parse: extract docket fields (title, agency, docket type, comment
deadlines, document count) into the row; extract per-document fields
into docket_documents. Markdown body = title + agency + summary +
documents list, written to bodies/markdown/.
- load: idempotent upsert on (docketId) natural key. Document rows
inserted in one transaction per docket. FTS5 sync triggers fire on
insert/update/delete. Cross-source FK to fr_documents via the
docket_id field on FR docs (resolved during the FR ingester's load
step or by a periodic re-resolution pass).

Migration shape: dockets (with the six citation columns + docket_id
UNIQUE), docket_documents (PK (docket_id, document_id)),
dockets_fts virtual + 3 sync triggers, fr_document_dockets join
table (added in the FR migration but documented here for completeness).
No *_chunks_vec0 — docket metadata is structured, not body text.

Schedule: every 4 hours. Active dockets see comment count and document
list updates throughout the day; daily polling misses the per-day
deadline-approaching signal that's high-value for users. Off-hours, the
delta is small and the API quota cost is negligible.

Defensive retry: the API is fronted by api.data.gov's api-umbrella
gateway (not Cloudflare — live headers show via: api-umbrella (ApacheTrafficServer), x-api-umbrella-request-id, x-ratelimit-*).
Throttling surfaces as HTTP 429 with Retry-After + x-ratelimit-*;
respect Retry-After, exponential backoff with jitter keyed on 429,
terminate after 5 retries and log to ingestion_logs.

Tasks

0 of 16 done.

Changelog

No history yet.