sourceplannedp0

Regulations.gov dockets ingester

regulations-dot-gov-dockets-ingester · updated 2026-05-29T00:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

When a federal agency proposes or finalizes a regulation, it opens a
docket on regulations.gov for public comment. Each docket is the
lifecycle container for a single rulemaking — the regulatory document(s),
supporting analyses, comment count, and deadline. For Josh, dockets give
us the bridge between Federal Register documents and the broader
rulemaking proceeding: every FR rule cites a docket_id, and many
dockets house multiple FR documents over their lifecycle (proposed rule
→ final rule → correction). Comment counts and deadlines are also
high-signal for "what's open for comment now" agent queries. Comment
text itself is deferred to v2
(hundreds of GB to TB).

As a policy analyst using my own AI agent, I want to ask "what regulations are open for comment in EPA Air Quality this month, with deadlines" so that I can prep stakeholder comment letters before the window closes.

As an OSS self-hoster, I want to run `josh ingest regulations-dot-gov-dockets` on a fresh substrate so that I can populate the docket metadata + document list from regulations.gov on my own instance.

As the Federal Register ingester (cross-source consumer), I want to FK a `fr_documents.docket_id` to a real `dockets.id` row so that agents can pivot from "this rule" to "the full rulemaking docket" in one hop.

  1. When `josh ingest regulations-dot-gov-dockets --since 2026-01-01` is run on the deployed substrate, the system shall populate `dockets` rows whose six citation columns are all non-null.
  2. While the ingester is running, if a single docket fetch returns 429 (api.data.gov api-umbrella throttle, signalled by `x-ratelimit-*` headers and `Retry-After`), then the system shall retry with exponential backoff up to ~5 min and continue with the next record on terminal failure.
  3. When a docket's documents list is fetched, the system shall populate `docket_documents` rows with one row per document (proposed rule, final rule, supporting docs, etc.) — comment text rows shall NOT be inserted in v1.
  4. Where an FR document declares a `docket_id`, the system shall resolve it to a `dockets.id` and write the M:N junction row in `fr_document_dockets` (one docket can hold many FR docs; one FR doc is in exactly one docket).
  5. The system shall enforce a UNIQUE constraint on `dockets.docket_id` (e.g. `EPA-HQ-OAR-2025-0001`) so re-fetching the same docket upserts rather than duplicates.
  6. When an FTS5 query is executed against `dockets_fts`, the system shall return BM25-ranked results over `(title, agency, document_titles)`.
  7. When an incremental run completes, the system shall update `ingestion_source_state.last_seen_key` to the highest `lastModifiedDate` observed so the next run only fetches changed dockets.
kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest regulations-dot-gov-dockets --since 2026-04-01 --max 50'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM dockets
  WHERE last_modified_date >= ''2026-04-01''
    AND source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM dockets_fts WHERE dockets_fts MATCH '\''rulemaking'\'';"'

Expect

>= 50 fully-populated rows; FTS5 match returns >= 1.

Smoke run only. Full backfill scope is a question (full ~1.5M historical dockets vs current-window only) — see clarifications_needed. The api.data.gov key with elevated quota is required for any meaningful backfill.

  • Backfill scope: full ~1.5M historical dockets (~2003-present) vs last-N-years window vs active-only. Active-only is ~10K-30K, easily backfilled. Full historical at the elevated 1,000/hr quota is ~62 days continuous. Lean: last 5 years + all currently-active = ~150K dockets, ~6 days at quota.
  • API key procurement: api.data.gov key is free but requires registration and per-key throttle agreement. Confirm we have the key in `.kamal/secrets` before starting backfill (env var `REGULATIONS_GOV_API_KEY`).
  • Document attachments (PDFs of supporting analyses): lazy-fetch on first agent request, or eager for active dockets only? Lean: lazy — most attachments aren't queried.
  • Defensive retry strategy concretely: the API is fronted by api.data.gov's api-umbrella gateway (live headers show `via: api-umbrella (ApacheTrafficServer)`, `x-api-umbrella-request-id`, `x-vcap-request-id`, `x-ratelimit-*`; no Cloudflare markers). Throttling surfaces as HTTP 429 with `x-ratelimit-*` + `Retry-After`. Plan exponential backoff with jitter keyed on 429; max retries TBD.
  • Comment metadata (count + deadline) — refresh policy. Once per docket per day for active dockets? Once per week for closed-comment-period dockets? Lean: daily for active (`commentEndDate >= today`), weekly otherwise.
  • Public comment text (hundreds of GB to TB) — explicitly v2 per `https://docs.usejosh.com/josh-data-sources/` deferral list. Per-comment-text fetch is on-demand only.
  • Scraping regulations.gov website (API is the right path).
  • State-level regulations.gov analogs (CA Reg.gov, NY etc.) — Phase 6+.
  • Real-time webhooks / push subscriptions — surface-layer concern.
  • Comment submission / writing back to regulations.gov — read-only ingest.

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/regulations-dot-gov-dockets/.

Discover/fetch/parse/load shape:
- discover: GET /v4/dockets?filter[lastModifiedDate][ge]={watermark}&page[size]=250,
paginated. Watermark from ingestion_source_state.last_seen_key=
'regulations-dot-gov-dockets-lastmod'
.
- fetch: per-docket GET /v4/dockets/{docketId} for full detail; then
GET /v4/documents?filter[docketId]={docketId}&page[size]=250 for the
docket's documents list. JSON written to
/data/corpus/regulations-dot-gov-dockets/bodies/raw/{docketId}.json.
SHA256 the JSON for raw_sha256.
- parse: extract docket fields (title, agency, docket type, comment
deadlines, document count) into the row; extract per-document fields
into docket_documents. Markdown body = title + agency + summary +
documents list, written to bodies/markdown/.
- load: idempotent upsert on (docketId) natural key. Document rows
inserted in one transaction per docket. FTS5 sync triggers fire on
insert/update/delete. Cross-source FK to fr_documents via the
docket_id field on FR docs (resolved during the FR ingester's load
step or by a periodic re-resolution pass).

Migration shape: dockets (with the six citation columns + docket_id
UNIQUE), docket_documents (PK (docket_id, document_id)),
dockets_fts virtual + 3 sync triggers, fr_document_dockets join
table (added in the FR migration but documented here for completeness).
No *_chunks_vec0 — docket metadata is structured, not body text.

Schedule: every 4 hours. Active dockets see comment count and document
list updates throughout the day; daily polling misses the per-day
deadline-approaching signal that's high-value for users. Off-hours, the
delta is small and the API quota cost is negligible.

Defensive retry: the API is fronted by api.data.gov's api-umbrella
gateway (not Cloudflare — live headers show via: api-umbrella
(ApacheTrafficServer)
, x-api-umbrella-request-id, x-ratelimit-*).
Throttling surfaces as HTTP 429 with Retry-After + x-ratelimit-*;
respect Retry-After, exponential backoff with jitter keyed on 429,
terminate after 5 retries and log to ingestion_logs.

0 of 16 done.

  • t1 Alembic migration applied locally + production (dockets + docket_documents + fr_document_dockets junction + FTS5)
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/dockets.py
  • t3 Source module implementing the `Source` protocol
  • t4 Body normalization wired (Markdown body = title + agency + documents list; raw JSON cached)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests
  • t6 FTS5 virtual table + 3 sync triggers wired into migration
  • t7 N/A — no body chunks, no vector pipeline
  • t8 Unit tests for parser against captured fixtures (active docket, closed-comments docket, multi-document docket)
  • t9 Integration test for the full discover→fetch→parse→load pipeline (with mocked api.regulations.gov)
  • t10 Smoke backfill (50 dockets) clean per success_determiner
  • t11 Production backfill at agreed scope (per clarifications_needed) complete
  • t12 4-hourly cron registered and observed firing once
  • t13 No recurring errors in `ingestion_logs` after 24h (transient 429 throttle retries are fine)
  • t14 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built
  • t15 api.data.gov elevated quota requested + key in `.kamal/secrets` (env: `REGULATIONS_GOV_API_KEY`)
  • t16 Cross-source FK test: smoke-backfill cohort of FR docs that declare a docket_id all resolve to a `dockets.id` row (or queue retry)

No history yet.

docs/spec/regulations-dot-gov-dockets-ingester.html · generated by bin/build-spec.py