Regulations.gov dockets

When a federal agency proposes or finalizes a regulation, it opens a docket on regulations.gov for public comment. Each docket contains the regulatory document(s), supporting analyses, and (after the comment period) the public comments submitted. The docket is the lifecycle container for a single rulemaking.

For Josh, regulations.gov dockets give us:

The comment count and deadline per active rulemaking — high-signal for "what's open for comment now."
The mapping from a Federal Register document to its docket_id and any related supporting documents.
The list of public comments (metadata only at v1; full comment text deferred to v2).

This source is adjacent to Federal Register — every FR rule cites a docket_id, and many dockets house multiple FR documents over their lifecycle. The relationship is M:N: a docket may contain multiple FR documents (proposed rule + final rule + correction); an FR document is in exactly one docket.

Quick reference


Source name	Regulations.gov dockets (metadata + document list)
Publisher	General Services Administration (regulations.gov is operated by GSA on behalf of agencies)
License	Public domain
Coverage	All open and historical federal rulemakings, ~2003 – present
Volume	~1.5M+ dockets all-time; ~10K-30K active at any time. ~50K-100K new documents per year.
Storage estimate	~1-2 GB metadata; full comment text would be 100s of GB to TB (deferred to v2).
Access	regulations.gov v4 REST API, JSON, base `https://api.regulations.gov/v4/`. Requires `X-Api-Key`. See Access notes.
Stable ID format	Docket: `docket:{docketId}` e.g. `docket:EPA-HQ-OAR-2025-0001`. Document: `regdoc:{documentId}`.
Status	exploring — schema drafted, ingestion not built

Source priority decision

Primary: regulations.gov v4 REST API. The canonical source. Direct from GSA. JSON. Filterable by date and agency. See the GSA API reference for endpoint routes, query params, and ?include=attachments.

Secondary: Federal Register API embedded regulations_dot_gov_info field (already documented in Federal Register). Carries comments_count, comment_url, docket_id, document_id for FR docs that exist in regulations.gov. This is redundant with the regulations.gov API but lower-cost (no key needed). Use as enrichment for FR docs we already have.

Skip: scraping regulations.gov website. API is the right path; scraping is fragile and adds nothing.

Skip: comment text in v1. Per the data sources index v2 deferral list — full comment text is hundreds of GB to TB. We ingest comment counts and deadlines per docket; per-comment-text fetch is on-demand only.

Access notes

An X-Api-Key header is required for every request. Register for a production key at api.data.gov/signup and set it as the API key env var. DEMO_KEY works for testing but rate-limits aggressively — observed at ~30/hour in practice, which won't sustain ingestion.

The API is fronted by api.data.gov's api-umbrella gateway, not Cloudflare — live response headers show via: api-umbrella (ApacheTrafficServer), x-api-umbrella-request-id, x-vcap-request-id, and x-ratelimit-* (no Cloudflare markers). Throttling surfaces as HTTP 429 with x-ratelimit-* headers and Retry-After. Plan defensive retry: respect Retry-After, then exponential backoff with jitter keyed on 429. See the upstream api.data.gov rate limits for current default and per-endpoint caps.

The commenting subsystem (/comments/{commentId}) is rate-limited more strictly than dockets/documents. We don't ingest comment text in v1, so this matters less.

Stable ID formats

Docket: docket:{docketId}, where docketId is the agency-assigned identifier verbatim.

Examples:

docket:EPA-HQ-OAR-2025-0001
docket:FAA-2024-1234
docket:NHTSA-2025-0050

The format is {AGENCY}-{ORG}-{TYPE}-{YEAR}-{SEQUENCE} but varies by agency. Treat as opaque string.

Document: regdoc:{documentId}, where documentId is regulations.gov's identifier (often {docketId}-NNNN form).

The Federal Register document number is separate from regulations.gov's documentId. fr_documents.document_number (e.g. 2026-08558) is FR-side; regdoc:{documentId} (e.g. EPA-HQ-OAR-2025-0001-0123) is regulations.gov-side. They're linked via cross-reference fields.

Volume

Metric	Value
Dockets total (all-time)	~1.5M+
Active dockets (open for comment)	~10K-30K at any time
Documents per year	~50K-100K
Comments per year	tens of millions (v2 deferral)
Per-docket metadata	~5-10 KB JSON
Per-document metadata	~5-10 KB JSON

Postgres footprint: ~1-2 GB metadata only.

Caching / incremental sync

Hourly: /dockets?filter[lastModifiedDate][ge]={last_run}&page[size]=250&sort=-lastModifiedDate — paginate, fetch new dockets.
Hourly: /documents?filter[lastModifiedDate][ge]={last_run}&page[size]=250&sort=-lastModifiedDate — paginate, fetch new documents.
Per docket with openForComment=true: check commentEndDate against today; mark closing_soon for ones within 7 days.
Cross-trigger from Federal Register loader: when an FR doc's regulations_dot_gov_info.docket_id is populated, schedule a docket fetch if we don't already have it.

The API rate limit is the binding constraint. With 1,000/hour and ~50K-100K new docs/year, the daily volume is manageable but burst-sensitive.

Schema (Postgres DDL)

-- ============================================================
-- Dockets
-- ============================================================

CREATE TABLE reg_dockets (
    id text PRIMARY KEY,                          -- 'docket:EPA-HQ-OAR-2025-0001'
    docket_id text NOT NULL UNIQUE,               -- 'EPA-HQ-OAR-2025-0001'

    agency_id text,                               -- 'EPA' (regs.gov code; differs from FR slug)
    docket_type text,                             -- 'Rulemaking' | 'Nonrulemaking' | 'Other'
    sub_type text,
    sub_type_2 text,
    category text,
    rin text,                                     -- '2060-AV12'

    title text,
    short_title text,
    docket_abstract text,

    last_modified_date timestamptz,

    -- Lifecycle
    raw_json jsonb NOT NULL,
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX reg_dockets_last_modified ON reg_dockets (last_modified_date DESC);
CREATE INDEX reg_dockets_agency ON reg_dockets (agency_id, last_modified_date DESC);
CREATE INDEX reg_dockets_rin ON reg_dockets (rin) WHERE rin IS NOT NULL;

ALTER TABLE reg_dockets ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '') || ' ' || coalesce(short_title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(docket_abstract, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(category, '')), 'C')
    ) STORED;
CREATE INDEX reg_dockets_search ON reg_dockets USING gin (search_tsv);

-- ============================================================
-- Documents (within dockets)
-- ============================================================

CREATE TABLE reg_documents (
    id text PRIMARY KEY,                          -- 'regdoc:EPA-HQ-OAR-2025-0001-0001'
    document_id text NOT NULL UNIQUE,             -- 'EPA-HQ-OAR-2025-0001-0001'
    docket_id text NOT NULL,                      -- 'EPA-HQ-OAR-2025-0001'
    docket_row_id text REFERENCES reg_dockets(id) ON DELETE CASCADE,

    agency_id text,
    document_type text NOT NULL,                  -- 'Rule' | 'Proposed Rule' | 'Supporting & Related Material' | 'Other'
    subtype text,
    title text,
    doc_abstract text,

    -- Federal Register cross-reference
    fr_doc_num text,                              -- '2025-25434' — joins to fr_documents.document_number
    fr_document_id text REFERENCES fr_documents(id),  -- soft FK ('fr:2025-25434')

    -- Comment window
    posted_date timestamptz,
    comment_start_date timestamptz,
    comment_end_date timestamptz,
    open_for_comment boolean NOT NULL DEFAULT false,

    -- Status
    withdrawn boolean NOT NULL DEFAULT false,
    last_modified_date timestamptz,

    -- Comment counts (denormalized; refresh periodically)
    comment_count int,
    comment_count_fetched_at timestamptz,

    -- Lifecycle
    raw_json jsonb NOT NULL,
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX reg_documents_docket ON reg_documents (docket_id);
CREATE INDEX reg_documents_fr_doc ON reg_documents (fr_doc_num) WHERE fr_doc_num IS NOT NULL;
CREATE INDEX reg_documents_comment_open ON reg_documents (comment_end_date)
    WHERE open_for_comment AND comment_end_date >= CURRENT_DATE;
CREATE INDEX reg_documents_last_modified ON reg_documents (last_modified_date DESC);

ALTER TABLE reg_documents ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(title, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(doc_abstract, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(document_type, '')), 'D')
    ) STORED;
CREATE INDEX reg_documents_search ON reg_documents USING gin (search_tsv);

-- ============================================================
-- Document attachments (file format index)
-- ============================================================

CREATE TABLE reg_document_attachments (
    id bigserial PRIMARY KEY,
    reg_document_id text NOT NULL REFERENCES reg_documents(id) ON DELETE CASCADE,
    file_format text NOT NULL,                    -- 'pdf' | 'docx' | 'xlsx' etc.
    file_url text NOT NULL,
    file_size_bytes int,
    fetched_at timestamptz,
    UNIQUE (reg_document_id, file_url)
);

-- ============================================================
-- (v2) Comments — placeholder; not in v1 ingestion
-- ============================================================

-- Schema deferred to v2 — comment text is too large to ingest eagerly.
-- We populate `reg_documents.comment_count` for "how many comments are on this docket"
-- queries; per-comment text is fetched on demand if/when needed.

Schema decisions worth flagging:

reg_documents.fr_doc_num is the cross-source bridge to Federal Register. The upstream frDocNum field on a document (e.g. frDocNum: "2025-25434") is the clean cross-reference — it matches fr_documents.document_number. We soft-resolve it to fr_document_id. This lets the agent answer "what's the Reg.gov docket for FR 2025-25434" cleanly.
comment_count denormalized with comment_count_fetched_at to indicate freshness — refresh asynchronously, don't block on it.
raw_json always preserved — schema preserves any field we don't model yet.
withdrawn flag retained — when a document is withdrawn from regulations.gov (rare), we keep the row with withdrawn=true rather than deleting it.
Comments deferred — schema shows the intent (reg_comments table not yet created in v1). Full comment text is v2.
Attachments separate — multiple file formats per document (PDF + DOCX + XLSX); attachment lookups are common ("download the PDF version").

The full upstream JSON-API response shapes for dockets, documents, and attachments are documented in the GSA API reference; the docketType / documentType enum vocabularies (Rulemaking / Nonrulemaking / Other; Rule / Proposed Rule / Supporting & Related Material / Other) are upstream-defined there too.

Download / update strategy

Backfill

The 1.5M+ docket history is too much to backfill entirely on first launch. Strategy:

Backfill last 3 years of dockets: ?filter[lastModifiedDate][ge]=2023-01-01&page[size]=250&sort=lastModifiedDate. Paginate. ~100K-500K dockets.
For each docket, fetch detail. ~500K calls at 1,000/hr = ~500 hours. Lean: parallelize with multiple keys, or accept ~21 days backfill and run during off-hours.
Cross-reference with Federal Register: any FR doc with regulations_dot_gov_info.docket_id populated triggers a docket fetch if we don't have it. This catches important historical dockets even if outside our 3-year window.
Document-level fetch piggybacks on docket fetch by listing all documents in the docket.

Daily incremental

Hourly: /dockets?filter[lastModifiedDate][ge]={last_run}&sort=-lastModifiedDate. Paginate.
Hourly: /documents?filter[lastModifiedDate][ge]={last_run}&sort=-lastModifiedDate. Paginate.
Daily: comment-count refresh for dockets/documents with open_for_comment=true. Hit /documents/{id} to refresh commentEndDate and check the comment count tally.

State tracking

Source key: reg_dockets, reg_documents. State stores per-source last_modified_date watermark.

Failure modes

Rate limit (429) from api.data.gov's api-umbrella gateway — throttling surfaces as HTTP 429 with x-ratelimit-* headers and Retry-After (the gateway is api-umbrella / ApacheTrafficServer, not Cloudflare). Respect Retry-After, then back off exponentially with jitter (e.g. 5s → 30s → 5min → halt after ~5 retries). DEMO_KEY hits the cap fast; production needs a real api.data.gov key.
Docket withdrawn / merged — rare but possible. The id may persist while attributes change. Update in place via lastModifiedDate.
Document withdrawn — withdrawn=true flag. Preserve row.
fr_doc_num resolution fails — fr_documents may not yet have the row (FR loader may be behind). Soft FK; resolve later.
Comment count rapidly changing during open period — refresh hourly during the last week before close; once-daily during the rest.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Real api.data.gov key vs DEMO_KEY. DEMO_KEY won't sustain ingestion. Provision a real key early.
Backfill scope. 3 years is a guess. Decide based on actual user need and the 21-day backfill cost.
Comment text v2 path. When we add comments in v2: per-comment metadata is small but text + attachments are huge. Plan: index per-comment metadata, store text only on-demand. Some agencies see >1M comments per high-profile docket.
Agency ID normalization. Regulations.gov uses agency codes (EPA, NHTSA) that may or may not match Federal Register's slugs (environmental-protection-agency). Build a small crosswalk.
API stability. The api.data.gov api-umbrella gateway throttles bursty callers with HTTP 429 (x-ratelimit-* + Retry-After). Build robust retry + circuit-breaker logic keyed on 429 that honors Retry-After before backing off.
Closing-soon notifications. A high-value workflow ("alert me when this comment period closes in 3 days") relies on this source. Schema supports it via the index on comment_end_date WHERE open_for_comment.
Withdrawn documents in citation graph. When a document is withdrawn, downstream FR docs / CFR sections that cited it are still pointing at a valid (withdrawn) record. Schema preserves; consumers must filter.
Comment counts vs comment count of attachments. Some dockets carry comments in batches as Mass Comment documents — a single docket entry that aggregates thousands of identical or near-identical comments. Surface this as a separate mass_comment_count field if useful.