substrateplannedp0

Substrate source-ingester defaults

substrate-source-defaults · updated 2026-05-10T22:00:00Z · owner rritz

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Pin the cross-cutting defaults every source ingester inherits, so adding
a new source is "implement the Source protocol" rather than "rediscover
HTTP timeouts, anti-bot strategy, PDF library choice, and disk layout
per source." Today, working defaults exist in josh-ingester/ingester/http.py
(User-Agent, retry, rate limit), shared/josh_substrate/corpus.py
(filesystem layout), and josh-ingester/ingester/scheduler.py (timezone)
but they're not documented as canonical conventions and four explicit
cross-source decisions remain undecided across clarifications_needed in
multiple ingester specs:

1. api.data.gov key strategy (bills, hearings, regulations.gov,
public-laws, roll-call-votes share quota).
2. PDF normalizer (pdf_v1) library choice (CBO, GAO, LDA, hearings).
3. Headed-browser worker topology (CBO, GAO, LDA, Senate roll-calls
fight Akamai/DataDome).
4. HTTP request timeout (httpx default 5s/5s is too short for federal
servers).

This spec is decisions-as-acceptance-criteria across those four plus
ratification of the existing conventions. Implementation lands in
shared/josh_substrate/ (corpus + http helpers) and
josh-ingester/ingester/ (browser singleton, api.data.gov budget
allocator). Every per-source ingester spec depends on this.

User stories

As a contributor adding a new source ingester, I want one place to read what HTTP, browser, PDF, and disk conventions apply so that my Source implementation follows the substrate's existing patterns instead of inventing.

As an OSS self-hoster reading the docs, I want a single conventions reference instead of 16 ingester-specific quirks so that the substrate feels coherent rather than 16 unrelated scrapers in a trench coat.

As an operator, I want api.data.gov quota fairly allocated across the ingesters that share it so that a bills backfill doesn't starve the regulations.gov daily ingester.

Acceptance criteria (EARS)

Where a source declares a `schedule` cron string, it shall be interpreted in `America/New_York` by default; per-source override is allowed via the source's `timezone` attribute.
When `josh-ingester/ingester/scheduler.py` builds the AsyncIOScheduler, the timezone shall default to `America/New_York` and the convention shall be documented as the substrate's source-schedule default.
When `JoshHttpClient` is constructed for a source, it shall apply the substrate User-Agent (real-browser Chrome-130 prefix + `josh-ingester/<version>` suffix), tenacity retry (3 attempts, exponential backoff, 5xx + network errors only), and a per-source token-bucket rate limit defaulting to 2 req/sec.
When `JoshHttpClient` issues an HTTP request, the timeout shall default to 10 seconds connect + 60 seconds read; per-source override is allowed via constructor arguments.
Where the runner orchestrates a source's fetch stage, the in-flight request concurrency shall default to 4; per-source override is allowed.
When the substrate User-Agent's Chrome version is more than 12 months old, a maintenance task shall bump it (manual hygiene, not blocking).
When any source fetches an api.data.gov-fronted endpoint, it shall use the shared key from the `JOSH_API_DATA_GOV_KEY` env var; per-source keys are not used.
Where multiple sources concurrently hit api.data.gov, requests shall pass through a shared parent token bucket scoped to the api.data.gov key so the aggregate never exceeds the configured quota.
Where the api.data.gov default quota of 1000 req/hour applies, per-source budgets shall sum to 1000/hour with the v1 default split: bills 350, regulations.gov-dockets 200, hearings 150, public-laws 100, roll-call-votes 100, reserve 100.
When `JoshHttpClient` observes an `X-RateLimit-Limit` header indicating an elevated quota (>1000/hr), per-source budgets shall scale proportionally so an operator with a 5K/hr key sees `bills=1750` etc. without manual config.
Where a source's per-hour budget is exceeded, the `JoshHttpClient` shall sleep until the next bucket refill rather than raising; lock-held semantics do not apply to api.data.gov rate limiting.
Where a source produces PDF bytes, the substrate's canonical `pdf_v1` normalizer shall implement the `Normalizer` protocol via `pdfplumber` (MIT license, pure Python).
When `pdf_v1` extracts a table, it shall serialize as a GFM pipe-table in the Markdown body; tables that exceed 20 columns or have irregular row/colspans shall be captured as `Artifact` per the body-normalization protocol rather than being rendered lossily.
Where `pdf_v1` encounters a PDF with no extractable text (scanned image-only PDF), the body normalization shall return an empty `markdown` field and stamp `body_status='scanned_no_text'` on the row; OCR is out of scope for v1.
When the `pdf_v1` validation suite at `shared/josh_substrate/tests/normalizers/test_pdf_v1.py` runs, it shall exercise 12 stratified samples (3 each from CBO / GAO / LDA / hearings) covering simple text, table-heavy, multi-column, and complex-layout cases.
Where `pdfplumber` is a substrate dependency, it shall be pinned `pdfplumber>=0.11,<0.12`; bumps require the validation suite to pass.
Where Josh's licensing is Apache 2.0, AGPL-licensed PDF libraries (PyMuPDF / Artifex) shall not be used as runtime dependencies, even when technically more capable.
Where a source must fetch from a bot-walled origin (Akamai / DataDome / similar), it shall use the shared `BrowserClient` singleton in `josh-ingester` rather than spawning its own browser.
When the first headed-browser fetch happens in a `josh-ingester` process, the `BrowserClient` shall lazily spawn a single Playwright Chromium instance and keep it alive for the rest of the process lifetime.
Where multiple sources share the browser, each shall fetch via its own `BrowserContext` so cookies do not leak across sources.
When `josh-ingester` is built, the Docker base image shall be `mcr.microsoft.com/playwright/python:v1.49.0-noble` so Chromium system dependencies are pre-baked; `josh-core` and `josh-embedder` retain `python:3.12-slim`.
When `BrowserClient` issues a fetch, the page shall use `playwright-stealth` patches (navigator.webdriver=false, etc.) to evade trivial bot detection; per-source override is allowed for sources requiring different stealth profiles.
Where a source preserves raw payload, files shall be written under `/data/corpus/<source>/bodies/raw/<source-specific-subpath>` per the source's chosen partitioning (date / congress / number / etc.).
Where a source's natural key has a stable hierarchical component (year, congress, type), the on-disk subpath shall partition by that component (FR by `YYYY/MM`, bills by `congress/type`, etc.).
When a source overwrites an existing raw file (upstream record updated), the write shall be atomic via tmp-then-rename + fsync; the prior bytes are preserved only in the restic snapshot tier, not on the live filesystem.
Where normalized Markdown is produced from a raw payload, it shall be written to the row's `body_text` column only; no parallel Markdown file shall be written to disk.
Where raw bytes need compression, the substrate shall not gzip at the application layer; filesystem-level compression (ZFS / btrfs) is the recommended path if disk pressure ever justifies it.
Where raw bytes are written, retention shall be indefinite at v1; pruning policy is reserved for revisit when disk pressure justifies it.
When a source's fetch stage crashes between file write and DB commit, leaving an orphan in the corpus directory, the `josh source janitor <name>` command (per cli-conventions) shall sweep orphans older than 24 hours from the per-source `.tmp/` directory.

Success determiner

kindtest_file

Path

shared/josh_substrate/tests/test_source_defaults.py

Runner

uv run pytest shared/josh_substrate/tests/test_source_defaults.py -v

Contract test that exercises every default the substrate ships: - Scheduler builds with `America/New_York` default; per-source override flows through. - JoshHttpClient applies UA, rate limit, retry, timeout defaults. - api.data.gov budget allocator splits 1000/hr per the documented table; auto-scales on observed `X-RateLimit-Limit` header. - `pdf_v1` normalizer ships and passes the 12-sample validation suite under `tests/normalizers/test_pdf_v1.py`. - `BrowserClient` singleton lazy-spawns once and produces per-source contexts. - Corpus path helpers in `josh_substrate.corpus` produce the documented per-source subpaths; atomic write helper round-trips via tmp-then-rename. Determiner currently fails because: (a) `shared/josh_substrate/tests/test_source_defaults.py` does not yet exist; (b) the api.data.gov budget allocator and BrowserClient singleton are not yet implemented; (c) `pdf_v1` and its 12-sample validation suite are not yet implemented. Will flip to passing as the implementation lands across `shared/josh_substrate/` and `josh-ingester/ingester/`.

Clarifications needed

None.

Out of scope

Specific source ingester implementations — those live in per-source ingester specs (`bills-ingester`, `gao-reports-ingester`, etc.) which now depend on this spec.
OCR fallback for scanned PDFs — defer to v1.x when a real need surfaces.
Marker (LLM-assisted PDF) integration — defer to v1.x; preserve the option as a `pdf_v2` track.
Extraction of headed-browser into a separate `josh-fetcher` container — deferred until single-process Chromium hits a real scaling wall.
api.data.gov backfill burst mode (`--burst` flag) — deferred to v1.x.
Application-layer compression of raw bytes — deferred; filesystem-level compression is the preferred answer if disk pressure surfaces.
On-disk version history of raw payloads — restic snapshot tier IS the version history.
Source-specific HTTP, browser, or PDF customizations — those override the defaults via per-source attributes; this spec defines the defaults, not the override mechanism beyond noting it exists.

Dependencies

Plan

Decisions locked, organized by section. Each maps to one or more EARS
acceptance criteria above and to a contract test in the determiner.

## 1. Scheduler timezone

Locked: America/New_York (already implemented in
josh-ingester/ingester/scheduler.py as DEFAULT_TIMEZONE). Federal
data is intrinsically Eastern (Congress floor activity, FR ~08:45 ET,
DC business hours). Per-source override is allowed via the source's
timezone attribute.

## 2. HTTP client defaults

Locked, applied in JoshHttpClient (josh-ingester/ingester/http.py):

| Field | Default |
|-------|---------|
| User-Agent | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 josh-ingester/<version> |
| Rate limit | 2 req/sec token bucket |
| Retry | tenacity, 3 attempts, exponential backoff, 5xx + network only |
| Fetch concurrency | 4 in-flight per source |
| Timeout | 10s connect / 60s read (new — was unset previously) |

Per-source override is the standard pattern; document the tighter
defaults source authors should reach for (e.g., FR API supports much
higher rate; LDA's headed-browser path is much lower).

Maintenance: when the User-Agent's Chrome version is >12 months old,
bump it. Non-blocking hygiene.

## 3. api.data.gov key strategy

Locked: single key, single env var, per-source budget allocation,
auto-scaling on elevated quota.

```
JOSH_API_DATA_GOV_KEY=...

Default budget split (1000 req/hr aggregate):
bills-ingester 350/hr
regulations-dot-gov-dockets 200/hr
hearings-ingester 150/hr
public-laws-ingester 100/hr
roll-call-votes-ingester 100/hr
headroom for v2 sources 100/hr
```

When JoshHttpClient observes X-RateLimit-Limit > 1000, all
per-source budgets scale proportionally — operator with elevated 5K/hr
key sees bills=1750 etc. without retuning. Per-source override via
JOSH_RATE_LIMIT_<SOURCE>_PER_HOUR=N env var.

Implementation: shared parent TokenBucket scoped to api.data.gov,
per-source child buckets. JoshHttpClient.api_data_gov_request()
acquires from both before issuing. New code in
josh-ingester/ingester/http.py.

Backfill burst (temporary access to the full unallocated budget) is
deferred to v1.x; v1 backfill on default quota is acceptable for
one-time operations.

## 4. PDF normalizer (pdf_v1)

Locked: pdfplumber (MIT license, pure Python, table-strong) as the
canonical pdf_v1 Normalizer implementation.

Library reasoning:
- PyMuPDF (Artifex) is AGPL-3.0 — incompatible with Apache 2.0
Foundation distribution. Ruled out despite being technically
strongest.
- pypdf is BSD but weak on tables; federal data is table-heavy.
- Marker (LLM-assisted) is highest-quality but requires GPU + 3GB
model weights. Defer to v1.x.

Behaviors:
- Tables → GFM pipe-table format in Markdown body.
- Complex tables (>20 cols or irregular spans) → captured as
Artifact per body-normalization protocol; surrounding prose
still in the Markdown body.
- Scanned PDFs → return empty markdown, stamp
body_status='scanned_no_text'. OCR is out of scope for v1; the
row still exists with metadata + null body.
- Validation: 12-sample stratified suite at
shared/josh_substrate/tests/normalizers/test_pdf_v1.py. 3
samples each from CBO, GAO, LDA, hearings, covering simple text,
table-heavy, multi-column, complex layouts.
- Pin pdfplumber>=0.11,<0.12. Bumps require the validation suite
to pass.

## 5. Headed-browser worker topology

Locked: in-process Playwright in josh-ingester (Option A from the
Q4 grilling pass).

Architecture:
- One BrowserClient singleton, lazy-spawned on first headed-fetch.
- Lifetime = ingester process lifetime. Killed on shutdown.
- Per-source BrowserContext for cookie isolation.
- playwright-stealth patches applied (navigator.webdriver=false,
etc.) to evade trivial bot-detection.

Library: playwright>=1.49, playwright-stealth>=0.3. Single
library family; async-native; well-maintained.

Image:
- josh-ingester/Dockerfile base swapped from python:3.12-slim to
mcr.microsoft.com/playwright/python:v1.49.0-noble. Image size
grows ~150MB but Chromium system deps are pre-baked, build is
faster, deps are correct.
- josh-core and josh-embedder retain python:3.12-slim (no
browser need).

Path to v2: extract BrowserClient to a separate josh-fetcher
container if scale demands. Trigger conditions: backfill takes >7
days OR steady-state browser memory exceeds 1GB OR ingester needs
to scale horizontally. Until then, in-process is operationally
simpler and adequate.

## 6. Corpus filesystem layout

Locked: existing per-source-subpath layout (already implemented in
shared/josh_substrate/corpus.py).

Pattern: /data/corpus/<source>/bodies/raw/<source-specific-subpath>
with the source author choosing partitioning that matches their
natural key:

| Source | Pattern |
|--------|---------|
| crs_reports | crs/bodies/raw/{number}.html + .json |
| federal_register | federal_register/bodies/raw/{YYYY}/{MM}/{document_number}.xml |
| public_laws | public_laws/bodies/raw/{congress}/PLAW-{c}publ{n}.xml |
| bills | bills/bodies/raw/billstatus/{congress}/{type}/BILLSTATUS-{c}{t}{n}.xml |

Convention for new sources: partition by the most stable component of
the natural key, then smaller components, terminating in the
natural-key string + extension.

Storage policies:
- Markdown: DB column only (body_text). No parallel Markdown
file on disk. Re-derivation = "fetch row, run normalizer on raw,
compare hash."
- Updates: atomic overwrite via tmp-then-rename + fsync. Prior
bytes preserved only in restic snapshots — backup tier IS the
version history.
- Compression: uncompressed at application layer. Filesystem-
level (ZFS / btrfs) is the recommended path if disk pressure
justifies it later.
- Retention: indefinite at v1. Revisit when disk pressure
justifies pruning policy.
- Atomicity: every fetch writes to <source>/bodies/raw/.tmp/
first, fsyncs, then os.replace() to the canonical path. Only
after the file is at the canonical path does the DB row get
committed.
- Orphan sweep: josh source janitor <name> (per
cli-conventions) cleans .tmp/ files older than 24 hours.

## Implementation surface

- shared/josh_substrate/corpus.py — gains atomic_write_bytes(path, bytes) helper used by every source's fetch stage. Per-source path
helpers already implemented.
- shared/josh_substrate/normalizers/pdf_v1.py — new module
implementing Normalizer against pdfplumber. ~150 LOC.
- shared/josh_substrate/tests/normalizers/test_pdf_v1.py — 12-sample
stratified validation suite.
- josh-ingester/ingester/http.py — extends JoshHttpClient with
api.data.gov budget allocator (shared parent bucket + per-source
children, auto-scaling on observed quota header). Adds 10s/60s
timeout default.
- josh-ingester/ingester/browser.py — new module with
BrowserClient singleton, per-source context manager, stealth
patching.
- josh-ingester/Dockerfile — base image swap to Playwright Python
base.
- shared/josh_substrate/tests/test_source_defaults.py — contract
test suite that the success_determiner runs.

Changelog

2026-05-10T22:00:00Z (new)→planned Spec authored after a 5-question grilling-style decision pass with claude. Locked the existing in-code defaults (scheduler TZ, HTTP User-Agent + retry + rate limit, fetch concurrency, corpus layout) as canonical conventions and made four real new decisions: 1. HTTP timeout: 10s connect / 60s read default. 2. api.data.gov: single shared key with per-source budget allocation (350/200/150/100/100 + 100 reserve at 1000/hr default; auto-scales on elevated quota detection). 3. PDF normalizer: pdfplumber as canonical pdf_v1; PyMuPDF ruled out due to AGPL-3.0 / Apache 2.0 conflict; Marker deferred. 4. Headed-browser: in-process Playwright in josh-ingester with singleton browser + per-source BrowserContext; base image swap to Microsoft's playwright/python; extraction to separate container deferred until scale demands. Per-source ingester specs will declare dependency on this spec and drop the resolved clarifications in a follow-up commit.

docs/spec/substrate-source-defaults.html · generated by bin/build-spec.py