substrateplannedp0

Substrate source-ingester defaults

substrate-source-defaults · updated 2026-05-10T22:00:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Pin the cross-cutting defaults every source ingester inherits, so adding
a new source is "implement the Source protocol" rather than "rediscover
HTTP timeouts, anti-bot strategy, PDF library choice, and disk layout
per source." Today, working defaults exist in josh-ingester/ingester/http.py
(User-Agent, retry, rate limit), shared/josh_substrate/corpus.py
(filesystem layout), and josh-ingester/ingester/scheduler.py (timezone)
but they're not documented as canonical conventions and four explicit
cross-source decisions remain undecided across clarifications_needed in
multiple ingester specs:

1. api.data.gov key strategy (bills, hearings, regulations.gov,
public-laws, roll-call-votes share quota).
2. PDF normalizer (pdf_v1) library choice (CBO, GAO, LDA, hearings).
3. Headed-browser worker topology (CBO, GAO, LDA, Senate roll-calls
fight Akamai/DataDome).
4. HTTP request timeout (httpx default 5s/5s is too short for federal
servers).

This spec is decisions-as-acceptance-criteria across those four plus
ratification of the existing conventions. Implementation lands in
shared/josh_substrate/ (corpus + http helpers) and
josh-ingester/ingester/ (browser singleton, api.data.gov budget
allocator). Every per-source ingester spec depends on this.

As a contributor adding a new source ingester, I want one place to read what HTTP, browser, PDF, and disk conventions apply so that my Source implementation follows the substrate's existing patterns instead of inventing.

As an OSS self-hoster reading the docs, I want a single conventions reference instead of 16 ingester-specific quirks so that the substrate feels coherent rather than 16 unrelated scrapers in a trench coat.

As an operator, I want api.data.gov quota fairly allocated across the ingesters that share it so that a bills backfill doesn't starve the regulations.gov daily ingester.

  1. Where a source declares a `schedule` cron string, it shall be interpreted in `America/New_York` by default; per-source override is allowed via the source's `timezone` attribute.
  2. When `josh-ingester/ingester/scheduler.py` builds the AsyncIOScheduler, the timezone shall default to `America/New_York` and the convention shall be documented as the substrate's source-schedule default.
  3. When `JoshHttpClient` is constructed for a source, it shall apply the substrate User-Agent (real-browser Chrome-130 prefix + `josh-ingester/<version>` suffix), tenacity retry (3 attempts, exponential backoff, 5xx + network errors only), and a per-source token-bucket rate limit defaulting to 2 req/sec.
  4. When `JoshHttpClient` issues an HTTP request, the timeout shall default to 10 seconds connect + 60 seconds read; per-source override is allowed via constructor arguments.
  5. Where the runner orchestrates a source's fetch stage, the in-flight request concurrency shall default to 4; per-source override is allowed.
  6. When the substrate User-Agent's Chrome version is more than 12 months old, a maintenance task shall bump it (manual hygiene, not blocking).
  7. When any source fetches an api.data.gov-fronted endpoint, it shall use the shared key from the `JOSH_API_DATA_GOV_KEY` env var; per-source keys are not used.
  8. Where multiple sources concurrently hit api.data.gov, requests shall pass through a shared parent token bucket scoped to the api.data.gov key so the aggregate never exceeds the configured quota.
  9. Where the api.data.gov default quota of 1000 req/hour applies, per-source budgets shall sum to 1000/hour with the v1 default split: bills 350, regulations.gov-dockets 200, hearings 150, public-laws 100, roll-call-votes 100, reserve 100.
  10. When `JoshHttpClient` observes an `X-RateLimit-Limit` header indicating an elevated quota (>1000/hr), per-source budgets shall scale proportionally so an operator with a 5K/hr key sees `bills=1750` etc. without manual config.
  11. Where a source's per-hour budget is exceeded, the `JoshHttpClient` shall sleep until the next bucket refill rather than raising; lock-held semantics do not apply to api.data.gov rate limiting.
  12. Where a source produces PDF bytes, the substrate's canonical `pdf_v1` normalizer shall implement the `Normalizer` protocol via `pdfplumber` (MIT license, pure Python).
  13. When `pdf_v1` extracts a table, it shall serialize as a GFM pipe-table in the Markdown body; tables that exceed 20 columns or have irregular row/colspans shall be captured as `Artifact` per the body-normalization protocol rather than being rendered lossily.
  14. Where `pdf_v1` encounters a PDF with no extractable text (scanned image-only PDF), the body normalization shall return an empty `markdown` field and stamp `body_status='scanned_no_text'` on the row; OCR is out of scope for v1.
  15. When the `pdf_v1` validation suite at `shared/josh_substrate/tests/normalizers/test_pdf_v1.py` runs, it shall exercise 12 stratified samples (3 each from CBO / GAO / LDA / hearings) covering simple text, table-heavy, multi-column, and complex-layout cases.
  16. Where `pdfplumber` is a substrate dependency, it shall be pinned `pdfplumber>=0.11,<0.12`; bumps require the validation suite to pass.
  17. Where Josh's licensing is Apache 2.0, AGPL-licensed PDF libraries (PyMuPDF / Artifex) shall not be used as runtime dependencies, even when technically more capable.
  18. Where a source must fetch from a bot-walled origin (Akamai / DataDome / similar), it shall use the shared `BrowserClient` singleton in `josh-ingester` rather than spawning its own browser.
  19. When the first headed-browser fetch happens in a `josh-ingester` process, the `BrowserClient` shall lazily spawn a single Playwright Chromium instance and keep it alive for the rest of the process lifetime.
  20. Where multiple sources share the browser, each shall fetch via its own `BrowserContext` so cookies do not leak across sources.
  21. When `josh-ingester` is built, the Docker base image shall be `mcr.microsoft.com/playwright/python:v1.49.0-noble` so Chromium system dependencies are pre-baked; `josh-core` and `josh-embedder` retain `python:3.12-slim`.
  22. When `BrowserClient` issues a fetch, the page shall use `playwright-stealth` patches (navigator.webdriver=false, etc.) to evade trivial bot detection; per-source override is allowed for sources requiring different stealth profiles.
  23. Where a source preserves raw payload, files shall be written under `/data/corpus/<source>/bodies/raw/<source-specific-subpath>` per the source's chosen partitioning (date / congress / number / etc.).
  24. Where a source's natural key has a stable hierarchical component (year, congress, type), the on-disk subpath shall partition by that component (FR by `YYYY/MM`, bills by `congress/type`, etc.).
  25. When a source overwrites an existing raw file (upstream record updated), the write shall be atomic via tmp-then-rename + fsync; the prior bytes are preserved only in the restic snapshot tier, not on the live filesystem.
  26. Where normalized Markdown is produced from a raw payload, it shall be written to the row's `body_text` column only; no parallel Markdown file shall be written to disk.
  27. Where raw bytes need compression, the substrate shall not gzip at the application layer; filesystem-level compression (ZFS / btrfs) is the recommended path if disk pressure ever justifies it.
  28. Where raw bytes are written, retention shall be indefinite at v1; pruning policy is reserved for revisit when disk pressure justifies it.
  29. When a source's fetch stage crashes between file write and DB commit, leaving an orphan in the corpus directory, the `josh source janitor <name>` command (per cli-conventions) shall sweep orphans older than 24 hours from the per-source `.tmp/` directory.
kindtest_file

Path

shared/josh_substrate/tests/test_source_defaults.py

Runner

uv run pytest shared/josh_substrate/tests/test_source_defaults.py -v

Contract test that exercises every default the substrate ships: - Scheduler builds with `America/New_York` default; per-source override flows through. - JoshHttpClient applies UA, rate limit, retry, timeout defaults. - api.data.gov budget allocator splits 1000/hr per the documented table; auto-scales on observed `X-RateLimit-Limit` header. - `pdf_v1` normalizer ships and passes the 12-sample validation suite under `tests/normalizers/test_pdf_v1.py`. - `BrowserClient` singleton lazy-spawns once and produces per-source contexts. - Corpus path helpers in `josh_substrate.corpus` produce the documented per-source subpaths; atomic write helper round-trips via tmp-then-rename. Determiner currently fails because: (a) `shared/josh_substrate/tests/test_source_defaults.py` does not yet exist; (b) the api.data.gov budget allocator and BrowserClient singleton are not yet implemented; (c) `pdf_v1` and its 12-sample validation suite are not yet implemented. Will flip to passing as the implementation lands across `shared/josh_substrate/` and `josh-ingester/ingester/`.

None.

  • Specific source ingester implementations — those live in per-source ingester specs (`bills-ingester`, `gao-reports-ingester`, etc.) which now depend on this spec.
  • OCR fallback for scanned PDFs — defer to v1.x when a real need surfaces.
  • Marker (LLM-assisted PDF) integration — defer to v1.x; preserve the option as a `pdf_v2` track.
  • Extraction of headed-browser into a separate `josh-fetcher` container — deferred until single-process Chromium hits a real scaling wall.
  • api.data.gov backfill burst mode (`--burst` flag) — deferred to v1.x.
  • Application-layer compression of raw bytes — deferred; filesystem-level compression is the preferred answer if disk pressure surfaces.
  • On-disk version history of raw payloads — restic snapshot tier IS the version history.
  • Source-specific HTTP, browser, or PDF customizations — those override the defaults via per-source attributes; this spec defines the defaults, not the override mechanism beyond noting it exists.

Decisions locked, organized by section. Each maps to one or more EARS
acceptance criteria above and to a contract test in the determiner.

## 1. Scheduler timezone

Locked: America/New_York (already implemented in
josh-ingester/ingester/scheduler.py as DEFAULT_TIMEZONE). Federal
data is intrinsically Eastern (Congress floor activity, FR ~08:45 ET,
DC business hours). Per-source override is allowed via the source's
timezone attribute.

## 2. HTTP client defaults

Locked, applied in JoshHttpClient (josh-ingester/ingester/http.py):

| Field | Default |
|-------|---------|
| User-Agent | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36 josh-ingester/<version> |
| Rate limit | 2 req/sec token bucket |
| Retry | tenacity, 3 attempts, exponential backoff, 5xx + network only |
| Fetch concurrency | 4 in-flight per source |
| Timeout | 10s connect / 60s read (new — was unset previously) |

Per-source override is the standard pattern; document the tighter
defaults source authors should reach for (e.g., FR API supports much
higher rate; LDA's headed-browser path is much lower).

Maintenance: when the User-Agent's Chrome version is >12 months old,
bump it. Non-blocking hygiene.

## 3. api.data.gov key strategy

Locked: single key, single env var, per-source budget allocation,
auto-scaling on elevated quota.

```
JOSH_API_DATA_GOV_KEY=...

Default budget split (1000 req/hr aggregate):
bills-ingester 350/hr
regulations-dot-gov-dockets 200/hr
hearings-ingester 150/hr
public-laws-ingester 100/hr
roll-call-votes-ingester 100/hr
headroom for v2 sources 100/hr
```

When JoshHttpClient observes X-RateLimit-Limit > 1000, all
per-source budgets scale proportionally — operator with elevated 5K/hr
key sees bills=1750 etc. without retuning. Per-source override via
JOSH_RATE_LIMIT_<SOURCE>_PER_HOUR=N env var.

Implementation: shared parent TokenBucket scoped to api.data.gov,
per-source child buckets. JoshHttpClient.api_data_gov_request()
acquires from both before issuing. New code in
josh-ingester/ingester/http.py.

Backfill burst (temporary access to the full unallocated budget) is
deferred to v1.x; v1 backfill on default quota is acceptable for
one-time operations.

## 4. PDF normalizer (pdf_v1)

Locked: pdfplumber (MIT license, pure Python, table-strong) as the
canonical pdf_v1 Normalizer implementation.

Library reasoning:
- PyMuPDF (Artifex) is AGPL-3.0 — incompatible with Apache 2.0
Foundation distribution. Ruled out despite being technically
strongest.
- pypdf is BSD but weak on tables; federal data is table-heavy.
- Marker (LLM-assisted) is highest-quality but requires GPU + 3GB
model weights. Defer to v1.x.

Behaviors:
- Tables → GFM pipe-table format in Markdown body.
- Complex tables (>20 cols or irregular spans) → captured as
Artifact per body-normalization protocol; surrounding prose
still in the Markdown body.
- Scanned PDFs → return empty markdown, stamp
body_status='scanned_no_text'. OCR is out of scope for v1; the
row still exists with metadata + null body.
- Validation: 12-sample stratified suite at
shared/josh_substrate/tests/normalizers/test_pdf_v1.py. 3
samples each from CBO, GAO, LDA, hearings, covering simple text,
table-heavy, multi-column, complex layouts.
- Pin pdfplumber>=0.11,<0.12. Bumps require the validation suite
to pass.

## 5. Headed-browser worker topology

Locked: in-process Playwright in josh-ingester (Option A from the
Q4 grilling pass).

Architecture:
- One BrowserClient singleton, lazy-spawned on first headed-fetch.
- Lifetime = ingester process lifetime. Killed on shutdown.
- Per-source BrowserContext for cookie isolation.
- playwright-stealth patches applied (navigator.webdriver=false,
etc.) to evade trivial bot-detection.

Library: playwright>=1.49, playwright-stealth>=0.3. Single
library family; async-native; well-maintained.

Image:
- josh-ingester/Dockerfile base swapped from python:3.12-slim to
mcr.microsoft.com/playwright/python:v1.49.0-noble. Image size
grows ~150MB but Chromium system deps are pre-baked, build is
faster, deps are correct.
- josh-core and josh-embedder retain python:3.12-slim (no
browser need).

Path to v2: extract BrowserClient to a separate josh-fetcher
container if scale demands. Trigger conditions: backfill takes >7
days OR steady-state browser memory exceeds 1GB OR ingester needs
to scale horizontally. Until then, in-process is operationally
simpler and adequate.

## 6. Corpus filesystem layout

Locked: existing per-source-subpath layout (already implemented in
shared/josh_substrate/corpus.py).

Pattern: /data/corpus/<source>/bodies/raw/<source-specific-subpath>
with the source author choosing partitioning that matches their
natural key:

| Source | Pattern |
|--------|---------|
| crs_reports | crs/bodies/raw/{number}.html + .json |
| federal_register | federal_register/bodies/raw/{YYYY}/{MM}/{document_number}.xml |
| public_laws | public_laws/bodies/raw/{congress}/PLAW-{c}publ{n}.xml |
| bills | bills/bodies/raw/billstatus/{congress}/{type}/BILLSTATUS-{c}{t}{n}.xml |

Convention for new sources: partition by the most stable component of
the natural key, then smaller components, terminating in the
natural-key string + extension.

Storage policies:
- Markdown: DB column only (body_text). No parallel Markdown
file on disk. Re-derivation = "fetch row, run normalizer on raw,
compare hash."
- Updates: atomic overwrite via tmp-then-rename + fsync. Prior
bytes preserved only in restic snapshots — backup tier IS the
version history.
- Compression: uncompressed at application layer. Filesystem-
level (ZFS / btrfs) is the recommended path if disk pressure
justifies it later.
- Retention: indefinite at v1. Revisit when disk pressure
justifies pruning policy.
- Atomicity: every fetch writes to <source>/bodies/raw/.tmp/
first, fsyncs, then os.replace() to the canonical path. Only
after the file is at the canonical path does the DB row get
committed.
- Orphan sweep: josh source janitor <name> (per
cli-conventions) cleans .tmp/ files older than 24 hours.

## Implementation surface

- shared/josh_substrate/corpus.py — gains atomic_write_bytes(path,
bytes)
helper used by every source's fetch stage. Per-source path
helpers already implemented.
- shared/josh_substrate/normalizers/pdf_v1.py — new module
implementing Normalizer against pdfplumber. ~150 LOC.
- shared/josh_substrate/tests/normalizers/test_pdf_v1.py — 12-sample
stratified validation suite.
- josh-ingester/ingester/http.py — extends JoshHttpClient with
api.data.gov budget allocator (shared parent bucket + per-source
children, auto-scaling on observed quota header). Adds 10s/60s
timeout default.
- josh-ingester/ingester/browser.py — new module with
BrowserClient singleton, per-source context manager, stealth
patching.
- josh-ingester/Dockerfile — base image swap to Playwright Python
base.
- shared/josh_substrate/tests/test_source_defaults.py — contract
test suite that the success_determiner runs.

1 of 12 done.

  • t1 All 5 cross-cutting decisions resolved with rritz; spec drafted
  • t2 10s/60s default HTTP timeout wired into JoshHttpClient
  • t3 api.data.gov shared parent bucket + per-source budgets + auto-scaling on observed quota
  • t4 pdf_v1 Normalizer implementation against pdfplumber
  • t5 12-sample stratified PDF validation suite (3 each: CBO / GAO / LDA / hearings)
  • t6 BrowserClient singleton with playwright-stealth and per-source BrowserContext
  • t7 josh-ingester Dockerfile base swap to mcr.microsoft.com/playwright/python:v1.49.0-noble
  • t8 atomic_write_bytes helper in josh_substrate.corpus + tmp-then-rename pattern enforced in source fetch stages
  • t9 Per-ingester-spec dependency declaration on substrate-source-defaults (17 ingester specs)
  • t10 Per-ingester-spec clarifications_needed pruned of items resolved by this spec
  • t11 Contract test suite at shared/josh_substrate/tests/test_source_defaults.py covering every acceptance criterion
  • t12 Determiner runs green — every default exercised against the live ingester runtime
  • 2026-05-10T22:00:00Z (new)planned Spec authored after a 5-question grilling-style decision pass with claude. Locked the existing in-code defaults (scheduler TZ, HTTP User-Agent + retry + rate limit, fetch concurrency, corpus layout) as canonical conventions and made four real new decisions: 1. HTTP timeout: 10s connect / 60s read default. 2. api.data.gov: single shared key with per-source budget allocation (350/200/150/100/100 + 100 reserve at 1000/hr default; auto-scales on elevated quota detection). 3. PDF normalizer: pdfplumber as canonical pdf_v1; PyMuPDF ruled out due to AGPL-3.0 / Apache 2.0 conflict; Marker deferred. 4. Headed-browser: in-process Playwright in josh-ingester with singleton browser + per-source BrowserContext; base image swap to Microsoft's playwright/python; extraction to separate container deferred until scale demands. Per-source ingester specs will declare dependency on this spec and drop the resolved clarifications in a follow-up commit.

docs/spec/substrate-source-defaults.html · generated by bin/build-spec.py