Skip to content

Code conventions

The patterns the two shipping ingesters (CRS Reports, Legislators+Committees) and the substrate framework already mostly agree on, codified so the next ingester doesn't drift. Read this before contributing code. Update it when the convention changes — never let conventions live only in your head.

Posture. These are conventions, not commandments. When the convention is wrong for a specific case, deviate and leave a comment that says why. When the deviation generalizes (the third instance of "this source is different because…"), pause and refactor the convention itself rather than papering over it.

  1. Tooling baseline
  2. Python style
  3. Async-first
  4. Schema: where it lives
  5. Database access
  6. Source modules
  7. Error handling
  8. Logging
  9. Tests
  10. HTTP API (josh-core)
  11. The raw_* rule
  12. Pre-PR checklist

Three checkers + one orchestrator, all run from the repo root via uv run poe:

ToolPurposepoe task
ruff formatCode formatter (drop-in for Black, ~30× faster).uv run poe format
ruff checkLinter (replaces flake8, isort, pyupgrade, autoflake, etc.).uv run poe lint
mypyType checker, strict mode.uv run poe typecheck
poethepoetTask runner — orchestrates everything via [tool.poe.tasks].uv run poe ci (full pipeline)

Why mypy not pyright. Both work; we picked mypy because the Pydantic plugin is more mature, the typeshed ecosystem treats mypy as upstream, and pyproject.toml's [tool.mypy] is what most third-party tools assume. Pyright in editor (Pylance) is fine and not in conflict with mypy in CI — they tend to agree on real findings, disagree only on edge cases neither catches reliably. If editor-CI alignment ever becomes a real friction point, switching is a ~2-hour exercise (see conventions-refactor spec for the rationale we landed on).

Why strict typing matters more than usual here. Josh is agent-written by design — Claude Code, Cursor, future agents. Strict typing is a circuit breaker against AI hallucination: it catches "method that doesn't exist" and "Any-leaking through three layers" at edit time instead of at production-ETL time. The bar for type rigor is higher than it would be for a typical web app.

Config lives at the repo root in pyproject.toml (Ruff and mypy both read root configs; per-package overrides are allowed but discouraged):

[tool.ruff]
target-version = "py312"
line-length = 100
extend-exclude = ["vendor", "**/migrations/versions", "**/__pycache__", ".venv"]
[tool.ruff.lint]
# Strict rule set — every file in the repo is held to this.
select = ["E", "F", "I", "UP", "B", "SIM", "RUF"]
ignore = [
"E501", # long lines — `ruff format` handles this; long URLs are fine
"RUF002", # ambiguous-unicode in docstrings — em-dashes / × / – are deliberate prose
]
[tool.ruff.lint.per-file-ignores]
# Tests can have asserts and unused fixtures.
"**/tests/**" = ["B", "SIM"]
[tool.mypy]
python_version = "3.12"
strict = true
plugins = ["pydantic.mypy"]
ignore_missing_imports = true
exclude = ["(?:^|/)migrations/versions/", "(?:^|/)\\.venv/", "(?:^|/)vendor/"]

Outside contributors and AI agents must run uv run poe ci before opening a PR. The pre-push git hook enforces this locally; there is no remote CI. See pre-PR checklist.

  • Python ≥ 3.12. All pyproject.tomls pin requires-python = ">=3.12". Use 3.12+ syntax freely (PEP 695 generics, Self, etc.).
  • from __future__ import annotations at the top of every module. Cheap forward-compat, lets us use modern type syntax in runtime contexts (e.g., SimpleNamespace attrs).
  • Modern type syntax. str | None, list[int], dict[str, Any] — never Optional[T], List[int], or Dict[str, Any].
  • Naming. snake_case for functions and variables, PascalCase for classes, SCREAMING_SNAKE for module-level constants. Private helpers prefix with _.
  • Imports. Three groups separated by blank lines: stdlib, third-party, first-party (Ruff's I rule enforces this). First-party means anything under josh_substrate, ingester, app.
  • Line length 100. Comments and SQL strings are exempt — long URLs and inline SQL stay readable.
  • Comments capture why, not what. If a future reader can derive the what from the code, the comment is noise. The why — workarounds, perf decisions, surprising upstream behavior, observed failure modes — is what code can't tell you. See shared/josh_substrate/src/josh_substrate/db.py's PRAGMA comment for the model.

The pipeline is async end-to-end. Every Source-protocol method is async def; the parse stage is an async generator even when it does pure CPU work, so the runner can async for uniformly.

Concrete rules:

  • Pipeline code: async. Source discover/fetch/parse/load, runner state writes, HTTP calls.
  • Pure utilities: sync. A regex parser, a string normalizer, a hash function — sync. Don't async def something that has nothing to await.
  • Crossing the boundary: asyncio.to_thread for blocking-but-justifiable code (e.g., lxml parse on a large XML); never call sync I/O from an async context without it.
  • CLI: asyncio.run(_inner()) at the leaf. Each Typer command does one asyncio.run call against a small async helper. See josh-ingester/ingester/cli.py.

This is the convention question with the highest blast radius — get it wrong, and every new ingester pays the tax. The decision:

Migrations are the source of truth. One Python file under shared/josh_substrate/src/josh_substrate/migrations/versions/ per logical schema change. The DDL there wins over every other representation.

No "row models" for the ingest path. The loader receives a ParsedRecord from the parser, calls a generic upsert helper with the column tuple + payload dict, and that's it. There is no Pydantic class that mirrors the table for ingestion to validate against. (The migration is the contract; the loader's column tuple is the only other place the column list appears, and tests cover that they match.)

Pydantic models exist where they validate at a boundary. Specifically:

  • Pipeline data shapes (FetchTask, ParsedRecord, SourceState, etc.) — boundary between pipeline stages.
  • Eventual HTTP request/response models in josh-core — boundary with API consumers.
  • Not "documentation of what's in this table." The migration documents that.

What this means for shared/josh_substrate/src/josh_substrate/models/. The CrsReport, Legislator, etc. models there are speculative: they were drafted ahead of the serving layer that will use them. Until josh-core serves a given resource, the corresponding model should not exist. When the serving endpoint lands, write the response model alongside the router — close to where it's used.

Why not SQLModel? SQLModel tries to unify SQLAlchemy ORM and Pydantic in one class. It's elegant when it works and painful when it lags Pydantic or SQLAlchemy releases. We have no ORM today — the loader uses raw SQL via SQLAlchemy Core's text() — so SQLModel solves a problem we don't have. Community discussion reflects this lag concern. Skip it.

Three rules govern every line of DB code in the repo:

5.1 Always go through josh_substrate.db.async_engine()

Section titled “5.1 Always go through josh_substrate.db.async_engine()”

Never construct your own engine. async_engine() resolves SUBSTRATE_DB_PATH, applies the substrate-wide PRAGMAs, and loads the sqlite-vec extension. Bypassing it gets you an engine without WAL or vec0 — silent correctness bug.

No ORM, no query builder. Every statement is a text("…") with named parameters. Why:

  • SQLite + FTS5 + sqlite-vec queries don't fit any ORM's mental model — JOINs against virtual tables, vec_search(), MATCH '...'. Hand-written SQL is honest about the storage.
  • The substrate is read-mostly under load. Hand-written SQL is faster than emitted SQL — the difference matters at the per-query latency of an MCP/REST call.
  • Migration-as-source-of-truth becomes natural: the SQL you write matches the schema you migrated, no translation layer.

5.3 Upserts go through the substrate helper

Section titled “5.3 Upserts go through the substrate helper”

Every loader does the same dance: build an INSERT … ON CONFLICT(natural_key) DO UPDATE SET …, replace child rows via DELETE-then-INSERT, count inserted vs updated. Centralize it as josh_substrate.db.upsert + upsert_with_children and call from the loader:

from josh_substrate.db import upsert_with_children
_PARENT_COLUMNS = ("id", "number", "type_id", "title", ...)
_CHILD_TABLES = ("crs_report_versions", "crs_report_authors",
"crs_report_topics", "crs_report_related_bills")
_PARENT_UPDATE_EXTRA = (
"updated_at = strftime('%Y-%m-%dT%H:%M:%fZ','now')",
# any audit-only setters that fire on UPDATE go here.
)
async with engine.begin() as conn:
existed = await upsert_with_children(
conn,
parent_table="crs_reports",
parent_columns=_PARENT_COLUMNS,
parent_payload=rec.payload,
parent_update_extra=_PARENT_UPDATE_EXTRA,
children=rec.children, # list[ParsedRecord]
child_tables=_CHILD_TABLES, # FULL set — DELETE fires for empty too
child_fk_column="report_id",
child_fk_value=rec.key,
)

The helper handles the column-tuple → SQL string concat (no two idioms across files), runs the SELECT 1 existence check, executes the upsert, DELETE-then-INSERTs children, and returns True if the row pre-existed. Loaders shrink to: per-record try/except + call the helper + accumulate LoadStats.

Three companion helpers in the same module:

  • upsert(conn, *, table, columns, payload, conflict_columns=("id",), update_extra=(), skip_update_columns=()) — when there are no children. Composite-key conflicts work via conflict_columns=("a", "b").
  • insert_record(conn, *, table, payload) — plain INSERT for child rows; column list inferred from payload keys.
  • replace_children(conn, *, child_tables, child_fk_column, child_fk_value) — DELETE-by-FK pattern, used internally by upsert_with_children and exposed for loaders that need it directly.

All four helpers validate every interpolated identifier (table, conflict_columns, child_fk_column, etc.) against [A-Za-z_]\w* at compose time. Caller typos and injection-shaped strings raise ValueError immediately rather than failing as obscure SQL errors at exec.

Note on RETURNING. SQLite supports RETURNING on INSERT … ON CONFLICT DO UPDATE as of 3.35, but it returns rows for both insert and update — there's no in-statement way to tell them apart (SQLite forum thread). The SELECT 1 existence check is the cleanest reliable detection at our scale. Revisit if loader latency ever becomes a bottleneck.

5.4 Transactions per record (or per logical group)

Section titled “5.4 Transactions per record (or per logical group)”

The loader opens one transaction per top-level ParsedRecord via async with db.begin() as conn. If the parent + children all succeed, commit. If any insert fails, the per-record try/except catches it, rolls back the transaction, increments stats.errored, and the run continues. Never let one bad record abort an entire batch.

Every source is a Python subpackage under josh-ingester/ingester/sources/<source_name>/ with this exact layout:

sources/<source_name>/
├── __init__.py # exposes `source = MySource()` (and re-exports `cli` if present)
├── _constants.py # optional: shared URL templates, file lists, taxonomy IDs
├── discover.py # async generator yielding FetchTasks
├── fetch.py # download payloads to /data/corpus/<source>/...
├── parse.py # bytes-on-disk → ParsedRecord async generator
├── load.py # one async function: per-record try/except + helper call
└── cli.py # optional: typer.Typer with source-local maintenance commands

Rules:

  • __init__.py exposes source: Source. The registry auto-discovers it via pkgutil.iter_modules. Don't register manually anywhere.
  • Each pipeline stage in its own file. One discover async generator in discover.py, etc. The Source class in __init__.py just delegates.
  • Pure functions where possible. parse.py takes raw_paths: dict[str, Path] and yields ParsedRecords. No HTTP, no DB, no filesystem writes. Easy to test.
  • Source-local helpers prefix with _ (e.g., _normalize_bill_id, _strip_wrapper_noise). Module-level constants in SCREAMING_SNAKE.
  • Underscore-prefixed modules are skipped by the registry. Use _constants.py, _helpers.py for anything you don't want auto-imported.
  • Optional CLI. If a source has its own maintenance commands, expose cli: typer.Typer from cli.py and re-export from __init__.py. The global CLI auto-mounts under josh-ingester <source-name-with-hyphens>. Don't add commands to the global CLI for source-specific concerns.
  • discover and parse are def-returning-AsyncIterator, not async def. Both implementations contain yield, which makes them async generators — callables that return an AsyncIterator directly. Declaring them async def would have mypy infer the return type as Coroutine[Any, Any, AsyncIterator[T]] (you'd have to await them before iterating). The def form matches what runtime actually does. fetch and load remain async def because they're regular coroutines. See mypy#5385.

The two shipping ingesters (josh-ingester/ingester/sources/crs_reports/__init__.py, josh-ingester/ingester/sources/legislators_committees/__init__.py) are the worked examples. The 14-task ingestion-built contract baked into docs/spec/data/_templates/source.yaml mirrors this layout step by step.

Three layers, each with a different policy:

LayerPolicyWhere
Per-recordCatch in load()'s loop. Append f"{type(exc).__name__}: {exc}" to LoadStats.error_messages. Log at error. Continue with next record.Source load() functions
Per-taskCatch in the runner's fetch/parse/load loops. Mark task errored in ingestion_tasks. Log at error with traceback. Run continues with remaining tasks.runner._run_inner_body
Whole-sourceAnything that escapes the loops (DB unreachable, HTTP client init failed, bug in discover()) is caught at the runner's outer try, the run is marked failed, and the exception is logged with traceback.format_exc().runner._run_locked

Universal idiom for stringifying exceptions: err = f"{type(exc).__name__}: {exc}". Used everywhere identically — search the codebase, you'll find one shape. Don't reinvent.

Re-raise vs catch. Catch when you want the run to continue (per-record, per-task). Re-raise (or let propagate) when the failure means subsequent work is meaningless (YAML parse error in legislators-historical = no point trying the rest). The default is catch for record-level work, propagate for setup-level work.

Defensive "should not happen" checks. Allowed when they protect against upstream API drift (e.g., "if raw_paths.get(file_name) is None, log and skip — the upstream YAML mirror occasionally changes filenames"). Not allowed when they protect against your own code's invariants (e.g., asserting that rec.table == "crs_reports" when the parser only ever emits that table) — that's noise that violates Rule 2 (Simplicity First).

  • structlog only. Standard logging is configured for level filtering, but every log call goes through structlog.get_logger(name).
  • Logger name = module path. structlog.get_logger("ingester.sources.crs_reports.load"). Makes filtering by component trivial.
  • Event name as first positional, structured kwargs after. log.error("crs.load_failed", report_id=rec.key, error=err) — never log.error(f"failed to load {rec.key}: {err}"). Structured fields make logs queryable in ingestion_logs and any future log aggregator.
  • Event names are dotted, lowercase, past-tense. fetch.completed, load_failed, wal_checkpoint_failed. Stable enough to alert on.
  • JSON to stderr in production. The CLI's _configure_logging sets up JSONRenderer. Don't override per-module.
  • Never log payload bodies at info or above. Field-level (e.g., report_id, title) is fine; full raw_json in a log line is not.

Today the test suite is asymmetric — CRS has fixture-driven tests, legislators+committees has none. The convention going forward: every shipping source has at least three test categories before its spec can flip verified.

CategoryWhat it coversExample
Parse fixturesReal upstream payloads captured under josh-ingester/tests/fixtures/<source>/. Asserts the parser produces the expected ParsedRecord shape and child counts.josh-ingester/tests/sources/test_crs_reports.py against fixtures/crs/R48481.json + .html
Defensive inputsMalformed CSV row, missing required field, unexpected JSON shape. Asserts the parser logs a warning and continues — never crashes the generator.test_csv_skips_malformed_rows in the same file
Loader smokeAn in-memory SQLite engine with the actual migrations applied. Run load(parsed_records, engine) and assert the row landed and counts are right.To be added in tests/sources/test_<source>_load.py

Conventions inside tests:

  • pytest + pytest-asyncio in auto mode. No @pytest.mark.asyncio decorators clutter; mark asyncio_mode = "auto" in pyproject.toml.
  • Fixtures live next to tests under tests/fixtures/<source>/. Real upstream payloads, captured once and committed. Don't generate fixtures programmatically in conftest — the point is to catch upstream-shape regressions.
  • HTTP is stubbed via a small _StubHttp shim like the one in test_crs_reports.py. No responses or httpx_mock dependencies — the shim is 15 lines.
  • Loader smoke tests use in-memory SQLite (SUBSTRATE_DB_PATH=":memory:") with alembic upgrade head applied at fixture setup. Each test gets a fresh DB.
  • Runner integration test (one-time, in tests/test_runner.py) exercises a fake source through the full pipeline and asserts state-table writes happened. Not duplicated per source.

josh-core is a placeholder today (two endpoints, no structure). When the first real endpoint lands, follow this layout:

josh-core/app/
├── main.py # FastAPI app, lifespan, router includes
├── deps.py # Depends() factories (engine, settings, etc.)
├── settings.py # pydantic-settings BaseSettings
├── routers/
│ ├── crs.py # /v1/crs/* endpoints + their response models
│ ├── legislators.py # /v1/legislators/*
│ └── ...
└── exceptions.py # custom exception classes + handlers

Rules:

  • One router file per resource. Endpoints + Pydantic response models live together. Don't split response models into a separate schemas/ directory unless they're shared across routers.
  • DB engine via FastAPI lifespan (one engine per process, attached on startup, disposed on shutdown). Inject into endpoints via Depends(get_engine).
  • Every endpoint declares response_model=…. Auto-validates and shapes the OpenAPI spec. Without it, FastAPI returns whatever you return — silent contract drift.
  • Versioned routes. /v1/... prefix from day one. Path-based versioning beats header negotiation for an OSS tool.
  • Errors: HTTPException for HTTP-shaped problems (404, 400, 422). Custom exception classes for substrate-level failures, registered as exception handlers in main.py.
  • Response models are hand-written, not derived. Per the schema rule, the migration is the source of truth. The response model is whatever the API contract requires — usually a subset, sometimes with computed fields.

Worked walkthrough of the async SQLAlchemy 2.0 + Pydantic v2 pattern if it's new to you.

Several substrate tables have raw_json / raw_yaml columns that hold the full upstream payload (a CRS JSON blob, a legislator YAML record). These exist for replay and debugging — re-run the parser against the same input to validate a parser change.

They are internal-only. No serving endpoint returns them. No CLI command prints them. No log line includes their full contents. The Pydantic response models for the eventual API must not include raw_* fields — even if you have them in the row, the projection drops them.

Why this matters: legislator YAML carries district-office addresses and other contact info that's appropriate for an internal store but not for a low-friction public API. The convention is the wall.

Before opening a PR (or, for an AI agent, before declaring the task verified):

  • uv run poe ci — green. This runs lint + format-check + typecheck + unit tests + contract tests + spec validation in one shot. The pre-push git hook enforces it locally; passing means the change is mergeable.
  • If a substrate or new-service change: uv run poe smoke — green. End-to-end verification is opt-in (~30s after warm cache); it isn't part of poe ci.
  • If a spec is involved: bin/build-spec.py --lint clean, bin/dry-run-spec.py <id> --explain green, changelog entry appended (the in-browser editor auto-prepends; if you edit YAML directly, append by hand).
  • If a data source's status changed: docs/josh-data-sources.html + docs/data-status.html updated in the same commit.
  • If a substrate-level pattern changed: this conventions doc updated in the same commit. Add a one-line entry to the convention changelog.
  • No new convention introduced silently — if you deviated from a rule above, the deviation is commented and (if it generalizes) the convention is updated.

This is a young doc. When a convention changes, add a one-line entry here so a reader can see the history without spelunking git blame.

  • 2026-05-10 (latest) — Removed the legacy mypy override block; strict typing now applies repo-wide. The seven previously-relaxed modules (ingester.*, app.main, josh_substrate.normalizers.*, josh_substrate.citations.*, josh_substrate.chunking.*, josh_substrate.corpus, josh_substrate.db, josh_substrate.protocols) were brought up to strict = true.
  • 2026-05-10 (later) — Conventions-refactor pass landed. Switched type checker from pyright to mypy (Pydantic plugin maturity, typeshed alignment); tightened [tool.ruff.lint] to E F I UP B SIM RUF across the whole repo (no per-module grandfathering except a documented legacy override block in mypy); flipped Source.discover and Source.parse from async def to def-returning-AsyncIterator to match what async generators actually produce; landed upsert / upsert_with_children helpers in josh_substrate.db as the canonical loader pattern; deleted speculative table-mirror Pydantic models (will reintroduce next to the FastAPI router that uses them, when it lands). Pre-PR checklist now centers on uv run poe ci.
  • 2026-05-10 — Initial draft. Codifies what CRS Reports + Legislators+Committees + the framework already do, with deliberate decisions on schema duplication (migrations win, no row models for ingest), upsert helper (substrate-owned), and HTTP layout (router-per-resource, hand-written response models).