Code conventions
The patterns the two shipping ingesters (CRS Reports, Legislators+Committees) and the substrate framework already mostly agree on, codified so the next ingester doesn't drift. Read this before contributing code. Update it when the convention changes — never let conventions live only in your head.
Posture. These are conventions, not commandments. When the convention is wrong for a specific case, deviate and leave a comment that says why. When the deviation generalizes (the third instance of "this source is different because…"), pause and refactor the convention itself rather than papering over it.
Contents
Section titled “Contents”- Tooling baseline
- Python style
- Async-first
- Schema: where it lives
- Database access
- Source modules
- Error handling
- Logging
- Tests
- HTTP API (josh-core)
- The
raw_*rule - Pre-PR checklist
1. Tooling baseline
Section titled “1. Tooling baseline”Three checkers + one orchestrator, all run from the repo root via uv run poe:
| Tool | Purpose | poe task |
|---|---|---|
ruff format | Code formatter (drop-in for Black, ~30× faster). | uv run poe format |
ruff check | Linter (replaces flake8, isort, pyupgrade, autoflake, etc.). | uv run poe lint |
mypy | Type checker, strict mode. | uv run poe typecheck |
poethepoet | Task runner — orchestrates everything via [tool.poe.tasks]. | uv run poe ci (full pipeline) |
Why mypy not pyright. Both work; we picked mypy because the Pydantic plugin is more mature, the typeshed ecosystem treats mypy as upstream, and pyproject.toml's [tool.mypy] is what most third-party tools assume. Pyright in editor (Pylance) is fine and not in conflict with mypy in CI — they tend to agree on real findings, disagree only on edge cases neither catches reliably. If editor-CI alignment ever becomes a real friction point, switching is a ~2-hour exercise (see conventions-refactor spec for the rationale we landed on).
Why strict typing matters more than usual here. Josh is agent-written by design — Claude Code, Cursor, future agents. Strict typing is a circuit breaker against AI hallucination: it catches "method that doesn't exist" and "Any-leaking through three layers" at edit time instead of at production-ETL time. The bar for type rigor is higher than it would be for a typical web app.
Config lives at the repo root in pyproject.toml (Ruff and mypy both read root configs; per-package overrides are allowed but discouraged):
[tool.ruff]target-version = "py312"line-length = 100extend-exclude = ["vendor", "**/migrations/versions", "**/__pycache__", ".venv"]
[tool.ruff.lint]# Strict rule set — every file in the repo is held to this.select = ["E", "F", "I", "UP", "B", "SIM", "RUF"]ignore = [ "E501", # long lines — `ruff format` handles this; long URLs are fine "RUF002", # ambiguous-unicode in docstrings — em-dashes / × / – are deliberate prose]
[tool.ruff.lint.per-file-ignores]# Tests can have asserts and unused fixtures."**/tests/**" = ["B", "SIM"]
[tool.mypy]python_version = "3.12"strict = trueplugins = ["pydantic.mypy"]ignore_missing_imports = trueexclude = ["(?:^|/)migrations/versions/", "(?:^|/)\\.venv/", "(?:^|/)vendor/"]Outside contributors and AI agents must run uv run poe ci before opening a PR. The pre-push git hook enforces this locally; there is no remote CI. See pre-PR checklist.
2. Python style
Section titled “2. Python style”- Python ≥ 3.12. All
pyproject.tomls pinrequires-python = ">=3.12". Use 3.12+ syntax freely (PEP 695 generics,Self, etc.). from __future__ import annotationsat the top of every module. Cheap forward-compat, lets us use modern type syntax in runtime contexts (e.g.,SimpleNamespaceattrs).- Modern type syntax.
str | None,list[int],dict[str, Any]— neverOptional[T],List[int], orDict[str, Any]. - Naming.
snake_casefor functions and variables,PascalCasefor classes,SCREAMING_SNAKEfor module-level constants. Private helpers prefix with_. - Imports. Three groups separated by blank lines: stdlib, third-party, first-party (Ruff's
Irule enforces this). First-party means anything underjosh_substrate,ingester,app. - Line length 100. Comments and SQL strings are exempt — long URLs and inline SQL stay readable.
- Comments capture why, not what. If a future reader can derive the what from the code, the comment is noise. The why — workarounds, perf decisions, surprising upstream behavior, observed failure modes — is what code can't tell you. See
shared/josh_substrate/src/josh_substrate/db.py's PRAGMA comment for the model.
3. Async-first
Section titled “3. Async-first”The pipeline is async end-to-end. Every Source-protocol method is async def; the parse stage is an async generator even when it does pure CPU work, so the runner can async for uniformly.
Concrete rules:
- Pipeline code: async. Source
discover/fetch/parse/load, runner state writes, HTTP calls. - Pure utilities: sync. A regex parser, a string normalizer, a hash function — sync. Don't
async defsomething that has nothing to await. - Crossing the boundary:
asyncio.to_threadfor blocking-but-justifiable code (e.g.,lxmlparse on a large XML); never call sync I/O from an async context without it. - CLI:
asyncio.run(_inner())at the leaf. Each Typer command does oneasyncio.runcall against a small async helper. Seejosh-ingester/ingester/cli.py.
4. Schema: where it lives
Section titled “4. Schema: where it lives”This is the convention question with the highest blast radius — get it wrong, and every new ingester pays the tax. The decision:
Migrations are the source of truth. One Python file under shared/josh_substrate/src/josh_substrate/migrations/versions/ per logical schema change. The DDL there wins over every other representation.
No "row models" for the ingest path. The loader receives a ParsedRecord from the parser, calls a generic upsert helper with the column tuple + payload dict, and that's it. There is no Pydantic class that mirrors the table for ingestion to validate against. (The migration is the contract; the loader's column tuple is the only other place the column list appears, and tests cover that they match.)
Pydantic models exist where they validate at a boundary. Specifically:
- Pipeline data shapes (
FetchTask,ParsedRecord,SourceState, etc.) — boundary between pipeline stages. - Eventual HTTP request/response models in
josh-core— boundary with API consumers. - Not "documentation of what's in this table." The migration documents that.
What this means for shared/josh_substrate/src/josh_substrate/models/. The CrsReport, Legislator, etc. models there are speculative: they were drafted ahead of the serving layer that will use them. Until josh-core serves a given resource, the corresponding model should not exist. When the serving endpoint lands, write the response model alongside the router — close to where it's used.
Why not SQLModel? SQLModel tries to unify SQLAlchemy ORM and Pydantic in one class. It's elegant when it works and painful when it lags Pydantic or SQLAlchemy releases. We have no ORM today — the loader uses raw SQL via SQLAlchemy Core's text() — so SQLModel solves a problem we don't have. Community discussion reflects this lag concern. Skip it.
5. Database access
Section titled “5. Database access”Three rules govern every line of DB code in the repo:
5.1 Always go through josh_substrate.db.async_engine()
Section titled “5.1 Always go through josh_substrate.db.async_engine()”Never construct your own engine. async_engine() resolves SUBSTRATE_DB_PATH, applies the substrate-wide PRAGMAs, and loads the sqlite-vec extension. Bypassing it gets you an engine without WAL or vec0 — silent correctness bug.
5.2 Raw SQL via SQLAlchemy Core's text()
Section titled “5.2 Raw SQL via SQLAlchemy Core's text()”No ORM, no query builder. Every statement is a text("…") with named parameters. Why:
- SQLite + FTS5 +
sqlite-vecqueries don't fit any ORM's mental model — JOINs against virtual tables,vec_search(),MATCH '...'. Hand-written SQL is honest about the storage. - The substrate is read-mostly under load. Hand-written SQL is faster than emitted SQL — the difference matters at the per-query latency of an MCP/REST call.
- Migration-as-source-of-truth becomes natural: the SQL you write matches the schema you migrated, no translation layer.
5.3 Upserts go through the substrate helper
Section titled “5.3 Upserts go through the substrate helper”Every loader does the same dance: build an INSERT … ON CONFLICT(natural_key) DO UPDATE SET …, replace child rows via DELETE-then-INSERT, count inserted vs updated. Centralize it as josh_substrate.db.upsert + upsert_with_children and call from the loader:
from josh_substrate.db import upsert_with_children
_PARENT_COLUMNS = ("id", "number", "type_id", "title", ...)_CHILD_TABLES = ("crs_report_versions", "crs_report_authors", "crs_report_topics", "crs_report_related_bills")_PARENT_UPDATE_EXTRA = ( "updated_at = strftime('%Y-%m-%dT%H:%M:%fZ','now')", # any audit-only setters that fire on UPDATE go here.)
async with engine.begin() as conn: existed = await upsert_with_children( conn, parent_table="crs_reports", parent_columns=_PARENT_COLUMNS, parent_payload=rec.payload, parent_update_extra=_PARENT_UPDATE_EXTRA, children=rec.children, # list[ParsedRecord] child_tables=_CHILD_TABLES, # FULL set — DELETE fires for empty too child_fk_column="report_id", child_fk_value=rec.key, )The helper handles the column-tuple → SQL string concat (no two idioms across files), runs the SELECT 1 existence check, executes the upsert, DELETE-then-INSERTs children, and returns True if the row pre-existed. Loaders shrink to: per-record try/except + call the helper + accumulate LoadStats.
Three companion helpers in the same module:
upsert(conn, *, table, columns, payload, conflict_columns=("id",), update_extra=(), skip_update_columns=())— when there are no children. Composite-key conflicts work viaconflict_columns=("a", "b").insert_record(conn, *, table, payload)— plain INSERT for child rows; column list inferred from payload keys.replace_children(conn, *, child_tables, child_fk_column, child_fk_value)— DELETE-by-FK pattern, used internally byupsert_with_childrenand exposed for loaders that need it directly.
All four helpers validate every interpolated identifier (table, conflict_columns, child_fk_column, etc.) against [A-Za-z_]\w* at compose time. Caller typos and injection-shaped strings raise ValueError immediately rather than failing as obscure SQL errors at exec.
Note on RETURNING. SQLite supports RETURNING on INSERT … ON CONFLICT DO UPDATE as of 3.35, but it returns rows for both insert and update — there's no in-statement way to tell them apart (SQLite forum thread). The SELECT 1 existence check is the cleanest reliable detection at our scale. Revisit if loader latency ever becomes a bottleneck.
5.4 Transactions per record (or per logical group)
Section titled “5.4 Transactions per record (or per logical group)”The loader opens one transaction per top-level ParsedRecord via async with db.begin() as conn. If the parent + children all succeed, commit. If any insert fails, the per-record try/except catches it, rolls back the transaction, increments stats.errored, and the run continues. Never let one bad record abort an entire batch.
6. Source modules
Section titled “6. Source modules”Every source is a Python subpackage under josh-ingester/ingester/sources/<source_name>/ with this exact layout:
sources/<source_name>/├── __init__.py # exposes `source = MySource()` (and re-exports `cli` if present)├── _constants.py # optional: shared URL templates, file lists, taxonomy IDs├── discover.py # async generator yielding FetchTasks├── fetch.py # download payloads to /data/corpus/<source>/...├── parse.py # bytes-on-disk → ParsedRecord async generator├── load.py # one async function: per-record try/except + helper call└── cli.py # optional: typer.Typer with source-local maintenance commandsRules:
__init__.pyexposessource: Source. The registry auto-discovers it viapkgutil.iter_modules. Don't register manually anywhere.- Each pipeline stage in its own file. One
discoverasync generator indiscover.py, etc. The Source class in__init__.pyjust delegates. - Pure functions where possible.
parse.pytakesraw_paths: dict[str, Path]and yieldsParsedRecords. No HTTP, no DB, no filesystem writes. Easy to test. - Source-local helpers prefix with
_(e.g.,_normalize_bill_id,_strip_wrapper_noise). Module-level constants inSCREAMING_SNAKE. - Underscore-prefixed modules are skipped by the registry. Use
_constants.py,_helpers.pyfor anything you don't want auto-imported. - Optional CLI. If a source has its own maintenance commands, expose
cli: typer.Typerfromcli.pyand re-export from__init__.py. The global CLI auto-mounts underjosh-ingester <source-name-with-hyphens>. Don't add commands to the global CLI for source-specific concerns. discoverandparsearedef-returning-AsyncIterator, notasync def. Both implementations containyield, which makes them async generators — callables that return anAsyncIteratordirectly. Declaring themasync defwould have mypy infer the return type asCoroutine[Any, Any, AsyncIterator[T]](you'd have toawaitthem before iterating). Thedefform matches what runtime actually does.fetchandloadremainasync defbecause they're regular coroutines. See mypy#5385.
The two shipping ingesters (josh-ingester/ingester/sources/crs_reports/__init__.py, josh-ingester/ingester/sources/legislators_committees/__init__.py) are the worked examples. The 14-task ingestion-built contract baked into docs/spec/data/_templates/source.yaml mirrors this layout step by step.
7. Error handling
Section titled “7. Error handling”Three layers, each with a different policy:
| Layer | Policy | Where |
|---|---|---|
| Per-record | Catch in load()'s loop. Append f"{type(exc).__name__}: {exc}" to LoadStats.error_messages. Log at error. Continue with next record. | Source load() functions |
| Per-task | Catch in the runner's fetch/parse/load loops. Mark task errored in ingestion_tasks. Log at error with traceback. Run continues with remaining tasks. | runner._run_inner_body |
| Whole-source | Anything that escapes the loops (DB unreachable, HTTP client init failed, bug in discover()) is caught at the runner's outer try, the run is marked failed, and the exception is logged with traceback.format_exc(). | runner._run_locked |
Universal idiom for stringifying exceptions: err = f"{type(exc).__name__}: {exc}". Used everywhere identically — search the codebase, you'll find one shape. Don't reinvent.
Re-raise vs catch. Catch when you want the run to continue (per-record, per-task). Re-raise (or let propagate) when the failure means subsequent work is meaningless (YAML parse error in legislators-historical = no point trying the rest). The default is catch for record-level work, propagate for setup-level work.
Defensive "should not happen" checks. Allowed when they protect against upstream API drift (e.g., "if raw_paths.get(file_name) is None, log and skip — the upstream YAML mirror occasionally changes filenames"). Not allowed when they protect against your own code's invariants (e.g., asserting that rec.table == "crs_reports" when the parser only ever emits that table) — that's noise that violates Rule 2 (Simplicity First).
8. Logging
Section titled “8. Logging”structlogonly. Standardloggingis configured for level filtering, but every log call goes throughstructlog.get_logger(name).- Logger name = module path.
structlog.get_logger("ingester.sources.crs_reports.load"). Makes filtering by component trivial. - Event name as first positional, structured kwargs after.
log.error("crs.load_failed", report_id=rec.key, error=err)— neverlog.error(f"failed to load {rec.key}: {err}"). Structured fields make logs queryable iningestion_logsand any future log aggregator. - Event names are dotted, lowercase, past-tense.
fetch.completed,load_failed,wal_checkpoint_failed. Stable enough to alert on. - JSON to stderr in production. The CLI's
_configure_loggingsets upJSONRenderer. Don't override per-module. - Never log payload bodies at
infoor above. Field-level (e.g.,report_id,title) is fine; fullraw_jsonin a log line is not.
9. Tests
Section titled “9. Tests”Today the test suite is asymmetric — CRS has fixture-driven tests, legislators+committees has none. The convention going forward: every shipping source has at least three test categories before its spec can flip verified.
| Category | What it covers | Example |
|---|---|---|
| Parse fixtures | Real upstream payloads captured under josh-ingester/tests/fixtures/<source>/. Asserts the parser produces the expected ParsedRecord shape and child counts. | josh-ingester/tests/sources/test_crs_reports.py against fixtures/crs/R48481.json + .html |
| Defensive inputs | Malformed CSV row, missing required field, unexpected JSON shape. Asserts the parser logs a warning and continues — never crashes the generator. | test_csv_skips_malformed_rows in the same file |
| Loader smoke | An in-memory SQLite engine with the actual migrations applied. Run load(parsed_records, engine) and assert the row landed and counts are right. | To be added in tests/sources/test_<source>_load.py |
Conventions inside tests:
pytest+pytest-asyncioinautomode. No@pytest.mark.asynciodecorators clutter; markasyncio_mode = "auto"inpyproject.toml.- Fixtures live next to tests under
tests/fixtures/<source>/. Real upstream payloads, captured once and committed. Don't generate fixtures programmatically in conftest — the point is to catch upstream-shape regressions. - HTTP is stubbed via a small
_StubHttpshim like the one intest_crs_reports.py. Noresponsesorhttpx_mockdependencies — the shim is 15 lines. - Loader smoke tests use in-memory SQLite (
SUBSTRATE_DB_PATH=":memory:") withalembic upgrade headapplied at fixture setup. Each test gets a fresh DB. - Runner integration test (one-time, in
tests/test_runner.py) exercises a fake source through the full pipeline and asserts state-table writes happened. Not duplicated per source.
10. HTTP API (josh-core)
Section titled “10. HTTP API (josh-core)”josh-core is a placeholder today (two endpoints, no structure). When the first real endpoint lands, follow this layout:
josh-core/app/├── main.py # FastAPI app, lifespan, router includes├── deps.py # Depends() factories (engine, settings, etc.)├── settings.py # pydantic-settings BaseSettings├── routers/│ ├── crs.py # /v1/crs/* endpoints + their response models│ ├── legislators.py # /v1/legislators/*│ └── ...└── exceptions.py # custom exception classes + handlersRules:
- One router file per resource. Endpoints + Pydantic response models live together. Don't split response models into a separate
schemas/directory unless they're shared across routers. - DB engine via FastAPI lifespan (one engine per process, attached on startup, disposed on shutdown). Inject into endpoints via
Depends(get_engine). - Every endpoint declares
response_model=…. Auto-validates and shapes the OpenAPI spec. Without it, FastAPI returns whatever you return — silent contract drift. - Versioned routes.
/v1/...prefix from day one. Path-based versioning beats header negotiation for an OSS tool. - Errors:
HTTPExceptionfor HTTP-shaped problems (404, 400, 422). Custom exception classes for substrate-level failures, registered as exception handlers inmain.py. - Response models are hand-written, not derived. Per the schema rule, the migration is the source of truth. The response model is whatever the API contract requires — usually a subset, sometimes with computed fields.
Worked walkthrough of the async SQLAlchemy 2.0 + Pydantic v2 pattern if it's new to you.
11. The raw_* rule
Section titled “11. The raw_* rule”Several substrate tables have raw_json / raw_yaml columns that hold the full upstream payload (a CRS JSON blob, a legislator YAML record). These exist for replay and debugging — re-run the parser against the same input to validate a parser change.
They are internal-only. No serving endpoint returns them. No CLI command prints them. No log line includes their full contents. The Pydantic response models for the eventual API must not include raw_* fields — even if you have them in the row, the projection drops them.
Why this matters: legislator YAML carries district-office addresses and other contact info that's appropriate for an internal store but not for a low-friction public API. The convention is the wall.
12. Pre-PR checklist
Section titled “12. Pre-PR checklist”Before opening a PR (or, for an AI agent, before declaring the task verified):
-
uv run poe ci— green. This runs lint + format-check + typecheck + unit tests + contract tests + spec validation in one shot. The pre-push git hook enforces it locally; passing means the change is mergeable. - If a substrate or new-service change:
uv run poe smoke— green. End-to-end verification is opt-in (~30s after warm cache); it isn't part ofpoe ci. - If a spec is involved:
bin/build-spec.py --lintclean,bin/dry-run-spec.py <id> --explaingreen, changelog entry appended (the in-browser editor auto-prepends; if you edit YAML directly, append by hand). - If a data source's status changed:
docs/josh-data-sources.html+docs/data-status.htmlupdated in the same commit. - If a substrate-level pattern changed: this conventions doc updated in the same commit. Add a one-line entry to the convention changelog.
- No new convention introduced silently — if you deviated from a rule above, the deviation is commented and (if it generalizes) the convention is updated.
Convention changelog
Section titled “Convention changelog”This is a young doc. When a convention changes, add a one-line entry here so a reader can see the history without spelunking git blame.
- 2026-05-10 (latest) — Removed the legacy mypy override block; strict typing now applies repo-wide. The seven previously-relaxed modules (
ingester.*,app.main,josh_substrate.normalizers.*,josh_substrate.citations.*,josh_substrate.chunking.*,josh_substrate.corpus,josh_substrate.db,josh_substrate.protocols) were brought up tostrict = true. - 2026-05-10 (later) — Conventions-refactor pass landed. Switched type checker from pyright to mypy (Pydantic plugin maturity, typeshed alignment); tightened
[tool.ruff.lint]toE F I UP B SIM RUFacross the whole repo (no per-module grandfathering except a documented legacy override block in mypy); flippedSource.discoverandSource.parsefromasync deftodef-returning-AsyncIteratorto match what async generators actually produce; landedupsert/upsert_with_childrenhelpers injosh_substrate.dbas the canonical loader pattern; deleted speculative table-mirror Pydantic models (will reintroduce next to the FastAPI router that uses them, when it lands). Pre-PR checklist now centers onuv run poe ci. - 2026-05-10 — Initial draft. Codifies what CRS Reports + Legislators+Committees + the framework already do, with deliberate decisions on schema duplication (migrations win, no row models for ingest), upsert helper (substrate-owned), and HTTP layout (router-per-resource, hand-written response models).