substrateverifiedp0

Refactor shipping code to match the conventions doc

conventions-refactor · updated 2026-05-10T13:00:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

The conventions doc (https://docs.usejosh.com/operations/conventions/) was authored from
the patterns the two shipping ingesters already mostly follow. CI is now
landed (see ci-foundation) — ruff + mypy + pytest + poe all run via
uv run poe ci, but the legacy modules are *grandfathered* out of the
strict checks via documented exclude lists. The grandfathering buys
surgical-changes safety while the embedding work lands; this spec closes
the gaps so the conventions doc actually applies to every file in the
repo, not just the new code.

Specifically: ruff check and ruff format --check skip
shared/josh_substrate/src/josh_substrate/{citations,normalizers,models,
chunking,protocols.py,db.py,corpus.py,migrations/env.py}
,
josh-ingester/ingester, josh-ingester/tests,
josh-core/app/{main.py,__init__.py}, bin/build-spec.py,
bin/sync-nav.py, plus the corresponding test directories. Strict
mypy applies only to the embedding cohort (josh_substrate.embedding.*,
josh_embedder.*, app.embedding, app.routes.{embed,health_embedding})
via [[tool.mypy.overrides]]. Everything else is checked at base
strictness (with runner.py excluded entirely due to a known
async-generator typing quirk).

Closing the legacy-cleanup gap also unblocks the existing convention
goals: helper consolidation (upsert_with_children), test coverage
parity for legislators_committees, and removal of speculative
Pydantic models. Doing all this together lets the spec flip to
verified in one sweep.

As an outside contributor adding a new source, I want the existing two ingesters to demonstrate one consistent pattern so that my source can be a copy of an existing one with no ambiguity about which idiom to follow.

As a maintainer, I want ruff/pyright/tests gating every PR so that drift from the conventions doc shows up at PR time, not at code-review-by-eyeball time.

  1. When `ruff format .` runs from the repo root, it shall produce no diff (legacy `extend-exclude` paths in `pyproject.toml` shall have been removed).
  2. When `ruff check .` runs from the repo root, it shall exit 0 with zero findings against the FULL repo (legitimate exceptions live in `[tool.ruff.lint.per-file-ignores]`); the global `extend-exclude` shall list only build artifacts and vendor.
  3. When `mypy` runs from the repo root, it shall exit 0 against every workspace member's source tree at the strict ruleset; `[[tool.mypy.overrides]]` shall no longer carve out a separate cohort for new vs. legacy code.
  4. When `mypy` runs, the `runner.py` file shall be type-clean without the current exclude — the protocol-level async-generator quirk shall be resolved (likely by changing `Source.discover` from `async def` to `def`-returning-AsyncIterator, or via a strategically placed cast).
  5. When `pytest` runs from the repo root, every shipping source shall have at least three test categories: parse fixtures, defensive inputs, loader smoke.
  6. When a loader needs to upsert a parent + children, it shall call `josh_substrate.db.upsert_with_children` rather than building INSERT/UPDATE SQL inline.
  7. Where `shared/josh_substrate/models/` contains Pydantic classes that mirror DB tables ahead of need (e.g., `CrsReport` when no josh-core endpoint serves it), they shall be deleted (and re-introduced next to the serving code when that code lands).
  8. When `uv run poe ci` runs after the cleanup, every check shall remain green (regression bar — the cleanup must not break the already-shipped embedding work).
kindbash

Command

set -euo pipefail
# 1. The full poe ci pipeline must pass on the cleaned-up repo.
uv run poe ci
# 2. ruff [tool.ruff] extend-exclude must contain only build-artifact
#    paths — no per-module grandfathering of legacy code.
uv run python -c '
import sys, tomllib
with open("pyproject.toml", "rb") as f:
    cfg = tomllib.load(f)
excludes = cfg.get("tool", {}).get("ruff", {}).get("extend-exclude", [])
allowed = {"vendor", "**/migrations/versions", "**/__pycache__", ".venv"}
bad = [e for e in excludes if e not in allowed]
if bad:
    print(f"ERROR: legacy ruff excludes still present: {bad}", file=sys.stderr)
    sys.exit(1)
'
# 3. The mypy strict-cohort override block must be gone — strict
#    typing should now be the default, with relaxations granted
#    only to truly-legacy modules (not the embedding cohort).
if grep -A 3 'tool.mypy.overrides' pyproject.toml | grep -E 'embedding|josh_embedder'; then
  echo "ERROR: per-module strict-mypy overrides still present — strict should be the default" >&2
  exit 1
fi
# 4. No inline INSERT…ON CONFLICT in source loaders.
if grep -RE 'INSERT.*ON CONFLICT' josh-ingester/ingester/sources/ | grep -v '_helpers'; then
  echo "ERROR: inline upsert SQL — must use josh_substrate.db.upsert_with_children" >&2
  exit 1
fi
# 5. Speculative table-mirroring Pydantic models removed.
if [ -f shared/josh_substrate/src/josh_substrate/models/crs.py ] || \
   [ -f shared/josh_substrate/src/josh_substrate/models/legislators.py ]; then
  echo "ERROR: shared/josh_substrate/models/ still has table-mirror Pydantic" >&2
  exit 1
fi
echo OK

Expect

OK

Five mechanical checks: poe ci is green AND the legacy carve-outs (ruff extend-exclude legacy paths, mypy embedding-cohort override) are gone AND the helper consolidation has happened AND speculative models are removed. Pass = the conventions doc applies to every file, not just the new ones.

None.

  • Wiring CI itself (tracked separately by the user-built CI scaffold).
  • Per-source deep refactors beyond the loader path (parse/discover/fetch are already consistent).
  • Adding `josh-core` HTTP endpoints (only structural prep — actual endpoints are tracked by `rest-api-resource-endpoints` and `rest-api-search`).
  • Migrating to a different test runner or build system.

State as of 2026-05-10. CI is in place via ci-foundation. The
embedding cohort is held to strict mypy + the full ruff rule set;
legacy modules are excluded. The cleanup work is now to drag the
legacy modules into the same standard.

Order of operations.

1. Ruff format pass. Drop the legacy extend-exclude entries
in pyproject.toml for shared/josh_substrate/{citations,
normalizers,models,chunking,protocols.py,db.py,corpus.py}
,
josh-ingester/ingester, josh-ingester/tests,
josh-core/app/{main.py,__init__.py}, bin/build-spec.py,
bin/sync-nav.py, and the legacy test dirs. Run uv run poe
format
and commit as one mechanical "ruff format baseline" commit.
Expect ~50 files reformatted (vertical alignment in db.py's
PRAGMA list, ternary collapses in runner.py, etc.).
2. Ruff lint pass. Bring the rule set up to the strict default
(E, F, I, UP, B, SIM, RUF). uv run poe format
auto-fixes most. The known holdouts are the 17 errors enumerated
during the embedding work: 3 B904 exception-chaining issues,
1 B023 loop-variable bug in bin/sync-nav.py, 4 SIM105
contextlib.suppress opportunities, 2 SIM103, 2 RUF012
mutable-default issues, 1 RUF005, 1 RUF002 (unicode
ambiguity in a docstring — likely false positive), 1 F841
dead variable in josh-ingester/ingester/runner.py:163. Fix
each one; document any remaining as noqa with a justifying
comment.
3. Mypy strict pass. Remove the per-module override block in
pyproject.toml and the runner.py exclude. ~70 missing-
annotation errors will surface across josh-core/app/main.py
(3) and josh-ingester/ingester/cli.py (~25). Two real bugs
hiding: Source.discover's async def returning AsyncIterator
produces a Coroutine wrapper that mypy correctly flags
(runner.py:336, :352). Fix the protocol — drop async from
discover since the implementations are async generators (yield
in body) which makes async def a typing footgun.
4. Helper consolidation. Write josh_substrate.db.upsert(conn,
table, columns, payload)
and upsert_with_children(conn,
parent_table, parent_columns, parent_payload, children,
child_fk_column)
. Unit-test against in-memory SQLite. Document
in the docstring.
5. Loader refactors. crs_reports/load.py → call helper. Verify
CRS tests pass unchanged. legislators_committees/load.py
same. Five inline upsert helpers in legislators collapse.
6. Test backfill. Add legislators_committees tests matching
the three-category contract (parse fixtures, defensive inputs,
loader smoke).
7. Speculative-Pydantic cleanup. Decide fate of
shared/josh_substrate/models/{crs,legislators}.py.
Recommendation: delete. Reintroduce as FastAPI response models
next to the josh-core router when that endpoint lands.
8. Verify regression-free. uv run poe ci plus uv run poe
smoke
plus uv run poe test-contract all green. Embedding work
must keep working — the cleanup is plumbing, not behavior.

Risk: mypy with the protocol fix surfaces issues across every
ingester source module.
Likely outcome: each source module's
discover signature changes from async def to def. That's a
one-line patch per source. Verify CRS + legislators tests pass
after the change.

11 of 11 done.

  • t1 [ci-foundation] Configure ruff + mypy + pytest + poe via root pyproject.toml
  • t2 Drop legacy ruff `extend-exclude` paths and run `uv run poe format` as a one-shot baseline commit
  • t3 Restore the strict ruff rule set (B, SIM, RUF, UP) and resolve all 17 enumerated holdouts
  • t4 Drop the per-module strict-mypy override block; resolve the ~70 missing-annotation errors that surface
  • t5 Fix `Source.discover` async-generator typing quirk (drop `async def` from the protocol) so `runner.py` no longer needs the mypy exclude
  • t6 Write `josh_substrate.db.upsert` and `upsert_with_children` helpers with unit tests
  • t7 Refactor `crs_reports/load.py` to use the helper; verify CRS tests pass
  • t8 Refactor `legislators_committees/load.py` to use the helper
  • t9 Add tests for `legislators_committees` (parse fixtures, defensive inputs, loader smoke)
  • t10 Delete or relocate speculative Pydantic models in `shared/josh_substrate/models/`
  • t11 Run the success determiner; flip to `verified` once green
  • 2026-05-10T13:00:00Z plannedverified All six follow-up commits landed (ruff format baseline, strict rule set + 17 holdouts, mypy override drop + Source.discover fix, upsert helpers + loader refactors, legislators_committees tests, speculative-models removal). Determiner runs green: `uv run poe ci` passes, no legacy ruff excludes remain, no embedding-cohort mypy override remains, no inline upsert SQL in source loaders, no speculative Pydantic models. 90 unit + 12 contract tests pass.
  • 2026-05-10T12:30:00Z draftplanned Updated to reflect ci-foundation now landed; reframed scope around removing the legacy grandfathering rather than building CI from scratch. Added explicit tasks for the 17 ruff holdouts and the Source.discover protocol fix that the embedding work surfaced.

docs/spec/conventions-refactor.html · generated by bin/build-spec.py