substrateshippedp0

Ingester modularity pass

ingester-modularity-pass · updated 2026-05-09T17:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Before more sources land (Federal Register is next), tighten the ingester's
modularity so source work and CLI work can proceed in parallel without
colliding. Three coordinated moves: subpackage the source modules
(currently single 20–50KB files), add an optional per-source CLI extension
point to the Source protocol (so source-specific commands like the
existing fix-crs-types live with their source instead of in the global
CLI), and reframe the add-a-new-source runbook to make explicit which
steps touch source code vs. exercise the existing generic surface.

Net effect: a Federal Register PR (or any future new-source PR) only ever
touches files under that source's subpackage, its migration, its citation
formatter, and its tests — never the global CLI, runner, or scheduler.

As an agent (or human) adding a new source, I want every file I create or edit to live in or near that source's subpackage so that my work doesn't collide with parallel work on the CLI or other sources.

As an agent (or human) polishing the CLI surface, I want to know that the global CLI never contains source-specific code so that I can refactor or improve CLI UX without touching ingester internals.

As an OSS reader scanning the repo at launch, I want each source to look like a coherent package, not a wall of code so that the architecture is legible and per-source contributions are easy to scope.

  1. When `josh-ingester/ingester/sources/` is listed, every concrete source shall be a subpackage (a directory containing `__init__.py`) and no source shall be a single `.py` file.
  2. When the global CLI module `josh-ingester/ingester/cli.py` is read, no command body shall import from `ingester.sources.<source-name>` or branch on a hard-coded source name (the registry lookup via `discover_sources()` / `get_source(name)` is the only sanctioned path).
  3. When a source subpackage exposes a module-level `cli: typer.Typer` attribute, the system shall mount that app under `josh-ingester <source-name>` automatically at startup, with no manual registration required in `cli.py`.
  4. When `josh-ingester crs-reports fix-types --help` is invoked, the system shall print help text for the source-local replacement of the previous global `fix-crs-types` command.
  5. When the `discover_sources()` registry is enumerated after the refactor, the system shall return the same set of source names as before (CRS Reports + Legislators+Committees) with no behavior change to `discover/fetch/parse/load`.
  6. Where the `add-a-new-source.html` runbook describes Step 8 (source module), the prose shall reference the subpackage layout (`sources/<name>/{__init__,discover,fetch,parse,load}.py`) rather than the single-file form.
  7. Where the `add-a-new-source.html` runbook introduces the 14-step flow, the prose shall make explicit which steps modify code vs. invoke the existing generic CLI/runner with no CLI changes required.
kindbash

Command

set -euo pipefail
cd "$(git rev-parse --show-toplevel)"

# 1. Every source under sources/ is a directory (subpackage), not a .py file.
cd josh-ingester/ingester/sources
for entry in *; do
  case "$entry" in
    _*|__pycache__|*.py) ;;
    *) continue ;;
  esac
  if [ -f "$entry" ] && [ "${entry##*.}" = "py" ] && [ "${entry#_}" = "$entry" ]; then
    echo "FAIL: $entry is a single-file source; should be a subpackage"; exit 1
  fi
done
test -f crs_reports/__init__.py
test -f legislators_committees/__init__.py
cd - >/dev/null

# 2. Global CLI does not import from a concrete source subpackage. The
#    sanctioned path is `ingester.sources._registry` (underscore prefix);
#    the regex requires the first char of the imported name to be a
#    letter, so the registry import doesn't match.
if grep -E 'from ingester\.sources\.[a-z][a-z_]* import' josh-ingester/ingester/cli.py; then
  echo "FAIL: global CLI imports from a concrete source subpackage"; exit 1
fi

# 3. The Source protocol carries the optional cli hook.
grep -E '^\s*cli\s*:' shared/josh_substrate/src/josh_substrate/protocols.py

# 4. The per-source subcommand is reachable. The auto-mount loop derives
#    the namespace from the Source's registry name (underscores → hyphens),
#    so `crs_reports` mounts under `crs-reports`. We don't special-case to
#    rename the source because `Source.name` is the state-table value too.
josh-ingester crs-reports fix-types --help >/dev/null

# 5. Registry discovery still works and returns both sources.
python -c "from ingester.sources._registry import discover_sources; \
  names = sorted(discover_sources().keys()); \
  assert 'crs_reports' in names and 'legislators_committees' in names, names; \
  print('OK', names)"

Expect

Exit 0; final line begins with 'OK' and lists both source names.

Run from a dev env with `pip install -e josh-ingester` and `pip install -e shared/josh_substrate` already done. The grep for the `cli` attribute on the protocol is loose on purpose — it just checks the field is declared, not its full type signature.

  • Keep a one-release deprecation alias for the old `josh-ingester fix-crs-types` command, or hard-cut to the new `josh-ingester crs-reports fix-types`? (Default: hard-cut — pre-launch, no external users yet.)
  • Subpackage file split — do we want all four stages (`discover.py`, `fetch.py`, `parse.py`, `load.py`) split out always, or let small sources keep everything in `__init__.py` and only split when a stage exceeds ~200 lines? (Default: always split, for uniformity.)
  • Per-source plugin packaging via setuptools entry points (overkill for v1; revisit if sources ever become independently distributable).
  • Declarative per-source CLI option schemas (nice-to-have; ship only if it falls out of the protocol extension naturally, otherwise defer).
  • Renaming the `josh-ingester` console script or restructuring `josh-core` (out of scope for this pass — purely about the ingester layout).

None.

Three coordinated moves. Land in this order; FR work waits on (a) at minimum.

## (a) Subpackage the source modules

Each concrete source becomes a directory under
josh-ingester/ingester/sources/<name>/:

- __init__.py — instantiates source = <Name>Source() (the
registry-consumed instance) plus the optional cli Typer app
- discover.py — async generator of FetchTasks
- fetch.py — payloads to disk
- parse.py — bytes → ParsedRecord (this is where most source-specific
cleanup lives, per the runbook's Step 8 note)
- load.py — transactional upsert

CRS Reports (crs_reports.py, ~20KB) and Legislators+Committees
(legislators_committees.py, ~50KB) are the two existing sources.
Migrate both as worked examples; the 50KB Legislators file gets the
biggest legibility win.

Keep the public registry contract (module-level source instance
exposed from the subpackage) identical so discover_sources() keeps
working unchanged. Tests should pass with no edits beyond import-path
updates inside the source's own test file.

## (b) Per-source CLI extension point

Extend the Source protocol in
shared/josh_substrate/src/josh_substrate/protocols.py with one
optional attribute:

``python
class Source(Protocol):
...
cli: typer.Typer | None # optional; if set, mounted as <source-name> subcommand
``

In josh-ingester/ingester/cli.py, after the existing
discover_sources() is wired (or at app startup), add:

``python
for src in discover_sources().values():
sub = getattr(src, "cli", None)
if sub is not None:
app.add_typer(sub, name=src.name.replace("_", "-"))
``

Move the body of fix-crs-types from cli.py into
josh-ingester/ingester/sources/crs_reports/cli.py, expose as
cli = typer.Typer() with @cli.command("fix-types"), and re-export
via crs_reports/__init__.py. Result: josh-ingester crs-reports
fix-types
becomes the canonical form (the auto-mount loop derives the
namespace from Source.name with underscores → hyphens, and we don't
special-case the mount name since Source.name is also the state-table
key). The old global fix-crs-types command is removed (see
clarifications_needed for the alias question).

## (c) Reframe the add-a-new-source runbook

In https://docs.usejosh.com/operations/add-a-new-source/:

- Add a "What changes vs. what doesn't" callout near the top:
- Per source, you change: spec YAML, migration, Pydantic models,
citation formatter, normalizer (if new format), source subpackage,
tests, two inventory pages.
- You do not change: global CLI, runner, scheduler, state-table
writers, substrate machinery.
- Steps 9–12 *exercise* the new source via the existing generic CLI
and runner — no CLI code edits required.
- Update Step 8 ("Write the source module") to reference the subpackage
layout and the optional cli.py extension point.
- Update the "Tools you have available" table to list the per-source
cli extension as a sanctioned tool.

## Sequencing for parallel work

- (a) and (b) can be done by separate agents in parallel; they touch
disjoint files (sources/ vs. cli.py + protocols.py).
- (c) follows naturally once (a) is in — same PR is fine.
- Federal Register's spec gains this as a dependency; FR work
starts after (a) at minimum so FR is born into the new layout.

13 of 13 done.

  • t1 Subpackage CRS Reports: crs_reports.py → crs_reports/{__init__,discover,fetch,parse,load}.py with no behavior change
  • t2 Subpackage Legislators+Committees similarly (the 50KB file is the biggest win)
  • t3 Run josh-ingester test suite — must pass with no edits beyond import paths inside source-local tests
  • t4 Verify discover_sources() returns both source names unchanged
  • t5 Add optional `cli: typer.Typer | None` attribute to Source protocol in shared/.../protocols.py
  • t6 Wire CLI auto-mount loop in josh-ingester/ingester/cli.py (one for-loop after discover_sources)
  • t7 Move fix-crs-types into crs_reports/cli.py; expose via crs_reports.cli; remove from global CLI
  • t8 Grep global cli.py: zero `from ingester.sources.<name>` imports remain
  • t9 Update add-a-new-source.html Step 8 (source module) to reflect subpackage layout
  • t10 Add 'What changes vs. what doesn't' framing callout to add-a-new-source.html intro
  • t11 Update 'Tools you have available' table to list the per-source cli extension point
  • t12 Run full success_determiner against a clean dev env; all 5 checks green
  • t13 Confirm federal-register-ingester spec lists this as a dependency before FR work begins
  • 2026-05-09T17:00:00Z verifiedshipped Committed to main.
  • 2026-05-09T16:15:00Z in_progressverified success_determiner ran green (5/5 checks). Pytest also clean: josh-ingester 13/13, josh_substrate 20/20 (33/33 total). Both source subpackages discover correctly via the registry; `josh-ingester crs-reports fix-types --help` resolves through the auto-mount loop; the `Source` protocol now declares `cli: Any | None = None` so existing sources without a CLI satisfy the protocol via the default. Runbook reframed (Step 8 around the subpackage layout, "What changes vs. what doesn't" callout near the top, per-source `cli` row in the tools table). Federal Register can now build into the new layout from t0.
  • 2026-05-09T15:45:00Z in_progressin_progress Spec edit (no status change): success_determiner check #2 regex tightened from `[a-z_]+` to `[a-z][a-z_]*` so the sanctioned `from ingester.sources._registry import …` line doesn't trip the "no concrete-source imports" check. The intent of the check was always "no concrete source modules"; the original regex was too loose against an underscore-prefixed registry module that's the framework's entry point.
  • 2026-05-09T15:30:00Z in_progressin_progress Spec edit (no status change): success_determiner check #4 and acceptance criterion #4 corrected from `josh-ingester crs fix-types` to `josh-ingester crs-reports fix-types`. Reason: the auto-mount loop derives the subcommand namespace from `Source.name` with underscores → hyphens, and we don't special-case the mount name because `Source.name` is also the state-table key (renaming it would be a data-correctness change, not a refactor). `plan` text and clarifications_needed updated to match.
  • 2026-05-09T15:00:00Z plannedin_progress Starting the three-part pass (subpackage sources, per-source CLI extension, runbook reframe) so Federal Register can land into the new layout.

docs/spec/ingester-modularity-pass.html · generated by bin/build-spec.py