Skip to content

Repo structure

How the Josh monorepo is organized, why it's organized this way, and what to do when adding a new service.

This is the canonical reference for layout decisions. When CLAUDE.md or other docs disagree with this document, this document wins for repo-organization concerns.

Storage backend: SQLite + FTS5 + sqlite-vec. See "Storage backend decision" in CLAUDE.md. The substrate is a single file at /data/josh.db shared between containers via a host bind mount — there is no josh-postgres accessory.

josh/
├── CLAUDE.md
├── .dockerignore # project-root, controls build-context uploads
├── .gitignore
├── .ruby-version # pins ruby for kamal
├── .kamal/ # shared kamal config across services
│ ├── secrets # plaintext, gitignored
│ ├── secrets.example # committed template
│ └── secrets.age # encrypted, committed (when populated)
├── shared/
│ └── josh_substrate/ # shared Python package — see below
│ ├── pyproject.toml
│ ├── README.md
│ ├── alembic.ini
│ └── src/josh_substrate/
│ ├── __init__.py
│ ├── migrations/ # all Alembic migrations
│ ├── models/ # (future) Pydantic models per table
│ ├── ids.py # (future) canonical ID helpers
│ └── db.py # (future) connection helpers (PRAGMAs, sqlite-vec load)
├── josh-core/ # FastAPI + LLM harness (currently placeholder)
│ ├── Dockerfile # build context = project root
│ ├── README.md
│ ├── pyproject.toml # service-specific deps (alembic etc. via shared/)
│ ├── app/
│ │ └── main.py
│ ├── config/deploy.yml # kamal config (no accessory; volume mount /data)
│ └── .kamal -> ../.kamal # symlink so kamal finds shared secrets
├── josh-ingester/ # (to be rebuilt) ETL workers — same shape as josh-core
├── josh-web/ # (later) Next.js — consumes /openapi.json
├── docs/
│ ├── josh-data-sources.md # canonical inventory of v1 / v2 data sources
│ ├── sources/ # per-data-source ingestion specs (probe findings, indicative schema)
│ └── operations/ # operational runbooks (this directory)
└── design/ # UI / brand design system

The previous iteration had a josh-postgres/ directory with extension init scripts; with the SQLite swap that directory is being removed. The substrate file lives on the host volume at /data/josh.db, mounted into each service container that needs it.

Both josh-core (FastAPI + harness) and josh-ingester (ETL workers) need:

  • Identical Pydantic models for the substrate tables
  • The same Alembic migrations (one schema, one source of truth)
  • The same DB connection helpers
  • The same canonical-ID format helpers

We could duplicate these models across services and hand-sync. We don't, because:

  • Drift is inevitable when two copies exist
  • Refactors require changing N places instead of one
  • The shape is genuinely shared, not coincidentally aligned

So: one Python package at shared/josh_substrate/, installable via pip install. Each service's Dockerfile copies it in and pip installs it before installing service-specific deps.

For the frontend (josh-web, when it exists), the chain is different — it's TypeScript, not Python. Approach: openapi-typescript http://localhost:8000/openapi.json -o src/types/api.ts. FastAPI auto-generates /openapi.json from the Pydantic models the API uses. So the same models drive backend types directly and frontend types via OpenAPI.

The single root Dockerfile uses the project root as Docker build context so it can reach across service boundaries:

config/deploy.yml
builder:
context: . # project root
dockerfile: ./Dockerfile # project-root Dockerfile

This lets the Dockerfile COPY shared/josh_substrate + each service's source tree in turn. The Dockerfile installs all three packages into a single image; roles in config/deploy.yml pick which CMD to run.

The cost: a project-root .dockerignore is necessary to keep build-context uploads small. Things like docs/, design/, and .claude/ are excluded.

The substrate ships as one Kamal service (josh) with multiple roles defined in config/deploy.yml:

Terminal window
kamal setup # bring up every active role
kamal deploy # deploy all active roles
kamal deploy --roles=web # surgical: web only
kamal deploy --roles=ingester # surgical: ingester only

Roles share the same image but use different CMDs (uvicorn josh_core.main:app … for web; josh-ingester daemon for ingester; josh-embedder daemon for embedder when activated). The web role is fronted by kamal-proxy on 80/443; the others set proxy: false.

The substrate SQLite file at /data/josh.db lives on a host directory bind-mounted at /data into every role. There's no accessory and no Docker network DB target — the file is the boundary, the volume mount is the wire. The root config/deploy.yml declares volumes: ["/data:/data"] globally, inherited by every role. See the substrate-single-image-deploy spec for the trade-offs (image bloat, activation path for deferred roles, etc.).

Kamal runs from the project root: config/deploy.yml, .kamal/secrets, and .kamal/hooks/ are all there. One service, one config, one secrets file shared across every role. No symlinks needed because kamal's cwd is the same place every time.

Project-root .dockerignore (the active one, controls build context uploads to the remote builder):

  • Excludes .git/, node_modules/, __pycache__/, docs/, design/, .claude/, secrets
  • Keeps the SSH-uploaded build context small and free of secrets

We don't have per-service .dockerignore files anymore. They were obsolete once the build context moved to project root — Docker only reads .dockerignore from the build-context root.

The root Dockerfile installs shared/josh_substrate first (with the [all-providers] extra, which is the union of every role's substrate-side deps), then pip-installs each service package into the same image:

FROM python:3.12-slim
WORKDIR /app
# Layer 1 — substrate (heavy: torch, sentence-transformers, alembic, …).
# Sits at the bottom of the cache stack because it changes rarely.
COPY shared/josh_substrate /tmp/josh_substrate
RUN pip install --no-cache-dir '/tmp/josh_substrate[all-providers]'
# Alembic config — pre-deploy hook runs `alembic upgrade head` from /app
# against the web role's container. script_location points into the
# installed josh_substrate package.
COPY shared/josh_substrate/alembic.ini /app/alembic.ini
# Layer 2 — josh-core (FastAPI). `josh_core/` is the top-level wheel package.
COPY josh-core/pyproject.toml /tmp/josh-core/pyproject.toml
COPY josh-core/josh_core /tmp/josh-core/josh_core
RUN pip install --no-cache-dir /tmp/josh-core
# Layer 3 — josh-ingester. Installs `josh-ingester` console script.
COPY josh-ingester/pyproject.toml /tmp/josh-ingester/pyproject.toml
COPY josh-ingester/josh_ingester /tmp/josh-ingester/josh_ingester
RUN pip install --no-cache-dir /tmp/josh-ingester
# Layer 4 — josh-embedder. Installs `josh-embedder` console script. Built
# into the image even when the role is commented out in deploy.yml —
# activation is config-only.
COPY josh-embedder/pyproject.toml /tmp/josh-embedder/pyproject.toml
COPY josh-embedder/src /tmp/josh-embedder/src
RUN pip install --no-cache-dir /tmp/josh-embedder
# Default CMD = web role; ingester/embedder override in config/deploy.yml.
CMD ["uvicorn", "josh_core.main:app", "--host", "0.0.0.0", "--port", "8000"]

Layers are ordered for cache reuse — substrate at the bottom (changes rarely), then service packages.

The sqlite-vec PyPI package bundles the compiled extension binary. Connection helpers in josh_substrate.db load it via conn.load_extension(...) after enabling extension loading on the aiosqlite connection. If extension loading needs to happen on a connection type that aiosqlite doesn't expose, fall back to wrapping apsw connections instead — same SQL, different driver.

We don't ship a docker-compose.yml. Local dev runs against the deployed substrate (SSH tunnel into josh-core's /health, query the SQLite file remotely, etc.) or by running josh-core and josh-ingester directly via python -m against a local SQLite file. Provisioning a fresh test/staging server is fast — see new-host-setup.html.

When OSS Foundation packaging ships at Step 1 launch, we'll add an OSS-friendly path (single docker run, prebuilt image, or a fresh compose file) sized for self-hosters.

When we add a service (next up: josh-ingester):

  1. Create the directory: mkdir -p josh-ingester/{ingester,config}
  2. Add the Dockerfile following the shared-package consumption pattern above.
  3. Add josh-ingester/pyproject.toml for service-specific deps (e.g., httpx, lxml, apscheduler, typer, tenacity, structlog) and register it under the workspace's [tool.uv.workspace] members in the root pyproject.toml.
  4. Add josh-ingester/config/deploy.yml:
    • service: josh-ingester
    • image: yourorg/josh-ingester
    • Same builder shape as josh-core (context: ../, dockerfile: ./Dockerfile)
    • No accessories. The substrate SQLite file is shared via the host volume mount.
    • Volume mount the /data host directory: volumes: ["/data:/data"]. Same as josh-core.
    • DB target env: SUBSTRATE_DB_PATH=/data/josh.db.
    • For non-web (worker) services, add primary_role: <your_role> at the top level (Kamal v2 defaults to expecting a web role; without this, deploy fails with primary_role web isn't defined).
    • For non-web services, opt out of kamal-proxy registration by setting proxy: false at the role level (top-level proxy: false is invalid — Kamal v2 expects proxy: to be a hash, not a boolean):
      servers:
      workers:
      hosts: [your.host.example]
      proxy: false # workers have no HTTP — skip proxy registration
  5. Dockerfile shape — workers should idle, not run a one-shot command:
    • For HTTP services (josh-core): CMD ["uvicorn", "josh_core.main:app", ...] — long-running server.
    • For workers (josh-ingester): CMD ["sleep", "infinity"] — container idles; jobs run via kamal app exec --reuse 'python -m josh_ingester ...'. Don't use ENTRYPOINT for workers; Kamal's per-role cmd: override is appended to the entrypoint, so ENTRYPOINT ["python","-m","josh_ingester"] + cmd: tail -f /dev/null becomes python -m josh_ingester tail -f /dev/null and the container exits immediately.
  6. Add a role to config/deploy.yml at the project root — a new servers.<rolename> block with hosts, cmd, and proxy: false if it's a worker.
  7. Extend the root Dockerfile to COPY + pip install the new service package alongside the existing ones.
  8. Update .dockerignore if anything in the new service needs excluding.
  9. Deploy: env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal deploy --roles=<rolename> from project root.

kamal app exec --reuse 'python -m ingester backfill <source>' works for short jobs. For long jobs (the CRS full backfill is ~6 hours), the local Kamal client process can be killed by client-side timeouts, taking the SSH session with it. The right pattern: detach via docker exec -d so the work decouples from any SSH session:

Terminal window
CONTAINER=$(ssh josh 'docker ps --filter label=service=josh --filter label=role=ingester --format "{{.Names}}"')
ssh josh "docker exec -d $CONTAINER sh -c 'josh-ingester backfill <source> > /data/<source>-backfill.log 2>&1'"
# Monitor progress periodically — the runner writes incremental counters
ssh josh "docker exec $CONTAINER josh-ingester status --source=<source> --limit=1"

The container's CMD is sleep infinity, so it stays alive after the detached job finishes. The log lands on /data/ (the bind-mounted volume) so it's visible from the host.

When we add a new substrate concept (a new table, a new ID format, a new DB helper):

  1. Schema work: add the migration in shared/josh_substrate/src/josh_substrate/migrations/versions/.
  2. Models work: add Pydantic models in shared/josh_substrate/src/josh_substrate/models/.
  3. Helpers work: add to shared/josh_substrate/src/josh_substrate/db.py or ids.py.
  4. Bump version in shared/josh_substrate/pyproject.toml (semver — minor for additive, major for breaking).
  5. Rebuild and redeploy every consuming service. Kamal picks up the changed shared package on its next build because the COPY shared/josh_substrate layer invalidates.
  6. Apply the migration (currently manual via kamal app exec --reuse "alembic upgrade head"; eventually via the pre-deploy hook).

When the multi-service workflow gets repetitive, add bin/ wrapper scripts:

  • bin/ingester — wraps kamal app exec --reuse --roles=ingester "josh-ingester $@"
  • bin/migrate — wraps kamal app exec --reuse --roles=web "alembic upgrade head"

Not added yet — wait until the friction is real.

ConcernLives in
Pydantic models for substrate tablesshared/josh_substrate/src/josh_substrate/models/
Alembic migrationsshared/josh_substrate/src/josh_substrate/migrations/versions/
FastAPI routes, request/response shapesjosh-core/app/
LLM harness, tool definitionsjosh-core/app/harness/ (when added)
Source modules (Federal Register, Bills, etc.)josh-ingester/ingester/sources/ (when added)
ETL framework (Source protocol, runner, scheduler)josh-ingester/ingester/ (when added)
Agent UI componentsjosh-web/src/ (when added)
TypeScript types from APIjosh-web/src/types/api.ts (generated)
Postgres init scriptsjosh-postgres/init/ — directory removed alongside the SQLite swap
Kamal hooks (pre-deploy migration, etc.)<service>/.kamal/hooks/ (per-service)
Per-data-source schemas + ingestion specsdocs/sources/
Operational runbooksdocs/operations/
Product / design docsdocs/

Josh ships as three public repos: josh-foundation (the substrate, the ingester, and the FastAPI core), josh-substrate (the pip-installable schema package — Pydantic models, Alembic migrations, ID helpers, snapshot loader), and josh-eval (the benchmark). License: Apache 2.0 for foundation + substrate; MIT for the benchmark.