substrateverifiedp1

Consolidate substrate services into a single-image Kamal deploy

substrate-single-image-deploy · updated 2026-05-12T18:30:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Before this change, the substrate ran as three independent Kamal apps —
josh-core, josh-ingester, josh-embedder — each with its own
Dockerfile, config/deploy.yml, Docker Hub image, and Docker Hub
build-cache image. Bringing the substrate up required three kamal setup
invocations from three different working directories, three separate image
builds, three pre-deploy hook firings (one of which actually ran alembic;
the others early-exited), and triplicate maintenance of nearly-identical
deploy YAML.

Consolidate to one Kamal service josh with three roles (web,
ingester, embedder), all built from one shared root Dockerfile. This
is Kamal's canonical "Rails monolith + Sidekiq workers" pattern — one
image, multiple roles with different CMDs, deployed atomically. Selective
deploys still work via kamal deploy --roles=embedder (etc).

v1 scope note: the embedder role is defined in Dockerfile but
commented out in config/deploy.yml
so the worker container is not
actually deployed in v1. Reason: the chunking strategy + embedding model
evaluation are not yet locked (tracked in
substrate-embedding-evaluation), and we don't want to bulk-embed a
23 K-document corpus against a strategy we'll later regret. The query-
time embedding singleton in josh-core stays live (/embed endpoint
works for ad-hoc queries); what's deferred is the bulk-embed worker
draining chunk_embedding_jobs. Activation later is a config-only
change: uncomment the role block, kamal deploy --roles=embedder.

The cost is ~200 MB of image bloat per role (everyone pays for everyone
else's pip deps on disk), which is irrelevant at our scale (1.8 TB on
RAID 10). Runtime memory is unaffected — Python only imports what each
role's CMD code path imports. The wins are: one image to build / push /
pull / cache, one deploy.yml to maintain, the pre-deploy alembic hook
fires exactly once per deploy (not three times), the basename "$PWD"
gate disappears, and the OSS install story collapses to "edit .env,
kamal setup."

Done now (rather than later) because the substrate is empty (no data
to migrate), kamal-proxy persists across kamal app remove (no HTTP
downtime concern beyond the brief cutover window), and adding two more
per-service deploys (ingester + embedder) before consolidating would
triple the rework cost.

As an operator standing up the substrate on a fresh host, I want a single `kamal setup` from the project root to bring up every role so that I'm not running three setup invocations from three directories and reasoning about cross-service ordering.

As an operator deploying a routine code change, I want one `kamal deploy` to ship every role atomically (with `--roles=` for surgical updates) so that a schema change lands on all three roles in one logical step, with no "deploy core, then ingester, then embedder" sequencing.

As an OSS self-hoster reading the Josh Foundation README, I want a single configuration artifact (`config/deploy.yml`) and a single image to understand so that I can read the deploy story end-to-end without context-switching across three subdirectories.

  1. There shall be exactly one `Dockerfile` at the project root (excluding `shared/`), and zero Dockerfiles under `josh-core/`, `josh-ingester/`, or `josh-embedder/`.
  2. There shall be exactly one `config/deploy.yml` at the project root, and zero `config/deploy.yml` files under `josh-core/`, `josh-ingester/`, or `josh-embedder/`.
  3. When `kamal setup` (or `kamal deploy`) runs from the project root, the host shall end up with **two role-containers** labelled `service=josh` (roles: `web`, `ingester`) and one `kamal-proxy` container. The `embedder` role is intentionally commented out in `config/deploy.yml` for v1; uncommenting + redeploying activates it.
  4. When `kamal app exec --reuse --roles web 'alembic current'` runs against the deployed stack, alembic shall report the current head from `shared/josh_substrate/src/josh_substrate/migrations/versions/`.
  5. When a `curl http://<your-host-ip>/health` runs against the deployed stack, it shall return HTTP 200 with body `{"status":"ok"}` (proves kamal-proxy routes to the web role).
  6. When a subsequent `kamal deploy` (without `--skip-hooks`) runs against the consolidated config, the rewritten pre-deploy hook shall fire successfully — proves env files are on the host and the `--roles web` alembic invocation works end-to-end.
  7. When the embedder role is uncommented in `config/deploy.yml` and `kamal deploy --roles=embedder` runs, the embedder container shall start successfully from the same image (proves the package is built into the image even when not deployed).
kindbash

Command

set -euo pipefail
cd /Users/ritz/projects/josh

# 1. Single Dockerfile at root, none in service dirs. (Vendored
# third-party Dockerfiles under vendor/ are excluded — they aren't
# part of our deploy stack.)
test "$(find . -maxdepth 4 -name 'Dockerfile' -not -path './shared/*' -not -path './node_modules/*' -not -path './.git/*' -not -path './vendor/*' -not -path './private/*' | wc -l | tr -d ' ')" = "1" \
  || { echo "FAIL: expected exactly 1 Dockerfile at root"; exit 1; }

# 2. Single config/deploy.yml at root, none in service dirs
test "$(find . -maxdepth 4 -name 'deploy.yml' -path '*/config/*' -not -path './node_modules/*' -not -path './.git/*' -not -path './vendor/*' -not -path './private/*' | wc -l | tr -d ' ')" = "1" \
  || { echo "FAIL: expected exactly 1 config/deploy.yml at root"; exit 1; }

# 3. Two role-containers + kamal-proxy on host (embedder commented out for v1)
roles=$(ssh josh 'docker ps --filter label=service=josh --format "{{.Label \"role\"}}"' | sort -u | tr '\n' ',')
[ "$roles" = "ingester,web," ] \
  || { echo "FAIL: expected two role-containers (ingester, web); got: $roles"; exit 1; }
ssh josh 'docker ps --filter name=kamal-proxy --format "{{.Names}}"' | grep -q '^kamal-proxy$' \
  || { echo "FAIL: kamal-proxy not running"; exit 1; }
# Confirm embedder role is commented out in deploy.yml (not just absent — explicit intent)
grep -q "^  # embedder:" config/deploy.yml \
  || { echo "FAIL: embedder role expected to be commented in config/deploy.yml"; exit 1; }

# 4. Alembic at head. We `docker exec` directly via SSH rather than
# going through `kamal app exec` because kamal's output buffering
# turns into empty captures when its stdout is a subshell pipe.
web_container=$(ssh josh 'docker ps --filter label=service=josh --filter label=role=web --format "{{.Names}}"' | head -1)
test -n "$web_container" \
  || { echo "FAIL: no running web container on host"; exit 1; }
cur=$(ssh josh "docker exec $web_container alembic current 2>&1" | tail -1)
echo "$cur" | grep -q "(head)" \
  || { echo "FAIL: alembic not at head; got: $cur"; exit 1; }

# 5. /health returns 200. Host IP is read from config/deploy.yml so the
# determiner stays in sync if the deploy target ever moves; the prose
# examples elsewhere in this spec keep the `<your-host-ip>` template
# form so the spec is reusable as an OSS reference.
host_ip=$(awk '/^servers:/{flag=1} flag && /^      - /{print $2; exit}' config/deploy.yml)
test -n "$host_ip" \
  || { echo "FAIL: could not parse host IP from config/deploy.yml"; exit 1; }
curl -fs "http://$host_ip/health" | grep -q '"status":"ok"' \
  || { echo "FAIL: /health did not return ok"; exit 1; }

echo OK

Expect

Final line is `OK`. All five functional checks (one root Dockerfile, one root config/deploy.yml, three role-containers + kamal-proxy, alembic head, /health 200) pass.

Run from the project root on the local Mac (the `kamal app exec` needs cwd at the deploy.yml path).

None.

  • Renaming josh-core's `app/` package to `josh_core/` for symmetry with `ingester/` and `josh_embedder/` — separable refactor, captured as a follow-up
  • Multi-host or multi-region deploy — single-host on Vint Hill is the v1 shape (see `substrate-bare-metal-host`)
  • Switching off Kamal entirely (e.g., to Docker Compose / Coolify / Dokploy) — researched and rejected; Kamal is the modern best fit for our shape

## End state

### Repo layout
- Dockerfile at project root — single Dockerfile that installs josh-substrate[all-providers] and all three service packages
- config/deploy.yml at project root — single Kamal service josh with three roles (web, ingester, embedder)
- .kamal/hooks/pre-deploy — rewritten to drop the basename "$PWD" != "josh-core" gate and target --roles web for the alembic invocation
- josh-core/, josh-ingester/, josh-embedder/ — source-only directories (the Dockerfile, config/, and .kamal symlink in each get deleted)
- shared/josh_substrate/ — unchanged

### Production host (v1 deploy state)
- One Kamal service josh deployed at <your-host-ip>
- Two role-containers from yourorg/josh:<sha>:
- josh-web-<sha> — FastAPI on port 8000, fronted by kamal-proxy on 80/443
- josh-ingester-<sha> — scheduler daemon, proxy: false
- josh-embedder role commented out in config/deploy.yml for v1; the embedder package + ML deps are still built into the image, so activation later is a config-only change (uncomment + kamal deploy --roles=embedder).
- kamal-proxy preserved across the migration (kamal app remove does not touch it)
- /data and /data/josh.db preserved across migration (host bind mount, not a Docker volume)
- HF_HOME=/data/cache/huggingface global — the query-time embedding singleton in josh-core uses this; the deferred embedder role would use it too when activated

## Migration sequence (executed)

Phase A — local repo prep on consolidate-kamal branch:
1. Create branch, write root Dockerfile, write root config/deploy.yml, rewrite .kamal/hooks/pre-deploy.
2. Validate kamal config parses cleanly.

Phase B — tear down existing josh-core:
3. Capture pre-state: /data/josh.db sha256 + alembic version + docker ps (tripwire).
4. cd josh-core && kamal app remove.
5. Verify post-tear-down: josh.db sha256 unchanged, kamal-proxy still up.

Phase C — deploy consolidated stack from project root:
6. kamal setup --skip-hooks — first-deploy chicken-egg (env files pushed after pre-deploy hook).
7. Manual kamal app exec --reuse --roles web "alembic upgrade head" — confirms schema head.
8. Smoke-test a regular kamal deploy (no --skip-hooks) to prove the rewritten pre-deploy hook works end-to-end.

Phase D — repo cleanup:
9. Delete josh-*/Dockerfile, josh-*/config/, josh-*/.kamal (symlinks).
10. Update docs (architecture, repo-structure, migrations, new-host-setup), specs (substrate-bare-metal-host, substrate-cron-scheduler, substrate-migrations-workflow), and josh-core/README.md.
11. bin/build-spec.py; uv run poe spec-lint; grep for stragglers.

Phase E — commit + push + merge:
12. Stage explicit paths (parallel agent has unrelated uncommitted work on main); commit.
13. Push; CI runs; fast-forward main; delete branch.

## The Dockerfile pattern

Multi-package install into one image. Each service is pip-installed from
its own /tmp/<service> directory to keep pyproject.tomls isolated:

``dockerfile
COPY shared/josh_substrate /tmp/josh_substrate
RUN pip install --no-cache-dir '/tmp/josh_substrate[all-providers]'
COPY shared/josh_substrate/alembic.ini /app/alembic.ini
COPY josh-core/pyproject.toml /tmp/josh-core/pyproject.toml
COPY josh-core/josh_core /tmp/josh-core/josh_core
RUN pip install --no-cache-dir /tmp/josh-core
# …same for josh_ingester and josh_embedder
``

WORKDIR is /app so alembic.ini is discovered without -c. The
josh_core/, josh_ingester/, and josh_embedder/ packages coexist
in site-packages without collision — all four workspace packages share
a consistent josh_* import prefix (josh_core, josh_ingester,
josh_embedder, josh_substrate).

## The deploy.yml pattern

One service with three roles defined; the embedder role is commented
out for v1
(intentional — see the v1 scope note in why). The
proxy: false flag on ingester (and embedder, when activated) keeps
kamal-proxy routing only to web. Per-role env (JOSH_EMBED_*) scoped
under servers.embedder.env.clear (also commented). HF_HOME=/data/cache/huggingface
is global so the query-time singleton in josh-core uses it now, and
the embedder will too when activated — without re-downloading ~600 MB
of arctic-embed weights on every restart.

## The pre-deploy hook rewrite

Old hook gated on basename "$PWD" != "josh-core". New hook drops the
gate (only one config now) and targets --roles web so alembic runs
once across the three role-containers, not three times.

## Risks + rollback

Branch isolates all repo changes. If anything in Phase C+ fails:
1. git checkout main
2. cd josh-core && kamal setup --skip-hooks
3. kamal app exec --reuse "alembic upgrade head"
Back to working state. /data/josh.db is untouched throughout (bind mount,
not a Docker volume — kamal app remove cannot delete host paths).

11 of 15 done.

  • t1 Create `consolidate-kamal` branch off main
  • t2 Write root `Dockerfile` consolidating josh-substrate[all-providers] + three service packages
  • t3 Write root `config/deploy.yml` with two active roles (web, ingester) + embedder role commented out for v1; per-role env for JOSH_EMBED_*; global HF_HOME=/data/cache/huggingface
  • t4 Rewrite `.kamal/hooks/pre-deploy` to drop the basename gate and target --roles web
  • t5 Validate root deploy.yml via `kamal config`
  • t6 Capture pre-tear-down state (sha256 of /data/josh.db, alembic version, docker ps) as a tripwire
  • t7 `cd josh-core && kamal app remove` — tear down current single-service deploy
  • t8 From project root: `kamal setup --skip-hooks` — brings up two active role-containers (web + ingester) from consolidated image. Done 2026-05-12; ~13 min build on ritz-cmd, 71s pull on josh, container health checks passed.
  • t9 Manual `kamal app exec --reuse --roles web 'alembic upgrade head'` — verified schema at `0005 (head)`; substrate file sha256 unchanged across tear-down + rebuild.
  • t10 Smoke-test `kamal deploy` (no skip-hooks) — succeeded end-to-end in 74s, proves rewritten pre-deploy hook works against env files already on the host.
  • t11 Deleted `josh-core/Dockerfile`, `josh-core/config/`, `josh-core/.kamal` symlink — same for josh-ingester and josh-embedder.
  • t12 Update docs (architecture, repo-structure, migrations, new-host-setup); specs (substrate-bare-metal-host changelog)
  • t13 `bin/build-spec.py` to regenerate spec HTML; `uv run poe spec-lint`; grep sweep for stragglers
  • t14 Commit, push (triggers `uv run poe ci`), fast-forward main, delete branch
  • t15 Run success_determiner; flip status to verified
  • 2026-05-12T17:30:00Z draftin_progress Spec drafted and put in flight as part of the kamal consolidation execution. Phases A and B (local prep + tear-down of single-service josh-core) already complete; Phase C (deploy consolidated stack) in progress. Decisions captured during planning (see `/Users/ritz/.claude/plans/playful-floating-brooks.md`): - Image name: `yourorg/josh` (single image replacing the three per-service images) - HF model cache: `/data/cache/huggingface` on host (under /data, restic-excluded in a follow-up) - Rename `josh-core/app/` → `josh_core/` deferred to follow-up commit (landed 2026-05-12; see changelog below) - Stay on Kamal vs. Docker Compose: researched (haloy.dev, kamal-deploy.org, evilmartians, mkdev) and Kamal is the modern best fit for our shape — zero server overhead, battle-tested at 37signals scale, ships with Rails 8 as default, opinionated but right-shaped. Builds atomically on top of `substrate-bare-metal-host` t6 (which stays as the host-provision spec); `substrate-bare-metal-host` t7 (the planned per-service `kamal setup josh-ingester + josh-embedder`) gets reframed against this consolidation rather than executed as three separate setups.
  • 2026-05-12T22:00:00Z in_progressin_progress Followed-up the consolidation by renaming the two outlier Python packages so all four workspace members share a consistent `josh_*` import prefix: - `josh-core/app/` → `josh-core/josh_core/` (commit 51d29b5) - `josh-ingester/ingester/` → `josh-ingester/josh_ingester/` (commit 1a1a4a7) `josh-embedder/src/josh_embedder/` and `shared/josh_substrate/` were already correctly named and untouched. The dashed distribution names (`josh-core`, `josh-ingester`) and the runtime console script (`josh-ingester daemon`) are unchanged — only the Python *import* paths moved. Touch points handled: directory renames via `git mv` (history preserved), `packages = […]` lists in each `pyproject.toml`, console-script target `josh_ingester.cli:app`, 22 import sites, 9 logger labels, 2 `importlib.import_module()` runtime paths in `josh_ingester.sources._registry`, Dockerfile COPY + CMD, `config/deploy.yml` web role cmd (`uvicorn josh_core.main:app`), root mypy task path list. Pure mechanical refactor — no schema change, no behavior change, /data/josh.db untouched. `uv run poe ci` green after each commit.
  • 2026-05-13T00:00:00Z in_progressverified Ran `bin/dry-run-spec.py substrate-single-image-deploy` — exit code 0, all five functional checks pass against the live OVH host (single Dockerfile at root, single `config/deploy.yml` at root, web + ingester role-containers + kamal-proxy running, alembic at `0005 (head)`, `/health` returns 200). Two determiner fixes landed alongside the flip: 1. Excluded `vendor/` and `private/` from the Dockerfile/ `deploy.yml` find checks. A vendored Dockerfile under `vendor/harveyai-legal-agent-benchmark/sandbox/` was tripping the "exactly one Dockerfile" assertion. 2. Replaced the `kamal app exec --reuse --roles web "alembic current"` capture with a direct `ssh josh "docker exec <web-container> alembic current"`. Kamal's CLI emits empty output when its stdout is captured by a subshell (`$(...)`), which had made the alembic check fail in a non-obvious way. Also: parsed the host IP from `config/deploy.yml` at runtime (instead of hardcoding `40.160.12.131`) so the determiner stays in sync if the deploy target moves. The prose elsewhere in this spec keeps the `<your-host-ip>` template form so the spec is reusable as an OSS reference.

docs/spec/substrate-single-image-deploy.html · generated by bin/build-spec.py