Provisioning a new host

How to set up a fresh server to run the Josh substrate. Use this when:

Spinning up the production server for the first time
Migrating to a new provider or region
Replacing a failed or compromised host
Setting up a staging environment

Steps are written for our current host shape — OVHcloud Advance-1 2024 bare metal with Soft RAID 10 — but the architectural commitments (data lives at /data, Kamal-managed Docker, bind-mount contract) are provider-agnostic. The major branch in the doc is cloud VM with detachable block storage (DO/Hetzner/AWS/GCP) versus bare metal with software RAID (OVHcloud, Hetzner Robot, your own iron). We default to the bare-metal path; cloud-VM differences are called out where they matter.

What this gets you

An Ubuntu host with /data reachable on a redundant disk array, persisted across reboots
The directory skeleton the ingester expects (/data/corpus/, /data/locks/, /data/backups/)
Docker installed (by Kamal, on first kamal setup)
Two role-containers running from a single image (web + ingester), each bind-mounting /data:/data. The embedder role is defined but commented out for v1.

The substrate SQLite file at /data/josh.db materializes when the first migration runs.

Prerequisites

A provider account with billing enabled
A local Mac with kamal installed (gem install kamal) and SSH access
The Josh repo cloned locally
Decrypted .kamal/secrets (age -d -i ~/.age/key.txt .kamal/secrets.age > .kamal/secrets)

Sizing

For the substrate node:

Resource	Minimum	Recommended	Notes
CPU	4 cores	6+ cores	Embedding throughput is CPU-bound when the embedder runs on the same host; modern AMD/Intel cores at 3.5+ GHz keep p50 query latency well under a second
RAM	8 GB	32 GB	SQLite is light; headroom is for embedding workers + parsers + page cache for the substrate file
Disk (data, after RAID)	500 GB usable	~2 TB usable	Holds everything substrate-related. Bare metal can't expand in-place, so size for 2–3 years of corpus growth at order time.
Disk redundancy	RAID 1 or zero-copy block storage	RAID 10 across 4 drives	Single-drive failure must not destroy the substrate. RAID 10 also doubles read/write throughput vs RAID 1.

What we run today: OVHcloud Advance-1 2024, Vint Hill VA — AMD EPYC 4244P (6c/12t, 5.1 GHz boost), 32 GB DDR5 ECC, 4 × 960 GB NVMe in Soft RAID 10 (~1.92 TB usable), 3 Gbps public + 25 Gbps private. ~$165/mo with free install fee on the 2024 chipset. The "why" — and why not a cloud VM with detachable block storage — lives in substrate-bare-metal-host.

If you're following on a cloud VM (DigitalOcean, Hetzner, AWS, GCP), the equivalent shape is a 4–8 vCPU instance plus a single block-storage volume sized to your corpus expectations. The instructions below note where the bare-metal flow differs from the cloud-VM flow.

Step 1 — Order the host

Bare metal (OVHcloud Advance-1 2024, what we run):

Order via the OVHcloud Manager. Pick the Vint Hill datacenter (or wherever is geographically closest to your users).
Storage option: 4 × 960 GB SSD NVMe Enterprise, Soft RAID 10. Confirm "Soft RAID" (mdadm), not hardware RAID — simpler recovery, no controller dependency, drives portable to any Linux box.
OS: Ubuntu 24.04 LTS. The OVHcloud installer pre-builds the RAID 10 array and lays down the OS partition during install — you don't have to assemble the array by hand.
SSH key: add your public key during checkout so the post-install image trusts it.
Network: 3 Gbps public unmetered + 25 Gbps private unmetered are included on this SKU; anti-DDoS is on by default.
Wait for the provisioning email (~hours to ~1 day on a fresh order). Note the public IPv4 it ships with.

Cloud VM alternative (DigitalOcean, Hetzner, AWS, etc.):

Create the instance: 4–8 vCPU, 8–32 GB RAM, Ubuntu 24.04 LTS, your SSH key added.
Choose a region near your users.
Create a Block Storage Volume in the same region. Size as planned. Do not let the provider format-and-mount it for you — we want explicit control. (If you do, just unmount and follow Step 3.)
Attach the volume to the instance.
Note the instance's public IPv4.

Step 2 — First SSH and update local config

Update ~/.ssh/config:

Host josh
  HostName <new-ip>
  User root
  IdentityFile ~/.ssh/id_rsa
  StrictHostKeyChecking accept-new

Verify:

ssh josh 'whoami && hostnamectl'

You should see root and the OS info.

Step 3 — Confirm the array and mount `/data`

This is the architectural commitment per CLAUDE.md "Where data lives": every durable byte of substrate state lives on a redundant array, not on a single non-mirrored disk. The mechanics differ between bare metal and cloud-VM-with-detachable-volume; the contract — /data reachable, on redundant storage — is the same.

Bare-metal (Soft RAID 10 mdadm): the OVHcloud installer has already built the RAID 10 array across all 4 NVMes and laid the root filesystem on top. There's no separable block-storage volume to mount — /data is a directory inside the root filesystem on the same array. Confirm and create:

ssh josh

# Confirm the array spans 4 NVMe devices and is healthy
cat /proc/mdstat
# Expected: a line like
#   md0 : active raid10 nvme0n1p2[0] nvme1n1p2[1] nvme2n1p2[2] nvme3n1p2[3]
# with [4/4] [UUUU] in the status line.

# Confirm the root filesystem rides on the array
findmnt /
# Expected SOURCE: /dev/md0 (or similar md device).

# Create /data as a directory on the array (no separate fstab entry needed —
# / already covers the whole array)
mkdir -p /data

If your installer carved a separate /data partition during OS install (some bare-metal flows do), confirm fstab uses UUID with defaults,noatime,discard,nofail rather than the device name — drives can be reordered when the chassis swaps a failed disk.

Cloud-VM alternative (block storage volume): mount the attached volume by UUID at /data so device names (/dev/sda, /dev/sdb) can change without breaking fstab.

# Identify the volume — typically /dev/sda when only one is attached
lsblk

# Get its UUID
blkid /dev/sda

# Unmount any provider-default mountpoint and remove the empty dir
umount /mnt/<whatever> 2>/dev/null
rmdir /mnt/<whatever>  2>/dev/null

# Format if blank (DO ships ext4-formatted; some providers don't)
mkfs.ext4 /dev/sda  # only if blank — skip if already formatted

# Create /data
mkdir -p /data

# Persist the mount in fstab using the UUID
UUID=$(blkid -s UUID -o value /dev/sda)
echo "# Block storage volume (substrate data root)" >> /etc/fstab
echo "UUID=$UUID /data ext4 defaults,noatime,discard,nofail 0 2" >> /etc/fstab

# Mount it
mount -a

# Verify
df -h /data
ls -la /data

Mount flags explained (cloud-VM path):

defaults — standard rw, suid, dev, exec, auto, nouser, async
noatime — don't update access timestamps. Reduces SSD writes; faster.
discard — issue TRIM commands to the underlying SSD on file deletes
nofail — don't block boot if the volume is missing (so a detached volume doesn't brick the host)

Step 4 — Create the substrate directory skeleton

ssh josh 'mkdir -p /data/corpus /data/locks /data/backups && chmod 755 /data/corpus /data/locks /data/backups'

After this:

/data/
├── corpus/                 # raw + Markdown bodies (per-source subdirs created at ingest)
├── locks/                  # ingester flock files
├── backups/                # Litestream local snapshots before S3 ship
└── lost+found/             # ext4 default

/data/josh.db (and its WAL/SHM companion files) will be created by the first Alembic migration.

Step 5 — (Optional) Configure swap

The substrate workload (SQLite + ingester + embedding) usually fits in 8 GB RAM. Skip swap unless you observe pressure.

If you want it (e.g., running a larger embedding model):

ssh josh '
  fallocate -l 4G /swapfile &&
  chmod 600 /swapfile &&
  mkswap /swapfile &&
  swapon /swapfile &&
  echo "/swapfile none swap sw 0 0" >> /etc/fstab &&
  swapon --show
'

On bare metal where / and /data share the RAID array, this just lives at /swapfile. On cloud VMs with a separate block-storage volume, put the swapfile on the root disk rather than the volume — keeps the volume purely for substrate data.

Step 6 — Deploy via Kamal (Docker installs itself)

You do not install Docker manually. kamal setup does it for you on first run. From your local Mac, run from project root:

cd ~/projects/josh

# Verify secrets are decrypted
test -f .kamal/secrets || age -d -i ~/.age/key.txt .kamal/secrets.age > .kamal/secrets

# Update config/deploy.yml's `servers:` block with the new host IP if you migrated

# First-time deploy — one command brings up every active role.
# --skip-hooks is required on first deploy (the pre-deploy alembic hook
# needs env files that kamal only pushes during setup itself — chicken
# and egg). Subsequent `kamal deploy` runs the hook normally.
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal setup --skip-hooks

# Apply migrations manually for the first deploy.
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app exec --reuse --roles web "alembic upgrade head"

What kamal setup does on first run:

SSHes to the host
Installs Docker via apt (if not present)
Logs into your container registry (Docker Hub) using KAMAL_REGISTRY_PASSWORD
Pulls the consolidated image (built remotely on ritz-cmd)
Starts every active role-container (web, ingester) with the bind mount /data:/data
Sets up kamal-proxy to route HTTP traffic to the web role on 80/443

Subsequent deploys use kamal deploy (without setup or --skip-hooks). For surgical updates: kamal deploy --roles=web or --roles=ingester. The embedder role is commented out in config/deploy.yml for v1; activation is a config-only change when the chunking + embedding-model strategy is locked.

Step 7 — Verify the bind mount

The most important verification. If the container isn't bind-mounting /data from the host, your data lands on the container's writable layer (or on a non-redundant device) and you've broken the durability commitment.

ssh josh 'docker inspect $(docker ps -qf label=service=josh -f label=role=web) --format "{{ range .Mounts }}{{ .Source }} -> {{ .Destination }} ({{ .Type }}){{ println }}{{ end }}"'

Expected output includes:

/data -> /data (bind)

If you see (volume) instead of (bind), the deploy is misconfigured — the container is using a Docker named volume under /var/lib/docker/volumes/ instead of the host's /data. Fix the volumes: block in the service's config/deploy.yml and redeploy.

After running migrations:

ssh josh 'ls -la /data/'

You should see josh.db, josh.db-wal, josh.db-shm alongside the directories.

Migrating from an existing host

If /data on an old host already has the SQLite file and corpus you want to keep, the path depends on whether the old and new hosts both expose detachable block storage:

Block-volume detach + reattach (zero-copy — only works between cloud VMs on the same provider):

Stop containers on old host: cd <service-dir> && kamal app remove
Provider panel: detach the volume from the old instance
Provider panel: attach the same volume to the new instance
On the new host: identify the volume via lsblk, get its UUID, add to fstab, mount at /data
Continue from Step 6 above (Kamal deploy on the new host)

The substrate file is byte-identical because nothing was copied. Not available when the destination is bare metal — there is no detachable volume there.

File-level rsync (always works, including bare-metal-to-anything):

Stop containers on old host: cd <service-dir> && kamal app remove
rsync -avzP josh-old:/data/ josh-new:/data/
Continue from Step 6

For the SQLite file specifically, check josh.db-wal is empty (ls -la) before copying — a non-empty WAL means the database was active, and a copy might be inconsistent. Stop containers first; the WAL checkpoints on clean shutdown.

From a destroyed host (no source filesystem to copy from): skip the migration entirely. Provision the new host (Steps 1–6), then restore from the latest snapshot per restore from backup. Acceptable because the substrate is regenerable from public APIs even if the snapshot is gone — re-ingestion takes hours, not days.

Decommissioning the old host

After verifying the new host has the data and is serving traffic:

Take a final snapshot of the old host (cheap insurance — image-based on cloud, rsync archive on bare metal)
kamal app remove for each service from the old host (skip if the old host is already destroyed)
Cancel the old host with the provider (cloud panel "destroy droplet" / OVHcloud "release server")
If the old host had a detachable volume: keep it around for ~1 week as a backup, then destroy

Troubleshooting

kamal setup fails with "Cannot connect to Docker daemon" on the host. Kamal couldn't install Docker. SSH manually, install via apt install -y docker.io or the provider's recommended Docker repo, then retry.

RAID array reports degraded after reboot. cat /proc/mdstat will show e.g. [3/4] instead of [4/4]. File a hardware ticket with the provider to swap the failed drive; once the new drive is in, run mdadm --add /dev/md0 /dev/<new> to start the rebuild. The substrate stays available throughout.

Cloud-volume isn't there after reboot. Check /etc/fstab has the UUID line. Check the volume is still attached in the provider panel. mount -a should re-mount it.

Container starts but writes don't persist. Check docker inspect ... --format ... shows (bind) not (volume). If it's volume, the deploy.yml is wrong.

kamal deploy hangs on the first build. The remote builder (ssh://ritz@ritz-cmd) needs to be reachable from your Mac. Test with ssh ritz-cmd 'echo ok'. If it fails, fix the SSH config or set builder: { local: true } in deploy.yml temporarily.

Reference

CLAUDE.md — server profile, volume-as-data-host pattern, Kamal/Docker mount pattern
docs/operations/architecture.html — full server profile, where data lives, bind-mount contract
substrate-bare-metal-host — why bare metal over a cloud VM, runway projections, drive-failure workflow
docs/operations/repo-structure.html — monorepo layout, build context, secrets symlinks
docs/operations/migrations.html — Alembic + Kamal migration pattern