Provisioning a new host
How to set up a fresh server to run the Josh substrate. Use this when:
- Spinning up the production server for the first time
- Migrating to a new provider or region
- Replacing a failed or compromised host
- Setting up a staging environment
Steps are written for our current host shape — OVHcloud Advance-1 2024 bare metal with Soft RAID 10 — but the architectural commitments (data lives at /data, Kamal-managed Docker, bind-mount contract) are provider-agnostic. The major branch in the doc is cloud VM with detachable block storage (DO/Hetzner/AWS/GCP) versus bare metal with software RAID (OVHcloud, Hetzner Robot, your own iron). We default to the bare-metal path; cloud-VM differences are called out where they matter.
What this gets you
Section titled “What this gets you”- An Ubuntu host with
/datareachable on a redundant disk array, persisted across reboots - The directory skeleton the ingester expects (
/data/corpus/,/data/locks/,/data/backups/) - Docker installed (by Kamal, on first
kamal setup) - Two role-containers running from a single image (
web+ingester), each bind-mounting/data:/data. Theembedderrole is defined but commented out for v1.
The substrate SQLite file at /data/josh.db materializes when the first migration runs.
Prerequisites
Section titled “Prerequisites”- A provider account with billing enabled
- A local Mac with
kamalinstalled (gem install kamal) and SSH access - The Josh repo cloned locally
- Decrypted
.kamal/secrets(age -d -i ~/.age/key.txt .kamal/secrets.age > .kamal/secrets)
Sizing
Section titled “Sizing”For the substrate node:
| Resource | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 4 cores | 6+ cores | Embedding throughput is CPU-bound when the embedder runs on the same host; modern AMD/Intel cores at 3.5+ GHz keep p50 query latency well under a second |
| RAM | 8 GB | 32 GB | SQLite is light; headroom is for embedding workers + parsers + page cache for the substrate file |
| Disk (data, after RAID) | 500 GB usable | ~2 TB usable | Holds everything substrate-related. Bare metal can't expand in-place, so size for 2–3 years of corpus growth at order time. |
| Disk redundancy | RAID 1 or zero-copy block storage | RAID 10 across 4 drives | Single-drive failure must not destroy the substrate. RAID 10 also doubles read/write throughput vs RAID 1. |
What we run today: OVHcloud Advance-1 2024, Vint Hill VA — AMD EPYC 4244P (6c/12t, 5.1 GHz boost), 32 GB DDR5 ECC, 4 × 960 GB NVMe in Soft RAID 10 (~1.92 TB usable), 3 Gbps public + 25 Gbps private. ~$165/mo with free install fee on the 2024 chipset. The "why" — and why not a cloud VM with detachable block storage — lives in substrate-bare-metal-host.
If you're following on a cloud VM (DigitalOcean, Hetzner, AWS, GCP), the equivalent shape is a 4–8 vCPU instance plus a single block-storage volume sized to your corpus expectations. The instructions below note where the bare-metal flow differs from the cloud-VM flow.
Step 1 — Order the host
Section titled “Step 1 — Order the host”Bare metal (OVHcloud Advance-1 2024, what we run):
- Order via the OVHcloud Manager. Pick the Vint Hill datacenter (or wherever is geographically closest to your users).
- Storage option: 4 × 960 GB SSD NVMe Enterprise, Soft RAID 10. Confirm "Soft RAID" (mdadm), not hardware RAID — simpler recovery, no controller dependency, drives portable to any Linux box.
- OS: Ubuntu 24.04 LTS. The OVHcloud installer pre-builds the RAID 10 array and lays down the OS partition during install — you don't have to assemble the array by hand.
- SSH key: add your public key during checkout so the post-install image trusts it.
- Network: 3 Gbps public unmetered + 25 Gbps private unmetered are included on this SKU; anti-DDoS is on by default.
- Wait for the provisioning email (~hours to ~1 day on a fresh order). Note the public IPv4 it ships with.
Cloud VM alternative (DigitalOcean, Hetzner, AWS, etc.):
- Create the instance: 4–8 vCPU, 8–32 GB RAM, Ubuntu 24.04 LTS, your SSH key added.
- Choose a region near your users.
- Create a Block Storage Volume in the same region. Size as planned. Do not let the provider format-and-mount it for you — we want explicit control. (If you do, just unmount and follow Step 3.)
- Attach the volume to the instance.
- Note the instance's public IPv4.
Step 2 — First SSH and update local config
Section titled “Step 2 — First SSH and update local config”Update ~/.ssh/config:
Host josh HostName <new-ip> User root IdentityFile ~/.ssh/id_rsa StrictHostKeyChecking accept-newVerify:
ssh josh 'whoami && hostnamectl'You should see root and the OS info.
Step 3 — Confirm the array and mount /data
Section titled “Step 3 — Confirm the array and mount /data”This is the architectural commitment per CLAUDE.md "Where data lives": every durable byte of substrate state lives on a redundant array, not on a single non-mirrored disk. The mechanics differ between bare metal and cloud-VM-with-detachable-volume; the contract — /data reachable, on redundant storage — is the same.
Bare-metal (Soft RAID 10 mdadm): the OVHcloud installer has already built the RAID 10 array across all 4 NVMes and laid the root filesystem on top. There's no separable block-storage volume to mount — /data is a directory inside the root filesystem on the same array. Confirm and create:
ssh josh# Confirm the array spans 4 NVMe devices and is healthycat /proc/mdstat# Expected: a line like# md0 : active raid10 nvme0n1p2[0] nvme1n1p2[1] nvme2n1p2[2] nvme3n1p2[3]# with [4/4] [UUUU] in the status line.
# Confirm the root filesystem rides on the arrayfindmnt /# Expected SOURCE: /dev/md0 (or similar md device).
# Create /data as a directory on the array (no separate fstab entry needed —# / already covers the whole array)mkdir -p /dataIf your installer carved a separate /data partition during OS install (some bare-metal flows do), confirm fstab uses UUID with defaults,noatime,discard,nofail rather than the device name — drives can be reordered when the chassis swaps a failed disk.
Cloud-VM alternative (block storage volume): mount the attached volume by UUID at /data so device names (/dev/sda, /dev/sdb) can change without breaking fstab.
# Identify the volume — typically /dev/sda when only one is attachedlsblk
# Get its UUIDblkid /dev/sda
# Unmount any provider-default mountpoint and remove the empty dirumount /mnt/<whatever> 2>/dev/nullrmdir /mnt/<whatever> 2>/dev/null
# Format if blank (DO ships ext4-formatted; some providers don't)mkfs.ext4 /dev/sda # only if blank — skip if already formatted
# Create /datamkdir -p /data
# Persist the mount in fstab using the UUIDUUID=$(blkid -s UUID -o value /dev/sda)echo "# Block storage volume (substrate data root)" >> /etc/fstabecho "UUID=$UUID /data ext4 defaults,noatime,discard,nofail 0 2" >> /etc/fstab
# Mount itmount -a
# Verifydf -h /datals -la /dataMount flags explained (cloud-VM path):
defaults— standard rw, suid, dev, exec, auto, nouser, asyncnoatime— don't update access timestamps. Reduces SSD writes; faster.discard— issue TRIM commands to the underlying SSD on file deletesnofail— don't block boot if the volume is missing (so a detached volume doesn't brick the host)
Step 4 — Create the substrate directory skeleton
Section titled “Step 4 — Create the substrate directory skeleton”ssh josh 'mkdir -p /data/corpus /data/locks /data/backups && chmod 755 /data/corpus /data/locks /data/backups'After this:
/data/├── corpus/ # raw + Markdown bodies (per-source subdirs created at ingest)├── locks/ # ingester flock files├── backups/ # Litestream local snapshots before S3 ship└── lost+found/ # ext4 default/data/josh.db (and its WAL/SHM companion files) will be created by the first Alembic migration.
Step 5 — (Optional) Configure swap
Section titled “Step 5 — (Optional) Configure swap”The substrate workload (SQLite + ingester + embedding) usually fits in 8 GB RAM. Skip swap unless you observe pressure.
If you want it (e.g., running a larger embedding model):
ssh josh ' fallocate -l 4G /swapfile && chmod 600 /swapfile && mkswap /swapfile && swapon /swapfile && echo "/swapfile none swap sw 0 0" >> /etc/fstab && swapon --show'On bare metal where / and /data share the RAID array, this just lives at /swapfile. On cloud VMs with a separate block-storage volume, put the swapfile on the root disk rather than the volume — keeps the volume purely for substrate data.
Step 6 — Deploy via Kamal (Docker installs itself)
Section titled “Step 6 — Deploy via Kamal (Docker installs itself)”You do not install Docker manually. kamal setup does it for you on first run. From your local Mac, run from project root:
cd ~/projects/josh
# Verify secrets are decryptedtest -f .kamal/secrets || age -d -i ~/.age/key.txt .kamal/secrets.age > .kamal/secrets
# Update config/deploy.yml's `servers:` block with the new host IP if you migrated
# First-time deploy — one command brings up every active role.# --skip-hooks is required on first deploy (the pre-deploy alembic hook# needs env files that kamal only pushes during setup itself — chicken# and egg). Subsequent `kamal deploy` runs the hook normally.env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal setup --skip-hooks
# Apply migrations manually for the first deploy.env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app exec --reuse --roles web "alembic upgrade head"What kamal setup does on first run:
- SSHes to the host
- Installs Docker via apt (if not present)
- Logs into your container registry (Docker Hub) using
KAMAL_REGISTRY_PASSWORD - Pulls the consolidated image (built remotely on
ritz-cmd) - Starts every active role-container (
web,ingester) with the bind mount/data:/data - Sets up
kamal-proxyto route HTTP traffic to thewebrole on 80/443
Subsequent deploys use kamal deploy (without setup or --skip-hooks). For surgical updates: kamal deploy --roles=web or --roles=ingester. The embedder role is commented out in config/deploy.yml for v1; activation is a config-only change when the chunking + embedding-model strategy is locked.
Step 7 — Verify the bind mount
Section titled “Step 7 — Verify the bind mount”The most important verification. If the container isn't bind-mounting /data from the host, your data lands on the container's writable layer (or on a non-redundant device) and you've broken the durability commitment.
ssh josh 'docker inspect $(docker ps -qf label=service=josh -f label=role=web) --format "{{ range .Mounts }}{{ .Source }} -> {{ .Destination }} ({{ .Type }}){{ println }}{{ end }}"'Expected output includes:
/data -> /data (bind)If you see (volume) instead of (bind), the deploy is misconfigured — the container is using a Docker named volume under /var/lib/docker/volumes/ instead of the host's /data. Fix the volumes: block in the service's config/deploy.yml and redeploy.
After running migrations:
ssh josh 'ls -la /data/'You should see josh.db, josh.db-wal, josh.db-shm alongside the directories.
Migrating from an existing host
Section titled “Migrating from an existing host”If /data on an old host already has the SQLite file and corpus you want to keep, the path depends on whether the old and new hosts both expose detachable block storage:
Block-volume detach + reattach (zero-copy — only works between cloud VMs on the same provider):
- Stop containers on old host:
cd <service-dir> && kamal app remove - Provider panel: detach the volume from the old instance
- Provider panel: attach the same volume to the new instance
- On the new host: identify the volume via
lsblk, get its UUID, add to fstab, mount at/data - Continue from Step 6 above (Kamal deploy on the new host)
The substrate file is byte-identical because nothing was copied. Not available when the destination is bare metal — there is no detachable volume there.
File-level rsync (always works, including bare-metal-to-anything):
- Stop containers on old host:
cd <service-dir> && kamal app remove rsync -avzP josh-old:/data/ josh-new:/data/- Continue from Step 6
For the SQLite file specifically, check josh.db-wal is empty (ls -la) before copying — a non-empty WAL means the database was active, and a copy might be inconsistent. Stop containers first; the WAL checkpoints on clean shutdown.
From a destroyed host (no source filesystem to copy from): skip the migration entirely. Provision the new host (Steps 1–6), then restore from the latest snapshot per restore from backup. Acceptable because the substrate is regenerable from public APIs even if the snapshot is gone — re-ingestion takes hours, not days.
Decommissioning the old host
Section titled “Decommissioning the old host”After verifying the new host has the data and is serving traffic:
- Take a final snapshot of the old host (cheap insurance — image-based on cloud, rsync archive on bare metal)
kamal app removefor each service from the old host (skip if the old host is already destroyed)- Cancel the old host with the provider (cloud panel "destroy droplet" / OVHcloud "release server")
- If the old host had a detachable volume: keep it around for ~1 week as a backup, then destroy
Troubleshooting
Section titled “Troubleshooting”kamal setup fails with "Cannot connect to Docker daemon" on the host. Kamal couldn't install Docker. SSH manually, install via apt install -y docker.io or the provider's recommended Docker repo, then retry.
RAID array reports degraded after reboot. cat /proc/mdstat will show e.g. [3/4] instead of [4/4]. File a hardware ticket with the provider to swap the failed drive; once the new drive is in, run mdadm --add /dev/md0 /dev/<new> to start the rebuild. The substrate stays available throughout.
Cloud-volume isn't there after reboot. Check /etc/fstab has the UUID line. Check the volume is still attached in the provider panel. mount -a should re-mount it.
Container starts but writes don't persist. Check docker inspect ... --format ... shows (bind) not (volume). If it's volume, the deploy.yml is wrong.
kamal deploy hangs on the first build. The remote builder (ssh://ritz@ritz-cmd) needs to be reachable from your Mac. Test with ssh ritz-cmd 'echo ok'. If it fails, fix the SSH config or set builder: { local: true } in deploy.yml temporarily.
Reference
Section titled “Reference”CLAUDE.md— server profile, volume-as-data-host pattern, Kamal/Docker mount patterndocs/operations/architecture.html— full server profile, where data lives, bind-mount contract- substrate-bare-metal-host — why bare metal over a cloud VM, runway projections, drive-failure workflow
docs/operations/repo-structure.html— monorepo layout, build context, secrets symlinksdocs/operations/migrations.html— Alembic + Kamal migration pattern