substratein progressp0

Provision substrate on OVHcloud bare-metal host

substrate-bare-metal-host · updated 2026-05-12T19:00:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Stand up the substrate fresh on an OVHcloud Advance-1 2024 dedicated
bare-metal server in Vint Hill, VA. The prior DigitalOcean droplet + 500 GB
block-storage volume was destroyed on 2026-05-10; the substrate is mostly
regenerable from public APIs, so the rebuild path is "provision OVHcloud
→ run migrations → re-ingest" rather than rsync-and-cutover.

The motivation for OVHcloud over re-provisioning DigitalOcean is runway:
at the substrate's projected ingestion volumes, DO's storage pricing curve
gets brutal and the 500 GB volume cliff would hit within a year on
conservative backfills. The Advance-1 2024 with 4 × 960 GB NVMe in Soft
RAID 10
is the closest US analog to a Hetzner-style host: ~$165/mo
configured
(free install), ~1.92 TB usable, 32 GB DDR5 ECC RAM (upgradable
to 192 GB later), EPYC 4244P (6c/12t, 5.1 GHz boost), anti-DDoS included.
Buys 5+ years of runway under the conservative backfill scenario
(~250 GB substrate), ~1–2 years even under aggressive full-historical
backfill
(~1.5 TB). Vint Hill is east-coast (low-latency to the
mid-Atlantic).

Picked the 2024 chipset over the 2026 because the CPU difference is
marginal for our workload (single-thread is comparable, neither is the
bottleneck), the install fee is free vs $134, and the price-per-disk is
better.

Bare-metal trade-off accepted: disk is not online-expandable (RAM is).
We size disks generously now to skip the next migration. When we eventually
need more capacity, the path is "order a second server, scale out" rather
than "expand the volume in place." The 1.92 TB usable target is sized to
give us a credible 2–3 year horizon without a re-host.

As the operator paying the cloud bill, I want the substrate to live on a host whose pricing scales with disk size by ~5×, not ~30× of equivalent DO storage so that I'm not migrating again at $300+/mo cloud spend.

As the operator who picked volume-as-data-host, I want the bulk-NVMe RAID array mounted at /data to preserve the same architectural commitment as the prior DO setup so that the existing `/data:/data` Kamal bind mount, the backup script, and the operations runbooks all keep working unchanged.

As an OSS self-hoster reading our docs to plan their own deployment, I want a clear "we run on bare metal at OVHcloud, here's why and the trade-offs" so that I can pick a similar shape (or pick differently) with eyes open.

  1. When the new OVHcloud server is provisioned, the host shall expose a single Soft RAID 10 array across all four NVMe drives, with `/` and `/data` both living on it (separate ext4 partitions or a single root with `/data` as a subdirectory — the durable contract is `/data`, not the partition layout).
  2. When `kamal setup` runs against the new host, the resulting `josh-core` container shall bind-mount `/data:/data` and `docker inspect` shall report `/data -> /data (bind)` (not `volume`).
  3. When `alembic upgrade head` runs against the new host's substrate file, the schema version shall match `head` in `shared/josh_substrate/src/josh_substrate/migrations/versions/`.
  4. Where the substrate is hosted on bare metal, the SQLite file shall live at `/data/josh.db` (on the RAID 10 array, never on a non-mirrored device).
  5. When the nightly backup script runs on the new host, the resulting snapshot shall appear in the configured object-storage bucket and pass the `substrate-nightly-backup` success_determiner (round-trip download + open + row counts >= 0).
  6. After the new host has been stable for 7 days under regular ingestion + backup load, the spec shall flip to shipped.
  7. When `df /data` reports >70% used (~1.35 TB), the operator shall plan the next hardware migration (this is the trigger to revisit, not an alert today).
kindbash

Command

set -euo pipefail
ssh josh '
  set -euo pipefail

  # 1. RAID 10 array exists across 4 devices
  cat /proc/mdstat | grep -q "raid10" \
    || { echo "FAIL: no raid10 array in /proc/mdstat"; exit 1; }
  grep -A2 "raid10" /proc/mdstat | grep -oE "nvme[0-9]+n[0-9]+\[[0-9]+\]" \
    | wc -l | grep -q "^4$" \
    || { echo "FAIL: raid10 array does not span 4 devices"; exit 1; }

  # 2. /data is mounted (could be a separate partition OR a directory on /;
  #    either is fine — the contract is the path, not the layout)
  findmnt /data >/dev/null 2>&1 || mountpoint -q / \
    || { echo "FAIL: /data not reachable via either mount or path"; exit 1; }
  test -d /data || { echo "FAIL: /data is not a directory"; exit 1; }

  # 3. If /data is a separate mount, fstab uses UUID. If it is just a
  #    directory on /, the / mount is enough. Either is acceptable.
  if findmnt /data >/dev/null 2>&1 && [ "$(findmnt -no SOURCE /data)" != "$(findmnt -no SOURCE /)" ]; then
    grep -q "^UUID=.* /data " /etc/fstab \
      || { echo "FAIL: /data fstab entry not UUID-based"; exit 1; }
  fi

  # 4. Substrate file on /data
  test -f /data/josh.db || { echo "FAIL: /data/josh.db not present"; exit 1; }

  # 5. josh-core container bind-mounts /data
  docker inspect "$(docker ps -qf name=josh-core)" \
    --format "{{ range .Mounts }}{{ .Source }} -> {{ .Destination }} ({{ .Type }}){{ println }}{{ end }}" \
    | grep -q "^/data -> /data (bind)$" \
    || { echo "FAIL: josh-core not bind-mounting /data"; exit 1; }

  # 6. Backup timer enabled and ran in last 25h
  systemctl is-enabled josh-backup.timer | grep -q "enabled"
  source /etc/josh-backup.env
  export AWS_REQUEST_CHECKSUM_CALCULATION=when_required
  latest=$(/usr/local/bin/aws s3 ls "s3://${BACKUP_BUCKET}/${BACKUP_PREFIX}/" \
    --endpoint-url "$BACKUP_ENDPOINT" | sort | tail -1)
  ts=$(echo "$latest" | awk "{print \$1\" \"\$2}")
  age=$(( ( $(date -u +%s) - $(date -u -d "$ts" +%s) ) / 3600 ))
  [ "$age" -lt 25 ] || { echo "FAIL: latest backup ${age}h old (>25h)"; exit 1; }

  # 7. Storage utilization headroom check
  pct=$(df --output=pcent /data | tail -1 | tr -dc 0-9)
  echo "/data is ${pct}% full (trigger to plan next migration: 70%)"

  echo OK
'

Expect

Final line is `OK`. All six functional checks (RAID 10 × 4 devices, /data path reachable, UUID-based fstab if separately mounted, substrate file present, container bind-mount intact, recent backup) pass. The utilization line is informational — not a failure trigger today.

Run from local Mac. `josh` SSH alias must be repointed at the new host's IP at first-deploy time.

None.

  • Horizontal scaling / multi-host topology (future spec; "scale by adding a second server" is the v2 answer)
  • High-availability failover (v1 is single-node; the backup is the disaster-recovery story)
  • Multi-region presence (single region Vint Hill is fine for v1)
  • Splitting any out-of-substrate state into a separate file — out of scope here
  • RAM upgrade beyond 32 GB (deferred until query load justifies; OVHcloud RAM upgrades are a few hours of downtime, not a migration)
  • Migration from DO (no longer applicable — DO droplet was destroyed 2026-05-10 ahead of this provision)

## Hardware

- OVHcloud Advance-1 2024 (https://us.ovhcloud.com/bare-metal/advance/adv-1/)
- Datacenter: Vint Hill, VA — east coast, low-latency to the mid-Atlantic
- CPU: AMD EPYC 4244P (6c/12t, 3.8–5.1 GHz)
- RAM: 32 GB DDR5 ECC at order. Upgradable to 192 GB later (~hours of downtime).
- Storage: single Soft RAID 10 array across 4 × 960 GB NVMe drives
→ ~1.92 TB usable. OS lives on a small partition (typically 50–100 GB);
/data gets the rest. (Or the whole array is mounted at / with
/data as a directory — equivalent for our purposes.)
- Network: 1–5 Gbps public, 25 Gbps private, anti-DDoS
- Configured price: ~$165/mo with free install fee on the 2024
chipset.

## Why 4×960 RAID 10 over alternatives

- vs 2×960 RAID 1 ($115/mo, 960 GB usable): $50/mo more buys 2× the
storage runway and meaningfully better IOPS via striping. Same single-
drive fault tolerance.
- vs 2×15.36 TB RAID 1 ($300–400/mo, 15 TB usable): That config sizes
for full-historical Bills + Federal Register backfill. We're not
committing to that for v1, and the price-per-month is ~2× too much for
ceiling we won't touch in 2–3 years.
- vs 2026 chipset variant: Marginal CPU improvement isn't worth $19/mo
+ $134 install fee for our workload. The 2024's free install effectively
pays for ~7 months at the price differential.

## RAID 10 strategy

- Soft RAID 10 (mdadm) across all 4 drives. Striped mirror = double
the read/write throughput vs RAID 1, with the same single-drive fault
tolerance (and survives a 2-drive failure if they're in different mirror
pairs).
- mdadm-managed. Simpler recovery than hardware RAID, no controller
dependency, drives portable to any Linux box. Modern CPU overhead is
negligible.
- OVHcloud's installer pre-configures RAID at install time. We pick "Soft
RAID 10" on all 4 disks during the install flow.
- Drive failure workflow: file ticket with OVHcloud, they swap the failed
disk in the chassis, we rebuild the array (mdadm --add /dev/mdX
/dev/<new>
). A few hours degraded, then back to healthy.

## Volume-as-data-host commitment, on bare metal

The architectural commitment from substrate-volume-mount carries forward
with one mechanical change: there's no separable block-storage volume on
bare metal. The RAID 10 array *is* the data host.

- /data/josh.db (+ -wal, -shm) — SQLite substrate
- /data/corpus/<source>/bodies/{raw,markdown}/... — bodies
- /data/locks/ingest-<source>.lock — flock files
- /data/backups/ — local snapshot staging (currently unused; the nightly
backup script writes directly to a mktemp -d under PrivateTmp)

The "data is on /data, root disk is throwaway" commitment is now
aspirational rather than physically enforced — the same array holds both.
In practice, the OS partition uses ~5% of the array, so the commitment is
preserved by convention: write durable state only to /data, treat
everything else as reproducible from kamal deploy.

## Storage runway projections

Starting from ~0 GB on /data (fresh provision; everything ingests
forward):

- CRS Reports re-backfilled + embedded ≈ ~14 GB (10 GB DB + 4 GB
embeddings)
- v1 conservative backfill (last 2 yr per source, current congress for
Bills): ~180–280 GB total → 5+ years runway at 1.92 TB usable
- v1 aggressive backfill (full historical FR + all bill text versions):
~1.2–1.8 TB → ~12–24 months runway at 1.92 TB usable

Trigger to revisit hardware: df /data reports >70% (~1.35 TB used).
Migration path at that point is "order a successor server (16-core EPYC
with 4× larger NVMe), rsync across, decommission old."

## Provision sequence (fresh, no DO migration)

Ordered as a clean stand-up. The order has been placed; expected ready
within ~1 day. Pre-server prep is complete — see
substrate-nightly-backup for the backup-side details (restic repo
pre-initialized at s3://josh-bucket/josh-db-restic, install scripts
ready in bin/josh-backup/, secrets re-encrypted with RESTIC_PASSWORD
in .kamal/secrets.age). Once the OVHcloud host is reachable:

1. First SSH and initial config. Add josh to ~/.ssh/config pointing
at the new IP. Verify Ubuntu 24.04 LTS, RAID 10 array on all 4 NVMes
via cat /proc/mdstat, network up. Drop a basic security baseline
(root SSH key only, no password auth, ufw allow ssh + 80 + 443).
2. Set up /data. If / already covers the full RAID 10 array,
mkdir -p /data /data/{corpus,locks,backups} is enough. If the
installer carved a separate /data partition, confirm fstab uses UUID
with defaults,noatime,discard,nofail.
3. Decrypt Kamal secrets locally. age -d -i ~/.age/key.txt
.kamal/secrets.age > .kamal/secrets
. Repoint the project-root
config/deploy.yml's servers: blocks at the new IP.
4. kamal setup --skip-hooks from project root. With secrets decrypted:
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal setup --skip-hooks.
One command builds the consolidated image on ritz-cmd, pushes to
Docker Hub, pulls on the new host, and starts every active role
(web, ingester) with the bind mount /data:/data. --skip-hooks
is needed only on first deploy (the alembic pre-deploy hook needs env
files that kamal hasn't pushed yet — chicken and egg).
5. Apply migrations manually on first deploy. Same shell:
kamal app exec --reuse --roles web "alembic upgrade head".
Subsequent kamal deploy runs apply migrations automatically via the
pre-deploy hook. The embedder role stays commented out in
config/deploy.yml for v1; activation later is uncomment + kamal deploy --roles=embedder.
6. Install the backup script. From local Mac:
JOSH_HOST=josh bin/josh-backup/install.sh. The script copies the
systemd unit + timer + env template to the host, installs restic +
jq + sqlite3, creates /var/cache/restic and /data/backups,
prompts you to populate /etc/josh-backup.env from .kamal/secrets
(map BACKUP_ACCESS_KEY_IDAWS_ACCESS_KEY_ID,
BACKUP_SECRET_ACCESS_KEYAWS_SECRET_ACCESS_KEY,
RESTIC_PASSWORDRESTIC_PASSWORD), then enables the timer.
7. Smoke-test the backup. ssh josh 'systemctl start josh-backup.service
&& journalctl -u josh-backup.service -n 50 --no-pager'
. The first run
connects to the already-initialized restic repo (init was done from
local Mac during pre-server prep — no restic init needed) and writes
snapshot 1. Verify with ssh josh 'source /etc/josh-backup.env && restic
snapshots'
.
8. Round-trip restore. Per the runbook in
https://docs.usejosh.com/operations/restore-from-backup/: restic restore latest →
verify SQLite + corpus → confirm restic check passes. This satisfies
substrate-nightly-backup t8 and clears the path to flip both this
spec and the backup spec to verified.
9. Re-ingest. Kick off the source backfills per their respective
ingester specs:
- legislators-and-committees-ingester (small, fast — minutes)
- crs-reports-ingester (~23K reports — hours)
- The shipping-source roster expands per the planned source-ingester
specs (bills, federal-register, public-laws, etc.).
10. Verify the substrate. Run the success_determiner. Spot-check FTS5
search, legislator lookups, latest CRS row.
11. DNS / external clients. No production DNS today (no users), but
when there is, this is where it points at the new IP.
12. Stable observation period (7 days). Let the regular ingester +
cron + backup cycle run. Watch journal for unexpected errors. Then
flip the spec from verifiedshipped.

## What does NOT change vs the prior DO setup

- The Kamal deploy pattern (build on ritz-cmd, push to Docker Hub, pull
on host) — same.
- The bind-mount commitment (/data:/data) — same.
- The substrate file format — byte-identical.
- The age-encrypted secrets pattern — same.

## What DOES change vs the prior DO setup

- Server cost profile. ~$165/mo vs ~$98/mo DO previously (+$67/mo).
Crosses over much later — at multi-TB substrate the equivalent DO
config would be $1,500+/mo just for the volume.
- Volume portability story. No detachable volume; future host changes
are rsync-based, not zero-copy.
- Drive failure workflow. Was "DO replaces hardware transparently."
Now "OVHcloud replaces drive on ticket; we rebuild RAID array."
- Headroom. ~1.92 TB usable RAID 10 vs the prior 500 GB DO volume.
~4× more runway before the next migration, with much higher IOPS.

10 of 16 done.

  • t1 Order Advance-1 2024 (Vint Hill, 4×960 GB NVMe Soft RAID 10, 32 GB RAM, free install) — placed 2026-05-10; delivered 2026-05-11
  • t2 Initial SSH + add `josh` alias in `~/.ssh/config` against new IP (<your-host-ip>) — done 2026-05-11
  • t3 Verify RAID 10 array spans 4 devices (`/proc/mdstat`); confirm Ubuntu 24.04 LTS; basic security baseline (root SSH key only, no passwords, ufw allow ssh + 80 + 443). All done 2026-05-11: RAID/OS verified, root SSH key authorized (via `cp /home/ubuntu/.ssh/authorized_keys → /root/.ssh/authorized_keys` — OVH installer creates user `ubuntu`, not root), `ufw` configured as `limit 22/tcp` + `allow 80/tcp` + `allow 443/tcp` with `deny incoming, allow outgoing` defaults.
  • t4 Carve out / partition (or single partition); create `/data` (separate ext4 partition mounted via fstab UUID with `defaults,noatime,discard,nofail` — OR a directory on the root partition; either is fine). Done as single root partition on RAID 10 (`md3`), `/data` is a directory on `/`.
  • t5 Create /data skeleton (`corpus/`, `locks/`, `backups/`) — done 2026-05-11
  • t6 `kamal setup` against new host; verify bind-mount via `docker inspect`; alembic migrations apply against fresh /data/josh.db. Done 2026-05-11 (initial single-service josh-core deploy) and redone 2026-05-12 after consolidation to single-image multi-role per `substrate-single-image-deploy`. Confirmed end-state: bind mount `/data -> /data (bind)` on both web + ingester roles, alembic at `0005 (head)`, `/health` returns 200. Workaround for first-deploy chicken-and-egg documented in https://docs.usejosh.com/operations/migrations/#first-deploy.
  • t7 Worker roles deployed via consolidation. Reframed 2026-05-12: instead of `kamal setup josh-ingester` + `kamal setup josh-embedder` (separate Kamal apps), the worker roles now ship from the same image as the web role. ingester role is live; embedder role is defined in `Dockerfile` but commented out in `config/deploy.yml` for v1 (chunking + embedding-model strategy not yet locked — see `substrate-embedding-evaluation`). Consolidation spec: `substrate-single-image-deploy`.
  • t8 [server-blocked] `JOSH_HOST=josh bin/josh-backup/install.sh` from local Mac (one shot — script is fully ready); populate `/etc/josh-backup.env` from `.kamal/secrets`; smoke-test `systemctl start josh-backup.service`. Restic repo is pre-initialized, so first backup writes snapshot 1 directly.
  • t9 [server-blocked] Round-trip restore per `https://docs.usejosh.com/operations/restore-from-backup/` — satisfies substrate-nightly-backup t8
  • t10 [server-blocked] Re-ingest legislators-and-committees + CRS reports backfill (the two previously-shipped sources)
  • t11 [server-blocked] Run success_determiner; spot-check FTS5 + legislator + latest CRS row
  • t12 [server-blocked, blocks ship] 7-day stable observation: regular ingester + cron + backup cycle without unexpected errors
  • t13 Update `CLAUDE.md`: server profile (DO droplet → EPYC 4244P bare metal), 'Where data lives' section (RAID 10 array instead of detachable volume)
  • t14 Update `https://docs.usejosh.com/operations/new-host-setup/`: bare-metal section (Soft RAID 10 setup, OVHcloud-specific notes)
  • t15 [server-blocked] Spec lifecycle: in_progress → verified → shipped
  • t16 Update `https://docs.usejosh.com/operations/architecture/`: server profile + 'Where data lives' sections reframed for OVHcloud bare metal + Soft RAID 10 (was DO droplet + detachable block-storage volume). Added in-flight after t13/t14 because the docs would silently disagree otherwise.
  • 2026-05-12T19:00:00Z in_progressin_progress **t7 reframed and marked done via single-image consolidation.** Original t7 planned three separate `kamal setup` invocations (one per service: josh-core, josh-ingester, josh-embedder). Instead, executed the consolidation per `substrate-single-image-deploy`: one Kamal service `josh` with multiple roles built from one shared image. `kamal setup --skip-hooks` from project root brings up every active role in one shot. End-state on the OVH host: - 2 role-containers running from `robertritz/josh:<sha>`: `josh-web-<sha>` (FastAPI on 8000, fronted by kamal-proxy on 80/443) and `josh-ingester-<sha>` (scheduler daemon, no proxy). - `embedder` role defined in `Dockerfile` but commented out in `config/deploy.yml` for v1 (chunking + embedding-model strategy still TBD; tracked in `substrate-embedding-evaluation`). - Pre-deploy hook rewritten to drop the `basename "$PWD" != "josh-core"` gate and target `--roles web` for the alembic invocation. - Verified: `docker inspect` shows `/data -> /data (bind)` on both roles; `alembic current` → `0005 (head)`; `curl /health` → 200; subsequent `kamal deploy` (no `--skip-hooks`) succeeds end-to-end (proves hook works against env files already on the host). - `/data/josh.db` sha256 unchanged across the entire tear-down + rebuild sequence (host bind mount, not a Docker volume). Per-service deploy artifacts (`josh-core/Dockerfile`, `josh-core/config/deploy.yml`, `josh-core/.kamal` symlink, and the matching trio under josh-ingester/ and josh-embedder/) deleted. Docs updated: `architecture.html`, `repo-structure.html`, `migrations.html`, `new-host-setup.html`. Rollback note for this work: the build first failed because `ritz-cmd` (remote builder) was unreachable mid-session (user VPN dropped). josh-core was already torn down at that point. Restored via `kamal setup --skip-push --version=31a134d...` pulling the prior image from Docker Hub. Once VPN came back, tore down again and ran the consolidated `kamal setup --skip-hooks` cleanly. Substrate file untouched throughout.
  • 2026-05-11T06:00:00Z in_progressin_progress **t6 done: josh-core deployed via kamal setup.** **First-deploy chicken-and-egg, documented.** The pre-deploy hook (`alembic upgrade head` via `kamal app exec --version <new>`) needs `.kamal/apps/josh-core/env/roles/web.env` on the host, but kamal only writes that file when starting the first container — which happens *after* the pre-deploy hook. First `kamal setup` failed with: ``` docker: open .kamal/apps/josh-core/env/roles/web.env: no such file ERROR (Kamal::Cli::HookError): Hook `pre-deploy` failed ``` Workaround used (and now documented at `https://docs.usejosh.com/operations/migrations/#first-deploy`): ``` kamal setup --skip-hooks kamal app exec --reuse "alembic upgrade head" ``` The hook itself is correct — only the first-deploy ordering is broken. Subsequent `kamal deploy` runs work cleanly because env files persist on the host. Fixing the hook to probe each host for the env file would add complexity for a once-per-host edge case; cheaper to document the bootstrap. **Verified post-deploy state on the OVH host:** - `docker inspect` confirms `/data -> /data (bind)` (satisfies acceptance criterion 2) - `alembic current` reports `0005 (head)` (satisfies acceptance criterion 3 — schema version matches migrations directory) - `/data/josh.db` materialized at 487 KB on the RAID 10 array (satisfies acceptance criterion 4) - kamal-proxy publishes 80/443, `curl http://josh/health` returns `{"status":"ok"}` Acceptance criteria 1 (RAID 10 across 4 NVMe), 2 (bind-mount), 3 (alembic head), and 4 (substrate on RAID 10) all satisfied. Still open: AC5 (nightly backup snapshot in S3 — t8/t9) and AC6 (7-day stable observation — t12). Next blocker is t7: `kamal setup` for josh-ingester + josh-embedder.
  • 2026-05-11T05:30:00Z in_progressin_progress **t3 done: ufw baseline active.** Sequence (run from local Mac via `ssh josh`): ``` ufw default deny incoming ufw default allow outgoing ufw limit 22/tcp # SSH (rate-limited: 6 conns / 30s per IP) ufw allow 80/tcp # HTTP — kamal-proxy + ACME ufw allow 443/tcp # HTTPS — kamal-proxy ufw --force enable ``` All rules apply v4 + v6 automatically (Ubuntu 24.04's `/etc/default/ufw` defaults `IPV6=yes`). **Decision: skip fail2ban for now.** With password auth off + ufw's connection-rate limit on 22, fail2ban would mostly quiet logs rather than reduce real attack surface. Revisit when (a) a public API surface opens on josh-core, (b) log noise becomes operationally annoying, or (c) compliance demands it. **Operator gotcha (worth promoting to the runbook later):** after `ufw limit 22/tcp` is active, do NOT open more than 5 SSH connections in any 30-second window — the rate-limit will lock your IP out temporarily (~30s to drain). Manifests as "Connection refused", not timeout. Recovery is just to wait. Hit this during initial smoke-test from rapid `nc -z` + `ssh -v` + retries. Next blocker: t6 — `cd josh-core && kamal setup` from local Mac. Requires `.kamal/secrets` decrypted from age.
  • 2026-05-11T05:00:00Z in_progressin_progress **OVHcloud host delivered and brought up.** Server is reachable at `<your-host-ip>` via `ssh josh`. Progress on the server-blocked tasks: - **t1** updated: order delivered 2026-05-11 (was "ETA ~1 day"). - **t2** done: `josh` alias in `~/.ssh/config` repointed from the dead DO IP (157.245.246.232) to the OVHcloud IP (<your-host-ip>). - **t3** partial: RAID 10 (`md3`) verified across all 4 NVMes via `/proc/mdstat` — [4/4] [UUUU] healthy. Ubuntu 24.04.4 LTS confirmed. Root SSH key authorized. **`ufw` baseline (allow 22 + 80 + 443) is NOT yet configured** — pending before kamal-setup. - **t4** done: single `/` partition on `md3` RAID 10 (~1.8 TB), `/data` is a directory on `/` (no separate /data partition — equivalent per acceptance criterion 1). - **t5** done: `/data/{corpus,locks,backups}` created on the RAID 10 array. **OVH installer gotcha (not in the runbook yet).** OVHcloud's Ubuntu 24.04 template provisions a default user `ubuntu` with passwordless sudo and authorizes the registered SSH key for that user — not for root. Since all three deploy.yml files (`josh-{core,ingester,embedder}`) expect `ssh.user: root`, we did a one-time `sudo cp /home/ubuntu/.ssh/authorized_keys /root/.ssh/authorized_keys` (+ chown + chmod 700/600) so `ssh josh` lands as root. Modern Ubuntu's default `PermitRootLogin prohibit-password` already allows key-only root SSH; no `sshd_config` edit needed. **TODO**: fold this into `https://docs.usejosh.com/operations/new-host-setup/` Step 1. **Container deploy.yml hygiene swept in the same change.** All three `config/deploy.yml` files had stale references — `servers: 157.245.246.232` (the destroyed DO droplet) and comments describing `/data` as "the DigitalOcean block-storage volume." Both swept: IPs updated to the OVHcloud host; comments updated to "a directory on the Soft RAID 10 NVMe array." This was a hard blocker on `kamal setup` (t6). **`clarifications_needed` cleared.** `josh-embedder/config/deploy.yml` exists in the repo (was scaffolded in a prior commit); t7 is no longer config-blocked. Next blocker is t3 (`ufw` baseline) → then t6 (`kamal setup josh-core`).
  • 2026-05-10T23:45:00Z in_progressin_progress **Pre-server doc tasks landed: t13, t14, and a new t16.** - **t13** done: `CLAUDE.md` gained a short `## Server` section (provider, hardware shorthand, link out to `architecture.html`) and a brief "Where data lives" paragraph framing the RAID 10 array as the data host. Updated the `Volume-as-data-host` critical rule wording to drop the now-incorrect "block volume" / "droplet root" phrasing in favor of provider-neutral "host root filesystem". - **t14** done: `https://docs.usejosh.com/operations/new-host-setup/` reframed around bare-metal Soft RAID 10 as the default flow, with a parallel "Cloud VM alternative" path kept for self-hosters using DO/Hetzner/AWS. Step 1 retitled "Order the host"; Step 3 retitled "Confirm the array and mount /data" with separate bare-metal vs. cloud-VM instructions. Migration section now explicitly covers the rsync-only path (zero-copy detach/reattach is no longer an option when the destination is bare metal). Decommissioning section retitled and adjusted for the case where the prior host is already destroyed. - **t16** added and done: `https://docs.usejosh.com/operations/architecture/` was not in the original task list but heavily DO-shaped (server table, "volume-as-data-host" section, mount-pattern paragraph). Bringing it forward in the same change avoided silent disagreement with t13/t14. Server table swapped for the EPYC 4244P / 32 GB DDR5 ECC / 4×960 GB NVMe RAID 10 / 3+25 Gbps facts; "Volume-as-data-host" renamed to "Where data lives" with the boundary explicitly framed as a convention rather than a physical guarantee on bare metal. Pre-flip verification done before this status update: smoke test (`uv run poe smoke`) green end-to-end, three local `docker build`s green for josh-core / josh-embedder / josh-ingester from monorepo root. Spec lint pending; will run before commit.
  • 2026-05-10T23:00:00Z in_progressin_progress **Provision sequence updated to reflect completed pre-server prep.** The substrate-nightly-backup spec is now ~80% landed locally (restic repo initialized at `s3://josh-bucket/josh-db-restic` with ID `1769e39995`; install scripts complete in `bin/josh-backup/`; runbook rewritten for restic; `.kamal/secrets.age` updated with `RESTIC_PASSWORD`; iCloud `SSH Keys and Tokens.tar.age` carries a backup of the password). When the OVHcloud host arrives, backup install is one command: `JOSH_HOST=josh bin/josh-backup/install.sh`. Provision sequence reordered to put backup install + smoke + restore verification (steps 6-8) before re-ingest (step 9), so we have a proven backup pipeline before substrate data exists to back up. Tasks updated with `[server-blocked]` prefix on the steps that wait for the host. Two tasks (t13 CLAUDE.md update, t14 new-host-setup doc update) are NOT server-blocked and can land in parallel. Added `clarifications_needed`: `josh-embedder/config/deploy.yml` is missing — needed for t7 (`kamal setup josh-embedder`) to have anything to act on.
  • 2026-05-10T18:00:00Z plannedin_progress DigitalOcean droplet destroyed earlier today; spec reframed from "migrate from DO" to "fresh provision OVHcloud." Rsync cutover steps dropped (no source host to sync from); re-ingestion replaces the data-migration path. OVHcloud Advance-1 2024 ordered today (Vint Hill, 4×960 GB NVMe Soft RAID 10, 32 GB RAM); expected ready within a day. Status flipped planned → in_progress with t1 (order) already done.

docs/spec/substrate-bare-metal-host.html · generated by bin/build-spec.py