Restore the substrate from a nightly restic backup

When the substrate is corrupted, deleted, or otherwise unusable, the nightly restic snapshot in DigitalOcean Spaces (josh-bucket, NYC3) is your recovery path. This runbook walks the full restore: list snapshots, restore the chosen one, swap the live files, restart.

Companion to the substrate-nightly-backup spec, which defines what's being backed up and on what cadence (entire /data tree — SQLite snapshot + corpus directory — via restic block-level dedupe to standard-tier object storage, with internal retention managed by restic forget --prune).

When to reach for this

You ran a destructive query against production by mistake (DROP TABLE, DELETE FROM … WHERE 1=1) and need to roll back.
An ingester bug introduced bad rows that you only just noticed.
The substrate file shows up corrupted on next open (integrity_check returns errors, queries fail with SQLITE_CORRUPT).
The corpus directory got wiped or the host disk failed and the substrate is gone.
You want to grab last night's substrate (DB + corpus) to a local box for debugging without disturbing production state.

Recovery time objective (RTO). ~5–30 minutes depending on substrate size and what you're restoring (DB-only is fast; full /data tree is bound by bucket throughput).

Recovery point objective (RPO). Up to 24 hours — whatever's been written since the last 03:30 UTC backup is gone. Re-run the relevant ingester to backfill if needed.

Prerequisites

SSH access to the josh host (alias configured in ~/.ssh/config).
The host already has restic, jq, and sqlite3 installed (from the backup-side install — bin/josh-backup/install.sh).
/etc/josh-backup.env on the host populated with RESTIC_REPOSITORY, RESTIC_PASSWORD, and the AWS_* credentials. If you've lost RESTIC_PASSWORD, the repo is unrecoverable — it's stashed in .kamal/secrets.age in the repo and in iCloud SSH Keys and Tokens.tar.age as a backup.

Step 1 — List available snapshots

SSH to the host and ask restic for the snapshot listing.

ssh josh
source /etc/josh-backup.env

restic snapshots

Expected output is one row per snapshot with timestamp, host, tags, and snapshot ID:

ID        Time                 Host  Tags     Paths
--------------------------------------------------------
3a1b2c3d  2026-05-08 03:30:14  josh  nightly  /data
4e5f6g7h  2026-05-09 03:30:11  josh  nightly  /data
…

Pick the most recent snapshot that pre-dates the corruption. If you're not sure when the corruption started, work backwards from the latest until the data looks right.

For long-term recovery (a snapshot from weeks/months ago), restic's forget policy keeps daily snapshots for 30 days, weekly for 8 weeks, and monthly for 12 months — so up to ~14 months of restoration coverage.

Step 2 — Restore the snapshot to a staging directory

Restore to a tmp dir first, never directly over the live /data. Restic recreates the original directory structure under the target.

ssh josh
source /etc/josh-backup.env

SNAPSHOT="4e5f6g7h"          # snapshot ID from step 1, or `latest`
TMP=$(mktemp -d /tmp/josh-restore.XXXXXX)

restic restore "$SNAPSHOT" --target "$TMP"
ls -la "$TMP/data/"          # confirm the restore landed

The restored layout mirrors the live host:

$TMP/data/
├── backups/
│   └── josh-snap.db          # the consistent SQLite snapshot
├── corpus/
│   ├── crs_reports/
│   ├── legislators_committees/
│   └── …
└── …

Step 3 — Verify integrity

Never trust a backup you haven't checked. PRAGMA integrity_check on the restored SQLite snapshot must print ok on a single line.

SNAPDB="$TMP/data/backups/josh-snap.db"
sqlite3 "$SNAPDB" "PRAGMA integrity_check;"

If it returns anything other than ok, the snapshot is unusable. Pick an earlier snapshot and try again. Report the bad snapshot — restic's own structural check runs nightly via the success_determiner, so if a restored snap fails integrity_check that's a real signal worth digging into.

Optional sanity check on the snapshot data:

sqlite3 "$SNAPDB" "SELECT name FROM sqlite_master WHERE type='table' ORDER BY 1;"
sqlite3 "$SNAPDB" "SELECT COUNT(*) FROM crs_reports;"
sqlite3 "$SNAPDB" "SELECT MAX(retrieved_at) FROM crs_reports;"

And on the restored corpus tree:

du -sh "$TMP/data/corpus/"          # should be MB to GB depending on substrate size
ls "$TMP/data/corpus/"              # one directory per source
find "$TMP/data/corpus" -maxdepth 4 -type f | head

Step 4 — Quiesce the writers

The active substrate file at /data/josh.db can't be replaced safely while josh-core and josh-ingester hold it open. Stop the running containers first.

# From your local Mac (so kamal can reach SSH agent etc.)
cd ~/projects/josh/josh-core
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop

cd ../josh-ingester
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop

cd ../josh-embedder    # if deployed
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop

Verify nothing has the file open:

ssh josh 'lsof /data/josh.db 2>/dev/null || echo "no holders"'

Step 5 — Swap the SQLite file (and optionally the corpus)

Move the (likely corrupted) live SQLite file aside as a safety copy, drop in the restored snapshot, and clear stale WAL/SHM companion files. The WAL/SHM purge is critical: a stale WAL from before the swap will corrupt the freshly restored DB on first open.

ssh josh "
  set -euo pipefail
  cd /data

  # Safety copy of the current (suspect) DB
  if [ -f josh.db ]; then
    mv josh.db josh.db.bad-\$(date -u +%Y-%m-%dT%H-%M-%SZ)
  fi

  # Drop in the restored snapshot (note: the file is at backups/josh-snap.db
  # in the restored tree, NOT at the live name)
  cp '$TMP'/data/backups/josh-snap.db /data/josh.db
  chmod 644 /data/josh.db

  # Purge stale WAL/SHM — these reference the OLD file's transactions
  rm -f /data/josh.db-wal /data/josh.db-shm

  ls -la /data/
"

If you also need to restore the corpus directory (raw payloads + normalized Markdown — required if disk failed or corpus was wiped, optional if only the SQLite was bad):

ssh josh "
  set -euo pipefail
  # rsync the restored corpus over the live one. --delete removes any files
  # that exist live but not in the snapshot (post-snapshot ingest results).
  # Drop --delete if you want to merge instead.
  rsync -aHv --delete '$TMP'/data/corpus/ /data/corpus/
"

You should now see /data/josh.db at the expected size, no -wal/-shm companions, and the renamed josh.db.bad-… safety copy alongside.

Step 6 — Resume traffic

cd ~/projects/josh/josh-core
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot

cd ../josh-ingester
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot

cd ../josh-embedder    # if deployed
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot

# Tail logs to confirm a clean start
ssh josh 'docker logs --tail 50 -f $(docker ps -qf name=josh-core)'

Hit the health check from your local machine:

ssh josh 'curl -s http://localhost/health'
# {"status":"ok"}

Step 7 — Confirm the restore

ssh josh 'sqlite3 /data/josh.db "PRAGMA integrity_check;"'
ssh josh 'sqlite3 /data/josh.db "SELECT COUNT(*) FROM crs_reports;"'
ssh josh 'sqlite3 /data/josh.db "SELECT COUNT(*) FROM legislators;"'
ssh josh 'ls /data/corpus | head'

If everything looks right, you're done. After a few hours of clean operation, you can clean up the safety files:

ssh josh 'ls -la /data/josh.db.bad-*'
ssh josh 'rm /data/josh.db.bad-…'   # adjust the filename
ssh josh 'rm -rf /tmp/josh-restore.*'

Catching up the lost window

Anything ingested between the snapshot timestamp and the corruption is gone. To replace it, re-run the affected ingester(s) over the date range:

# Example — rerun CRS Reports for the lost day
ssh josh 'docker exec $(docker ps -qf name=josh-ingester) \
  josh ingest crs-reports --since 2026-05-09T00:00:00Z'

Most v1 substrate sources are derivable from public APIs, so re-running is safe and idempotent. Sources that depend on irreplaceable upstream state (e.g., a deleted upstream document that you'd already fetched) are the exception — flag those if you find any.

`restic check --read-data` — periodic deep integrity verification

The structural restic check runs nightly via the substrate-nightly-backup success_determiner. It validates pack-file references and snapshot-tree integrity but doesn't re-download pack file content.

The deep check, restic check --read-data, downloads every pack file in the repo and verifies content hashes — catches bit rot, server-side corruption, and transmission errors that survived the structural check. Run monthly (or weekly if paranoid; standard tier means there's no per-GB retrieval cost concern):

ssh josh
source /etc/josh-backup.env
restic check --read-data

This is also worth running before any actual restore, to confirm the snapshot you're about to restore from is structurally sound.

Quick reference — one-shot restore

The full restore, condensed into a single block you can paste once you're confident:

ssh josh "
  set -euo pipefail
  source /etc/josh-backup.env

  TMP=\$(mktemp -d /tmp/josh-restore.XXXXXX)

  echo 'restoring latest snapshot to:' \"\$TMP\"
  restic restore latest --target \"\$TMP\"

  SNAP=\"\$TMP/data/backups/josh-snap.db\"
  result=\$(sqlite3 \"\$SNAP\" 'PRAGMA integrity_check;')
  test \"\$result\" = 'ok' || { echo 'integrity_check FAILED:' \"\$result\"; exit 1; }
  echo 'integrity_check ok — snapshot ready at:' \"\$SNAP\"

  echo 'STOP josh-core/ingester/embedder via kamal app stop, then run the swap block from this runbook.'
"

The one-shot intentionally stops before the swap step — swapping the live file under a running writer is the one thing this runbook can't do safely without manual confirmation.

Troubleshooting

restic snapshots returns "Fatal: unable to open config file". Either the RESTIC_REPOSITORY env var is wrong, the credentials are bad, or the bucket name is mistyped. Decrypt .kamal/secrets.age, copy the BACKUP_* + RESTIC_PASSWORD values into /etc/josh-backup.env, and retry. The bucket is josh-bucket (NYC3, standard tier); the repo prefix is josh-db-restic.

restic snapshots returns "wrong password or no key found". The RESTIC_PASSWORD in /etc/josh-backup.env doesn't match the password used to initialize the repo. Check the iCloud SSH Keys and Tokens.tar.age backup of the password.

integrity_check returns errors on the restored snapshot. Try the previous snapshot. File a follow-up: nightly restic check validates structural integrity, so a snapshot that passes structural but fails restored-DB integrity is interesting. Run restic check --read-data to see if the repo itself has bit rot.

After swap, queries return SQLITE_CORRUPT on first open. You probably skipped the WAL/SHM purge in step 5. Stop the container, rm /data/josh.db-wal /data/josh.db-shm, restart.

The container won't boot after restore. Check schema compatibility: if the restored snapshot is from a substrate version older than the deployed code expects, run alembic upgrade head against the restored file before booting (see migrations).

restic restore is slow or fails partway. Resumable: re-run the same command. Restic skips already-downloaded pack files. If a specific pack file is consistently failing, that's a sign the bucket has corruption — contact DO support and check whether restic check --read-data confirms the corruption against the same pack.

Reference

substrate-nightly-backup — the spec defining what gets backed up, on what cadence, with what retention
substrate-volume-mount — why /data is the source-of-truth location and what lives there
substrate-litestream-backup — the path-not-taken (continuous WAL streaming, deferred indefinitely)
new-host-setup — how to spin up a fresh host that this restore can target
CLAUDE.md — repo overview, server profile, deploy pattern
restic docs — for everything not covered here