Skip to content

Restore the substrate from a nightly restic backup

When the substrate is corrupted, deleted, or otherwise unusable, the nightly restic snapshot in DigitalOcean Spaces (josh-bucket, NYC3) is your recovery path. This runbook walks the full restore: list snapshots, restore the chosen one, swap the live files, restart.

Companion to the substrate-nightly-backup spec, which defines what's being backed up and on what cadence (entire /data tree — SQLite snapshot + corpus directory — via restic block-level dedupe to standard-tier object storage, with internal retention managed by restic forget --prune).

  • You ran a destructive query against production by mistake (DROP TABLE, DELETE FROM … WHERE 1=1) and need to roll back.
  • An ingester bug introduced bad rows that you only just noticed.
  • The substrate file shows up corrupted on next open (integrity_check returns errors, queries fail with SQLITE_CORRUPT).
  • The corpus directory got wiped or the host disk failed and the substrate is gone.
  • You want to grab last night's substrate (DB + corpus) to a local box for debugging without disturbing production state.

Recovery time objective (RTO). ~5–30 minutes depending on substrate size and what you're restoring (DB-only is fast; full /data tree is bound by bucket throughput).

Recovery point objective (RPO). Up to 24 hours — whatever's been written since the last 03:30 UTC backup is gone. Re-run the relevant ingester to backfill if needed.

  • SSH access to the josh host (alias configured in ~/.ssh/config).
  • The host already has restic, jq, and sqlite3 installed (from the backup-side install — bin/josh-backup/install.sh).
  • /etc/josh-backup.env on the host populated with RESTIC_REPOSITORY, RESTIC_PASSWORD, and the AWS_* credentials. If you've lost RESTIC_PASSWORD, the repo is unrecoverable — it's stashed in .kamal/secrets.age in the repo and in iCloud SSH Keys and Tokens.tar.age as a backup.

SSH to the host and ask restic for the snapshot listing.

Terminal window
ssh josh
source /etc/josh-backup.env
restic snapshots

Expected output is one row per snapshot with timestamp, host, tags, and snapshot ID:

ID Time Host Tags Paths
--------------------------------------------------------
3a1b2c3d 2026-05-08 03:30:14 josh nightly /data
4e5f6g7h 2026-05-09 03:30:11 josh nightly /data

Pick the most recent snapshot that pre-dates the corruption. If you're not sure when the corruption started, work backwards from the latest until the data looks right.

For long-term recovery (a snapshot from weeks/months ago), restic's forget policy keeps daily snapshots for 30 days, weekly for 8 weeks, and monthly for 12 months — so up to ~14 months of restoration coverage.

Step 2 — Restore the snapshot to a staging directory

Section titled “Step 2 — Restore the snapshot to a staging directory”

Restore to a tmp dir first, never directly over the live /data. Restic recreates the original directory structure under the target.

Terminal window
ssh josh
source /etc/josh-backup.env
SNAPSHOT="4e5f6g7h" # snapshot ID from step 1, or `latest`
TMP=$(mktemp -d /tmp/josh-restore.XXXXXX)
restic restore "$SNAPSHOT" --target "$TMP"
ls -la "$TMP/data/" # confirm the restore landed

The restored layout mirrors the live host:

$TMP/data/
├── backups/
│ └── josh-snap.db # the consistent SQLite snapshot
├── corpus/
│ ├── crs_reports/
│ ├── legislators_committees/
│ └── …
└── …

Never trust a backup you haven't checked. PRAGMA integrity_check on the restored SQLite snapshot must print ok on a single line.

Terminal window
SNAPDB="$TMP/data/backups/josh-snap.db"
sqlite3 "$SNAPDB" "PRAGMA integrity_check;"

If it returns anything other than ok, the snapshot is unusable. Pick an earlier snapshot and try again. Report the bad snapshot — restic's own structural check runs nightly via the success_determiner, so if a restored snap fails integrity_check that's a real signal worth digging into.

Optional sanity check on the snapshot data:

Terminal window
sqlite3 "$SNAPDB" "SELECT name FROM sqlite_master WHERE type='table' ORDER BY 1;"
sqlite3 "$SNAPDB" "SELECT COUNT(*) FROM crs_reports;"
sqlite3 "$SNAPDB" "SELECT MAX(retrieved_at) FROM crs_reports;"

And on the restored corpus tree:

Terminal window
du -sh "$TMP/data/corpus/" # should be MB to GB depending on substrate size
ls "$TMP/data/corpus/" # one directory per source
find "$TMP/data/corpus" -maxdepth 4 -type f | head

The active substrate file at /data/josh.db can't be replaced safely while josh-core and josh-ingester hold it open. Stop the running containers first.

Terminal window
# From your local Mac (so kamal can reach SSH agent etc.)
cd ~/projects/josh/josh-core
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop
cd ../josh-ingester
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop
cd ../josh-embedder # if deployed
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop

Verify nothing has the file open:

Terminal window
ssh josh 'lsof /data/josh.db 2>/dev/null || echo "no holders"'

Step 5 — Swap the SQLite file (and optionally the corpus)

Section titled “Step 5 — Swap the SQLite file (and optionally the corpus)”

Move the (likely corrupted) live SQLite file aside as a safety copy, drop in the restored snapshot, and clear stale WAL/SHM companion files. The WAL/SHM purge is critical: a stale WAL from before the swap will corrupt the freshly restored DB on first open.

Terminal window
ssh josh "
set -euo pipefail
cd /data
# Safety copy of the current (suspect) DB
if [ -f josh.db ]; then
mv josh.db josh.db.bad-\$(date -u +%Y-%m-%dT%H-%M-%SZ)
fi
# Drop in the restored snapshot (note: the file is at backups/josh-snap.db
# in the restored tree, NOT at the live name)
cp '$TMP'/data/backups/josh-snap.db /data/josh.db
chmod 644 /data/josh.db
# Purge stale WAL/SHM — these reference the OLD file's transactions
rm -f /data/josh.db-wal /data/josh.db-shm
ls -la /data/
"

If you also need to restore the corpus directory (raw payloads + normalized Markdown — required if disk failed or corpus was wiped, optional if only the SQLite was bad):

Terminal window
ssh josh "
set -euo pipefail
# rsync the restored corpus over the live one. --delete removes any files
# that exist live but not in the snapshot (post-snapshot ingest results).
# Drop --delete if you want to merge instead.
rsync -aHv --delete '$TMP'/data/corpus/ /data/corpus/
"

You should now see /data/josh.db at the expected size, no -wal/-shm companions, and the renamed josh.db.bad-… safety copy alongside.

Terminal window
cd ~/projects/josh/josh-core
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot
cd ../josh-ingester
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot
cd ../josh-embedder # if deployed
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot
# Tail logs to confirm a clean start
ssh josh 'docker logs --tail 50 -f $(docker ps -qf name=josh-core)'

Hit the health check from your local machine:

Terminal window
ssh josh 'curl -s http://localhost/health'
# {"status":"ok"}
Terminal window
ssh josh 'sqlite3 /data/josh.db "PRAGMA integrity_check;"'
ssh josh 'sqlite3 /data/josh.db "SELECT COUNT(*) FROM crs_reports;"'
ssh josh 'sqlite3 /data/josh.db "SELECT COUNT(*) FROM legislators;"'
ssh josh 'ls /data/corpus | head'

If everything looks right, you're done. After a few hours of clean operation, you can clean up the safety files:

Terminal window
ssh josh 'ls -la /data/josh.db.bad-*'
ssh josh 'rm /data/josh.db.bad-…' # adjust the filename
ssh josh 'rm -rf /tmp/josh-restore.*'

Anything ingested between the snapshot timestamp and the corruption is gone. To replace it, re-run the affected ingester(s) over the date range:

Terminal window
# Example — rerun CRS Reports for the lost day
ssh josh 'docker exec $(docker ps -qf name=josh-ingester) \
josh ingest crs-reports --since 2026-05-09T00:00:00Z'

Most v1 substrate sources are derivable from public APIs, so re-running is safe and idempotent. Sources that depend on irreplaceable upstream state (e.g., a deleted upstream document that you'd already fetched) are the exception — flag those if you find any.

restic check --read-data — periodic deep integrity verification

Section titled “restic check --read-data — periodic deep integrity verification”

The structural restic check runs nightly via the substrate-nightly-backup success_determiner. It validates pack-file references and snapshot-tree integrity but doesn't re-download pack file content.

The deep check, restic check --read-data, downloads every pack file in the repo and verifies content hashes — catches bit rot, server-side corruption, and transmission errors that survived the structural check. Run monthly (or weekly if paranoid; standard tier means there's no per-GB retrieval cost concern):

Terminal window
ssh josh
source /etc/josh-backup.env
restic check --read-data

This is also worth running before any actual restore, to confirm the snapshot you're about to restore from is structurally sound.

The full restore, condensed into a single block you can paste once you're confident:

Terminal window
ssh josh "
set -euo pipefail
source /etc/josh-backup.env
TMP=\$(mktemp -d /tmp/josh-restore.XXXXXX)
echo 'restoring latest snapshot to:' \"\$TMP\"
restic restore latest --target \"\$TMP\"
SNAP=\"\$TMP/data/backups/josh-snap.db\"
result=\$(sqlite3 \"\$SNAP\" 'PRAGMA integrity_check;')
test \"\$result\" = 'ok' || { echo 'integrity_check FAILED:' \"\$result\"; exit 1; }
echo 'integrity_check ok — snapshot ready at:' \"\$SNAP\"
echo 'STOP josh-core/ingester/embedder via kamal app stop, then run the swap block from this runbook.'
"

The one-shot intentionally stops before the swap step — swapping the live file under a running writer is the one thing this runbook can't do safely without manual confirmation.

restic snapshots returns "Fatal: unable to open config file". Either the RESTIC_REPOSITORY env var is wrong, the credentials are bad, or the bucket name is mistyped. Decrypt .kamal/secrets.age, copy the BACKUP_* + RESTIC_PASSWORD values into /etc/josh-backup.env, and retry. The bucket is josh-bucket (NYC3, standard tier); the repo prefix is josh-db-restic.

restic snapshots returns "wrong password or no key found". The RESTIC_PASSWORD in /etc/josh-backup.env doesn't match the password used to initialize the repo. Check the iCloud SSH Keys and Tokens.tar.age backup of the password.

integrity_check returns errors on the restored snapshot. Try the previous snapshot. File a follow-up: nightly restic check validates structural integrity, so a snapshot that passes structural but fails restored-DB integrity is interesting. Run restic check --read-data to see if the repo itself has bit rot.

After swap, queries return SQLITE_CORRUPT on first open. You probably skipped the WAL/SHM purge in step 5. Stop the container, rm /data/josh.db-wal /data/josh.db-shm, restart.

The container won't boot after restore. Check schema compatibility: if the restored snapshot is from a substrate version older than the deployed code expects, run alembic upgrade head against the restored file before booting (see migrations).

restic restore is slow or fails partway. Resumable: re-run the same command. Restic skips already-downloaded pack files. If a specific pack file is consistently failing, that's a sign the bucket has corruption — contact DO support and check whether restic check --read-data confirms the corruption against the same pack.