Restore the substrate from a nightly restic backup
When the substrate is corrupted, deleted, or otherwise unusable, the nightly restic snapshot in DigitalOcean Spaces (josh-bucket, NYC3) is your recovery path. This runbook walks the full restore: list snapshots, restore the chosen one, swap the live files, restart.
Companion to the substrate-nightly-backup spec, which defines what's being backed up and on what cadence (entire /data tree — SQLite snapshot + corpus directory — via restic block-level dedupe to standard-tier object storage, with internal retention managed by restic forget --prune).
When to reach for this
Section titled “When to reach for this”- You ran a destructive query against production by mistake (
DROP TABLE,DELETE FROM … WHERE 1=1) and need to roll back. - An ingester bug introduced bad rows that you only just noticed.
- The substrate file shows up corrupted on next open (
integrity_checkreturns errors, queries fail withSQLITE_CORRUPT). - The corpus directory got wiped or the host disk failed and the substrate is gone.
- You want to grab last night's substrate (DB + corpus) to a local box for debugging without disturbing production state.
Recovery time objective (RTO). ~5–30 minutes depending on substrate size and what you're restoring (DB-only is fast; full /data tree is bound by bucket throughput).
Recovery point objective (RPO). Up to 24 hours — whatever's been written since the last 03:30 UTC backup is gone. Re-run the relevant ingester to backfill if needed.
Prerequisites
Section titled “Prerequisites”- SSH access to the
joshhost (alias configured in~/.ssh/config). - The host already has
restic,jq, andsqlite3installed (from the backup-side install —bin/josh-backup/install.sh). /etc/josh-backup.envon the host populated withRESTIC_REPOSITORY,RESTIC_PASSWORD, and theAWS_*credentials. If you've lostRESTIC_PASSWORD, the repo is unrecoverable — it's stashed in.kamal/secrets.agein the repo and in iCloudSSH Keys and Tokens.tar.ageas a backup.
Step 1 — List available snapshots
Section titled “Step 1 — List available snapshots”SSH to the host and ask restic for the snapshot listing.
ssh joshsource /etc/josh-backup.env
restic snapshotsExpected output is one row per snapshot with timestamp, host, tags, and snapshot ID:
ID Time Host Tags Paths--------------------------------------------------------3a1b2c3d 2026-05-08 03:30:14 josh nightly /data4e5f6g7h 2026-05-09 03:30:11 josh nightly /data…Pick the most recent snapshot that pre-dates the corruption. If you're not sure when the corruption started, work backwards from the latest until the data looks right.
For long-term recovery (a snapshot from weeks/months ago), restic's forget policy keeps daily snapshots for 30 days, weekly for 8 weeks, and monthly for 12 months — so up to ~14 months of restoration coverage.
Step 2 — Restore the snapshot to a staging directory
Section titled “Step 2 — Restore the snapshot to a staging directory”Restore to a tmp dir first, never directly over the live /data. Restic recreates the original directory structure under the target.
ssh joshsource /etc/josh-backup.env
SNAPSHOT="4e5f6g7h" # snapshot ID from step 1, or `latest`TMP=$(mktemp -d /tmp/josh-restore.XXXXXX)
restic restore "$SNAPSHOT" --target "$TMP"ls -la "$TMP/data/" # confirm the restore landedThe restored layout mirrors the live host:
$TMP/data/├── backups/│ └── josh-snap.db # the consistent SQLite snapshot├── corpus/│ ├── crs_reports/│ ├── legislators_committees/│ └── …└── …Step 3 — Verify integrity
Section titled “Step 3 — Verify integrity”Never trust a backup you haven't checked. PRAGMA integrity_check on the restored SQLite snapshot must print ok on a single line.
SNAPDB="$TMP/data/backups/josh-snap.db"sqlite3 "$SNAPDB" "PRAGMA integrity_check;"If it returns anything other than ok, the snapshot is unusable. Pick an earlier snapshot and try again. Report the bad snapshot — restic's own structural check runs nightly via the success_determiner, so if a restored snap fails integrity_check that's a real signal worth digging into.
Optional sanity check on the snapshot data:
sqlite3 "$SNAPDB" "SELECT name FROM sqlite_master WHERE type='table' ORDER BY 1;"sqlite3 "$SNAPDB" "SELECT COUNT(*) FROM crs_reports;"sqlite3 "$SNAPDB" "SELECT MAX(retrieved_at) FROM crs_reports;"And on the restored corpus tree:
du -sh "$TMP/data/corpus/" # should be MB to GB depending on substrate sizels "$TMP/data/corpus/" # one directory per sourcefind "$TMP/data/corpus" -maxdepth 4 -type f | headStep 4 — Quiesce the writers
Section titled “Step 4 — Quiesce the writers”The active substrate file at /data/josh.db can't be replaced safely while josh-core and josh-ingester hold it open. Stop the running containers first.
# From your local Mac (so kamal can reach SSH agent etc.)cd ~/projects/josh/josh-coreenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop
cd ../josh-ingesterenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stop
cd ../josh-embedder # if deployedenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app stopVerify nothing has the file open:
ssh josh 'lsof /data/josh.db 2>/dev/null || echo "no holders"'Step 5 — Swap the SQLite file (and optionally the corpus)
Section titled “Step 5 — Swap the SQLite file (and optionally the corpus)”Move the (likely corrupted) live SQLite file aside as a safety copy, drop in the restored snapshot, and clear stale WAL/SHM companion files. The WAL/SHM purge is critical: a stale WAL from before the swap will corrupt the freshly restored DB on first open.
ssh josh " set -euo pipefail cd /data
# Safety copy of the current (suspect) DB if [ -f josh.db ]; then mv josh.db josh.db.bad-\$(date -u +%Y-%m-%dT%H-%M-%SZ) fi
# Drop in the restored snapshot (note: the file is at backups/josh-snap.db # in the restored tree, NOT at the live name) cp '$TMP'/data/backups/josh-snap.db /data/josh.db chmod 644 /data/josh.db
# Purge stale WAL/SHM — these reference the OLD file's transactions rm -f /data/josh.db-wal /data/josh.db-shm
ls -la /data/"If you also need to restore the corpus directory (raw payloads + normalized Markdown — required if disk failed or corpus was wiped, optional if only the SQLite was bad):
ssh josh " set -euo pipefail # rsync the restored corpus over the live one. --delete removes any files # that exist live but not in the snapshot (post-snapshot ingest results). # Drop --delete if you want to merge instead. rsync -aHv --delete '$TMP'/data/corpus/ /data/corpus/"You should now see /data/josh.db at the expected size, no -wal/-shm companions, and the renamed josh.db.bad-… safety copy alongside.
Step 6 — Resume traffic
Section titled “Step 6 — Resume traffic”cd ~/projects/josh/josh-coreenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot
cd ../josh-ingesterenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot
cd ../josh-embedder # if deployedenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal app boot
# Tail logs to confirm a clean startssh josh 'docker logs --tail 50 -f $(docker ps -qf name=josh-core)'Hit the health check from your local machine:
ssh josh 'curl -s http://localhost/health'# {"status":"ok"}Step 7 — Confirm the restore
Section titled “Step 7 — Confirm the restore”ssh josh 'sqlite3 /data/josh.db "PRAGMA integrity_check;"'ssh josh 'sqlite3 /data/josh.db "SELECT COUNT(*) FROM crs_reports;"'ssh josh 'sqlite3 /data/josh.db "SELECT COUNT(*) FROM legislators;"'ssh josh 'ls /data/corpus | head'If everything looks right, you're done. After a few hours of clean operation, you can clean up the safety files:
ssh josh 'ls -la /data/josh.db.bad-*'ssh josh 'rm /data/josh.db.bad-…' # adjust the filenamessh josh 'rm -rf /tmp/josh-restore.*'Catching up the lost window
Section titled “Catching up the lost window”Anything ingested between the snapshot timestamp and the corruption is gone. To replace it, re-run the affected ingester(s) over the date range:
# Example — rerun CRS Reports for the lost dayssh josh 'docker exec $(docker ps -qf name=josh-ingester) \ josh ingest crs-reports --since 2026-05-09T00:00:00Z'Most v1 substrate sources are derivable from public APIs, so re-running is safe and idempotent. Sources that depend on irreplaceable upstream state (e.g., a deleted upstream document that you'd already fetched) are the exception — flag those if you find any.
restic check --read-data — periodic deep integrity verification
Section titled “restic check --read-data — periodic deep integrity verification”The structural restic check runs nightly via the substrate-nightly-backup success_determiner. It validates pack-file references and snapshot-tree integrity but doesn't re-download pack file content.
The deep check, restic check --read-data, downloads every pack file in the repo and verifies content hashes — catches bit rot, server-side corruption, and transmission errors that survived the structural check. Run monthly (or weekly if paranoid; standard tier means there's no per-GB retrieval cost concern):
ssh joshsource /etc/josh-backup.envrestic check --read-dataThis is also worth running before any actual restore, to confirm the snapshot you're about to restore from is structurally sound.
Quick reference — one-shot restore
Section titled “Quick reference — one-shot restore”The full restore, condensed into a single block you can paste once you're confident:
ssh josh " set -euo pipefail source /etc/josh-backup.env
TMP=\$(mktemp -d /tmp/josh-restore.XXXXXX)
echo 'restoring latest snapshot to:' \"\$TMP\" restic restore latest --target \"\$TMP\"
SNAP=\"\$TMP/data/backups/josh-snap.db\" result=\$(sqlite3 \"\$SNAP\" 'PRAGMA integrity_check;') test \"\$result\" = 'ok' || { echo 'integrity_check FAILED:' \"\$result\"; exit 1; } echo 'integrity_check ok — snapshot ready at:' \"\$SNAP\"
echo 'STOP josh-core/ingester/embedder via kamal app stop, then run the swap block from this runbook.'"The one-shot intentionally stops before the swap step — swapping the live file under a running writer is the one thing this runbook can't do safely without manual confirmation.
Troubleshooting
Section titled “Troubleshooting”restic snapshots returns "Fatal: unable to open config file". Either the RESTIC_REPOSITORY env var is wrong, the credentials are bad, or the bucket name is mistyped. Decrypt .kamal/secrets.age, copy the BACKUP_* + RESTIC_PASSWORD values into /etc/josh-backup.env, and retry. The bucket is josh-bucket (NYC3, standard tier); the repo prefix is josh-db-restic.
restic snapshots returns "wrong password or no key found". The RESTIC_PASSWORD in /etc/josh-backup.env doesn't match the password used to initialize the repo. Check the iCloud SSH Keys and Tokens.tar.age backup of the password.
integrity_check returns errors on the restored snapshot. Try the previous snapshot. File a follow-up: nightly restic check validates structural integrity, so a snapshot that passes structural but fails restored-DB integrity is interesting. Run restic check --read-data to see if the repo itself has bit rot.
After swap, queries return SQLITE_CORRUPT on first open. You probably skipped the WAL/SHM purge in step 5. Stop the container, rm /data/josh.db-wal /data/josh.db-shm, restart.
The container won't boot after restore. Check schema compatibility: if the restored snapshot is from a substrate version older than the deployed code expects, run alembic upgrade head against the restored file before booting (see migrations).
restic restore is slow or fails partway. Resumable: re-run the same command. Restic skips already-downloaded pack files. If a specific pack file is consistently failing, that's a sign the bucket has corruption — contact DO support and check whether restic check --read-data confirms the corruption against the same pack.
Reference
Section titled “Reference”substrate-nightly-backup— the spec defining what gets backed up, on what cadence, with what retentionsubstrate-volume-mount— why/datais the source-of-truth location and what lives theresubstrate-litestream-backup— the path-not-taken (continuous WAL streaming, deferred indefinitely)- new-host-setup — how to spin up a fresh host that this restore can target
CLAUDE.md— repo overview, server profile, deploy pattern- restic docs — for everything not covered here