substrateshippedp0

Substrate SQLite foundation

substrate-sqlite-foundation · updated 2026-05-10T18:30:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

SQLite + WAL + sqlite-vec + FTS5 — the storage foundation under everything
else in Josh. One file at /data/josh.db, opened by josh-core (read-mostly)
and josh-ingester (batch writes) via the same connection helper. The
helper applies the substrate-wide PRAGMAs (WAL, foreign keys on, 30s
busy_timeout, NORMAL synchronous, 256 MB mmap) and loads the bundled
vec0 extension on every fresh DB-API connection. Decision locked
2026-05-07 — see the storage-stack rationale in
https://docs.usejosh.com/operations/architecture/.

As a service author, I want one connection factory that returns engines with the right PRAGMAs and `vec0` already loaded so that I can't accidentally bypass WAL or vector support.

As an OSS self-hoster, I want a single SQLite file as the substrate so that backups, migrations, and inspection are `cp` and `sqlite3 josh.db`.

  1. When a caller imports `josh_substrate.db.async_engine` and opens a connection, the engine shall have `journal_mode=wal`, `foreign_keys=on`, `busy_timeout=30000`, and `synchronous=NORMAL` applied on every fresh DB-API connection.
  2. When `josh_substrate.db.async_engine()` returns, `SELECT vec_version()` shall succeed against any connection produced by it (sqlite-vec is loaded via the connect-event hook in `db.py`).
  3. When `SUBSTRATE_DB_PATH` is set, `async_engine()` shall honour it; otherwise the substrate path shall default to `/data/josh.db`.
  4. Where any service in the workspace needs DB access, it shall go through `josh_substrate.db.async_engine` rather than constructing its own engine — enforced by code review against the conventions doc.
kindbash

Command

uv run python - <<'PY'
import asyncio, os, tempfile
from pathlib import Path
from sqlalchemy import text
from josh_substrate.db import async_engine

async def main() -> None:
    with tempfile.TemporaryDirectory() as td:
        os.environ["SUBSTRATE_DB_PATH"] = str(Path(td) / "probe.db")
        engine = async_engine()
        async with engine.connect() as conn:
            jm = (await conn.execute(text("PRAGMA journal_mode"))).scalar()
            fk = (await conn.execute(text("PRAGMA foreign_keys"))).scalar()
            bt = (await conn.execute(text("PRAGMA busy_timeout"))).scalar()
            sn = (await conn.execute(text("PRAGMA synchronous"))).scalar()
            vv = (await conn.execute(text("SELECT vec_version()"))).scalar()
        await engine.dispose()
        assert jm == "wal", f"journal_mode={jm!r}"
        assert fk == 1, f"foreign_keys={fk!r}"
        assert bt == 30000, f"busy_timeout={bt!r}"
        assert sn == 1, f"synchronous={sn!r} (1==NORMAL)"
        assert vv, f"vec_version() empty"
        print("OK", jm, fk, bt, sn, vv)

asyncio.run(main())
PY

Expect

Stdout starts with `OK wal 1 30000 1 ` and includes a sqlite-vec version string.

Self-contained: creates a temp DB, opens an engine, asserts each PRAGMA value and that `vec_version()` returns non-empty. Doesn't require the deployed substrate. Re-run any time the connection helper changes.

None.

  • Postgres, libSQL/Turso, or any other DB engine. SQLite is locked.
  • Schema definitions — those live in per-source migrations, not here.
  • Connection pooling beyond SQLAlchemy defaults — the substrate is single-host single-file.

None.

Implementation lives in
shared/josh_substrate/src/josh_substrate/db.py:

- SUBSTRATE_PRAGMAS tuple — single source of truth for the connection
PRAGMAs. Centralized here so josh-core, josh-ingester, and Alembic
env.py agree.
- _on_connect SQLAlchemy event hook — applies PRAGMAs, then toggles
enable_load_extension, loads sqlite_vec.loadable_path(), toggles
extension loading off again so downstream code can't inject other
extensions.
- async_engine(db_path=None, **kwargs) — public factory. Resolves
SUBSTRATE_DB_PATH (default /data/josh.db), wires the connect event
handler.

The sync→async bridge for enable_load_extension uses sqlalchemy.util.await_only
inside the connect event because aiosqlite's standard adapter doesn't
expose enable_load_extension synchronously — that method lives on the
underlying sqlite3.Connection on aiosqlite's worker thread.

0001_state_tables.py is the seed migration; subsequent migrations
(CRS, legislators+committees, ingestion-runs triggered_by, embedding
jobs) layer on top.

4 of 4 done.

  • t1 PRAGMAs + sqlite-vec load via SQLAlchemy connect event
  • t2 async_engine() factory honouring SUBSTRATE_DB_PATH
  • t3 Production deploy.yml sets SUBSTRATE_DB_PATH=/data/josh.db
  • t4 Conventions doc codifies the 'always go through async_engine' rule
  • 2026-05-10T18:30:00Z (new)shipped Spec content backfilled retrospectively. The implementation has been live in the workspace since the SQLite swap (2026-05-07); this commit replaces the [STUB] placeholders with real acceptance criteria, a mechanical bash success determiner, and a plan grounded in `db.py` and the seed migration.

docs/spec/substrate-sqlite-foundation.html · generated by bin/build-spec.py