Substrate SQLite foundation
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
SQLite + WAL + sqlite-vec + FTS5 — the storage foundation under everything
else in Josh. One file at /data/josh.db, opened by josh-core (read-mostly)
and josh-ingester (batch writes) via the same connection helper. The
helper applies the substrate-wide PRAGMAs (WAL, foreign keys on, 30s
busy_timeout, NORMAL synchronous, 256 MB mmap) and loads the bundledvec0 extension on every fresh DB-API connection. Decision locked
2026-05-07 — see the storage-stack rationale inhttps://docs.usejosh.com/operations/architecture/.
User stories
As a service author, I want one connection factory that returns engines with the right PRAGMAs and `vec0` already loaded so that I can't accidentally bypass WAL or vector support.
As an OSS self-hoster, I want a single SQLite file as the substrate so that backups, migrations, and inspection are `cp` and `sqlite3 josh.db`.
Acceptance criteria (EARS)
- When a caller imports `josh_substrate.db.async_engine` and opens a connection, the engine shall have `journal_mode=wal`, `foreign_keys=on`, `busy_timeout=30000`, and `synchronous=NORMAL` applied on every fresh DB-API connection.
- When `josh_substrate.db.async_engine()` returns, `SELECT vec_version()` shall succeed against any connection produced by it (sqlite-vec is loaded via the connect-event hook in `db.py`).
- When `SUBSTRATE_DB_PATH` is set, `async_engine()` shall honour it; otherwise the substrate path shall default to `/data/josh.db`.
- Where any service in the workspace needs DB access, it shall go through `josh_substrate.db.async_engine` rather than constructing its own engine — enforced by code review against the conventions doc.
Success determiner
Command
uv run python - <<'PY'
import asyncio, os, tempfile
from pathlib import Path
from sqlalchemy import text
from josh_substrate.db import async_engine
async def main() -> None:
with tempfile.TemporaryDirectory() as td:
os.environ["SUBSTRATE_DB_PATH"] = str(Path(td) / "probe.db")
engine = async_engine()
async with engine.connect() as conn:
jm = (await conn.execute(text("PRAGMA journal_mode"))).scalar()
fk = (await conn.execute(text("PRAGMA foreign_keys"))).scalar()
bt = (await conn.execute(text("PRAGMA busy_timeout"))).scalar()
sn = (await conn.execute(text("PRAGMA synchronous"))).scalar()
vv = (await conn.execute(text("SELECT vec_version()"))).scalar()
await engine.dispose()
assert jm == "wal", f"journal_mode={jm!r}"
assert fk == 1, f"foreign_keys={fk!r}"
assert bt == 30000, f"busy_timeout={bt!r}"
assert sn == 1, f"synchronous={sn!r} (1==NORMAL)"
assert vv, f"vec_version() empty"
print("OK", jm, fk, bt, sn, vv)
asyncio.run(main())
PY
Expect
Self-contained: creates a temp DB, opens an engine, asserts each PRAGMA value and that `vec_version()` returns non-empty. Doesn't require the deployed substrate. Re-run any time the connection helper changes.
Clarifications needed
None.
Out of scope
- Postgres, libSQL/Turso, or any other DB engine. SQLite is locked.
- Schema definitions — those live in per-source migrations, not here.
- Connection pooling beyond SQLAlchemy defaults — the substrate is single-host single-file.
Dependencies
None.
Plan
Implementation lives inshared/josh_substrate/src/josh_substrate/db.py:
- SUBSTRATE_PRAGMAS tuple — single source of truth for the connection
PRAGMAs. Centralized here so josh-core, josh-ingester, and Alembic
env.py agree.
- _on_connect SQLAlchemy event hook — applies PRAGMAs, then toggles
enable_load_extension, loads sqlite_vec.loadable_path(), toggles
extension loading off again so downstream code can't inject other
extensions.
- async_engine(db_path=None, **kwargs) — public factory. Resolves
SUBSTRATE_DB_PATH (default /data/josh.db), wires the connect event
handler.
The sync→async bridge for enable_load_extension uses sqlalchemy.util.await_only
inside the connect event because aiosqlite's standard adapter doesn't
expose enable_load_extension synchronously — that method lives on the
underlying sqlite3.Connection on aiosqlite's worker thread.
0001_state_tables.py is the seed migration; subsequent migrations
(CRS, legislators+committees, ingestion-runs triggered_by, embedding
jobs) layer on top.
Tasks
4 of 4 done.
- t1 PRAGMAs + sqlite-vec load via SQLAlchemy connect event
- t2 async_engine() factory honouring SUBSTRATE_DB_PATH
- t3 Production deploy.yml sets SUBSTRATE_DB_PATH=/data/josh.db
- t4 Conventions doc codifies the 'always go through async_engine' rule
Changelog
-
2026-05-10T18:30:00Z
(new)→shippedSpec content backfilled retrospectively. The implementation has been live in the workspace since the SQLite swap (2026-05-07); this commit replaces the [STUB] placeholders with real acceptance criteria, a mechanical bash success determiner, and a plan grounded in `db.py` and the seed migration.