substrateshippedp0

Substrate FTS5 pipeline

substrate-fts5-pipeline · updated 2026-05-10T18:30:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Full-text search via SQLite's FTS5 — external-content virtual tables
paired with the primary record tables, plus the standard three-trigger
pattern (AFTER INSERT / AFTER UPDATE / AFTER DELETE) that keeps the
FTS5 shadow row in sync with the parent row. BM25 with per-column
weights is the ranking surface; (10.0, 4.0, 2.0, 1.0) for
(title, abstract, action, body) is the working default, applied at
query time in the search wrapper that ships with rest-api-search.
FTS5 ranks better than Postgres ts_rank_cd for our use case (native
IDF + length normalization); choosing FTS5 over the alternatives is
one of the reasons the SQLite-as-substrate decision was a search
upgrade, not a downgrade.

As a query-time agent, I want BM25 results across every text-bearing source via one SQL primitive so that I don't compose four different search APIs for one question.

As a Source author, I want a copy-pasteable migration block (FTS5 table + 3 triggers) so that adding a new source's search surface is mechanical.

As a query writer, I want column weights to be tunable at query time so that per-source weight experimentation doesn't require a re-index.

  1. Where a substrate source has body text, its migration shall create an FTS5 virtual table with `content='<parent>'` and `content_rowid='rowid'` so FTS5 stays in sync with the parent table without storing duplicate text (see `crs_reports_fts` in `0002_crs.py` line 226-).
  2. When a row is inserted into the parent table, an `AFTER INSERT` trigger shall populate the FTS5 shadow row; when the parent row is updated, the `AFTER UPDATE` trigger shall emit FTS5's `'delete'` instruction followed by a fresh insert; when the parent row is deleted, the `AFTER DELETE` trigger shall emit the same `'delete'` instruction.
  3. When a query runs `SELECT … FROM <source>_fts WHERE <source>_fts MATCH ?`, the system shall return rows ranked by BM25; the ranking function `bm25(<source>_fts, w_title, w_abstract, w_action, w_body)` shall be available for per-column weight tuning at query time.
  4. Where a source has multiple FTS-eligible parent tables (e.g., legislators-and-committees), each shall get its own FTS5 table and trigger trio (`legislators_fts`, `committees_fts` in `0003_legislators_committees.py`).
  5. When `alembic upgrade head` runs against a fresh DB, every shipped source's FTS5 tables shall exist and pass a smoke `MATCH` query (verified by the determiner).
kindbash

Command

set -e
tmp=$(mktemp -d)
SUBSTRATE_DB_PATH="$tmp/probe.db" \
  uv run --directory shared/josh_substrate alembic upgrade head
# Every shipped source's FTS5 table must exist and be queryable.
uv run python - <<PY
import asyncio, os, sys
os.environ["SUBSTRATE_DB_PATH"] = "$tmp/probe.db"
from sqlalchemy import text
from josh_substrate.db import async_engine

async def main() -> None:
    engine = async_engine()
    async with engine.connect() as conn:
        for tbl in ("crs_reports_fts", "legislators_fts", "committees_fts"):
            rows = (await conn.execute(text(
                f"SELECT name FROM sqlite_master WHERE type='table' AND name=:n"
            ), {"n": tbl})).all()
            assert rows, f"{tbl} missing"
            # Smoke MATCH — empty result is fine; we only care it parses.
            await conn.execute(text(f"SELECT * FROM {tbl} WHERE {tbl} MATCH 'probe' LIMIT 1"))
    await engine.dispose()
    print("OK")
asyncio.run(main())
PY
rm -rf "$tmp"

Expect

Stdout ends with `OK`. Every shipped source's FTS5 table is present in `sqlite_master` and a `MATCH` query parses cleanly.

Determiner targets the schema, not the corpus — proves the FTS5 pipeline is wired even on an empty substrate. After backfill, the same approach scales: `SELECT count(*) FROM crs_reports_fts WHERE crs_reports_fts MATCH 'medicaid'` should return >0 once CRS reports are loaded.

None.

  • Cross-source unified search. That's the surface concern in `rest-api-search`; the substrate primitive is per-source FTS5 tables.
  • Custom tokenizers. Default `unicode61` is fine for v1.
  • Phrase suggestions / autocomplete. Surface concern, not substrate.

Implementation pattern, copy-pasteable for new sources:

1. Migration creates the parent <source> table with a body TEXT
column populated by the normalizer.
2. CREATE VIRTUAL TABLE <source>_fts USING fts5(title, abstract, action,
body, content='<source>', content_rowid='rowid')
— external-content
mode means FTS5 stores only the index, not the text.
3. Three triggers maintain sync: AFTER INSERT populates the shadow,
AFTER UPDATE issues the 'delete' opcode + re-inserts (FTS5's
pattern for external-content updates), AFTER DELETE issues
'delete'. See lines 239-265 of 0002_crs.py for the canonical
example.
4. Query-time wrapper in josh-core (when rest-api-search ships)
applies the per-column weight tuple via bm25(<source>_fts, …).

Live tables (per migrations 0002 + 0003):

- crs_reports_fts — title, abstract, action, body
- legislators_fts — name search
- committees_fts — committee name + jurisdiction

New sources add their FTS5 block to their own migration; the runner
doesn't need any per-source registration.

4 of 5 done.

  • t1 External-content FTS5 + three-trigger sync pattern documented and demonstrated
  • t2 crs_reports_fts shipped (migration 0002)
  • t3 legislators_fts + committees_fts shipped (migration 0003)
  • t4 Determiner verifies tables exist + MATCH parses on empty substrate
  • t5 Per-column weight tuple (10.0, 4.0, 2.0, 1.0) wired in the rest-api-search wrapper — tracked under that spec
  • 2026-05-10T18:30:00Z (new)shipped Spec content backfilled retrospectively. The FTS5 external-content pattern + three-trigger sync has been live since the CRS Reports ingester (migration 0002) and was extended for legislators/committees in migration 0003. This commit replaces the [STUB] placeholders with acceptance criteria grounded in the migrations and a determiner that proves the schema is intact on a fresh DB. The per-column weight tuple lives query-side and tracks under `rest-api-search`.

docs/spec/substrate-fts5-pipeline.html · generated by bin/build-spec.py