REST API — unified search endpoint

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Search is the agent's most-used tool. A single GET /v1/search that
unifies FTS5 BM25 + vector retrieval across every text-bearing source
is the smallest API surface that proves the substrate is a real product
rather than a database with a pretty face. Per the v1 plan: this is
the first endpoint shipped after schema validation against two real
sources.

This endpoint is the primary backend for mcp-server's search,
lexical_search, and semantic_search tools (mandatory search name
for ChatGPT App compatibility). The mode parameter on this endpoint
is what lets one REST endpoint back three MCP tools without code
duplication. Result IDs must round-trip through rest-api-resource-endpoints'
fetch handlers — that round-trip is the load-bearing contract between
search and mcp-server's fetch tool.

User stories

As a Cowork / ChatGPT-desktop / Cursor user who installed Josh's MCP, I want to ask 'find regs from EPA in 2026 about PFAS' and have the agent issue one HTTP call so that I get cited results without waterfalling through per-source endpoints.

As an OSS self-hoster pointing curl at my own substrate, I want `GET /v1/search?q=...&source=fr,bills&limit=20` to return BM25-ranked + vector-rescored hits so that I don't have to learn the underlying SQLite + FTS5 + sqlite-vec stack.

As an agent assembling a citation-rich answer, I want every result to carry a `citation` block with `citation_string` + `url` so that I can drop citations into the deliverable without a second round-trip.

Acceptance criteria (EARS)

When a client sends `GET /v1/search?q=<query>` with a valid API key, the system shall return a JSON `{results: [...], total: N, took_ms: M}` payload within 500 ms p95 against the v1 substrate.
When the request specifies `source=fr,bills` (comma-separated), the system shall restrict results to those source tables; when `source` is omitted, all text-bearing sources are searched.
When the request specifies `mode=lexical`, only the FTS5 BM25 path shall execute; `mode=semantic` shall run only the vector path with filters applied as pre-filters; `mode=hybrid` (default) shall run both and fuse with Reciprocal Rank Fusion.
Where a source publishes body text >5K tokens and `mode` is `hybrid` or `semantic`, the system shall include vector-rescored hits in the merged result set; if `mode=lexical` the system shall return BM25 hits only.
When the search endpoint receives a `source` value that is NOT body-bearing or short-body (i.e., a registry source per `https://docs.usejosh.com/operations/query-flows/`), the system shall return HTTP 400 with `error.code='source_not_searchable'`; `error.hint` shall include `offending_sources` (the rejected source IDs), `redirect_to` (the `rest-api-entity-resolution` endpoint for that source), and `valid_sources` (the body-bearing + short-body source IDs).
When `mode=semantic` is requested with a `source` that is short-body (has body FTS5 but no vectors — e.g., SAPs), the system shall return HTTP 400 with `error.code='source_not_searchable_semantically'`; `error.hint` shall include `valid_sources` (the body-bearing source IDs that DO have vectors).
When `mode=hybrid` is requested with a short-body source mixed with body-bearing sources, the system shall execute only the lexical leg for the short-body source, run hybrid for the body-bearing sources, and include a top-level `degraded` block in the response body of shape `{from: 'hybrid', to: 'lexical', per_source: {<source>: <reason>}}` — never silently change semantics without surfacing it.
Where `source` is omitted (search across all eligible sources) and `mode=semantic`, the system shall restrict the candidate set to body-bearing sources only (excluding short-body and registry); the response shall include `degraded.excluded_sources` listing the sources skipped.
Where the source eligibility classification is consulted at runtime, the system shall derive it from the live schema (presence of `<source>_chunks_vec0` for body-bearing; presence of `body_text` column with FTS5 but no chunks for short-body; neither for registry) — not from a hard-coded list.
When any result is returned, each item shall carry an `id` field whose value is accepted as input to the matching `GET /v1/<resource>/{id}` endpoint (round-trip contract with `rest-api-resource-endpoints` and `mcp-server`'s `fetch` tool).
When any result is returned, each item shall carry a `citation` object with non-null `citation_string`, `url`, and `published_at`.
If the API key is missing, invalid, or rate-limited, then the system shall return HTTP 401 / 401 / 429 respectively with a JSON `{error: ...}` body.
While the substrate is empty for a given source, the system shall return `total: 0` for that source rather than failing the whole request.

Success determiner

kindtest_file

Path

josh-core/tests/test_search_endpoint.py

Runner

kamal app exec --reuse "pytest josh-core/tests/test_search_endpoint.py -v"

The test file exercises every acceptance criterion: valid query against seeded FR + bills fixtures, source-filter behavior, citation block shape, auth failure modes, empty-source fallback, p95 latency budget.

Clarifications needed

RRF `k` parameter — community default is 60. Smaller corpora (Josh's v1 is ≤ 5M chunks) sometimes prefer k=10–20. Need a small offline eval before locking.
Pagination: locked to `offset` + `limit` per `rest-api-conventions` §1 (search uses offset, list endpoints use cursor).
Should `q` accept FTS5 native MATCH syntax (`"medical device"`, `term*`) directly, or do we sanitize first? Lean on direct passthrough with a small allowlist of operators.

Out of scope

Per-source resource endpoints (`GET /v1/fr/{id}`, etc.) — separate spec `rest-api-resource-endpoints` (consumed alongside this spec by `mcp-server`'s `fetch` tool).
MCP tool surface, transport, and OAuth posture — separate spec `mcp-server`. This endpoint is the canonical contract; MCP is a thin wrapper.
Pure structured (identifier-only) lookups like "get HR103" or "list Ways and Means members" — those route to `rest-api-resource-endpoints`, never enter the search path.
Fuzzy entity resolution (find legislator by noisy name, find committee by colloquial name) — separate spec `rest-api-entity-resolution`. Search and resolution are surface-disjoint.
Analytical queries (counts, top-N, time-series) — separate spec `rest-api-aggregations`.
Cross-source fan-out from one ID — separate spec `rest-api-dossiers`.
Saved searches / alerts / webhooks — Step 2 territory.
Prediction or other downstream models — out of scope for the substrate.

Dependencies

Plan

FastAPI josh-core router at app/routers/search.py. The router calls a
SearchService that fans out per-source queries, runs each source's
FTS5 + vector pipeline, and merges. Per-source query plans (which tables,
which weights) live in josh_substrate.search.plans so the router stays
source-agnostic — adding a new source means adding a plan, not editing
the search router.

Vector retrieval uses the BQ+rescore pattern: brute-force scan of the
bit[1024] column for top-K candidates (K ≈ 100), then rescore the top-K
with float[1024] cosine. Latency at v1 corpus (≤ 5M chunks) is comfortably
inside 500ms p95.

Mode parameter (?mode=lexical|semantic|hybrid, default hybrid) is the
contract surface for mcp-server's three search tools (lexical_search,
semantic_search, search). One endpoint, three modes, three MCP tools —
the agent picks by tool name; this endpoint just executes. When
mode=semantic, source / since / until filters are applied
as pre-filters on the candidate set before vector scoring (matches
Harvey / LanceDB / pgvector + HNSW guidance), not as post-filters on top-K.

Merge strategy for mode=hybrid: Reciprocal Rank Fusion with k=60
(community default; revisit if the routing eval shows the corpus is small
enough to prefer k=10–20). RRF avoids comparing incompatible BM25 and
cosine scores — purely rank-based, drop-in safe.

## Source eligibility (search is body-only)

The substrate splits into three shapes (matrix maintained in
https://docs.usejosh.com/operations/query-flows/):

1. Body-bearing (11 sources at v1: bills, cbo-cost-estimates,
committee-reports, congressional-record, crs-reports, ecfr-and-cfr,
federal-register, gao-reports, hearing-transcripts, public-laws,
us-code) — have <source>_fts + <source>_chunks +
<source>_chunks_vec0. All three modes work.
2. Short-body (1 source: statements-of-administration-policy) —
have <source>_fts over a full body but no chunks/vec0 (SAPs are
1-3 pages, per-doc embedding deferred to v1.x). Modes lexical
and hybrid (downgraded) work; semantic rejects.
3. Registry / lookup (9 sources: legislators, committees,
committee-memberships, hearings, lda-filings,
regulations-dot-gov-dockets, roll-call-votes, staff-directories,
topic-taxonomy) — out of scope for this endpoint. Their FTS5
indexes are over short structured fields (names, jurisdictions)
for fuzzy ID lookup, not body BM25. They route through
rest-api-entity-resolution (GET /v1/<source>?q=<noisy_name>)
which returns full records, not search hits.

The search service derives this classification at request time from
the live schema (SELECT name FROM sqlite_master WHERE name LIKE '%_chunks_vec0') — never hard-coded. New body-bearing sources
become searchable automatically; the registry/body split is locked
by ingester contract.

Behavior on mismatch:
- Any source that is registry-shape → HTTP 400
source_not_searchable with hint.redirect_to pointing at the
rest-api-entity-resolution endpoint. Search and entity resolution
are surface-disjoint and intent-disjoint (Postgres tsvector vs
pg_trgm pattern).
- mode=semantic against a short-body source → HTTP 400
source_not_searchable_semantically with hint.valid_sources.
- mode=hybrid against a short-body source mixed with body-bearing
sources → body sources get hybrid, short-body source gets lexical
only, response carries a degraded block surfacing the downgrade
per source.
- mode=hybrid with source omitted → restricted to body-bearing
only; short-body sources excluded with degraded.excluded_sources.

Auth: a small Depends(api_key) resolver against the api_keys table
for managed deployments. OSS self-host: JOSH_AUTH=disabled skips the dependency.

Citation block is built per-result by calling
josh_substrate.citations.formatters.<source>.citation_for(record). The
formatter contract is shared with the loader, which is what makes the
citation block free at query time (it's a column read, not a re-derive).

Tasks

0 of 14 done.

Changelog

2026-05-13T12:00:00Z planned→planned Refined to be the explicit backend for mcp-server. Added `?mode=` parameter so one endpoint backs `search` / `lexical_search` / `semantic_search` MCP tools. Locked RRF (k=60 default) as the hybrid merge strategy. Added round-trip `id` requirement so search results are valid input to mcp-server's `fetch` tool. Marked identifier-only lookups out of scope (they belong to rest-api-resource-endpoints). No behavioral change to in-scope paths — semantic pre-filter was already implicit but is now explicit.
2026-05-13T14:00:00Z planned→planned Source eligibility added. Audit of substrate sources surfaced three shapes — body-bearing (11), short-body (1: SAPs), registry/lookup (9) — and only the first supports semantic/hybrid retrieval. Locked behavior: `mode=semantic` rejects non-vector sources with HTTP 400 + `source_not_searchable_semantically`; `mode=hybrid` against a registry source auto-downgrades to lexical and surfaces the change in a top-level `degraded` block. Classification derived from live schema, not hard-coded — new body-bearing sources become eligible automatically. Matrix lives in `https://docs.usejosh.com/operations/query-flows/`.
2026-05-13T15:00:00Z planned→planned Registry-source auto-downgrade removed. After the 64-query coverage analysis and the production-systems research (Postgres `tsvector` vs `pg_trgm` is the canonical split), fuzzy entity lookup over registry sources moved to a dedicated spec `rest-api-entity-resolution`. This endpoint now rejects registry sources outright with HTTP 400 `source_not_searchable` and a `redirect_to` hint. Search and entity resolution are surface-disjoint, intent-disjoint, and never overlap. Short-body sources (SAPs) still auto-downgrade for `mode=hybrid` because they're genuinely body sources without a vector index — that's a real `degraded` case, not a category error.