REST API — entity resolution

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Registry sources (legislators, committees, lobbyists, staffers,
taxonomies) need a typed fuzzy-name-to-ID resolver that is distinct
from both search and ID lookup. The 64-query coverage analysis
(https://docs.usejosh.com/operations/query-coverage/) surfaced this as Category 5:
questions like "Sen. Markey from Mass," "Katie O'Brian on E&C," "Marc
Lampkin lobbyist with uncertain spelling," "House Ag — which
committee?" — each is a fuzzy match against short structured fields,
not BM25 over a body and not an exact ID lookup.

Production-system precedent is decisive on this split. Postgres uses
different operators and indexes for the two cases (tsvector for
body lexical search, pg_trgm for fuzzy entity names). Elasticsearch
unifies the surface but the field mapping (text vs keyword) is
load-bearing. Treating fuzzy registry lookup as a mode of
lexical_search (the prior "Flow 1b" framing) conflates two
different operations with different intents, result shapes, and
relevance models.

This spec locks the resolver as a ?q=<noisy_name> query parameter
on the list-endpoint surface of each registry source. The MCP server
exposes it via a resolve_entity tool (mcp-server Class D),
scoped by entity type. Body-bearing sources continue to use
rest-api-search for free-text retrieval; the two never overlap.

User stories

As an agent given "Sen. Markey from Massachusetts" with no bioguide ID in hand, I want to call `GET /v1/legislators?q=Markey&state=MA` and get back a ranked list of matching legislators with their bioguide IDs so that every downstream tool call (votes, sponsored bills, committee memberships) can use a canonical ID, not a noisy string.

As a journalist with the staffer's name as "Katie O'Brian" but uncertain of the spelling, I want to call `GET /v1/staff-directories?q=Katie+O+Brian&committee=House+Energy+Commerce` and get fuzzy matches ranked by relevance so that I can find the right person even when I have the spelling slightly wrong.

As a lobbyist looking up "Marc Lampkin" across LDA filings, I want to call `GET /v1/lda-filings?q=Marc+Lampkin` and find the canonical registrant record so that I can pull every filing they've made, regardless of which client they represented at the time.

As a constituent asking about "House Ag", I want to call `GET /v1/committees?q=House+Ag` and get back the canonical committee record so that I learn the committee's Thomas ID and can fetch its members, hearings, and reports through structured endpoints.

Acceptance criteria (EARS)

When a client sends `GET /v1/<registry-source>?q=<text>` against any of `legislators`, `committees`, `staff-directories`, `lda-filings`, `regulations-dot-gov-dockets`, `roll-call-votes`, `hearings`, or `topic-taxonomy`, the system shall execute an FTS5 match against the source's documented name/short-field index and return ranked records (best match first).
Where a body-bearing source receives a `?q=` parameter on its list endpoint, the request shall be rejected with HTTP 400 `error.code='use_search_for_body'` and `hint.redirect_to='/v1/search?source=<source>&q=<text>'` — body-bearing sources route through `rest-api-search`, not here.
When entity-resolution returns a result, each item shall be the resource's documented `card` fieldset (per `rest-api-conventions` §9), with the resource's `id` always present — NOT a search-card with snippet/highlights.
Where the resolver returns multiple matches, the response shall include a `match_score` field per result (FTS5 BM25 score, smaller is better) and shall be ordered ascending by that score.
When the resolver returns zero matches, the response shall be `{data: [], total: 0, has_more: false}` — never HTTP 404 (a 404 means the endpoint is wrong; an empty result means no entity matched).
When `?q=` is combined with other filter parameters defined in `rest-api-resource-endpoints` (e.g., `?q=Markey&state=MA&congress=119`), the system shall apply structured filters BEFORE the FTS5 match (pre-filter) so the FTS5 candidate set is bounded.
Where multiple structured filters are passed, the system shall AND them and apply the FTS5 ranking on the remaining candidates.
When `?q=` is set, pagination shall use cursor (matching `rest-api-conventions` §1 for list endpoints, NOT offset — search-style offset is reserved for `rest-api-search`).
Where a source has no FTS5 index in the substrate schema (e.g., `committee-memberships`, which is a pure join table), the resolver shall return HTTP 400 `error.code='source_has_no_resolver'` with `hint.use_endpoint` pointing at the typed lookup path (`/v1/committees/{id}/members?congress=`).
When the resolver is called against a registry source with the v1 substrate populated, the response shall be returned within 100 ms p95 (FTS5 over short fields is cheap; this budget is tighter than the 500 ms search budget).

Success determiner

kindtest_file

Path

josh-core/tests/test_entity_resolution.py

Runner

uv run pytest josh-core/tests/test_entity_resolution.py -v

Contract test against the live FastAPI app covering every AC: - Happy path resolution per registry source with seeded fixtures (Markey/MA, House Ag, Pfizer, etc.). - Composability: `?q=Markey&state=MA&congress=119` returns the right subset. - Body-bearing rejection: `GET /v1/bills?q=climate` → 400 with `use_search_for_body` and redirect hint. - No-resolver rejection: `GET /v1/committee-memberships?q=anything` → 400 with `source_has_no_resolver`. - Empty result: noisy non-existent name returns `{data: [], total: 0}`, not 404. - Pagination: cursor-based, `next_cursor` survives across pages. - Latency: p95 < 100 ms against v1 fixtures (asserted). - Round-trip with `mcp-server`'s `resolve_entity` tool — same path, same envelope. Determiner currently fails because the per-source list endpoints don't yet accept `?q=` (tracked in `rest-api-resource-endpoints` t14, the new filter task this spec adds as a dependency).

Clarifications needed

Should the resolver accept multi-token queries with FTS5 native operators (`"Markey" AND MA`), or sanitize to plain phrase matching? Lean on plain phrase for v1 — agents pass natural text, not query DSL.
Trigram (pg_trgm-style) similarity beats FTS5 BM25 for true typo tolerance, but SQLite has no native trigram. Should we add a custom tokenizer or live with FTS5 for v1? Lean on FTS5 — the field is short enough that BM25 prefix matching covers most typos, and adding a tokenizer is a build-step change.
Should `resolve_entity` accept multiple entity types in one call (`resolve_entity('Markey', type=['legislator','committee'])`)? Probably not — the agent picks the type from context. Single-type per call keeps the result shape uniform.

Out of scope

Free-text search over body-bearing sources — owned by `rest-api-search`. Calling `?q=` on a body source redirects.
Cross-source entity resolution (a single 'who is this' resolver across legislators + staff + lobbyists). Each entity type is a separate endpoint; the agent picks. A meta-resolver may make sense at v2 if the agent's tool-selection overhead is high.
Entity disambiguation (when 'Markey' matches two senators across different congresses). Returns both; agent picks. No special UI for ambiguity at v1.
Trigram indexes — see clarifications.
Entity-linking against external authorities (Wikidata, VIAF). Substrate-native only.

Dependencies

Plan

## 1. Why this is a distinct spec (not just a filter param)

Three things make entity resolution a separate operation from list
filtering and from body search:

- Different intent. "Find me records about X" (body search) vs
"Find the canonical record FOR X" (entity resolution).
- Different result shape. Search returns hits with snippets +
highlights pointing into bodies. Resolution returns full records
with no body excerpt.
- Different relevance model. Body BM25 ranks long documents.
FTS5 over short structured fields is closer to typo-tolerant
string match — Postgres makes this split explicit with
tsvector vs pg_trgm.

The spec is small but the contract is load-bearing for agent
ergonomics: an agent should never need to "BM25-search for a senator"
to resolve a name — that's the wrong tool. The resolve_entity
MCP tool maps 1:1 onto this REST endpoint pattern.

## 2. The endpoint shape

Registry sources expose ?q=<text> on their list endpoint:

``GET /v1/legislators?q=Markey&state=MA GET /v1/committees?q=ways+and+means GET /v1/staff-directories?q=Katie+O+Brian&committee=HSEN GET /v1/lda-filings?q=Marc+Lampkin GET /v1/roll-call-votes?q=fisa+reauthorization GET /v1/hearings?q=hypersonics+procurement GET /v1/regulations-dot-gov-dockets?q=EPA+PFAS GET /v1/topic-taxonomy?q=climate``

Per-source name fields are pinned by each ingester spec:

| Source | FTS5 columns |
|-----------------|----------------------------------------------------|
| legislators | official_full_name, last_name, first_name |
| committees | name, jurisdiction |
| staff-directories | name, role |
| lda-filings | client_name, registrant_name, lobbying_issues_text |
| roll-call-votes | question, vote_description |
| hearings | title, agenda, witnesses_text |
| regulations-dot-gov-dockets | title, agency, document_titles |
| topic-taxonomy | label, broader_labels |

The endpoint dispatches by URL path; the FTS5 column set is per-source.

## 3. The MCP wrapper

mcp-server exposes a single resolve_entity(query, entity_type) tool
that maps to one of these REST paths. entity_type is an enum derived
from the substrate schema at tools/list time; new registry sources
become resolvable automatically.

## 4. Body sources reject ?q=

Calling GET /v1/bills?q=climate is a category error — the user
wants search, not resolution. The endpoint returns:

``HTTP 400 { "error": { "type": "invalid_request", "code": "use_search_for_body", "message": "?q= on body-bearing sources is reserved for free-text search; use /v1/search?source=bills&q=climate", "hint": { "redirect_to": "/v1/search?source=bills&q=climate" } } }``

Same shape as other 400s in rest-api-conventions §3.

## 5. The pre-filter property

Filters apply before the FTS5 match. ?q=Markey&state=MA is:

``SELECT l.* FROM legislators l JOIN legislators_fts fts ON l.rowid = fts.rowid JOIN legislator_terms lt ON l.bioguide_id = lt.bioguide_id WHERE legislators_fts MATCH 'Markey' AND lt.state = 'MA' AND lt.is_current = 1 ORDER BY bm25(legislators_fts) ASC LIMIT 20``

Pre-filter keeps the FTS5 candidate set small; FTS5 ranks the result.
Composes cleanly with the new filter params from
rest-api-resource-endpoints t14.

## 6. Implementation surface

``josh-core/josh_core/ routers/ legislators.py # extended: add ?q= support committees.py # extended: add ?q= support staff_directories.py # extended lda_filings.py # extended roll_call_votes.py # extended hearings.py # extended regulations_dockets.py # extended topic_taxonomy.py # extended services/ entity_resolver.py # shared FTS5 query builder + pre-filter tests/ test_entity_resolution.py``

The per-router change is small: detect ?q=, build the FTS5 fragment,
pre-filter with structured args. Shared service handles the common
shape so the routers stay thin.

Tasks

0 of 11 done.

Changelog

2026-05-13T15:00:00Z planned→planned New spec authored after the 64-query coverage analysis (https://docs.usejosh.com/operations/query-coverage/) and the prior research pass on production retrieval systems (which found the Postgres `tsvector` vs `pg_trgm` distinction is the canonical model for separating body lexical from fuzzy entity lookup). The `resolve_entity` MCP tool now has a dedicated REST backend rather than being a special case of `lexical_search` over registry FTS5 indexes. Body sources reject `?q=` with a redirect hint to `/v1/search`. The two operations are surface-disjoint and intent-disjoint, matching production conventions.