REST API — entity resolution
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Registry sources (legislators, committees, lobbyists, staffers,
taxonomies) need a typed fuzzy-name-to-ID resolver that is distinct
from both search and ID lookup. The 64-query coverage analysis
(https://docs.usejosh.com/operations/query-coverage/) surfaced this as Category 5:
questions like "Sen. Markey from Mass," "Katie O'Brian on E&C," "Marc
Lampkin lobbyist with uncertain spelling," "House Ag — which
committee?" — each is a fuzzy match against short structured fields,
not BM25 over a body and not an exact ID lookup.
Production-system precedent is decisive on this split. Postgres uses
different operators and indexes for the two cases (tsvector for
body lexical search, pg_trgm for fuzzy entity names). Elasticsearch
unifies the surface but the field mapping (text vs keyword) is
load-bearing. Treating fuzzy registry lookup as a mode oflexical_search (the prior "Flow 1b" framing) conflates two
different operations with different intents, result shapes, and
relevance models.
This spec locks the resolver as a ?q=<noisy_name> query parameter
on the list-endpoint surface of each registry source. The MCP server
exposes it via a resolve_entity tool (mcp-server Class D),
scoped by entity type. Body-bearing sources continue to userest-api-search for free-text retrieval; the two never overlap.
User stories
As an agent given "Sen. Markey from Massachusetts" with no bioguide ID in hand, I want to call `GET /v1/legislators?q=Markey&state=MA` and get back a ranked list of matching legislators with their bioguide IDs so that every downstream tool call (votes, sponsored bills, committee memberships) can use a canonical ID, not a noisy string.
As a journalist with the staffer's name as "Katie O'Brian" but uncertain of the spelling, I want to call `GET /v1/staff-directories?q=Katie+O+Brian&committee=House+Energy+Commerce` and get fuzzy matches ranked by relevance so that I can find the right person even when I have the spelling slightly wrong.
As a lobbyist looking up "Marc Lampkin" across LDA filings, I want to call `GET /v1/lda-filings?q=Marc+Lampkin` and find the canonical registrant record so that I can pull every filing they've made, regardless of which client they represented at the time.
As a constituent asking about "House Ag", I want to call `GET /v1/committees?q=House+Ag` and get back the canonical committee record so that I learn the committee's Thomas ID and can fetch its members, hearings, and reports through structured endpoints.
Acceptance criteria (EARS)
- When a client sends `GET /v1/<registry-source>?q=<text>` against any of `legislators`, `committees`, `staff-directories`, `lda-filings`, `regulations-dot-gov-dockets`, `roll-call-votes`, `hearings`, or `topic-taxonomy`, the system shall execute an FTS5 match against the source's documented name/short-field index and return ranked records (best match first).
- Where a body-bearing source receives a `?q=` parameter on its list endpoint, the request shall be rejected with HTTP 400 `error.code='use_search_for_body'` and `hint.redirect_to='/v1/search?source=<source>&q=<text>'` — body-bearing sources route through `rest-api-search`, not here.
- When entity-resolution returns a result, each item shall be the resource's documented `card` fieldset (per `rest-api-conventions` §9), with the resource's `id` always present — NOT a search-card with snippet/highlights.
- Where the resolver returns multiple matches, the response shall include a `match_score` field per result (FTS5 BM25 score, smaller is better) and shall be ordered ascending by that score.
- When the resolver returns zero matches, the response shall be `{data: [], total: 0, has_more: false}` — never HTTP 404 (a 404 means the endpoint is wrong; an empty result means no entity matched).
- When `?q=` is combined with other filter parameters defined in `rest-api-resource-endpoints` (e.g., `?q=Markey&state=MA&congress=119`), the system shall apply structured filters BEFORE the FTS5 match (pre-filter) so the FTS5 candidate set is bounded.
- Where multiple structured filters are passed, the system shall AND them and apply the FTS5 ranking on the remaining candidates.
- When `?q=` is set, pagination shall use cursor (matching `rest-api-conventions` §1 for list endpoints, NOT offset — search-style offset is reserved for `rest-api-search`).
- Where a source has no FTS5 index in the substrate schema (e.g., `committee-memberships`, which is a pure join table), the resolver shall return HTTP 400 `error.code='source_has_no_resolver'` with `hint.use_endpoint` pointing at the typed lookup path (`/v1/committees/{id}/members?congress=`).
- When the resolver is called against a registry source with the v1 substrate populated, the response shall be returned within 100 ms p95 (FTS5 over short fields is cheap; this budget is tighter than the 500 ms search budget).
Success determiner
Path
Runner
Contract test against the live FastAPI app covering every AC: - Happy path resolution per registry source with seeded fixtures (Markey/MA, House Ag, Pfizer, etc.). - Composability: `?q=Markey&state=MA&congress=119` returns the right subset. - Body-bearing rejection: `GET /v1/bills?q=climate` → 400 with `use_search_for_body` and redirect hint. - No-resolver rejection: `GET /v1/committee-memberships?q=anything` → 400 with `source_has_no_resolver`. - Empty result: noisy non-existent name returns `{data: [], total: 0}`, not 404. - Pagination: cursor-based, `next_cursor` survives across pages. - Latency: p95 < 100 ms against v1 fixtures (asserted). - Round-trip with `mcp-server`'s `resolve_entity` tool — same path, same envelope. Determiner currently fails because the per-source list endpoints don't yet accept `?q=` (tracked in `rest-api-resource-endpoints` t14, the new filter task this spec adds as a dependency).
Clarifications needed
- Should the resolver accept multi-token queries with FTS5 native operators (`"Markey" AND MA`), or sanitize to plain phrase matching? Lean on plain phrase for v1 — agents pass natural text, not query DSL.
- Trigram (pg_trgm-style) similarity beats FTS5 BM25 for true typo tolerance, but SQLite has no native trigram. Should we add a custom tokenizer or live with FTS5 for v1? Lean on FTS5 — the field is short enough that BM25 prefix matching covers most typos, and adding a tokenizer is a build-step change.
- Should `resolve_entity` accept multiple entity types in one call (`resolve_entity('Markey', type=['legislator','committee'])`)? Probably not — the agent picks the type from context. Single-type per call keeps the result shape uniform.
Out of scope
- Free-text search over body-bearing sources — owned by `rest-api-search`. Calling `?q=` on a body source redirects.
- Cross-source entity resolution (a single 'who is this' resolver across legislators + staff + lobbyists). Each entity type is a separate endpoint; the agent picks. A meta-resolver may make sense at v2 if the agent's tool-selection overhead is high.
- Entity disambiguation (when 'Markey' matches two senators across different congresses). Returns both; agent picks. No special UI for ambiguity at v1.
- Trigram indexes — see clarifications.
- Entity-linking against external authorities (Wikidata, VIAF). Substrate-native only.
Dependencies
Plan
## 1. Why this is a distinct spec (not just a filter param)
Three things make entity resolution a separate operation from list
filtering and from body search:
- Different intent. "Find me records about X" (body search) vs
"Find the canonical record FOR X" (entity resolution).
- Different result shape. Search returns hits with snippets +
highlights pointing into bodies. Resolution returns full records
with no body excerpt.
- Different relevance model. Body BM25 ranks long documents.
FTS5 over short structured fields is closer to typo-tolerant
string match — Postgres makes this split explicit with
tsvector vs pg_trgm.
The spec is small but the contract is load-bearing for agent
ergonomics: an agent should never need to "BM25-search for a senator"
to resolve a name — that's the wrong tool. The resolve_entity
MCP tool maps 1:1 onto this REST endpoint pattern.
## 2. The endpoint shape
Registry sources expose ?q=<text> on their list endpoint:
````
GET /v1/legislators?q=Markey&state=MA
GET /v1/committees?q=ways+and+means
GET /v1/staff-directories?q=Katie+O+Brian&committee=HSEN
GET /v1/lda-filings?q=Marc+Lampkin
GET /v1/roll-call-votes?q=fisa+reauthorization
GET /v1/hearings?q=hypersonics+procurement
GET /v1/regulations-dot-gov-dockets?q=EPA+PFAS
GET /v1/topic-taxonomy?q=climate
Per-source name fields are pinned by each ingester spec:
| Source | FTS5 columns |
|-----------------|----------------------------------------------------|
| legislators | official_full_name, last_name, first_name |
| committees | name, jurisdiction |
| staff-directories | name, role |
| lda-filings | client_name, registrant_name, lobbying_issues_text |
| roll-call-votes | question, vote_description |
| hearings | title, agenda, witnesses_text |
| regulations-dot-gov-dockets | title, agency, document_titles |
| topic-taxonomy | label, broader_labels |
The endpoint dispatches by URL path; the FTS5 column set is per-source.
## 3. The MCP wrapper
mcp-server exposes a single resolve_entity(query, entity_type) tool
that maps to one of these REST paths. entity_type is an enum derived
from the substrate schema at tools/list time; new registry sources
become resolvable automatically.
## 4. Body sources reject ?q=
Calling GET /v1/bills?q=climate is a category error — the user
wants search, not resolution. The endpoint returns:
````
HTTP 400
{
"error": {
"type": "invalid_request",
"code": "use_search_for_body",
"message": "?q= on body-bearing sources is reserved for free-text search; use /v1/search?source=bills&q=climate",
"hint": { "redirect_to": "/v1/search?source=bills&q=climate" }
}
}
Same shape as other 400s in rest-api-conventions §3.
## 5. The pre-filter property
Filters apply before the FTS5 match. ?q=Markey&state=MA is:
````
SELECT l.* FROM legislators l
JOIN legislators_fts fts ON l.rowid = fts.rowid
JOIN legislator_terms lt ON l.bioguide_id = lt.bioguide_id
WHERE legislators_fts MATCH 'Markey'
AND lt.state = 'MA'
AND lt.is_current = 1
ORDER BY bm25(legislators_fts) ASC
LIMIT 20
Pre-filter keeps the FTS5 candidate set small; FTS5 ranks the result.
Composes cleanly with the new filter params fromrest-api-resource-endpoints t14.
## 6. Implementation surface
````
josh-core/josh_core/
routers/
legislators.py # extended: add ?q= support
committees.py # extended: add ?q= support
staff_directories.py # extended
lda_filings.py # extended
roll_call_votes.py # extended
hearings.py # extended
regulations_dockets.py # extended
topic_taxonomy.py # extended
services/
entity_resolver.py # shared FTS5 query builder + pre-filter
tests/
test_entity_resolution.py
The per-router change is small: detect ?q=, build the FTS5 fragment,
pre-filter with structured args. Shared service handles the common
shape so the routers stay thin.
Tasks
0 of 11 done.
- t1 Shared entity_resolver service in josh_substrate (FTS5 query builder + pre-filter composer)
- t2 ?q= support on legislators router (pre-filter by state/chamber/party/congress)
- t3 ?q= support on committees router (pre-filter by chamber)
- t4 ?q= support on staff-directories, lda-filings, roll-call-votes routers
- t5 ?q= support on hearings, regulations-dot-gov-dockets, topic-taxonomy routers
- t6 Body-source rejection (HTTP 400 use_search_for_body) on /v1/bills, /v1/federal-register, /v1/crs-reports, /v1/public-laws, /v1/us-code, /v1/ecfr-and-cfr, /v1/hearing-transcripts, /v1/committee-reports, /v1/gao-reports, /v1/cbo-cost-estimates, /v1/statements-of-administration-policy, /v1/congressional-record
- t7 Source-without-resolver rejection (HTTP 400 source_has_no_resolver) on /v1/committee-memberships
- t8 Cursor pagination for ?q= results (consistent with rest-api-conventions §1 list-endpoint shape)
- t9 Latency budget: p95 < 100 ms against v1 fixtures (asserted in determiner)
- t10 test_entity_resolution.py covers every AC + cross-checks with mcp-server's resolve_entity tool
- t11 OpenAPI documents ?q= on every eligible list endpoint
Changelog
-
2026-05-13T15:00:00Z
planned→plannedNew spec authored after the 64-query coverage analysis (https://docs.usejosh.com/operations/query-coverage/) and the prior research pass on production retrieval systems (which found the Postgres `tsvector` vs `pg_trgm` distinction is the canonical model for separating body lexical from fuzzy entity lookup). The `resolve_entity` MCP tool now has a dedicated REST backend rather than being a special case of `lexical_search` over registry FTS5 indexes. Body sources reject `?q=` with a redirect hint to `/v1/search`. The two operations are surface-disjoint and intent-disjoint, matching production conventions.