Query flows
Four ways a question becomes an answer in Josh. Three of them are not semantic search. The substrate is relational with FTS5 and vector layers stacked on top — picking the right layer for the right question is the whole game.
Why this page exists
An easy mistake when building any AI-facing data product is to assume every question is a semantic-search question. It isn't. A large fraction of questions an agent asks of Josh are citation-style lookups — questions with an identifier baked in, or whose answer is a SQL join, not a similarity score.
Two examples:
- "Does HR103 mention child protection?" — the bill ID pins the document. The question is "scan a fixed text for a phrase," which is a deterministic keyword scan or a column read, not a vector cosine over 5 million chunks.
- "Who is on the Ways and Means Committee?" — the answer is a SQL join on
committee_memberships. Vector retrieval here would be malpractice.
Dense embeddings reliably miss exact identifiers and named entities — well documented in the legal-RAG literature. Routing identifier-bearing questions through the vector path produces fuzzy answers to questions that have exact ones. Josh's substrate is designed to keep those paths separate.
The four query flows
| # | Flow | Wraps | When the agent picks it |
|---|---|---|---|
| 1 | ID lookup / structured join | Plain SQL on relational tables | Question contains an identifier (hr:119:103, S000033, HSWM), or the answer is a join over normalized columns ("members of X committee"). |
| 2 | Lexical search (FTS5 BM25) | *_fts external-content tables |
Question is a keyword scan ("bills mentioning child tax credit", "find PFAS in regulations"). Exact tokens matter; the user wrote the words they want matched. |
| 3 | Semantic search (vector) | *_chunks_vec0 BQ+rescore |
Question is a paraphrase or conceptual match ("bills about reducing tailpipe emissions"). The user wouldn't necessarily write the exact tokens in the underlying document. |
| 4 | Hybrid search (BM25 + vector, RRF) | Both FTS5 + vec0, fused | Default when the agent isn't sure. Robust to both exact and paraphrased queries; the cost is two scans instead of one. |
Pre-filters (source, since, until) apply to all four flows. In flows 3 and 4, filters apply before vector scoring, not after — a hard requirement in the search spec.
Source eligibility matrix
Not every source supports every flow. Embedding a committee name or a senator's bioguide ID would be malpractice — the vector would tell you "John Smith" is similar to "John Smithson," which is worse than useless for citation work. The substrate splits sources into three shapes; the migration schema enforces it (presence of a <source>_chunks_vec0 table is the ground truth, derived dynamically by the search service — never hard-coded).
"Flow 1b" is the fuzzy-name use of FTS5 over short structured fields on registry sources — not real lexical search over a body. The lexical_search MCP tool accepts it (calling lexical_search({query:'ways and means', source:'committees'}) finds the committee by name); the semantic_search tool rejects it.
Shape A · Body-bearing
All four flows apply. The default search (hybrid) tool runs both BM25 and vector legs and fuses with RRF.
| Source | Flow 1 ID |
Flow 2 Lexical |
Flow 3 Semantic |
Flow 4 Hybrid |
Notes |
|---|---|---|---|---|---|
| billsplanned | ✓ | ✓ | ✓ | ✓ | BILLSTATUS bodies up to 1.9 MB; chunks > 5K tokens routinely. |
| cbo-cost-estimatesplanned | ✓ | ✓ | ✓ | ✓ | PDFs → markdown via pdfplumber; broader analyses cross the chunk threshold. |
| committee-reportsplanned | ✓ | ✓ | ✓ | ✓ | GovInfo CRPT. Section-by-section legislative intent; most reports > 5K tokens. |
| congressional-recordplanned | ✓ | ✓ | ✓ | ✓ | CREC granules (speeches, motions, quorum calls); each granule has its own body. |
| crs-reportsverified | ✓ | ✓ | ✓ | ✓ | First source through the full pipeline; the worked example everywhere else copies. |
| ecfr-and-cfrplanned | ✓ | ✓ | ✓ | ✓ | XML → markdown per section. Long-lived regulatory text. |
| federal-registerplanned | ✓ | ✓ | ✓ | ✓ | NPRMs and rules 30K+ tokens; chunks enqueued > 5K. |
| gao-reportsplanned | ✓ | ✓ | ✓ | ✓ | PDF → markdown. FTS5 weights tuned for title / fast-facts / recommendations / body. |
| hearing-transcriptsplanned | ✓ | ✓ | ✓ | ✓ | Speaker-tagged markdown; almost every transcript exceeds 5K tokens. |
| public-lawsplanned | ✓ | ✓ | ✓ | ✓ | USLM XML → markdown. Enacted statute text. |
| us-codeplanned | ✓ | ✓ | ✓ | ✓ | USLM statutes; FTS5 on heading / chapeau / body for section-level retrieval. |
<div class="matrix-section is-short"> <p class="heading"><span class="dot"></span>Shape B · Short-body</p> <p class="sub">FTS5 over the full body but no chunks/vec0 — documents are short enough that BM25 alone is sufficient. <code>semantic_search</code> rejects; <code>search</code> auto-downgrades to lexical.</p></div><table class="matrix-table"> <thead> <tr> <th>Source</th> <th class="center">Flow 1<br><span style="font-weight:400;text-transform:none;letter-spacing:0">ID</span></th> <th class="center">Flow 2<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Lexical</span></th> <th class="center">Flow 3<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Semantic</span></th> <th class="center">Flow 4<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Hybrid</span></th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td><span class="source-name">statements-of-administration-policy</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">SAPs are 1–3 pages; FTS5 over body is enough at v1. Per-doc embedding deferred to v1.x.</td> </tr> </tbody></table>
<div class="matrix-section is-registry"> <p class="heading"><span class="dot"></span>Shape C · Registry / lookup</p> <p class="sub">No body text. FTS5 indexes exist over short fields (names, jurisdictions, issue labels) for fuzzy ID lookup — "Flow 1b." <code>lexical_search</code> works against these short fields; <code>semantic_search</code> rejects; <code>search</code> auto-downgrades. Class A typed tools (<code>get_legislator</code>, <code>list_committee_members</code>, etc.) are the preferred entry point.</p></div><table class="matrix-table"> <thead> <tr> <th>Source</th> <th class="center">Flow 1<br><span style="font-weight:400;text-transform:none;letter-spacing:0">ID</span></th> <th class="center">Flow 2<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Lexical</span></th> <th class="center">Flow 3<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Semantic</span></th> <th class="center">Flow 4<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Hybrid</span></th> <th>Notes</th> </tr> </thead> <tbody> <tr> <td><span class="source-name">legislators</span><span class="status-pill is-verified">verified</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">FTS5 on official_full_name / last_name / first_name. Bioguide ID is the primary lookup.</td> </tr> <tr> <td><span class="source-name">committees</span><span class="status-pill is-verified">verified</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">FTS5 on name / jurisdiction. Thomas committee IDs are stable.</td> </tr> <tr> <td><span class="source-name">committee-memberships</span><span class="status-pill is-verified">verified</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">Pure join table: (committee_id, bioguide_id, congress, role). No FTS5 at all.</td> </tr> <tr> <td><span class="source-name">hearings</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">Meeting metadata only. Transcript bodies live in the separate <code>hearing-transcripts</code> source.</td> </tr> <tr> <td><span class="source-name">lda-filings</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">Lobbying disclosures. FTS5 on client_name / registrant_name / lobbying_issues_text.</td> </tr> <tr> <td><span class="source-name">regulations-dot-gov-dockets</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">Docket metadata only. Comment-body text lives upstream at regulations.gov.</td> </tr> <tr> <td><span class="source-name">roll-call-votes</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">FTS5 on the short question/vote_description text. Per-member positions are a structured join.</td> </tr> <tr> <td><span class="source-name">staff-directories</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">FK to legislators + committees. FTS5 on name / role.</td> </tr> <tr> <td><span class="source-name">topic-taxonomy</span><span class="status-pill">planned</span></td> <td class="cell-flow is-yes">✓</td> <td class="cell-flow is-name">name</td> <td class="cell-flow is-no">−</td> <td class="cell-flow is-no">−</td> <td class="note">LCSH vocabulary. FTS5 on label / broader_labels.</td> </tr> </tbody></table>Behavior on mismatch: mode=semantic against any non-vector source returns HTTP 400 with error.code='source_not_searchable_semantically'; mode=hybrid against a registry source auto-downgrades to lexical-only and surfaces the change in a top-level degraded block in the response. The classification is derived from the live schema, so new body-bearing sources become semantically searchable automatically when they ship — no separate config.
The system in layers
Everything below sits behind one HTTPS endpoint (/v1/...) and one MCP endpoint (/mcp), both served by the same josh-core FastAPI process. MCP is a thin wrapper over REST — same handlers, same auth, same citation block.
josh-core — one process, two surfaces, four flows
Agent hosts
<div class="arch-connector"></div>
<div class="arch-layer"> <p class="arch-layer-label">Transport</p> <div class="arch-transports"> <div class="arch-transport"> <span class="name">MCP</span> <span class="desc">Streamable HTTP · /mcp</span> </div> <div class="arch-transport"> <span class="name">REST</span> <span class="desc">HTTPS · /v1/…</span> </div> </div> </div>
<div class="arch-connector"></div>
<div class="arch-core"> <p class="arch-layer-label">josh-core · FastAPI</p>
<div class="arch-router"> Query router <span class="sub">picks one of four flows by endpoint / tool</span> </div>
<div class="arch-flows"> <div class="arch-flow is-1"> <span class="num">Flow 1</span> <span class="label">ID lookup</span> <span class="impl">SQL on relational tables</span> </div> <div class="arch-flow is-2"> <span class="num">Flow 2</span> <span class="label">Lexical</span> <span class="impl">FTS5 BM25</span> </div> <div class="arch-flow is-3"> <span class="num">Flow 3</span> <span class="label">Semantic</span> <span class="impl">vec0 · BQ + rescore</span> </div> <div class="arch-flow is-4"> <span class="num">Flow 4</span> <span class="label">Hybrid</span> <span class="impl">BM25 + vec · RRF</span> </div> </div>
<div class="arch-storage"> <strong>SQLite</strong> <span style="opacity:0.7">/data/josh.db</span> + <strong>Corpus</strong> <span style="opacity:0.7">/data/corpus/<source>/…</span> </div> </div></div>Flow 1 — ID lookup / structured join
Pure SQL over normalized tables. No FTS5, no vector. The relational schema is the source of truth for who, what, where, and when — anything an ingester wrote into a typed column.
Flow 1 · ID lookup — structured join
congress=119
AND congress=119
ORDER BY rank ASC
Strengths & weaknesses
Strengths
- Exact, deterministic, reproducible. Same query → same answer every time. No "the vector said 0.82, did you want 0.81?"
- Sub-millisecond on indexed primary keys. The 32 GB host can run thousands of these per second without breaking a sweat.
- Citations are free — every row already carries
source_url,published_at,citation_stringfrom the ingester. - Memory-safe.
LIMIT-able. Never OOMs the box.
Weaknesses
- Only as good as the schema. If a fact isn't a column, you can't query it here.
- Requires the agent to know the identifier. "What's HR103 about?" works; "the bill about child protection" doesn't route here.
- No fuzziness. A typo in the ID returns 404, not "did you mean…?". Recovery is the agent's job (it can fall back to
search). - Many-to-many fan-out on deep joins (bill → cosponsors → bills sponsored by each cosponsor) — bounded in v1 by pagination defaults.
Flow 2 — Lexical search (FTS5 BM25)
SQLite's built-in FTS5 with per-column BM25 weights (title 10×, abstract 4×, action 2×, body 1×). External-content tables: the index lives in *_fts, the records in the underlying bills / federal_register / etc. tables. Triggers keep them in sync.
Flow 2 · Lexical — FTS5 BM25
source='bills'
filters={congress:119}
&source=bills
&congress=119
&mode=lexical
action 2 · body 1
ORDER BY bm25(…) ASC
LIMIT 20
snippet {text, highlights}
citation per row
Strengths & weaknesses
Strengths
- Catches exact tokens — bill numbers, statute citations, named programs, named people, dollar amounts. Where dense embeddings drop the ball, BM25 nails it.
- No model required. Works from day one, no embedding backfill, no GPU. The very first deployment can run lexical search before any vectors exist.
- FTS5 native syntax supported: phrase matching (
"medical device"), prefix (climat*), NEAR, AND/OR/NOT. - BM25 scores are explainable. The agent can see why a result matched.
Weaknesses
- No paraphrase tolerance. "Cars" doesn't match "automobiles" unless an analyzer is added.
- Per-column weight tuning is per-source. Defaults are picked from the literature; corpus-specific tuning improves quality but needs eval fixtures.
- FTS5 syntax is user-facing. Malformed queries (unbalanced quotes) raise errors — search endpoint sanitizes a small allowlist before passing through.
- Stopwords and morphology are SQLite-default. No fancy stemming. Acceptable trade for "no infrastructure required."
Flow 3 — Semantic search (vector, pre-filtered)
Each chunk in the body of a text-bearing source carries a 1024-dim embedding stored in a *_chunks_vec0 virtual table. Retrieval is the BQ+rescore pattern: a brute-force scan of the bit-quantized column (1 bit per dimension) finds the top-K (K≈100) candidates, then a float-precision cosine rescore picks the final order. Filters (source, since, until) restrict the candidate set before scoring.
Flow 3 · Semantic — vector, pre-filtered
<div class="pipeline" style="margin-bottom: 22px;"> <div class="pipeline-step"> <span class="step-label">Agent question</span> <span class="step-title">"Bills about reducing tailpipe emissions from passenger cars"</span> </div> <div class="pipeline-step"> <span class="step-label">MCP tool</span> <span class="step-title">semantic_search</span> <span class="step-detail">query='…'<br>source='bills'<br>filters={congress:119}</span> </div> <div class="pipeline-step"> <span class="step-label">REST endpoint</span> <span class="step-title">GET /v1/search</span> <span class="step-detail">?q=reducing+tailpipe…<br>&source=bills<br>&congress=119<br>&mode=semantic</span> </div></div>
<ol class="steplist"> <li> <strong>Embed the query</strong> Encoder reduces query text to a 1024-dim float vector. Same model as the chunks; query-time singleton in <code>josh-core</code>. </li> <li> <strong>Pre-filter the candidate set</strong> <code>SELECT rowid FROM bills WHERE congress = 119</code> → N rowids. Filters apply <em>before</em> scoring, never after — otherwise top-K can be empty after filtering. </li> <li> <strong>BQ scan for top-K candidates</strong> Hamming-distance scan on <code>bills_chunks_vec0</code> against the bit-quantized query, restricted to the filtered rowid set. <code>LIMIT 100</code>. </li> <li> <strong>Float-precision rescore</strong> Top-100 candidates reranked by full-precision cosine on the <code>float[1024]</code> column. This is the cheap-then-precise pattern: BQ for recall, float for ranking. </li> <li> <strong>Join back & assemble response</strong> Top-20 chunks join to their parent records. Citation block per row from <code>citation_for(record)</code>. Snippet excerpted around the matched section. </li></ol>Strengths & weaknesses
Strengths
- Paraphrase tolerance. "Tailpipe emissions" finds bills using "vehicle exhaust" or "Title II of the Clean Air Act."
- Conceptual similarity that lexical search can't see: "bills similar to HR103" works.
- BQ-quantization is 32× smaller than float storage and ~20× faster to scan; the v1 corpus (≤ 5M chunks) fits comfortably.
- Encoder swap is a side-by-side rebuild + config flip. Every embedded row carries
model_idandmodel_version— see embedding architecture.
Weaknesses
- Misses exact tokens. "HR103" returns vaguely related bills, not HR103 itself. Identifier lookups must never enter this path — they route to Flow 1.
- Requires embeddings — a build step. New deployments need hours-to-days for a backfill at v1 scale. The lexical path (Flow 2) is the bridge.
- Pre-filter must be designed in. Post-filtering the top-K is the wrong shape — a date-restricted query could return zero results from a top-100 that's all out of the date range. Search spec enforces pre-filter.
- Scores are not human-interpretable. "0.81 cosine" is meaningful in aggregate, not for a single result.
Flow 4 — Hybrid search (BM25 + vector, RRF)
Default mode of /v1/search. Runs the FTS5 path and the vector path in parallel, then fuses the two ranked lists with Reciprocal Rank Fusion. RRF doesn't need the two scoring systems to be comparable — it works on ranks only.
Flow 4 · Hybrid — BM25 + vector fused via RRF
<defs> <marker id="arr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse"> <path d="M 0 0 L 10 5 L 0 10 z" fill="#8a909b"/> </marker> </defs>
<!-- Query box --> <g> <rect x="280" y="14" width="160" height="50" rx="8" fill="#ffffff" stroke="#e7e9ee"/> <text x="360" y="36" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="12" font-weight="600" fill="#0f1115">GET /v1/search</text> <text x="360" y="52" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="10.5" fill="#5b6270">?q=… & mode=hybrid</text> </g>
<!-- Branch lines from query down --> <path d="M 360 64 L 360 90" stroke="#8a909b" stroke-width="1" fill="none"/> <path d="M 360 90 L 180 90 L 180 110" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/> <path d="M 360 90 L 540 90 L 540 110" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<!-- BM25 box --> <g> <rect x="60" y="110" width="240" height="70" rx="8" fill="#ffffff" stroke="#e7e9ee"/> <rect x="60" y="110" width="240" height="3" fill="oklch(0.62 0.14 40)"/> <text x="180" y="135" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="13" font-weight="600" fill="#0f1115">FTS5 · BM25</text> <text x="180" y="153" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#5b6270">top-K ranked list</text> <text x="180" y="170" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#2a2e36">[ A, B, C, D, … ]</text> </g>
<!-- Vector box --> <g> <rect x="420" y="110" width="240" height="70" rx="8" fill="#ffffff" stroke="#e7e9ee"/> <rect x="420" y="110" width="240" height="3" fill="oklch(0.7 0.13 90)"/> <text x="540" y="135" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="13" font-weight="600" fill="#0f1115">Vector · BQ + rescore</text> <text x="540" y="153" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#5b6270">top-K ranked list</text> <text x="540" y="170" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#2a2e36">[ B, E, A, F, … ]</text> </g>
<!-- Down to RRF --> <path d="M 180 180 L 180 210 L 360 210" stroke="#8a909b" stroke-width="1" fill="none"/> <path d="M 540 180 L 540 210 L 360 210" stroke="#8a909b" stroke-width="1" fill="none"/> <path d="M 360 210 L 360 230" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<!-- RRF box --> <g> <rect x="220" y="230" width="280" height="74" rx="8" fill="#ffffff" stroke="#e7e9ee"/> <rect x="220" y="230" width="280" height="3" fill="oklch(0.55 0.13 155)"/> <text x="360" y="254" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="13" font-weight="600" fill="#0f1115">Reciprocal Rank Fusion</text> <text x="360" y="274" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#5b6270">score(d) = ∑<tspan baseline-shift="sub" font-size="8">i</tspan> 1 / (k + rank<tspan baseline-shift="sub" font-size="8">i</tspan>(d))</text> <text x="360" y="291" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#2a2e36">k = 60</text> </g>
<!-- Down to final --> <path d="M 360 304 L 360 326" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<!-- Final list --> <g> <rect x="220" y="326" width="280" height="28" rx="6" fill="#0f1115" stroke="#0f1115"/> <text x="360" y="345" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#ffffff">[ B, A, E, C, D, F, … ] · top-20 + citations</text> </g> </svg></div>RRF intuition: a doc that appears in both lists gets contributions from both; a doc in one list still ranks if it's high enough there. k=60 is the community default — large enough that low-ranked items don't dominate, small enough that rank-1 in either list is a strong signal.
Strengths & weaknesses
Strengths
- Robust to ambiguity. Works whether the query has exact tokens or is a paraphrase. Strong "I don't know what shape the question is" default.
- RRF has no tunable score scaling — purely rank-based, can't be gamed by one side returning huge scores.
- Result diversity. Hybrid lists are less monoculture than either path alone — the BM25 × vector intersection is bigger than either's individual top-N.
- Drop-in safe. RRF is the lowest-risk way to add semantic to a lexical-only system, or vice versa — neither side dominates.
Weaknesses
- Two scans instead of one — roughly 1.5–2× the latency of either alone. Still inside the 500 ms p95 budget at v1 scale, but worth knowing.
kis the one knob. Needs an offline eval to pick well for the corpus. Default 60 is safe but not optimal.- Harder to explain a single result's rank. The agent can still see both contributing scores in the response, but it's noisier than a single BM25 number.
- Doesn't help when neither side has the right doc. Hybrid is not a magic upgrade; it's an averaging.
How the agent picks the right flow
Tool names + descriptions, not a learned classifier. The MCP server exposes three classes of tools in mcp-server:
| Class | Tools | Routes to |
|---|---|---|
| A. Typed structured lookups | get_bill, get_legislator, get_committee, list_committee_members, list_cosponsors, get_bill_text, list_member_votes |
Flow 1 — SQL on relational tables |
| B. Text search | search (hybrid, default), lexical_search, semantic_search |
Flow 2, 3, or 4 depending on tool |
| C. Universal fetch | fetch(id) — dispatches by ID prefix |
Flow 1 — required by ChatGPT |
The agent sees the tool descriptions and picks. Three loose rules drilled into the descriptions:
- If the question contains an identifier, prefer Class A or C. Never Class B.
- If the question is a keyword lookup ("the words must appear"), prefer
lexical_search. - If the question is a concept or paraphrase, prefer
semantic_search— or justsearchif unsure.
Why two redundant paths for IDs (Class A typed tools and Class C fetch)? fetch is mandatory for ChatGPT App compatibility (OpenAI's retrieval pipeline expects the name). The typed tools are sharper for Claude — the LLM picks get_bill('hr:119:103') over fetch('hr:119:103') when both are available because the typed name and description are more specific. Both call the same REST handler.
Worked examples
Three queries traced end-to-end.
"Does HR103 have language about child protection?"
This is the question that motivated the doc. Two paths to the same answer; the smart one uses both flows.
semantic_search({query: 'HR103 child protection'}).
Why it fails: the vector encoder doesn't know what HR103 is — it returns vaguely-related bills, not HR103 itself, and the agent hallucinates.
get_bill_text(bill_id='hr:119:103')→ returnsbody_text+ citation, deterministic.- Agent scans the returned text for "child protection" (or for any of: minor, juvenile, child welfare, <18 years) → finds the phrase or doesn't; cites the section if found.
get_bill(bill_id='hr:119:103')→ confirm bill exists, get title.lexical_search({query: 'child protection', source: 'bills', filters: {bill_id: 'hr:119:103'}})→ BM25 over the chunks of that one bill, returns matching sections with offsets for citing.
"Who is on the Ways and Means Committee?"
Pure structured. Vector retrieval would be malpractice.
list_committee_members(committee_id='HSWM', congress=119)
- JOIN
committee_memberships×legislators - Ordered list of legislators with
role(chair / ranking member / member) - Each row carries its own citation
Note: HSWM is the Thomas committee ID for House Ways and Means. The agent can look up the committee ID first via get_committee by name if it doesn't know it, but committee IDs are stable and worth pinning in agent prompts.
"EPA regulations on PFAS in 2026"
This is the canonical hybrid query. The agent doesn't have an ID; "PFAS" is an exact token (chemistry abbreviation, won't be paraphrased) but "regulations on" is conceptual.
search({query: 'EPA PFAS', source: 'federal-register', since: '2026-01-01'}) — mode=hybrid (default).
- FTS5 leg matches "PFAS" exactly in any document title or body.
- Vector leg matches semantically-similar docs ("perfluoroalkyl", "forever chemicals", "drinking water contaminant rules").
- RRF fusion — top-20 with both signals.
- Each result carries snippet + citation; the agent picks the most-cited / most-recent for the answer.
API surface summary
All four flows behind two endpoints in josh-core — and the MCP server wraps both:
| Need | REST | MCP tool | Flow |
|---|---|---|---|
| Fetch a record by ID | GET /v1/<resource>/{id} |
get_bill, get_legislator, …, fetch |
1 |
| List members of a committee | GET /v1/committees/{id}/members |
list_committee_members |
1 |
| List cosponsors of a bill | GET /v1/bills/{id}/cosponsors |
list_cosponsors |
1 |
| List votes by a legislator | GET /v1/legislators/{id}/votes |
list_member_votes |
1 |
| Keyword scan | GET /v1/search?mode=lexical |
lexical_search |
2 |
| Conceptual scan | GET /v1/search?mode=semantic |
semantic_search |
3 |
| Mixed-shape question | GET /v1/search (default hybrid) |
search |
4 |
What this design depends on
A few load-bearing invariants. If any of them breaks, the routing logic above stops being safe.
- Type-prefixed IDs. Every public ID issued by the substrate has a unique type prefix (per rest-api-conventions §5). Without this,
fetch(id)can't dispatch unambiguously and the agent has to guess. The ID grammar is enforced by ingester contract tests. - Pre-filter, not post-filter. When a semantic query carries
source/since/until, those filters apply before vector scoring. Otherwise a date-restricted query can come back empty even when matching documents exist. Enforced by an acceptance criterion in rest-api-search. - Round-trip ID property. Every
idreturned by search is accepted by the matching resource endpoint. The agent can chain "search → fetch the top hit" without translating IDs. Contract-tested in both directions. - Citation is a column read, not a re-derive. Every record carries
source_url,published_at,citation_string, etc. The search response and the resource response both quote the same fields. No path produces a record without a citation. - Memory-safe queries. Every flow bounds its result set (
LIMIT, pagination, top-K). No unbounded scan over the substrate. The 32 GB host is the budget; runaway queries are not allowed.
Open questions
- RRF
kparameter. 60 is the community default. The corpus is small enough (≤ 5M chunks at v1) thatk=10–20might be better. Pin via an offline eval with a fixed Q&A set before flipping the search spec toverified. - MCP tool descriptions. Tool selection is description-driven. We need a small routing eval (~30 Q&A pairs covering all four flows) to confirm the LLM picks the right tool. Lives at
josh-core/tests/fixtures/mcp_routing_eval.jsonl; tracked in mcp-server task t8. - Per-source BM25 weights. Defaults are
title 10, abstract 4, action 2, body 1. Likely fine for v1, but corpus-specific tuning beats the defaults. Eval after the first two sources ship. - Snippet length & highlight policy. 200 chars is the search-spec default. May need per-source variation (a bill snippet wants more context than a FR digest snippet).
- "Find me bills similar to HR103" flow. Not yet a tool. It would be a Flow 1 → Flow 3 chain: fetch HR103's chunks → average or pick a centroid → semantic-search by that vector with
source=billsfilter. Worth a dedicated tool once a real consumer asks.
Related
- rest-api-conventions — cross-cutting REST decisions every flow inherits (envelopes, IDs, errors, pagination).
- rest-api-search — the unified search endpoint backing flows 2, 3, 4.
- rest-api-resource-endpoints — the per-resource fetch endpoints backing flow 1.
- mcp-server — the MCP wrapper that turns these flows into agent tools.
- Embedding architecture — how chunks become vectors (the backbone of flows 3 and 4).
- Chunking — what gets embedded, and why.