Skip to content

Query flows

Four ways a question becomes an answer in Josh. Three of them are not semantic search. The substrate is relational with FTS5 and vector layers stacked on top — picking the right layer for the right question is the whole game.

Why this page exists

An easy mistake when building any AI-facing data product is to assume every question is a semantic-search question. It isn't. A large fraction of questions an agent asks of Josh are citation-style lookups — questions with an identifier baked in, or whose answer is a SQL join, not a similarity score.

Two examples:

  • "Does HR103 mention child protection?" — the bill ID pins the document. The question is "scan a fixed text for a phrase," which is a deterministic keyword scan or a column read, not a vector cosine over 5 million chunks.
  • "Who is on the Ways and Means Committee?" — the answer is a SQL join on committee_memberships. Vector retrieval here would be malpractice.

Dense embeddings reliably miss exact identifiers and named entities — well documented in the legal-RAG literature. Routing identifier-bearing questions through the vector path produces fuzzy answers to questions that have exact ones. Josh's substrate is designed to keep those paths separate.

The four query flows

#FlowWrapsWhen the agent picks it
1 ID lookup / structured join Plain SQL on relational tables Question contains an identifier (hr:119:103, S000033, HSWM), or the answer is a join over normalized columns ("members of X committee").
2 Lexical search (FTS5 BM25) *_fts external-content tables Question is a keyword scan ("bills mentioning child tax credit", "find PFAS in regulations"). Exact tokens matter; the user wrote the words they want matched.
3 Semantic search (vector) *_chunks_vec0 BQ+rescore Question is a paraphrase or conceptual match ("bills about reducing tailpipe emissions"). The user wouldn't necessarily write the exact tokens in the underlying document.
4 Hybrid search (BM25 + vector, RRF) Both FTS5 + vec0, fused Default when the agent isn't sure. Robust to both exact and paraphrased queries; the cost is two scans instead of one.

Pre-filters (source, since, until) apply to all four flows. In flows 3 and 4, filters apply before vector scoring, not after — a hard requirement in the search spec.

Source eligibility matrix

Not every source supports every flow. Embedding a committee name or a senator's bioguide ID would be malpractice — the vector would tell you "John Smith" is similar to "John Smithson," which is worse than useless for citation work. The substrate splits sources into three shapes; the migration schema enforces it (presence of a <source>_chunks_vec0 table is the ground truth, derived dynamically by the search service — never hard-coded).

Shape A 11 Body-bearing Free text long enough to chunk and embed (typical doc > 1K tokens). All four flows apply.
Flows 1 · 2 · 3 · 4
Shape B 1 Short-body Has a body in FTS5 but no chunks/vectors (docs are 1–3 pages; per-doc embedding deferred to v1.x).
Flows 1 · 2   · 3 · 4
Shape C 9 Registry / lookup Structured metadata. FTS5 over short fields (names, jurisdictions) for fuzzy ID lookup — not body search.
Flow 1 · 1b  · 2 · 3 · 4

"Flow 1b" is the fuzzy-name use of FTS5 over short structured fields on registry sources — not real lexical search over a body. The lexical_search MCP tool accepts it (calling lexical_search({query:'ways and means', source:'committees'}) finds the committee by name); the semantic_search tool rejects it.

Shape A · Body-bearing

All four flows apply. The default search (hybrid) tool runs both BM25 and vector legs and fuses with RRF.

Source Flow 1
ID
Flow 2
Lexical
Flow 3
Semantic
Flow 4
Hybrid
Notes
billsplanned BILLSTATUS bodies up to 1.9 MB; chunks > 5K tokens routinely.
cbo-cost-estimatesplanned PDFs → markdown via pdfplumber; broader analyses cross the chunk threshold.
committee-reportsplanned GovInfo CRPT. Section-by-section legislative intent; most reports > 5K tokens.
congressional-recordplanned CREC granules (speeches, motions, quorum calls); each granule has its own body.
crs-reportsverified First source through the full pipeline; the worked example everywhere else copies.
ecfr-and-cfrplanned XML → markdown per section. Long-lived regulatory text.
federal-registerplanned NPRMs and rules 30K+ tokens; chunks enqueued > 5K.
gao-reportsplanned PDF → markdown. FTS5 weights tuned for title / fast-facts / recommendations / body.
hearing-transcriptsplanned Speaker-tagged markdown; almost every transcript exceeds 5K tokens.
public-lawsplanned USLM XML → markdown. Enacted statute text.
us-codeplanned USLM statutes; FTS5 on heading / chapeau / body for section-level retrieval.
<div class="matrix-section is-short">
<p class="heading"><span class="dot"></span>Shape B &middot; Short-body</p>
<p class="sub">FTS5 over the full body but no chunks/vec0 &mdash; documents are short enough that BM25 alone is sufficient. <code>semantic_search</code> rejects; <code>search</code> auto-downgrades to lexical.</p>
</div>
<table class="matrix-table">
<thead>
<tr>
<th>Source</th>
<th class="center">Flow 1<br><span style="font-weight:400;text-transform:none;letter-spacing:0">ID</span></th>
<th class="center">Flow 2<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Lexical</span></th>
<th class="center">Flow 3<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Semantic</span></th>
<th class="center">Flow 4<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Hybrid</span></th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="source-name">statements-of-administration-policy</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">SAPs are 1&ndash;3 pages; FTS5 over body is enough at v1. Per-doc embedding deferred to v1.x.</td>
</tr>
</tbody>
</table>
<div class="matrix-section is-registry">
<p class="heading"><span class="dot"></span>Shape C &middot; Registry / lookup</p>
<p class="sub">No body text. FTS5 indexes exist over short fields (names, jurisdictions, issue labels) for fuzzy ID lookup &mdash; "Flow 1b." <code>lexical_search</code> works against these short fields; <code>semantic_search</code> rejects; <code>search</code> auto-downgrades. Class A typed tools (<code>get_legislator</code>, <code>list_committee_members</code>, etc.) are the preferred entry point.</p>
</div>
<table class="matrix-table">
<thead>
<tr>
<th>Source</th>
<th class="center">Flow 1<br><span style="font-weight:400;text-transform:none;letter-spacing:0">ID</span></th>
<th class="center">Flow 2<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Lexical</span></th>
<th class="center">Flow 3<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Semantic</span></th>
<th class="center">Flow 4<br><span style="font-weight:400;text-transform:none;letter-spacing:0">Hybrid</span></th>
<th>Notes</th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="source-name">legislators</span><span class="status-pill is-verified">verified</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">FTS5 on official_full_name / last_name / first_name. Bioguide ID is the primary lookup.</td>
</tr>
<tr>
<td><span class="source-name">committees</span><span class="status-pill is-verified">verified</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">FTS5 on name / jurisdiction. Thomas committee IDs are stable.</td>
</tr>
<tr>
<td><span class="source-name">committee-memberships</span><span class="status-pill is-verified">verified</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">Pure join table: (committee_id, bioguide_id, congress, role). No FTS5 at all.</td>
</tr>
<tr>
<td><span class="source-name">hearings</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">Meeting metadata only. Transcript bodies live in the separate <code>hearing-transcripts</code> source.</td>
</tr>
<tr>
<td><span class="source-name">lda-filings</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">Lobbying disclosures. FTS5 on client_name / registrant_name / lobbying_issues_text.</td>
</tr>
<tr>
<td><span class="source-name">regulations-dot-gov-dockets</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">Docket metadata only. Comment-body text lives upstream at regulations.gov.</td>
</tr>
<tr>
<td><span class="source-name">roll-call-votes</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">FTS5 on the short question/vote_description text. Per-member positions are a structured join.</td>
</tr>
<tr>
<td><span class="source-name">staff-directories</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">FK to legislators + committees. FTS5 on name / role.</td>
</tr>
<tr>
<td><span class="source-name">topic-taxonomy</span><span class="status-pill">planned</span></td>
<td class="cell-flow is-yes">&check;</td>
<td class="cell-flow is-name">name</td>
<td class="cell-flow is-no">&minus;</td>
<td class="cell-flow is-no">&minus;</td>
<td class="note">LCSH vocabulary. FTS5 on label / broader_labels.</td>
</tr>
</tbody>
</table>

Behavior on mismatch: mode=semantic against any non-vector source returns HTTP 400 with error.code='source_not_searchable_semantically'; mode=hybrid against a registry source auto-downgrades to lexical-only and surfaces the change in a top-level degraded block in the response. The classification is derived from the live schema, so new body-bearing sources become semantically searchable automatically when they ship — no separate config.

The system in layers

Everything below sits behind one HTTPS endpoint (/v1/...) and one MCP endpoint (/mcp), both served by the same josh-core FastAPI process. MCP is a thin wrapper over REST — same handlers, same auth, same citation block.

josh-core — one process, two surfaces, four flows

Agent hosts

Claude.aiconnector
Coworkplugin
ChatGPTApp
Claude Code.mcp.json
curldirect REST
<div class="arch-connector"></div>
<div class="arch-layer">
<p class="arch-layer-label">Transport</p>
<div class="arch-transports">
<div class="arch-transport">
<span class="name">MCP</span>
<span class="desc">Streamable HTTP · /mcp</span>
</div>
<div class="arch-transport">
<span class="name">REST</span>
<span class="desc">HTTPS · /v1/&hellip;</span>
</div>
</div>
</div>
<div class="arch-connector"></div>
<div class="arch-core">
<p class="arch-layer-label">josh-core · FastAPI</p>
<div class="arch-router">
Query router
<span class="sub">picks one of four flows by endpoint / tool</span>
</div>
<div class="arch-flows">
<div class="arch-flow is-1">
<span class="num">Flow 1</span>
<span class="label">ID lookup</span>
<span class="impl">SQL on relational tables</span>
</div>
<div class="arch-flow is-2">
<span class="num">Flow 2</span>
<span class="label">Lexical</span>
<span class="impl">FTS5 BM25</span>
</div>
<div class="arch-flow is-3">
<span class="num">Flow 3</span>
<span class="label">Semantic</span>
<span class="impl">vec0 · BQ + rescore</span>
</div>
<div class="arch-flow is-4">
<span class="num">Flow 4</span>
<span class="label">Hybrid</span>
<span class="impl">BM25 + vec · RRF</span>
</div>
</div>
<div class="arch-storage">
<strong>SQLite</strong> <span style="opacity:0.7">/data/josh.db</span>
&nbsp;+&nbsp;
<strong>Corpus</strong> <span style="opacity:0.7">/data/corpus/&lt;source&gt;/&hellip;</span>
</div>
</div>
</div>

Flow 1 — ID lookup / structured join

Pure SQL over normalized tables. No FTS5, no vector. The relational schema is the source of truth for who, what, where, and when — anything an ingester wrote into a typed column.

Flow 1 · ID lookup — structured join

Agent question "Who is on Ways and Means in the 119th Congress?"
MCP tool list_committee_members committee_id='HSWM'
congress=119
REST endpoint GET /v1/committees/HSWM/members ?congress=119
SQL JOIN committee_memberships × legislators WHERE committee_id='HSWM'
AND congress=119
ORDER BY rank ASC
Response { data: […], next_cursor: null } card fieldset · citation block per row

Strengths & weaknesses

Strengths

  • Exact, deterministic, reproducible. Same query → same answer every time. No "the vector said 0.82, did you want 0.81?"
  • Sub-millisecond on indexed primary keys. The 32 GB host can run thousands of these per second without breaking a sweat.
  • Citations are free — every row already carries source_url, published_at, citation_string from the ingester.
  • Memory-safe. LIMIT-able. Never OOMs the box.

Weaknesses

  • Only as good as the schema. If a fact isn't a column, you can't query it here.
  • Requires the agent to know the identifier. "What's HR103 about?" works; "the bill about child protection" doesn't route here.
  • No fuzziness. A typo in the ID returns 404, not "did you mean…?". Recovery is the agent's job (it can fall back to search).
  • Many-to-many fan-out on deep joins (bill → cosponsors → bills sponsored by each cosponsor) — bounded in v1 by pagination defaults.

Flow 2 — Lexical search (FTS5 BM25)

SQLite's built-in FTS5 with per-column BM25 weights (title 10×, abstract 4×, action 2×, body 1×). External-content tables: the index lives in *_fts, the records in the underlying bills / federal_register / etc. tables. Triggers keep them in sync.

Flow 2 · Lexical — FTS5 BM25

Agent question "Bills mentioning 'child tax credit' in the 119th Congress"
MCP tool lexical_search query='child tax credit'
source='bills'
filters={congress:119}
REST endpoint GET /v1/search ?q=child+tax+credit
&source=bills
&congress=119
&mode=lexical
FTS5 BM25 bills_fts MATCH query weights: title 10 · abstract 4
action 2 · body 1
ORDER BY bm25(…) ASC
LIMIT 20
Response { data: […] } id · title
snippet {text, highlights}
citation per row

Strengths & weaknesses

Strengths

  • Catches exact tokens — bill numbers, statute citations, named programs, named people, dollar amounts. Where dense embeddings drop the ball, BM25 nails it.
  • No model required. Works from day one, no embedding backfill, no GPU. The very first deployment can run lexical search before any vectors exist.
  • FTS5 native syntax supported: phrase matching ("medical device"), prefix (climat*), NEAR, AND/OR/NOT.
  • BM25 scores are explainable. The agent can see why a result matched.

Weaknesses

  • No paraphrase tolerance. "Cars" doesn't match "automobiles" unless an analyzer is added.
  • Per-column weight tuning is per-source. Defaults are picked from the literature; corpus-specific tuning improves quality but needs eval fixtures.
  • FTS5 syntax is user-facing. Malformed queries (unbalanced quotes) raise errors — search endpoint sanitizes a small allowlist before passing through.
  • Stopwords and morphology are SQLite-default. No fancy stemming. Acceptable trade for "no infrastructure required."

Flow 3 — Semantic search (vector, pre-filtered)

Each chunk in the body of a text-bearing source carries a 1024-dim embedding stored in a *_chunks_vec0 virtual table. Retrieval is the BQ+rescore pattern: a brute-force scan of the bit-quantized column (1 bit per dimension) finds the top-K (K≈100) candidates, then a float-precision cosine rescore picks the final order. Filters (source, since, until) restrict the candidate set before scoring.

Flow 3 · Semantic — vector, pre-filtered

<div class="pipeline" style="margin-bottom: 22px;">
<div class="pipeline-step">
<span class="step-label">Agent question</span>
<span class="step-title">"Bills about reducing tailpipe emissions from passenger cars"</span>
</div>
<div class="pipeline-step">
<span class="step-label">MCP tool</span>
<span class="step-title">semantic_search</span>
<span class="step-detail">query='&hellip;'<br>source='bills'<br>filters={congress:119}</span>
</div>
<div class="pipeline-step">
<span class="step-label">REST endpoint</span>
<span class="step-title">GET /v1/search</span>
<span class="step-detail">?q=reducing+tailpipe&hellip;<br>&amp;source=bills<br>&amp;congress=119<br>&amp;mode=semantic</span>
</div>
</div>
<ol class="steplist">
<li>
<strong>Embed the query</strong>
Encoder reduces query text to a 1024-dim float vector. Same model as the chunks; query-time singleton in <code>josh-core</code>.
</li>
<li>
<strong>Pre-filter the candidate set</strong>
<code>SELECT rowid FROM bills WHERE congress = 119</code> &rarr; N rowids. Filters apply <em>before</em> scoring, never after &mdash; otherwise top-K can be empty after filtering.
</li>
<li>
<strong>BQ scan for top-K candidates</strong>
Hamming-distance scan on <code>bills_chunks_vec0</code> against the bit-quantized query, restricted to the filtered rowid set. <code>LIMIT 100</code>.
</li>
<li>
<strong>Float-precision rescore</strong>
Top-100 candidates reranked by full-precision cosine on the <code>float[1024]</code> column. This is the cheap-then-precise pattern: BQ for recall, float for ranking.
</li>
<li>
<strong>Join back &amp; assemble response</strong>
Top-20 chunks join to their parent records. Citation block per row from <code>citation_for(record)</code>. Snippet excerpted around the matched section.
</li>
</ol>

Strengths & weaknesses

Strengths

  • Paraphrase tolerance. "Tailpipe emissions" finds bills using "vehicle exhaust" or "Title II of the Clean Air Act."
  • Conceptual similarity that lexical search can't see: "bills similar to HR103" works.
  • BQ-quantization is 32× smaller than float storage and ~20× faster to scan; the v1 corpus (≤ 5M chunks) fits comfortably.
  • Encoder swap is a side-by-side rebuild + config flip. Every embedded row carries model_id and model_version — see embedding architecture.

Weaknesses

  • Misses exact tokens. "HR103" returns vaguely related bills, not HR103 itself. Identifier lookups must never enter this path — they route to Flow 1.
  • Requires embeddings — a build step. New deployments need hours-to-days for a backfill at v1 scale. The lexical path (Flow 2) is the bridge.
  • Pre-filter must be designed in. Post-filtering the top-K is the wrong shape — a date-restricted query could return zero results from a top-100 that's all out of the date range. Search spec enforces pre-filter.
  • Scores are not human-interpretable. "0.81 cosine" is meaningful in aggregate, not for a single result.

Flow 4 — Hybrid search (BM25 + vector, RRF)

Default mode of /v1/search. Runs the FTS5 path and the vector path in parallel, then fuses the two ranked lists with Reciprocal Rank Fusion. RRF doesn't need the two scoring systems to be comparable — it works on ranks only.

Flow 4 · Hybrid — BM25 + vector fused via RRF

Reciprocal Rank Fusion diagram Query splits into BM25 and vector retrieval, two ranked lists merge through RRF into a final top-20.
<defs>
<marker id="arr" viewBox="0 0 10 10" refX="9" refY="5" markerWidth="7" markerHeight="7" orient="auto-start-reverse">
<path d="M 0 0 L 10 5 L 0 10 z" fill="#8a909b"/>
</marker>
</defs>
<!-- Query box -->
<g>
<rect x="280" y="14" width="160" height="50" rx="8" fill="#ffffff" stroke="#e7e9ee"/>
<text x="360" y="36" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="12" font-weight="600" fill="#0f1115">GET /v1/search</text>
<text x="360" y="52" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="10.5" fill="#5b6270">?q=&hellip; &amp; mode=hybrid</text>
</g>
<!-- Branch lines from query down -->
<path d="M 360 64 L 360 90" stroke="#8a909b" stroke-width="1" fill="none"/>
<path d="M 360 90 L 180 90 L 180 110" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<path d="M 360 90 L 540 90 L 540 110" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<!-- BM25 box -->
<g>
<rect x="60" y="110" width="240" height="70" rx="8" fill="#ffffff" stroke="#e7e9ee"/>
<rect x="60" y="110" width="240" height="3" fill="oklch(0.62 0.14 40)"/>
<text x="180" y="135" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="13" font-weight="600" fill="#0f1115">FTS5 · BM25</text>
<text x="180" y="153" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#5b6270">top-K ranked list</text>
<text x="180" y="170" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#2a2e36">[ A, B, C, D, &hellip; ]</text>
</g>
<!-- Vector box -->
<g>
<rect x="420" y="110" width="240" height="70" rx="8" fill="#ffffff" stroke="#e7e9ee"/>
<rect x="420" y="110" width="240" height="3" fill="oklch(0.7 0.13 90)"/>
<text x="540" y="135" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="13" font-weight="600" fill="#0f1115">Vector · BQ + rescore</text>
<text x="540" y="153" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#5b6270">top-K ranked list</text>
<text x="540" y="170" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#2a2e36">[ B, E, A, F, &hellip; ]</text>
</g>
<!-- Down to RRF -->
<path d="M 180 180 L 180 210 L 360 210" stroke="#8a909b" stroke-width="1" fill="none"/>
<path d="M 540 180 L 540 210 L 360 210" stroke="#8a909b" stroke-width="1" fill="none"/>
<path d="M 360 210 L 360 230" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<!-- RRF box -->
<g>
<rect x="220" y="230" width="280" height="74" rx="8" fill="#ffffff" stroke="#e7e9ee"/>
<rect x="220" y="230" width="280" height="3" fill="oklch(0.55 0.13 155)"/>
<text x="360" y="254" text-anchor="middle" font-family="Inter, system-ui, sans-serif" font-size="13" font-weight="600" fill="#0f1115">Reciprocal Rank Fusion</text>
<text x="360" y="274" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#5b6270">score(d) = &sum;<tspan baseline-shift="sub" font-size="8">i</tspan> 1 / (k + rank<tspan baseline-shift="sub" font-size="8">i</tspan>(d))</text>
<text x="360" y="291" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#2a2e36">k = 60</text>
</g>
<!-- Down to final -->
<path d="M 360 304 L 360 326" stroke="#8a909b" stroke-width="1" fill="none" marker-end="url(#arr)"/>
<!-- Final list -->
<g>
<rect x="220" y="326" width="280" height="28" rx="6" fill="#0f1115" stroke="#0f1115"/>
<text x="360" y="345" text-anchor="middle" font-family="JetBrains Mono, ui-monospace, monospace" font-size="11" fill="#ffffff">[ B, A, E, C, D, F, &hellip; ] &middot; top-20 + citations</text>
</g>
</svg>
</div>

RRF intuition: a doc that appears in both lists gets contributions from both; a doc in one list still ranks if it's high enough there. k=60 is the community default — large enough that low-ranked items don't dominate, small enough that rank-1 in either list is a strong signal.

Strengths & weaknesses

Strengths

  • Robust to ambiguity. Works whether the query has exact tokens or is a paraphrase. Strong "I don't know what shape the question is" default.
  • RRF has no tunable score scaling — purely rank-based, can't be gamed by one side returning huge scores.
  • Result diversity. Hybrid lists are less monoculture than either path alone — the BM25 × vector intersection is bigger than either's individual top-N.
  • Drop-in safe. RRF is the lowest-risk way to add semantic to a lexical-only system, or vice versa — neither side dominates.

Weaknesses

  • Two scans instead of one — roughly 1.5–2× the latency of either alone. Still inside the 500 ms p95 budget at v1 scale, but worth knowing.
  • k is the one knob. Needs an offline eval to pick well for the corpus. Default 60 is safe but not optimal.
  • Harder to explain a single result's rank. The agent can still see both contributing scores in the response, but it's noisier than a single BM25 number.
  • Doesn't help when neither side has the right doc. Hybrid is not a magic upgrade; it's an averaging.

How the agent picks the right flow

Tool names + descriptions, not a learned classifier. The MCP server exposes three classes of tools in mcp-server:

ClassToolsRoutes to
A. Typed structured lookups get_bill, get_legislator, get_committee, list_committee_members, list_cosponsors, get_bill_text, list_member_votes Flow 1 — SQL on relational tables
B. Text search search (hybrid, default), lexical_search, semantic_search Flow 2, 3, or 4 depending on tool
C. Universal fetch fetch(id) — dispatches by ID prefix Flow 1 — required by ChatGPT

The agent sees the tool descriptions and picks. Three loose rules drilled into the descriptions:

  1. If the question contains an identifier, prefer Class A or C. Never Class B.
  2. If the question is a keyword lookup ("the words must appear"), prefer lexical_search.
  3. If the question is a concept or paraphrase, prefer semantic_search — or just search if unsure.

Why two redundant paths for IDs (Class A typed tools and Class C fetch)? fetch is mandatory for ChatGPT App compatibility (OpenAI's retrieval pipeline expects the name). The typed tools are sharper for Claude — the LLM picks get_bill('hr:119:103') over fetch('hr:119:103') when both are available because the typed name and description are more specific. Both call the same REST handler.

Worked examples

Three queries traced end-to-end.

"Does HR103 have language about child protection?"

This is the question that motivated the doc. Two paths to the same answer; the smart one uses both flows.

Wrong
Route through semantic search: semantic_search({query: 'HR103 child protection'}). Why it fails: the vector encoder doesn't know what HR103 is — it returns vaguely-related bills, not HR103 itself, and the agent hallucinates.
Right
Two-step — Flow 1 then a scan inside the returned text.
  1. get_bill_text(bill_id='hr:119:103') → returns body_text + citation, deterministic.
  2. Agent scans the returned text for "child protection" (or for any of: minor, juvenile, child welfare, <18 years) → finds the phrase or doesn't; cites the section if found.
Alt
For very long bills, use Flow 1 then Flow 2 scoped to the one bill:
  1. get_bill(bill_id='hr:119:103') → confirm bill exists, get title.
  2. lexical_search({query: 'child protection', source: 'bills', filters: {bill_id: 'hr:119:103'}}) → BM25 over the chunks of that one bill, returns matching sections with offsets for citing.

"Who is on the Ways and Means Committee?"

Pure structured. Vector retrieval would be malpractice.

Right
list_committee_members(committee_id='HSWM', congress=119)
  1. JOIN committee_memberships × legislators
  2. Ordered list of legislators with role (chair / ranking member / member)
  3. Each row carries its own citation

Note: HSWM is the Thomas committee ID for House Ways and Means. The agent can look up the committee ID first via get_committee by name if it doesn't know it, but committee IDs are stable and worth pinning in agent prompts.

"EPA regulations on PFAS in 2026"

This is the canonical hybrid query. The agent doesn't have an ID; "PFAS" is an exact token (chemistry abbreviation, won't be paraphrased) but "regulations on" is conceptual.

Right
search({query: 'EPA PFAS', source: 'federal-register', since: '2026-01-01'})  —  mode=hybrid (default).
  1. FTS5 leg matches "PFAS" exactly in any document title or body.
  2. Vector leg matches semantically-similar docs ("perfluoroalkyl", "forever chemicals", "drinking water contaminant rules").
  3. RRF fusion — top-20 with both signals.
  4. Each result carries snippet + citation; the agent picks the most-cited / most-recent for the answer.

API surface summary

All four flows behind two endpoints in josh-core — and the MCP server wraps both:

NeedRESTMCP toolFlow
Fetch a record by ID GET /v1/<resource>/{id} get_bill, get_legislator, …, fetch 1
List members of a committee GET /v1/committees/{id}/members list_committee_members 1
List cosponsors of a bill GET /v1/bills/{id}/cosponsors list_cosponsors 1
List votes by a legislator GET /v1/legislators/{id}/votes list_member_votes 1
Keyword scan GET /v1/search?mode=lexical lexical_search 2
Conceptual scan GET /v1/search?mode=semantic semantic_search 3
Mixed-shape question GET /v1/search (default hybrid) search 4

What this design depends on

A few load-bearing invariants. If any of them breaks, the routing logic above stops being safe.

  1. Type-prefixed IDs. Every public ID issued by the substrate has a unique type prefix (per rest-api-conventions §5). Without this, fetch(id) can't dispatch unambiguously and the agent has to guess. The ID grammar is enforced by ingester contract tests.
  2. Pre-filter, not post-filter. When a semantic query carries source / since / until, those filters apply before vector scoring. Otherwise a date-restricted query can come back empty even when matching documents exist. Enforced by an acceptance criterion in rest-api-search.
  3. Round-trip ID property. Every id returned by search is accepted by the matching resource endpoint. The agent can chain "search → fetch the top hit" without translating IDs. Contract-tested in both directions.
  4. Citation is a column read, not a re-derive. Every record carries source_url, published_at, citation_string, etc. The search response and the resource response both quote the same fields. No path produces a record without a citation.
  5. Memory-safe queries. Every flow bounds its result set (LIMIT, pagination, top-K). No unbounded scan over the substrate. The 32 GB host is the budget; runaway queries are not allowed.

Open questions

  • RRF k parameter. 60 is the community default. The corpus is small enough (≤ 5M chunks at v1) that k=10–20 might be better. Pin via an offline eval with a fixed Q&A set before flipping the search spec to verified.
  • MCP tool descriptions. Tool selection is description-driven. We need a small routing eval (~30 Q&A pairs covering all four flows) to confirm the LLM picks the right tool. Lives at josh-core/tests/fixtures/mcp_routing_eval.jsonl; tracked in mcp-server task t8.
  • Per-source BM25 weights. Defaults are title 10, abstract 4, action 2, body 1. Likely fine for v1, but corpus-specific tuning beats the defaults. Eval after the first two sources ship.
  • Snippet length & highlight policy. 200 chars is the search-spec default. May need per-source variation (a bill snippet wants more context than a FR digest snippet).
  • "Find me bills similar to HR103" flow. Not yet a tool. It would be a Flow 1 → Flow 3 chain: fetch HR103's chunks → average or pick a centroid → semantic-search by that vector with source=bills filter. Worth a dedicated tool once a real consumer asks.