Skip to content

Chunking — design and experiment framework

How Josh splits source documents into the chunks that get embedded and stored as vectors, and how we decide which chunking strategy wins for which source.

This document covers two things that pair naturally:

  1. The production Chunker abstraction — the protocol every chunker implements, where chunkers live in the codebase, how the ingester picks one, how we evolve it without breaking existing data.
  2. The experiment framework — how we run head-to-head comparisons of chunking strategies on the same corpus, with the same eval set, against the same embedder, to make decisions backed by numbers rather than priors.

When this document disagrees with prose elsewhere, this document wins for chunking concerns.

Storage backend: SQLite + FTS5 + sqlite-vec. Embedding model: Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license). See CLAUDE.md "Storage backend decision" for context.

Where this document has been superseded. The evaluation architecture doc is now the source of truth for how chunker comparison runs. Four specific things below have been replaced:

  • Eval harness location. What this doc calls shared/josh_substrate/src/josh_substrate/eval/chunker_compare.py now lives at josh-eval/retrieval/runner.py. The chunker protocol and chunker implementations stay in the substrate package; only the eval runner moves out.
  • CLI shape. Bare josh-eval chunker --source=… is replaced by josh-eval retrieval run --eval-set=… --substrate-version=…, a subcommand under the unified eval CLI.
  • Eval-set artifact location. Eval sets move from evals/<source>/ in the substrate repo to retrieval/eval-sets/<source>/ inside josh-eval.
  • Snapshot pinning (new requirement). Each eval set's manifest.json declares the substrate version it was authored against. Re-runs against a different version are flagged in reports rather than silently producing drifty numbers.

Everything else in this doc — the Chunker protocol, the strategy comparison, query authoring, labeling rubric, decision rules — is still authoritative. Read this doc for chunker design; read evaluation architecture for how the runs are organized.


Retrieval quality is the dominant factor in how good Josh feels to an agent. Bad chunking doesn't just degrade vector retrieval — it floors the quality ceiling for everything downstream. If a chunk crosses a section boundary in an FR rule, the embedding represents two unrelated topics and the cosine similarity to any query is noisy. If a chunk is too small, it loses context the agent needs. If it's too large, it dilutes the signal.

The ingester previously shipped on fixed-size 4000-char / 800-char-overlap chunking as the default — a reasonable starting point but almost certainly not optimal for federal documents, which are heavily structured (FR XML has explicit <HD> headers and <SECTION> elements; bills have <section>; USC uses USLM with a deep hierarchy; CRS has HTML headings; hearings have speaker turns).

We need to know which chunking strategy actually wins, per source, before we commit to a default for Step 1's launch corpus.

Stated explicitly so we can later check whether reality matched.

SourcePredicted bestConfidenceReasoning
Federal Register (rules, NPRMs)Section-aware (HD-driven)HighFR XML carries explicit <HD> headers with named sections (BACKGROUND, REGULATORY ANALYSIS, etc.) that map cleanly to logical units. Mixing sections in one chunk dilutes embeddings.
BillsSection-aware (<section>-driven) with parent-child fallbackMedium-highBill sections are the citation atom. But many bills have boilerplate (short titles, definitions) that swamps signal in mixed chunks.
US CodeUSLM hierarchical (section level)HighUSC is hierarchical by design; the section is the citation atom.
Public LawsUSLM hierarchical (section level)HighSame shape as USC.
CRS reportsRecursive splitter with markdown header awarenessMediumLess explicit XML structure; HTML rendering varies. Likely wants paragraph-level splitting with some heading awareness.
GAO reportsRecursive splitterMediumPDF-derived plaintext; structure varies.
Hearing transcriptsSpeaker-turn-levelHighSpeaker turns are the citation atom. Each turn is a coherent contribution.
Congressional Record granulesSpeaker-turn-levelHighSame as hearing transcripts.

Aggregate prior: section-aware (or speaker-turn-aware) wins meaningfully for the structured sources, by what feels like 5–15% NDCG@10 versus fixed-size char-based. For unstructured sources (CRS, GAO), the win narrows to 0–5% over recursive splitting.

These predictions are checkable. Track them; if reality is consistently different, the priors were wrong and we should update our intuitions about the corpus.


The Step 1 candidate set. Six strategies cover the realistic design space:

#StrategyTunablesImplementation costNotes
1Fixed-size charchunk_size, overlapTrivial (current default)Baseline. 4000/800 chars ≈ 1000/200 tokens. Source-agnostic.
2Fixed-size tokenchunk_size, overlap, tokenizerEasySame as #1 but counted in tokens (Arctic uses BPE), more reliable across sources.
3Recursive splitterseparator hierarchy, target_sizeEasyLangChain-style cascade: split on \n\n, then \n, then ., then space, until each chunk fits target size. Source-agnostic.
4Section-awareper-source structure rulesMedium (per source)Use source-specific structure: FR <HD>, bills <section>, USC USLM, hearing speaker turns, CRS markdown headings. Falls back to recursive splitter inside oversize sections.
5Hierarchical (parent-child)parent_size, child_size, overlapMediumSplit into small children for retrieval, larger parents for context. Two indexes: child for vec search, parent for context window. Adds complexity.
6Late chunkingmodel max-lengthHighEmbed the full document with a long-context model, then mean-pool token embeddings within chunk boundaries. Requires the embedder to support full-document input. Changes both chunker AND embedder; can't be evaluated cleanly against just-a-chunker variants.
  • Semantic boundary detection (using the embedder itself to find semantic breaks). Heavy, exotic, hard to debug.
  • Window-based chunking with attention to query terms. Query-aware chunking is a research direction; not Step 1.
  • Fine-grained token-level chunking with sliding windows. Diminishing returns.

We'll compare 1, 2, 3, 4 head-to-head as the first pass. If section-aware wins as expected, hierarchical (5) becomes the v1.x optimization. Late chunking (6) is a separate research thread because it changes the embedder.


Production abstraction. Lives at shared/josh_substrate/src/josh_substrate/chunking.py (or in the ingester package — TBD; lean toward shared so josh-core can chunk on the fly for ad-hoc use cases).

from typing import Protocol, AsyncIterator, runtime_checkable
from pydantic import BaseModel
class Chunk(BaseModel):
"""One chunk produced by a Chunker. Source-agnostic."""
chunk_index: int # 0-based, monotonic within a parent doc
text: str # the chunk content
char_start: int | None # offset into the parsed source text
char_end: int | None
metadata: dict # chunker-specific (section_path, speaker, heading_level, etc.)
class ParsedDoc(BaseModel):
"""Input to a Chunker. Source-agnostic."""
doc_id: str # canonical Josh ID (e.g. 'fr:2026-08558')
source: str # 'federal_register', 'bills', etc.
text: str # cleaned plaintext body
structure: dict | None # source-specific structure hints (XML tree, markdown headings, etc.)
@runtime_checkable
class Chunker(Protocol):
"""A function from a parsed document to an ordered sequence of chunks."""
name: str # 'fixed_char_4k', 'recursive_v1', 'section_aware_fr_v1', etc.
version: int # bump when behavior changes
config: dict # the parameters this instance was constructed with
def chunk(self, doc: ParsedDoc) -> list[Chunk]:
"""Pure function — same input, same output. No I/O."""
...

Chunker IDs are stored alongside chunks so we know which strategy produced which row. Format: {strategy}_{variant}_v{version}.

Examples:

  • fixed_char_4k800_v1 — fixed 4000 chars / 800 overlap, version 1
  • fixed_token_512_50_v1 — fixed 512 tokens / 50 overlap
  • recursive_v1 — default recursive splitter
  • section_aware_fr_v1 — FR-specific section-aware
  • section_aware_bills_v1 — bills-specific
  • usc_uslm_section_v1 — USC USLM section-level
  • speaker_turn_v1 — for hearings + congressional record

Chunks live in <source>_chunks tables. Each row carries chunker_id (TEXT, the {strategy}_{variant}_v{version} string) so multiple chunkings can coexist temporarily for evaluation.

-- Example: fr_document_chunks
CREATE TABLE fr_document_chunks (
id INTEGER PRIMARY KEY AUTOINCREMENT,
fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE,
chunker_id TEXT NOT NULL, -- 'section_aware_fr_v1', etc.
chunk_index INTEGER NOT NULL,
chunk_text TEXT NOT NULL,
char_start INTEGER,
char_end INTEGER,
metadata TEXT, -- JSON, chunker-specific
UNIQUE (fr_document_id, chunker_id, chunk_index)
);
CREATE INDEX fr_document_chunks_doc ON fr_document_chunks (fr_document_id, chunker_id);

vec0 companion tables key vectors back to fr_document_chunks.id, so swapping the active chunker is a join change, not a schema change.

Production picks one canonical chunker per source

Section titled “Production picks one canonical chunker per source”

In production, one chunker_id is "live" per source — that's the one queries hit by default. The harness picks the live chunker via a configuration table (source_chunker_config) or via env, TBD. Eval runs spawn alternate chunker IDs alongside the live one; once an experiment concludes, the alternates are pruned.

This keeps the steady-state simple (one chunker per source, one set of vectors) but lets us evaluate alternates without forking the database.


How we run a comparison.

The chunker eval depends on having the raw artifacts persistent on disk so we can re-chunk without re-fetching. This is already the convention: /data/corpus/federal_register/metadata/{YYYY}/{MM}/{number}.json and /data/corpus/federal_register/bodies/xml/{YYYY}/{MM}/{number}.xml.

For each source, the eval corpus is a defined slice of these raw files:

SourceEval corpus sliceDoc countNotes
Federal Register2025-01-01 → 2025-12-31 (full year)~70K docsSpans Rules, NPRMs, Notices, Presidential Documents — exercises every type.
Bills119th Congress, all bill types~10K billsCurrent congress; mix of HRES, S, HR, etc.
US CodeTitle 42 (Public Health) + Title 26 (IRC)~10K sectionsTwo contrasting titles.

The eval reads these raw files, runs the parser → the chunker variant being tested → the embedder, indexes into a temp SQLite file, then runs the eval set against it. Production data is untouched.

We need labeled query-document relevance judgments. Build the eval set in three steps:

1. Query authoring. ~50 queries per source, drawn from agent-realistic patterns. Categories:

  • Single-doc retrieval ("the rule on medical device cybersecurity from 2025")
  • Topical (lookup) ("rules affecting HSA expansion")
  • Citation following ("rules amending 42 USC 1396a")
  • Comparative ("how does this NPRM differ from the prior 2018 proposed rule on the same topic")
  • Authority queries ("EPA actions on PFAS in 2025")
  • Section-level ("FDA premarket pathway for software in 21 CFR 820")

2. Candidate gathering. For each query, run a baseline retrieval (e.g., FTS5 BM25 with default weights, plus a brute-force semantic search) and gather top-30 candidates. Pool candidates from multiple chunkers so we don't bias the labels toward what one chunker happens to produce.

3. Labeling. Three approaches, in order of preference:

  • (a) Manual labeling — a human reads each candidate and assigns a relevance score (0 = not relevant, 1 = partially relevant, 2 = relevant, 3 = highly relevant / canonical answer). Slow but ground-truth.
  • (b) LLM-as-judge — Claude (or similar) reads the query + candidate and scores. Useful as a force multiplier; manual spot-check ~20% to validate correlation.
  • (c) Bootstrap from prior eval — the deleted Phase C eval had ~55 queries with ~10 manually labeled. The query templates are documented in the conversation log of 2026-05-04 / 2026-05-05; rebuild from there.

Realistic plan: (c) → (b) → (a) for a high-confidence subset. Aim for 50 queries × 30 candidates labeled = 1,500 labels per source. With LLM-as-judge as the main labor and human spot-check, this is days, not weeks.

Eval set artifacts live in:

evals/
├── fr/
│ ├── queries.yaml # 50 queries with categories
│ ├── candidates.jsonl # one row per (query, doc) candidate pool
│ └── labels.jsonl # one row per (query, doc) label with judge metadata
├── bills/
│ └── ...
└── usc/
└── ...

(The previous evals/ directory was deleted in the SQLite-pivot rollback. The methodology described here recreates and improves on it.)

Same as the deleted Phase C eval — known good for retrieval evaluation:

MetricWhat it measuresWhy we use it
NDCG@10Discounted cumulative gain at 10 — rewards highly relevant docs early in the rankingPrimary metric. Sensitive to ranking quality, not just inclusion.
Recall@10Fraction of all relevant docs retrieved in top 10Sanity check. Don't ship a chunker that loses recall even if NDCG looks fine.
MRRMean reciprocal rank of the first relevant resultTells us "did the first result the agent sees actually help."
Mean chunk count per docHow fragmented the chunker isAffects storage + scan time. Reported alongside quality metrics for cost awareness.
p50 / p95 chunk size (tokens)Distribution of chunk sizesCatches degenerate behavior — section-aware producing 50K-token chunks for one doc and 20-token chunks for another.

When does an experiment win? The honest threshold:

  • NDCG@10 improvement of ≥0.02 absolute (over the live chunker) on the eval set, with no Recall@10 regression worse than 0.02 absolute. That's roughly the noise floor of the labeled set; smaller gains aren't reliably real.
  • Mean chunk count within 2× of the live chunker. A win that triples storage + scan cost has to be evaluated against that cost.
  • p95 chunk size ≤ 2048 tokens (fits Arctic comfortably).
  • The chunker is reproducible (no LLM-in-the-loop nondeterminism, no version-drifting libraries).

If the candidate beats the live chunker on NDCG@10 by ≥0.02 with the other constraints satisfied, swap it in production for that source. Otherwise, keep the live chunker.

A runnable Python module at shared/josh_substrate/src/josh_substrate/eval/chunker_compare.py (rough sketch):

async def run_chunker_eval(
source: str, # 'federal_register'
corpus_path: Path, # /data/corpus/federal_register/
eval_set_path: Path, # evals/fr/
chunkers: list[Chunker], # variants to compare
embedder: Embedder, # the live embedder (Arctic-L-v2)
output_path: Path, # where to write the comparison report
) -> None:
"""For each chunker:
1. Build a temp SQLite file
2. Parse + chunk the corpus into it
3. Embed all chunks via the embedder
4. Run the eval set, scoring each query
5. Aggregate metrics
Write a comparison table + per-query breakdown to output_path.
"""

Output: a markdown report with the comparison table, per-query winners/losers, and the cost numbers (chunk count, storage, eval time). Comparable to the deleted evals/federal_register/runs/...comparison.md files.

CLI: josh-eval chunker --source=federal_register --chunkers=fixed_char_4k800_v1,recursive_v1,section_aware_fr_v1.


One chunker per source is the production rule. Different sources have different structures; different structures want different chunkers. Don't try to find a one-size-fits-all winner.

The XML body has a clear hierarchy: <RULE> / <PRORULE> / <NOTICE> / <PRESDOCU><PREAMB> (preamble) + <SUPLINF> (supplementary info) → <HD> headings (with SOURCE attribute giving heading level: HED, HD1, HD2).

A section-aware chunker walks the XML, treats each <HD>-bounded block as a chunk, and falls back to recursive splitting inside any block that exceeds ~2048 tokens. The chunk metadata captures section_path (e.g., ["SUPLINF", "Background", "Statutory Authority"]) so the agent can present "from the Background section" in citations.

Special-case for Presidential Documents: the wrapper element (<EXECORD>, <PROCLA>, <PRMEMO>, etc.) contains the full body text without internal sections in many cases. The chunker should fall back to recursive splitting at paragraph boundaries.

BILLSTATUS XML has the metadata; bill TEXT (in BILLS or PLAW) has the body. For Step 1, focus on the enacted text for Public Laws (USLM 2.0) and the latest version text for active bills. Live bill XML uses the older "billres" DTD which is less structured.

Section-aware chunker uses USLM <section> elements when available, falls back to recursive splitting on <subsection>/<paragraph> for older DTDs.

USLM 2.0. The section is the citation atom. A chunker that walks the USLM tree and emits one chunk per <section> element (with subsections inline) is the obvious play. Cross-section chunks dilute citation precision badly here — USC queries are almost always citation-anchored ("what does 42 USC 1396a(a)(10) say?").

HTML rendering. No formal section markup; rely on <h1>/<h2>/<h3> headings and paragraph boundaries. A recursive splitter with markdown-style heading awareness (treats # Heading lines as boundaries) is the realistic v1 chunker.

Speaker turns are the citation atom. For CHRG and CREC, the parser already extracts speaker-name-prefixed turns. The chunker is just "one chunk per turn" with metadata carrying speaker, role (witness / member / chairman), and timing where available.

Long turns (>2048 tokens) get split with a recursive splitter as a fallback. Single chunks under ~50 tokens get merged with adjacent same-speaker turns.

PDF-derived plaintext. No useful structure markup. Recursive splitter with paragraph + sentence boundaries. These sources are likely the ones where chunking strategy makes the smallest difference — recursive is fine.


When does the chunker eval framework actually get built?

Phase A — protocol + baseline (during Phase 0.3 ingester rebuild)

Section titled “Phase A — protocol + baseline (during Phase 0.3 ingester rebuild)”
  • Implement the Chunker protocol in shared/josh_substrate/
  • Implement fixed_char_4k800_v1 (the previous default) and recursive_v1
  • Wire chunker selection into the ingester load stage
  • Schema for <source>_chunks carries chunker_id

This is the minimum to ship Phase 0 with a chunker. No eval framework yet — just the abstraction.

Phase B — section-aware FR (during Phase 0.6 first FR ingestion)

Section titled “Phase B — section-aware FR (during Phase 0.6 first FR ingestion)”
  • Implement section_aware_fr_v1 walking the FR XML <HD> structure
  • Run it as the production chunker for FR (high prior; instrumented to compare retroactively)
  • No formal eval yet — committing on prior

Phase C — eval framework (before Step 1 launch, summer 2026)

Section titled “Phase C — eval framework (before Step 1 launch, summer 2026)”
  • Build the eval harness (josh-eval chunker)
  • Author the FR query set and label it (LLM-as-judge bootstrap, manual spot-check)
  • Run head-to-head: fixed_char_4k800_v1 vs recursive_v1 vs section_aware_fr_v1
  • Publish the results as part of the OSS launch's retrieval research blog post
  • If section-aware doesn't win on NDCG@10, that's a real and surprising finding worth investigating

Phase D — repeat per source (Phase 1+ ingestion)

Section titled “Phase D — repeat per source (Phase 1+ ingestion)”
  • For each new source as it lands, build the source-specific chunker (or pick recursive as fallback)
  • Run the eval against the new source's corpus
  • Lock in the source's production chunker
  • Document the result in the source's per-source spec

  1. Where does the chunker live in the codebase? shared/josh_substrate/chunking.py (alongside models + DB helpers, available to both ingester and core) or josh-ingester/ingester/chunking.py (ingester-private, smaller dependency footprint)? Lean shared — josh-core may want to chunk uploaded files in the agent harness later (Step 2).

  2. LLM-as-judge model and protocol. Which model? Claude Opus is the gold standard for relevance judgments; Sonnet is cheaper and may be sufficient. What's the labeling prompt format that generalizes across queries? Worth a small calibration study (50 manual labels vs LLM labels, measure agreement).

  3. Single eval set or per-source eval sets? Lean per-source. Different sources have different query patterns. But a shared "cross-source citation-following" eval might also be useful (queries that traverse multiple sources).

  4. How often do we re-run the eval? On every chunker change. On every embedder change. As a CI gate? (Probably not — the eval takes minutes-to-hours per source. Run it on-demand when something changes.)

  5. Eval set versioning. The eval set is itself a research artifact. When we add queries or rejudge labels, version it (evals/fr/v1/, evals/fr/v2/). Old runs stay reproducible against the version they used.

  6. Late chunking research thread. Worth a separate doc + experiment if it becomes interesting. Requires confirming Arctic-L-v2's max sequence length (8192 tokens?) and whether mean-pooling token embeddings within chunk boundaries gives a real quality lift versus segment-then-embed. Defer to v1.x.

  7. Public chunker eval as a research artifact. The OSS launch's retrieval research blog post should include the chunker comparison results. That means the eval set + labels + run files all need to be reproducible from the OSS code. Decide what's in the OSS repo vs cloud-private (lean: queries + labels in OSS, the labeled cloud corpus stays private).

  8. Hierarchical (parent-child) chunking. Out of Phase C scope but a likely v1.x improvement. Children for retrieval (~256 tokens), parents for context window (~2048 tokens). Adds complexity to the schema (parent/child tables) and the agent's retrieval flow (which set does it query, how does it expand).