Chunking — design and experiment framework
How Josh splits source documents into the chunks that get embedded and stored as vectors, and how we decide which chunking strategy wins for which source.
This document covers two things that pair naturally:
- The production
Chunkerabstraction — the protocol every chunker implements, where chunkers live in the codebase, how the ingester picks one, how we evolve it without breaking existing data. - The experiment framework — how we run head-to-head comparisons of chunking strategies on the same corpus, with the same eval set, against the same embedder, to make decisions backed by numbers rather than priors.
When this document disagrees with prose elsewhere, this document wins for chunking concerns.
Storage backend: SQLite + FTS5 +
sqlite-vec. Embedding model: Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license). SeeCLAUDE.md"Storage backend decision" for context.
Where this document has been superseded. The evaluation architecture doc is now the source of truth for how chunker comparison runs. Four specific things below have been replaced:
- Eval harness location. What this doc calls
shared/josh_substrate/src/josh_substrate/eval/chunker_compare.pynow lives atjosh-eval/retrieval/runner.py. The chunker protocol and chunker implementations stay in the substrate package; only the eval runner moves out.- CLI shape. Bare
josh-eval chunker --source=…is replaced byjosh-eval retrieval run --eval-set=… --substrate-version=…, a subcommand under the unified eval CLI.- Eval-set artifact location. Eval sets move from
evals/<source>/in the substrate repo toretrieval/eval-sets/<source>/insidejosh-eval.- Snapshot pinning (new requirement). Each eval set's
manifest.jsondeclares the substrate version it was authored against. Re-runs against a different version are flagged in reports rather than silently producing drifty numbers.Everything else in this doc — the
Chunkerprotocol, the strategy comparison, query authoring, labeling rubric, decision rules — is still authoritative. Read this doc for chunker design; read evaluation architecture for how the runs are organized.
Why this matters
Section titled “Why this matters”Retrieval quality is the dominant factor in how good Josh feels to an agent. Bad chunking doesn't just degrade vector retrieval — it floors the quality ceiling for everything downstream. If a chunk crosses a section boundary in an FR rule, the embedding represents two unrelated topics and the cosine similarity to any query is noisy. If a chunk is too small, it loses context the agent needs. If it's too large, it dilutes the signal.
The ingester previously shipped on fixed-size 4000-char / 800-char-overlap chunking as the default — a reasonable starting point but almost certainly not optimal for federal documents, which are heavily structured (FR XML has explicit <HD> headers and <SECTION> elements; bills have <section>; USC uses USLM with a deep hierarchy; CRS has HTML headings; hearings have speaker turns).
We need to know which chunking strategy actually wins, per source, before we commit to a default for Step 1's launch corpus.
Hypothesis (priors before measurement)
Section titled “Hypothesis (priors before measurement)”Stated explicitly so we can later check whether reality matched.
| Source | Predicted best | Confidence | Reasoning |
|---|---|---|---|
| Federal Register (rules, NPRMs) | Section-aware (HD-driven) | High | FR XML carries explicit <HD> headers with named sections (BACKGROUND, REGULATORY ANALYSIS, etc.) that map cleanly to logical units. Mixing sections in one chunk dilutes embeddings. |
| Bills | Section-aware (<section>-driven) with parent-child fallback | Medium-high | Bill sections are the citation atom. But many bills have boilerplate (short titles, definitions) that swamps signal in mixed chunks. |
| US Code | USLM hierarchical (section level) | High | USC is hierarchical by design; the section is the citation atom. |
| Public Laws | USLM hierarchical (section level) | High | Same shape as USC. |
| CRS reports | Recursive splitter with markdown header awareness | Medium | Less explicit XML structure; HTML rendering varies. Likely wants paragraph-level splitting with some heading awareness. |
| GAO reports | Recursive splitter | Medium | PDF-derived plaintext; structure varies. |
| Hearing transcripts | Speaker-turn-level | High | Speaker turns are the citation atom. Each turn is a coherent contribution. |
| Congressional Record granules | Speaker-turn-level | High | Same as hearing transcripts. |
Aggregate prior: section-aware (or speaker-turn-aware) wins meaningfully for the structured sources, by what feels like 5–15% NDCG@10 versus fixed-size char-based. For unstructured sources (CRS, GAO), the win narrows to 0–5% over recursive splitting.
These predictions are checkable. Track them; if reality is consistently different, the priors were wrong and we should update our intuitions about the corpus.
Strategies to compare
Section titled “Strategies to compare”The Step 1 candidate set. Six strategies cover the realistic design space:
| # | Strategy | Tunables | Implementation cost | Notes |
|---|---|---|---|---|
| 1 | Fixed-size char | chunk_size, overlap | Trivial (current default) | Baseline. 4000/800 chars ≈ 1000/200 tokens. Source-agnostic. |
| 2 | Fixed-size token | chunk_size, overlap, tokenizer | Easy | Same as #1 but counted in tokens (Arctic uses BPE), more reliable across sources. |
| 3 | Recursive splitter | separator hierarchy, target_size | Easy | LangChain-style cascade: split on \n\n, then \n, then ., then space, until each chunk fits target size. Source-agnostic. |
| 4 | Section-aware | per-source structure rules | Medium (per source) | Use source-specific structure: FR <HD>, bills <section>, USC USLM, hearing speaker turns, CRS markdown headings. Falls back to recursive splitter inside oversize sections. |
| 5 | Hierarchical (parent-child) | parent_size, child_size, overlap | Medium | Split into small children for retrieval, larger parents for context. Two indexes: child for vec search, parent for context window. Adds complexity. |
| 6 | Late chunking | model max-length | High | Embed the full document with a long-context model, then mean-pool token embeddings within chunk boundaries. Requires the embedder to support full-document input. Changes both chunker AND embedder; can't be evaluated cleanly against just-a-chunker variants. |
Out of scope for the first comparison
Section titled “Out of scope for the first comparison”- Semantic boundary detection (using the embedder itself to find semantic breaks). Heavy, exotic, hard to debug.
- Window-based chunking with attention to query terms. Query-aware chunking is a research direction; not Step 1.
- Fine-grained token-level chunking with sliding windows. Diminishing returns.
We'll compare 1, 2, 3, 4 head-to-head as the first pass. If section-aware wins as expected, hierarchical (5) becomes the v1.x optimization. Late chunking (6) is a separate research thread because it changes the embedder.
The Chunker protocol
Section titled “The Chunker protocol”Production abstraction. Lives at shared/josh_substrate/src/josh_substrate/chunking.py (or in the ingester package — TBD; lean toward shared so josh-core can chunk on the fly for ad-hoc use cases).
from typing import Protocol, AsyncIterator, runtime_checkablefrom pydantic import BaseModel
class Chunk(BaseModel): """One chunk produced by a Chunker. Source-agnostic."""
chunk_index: int # 0-based, monotonic within a parent doc text: str # the chunk content char_start: int | None # offset into the parsed source text char_end: int | None metadata: dict # chunker-specific (section_path, speaker, heading_level, etc.)
class ParsedDoc(BaseModel): """Input to a Chunker. Source-agnostic."""
doc_id: str # canonical Josh ID (e.g. 'fr:2026-08558') source: str # 'federal_register', 'bills', etc. text: str # cleaned plaintext body structure: dict | None # source-specific structure hints (XML tree, markdown headings, etc.)
@runtime_checkableclass Chunker(Protocol): """A function from a parsed document to an ordered sequence of chunks."""
name: str # 'fixed_char_4k', 'recursive_v1', 'section_aware_fr_v1', etc. version: int # bump when behavior changes config: dict # the parameters this instance was constructed with
def chunk(self, doc: ParsedDoc) -> list[Chunk]: """Pure function — same input, same output. No I/O.""" ...Naming convention for chunker IDs
Section titled “Naming convention for chunker IDs”Chunker IDs are stored alongside chunks so we know which strategy produced which row. Format: {strategy}_{variant}_v{version}.
Examples:
fixed_char_4k800_v1— fixed 4000 chars / 800 overlap, version 1fixed_token_512_50_v1— fixed 512 tokens / 50 overlaprecursive_v1— default recursive splittersection_aware_fr_v1— FR-specific section-awaresection_aware_bills_v1— bills-specificusc_uslm_section_v1— USC USLM section-levelspeaker_turn_v1— for hearings + congressional record
Chunk storage
Section titled “Chunk storage”Chunks live in <source>_chunks tables. Each row carries chunker_id (TEXT, the {strategy}_{variant}_v{version} string) so multiple chunkings can coexist temporarily for evaluation.
-- Example: fr_document_chunksCREATE TABLE fr_document_chunks ( id INTEGER PRIMARY KEY AUTOINCREMENT, fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE, chunker_id TEXT NOT NULL, -- 'section_aware_fr_v1', etc. chunk_index INTEGER NOT NULL, chunk_text TEXT NOT NULL, char_start INTEGER, char_end INTEGER, metadata TEXT, -- JSON, chunker-specific UNIQUE (fr_document_id, chunker_id, chunk_index));CREATE INDEX fr_document_chunks_doc ON fr_document_chunks (fr_document_id, chunker_id);vec0 companion tables key vectors back to fr_document_chunks.id, so swapping the active chunker is a join change, not a schema change.
Production picks one canonical chunker per source
Section titled “Production picks one canonical chunker per source”In production, one chunker_id is "live" per source — that's the one queries hit by default. The harness picks the live chunker via a configuration table (source_chunker_config) or via env, TBD. Eval runs spawn alternate chunker IDs alongside the live one; once an experiment concludes, the alternates are pruned.
This keeps the steady-state simple (one chunker per source, one set of vectors) but lets us evaluate alternates without forking the database.
The experiment framework
Section titled “The experiment framework”How we run a comparison.
The corpus
Section titled “The corpus”The chunker eval depends on having the raw artifacts persistent on disk so we can re-chunk without re-fetching. This is already the convention: /data/corpus/federal_register/metadata/{YYYY}/{MM}/{number}.json and /data/corpus/federal_register/bodies/xml/{YYYY}/{MM}/{number}.xml.
For each source, the eval corpus is a defined slice of these raw files:
| Source | Eval corpus slice | Doc count | Notes |
|---|---|---|---|
| Federal Register | 2025-01-01 → 2025-12-31 (full year) | ~70K docs | Spans Rules, NPRMs, Notices, Presidential Documents — exercises every type. |
| Bills | 119th Congress, all bill types | ~10K bills | Current congress; mix of HRES, S, HR, etc. |
| US Code | Title 42 (Public Health) + Title 26 (IRC) | ~10K sections | Two contrasting titles. |
The eval reads these raw files, runs the parser → the chunker variant being tested → the embedder, indexes into a temp SQLite file, then runs the eval set against it. Production data is untouched.
The eval set
Section titled “The eval set”We need labeled query-document relevance judgments. Build the eval set in three steps:
1. Query authoring. ~50 queries per source, drawn from agent-realistic patterns. Categories:
- Single-doc retrieval ("the rule on medical device cybersecurity from 2025")
- Topical (lookup) ("rules affecting HSA expansion")
- Citation following ("rules amending 42 USC 1396a")
- Comparative ("how does this NPRM differ from the prior 2018 proposed rule on the same topic")
- Authority queries ("EPA actions on PFAS in 2025")
- Section-level ("FDA premarket pathway for software in 21 CFR 820")
2. Candidate gathering. For each query, run a baseline retrieval (e.g., FTS5 BM25 with default weights, plus a brute-force semantic search) and gather top-30 candidates. Pool candidates from multiple chunkers so we don't bias the labels toward what one chunker happens to produce.
3. Labeling. Three approaches, in order of preference:
- (a) Manual labeling — a human reads each candidate and assigns a relevance score (0 = not relevant, 1 = partially relevant, 2 = relevant, 3 = highly relevant / canonical answer). Slow but ground-truth.
- (b) LLM-as-judge — Claude (or similar) reads the query + candidate and scores. Useful as a force multiplier; manual spot-check ~20% to validate correlation.
- (c) Bootstrap from prior eval — the deleted Phase C eval had ~55 queries with ~10 manually labeled. The query templates are documented in the conversation log of 2026-05-04 / 2026-05-05; rebuild from there.
Realistic plan: (c) → (b) → (a) for a high-confidence subset. Aim for 50 queries × 30 candidates labeled = 1,500 labels per source. With LLM-as-judge as the main labor and human spot-check, this is days, not weeks.
Eval set artifacts live in:
evals/├── fr/│ ├── queries.yaml # 50 queries with categories│ ├── candidates.jsonl # one row per (query, doc) candidate pool│ └── labels.jsonl # one row per (query, doc) label with judge metadata├── bills/│ └── ...└── usc/ └── ...(The previous evals/ directory was deleted in the SQLite-pivot rollback. The methodology described here recreates and improves on it.)
Metrics
Section titled “Metrics”Same as the deleted Phase C eval — known good for retrieval evaluation:
| Metric | What it measures | Why we use it |
|---|---|---|
| NDCG@10 | Discounted cumulative gain at 10 — rewards highly relevant docs early in the ranking | Primary metric. Sensitive to ranking quality, not just inclusion. |
| Recall@10 | Fraction of all relevant docs retrieved in top 10 | Sanity check. Don't ship a chunker that loses recall even if NDCG looks fine. |
| MRR | Mean reciprocal rank of the first relevant result | Tells us "did the first result the agent sees actually help." |
| Mean chunk count per doc | How fragmented the chunker is | Affects storage + scan time. Reported alongside quality metrics for cost awareness. |
| p50 / p95 chunk size (tokens) | Distribution of chunk sizes | Catches degenerate behavior — section-aware producing 50K-token chunks for one doc and 20-token chunks for another. |
Decision rule
Section titled “Decision rule”When does an experiment win? The honest threshold:
- NDCG@10 improvement of ≥0.02 absolute (over the live chunker) on the eval set, with no Recall@10 regression worse than 0.02 absolute. That's roughly the noise floor of the labeled set; smaller gains aren't reliably real.
- Mean chunk count within 2× of the live chunker. A win that triples storage + scan cost has to be evaluated against that cost.
- p95 chunk size ≤ 2048 tokens (fits Arctic comfortably).
- The chunker is reproducible (no LLM-in-the-loop nondeterminism, no version-drifting libraries).
If the candidate beats the live chunker on NDCG@10 by ≥0.02 with the other constraints satisfied, swap it in production for that source. Otherwise, keep the live chunker.
The harness
Section titled “The harness”A runnable Python module at shared/josh_substrate/src/josh_substrate/eval/chunker_compare.py (rough sketch):
async def run_chunker_eval( source: str, # 'federal_register' corpus_path: Path, # /data/corpus/federal_register/ eval_set_path: Path, # evals/fr/ chunkers: list[Chunker], # variants to compare embedder: Embedder, # the live embedder (Arctic-L-v2) output_path: Path, # where to write the comparison report) -> None: """For each chunker: 1. Build a temp SQLite file 2. Parse + chunk the corpus into it 3. Embed all chunks via the embedder 4. Run the eval set, scoring each query 5. Aggregate metrics Write a comparison table + per-query breakdown to output_path. """Output: a markdown report with the comparison table, per-query winners/losers, and the cost numbers (chunk count, storage, eval time). Comparable to the deleted evals/federal_register/runs/...comparison.md files.
CLI: josh-eval chunker --source=federal_register --chunkers=fixed_char_4k800_v1,recursive_v1,section_aware_fr_v1.
Per-source considerations
Section titled “Per-source considerations”One chunker per source is the production rule. Different sources have different structures; different structures want different chunkers. Don't try to find a one-size-fits-all winner.
Federal Register
Section titled “Federal Register”The XML body has a clear hierarchy: <RULE> / <PRORULE> / <NOTICE> / <PRESDOCU> → <PREAMB> (preamble) + <SUPLINF> (supplementary info) → <HD> headings (with SOURCE attribute giving heading level: HED, HD1, HD2).
A section-aware chunker walks the XML, treats each <HD>-bounded block as a chunk, and falls back to recursive splitting inside any block that exceeds ~2048 tokens. The chunk metadata captures section_path (e.g., ["SUPLINF", "Background", "Statutory Authority"]) so the agent can present "from the Background section" in citations.
Special-case for Presidential Documents: the wrapper element (<EXECORD>, <PROCLA>, <PRMEMO>, etc.) contains the full body text without internal sections in many cases. The chunker should fall back to recursive splitting at paragraph boundaries.
BILLSTATUS XML has the metadata; bill TEXT (in BILLS or PLAW) has the body. For Step 1, focus on the enacted text for Public Laws (USLM 2.0) and the latest version text for active bills. Live bill XML uses the older "billres" DTD which is less structured.
Section-aware chunker uses USLM <section> elements when available, falls back to recursive splitting on <subsection>/<paragraph> for older DTDs.
US Code (USC) and Public Laws
Section titled “US Code (USC) and Public Laws”USLM 2.0. The section is the citation atom. A chunker that walks the USLM tree and emits one chunk per <section> element (with subsections inline) is the obvious play. Cross-section chunks dilute citation precision badly here — USC queries are almost always citation-anchored ("what does 42 USC 1396a(a)(10) say?").
CRS reports
Section titled “CRS reports”HTML rendering. No formal section markup; rely on <h1>/<h2>/<h3> headings and paragraph boundaries. A recursive splitter with markdown-style heading awareness (treats # Heading lines as boundaries) is the realistic v1 chunker.
Hearings + Congressional Record
Section titled “Hearings + Congressional Record”Speaker turns are the citation atom. For CHRG and CREC, the parser already extracts speaker-name-prefixed turns. The chunker is just "one chunk per turn" with metadata carrying speaker, role (witness / member / chairman), and timing where available.
Long turns (>2048 tokens) get split with a recursive splitter as a fallback. Single chunks under ~50 tokens get merged with adjacent same-speaker turns.
GAO + CBO
Section titled “GAO + CBO”PDF-derived plaintext. No useful structure markup. Recursive splitter with paragraph + sentence boundaries. These sources are likely the ones where chunking strategy makes the smallest difference — recursive is fine.
Implementation phases
Section titled “Implementation phases”When does the chunker eval framework actually get built?
Phase A — protocol + baseline (during Phase 0.3 ingester rebuild)
Section titled “Phase A — protocol + baseline (during Phase 0.3 ingester rebuild)”- Implement the
Chunkerprotocol inshared/josh_substrate/ - Implement
fixed_char_4k800_v1(the previous default) andrecursive_v1 - Wire chunker selection into the ingester load stage
- Schema for
<source>_chunkscarrieschunker_id
This is the minimum to ship Phase 0 with a chunker. No eval framework yet — just the abstraction.
Phase B — section-aware FR (during Phase 0.6 first FR ingestion)
Section titled “Phase B — section-aware FR (during Phase 0.6 first FR ingestion)”- Implement
section_aware_fr_v1walking the FR XML<HD>structure - Run it as the production chunker for FR (high prior; instrumented to compare retroactively)
- No formal eval yet — committing on prior
Phase C — eval framework (before Step 1 launch, summer 2026)
Section titled “Phase C — eval framework (before Step 1 launch, summer 2026)”- Build the eval harness (
josh-eval chunker) - Author the FR query set and label it (LLM-as-judge bootstrap, manual spot-check)
- Run head-to-head:
fixed_char_4k800_v1vsrecursive_v1vssection_aware_fr_v1 - Publish the results as part of the OSS launch's retrieval research blog post
- If section-aware doesn't win on NDCG@10, that's a real and surprising finding worth investigating
Phase D — repeat per source (Phase 1+ ingestion)
Section titled “Phase D — repeat per source (Phase 1+ ingestion)”- For each new source as it lands, build the source-specific chunker (or pick recursive as fallback)
- Run the eval against the new source's corpus
- Lock in the source's production chunker
- Document the result in the source's per-source spec
Open questions
Section titled “Open questions”-
Where does the chunker live in the codebase?
shared/josh_substrate/chunking.py(alongside models + DB helpers, available to both ingester and core) orjosh-ingester/ingester/chunking.py(ingester-private, smaller dependency footprint)? Lean shared — josh-core may want to chunk uploaded files in the agent harness later (Step 2). -
LLM-as-judge model and protocol. Which model? Claude Opus is the gold standard for relevance judgments; Sonnet is cheaper and may be sufficient. What's the labeling prompt format that generalizes across queries? Worth a small calibration study (50 manual labels vs LLM labels, measure agreement).
-
Single eval set or per-source eval sets? Lean per-source. Different sources have different query patterns. But a shared "cross-source citation-following" eval might also be useful (queries that traverse multiple sources).
-
How often do we re-run the eval? On every chunker change. On every embedder change. As a CI gate? (Probably not — the eval takes minutes-to-hours per source. Run it on-demand when something changes.)
-
Eval set versioning. The eval set is itself a research artifact. When we add queries or rejudge labels, version it (
evals/fr/v1/,evals/fr/v2/). Old runs stay reproducible against the version they used. -
Late chunking research thread. Worth a separate doc + experiment if it becomes interesting. Requires confirming Arctic-L-v2's max sequence length (8192 tokens?) and whether mean-pooling token embeddings within chunk boundaries gives a real quality lift versus segment-then-embed. Defer to v1.x.
-
Public chunker eval as a research artifact. The OSS launch's retrieval research blog post should include the chunker comparison results. That means the eval set + labels + run files all need to be reproducible from the OSS code. Decide what's in the OSS repo vs cloud-private (lean: queries + labels in OSS, the labeled cloud corpus stays private).
-
Hierarchical (parent-child) chunking. Out of Phase C scope but a likely v1.x improvement. Children for retrieval (~256 tokens), parents for context window (~2048 tokens). Adds complexity to the schema (parent/child tables) and the agent's retrieval flow (which set does it query, how does it expand).
Cross-references
Section titled “Cross-references”docs/operations/eval-architecture.html— the unified eval framework. Authoritative for runner location, CLI shape, eval-set artifact location, and substrate-snapshot pinning. Where the four supersessions noted at the top of this doc are spec'd in detail.docs/operations/ingestion-architecture.html— where the chunker plugs into the load stagedocs/sources/federal-register.html— the FR XML structure that drives section-aware chunking for FRdocs/sources/bills.html— bill text structure (USLM for PLAW, billres for live bills)docs/sources/us-code.html— USC USLM hierarchydocs/sources/hearing-transcripts.html— speaker-turn structureCLAUDE.md"Storage backend decision" — why SQLite + sqlite-vec, why Arctic-L-v2 as the embedder