Chunking — design and experiment framework

How Josh splits source documents into the chunks that get embedded and stored as vectors, and how we decide which chunking strategy wins for which source.

This document covers two things that pair naturally:

The production Chunker abstraction — the protocol every chunker implements, where chunkers live in the codebase, how the ingester picks one, how we evolve it without breaking existing data.
The experiment framework — how we run head-to-head comparisons of chunking strategies on the same corpus, with the same eval set, against the same embedder, to make decisions backed by numbers rather than priors.

When this document disagrees with prose elsewhere, this document wins for chunking concerns.

Storage backend: SQLite + FTS5 + sqlite-vec. Embedding model: Snowflake Arctic-Embed-M-v2 (1024-dim, MIT license). See CLAUDE.md "Storage backend decision" for context.

Where this document has been superseded. The evaluation architecture doc is now the source of truth for how chunker comparison runs. Four specific things below have been replaced:

Eval harness location. What this doc calls shared/josh_substrate/src/josh_substrate/eval/chunker_compare.py now lives at josh-eval/retrieval/runner.py. The chunker protocol and chunker implementations stay in the substrate package; only the eval runner moves out.

CLI shape. Bare josh-eval chunker --source=… is replaced by josh-eval retrieval run --eval-set=… --substrate-version=…, a subcommand under the unified eval CLI.

Eval-set artifact location. Eval sets move from evals/<source>/ in the substrate repo to retrieval/eval-sets/<source>/ inside josh-eval.

Snapshot pinning (new requirement). Each eval set's manifest.json declares the substrate version it was authored against. Re-runs against a different version are flagged in reports rather than silently producing drifty numbers.

Everything else in this doc — the Chunker protocol, the strategy comparison, query authoring, labeling rubric, decision rules — is still authoritative. Read this doc for chunker design; read evaluation architecture for how the runs are organized.

Why this matters

Retrieval quality is the dominant factor in how good Josh feels to an agent. Bad chunking doesn't just degrade vector retrieval — it floors the quality ceiling for everything downstream. If a chunk crosses a section boundary in an FR rule, the embedding represents two unrelated topics and the cosine similarity to any query is noisy. If a chunk is too small, it loses context the agent needs. If it's too large, it dilutes the signal.

The ingester previously shipped on fixed-size 4000-char / 800-char-overlap chunking as the default — a reasonable starting point but almost certainly not optimal for federal documents, which are heavily structured (FR XML has explicit <HD> headers and <SECTION> elements; bills have <section>; USC uses USLM with a deep hierarchy; CRS has HTML headings; hearings have speaker turns).

We need to know which chunking strategy actually wins, per source, before we commit to a default for Step 1's launch corpus.

Hypothesis (priors before measurement)

Stated explicitly so we can later check whether reality matched.

Source	Predicted best	Confidence	Reasoning
Federal Register (rules, NPRMs)	Section-aware (HD-driven)	High	FR XML carries explicit `<HD>` headers with named sections (BACKGROUND, REGULATORY ANALYSIS, etc.) that map cleanly to logical units. Mixing sections in one chunk dilutes embeddings.
Bills	Section-aware (`<section>`-driven) with parent-child fallback	Medium-high	Bill sections are the citation atom. But many bills have boilerplate (short titles, definitions) that swamps signal in mixed chunks.
US Code	USLM hierarchical (section level)	High	USC is hierarchical by design; the section is the citation atom.
Public Laws	USLM hierarchical (section level)	High	Same shape as USC.
CRS reports	Recursive splitter with markdown header awareness	Medium	Less explicit XML structure; HTML rendering varies. Likely wants paragraph-level splitting with some heading awareness.
GAO reports	Recursive splitter	Medium	PDF-derived plaintext; structure varies.
Hearing transcripts	Speaker-turn-level	High	Speaker turns are the citation atom. Each turn is a coherent contribution.
Congressional Record granules	Speaker-turn-level	High	Same as hearing transcripts.

Aggregate prior: section-aware (or speaker-turn-aware) wins meaningfully for the structured sources, by what feels like 5–15% NDCG@10 versus fixed-size char-based. For unstructured sources (CRS, GAO), the win narrows to 0–5% over recursive splitting.

These predictions are checkable. Track them; if reality is consistently different, the priors were wrong and we should update our intuitions about the corpus.

Strategies to compare

The Step 1 candidate set. Six strategies cover the realistic design space:

#	Strategy	Tunables	Implementation cost	Notes
1	Fixed-size char	chunk_size, overlap	Trivial (current default)	Baseline. 4000/800 chars ≈ 1000/200 tokens. Source-agnostic.
2	Fixed-size token	chunk_size, overlap, tokenizer	Easy	Same as #1 but counted in tokens (Arctic uses BPE), more reliable across sources.
3	Recursive splitter	separator hierarchy, target_size	Easy	LangChain-style cascade: split on `\n\n`, then `\n`, then `.`, then space, until each chunk fits target size. Source-agnostic.
4	Section-aware	per-source structure rules	Medium (per source)	Use source-specific structure: FR `<HD>`, bills `<section>`, USC USLM, hearing speaker turns, CRS markdown headings. Falls back to recursive splitter inside oversize sections.
5	Hierarchical (parent-child)	parent_size, child_size, overlap	Medium	Split into small children for retrieval, larger parents for context. Two indexes: child for vec search, parent for context window. Adds complexity.
6	Late chunking	model max-length	High	Embed the full document with a long-context model, then mean-pool token embeddings within chunk boundaries. Requires the embedder to support full-document input. Changes both chunker AND embedder; can't be evaluated cleanly against just-a-chunker variants.

Out of scope for the first comparison

Semantic boundary detection (using the embedder itself to find semantic breaks). Heavy, exotic, hard to debug.
Window-based chunking with attention to query terms. Query-aware chunking is a research direction; not Step 1.
Fine-grained token-level chunking with sliding windows. Diminishing returns.

We'll compare 1, 2, 3, 4 head-to-head as the first pass. If section-aware wins as expected, hierarchical (5) becomes the v1.x optimization. Late chunking (6) is a separate research thread because it changes the embedder.

The `Chunker` protocol

Production abstraction. Lives at shared/josh_substrate/src/josh_substrate/chunking.py (or in the ingester package — TBD; lean toward shared so josh-core can chunk on the fly for ad-hoc use cases).

from typing import Protocol, AsyncIterator, runtime_checkable
from pydantic import BaseModel


class Chunk(BaseModel):
    """One chunk produced by a Chunker. Source-agnostic."""

    chunk_index: int                       # 0-based, monotonic within a parent doc
    text: str                              # the chunk content
    char_start: int | None                 # offset into the parsed source text
    char_end: int | None
    metadata: dict                         # chunker-specific (section_path, speaker, heading_level, etc.)


class ParsedDoc(BaseModel):
    """Input to a Chunker. Source-agnostic."""

    doc_id: str                            # canonical Josh ID (e.g. 'fr:2026-08558')
    source: str                            # 'federal_register', 'bills', etc.
    text: str                              # cleaned plaintext body
    structure: dict | None                 # source-specific structure hints (XML tree, markdown headings, etc.)


@runtime_checkable
class Chunker(Protocol):
    """A function from a parsed document to an ordered sequence of chunks."""

    name: str                              # 'fixed_char_4k', 'recursive_v1', 'section_aware_fr_v1', etc.
    version: int                           # bump when behavior changes
    config: dict                           # the parameters this instance was constructed with

    def chunk(self, doc: ParsedDoc) -> list[Chunk]:
        """Pure function — same input, same output. No I/O."""
        ...

Naming convention for chunker IDs

Chunker IDs are stored alongside chunks so we know which strategy produced which row. Format: {strategy}_{variant}_v{version}.

Examples:

fixed_char_4k800_v1 — fixed 4000 chars / 800 overlap, version 1
fixed_token_512_50_v1 — fixed 512 tokens / 50 overlap
recursive_v1 — default recursive splitter
section_aware_fr_v1 — FR-specific section-aware
section_aware_bills_v1 — bills-specific
usc_uslm_section_v1 — USC USLM section-level
speaker_turn_v1 — for hearings + congressional record

Chunk storage

Chunks live in <source>_chunks tables. Each row carries chunker_id (TEXT, the {strategy}_{variant}_v{version} string) so multiple chunkings can coexist temporarily for evaluation.

-- Example: fr_document_chunks
CREATE TABLE fr_document_chunks (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    fr_document_id TEXT NOT NULL REFERENCES fr_documents(id) ON DELETE CASCADE,
    chunker_id TEXT NOT NULL,                 -- 'section_aware_fr_v1', etc.
    chunk_index INTEGER NOT NULL,
    chunk_text TEXT NOT NULL,
    char_start INTEGER,
    char_end INTEGER,
    metadata TEXT,                            -- JSON, chunker-specific
    UNIQUE (fr_document_id, chunker_id, chunk_index)
);
CREATE INDEX fr_document_chunks_doc ON fr_document_chunks (fr_document_id, chunker_id);

vec0 companion tables key vectors back to fr_document_chunks.id, so swapping the active chunker is a join change, not a schema change.

Production picks one canonical chunker per source

In production, one chunker_id is "live" per source — that's the one queries hit by default. The harness picks the live chunker via a configuration table (source_chunker_config) or via env, TBD. Eval runs spawn alternate chunker IDs alongside the live one; once an experiment concludes, the alternates are pruned.

This keeps the steady-state simple (one chunker per source, one set of vectors) but lets us evaluate alternates without forking the database.

The experiment framework

How we run a comparison.

The corpus

The chunker eval depends on having the raw artifacts persistent on disk so we can re-chunk without re-fetching. This is already the convention: /data/corpus/federal_register/metadata/{YYYY}/{MM}/{number}.json and /data/corpus/federal_register/bodies/xml/{YYYY}/{MM}/{number}.xml.

For each source, the eval corpus is a defined slice of these raw files:

Source	Eval corpus slice	Doc count	Notes
Federal Register	2025-01-01 → 2025-12-31 (full year)	~70K docs	Spans Rules, NPRMs, Notices, Presidential Documents — exercises every type.
Bills	119th Congress, all bill types	~10K bills	Current congress; mix of HRES, S, HR, etc.
US Code	Title 42 (Public Health) + Title 26 (IRC)	~10K sections	Two contrasting titles.

The eval reads these raw files, runs the parser → the chunker variant being tested → the embedder, indexes into a temp SQLite file, then runs the eval set against it. Production data is untouched.

The eval set

We need labeled query-document relevance judgments. Build the eval set in three steps:

1. Query authoring. ~50 queries per source, drawn from agent-realistic patterns. Categories:

Single-doc retrieval ("the rule on medical device cybersecurity from 2025")
Topical (lookup) ("rules affecting HSA expansion")
Citation following ("rules amending 42 USC 1396a")
Comparative ("how does this NPRM differ from the prior 2018 proposed rule on the same topic")
Authority queries ("EPA actions on PFAS in 2025")
Section-level ("FDA premarket pathway for software in 21 CFR 820")

2. Candidate gathering. For each query, run a baseline retrieval (e.g., FTS5 BM25 with default weights, plus a brute-force semantic search) and gather top-30 candidates. Pool candidates from multiple chunkers so we don't bias the labels toward what one chunker happens to produce.

3. Labeling. Three approaches, in order of preference:

(a) Manual labeling — a human reads each candidate and assigns a relevance score (0 = not relevant, 1 = partially relevant, 2 = relevant, 3 = highly relevant / canonical answer). Slow but ground-truth.
(b) LLM-as-judge — Claude (or similar) reads the query + candidate and scores. Useful as a force multiplier; manual spot-check ~20% to validate correlation.
(c) Bootstrap from prior eval — the deleted Phase C eval had ~55 queries with ~10 manually labeled. The query templates are documented in the conversation log of 2026-05-04 / 2026-05-05; rebuild from there.

Realistic plan: (c) → (b) → (a) for a high-confidence subset. Aim for 50 queries × 30 candidates labeled = 1,500 labels per source. With LLM-as-judge as the main labor and human spot-check, this is days, not weeks.

Eval set artifacts live in:

evals/
├── fr/
│   ├── queries.yaml              # 50 queries with categories
│   ├── candidates.jsonl          # one row per (query, doc) candidate pool
│   └── labels.jsonl              # one row per (query, doc) label with judge metadata
├── bills/
│   └── ...
└── usc/
    └── ...

(The previous evals/ directory was deleted in the SQLite-pivot rollback. The methodology described here recreates and improves on it.)

Metrics

Same as the deleted Phase C eval — known good for retrieval evaluation:

Metric	What it measures	Why we use it
NDCG@10	Discounted cumulative gain at 10 — rewards highly relevant docs early in the ranking	Primary metric. Sensitive to ranking quality, not just inclusion.
Recall@10	Fraction of all relevant docs retrieved in top 10	Sanity check. Don't ship a chunker that loses recall even if NDCG looks fine.
MRR	Mean reciprocal rank of the first relevant result	Tells us "did the first result the agent sees actually help."
Mean chunk count per doc	How fragmented the chunker is	Affects storage + scan time. Reported alongside quality metrics for cost awareness.
p50 / p95 chunk size (tokens)	Distribution of chunk sizes	Catches degenerate behavior — section-aware producing 50K-token chunks for one doc and 20-token chunks for another.

Decision rule

When does an experiment win? The honest threshold:

NDCG@10 improvement of ≥0.02 absolute (over the live chunker) on the eval set, with no Recall@10 regression worse than 0.02 absolute. That's roughly the noise floor of the labeled set; smaller gains aren't reliably real.
Mean chunk count within 2× of the live chunker. A win that triples storage + scan cost has to be evaluated against that cost.
p95 chunk size ≤ 2048 tokens (fits Arctic comfortably).
The chunker is reproducible (no LLM-in-the-loop nondeterminism, no version-drifting libraries).

If the candidate beats the live chunker on NDCG@10 by ≥0.02 with the other constraints satisfied, swap it in production for that source. Otherwise, keep the live chunker.

The harness

A runnable Python module at shared/josh_substrate/src/josh_substrate/eval/chunker_compare.py (rough sketch):

async def run_chunker_eval(
    source: str,                          # 'federal_register'
    corpus_path: Path,                    # /data/corpus/federal_register/
    eval_set_path: Path,                  # evals/fr/
    chunkers: list[Chunker],              # variants to compare
    embedder: Embedder,                   # the live embedder (Arctic-L-v2)
    output_path: Path,                    # where to write the comparison report
) -> None:
    """For each chunker:
       1. Build a temp SQLite file
       2. Parse + chunk the corpus into it
       3. Embed all chunks via the embedder
       4. Run the eval set, scoring each query
       5. Aggregate metrics
       Write a comparison table + per-query breakdown to output_path.
    """

Output: a markdown report with the comparison table, per-query winners/losers, and the cost numbers (chunk count, storage, eval time). Comparable to the deleted evals/federal_register/runs/...comparison.md files.

CLI: josh-eval chunker --source=federal_register --chunkers=fixed_char_4k800_v1,recursive_v1,section_aware_fr_v1.

Per-source considerations

One chunker per source is the production rule. Different sources have different structures; different structures want different chunkers. Don't try to find a one-size-fits-all winner.

Federal Register

The XML body has a clear hierarchy: <RULE> / <PRORULE> / <NOTICE> / <PRESDOCU> → <PREAMB> (preamble) + <SUPLINF> (supplementary info) → <HD> headings (with SOURCE attribute giving heading level: HED, HD1, HD2).

A section-aware chunker walks the XML, treats each <HD>-bounded block as a chunk, and falls back to recursive splitting inside any block that exceeds ~2048 tokens. The chunk metadata captures section_path (e.g., ["SUPLINF", "Background", "Statutory Authority"]) so the agent can present "from the Background section" in citations.

Special-case for Presidential Documents: the wrapper element (<EXECORD>, <PROCLA>, <PRMEMO>, etc.) contains the full body text without internal sections in many cases. The chunker should fall back to recursive splitting at paragraph boundaries.

Bills

BILLSTATUS XML has the metadata; bill TEXT (in BILLS or PLAW) has the body. For Step 1, focus on the enacted text for Public Laws (USLM 2.0) and the latest version text for active bills. Live bill XML uses the older "billres" DTD which is less structured.

Section-aware chunker uses USLM <section> elements when available, falls back to recursive splitting on <subsection>/<paragraph> for older DTDs.

US Code (USC) and Public Laws

USLM 2.0. The section is the citation atom. A chunker that walks the USLM tree and emits one chunk per <section> element (with subsections inline) is the obvious play. Cross-section chunks dilute citation precision badly here — USC queries are almost always citation-anchored ("what does 42 USC 1396a(a)(10) say?").

CRS reports

HTML rendering. No formal section markup; rely on <h1>/<h2>/<h3> headings and paragraph boundaries. A recursive splitter with markdown-style heading awareness (treats # Heading lines as boundaries) is the realistic v1 chunker.

Hearings + Congressional Record

Speaker turns are the citation atom. For CHRG and CREC, the parser already extracts speaker-name-prefixed turns. The chunker is just "one chunk per turn" with metadata carrying speaker, role (witness / member / chairman), and timing where available.

Long turns (>2048 tokens) get split with a recursive splitter as a fallback. Single chunks under ~50 tokens get merged with adjacent same-speaker turns.

GAO + CBO

PDF-derived plaintext. No useful structure markup. Recursive splitter with paragraph + sentence boundaries. These sources are likely the ones where chunking strategy makes the smallest difference — recursive is fine.

Implementation phases

When does the chunker eval framework actually get built?

Phase A — protocol + baseline (during Phase 0.3 ingester rebuild)

Implement the Chunker protocol in shared/josh_substrate/
Implement fixed_char_4k800_v1 (the previous default) and recursive_v1
Wire chunker selection into the ingester load stage
Schema for <source>_chunks carries chunker_id

This is the minimum to ship Phase 0 with a chunker. No eval framework yet — just the abstraction.

Phase B — section-aware FR (during Phase 0.6 first FR ingestion)

Implement section_aware_fr_v1 walking the FR XML <HD> structure
Run it as the production chunker for FR (high prior; instrumented to compare retroactively)
No formal eval yet — committing on prior

Phase C — eval framework (before Step 1 launch, summer 2026)

Build the eval harness (josh-eval chunker)
Author the FR query set and label it (LLM-as-judge bootstrap, manual spot-check)
Run head-to-head: fixed_char_4k800_v1 vs recursive_v1 vs section_aware_fr_v1
Publish the results as part of the OSS launch's retrieval research blog post
If section-aware doesn't win on NDCG@10, that's a real and surprising finding worth investigating

Phase D — repeat per source (Phase 1+ ingestion)

For each new source as it lands, build the source-specific chunker (or pick recursive as fallback)
Run the eval against the new source's corpus
Lock in the source's production chunker
Document the result in the source's per-source spec

Open questions

Where does the chunker live in the codebase? shared/josh_substrate/chunking.py (alongside models + DB helpers, available to both ingester and core) or josh-ingester/ingester/chunking.py (ingester-private, smaller dependency footprint)? Lean shared — josh-core may want to chunk uploaded files in the agent harness later (Step 2).
LLM-as-judge model and protocol. Which model? Claude Opus is the gold standard for relevance judgments; Sonnet is cheaper and may be sufficient. What's the labeling prompt format that generalizes across queries? Worth a small calibration study (50 manual labels vs LLM labels, measure agreement).
Single eval set or per-source eval sets? Lean per-source. Different sources have different query patterns. But a shared "cross-source citation-following" eval might also be useful (queries that traverse multiple sources).
How often do we re-run the eval? On every chunker change. On every embedder change. As a CI gate? (Probably not — the eval takes minutes-to-hours per source. Run it on-demand when something changes.)
Eval set versioning. The eval set is itself a research artifact. When we add queries or rejudge labels, version it (evals/fr/v1/, evals/fr/v2/). Old runs stay reproducible against the version they used.
Late chunking research thread. Worth a separate doc + experiment if it becomes interesting. Requires confirming Arctic-L-v2's max sequence length (8192 tokens?) and whether mean-pooling token embeddings within chunk boundaries gives a real quality lift versus segment-then-embed. Defer to v1.x.
Public chunker eval as a research artifact. The OSS launch's retrieval research blog post should include the chunker comparison results. That means the eval set + labels + run files all need to be reproducible from the OSS code. Decide what's in the OSS repo vs cloud-private (lean: queries + labels in OSS, the labeled cloud corpus stays private).
Hierarchical (parent-child) chunking. Out of Phase C scope but a likely v1.x improvement. Children for retrieval (~256 tokens), parents for context window (~2048 tokens). Adds complexity to the schema (parent/child tables) and the agent's retrieval flow (which set does it query, how does it expand).

Cross-references

docs/operations/eval-architecture.html — the unified eval framework. Authoritative for runner location, CLI shape, eval-set artifact location, and substrate-snapshot pinning. Where the four supersessions noted at the top of this doc are spec'd in detail.
docs/operations/ingestion-architecture.html — where the chunker plugs into the load stage
docs/sources/federal-register.html — the FR XML structure that drives section-aware chunking for FR
docs/sources/bills.html — bill text structure (USLM for PLAW, billres for live bills)
docs/sources/us-code.html — USC USLM hierarchy
docs/sources/hearing-transcripts.html — speaker-turn structure
CLAUDE.md "Storage backend decision" — why SQLite + sqlite-vec, why Arctic-L-v2 as the embedder