substrateshippedp0

Substrate body normalization

substrate-body-normalization · updated 2026-05-10T18:30:00Z · owner rritz

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Every text-bearing source needs raw payload preserved on disk *and*
a normalized Markdown body the chunker, embedder, and FTS5 indexer
consume. Per-format Normalizer implementations (html_v1, text_v1
shipped; fr_xml_v1, uslm_v1, ecfr_xml_v1, pdf_v1 planned) all
conform to the Normalizer protocol in josh_substrate.protocols and
return a NormalizedBody (markdown + heading tree + artifacts +
warnings). The registry lookup (josh_substrate.normalizers.registry)
is the single dispatch point — loaders never import a specific
normalizer.

User stories

As a Source author, I want to declare the format my body comes in (`html_v1`, etc.) and have the runner do the rest so that I'm not re-implementing markdownify-with-lossy-table-fallback per source.

As a chunker, I want a Markdown body with a heading tree and stable anchors so that chunk locators include heading paths and `source_url#anchor` deep links.

As an OSS contributor adding a new format, I want one Protocol to satisfy and a registry entry so that the addition is local and tested in isolation.

Acceptance criteria (EARS)

When a `Normalizer` is invoked with raw bytes, it shall return a `NormalizedBody` whose `markdown`, `headings`, `artifacts`, and `warnings` fields are populated; the `markdown` shall be safe to write to a `body_text TEXT` column.
When a heading hierarchy exists in the source, the normalizer shall preserve `level`, `text`, and (when available) an `anchor` for each heading in `NormalizedBody.headings`.
When a fragment cannot be cleanly converted to Markdown (e.g., a complex table), it shall be captured as an `Artifact` instead of being silently dropped or HTML-bled into the markdown.
Where a Source has registered a normalizer version, loaders shall stamp `body_normalizer_version`, `body_text_sha256`, and `body_normalized_at` on the parent row at write time (see migration `0002_crs.py`).
When `josh_substrate.normalizers.registry.get(name)` is called with an unregistered name, the system shall raise rather than silently returning `None`.

Success determiner

kindtest_file

Path

shared/josh_substrate/tests/normalizers/test_html_v1.py

Runner

uv run pytest shared/josh_substrate/tests/normalizers -v

`html_v1` is the reference normalizer. Ten behavior-isolated tests cover heading hierarchy, citation/emphasis preservation, relative-link resolution, table → markdown, lossy-table → artifact, no HTML-tag bleed, script/nav stripping, near-empty input, and length-ratio sanity warnings. Pairs with the manual 12-sample stratified validation gate documented in `https://docs.usejosh.com/sources/crs-reports/` (Normalizer validation). Re-run when `html_v1.py` changes; add a new test file under `tests/normalizers/` when a new Normalizer ships.

Clarifications needed

None.

Out of scope

PDF normalization. Specced separately under `pdf_v1`; depends on a separate decision about OCR vs structured-text extraction.
LLM-based normalization. Not in v1 — deterministic, replayable normalizers only.
Body chunking. That's `substrate-vector-pipeline`'s concern; this spec stops at the Markdown body and heading tree.

Dependencies

substrate-sqlite-foundation

Plan

Implementation:

- Protocols: Normalizer, NormalizedBody, Heading, Artifact
in shared/josh_substrate/src/josh_substrate/protocols.py. The
Protocol is runtime_checkable so registry lookups can validate
shape at boot.
- Registry: shared/josh_substrate/src/josh_substrate/normalizers/registry.py
keeps a name → Normalizer dict; get(name) raises KeyError on miss.
- html_v1: normalizers/html_v1.py (290 LOC). Uses markdownify
with a custom converter set for table preservation, falls back to
Artifact for tables that exceed a complexity heuristic. Strips
<script>, <nav>, and footer chrome via lxml selectors.
- text_v1: trivial passthrough for already-Markdown sources.
- Stamping: loaders call the normalizer, hash the markdown, write
body_text + body_text_sha256 + body_normalizer_version +
body_normalized_at (the last via parent_update_extra in the
upsert helper — see CRS loader). Migration 0002_crs.py defines
those columns on crs_reports; analogous columns ship per source
that has bodies.
- Tests: tests/normalizers/test_html_v1.py (170 LOC, 10 tests,
all green via poe test).

Tasks

5 of 5 done.

t1 Protocol + NormalizedBody/Heading/Artifact dataclasses
t2 Registry with strict get() and module-import-time registration
t3 html_v1 normalizer + 10 unit tests
t4 text_v1 passthrough for already-Markdown sources
t5 body_normalizer_version + body_text_sha256 columns wired in CRS migration

Changelog

2026-05-10T18:30:00Z (new)→shipped Spec content backfilled retrospectively. The normalizer protocol, registry, and `html_v1` reference implementation have been live since the CRS Reports ingester landed (the worked-example for the whole pipeline). This commit replaces the [STUB] placeholders with acceptance criteria grounded in the `Normalizer` protocol and the ten passing unit tests under `tests/normalizers/`.

docs/spec/substrate-body-normalization.html · generated by bin/build-spec.py