Adding a new data source

How to add a new source to the Josh substrate — from per-source spec doc through full backfill on the production server. CRS Reports is the worked example: every step links to its concrete CRS counterpart.

This runbook is the contract between you and the substrate. Follow it in order, and the new source will:

Land in the unified SQLite file at /data/josh.db with full FTS5 + vec0 support.
Carry the standard 6 citation columns and 7 body columns — identical shape across every source.
Pass the same manual-validation gates that CRS passed.
Drop into the runner with no framework changes.

What changes vs. what doesn't

The framework's whole point is that adding a source touches a tightly bounded set of files. Calibrate your expectations accordingly:

Per source, you change:

The per-source ingester spec YAML (docs/spec/data/<source>-ingester.yaml) and rendered HTML.
The Alembic migration for this source.
The Pydantic models for this source.
The citation formatter for this source.
A normalizer file only if the source ingests a format we haven't seen before.
The source subpackage at josh-ingester/ingester/sources/<source>/ (__init__.py, discover.py, fetch.py, parse.py, load.py — and optionally cli.py for source-local maintenance commands).
Tests for the source module and the citation formatter.
Two inventory pages: docs/josh-data-sources.html and docs/data-status.html.

You do not change:

The global ingester CLI (josh-ingester/ingester/cli.py). Source-specific commands belong in the source's own cli.py; the global CLI auto-mounts them.
The runner (josh-ingester/ingester/runner.py) or the scheduler.
The state-table writers (josh-ingester/ingester/state.py) or the per-source advisory-lock helper.
The substrate's protocol definitions, body-normalization machinery, or chunking framework.
The Source protocol itself.

Steps 9–12 of this runbook exercise the new source via the existing generic CLI and runner — no CLI code edits required. If you find yourself reaching outside the bounded set, stop and re-read this callout: most of the time the right answer is to do the work inside the source subpackage, not in the framework.

What you're committing to

Adding one source is roughly 6–10 hours after the framework is in place (which it is, post-CRS). Per-source deliverables, in order:

Per-source spec doc at docs/sources/<name>.html
Alembic migration at shared/josh_substrate/src/josh_substrate/migrations/versions/<NNNN>_<source>.py
Pydantic models at shared/josh_substrate/src/josh_substrate/models/<source>.py
Citation formatter at shared/josh_substrate/src/josh_substrate/citations/formatters/<source>.py
Normalizer registry entry (or new normalizer file under shared/josh_substrate/src/josh_substrate/normalizers/ if the format is new)
Source subpackage at josh-ingester/ingester/sources/<source>/ (__init__.py + per-stage discover.py / fetch.py / parse.py / load.py + optional cli.py)
Tests at josh-ingester/tests/sources/test_<source>.py and (optionally) shared/josh_substrate/tests/citations/test_<source>.py
Manual HTML/XML→Markdown validation against ≥12 stratified samples — REQUIRED.
URL verification gate against 10 random records — REQUIRED.

Tools you have available

Use the existing kit; don't reinvent.

Need	Tool	Where
HTTP fetch + retries + rate-limit	`JoshHttpClient`	`josh-ingester/ingester/http.py` (constructed by the runner; passed into `Source.discover` and `Source.fetch`)
Anti-bot / JS-rendered pages	`browser-use` skill	Claude Code skill, available globally. Use for any source where plain HTTP returns 403/blocked
HTML → Markdown	`html_v1`	`shared/josh_substrate/src/josh_substrate/normalizers/html_v1.py`
Plain text → Markdown	`text_v1`	`shared/josh_substrate/src/josh_substrate/normalizers/text_v1.py`
XML / PDF	(none yet)	When you ingest the first such source, write `<source>_xml_v1` or use `pdf_v1` (TBD)
Substrate connection + sqlite-vec load	`josh_substrate.db.async_engine`	`shared/josh_substrate/src/josh_substrate/db.py`
Citation registry	`josh_substrate.citations`	Auto-discovers your `formatters/<source>.py`
Normalizer registry	`josh_substrate.normalizers`	Auto-discovers your normalizer module if you add one
Path conventions	`josh_substrate.corpus.raw_path`	`/data/corpus/<source>/bodies/raw/...` (markdown lives in DB column, not on disk)
State-table writers	`ingester.state`	Used by the runner; you don't call these directly
Per-source advisory lock	`ingester.concurrency.source_lock`	Used by the runner; you don't call these directly
Per-source CLI extension	Module-level `cli: typer.Typer` in `sources/<source>/cli.py`	Optional. The global ingester CLI auto-mounts your source's `cli` attribute under `josh-ingester <source-name-with-dashes>` (underscores → hyphens). Use for source-local maintenance commands (one-shot data fixes, etc.). See `crs_reports/cli.py` for the worked example (`josh-ingester crs-reports fix-types`).

Don't roll your own when these exist. The point of the framework is uniformity across sources.

Prerequisites

The substrate is provisioned per new host setup and migration 0001 (state tables) is applied.
The josh_substrate shared package is installed (pip install -e shared/josh_substrate for local dev; the Dockerfile handles it for production).
CRS is shipped and serves as the worked example — every step below has a "look at CRS for…" pointer.
You can SSH into josh (the production droplet) and your Kamal credentials are decrypted.

Step 0 — Open the spec YAML and drive it as you go

Every source ingester is governed by a spec item at docs/spec/data/<source>-ingester.yaml — that file is the contract for what "shipped" means for this source, and the place an agent (or you, next session) reads to know what's left.

If the spec doesn't exist yet, copy the template:

cp docs/spec/data/_templates/source.yaml docs/spec/data/<source>-ingester.yaml
# fill in: id, title, why, user_stories, acceptance_criteria, success_determiner
python bin/build-spec.py    # validates against _schema.json + regenerates HTML

Drive the spec through this runbook:

The 14 standard tasks in the template (t1–t14) intentionally mirror the 14 steps below. Tick each task done: true as you finish the corresponding step. Add source-specific tasks at t15+.
Flip status: planned → in_progress when you start coding (Step 4). The build script auto-prepends a changelog entry on every status transition.
Flip status: in_progress → verified when the success_determiner runs green against the deployed substrate (after Step 11).
Flip status: verified → shipped when the cron is registered and observed firing in production (after Step 12). Verified ≠ shipped — verified means the determiner passed; shipped means the change is live.
Re-run python bin/build-spec.py after each YAML edit (or use the in-browser editor on docs/spec/ in Chrome/Edge — the File System Access API writes back silently).

The same write-the-criteria-first discipline applies to the success_determiner. Pick kind: bash or kind: sql over kind: manual wherever possible — manual is a fallback, not a goal.

Step 1 — Author the per-source spec doc

File: docs/sources/<name>.html. Reference template: docs/sources/crs-reports.html. (Note: this is the per-source documentation page — endpoints, response shapes, indicative DDL, probe findings. The ingester spec — acceptance criteria, success determiner, task contract — is a separate file you set up in Step 0.)

Required sections:

Section	What goes here
Quick reference	A table at the top: source name, primary URL pattern, auth requirements, rate limits, format, indicative volume
Source priority decision	Which API/mirror you picked and why. If multiple options exist (Congress.gov vs EveryCRSReport mirror, etc.), document the tradeoff
Access notes	Bot walls, auth flows, gotchas. Note here if the source needs `browser-use`
Endpoints	Concrete URL templates with parameter placeholders
Vocabulary / enums	Type codes, status enums, anything an agent would benefit from understanding
Stable ID format	The prefix-namespaced ID (e.g., `crs:R48481`, `fr:2026-08558`)
Response shapes	Sample JSON/XML payloads for each endpoint. Capture during initial probe
Volume	Records all-time, records/year, expected on-disk size of raw + DB
Caching / incremental sync	Watermark strategy, ETag / If-Modified-Since support, RSS feed availability
Schema (indicative DDL)	The shape you intend to write. Migration is the source of truth — this is illustrative
Download / update strategy	Backfill plan + daily incremental plan
Failure modes	Known flake patterns + recovery
Open questions	Things you noticed but defer to v1.x

Add the source to the inventory table in docs/josh-data-sources.html.

Step 2 — Decide stable ID format

The Josh ID format is <source-prefix>:<source-native-key>. Conventions:

Property	Required
Lowercased prefix matching the source name	Yes (`crs:`, `fr:`, `bill:`)
Source-native key preserved verbatim	Yes (preserve case, separators)
Globally unique within source	Yes
Valid as a TEXT column value (no shell-special chars unless quoted)	Yes

Examples in production:

crs:R48481 — CRS report number
crs:R48481:c0 — chunk index zero of that report
crs:R48481:author:0 — first author of that report

Keep them human-readable. Agents will quote these in citations and logs; debugging is easier when IDs aren't UUIDs.

Step 3 — Pick or build the normalizer

Decision tree:

Source format	Action
HTML	Reuse `html_v1`. Add per-source cleanup (e.g., wrapper-noise stripping) in your source module's `parse()` stage, not in the normalizer
XML with new schema	Write `<source>_xml_v1` as a new file under `normalizers/` — `lxml`-based walker that preserves heading hierarchy and produces `Markdown + Heading[]`
Plain text	Reuse `text_v1`
PDF	Use `pdf_v1` (when shipped). PDF normalization is queued (slower than inline) — record loads with `body_text=NULL` and an entry in `ingestion_normalization_queue`; the worker fills it in later

REQUIRED — Manual normalizer validation gate

Per the locked decision in the plan, every source's normalizer must pass the 12-sample stratified validation gate before it ships. This is non-optional. The bar is binary: all 12 samples × all 12 criteria pass. No averaging.

Sample selection rule:

Select 12 fixtures stratified by an intrinsic source dimension (CRS uses report-type prefix; FR would use document type; bills would use bill type or congress).
Pick the most-recently-published instances of each stratum that exceed a minimum size (filters empty stubs).
Pin the sample IDs in your per-source spec doc's "Normalizer validation" section so the gate is reproducible.

Workflow per sample:

Open the source URL in a browser (or fetch via browser-use).
Run the converter: await normalizer.to_markdown(raw_path, source_hint=url).
Render the Markdown side-by-side (VS Code preview / glow / similar).
Walk the criteria table below; score pass/fail.

Criteria with pass bars:

Criterion	Pass bar
Heading hierarchy	All `<h1>`–`<h4>` map to `#`–`####` in correct nesting order. No skipped levels.
Section ordering	First-paragraph-of-section text in source order
Footnotes	Markers preserved; round-trip from marker to text without manual hunting
Tables	Every `<table>` becomes Markdown table OR an `Artifact` reference. Fail if a table silently becomes prose
Block quotes	`<blockquote>` → `>`; statutory excerpts visually distinct from analysis
Citations	Statutory citations preserved verbatim; no Unicode mangling of `§`; no whitespace collapse around abbreviations
Links	Every `<a href>` → `[text](url)`; URLs absolute (relative resolved against source URL)
Lists	`<ol>`/`<ul>` preserved with correct nesting; numbering preserved
Emphasis	`<em>`/`<strong>` → ``/`*`
Images / figures	`<img>` / `<figure>` → Markdown link with alt text or `Artifact` reference; captions preserved
No HTML tag bleed	Zero raw `<div>`, `<span>` (tables exempt where lossy)
Body length sanity	`len(markdown) / len(html)` ∈ [0.3, 0.8]

Promotion criteria: all 12 samples × all 12 criteria pass. One fail = file an issue, fix the converter, re-run all 12.

Document the run in the source's spec doc under a "Normalizer validation" section. CRS's docs/sources/crs-reports.html "Normalizer validation" section is the template — duplicate that structure for your source.

Step 4 — Write the migration

File: shared/josh_substrate/src/josh_substrate/migrations/versions/<NNNN>_<source>.py. Reference: 0002_crs.py.

Pattern: parent table → version/junction tables → <source>_chunks table → FTS5 virtual table + sync triggers → vec0 virtual table.

Required parent-table columns

Every record-bearing parent table includes these — copy verbatim:

-- Standard citation columns (Decision 6 — same on every source)
source_url TEXT NOT NULL,
source_org TEXT NOT NULL DEFAULT '<your source org>',  -- e.g. 'Federal Register', 'GAO'
published_at TEXT,
retrieved_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
raw_sha256 TEXT,
citation_string TEXT NOT NULL,

-- Body columns (Decision 5 — raw on disk, markdown in column)
body_raw_path TEXT,
body_raw_format TEXT,                  -- 'html' | 'xml' | 'pdf' | 'text'
body_raw_sha256 TEXT,
body_text TEXT,                        -- markdown — what FTS5 indexes
body_text_sha256 TEXT,
body_normalized_at TEXT,
body_normalizer_version TEXT,

-- Lifecycle
fetched_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
parsed_at TEXT,
inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
raw_json TEXT                          -- preserved upstream payload for re-parse

About raw_json: populate this with the upstream JSON metadata payload (or, for non-JSON sources, the parsed metadata serialized to JSON) at load time. It's the safety net that lets a future migration re-derive any column without re-fetching from the upstream API — same role body_raw_path plays for body text. CRS uses it to preserve the EveryCRSReport per-report JSON; see crs_reports.py.

FTS5 virtual table

Standard 4-column ordering: (title, abstract, action, body). The action column is the source-specific category that the agent might want to boost (type_id for CRS, agency for FR, chamber for votes, etc.).

BM25 weights are query-time, not schema. The standard tuple is (10.0, 4.0, 2.0, 1.0) for (title, abstract, action, body) — applied at the call site via bm25(<source>_fts, 10.0, 4.0, 2.0, 1.0). The search wrapper that applies these uniformly across sources lives with the rest-api-search spec; until that ships, source-spec acceptance criteria should assert "FTS5 returns BM25-ranked results" without pinning the tuple.

CREATE VIRTUAL TABLE <source>_fts USING fts5(
    title, abstract, action, body,
    content='<source>',
    content_rowid='rowid',
    tokenize='porter unicode61'
);

Plus three sync triggers (ai, au, ad for after-insert / -update / -delete). See 0002_crs.py for the exact pattern — copy it.

vec0 virtual table

Created empty in this migration (chunking + embedding wires up in Phase 8 of the plan):

CREATE VIRTUAL TABLE <source>_chunks_vec0 USING vec0(
    chunk_id TEXT PRIMARY KEY,
    embedding float[1024],         -- 1024-dim matches Arctic-L-v2
    embedding_bq bit[1024]         -- binary-quantized for fast brute-force scan
);

Verify

SUBSTRATE_DB_PATH=/tmp/josh-test.db alembic upgrade head
sqlite3 /tmp/josh-test.db ".tables"   # all your new tables show up
sqlite3 /tmp/josh-test.db "SELECT name, type FROM sqlite_master WHERE name LIKE '<source>%'"

Step 5 — Write Pydantic models

File: shared/josh_substrate/src/josh_substrate/models/<source>.py. Reference: models/crs.py.

One Pydantic v2 model per table. Include model_config = ConfigDict(from_attributes=True) on each so SQLAlchemy row → Pydantic round-trips work.

The parent model carries the citation columns and body columns. Children carry FK to parent + their own fields. For the chunk model, import ChunkLocator from josh_substrate.protocols — don't redefine.

Step 6 — Write the citation formatter

File: shared/josh_substrate/src/josh_substrate/citations/formatters/<source>.py. Reference: formatters/crs.py.

Export citation_for(record, version=None) -> str. Bluebook-flavored. Pragmatic — Josh isn't a legal product.

Source	Format example
FR document	`91 Fed. Reg. 12345 (Mar. 4, 2026).`
Public Law	`Pub. L. No. 119-21 (2026).`
US Code section	`42 U.S.C. § 1396a (2024).`
GAO report	`U.S. Gov't Accountability Off., GAO-25-107521, Title (Apr. 2025).`
Committee report	`H.R. Rep. No. 119-100 (2025).`
Bill	`H.R. 1234, 119th Cong. (2025).`
Hearing transcript (CHRG)	`Hearing Title: Hearing Before the H. Comm. on X, 119th Cong. (2025).`
Roll call vote	`Roll Call Vote No. 362 (House), 119th Cong., 1st Sess. (Mar. 4, 2025).`
LDA filing	`Lobbying Disclosure Act Filing {filing_uuid}, {client} via {registrant} (Q3 2025).`
SAP	`Statement of Administration Policy on H.R. 1234 (Jan. 22, 2025).`
Congressional Record	`171 Cong. Rec. H1234 (daily ed. Mar. 4, 2025) (statement of Rep. X).`

Add unit tests at shared/josh_substrate/tests/citations/test_<source>.py. At least 4: canonical record, version-parametric path, edge cases (missing date, missing title), and any source-specific quirks.

Step 7 — Register normalizer + citation formatter

Both registries auto-discover via pkgutil.iter_modules. Just dropping your file in the right location is enough — no manual registration.

For the normalizer: if you wrote a new one, the file must export a module-level normalizer instance (normalizer = MyNormalizer()).

For the citation formatter: the file must export a module-level citation_for callable.

Verify:

python -c "from josh_substrate.normalizers import all_normalizers; print(sorted(all_normalizers().keys()))"
python -c "from josh_substrate.citations.formatters import all_formatters; print(sorted(all_formatters().keys()))"

Step 8 — Write the source subpackage

Directory: josh-ingester/ingester/sources/<source>/. Reference: sources/crs_reports/.

Each concrete source is a Python subpackage — a directory containing __init__.py plus one file per pipeline stage. The single-file form (sources/<source>.py) is no longer accepted; the registry walks subpackages and the success_determiner of the ingester-modularity-pass spec asserts that every concrete source is a directory.

The skeleton:

sources/<source>/
├── __init__.py       # exposes `source = <Source>()` for the registry; thin
│                     # class delegating to the four stage modules. Optional:
│                     # re-export `cli` from cli.py.
├── discover.py       # async generator of FetchTasks; updates state.last_seen_key
├── fetch.py          # download payloads to /data/corpus/<source>/bodies/raw/…
├── parse.py          # bytes-on-disk → ParsedRecord; runs normalizer here
├── load.py           # transactional upsert
└── cli.py            # OPTIONAL: source-local Typer for maintenance commands

Always split into the four stages, even when a stage is short. Uniformity makes the architecture legible across sources and lets per-stage work proceed in parallel without merge collisions. If a helper is used by exactly one stage, put it in that stage's file; if it's used across stages or is purely shape/formatting logic, add a helpers.py alongside the stages.

The __init__.py implements the Source Protocol from josh_substrate.protocols by delegating to the stage modules:

# sources/<source>/__init__.py
from __future__ import annotations

from .discover import discover as _discover
from .fetch import fetch as _fetch
from .load import load as _load
from .parse import parse as _parse
# from .cli import cli   # uncomment if you wrote a source-local CLI


class <Source>Source:
    name = "<source>"
    schedule = "30 7 * * *"   # cron, or None for manual-only
    depends_on: list[str] = []  # other source names that must load first
    # cli = cli   # optional — uncomment alongside the import above

    async def discover(self, state, http):
        async for task in _discover(state, http):
            yield task

    async def fetch(self, task, http):
        return await _fetch(task, http)

    async def parse(self, raw_paths):
        async for record in _parse(raw_paths):
            yield record

    async def load(self, records, db):
        return await _load(records, db)


source = <Source>Source()  # module-level instance for the registry

Each stage module is a plain async function (async def discover(state, http), async def fetch(task, http) -> dict[str, Path], async def parse(raw_paths) -> AsyncIterator[ParsedRecord], async def load(records, db) -> LoadStats). The class is a thin shim so the registry's Source protocol stays satisfied — the actual logic lives in the stage files.

Optional source-local CLI

If your source needs a maintenance command (one-shot data fix, manual reindex, etc.), put it in sources/<source>/cli.py as a typer.Typer instance and re-export it from __init__.py as cli. The global ingester CLI auto-mounts it under josh-ingester <source-name-with-dashes> at startup. Don't add the command to the global cli.py. See crs_reports/cli.py for the worked example — its fix-types command became josh-ingester crs-reports fix-types after the modularity pass moved it out of the global CLI.

Source-specific cleanup (e.g., wrapper-noise stripping for CRS, <HD>-header recognition for FR XML) belongs in parse.py, not in the normalizer. Keep normalizers source-agnostic.

Step 9 — Smoke test

Local first, server next.

# Local (after `pip install -e josh-ingester`)
SUBSTRATE_DB_PATH=/tmp/josh-test.db CORPUS_DIR=/tmp/josh-corpus LOCK_DIR=/tmp/josh-locks \
  josh-ingester backfill <source> --limit 5
sqlite3 /tmp/josh-test.db "SELECT count(*) FROM <source>"   # 5 rows
sqlite3 /tmp/josh-test.db "SELECT * FROM <source>_fts WHERE <source>_fts MATCH 'something' LIMIT 1"

# Server (after first deploy)
kamal app exec --reuse 'josh-ingester backfill <source> --limit 100'
ssh josh 'sqlite3 /data/josh.db "SELECT count(*) FROM <source>"'
ssh josh 'ls /data/corpus/<source>/bodies/raw/ | wc -l'

(josh-ingester is the console script defined in josh-ingester/pyproject.toml; python -m ingester is the equivalent module-form invocation if the script isn't on PATH in your environment.)

Pass criteria: N rows in parent table, ≥N raw payload files on disk, FTS5 query returns hits, all 6 citation columns NOT NULL, no recurring errors in ingestion_logs.

Add tests at josh-ingester/tests/sources/test_<source>.py. Reference: tests/sources/test_crs_reports.py. Cover: helper functions; discover() defensive parsing (good rows + malformed rows + filter watermark + backfill mode); parse() against real fixtures asserting all citation+body columns populate.

Step 10 — URL verification gate

REQUIRED. Random-sample 10 records from the smoke backfill. For each:

Use the browser-use skill to fetch record.source_url. Preferred over curl because federal sources frequently bot-wall plain HTTP.
Pass = page loads (200) and the visible title matches the row's title.
Fail = bad URL pattern in your parser. Fix, re-backfill, re-verify.

Document the verification result in your spec doc's "URL verification" section (a 2-line table is plenty: 10/10 pass, date, sample IDs).

Step 11 — Full backfill

For backfills that fit comfortably in a 10-minute window, the synchronous form is fine:

kamal app exec --reuse 'josh-ingester backfill <source>'

For multi-hour backfills (CRS is ~6 hours; FR full backfill will be longer), the local Kamal client can be killed by client-side timeouts, taking the SSH session with it. Detach the work from any SSH session by using docker exec -d directly:

CONTAINER=$(ssh josh 'docker ps --format "{{.Names}}" | grep ^josh-ingester-workers-')
ssh josh "docker exec -d $CONTAINER sh -c 'josh-ingester backfill <source> > /data/<source>-backfill.log 2>&1'"

Then monitor progress periodically (the runner writes incremental counters to ingestion_runs and ingestion_tasks):

ssh josh "docker exec $CONTAINER josh-ingester status --source <source> --limit 1"
ssh josh 'tail /data/<source>-backfill.log'
ssh josh 'df -h /data'   # confirm disk headroom

The container's CMD is sleep infinity, so it stays alive after the detached job finishes. The log on /data/ (bind-mounted volume) is visible from the host without docker logs.

Pass criteria: ingestion_runs.status='success' (or 'partial' with error_count / discovered_count < 2%).

If errors > 2%, dig into ingestion_logs for that run, fix the root cause, re-backfill (idempotent — ON CONFLICT DO UPDATE handles repeats; fetch() skips files already on disk).

Step 12 — Schedule + deploy

Set the source's schedule to its cron string in the source-module file (schedule = "30 7 * * *" for daily 07:30 ET). Commit. Deploy:

cd josh-ingester
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal deploy
kamal app exec --reuse 'josh-ingester schedule'   # confirm registry picks it up

(Note: as of this writing the scheduler isn't wired to a long-running daemon yet — schedule is informational. When the substrate-cron-scheduler spec ships, this command will register the cron job; until then, set the value correctly so it's ready to take effect on first daemon boot.)

Step 13 — Add to admin overview

No code change needed. The admin page at josh-core /admin/ingestion/sources (when shipped) reads the source registry directly. Your new source appears automatically.

Step 14 — Update the inventory + status (both pages)

Two hand-authored inventory pages must stay in sync any time a source's status changes — both on this commit:

docs/josh-data-sources.html — the v1/v2 inventory index. Bump the source's status from schema-drafted to ingestion-built.
docs/data-status.html — the live "what's loaded right now" view. Flip the source's row to shipped, fill the Notes column with the record count + load date, and bump the Last updated footer at the bottom.

This page is the canonical "what's loaded right now" surface until josh-core ships a live status endpoint. Don't let it drift.

And the spec YAML you opened in Step 0 — flip it now:

Verified: the success_determiner ran green against the deployed substrate. Flip status: in_progress → verified.
Shipped: the source is live, the cron is registered, the inventory pages reflect it. Flip status: verified → shipped. Tick the remaining tasks done: true. Run python bin/build-spec.py to regenerate the HTML.

Troubleshooting

Symptom	Likely cause	Fix
`csv.Error: line contains NUL`	Source CSV has a BOM or null byte	Read with `errors='replace'` and trust `csv.DictReader`
`csv.Error: field larger than field size limit`	One row has a giant cell	`csv.field_size_limit(sys.maxsize)` once at module load
Some rows skipped silently	Defensive parse exception eaten	Set `LOG_LEVEL=DEBUG` and look for `csv_row_malformed` in `ingestion_logs`
HTTP 403 on plain fetch	Anti-bot wall	Switch fetch to `browser-use` skill via the runner's HTTP client OR use TLS-fingerprint-spoofing client (`curl_cffi`, `tls-client`, `niquests`)
HTTP 502 / 503	Transient upstream	The runner's tenacity retries handle this. If recurring, raise rate-limit or pause
FK violation on insert	Cross-source FK to a record from a different source that hasn't loaded yet	Soft-FK pattern: omit the constraint from the migration, populate the column anyway, resolve at query time
`body_text` is empty after load	Normalizer error swallowed; or PDF queued and not yet processed	Check `body_normalizer_version` is set; check `ingestion_normalization_queue` for PDF sources
FTS5 query returns nothing	Triggers not firing	Verify with `SELECT count(*) FROM <source>_fts` — should equal parent count
`sqlite-vec` extension not loading	aiosqlite's connect path bypassed `_on_connect`	Ensure engine was built via `josh_substrate.db.async_engine`, not raw `create_async_engine`
Container has data but it's not on `/data`	Bind mount missing in `deploy.yml`	Check `volumes: ["/data:/data"]`; verify with `docker inspect` showing `(bind)` not `(volume)`
Discover yields zero tasks unexpectedly	`state.last_seen_key` is too high	For backfill, ensure runner-cleared (`mode='backfill'`); for incremental, check the watermark advanced sensibly last run
`OperationalError: database is locked` mid-backfill	SQLite single-writer + bounded fetch concurrency producing collisions on `ingestion_runs` / `ingestion_tasks` updates	At our scale (1 source at a time, fetch_concurrency=4), this hits ~0.05% of tasks during peak — acceptable. To reduce: drop `fetch_concurrency` to 1–2, or batch state-table writes through a single coroutine. Don't raise `busy_timeout` (already 10s).
Run finalized as `failed` despite most records loading	Top-level exception in runner's outer try/except (e.g., a SQLITE_BUSY hit during `mark_task` or `save_source_state` in the cleanup phase)	Inspect `error_summary` on the `ingestion_runs` row. If the per-record error rate is low and the substrate has the data, treat as effectively `partial` and move on — re-run incrementally if the watermark advanced too far.
Unexpectedly large `body_text` rows (50+ MB)	Source HTML is bloated (embedded large images, framework boilerplate, etc.)	Outliers are unavoidable for some sources. Add a `len(body_text) > N` check in the source's parse stage and either truncate, skip, or capture as `Artifact` if pathological
Stray raw files in `/data/corpus/<source>/bodies/raw/` not in DB	Fetch succeeded, parse/load failed for that record	Cross-check: `find /data/corpus/<source>/bodies/raw -name '*.html' \| while read f; do num=$(basename $f .html); echo "$num,$(sqlite3 /data/josh.db "SELECT 1 FROM <source> WHERE number='$num'")"; done` finds orphans
`wal` file growing very large during backfill	WAL hasn't checkpointed under sustained writer load	Periodically run `PRAGMA wal_checkpoint(TRUNCATE)` from a sidecar script during multi-hour backfills — or accept the WAL size during the run and let it auto-checkpoint when the writer rests

Reference

ingestion-architecture.html — framework spec (Source protocol, 4 stages, state tables, body normalization, citation metadata)
chunking.html — chunker protocol + experiment framework (Phase 8 work)
new-host-setup.html — provisioning a fresh server
migrations.html — Kamal pre-deploy hook for migrations
crs-reports.html — the worked example (per-source spec template)
sources/crs_reports.py — the worked example (Source implementation)
migrations/versions/0002_crs.py — the worked example (migration template)