Skip to content

Adding a new data source

How to add a new source to the Josh substrate — from per-source spec doc through full backfill on the production server. CRS Reports is the worked example: every step links to its concrete CRS counterpart.

This runbook is the contract between you and the substrate. Follow it in order, and the new source will:

  • Land in the unified SQLite file at /data/josh.db with full FTS5 + vec0 support.
  • Carry the standard 6 citation columns and 7 body columns — identical shape across every source.
  • Pass the same manual-validation gates that CRS passed.
  • Drop into the runner with no framework changes.

The framework's whole point is that adding a source touches a tightly bounded set of files. Calibrate your expectations accordingly:

Per source, you change:

  • The per-source ingester spec YAML (docs/spec/data/<source>-ingester.yaml) and rendered HTML.
  • The Alembic migration for this source.
  • The Pydantic models for this source.
  • The citation formatter for this source.
  • A normalizer file only if the source ingests a format we haven't seen before.
  • The source subpackage at josh-ingester/ingester/sources/<source>/ (__init__.py, discover.py, fetch.py, parse.py, load.py — and optionally cli.py for source-local maintenance commands).
  • Tests for the source module and the citation formatter.
  • Two inventory pages: docs/josh-data-sources.html and docs/data-status.html.

You do not change:

  • The global ingester CLI (josh-ingester/ingester/cli.py). Source-specific commands belong in the source's own cli.py; the global CLI auto-mounts them.
  • The runner (josh-ingester/ingester/runner.py) or the scheduler.
  • The state-table writers (josh-ingester/ingester/state.py) or the per-source advisory-lock helper.
  • The substrate's protocol definitions, body-normalization machinery, or chunking framework.
  • The Source protocol itself.

Steps 9–12 of this runbook exercise the new source via the existing generic CLI and runner — no CLI code edits required. If you find yourself reaching outside the bounded set, stop and re-read this callout: most of the time the right answer is to do the work inside the source subpackage, not in the framework.

Adding one source is roughly 6–10 hours after the framework is in place (which it is, post-CRS). Per-source deliverables, in order:

  1. Per-source spec doc at docs/sources/<name>.html
  2. Alembic migration at shared/josh_substrate/src/josh_substrate/migrations/versions/<NNNN>_<source>.py
  3. Pydantic models at shared/josh_substrate/src/josh_substrate/models/<source>.py
  4. Citation formatter at shared/josh_substrate/src/josh_substrate/citations/formatters/<source>.py
  5. Normalizer registry entry (or new normalizer file under shared/josh_substrate/src/josh_substrate/normalizers/ if the format is new)
  6. Source subpackage at josh-ingester/ingester/sources/<source>/ (__init__.py + per-stage discover.py / fetch.py / parse.py / load.py + optional cli.py)
  7. Tests at josh-ingester/tests/sources/test_<source>.py and (optionally) shared/josh_substrate/tests/citations/test_<source>.py
  8. Manual HTML/XML→Markdown validation against ≥12 stratified samples — REQUIRED.
  9. URL verification gate against 10 random records — REQUIRED.

Use the existing kit; don't reinvent.

NeedToolWhere
HTTP fetch + retries + rate-limitJoshHttpClientjosh-ingester/ingester/http.py (constructed by the runner; passed into Source.discover and Source.fetch)
Anti-bot / JS-rendered pagesbrowser-use skillClaude Code skill, available globally. Use for any source where plain HTTP returns 403/blocked
HTML → Markdownhtml_v1shared/josh_substrate/src/josh_substrate/normalizers/html_v1.py
Plain text → Markdowntext_v1shared/josh_substrate/src/josh_substrate/normalizers/text_v1.py
XML / PDF(none yet)When you ingest the first such source, write <source>_xml_v1 or use pdf_v1 (TBD)
Substrate connection + sqlite-vec loadjosh_substrate.db.async_engineshared/josh_substrate/src/josh_substrate/db.py
Citation registryjosh_substrate.citationsAuto-discovers your formatters/<source>.py
Normalizer registryjosh_substrate.normalizersAuto-discovers your normalizer module if you add one
Path conventionsjosh_substrate.corpus.raw_path/data/corpus/<source>/bodies/raw/... (markdown lives in DB column, not on disk)
State-table writersingester.stateUsed by the runner; you don't call these directly
Per-source advisory lockingester.concurrency.source_lockUsed by the runner; you don't call these directly
Per-source CLI extensionModule-level cli: typer.Typer in sources/<source>/cli.pyOptional. The global ingester CLI auto-mounts your source's cli attribute under josh-ingester <source-name-with-dashes> (underscores → hyphens). Use for source-local maintenance commands (one-shot data fixes, etc.). See crs_reports/cli.py for the worked example (josh-ingester crs-reports fix-types).

Don't roll your own when these exist. The point of the framework is uniformity across sources.

  1. The substrate is provisioned per new host setup and migration 0001 (state tables) is applied.
  2. The josh_substrate shared package is installed (pip install -e shared/josh_substrate for local dev; the Dockerfile handles it for production).
  3. CRS is shipped and serves as the worked example — every step below has a "look at CRS for…" pointer.
  4. You can SSH into josh (the production droplet) and your Kamal credentials are decrypted.

Step 0 — Open the spec YAML and drive it as you go

Section titled “Step 0 — Open the spec YAML and drive it as you go”

Every source ingester is governed by a spec item at docs/spec/data/<source>-ingester.yaml — that file is the contract for what "shipped" means for this source, and the place an agent (or you, next session) reads to know what's left.

If the spec doesn't exist yet, copy the template:

Terminal window
cp docs/spec/data/_templates/source.yaml docs/spec/data/<source>-ingester.yaml
# fill in: id, title, why, user_stories, acceptance_criteria, success_determiner
python bin/build-spec.py # validates against _schema.json + regenerates HTML

Drive the spec through this runbook:

  • The 14 standard tasks in the template (t1t14) intentionally mirror the 14 steps below. Tick each task done: true as you finish the corresponding step. Add source-specific tasks at t15+.
  • Flip status: planned → in_progress when you start coding (Step 4). The build script auto-prepends a changelog entry on every status transition.
  • Flip status: in_progress → verified when the success_determiner runs green against the deployed substrate (after Step 11).
  • Flip status: verified → shipped when the cron is registered and observed firing in production (after Step 12). Verified ≠ shipped — verified means the determiner passed; shipped means the change is live.
  • Re-run python bin/build-spec.py after each YAML edit (or use the in-browser editor on docs/spec/ in Chrome/Edge — the File System Access API writes back silently).

The same write-the-criteria-first discipline applies to the success_determiner. Pick kind: bash or kind: sql over kind: manual wherever possible — manual is a fallback, not a goal.

File: docs/sources/<name>.html. Reference template: docs/sources/crs-reports.html. (Note: this is the per-source documentation page — endpoints, response shapes, indicative DDL, probe findings. The ingester spec — acceptance criteria, success determiner, task contract — is a separate file you set up in Step 0.)

Required sections:

SectionWhat goes here
Quick referenceA table at the top: source name, primary URL pattern, auth requirements, rate limits, format, indicative volume
Source priority decisionWhich API/mirror you picked and why. If multiple options exist (Congress.gov vs EveryCRSReport mirror, etc.), document the tradeoff
Access notesBot walls, auth flows, gotchas. Note here if the source needs browser-use
EndpointsConcrete URL templates with parameter placeholders
Vocabulary / enumsType codes, status enums, anything an agent would benefit from understanding
Stable ID formatThe prefix-namespaced ID (e.g., crs:R48481, fr:2026-08558)
Response shapesSample JSON/XML payloads for each endpoint. Capture during initial probe
VolumeRecords all-time, records/year, expected on-disk size of raw + DB
Caching / incremental syncWatermark strategy, ETag / If-Modified-Since support, RSS feed availability
Schema (indicative DDL)The shape you intend to write. Migration is the source of truth — this is illustrative
Download / update strategyBackfill plan + daily incremental plan
Failure modesKnown flake patterns + recovery
Open questionsThings you noticed but defer to v1.x

Add the source to the inventory table in docs/josh-data-sources.html.

The Josh ID format is <source-prefix>:<source-native-key>. Conventions:

PropertyRequired
Lowercased prefix matching the source nameYes (crs:, fr:, bill:)
Source-native key preserved verbatimYes (preserve case, separators)
Globally unique within sourceYes
Valid as a TEXT column value (no shell-special chars unless quoted)Yes

Examples in production:

  • crs:R48481 — CRS report number
  • crs:R48481:c0 — chunk index zero of that report
  • crs:R48481:author:0 — first author of that report

Keep them human-readable. Agents will quote these in citations and logs; debugging is easier when IDs aren't UUIDs.

Decision tree:

Source formatAction
HTMLReuse html_v1. Add per-source cleanup (e.g., wrapper-noise stripping) in your source module's parse() stage, not in the normalizer
XML with new schemaWrite <source>_xml_v1 as a new file under normalizers/lxml-based walker that preserves heading hierarchy and produces Markdown + Heading[]
Plain textReuse text_v1
PDFUse pdf_v1 (when shipped). PDF normalization is queued (slower than inline) — record loads with body_text=NULL and an entry in ingestion_normalization_queue; the worker fills it in later

REQUIRED — Manual normalizer validation gate

Section titled “REQUIRED — Manual normalizer validation gate”

Per the locked decision in the plan, every source's normalizer must pass the 12-sample stratified validation gate before it ships. This is non-optional. The bar is binary: all 12 samples × all 12 criteria pass. No averaging.

Sample selection rule:

  • Select 12 fixtures stratified by an intrinsic source dimension (CRS uses report-type prefix; FR would use document type; bills would use bill type or congress).
  • Pick the most-recently-published instances of each stratum that exceed a minimum size (filters empty stubs).
  • Pin the sample IDs in your per-source spec doc's "Normalizer validation" section so the gate is reproducible.

Workflow per sample:

  1. Open the source URL in a browser (or fetch via browser-use).
  2. Run the converter: await normalizer.to_markdown(raw_path, source_hint=url).
  3. Render the Markdown side-by-side (VS Code preview / glow / similar).
  4. Walk the criteria table below; score pass/fail.

Criteria with pass bars:

CriterionPass bar
Heading hierarchyAll <h1><h4> map to ##### in correct nesting order. No skipped levels.
Section orderingFirst-paragraph-of-section text in source order
FootnotesMarkers preserved; round-trip from marker to text without manual hunting
TablesEvery <table> becomes Markdown table OR an Artifact reference. Fail if a table silently becomes prose
Block quotes<blockquote>>; statutory excerpts visually distinct from analysis
CitationsStatutory citations preserved verbatim; no Unicode mangling of §; no whitespace collapse around abbreviations
LinksEvery <a href>[text](url); URLs absolute (relative resolved against source URL)
Lists<ol>/<ul> preserved with correct nesting; numbering preserved
Emphasis<em>/<strong>*/**
Images / figures<img> / <figure> → Markdown link with alt text or Artifact reference; captions preserved
No HTML tag bleedZero raw <div>, <span> (tables exempt where lossy)
Body length sanitylen(markdown) / len(html) ∈ [0.3, 0.8]

Promotion criteria: all 12 samples × all 12 criteria pass. One fail = file an issue, fix the converter, re-run all 12.

Document the run in the source's spec doc under a "Normalizer validation" section. CRS's docs/sources/crs-reports.html "Normalizer validation" section is the template — duplicate that structure for your source.

File: shared/josh_substrate/src/josh_substrate/migrations/versions/<NNNN>_<source>.py. Reference: 0002_crs.py.

Pattern: parent table → version/junction tables → <source>_chunks table → FTS5 virtual table + sync triggers → vec0 virtual table.

Every record-bearing parent table includes these — copy verbatim:

-- Standard citation columns (Decision 6 — same on every source)
source_url TEXT NOT NULL,
source_org TEXT NOT NULL DEFAULT '<your source org>', -- e.g. 'Federal Register', 'GAO'
published_at TEXT,
retrieved_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
raw_sha256 TEXT,
citation_string TEXT NOT NULL,
-- Body columns (Decision 5 — raw on disk, markdown in column)
body_raw_path TEXT,
body_raw_format TEXT, -- 'html' | 'xml' | 'pdf' | 'text'
body_raw_sha256 TEXT,
body_text TEXT, -- markdown — what FTS5 indexes
body_text_sha256 TEXT,
body_normalized_at TEXT,
body_normalizer_version TEXT,
-- Lifecycle
fetched_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
parsed_at TEXT,
inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),
raw_json TEXT -- preserved upstream payload for re-parse

About raw_json: populate this with the upstream JSON metadata payload (or, for non-JSON sources, the parsed metadata serialized to JSON) at load time. It's the safety net that lets a future migration re-derive any column without re-fetching from the upstream API — same role body_raw_path plays for body text. CRS uses it to preserve the EveryCRSReport per-report JSON; see crs_reports.py.

Standard 4-column ordering: (title, abstract, action, body). The action column is the source-specific category that the agent might want to boost (type_id for CRS, agency for FR, chamber for votes, etc.).

BM25 weights are query-time, not schema. The standard tuple is (10.0, 4.0, 2.0, 1.0) for (title, abstract, action, body) — applied at the call site via bm25(<source>_fts, 10.0, 4.0, 2.0, 1.0). The search wrapper that applies these uniformly across sources lives with the rest-api-search spec; until that ships, source-spec acceptance criteria should assert "FTS5 returns BM25-ranked results" without pinning the tuple.

CREATE VIRTUAL TABLE <source>_fts USING fts5(
title, abstract, action, body,
content='<source>',
content_rowid='rowid',
tokenize='porter unicode61'
);

Plus three sync triggers (ai, au, ad for after-insert / -update / -delete). See 0002_crs.py for the exact pattern — copy it.

Created empty in this migration (chunking + embedding wires up in Phase 8 of the plan):

CREATE VIRTUAL TABLE <source>_chunks_vec0 USING vec0(
chunk_id TEXT PRIMARY KEY,
embedding float[1024], -- 1024-dim matches Arctic-L-v2
embedding_bq bit[1024] -- binary-quantized for fast brute-force scan
);
Terminal window
SUBSTRATE_DB_PATH=/tmp/josh-test.db alembic upgrade head
sqlite3 /tmp/josh-test.db ".tables" # all your new tables show up
sqlite3 /tmp/josh-test.db "SELECT name, type FROM sqlite_master WHERE name LIKE '<source>%'"

File: shared/josh_substrate/src/josh_substrate/models/<source>.py. Reference: models/crs.py.

One Pydantic v2 model per table. Include model_config = ConfigDict(from_attributes=True) on each so SQLAlchemy row → Pydantic round-trips work.

The parent model carries the citation columns and body columns. Children carry FK to parent + their own fields. For the chunk model, import ChunkLocator from josh_substrate.protocols — don't redefine.

File: shared/josh_substrate/src/josh_substrate/citations/formatters/<source>.py. Reference: formatters/crs.py.

Export citation_for(record, version=None) -> str. Bluebook-flavored. Pragmatic — Josh isn't a legal product.

SourceFormat example
FR document91 Fed. Reg. 12345 (Mar. 4, 2026).
Public LawPub. L. No. 119-21 (2026).
US Code section42 U.S.C. § 1396a (2024).
GAO reportU.S. Gov't Accountability Off., GAO-25-107521, Title (Apr. 2025).
Committee reportH.R. Rep. No. 119-100 (2025).
BillH.R. 1234, 119th Cong. (2025).
Hearing transcript (CHRG)Hearing Title: Hearing Before the H. Comm. on X, 119th Cong. (2025).
Roll call voteRoll Call Vote No. 362 (House), 119th Cong., 1st Sess. (Mar. 4, 2025).
LDA filingLobbying Disclosure Act Filing {filing_uuid}, {client} via {registrant} (Q3 2025).
SAPStatement of Administration Policy on H.R. 1234 (Jan. 22, 2025).
Congressional Record171 Cong. Rec. H1234 (daily ed. Mar. 4, 2025) (statement of Rep. X).

Add unit tests at shared/josh_substrate/tests/citations/test_<source>.py. At least 4: canonical record, version-parametric path, edge cases (missing date, missing title), and any source-specific quirks.

Step 7 — Register normalizer + citation formatter

Section titled “Step 7 — Register normalizer + citation formatter”

Both registries auto-discover via pkgutil.iter_modules. Just dropping your file in the right location is enough — no manual registration.

For the normalizer: if you wrote a new one, the file must export a module-level normalizer instance (normalizer = MyNormalizer()).

For the citation formatter: the file must export a module-level citation_for callable.

Verify:

Terminal window
python -c "from josh_substrate.normalizers import all_normalizers; print(sorted(all_normalizers().keys()))"
python -c "from josh_substrate.citations.formatters import all_formatters; print(sorted(all_formatters().keys()))"

Directory: josh-ingester/ingester/sources/<source>/. Reference: sources/crs_reports/.

Each concrete source is a Python subpackage — a directory containing __init__.py plus one file per pipeline stage. The single-file form (sources/<source>.py) is no longer accepted; the registry walks subpackages and the success_determiner of the ingester-modularity-pass spec asserts that every concrete source is a directory.

The skeleton:

sources/<source>/
├── __init__.py # exposes `source = <Source>()` for the registry; thin
│ # class delegating to the four stage modules. Optional:
│ # re-export `cli` from cli.py.
├── discover.py # async generator of FetchTasks; updates state.last_seen_key
├── fetch.py # download payloads to /data/corpus/<source>/bodies/raw/…
├── parse.py # bytes-on-disk → ParsedRecord; runs normalizer here
├── load.py # transactional upsert
└── cli.py # OPTIONAL: source-local Typer for maintenance commands

Always split into the four stages, even when a stage is short. Uniformity makes the architecture legible across sources and lets per-stage work proceed in parallel without merge collisions. If a helper is used by exactly one stage, put it in that stage's file; if it's used across stages or is purely shape/formatting logic, add a helpers.py alongside the stages.

The __init__.py implements the Source Protocol from josh_substrate.protocols by delegating to the stage modules:

# sources/<source>/__init__.py
from __future__ import annotations
from .discover import discover as _discover
from .fetch import fetch as _fetch
from .load import load as _load
from .parse import parse as _parse
# from .cli import cli # uncomment if you wrote a source-local CLI
class <Source>Source:
name = "<source>"
schedule = "30 7 * * *" # cron, or None for manual-only
depends_on: list[str] = [] # other source names that must load first
# cli = cli # optional — uncomment alongside the import above
async def discover(self, state, http):
async for task in _discover(state, http):
yield task
async def fetch(self, task, http):
return await _fetch(task, http)
async def parse(self, raw_paths):
async for record in _parse(raw_paths):
yield record
async def load(self, records, db):
return await _load(records, db)
source = <Source>Source() # module-level instance for the registry

Each stage module is a plain async function (async def discover(state, http), async def fetch(task, http) -> dict[str, Path], async def parse(raw_paths) -> AsyncIterator[ParsedRecord], async def load(records, db) -> LoadStats). The class is a thin shim so the registry's Source protocol stays satisfied — the actual logic lives in the stage files.

If your source needs a maintenance command (one-shot data fix, manual reindex, etc.), put it in sources/<source>/cli.py as a typer.Typer instance and re-export it from __init__.py as cli. The global ingester CLI auto-mounts it under josh-ingester <source-name-with-dashes> at startup. Don't add the command to the global cli.py. See crs_reports/cli.py for the worked example — its fix-types command became josh-ingester crs-reports fix-types after the modularity pass moved it out of the global CLI.

Source-specific cleanup (e.g., wrapper-noise stripping for CRS, <HD>-header recognition for FR XML) belongs in parse.py, not in the normalizer. Keep normalizers source-agnostic.

Local first, server next.

Terminal window
# Local (after `pip install -e josh-ingester`)
SUBSTRATE_DB_PATH=/tmp/josh-test.db CORPUS_DIR=/tmp/josh-corpus LOCK_DIR=/tmp/josh-locks \
josh-ingester backfill <source> --limit 5
sqlite3 /tmp/josh-test.db "SELECT count(*) FROM <source>" # 5 rows
sqlite3 /tmp/josh-test.db "SELECT * FROM <source>_fts WHERE <source>_fts MATCH 'something' LIMIT 1"
# Server (after first deploy)
kamal app exec --reuse 'josh-ingester backfill <source> --limit 100'
ssh josh 'sqlite3 /data/josh.db "SELECT count(*) FROM <source>"'
ssh josh 'ls /data/corpus/<source>/bodies/raw/ | wc -l'

(josh-ingester is the console script defined in josh-ingester/pyproject.toml; python -m ingester is the equivalent module-form invocation if the script isn't on PATH in your environment.)

Pass criteria: N rows in parent table, ≥N raw payload files on disk, FTS5 query returns hits, all 6 citation columns NOT NULL, no recurring errors in ingestion_logs.

Add tests at josh-ingester/tests/sources/test_<source>.py. Reference: tests/sources/test_crs_reports.py. Cover: helper functions; discover() defensive parsing (good rows + malformed rows + filter watermark + backfill mode); parse() against real fixtures asserting all citation+body columns populate.

REQUIRED. Random-sample 10 records from the smoke backfill. For each:

  • Use the browser-use skill to fetch record.source_url. Preferred over curl because federal sources frequently bot-wall plain HTTP.
  • Pass = page loads (200) and the visible title matches the row's title.
  • Fail = bad URL pattern in your parser. Fix, re-backfill, re-verify.

Document the verification result in your spec doc's "URL verification" section (a 2-line table is plenty: 10/10 pass, date, sample IDs).

For backfills that fit comfortably in a 10-minute window, the synchronous form is fine:

Terminal window
kamal app exec --reuse 'josh-ingester backfill <source>'

For multi-hour backfills (CRS is ~6 hours; FR full backfill will be longer), the local Kamal client can be killed by client-side timeouts, taking the SSH session with it. Detach the work from any SSH session by using docker exec -d directly:

Terminal window
CONTAINER=$(ssh josh 'docker ps --format "{{.Names}}" | grep ^josh-ingester-workers-')
ssh josh "docker exec -d $CONTAINER sh -c 'josh-ingester backfill <source> > /data/<source>-backfill.log 2>&1'"

Then monitor progress periodically (the runner writes incremental counters to ingestion_runs and ingestion_tasks):

Terminal window
ssh josh "docker exec $CONTAINER josh-ingester status --source <source> --limit 1"
ssh josh 'tail /data/<source>-backfill.log'
ssh josh 'df -h /data' # confirm disk headroom

The container's CMD is sleep infinity, so it stays alive after the detached job finishes. The log on /data/ (bind-mounted volume) is visible from the host without docker logs.

Pass criteria: ingestion_runs.status='success' (or 'partial' with error_count / discovered_count < 2%).

If errors > 2%, dig into ingestion_logs for that run, fix the root cause, re-backfill (idempotent — ON CONFLICT DO UPDATE handles repeats; fetch() skips files already on disk).

Set the source's schedule to its cron string in the source-module file (schedule = "30 7 * * *" for daily 07:30 ET). Commit. Deploy:

Terminal window
cd josh-ingester
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal deploy
kamal app exec --reuse 'josh-ingester schedule' # confirm registry picks it up

(Note: as of this writing the scheduler isn't wired to a long-running daemon yet — schedule is informational. When the substrate-cron-scheduler spec ships, this command will register the cron job; until then, set the value correctly so it's ready to take effect on first daemon boot.)

No code change needed. The admin page at josh-core /admin/ingestion/sources (when shipped) reads the source registry directly. Your new source appears automatically.

Step 14 — Update the inventory + status (both pages)

Section titled “Step 14 — Update the inventory + status (both pages)”

Two hand-authored inventory pages must stay in sync any time a source's status changes — both on this commit:

  • docs/josh-data-sources.html — the v1/v2 inventory index. Bump the source's status from schema-drafted to ingestion-built.
  • docs/data-status.html — the live "what's loaded right now" view. Flip the source's row to shipped, fill the Notes column with the record count + load date, and bump the Last updated footer at the bottom.

This page is the canonical "what's loaded right now" surface until josh-core ships a live status endpoint. Don't let it drift.

And the spec YAML you opened in Step 0 — flip it now:

  • Verified: the success_determiner ran green against the deployed substrate. Flip status: in_progress → verified.
  • Shipped: the source is live, the cron is registered, the inventory pages reflect it. Flip status: verified → shipped. Tick the remaining tasks done: true. Run python bin/build-spec.py to regenerate the HTML.

SymptomLikely causeFix
csv.Error: line contains NULSource CSV has a BOM or null byteRead with errors='replace' and trust csv.DictReader
csv.Error: field larger than field size limitOne row has a giant cellcsv.field_size_limit(sys.maxsize) once at module load
Some rows skipped silentlyDefensive parse exception eatenSet LOG_LEVEL=DEBUG and look for csv_row_malformed in ingestion_logs
HTTP 403 on plain fetchAnti-bot wallSwitch fetch to browser-use skill via the runner's HTTP client OR use TLS-fingerprint-spoofing client (curl_cffi, tls-client, niquests)
HTTP 502 / 503Transient upstreamThe runner's tenacity retries handle this. If recurring, raise rate-limit or pause
FK violation on insertCross-source FK to a record from a different source that hasn't loaded yetSoft-FK pattern: omit the constraint from the migration, populate the column anyway, resolve at query time
body_text is empty after loadNormalizer error swallowed; or PDF queued and not yet processedCheck body_normalizer_version is set; check ingestion_normalization_queue for PDF sources
FTS5 query returns nothingTriggers not firingVerify with SELECT count(*) FROM <source>_fts — should equal parent count
sqlite-vec extension not loadingaiosqlite's connect path bypassed _on_connectEnsure engine was built via josh_substrate.db.async_engine, not raw create_async_engine
Container has data but it's not on /dataBind mount missing in deploy.ymlCheck volumes: ["/data:/data"]; verify with docker inspect showing (bind) not (volume)
Discover yields zero tasks unexpectedlystate.last_seen_key is too highFor backfill, ensure runner-cleared (mode='backfill'); for incremental, check the watermark advanced sensibly last run
OperationalError: database is locked mid-backfillSQLite single-writer + bounded fetch concurrency producing collisions on ingestion_runs / ingestion_tasks updatesAt our scale (1 source at a time, fetch_concurrency=4), this hits ~0.05% of tasks during peak — acceptable. To reduce: drop fetch_concurrency to 1–2, or batch state-table writes through a single coroutine. Don't raise busy_timeout (already 10s).
Run finalized as failed despite most records loadingTop-level exception in runner's outer try/except (e.g., a SQLITE_BUSY hit during mark_task or save_source_state in the cleanup phase)Inspect error_summary on the ingestion_runs row. If the per-record error rate is low and the substrate has the data, treat as effectively partial and move on — re-run incrementally if the watermark advanced too far.
Unexpectedly large body_text rows (50+ MB)Source HTML is bloated (embedded large images, framework boilerplate, etc.)Outliers are unavoidable for some sources. Add a len(body_text) > N check in the source's parse stage and either truncate, skip, or capture as Artifact if pathological
Stray raw files in /data/corpus/<source>/bodies/raw/ not in DBFetch succeeded, parse/load failed for that recordCross-check: find /data/corpus/<source>/bodies/raw -name '*.html' | while read f; do num=$(basename $f .html); echo "$num,$(sqlite3 /data/josh.db "SELECT 1 FROM <source> WHERE number='$num'")"; done finds orphans
wal file growing very large during backfillWAL hasn't checkpointed under sustained writer loadPeriodically run PRAGMA wal_checkpoint(TRUNCATE) from a sidecar script during multi-hour backfills — or accept the WAL size during the run and let it auto-checkpoint when the writer rests
  • ingestion-architecture.html — framework spec (Source protocol, 4 stages, state tables, body normalization, citation metadata)
  • chunking.html — chunker protocol + experiment framework (Phase 8 work)
  • new-host-setup.html — provisioning a fresh server
  • migrations.html — Kamal pre-deploy hook for migrations
  • crs-reports.html — the worked example (per-source spec template)
  • sources/crs_reports.py — the worked example (Source implementation)
  • migrations/versions/0002_crs.py — the worked example (migration template)