Adding a new data source
How to add a new source to the Josh substrate — from per-source spec doc through full backfill on the production server. CRS Reports is the worked example: every step links to its concrete CRS counterpart.
This runbook is the contract between you and the substrate. Follow it in order, and the new source will:
- Land in the unified SQLite file at
/data/josh.dbwith full FTS5 + vec0 support. - Carry the standard 6 citation columns and 7 body columns — identical shape across every source.
- Pass the same manual-validation gates that CRS passed.
- Drop into the runner with no framework changes.
What changes vs. what doesn't
Section titled “What changes vs. what doesn't”The framework's whole point is that adding a source touches a tightly bounded set of files. Calibrate your expectations accordingly:
Per source, you change:
- The per-source ingester spec YAML (
docs/spec/data/<source>-ingester.yaml) and rendered HTML. - The Alembic migration for this source.
- The Pydantic models for this source.
- The citation formatter for this source.
- A normalizer file only if the source ingests a format we haven't seen before.
- The source subpackage at
josh-ingester/ingester/sources/<source>/(__init__.py,discover.py,fetch.py,parse.py,load.py— and optionallycli.pyfor source-local maintenance commands). - Tests for the source module and the citation formatter.
- Two inventory pages:
docs/josh-data-sources.htmlanddocs/data-status.html.
You do not change:
- The global ingester CLI (
josh-ingester/ingester/cli.py). Source-specific commands belong in the source's owncli.py; the global CLI auto-mounts them. - The runner (
josh-ingester/ingester/runner.py) or the scheduler. - The state-table writers (
josh-ingester/ingester/state.py) or the per-source advisory-lock helper. - The substrate's protocol definitions, body-normalization machinery, or chunking framework.
- The
Sourceprotocol itself.
Steps 9–12 of this runbook exercise the new source via the existing generic CLI and runner — no CLI code edits required. If you find yourself reaching outside the bounded set, stop and re-read this callout: most of the time the right answer is to do the work inside the source subpackage, not in the framework.
What you're committing to
Section titled “What you're committing to”Adding one source is roughly 6–10 hours after the framework is in place (which it is, post-CRS). Per-source deliverables, in order:
- Per-source spec doc at
docs/sources/<name>.html - Alembic migration at
shared/josh_substrate/src/josh_substrate/migrations/versions/<NNNN>_<source>.py - Pydantic models at
shared/josh_substrate/src/josh_substrate/models/<source>.py - Citation formatter at
shared/josh_substrate/src/josh_substrate/citations/formatters/<source>.py - Normalizer registry entry (or new normalizer file under
shared/josh_substrate/src/josh_substrate/normalizers/if the format is new) - Source subpackage at
josh-ingester/ingester/sources/<source>/(__init__.py+ per-stagediscover.py/fetch.py/parse.py/load.py+ optionalcli.py) - Tests at
josh-ingester/tests/sources/test_<source>.pyand (optionally)shared/josh_substrate/tests/citations/test_<source>.py - Manual HTML/XML→Markdown validation against ≥12 stratified samples — REQUIRED.
- URL verification gate against 10 random records — REQUIRED.
Tools you have available
Section titled “Tools you have available”Use the existing kit; don't reinvent.
| Need | Tool | Where |
|---|---|---|
| HTTP fetch + retries + rate-limit | JoshHttpClient | josh-ingester/ingester/http.py (constructed by the runner; passed into Source.discover and Source.fetch) |
| Anti-bot / JS-rendered pages | browser-use skill | Claude Code skill, available globally. Use for any source where plain HTTP returns 403/blocked |
| HTML → Markdown | html_v1 | shared/josh_substrate/src/josh_substrate/normalizers/html_v1.py |
| Plain text → Markdown | text_v1 | shared/josh_substrate/src/josh_substrate/normalizers/text_v1.py |
| XML / PDF | (none yet) | When you ingest the first such source, write <source>_xml_v1 or use pdf_v1 (TBD) |
| Substrate connection + sqlite-vec load | josh_substrate.db.async_engine | shared/josh_substrate/src/josh_substrate/db.py |
| Citation registry | josh_substrate.citations | Auto-discovers your formatters/<source>.py |
| Normalizer registry | josh_substrate.normalizers | Auto-discovers your normalizer module if you add one |
| Path conventions | josh_substrate.corpus.raw_path | /data/corpus/<source>/bodies/raw/... (markdown lives in DB column, not on disk) |
| State-table writers | ingester.state | Used by the runner; you don't call these directly |
| Per-source advisory lock | ingester.concurrency.source_lock | Used by the runner; you don't call these directly |
| Per-source CLI extension | Module-level cli: typer.Typer in sources/<source>/cli.py | Optional. The global ingester CLI auto-mounts your source's cli attribute under josh-ingester <source-name-with-dashes> (underscores → hyphens). Use for source-local maintenance commands (one-shot data fixes, etc.). See crs_reports/cli.py for the worked example (josh-ingester crs-reports fix-types). |
Don't roll your own when these exist. The point of the framework is uniformity across sources.
Prerequisites
Section titled “Prerequisites”- The substrate is provisioned per new host setup and migration
0001(state tables) is applied. - The
josh_substrateshared package is installed (pip install -e shared/josh_substratefor local dev; the Dockerfile handles it for production). - CRS is shipped and serves as the worked example — every step below has a "look at CRS for…" pointer.
- You can SSH into
josh(the production droplet) and your Kamal credentials are decrypted.
Step 0 — Open the spec YAML and drive it as you go
Section titled “Step 0 — Open the spec YAML and drive it as you go”Every source ingester is governed by a spec item at docs/spec/data/<source>-ingester.yaml — that file is the contract for what "shipped" means for this source, and the place an agent (or you, next session) reads to know what's left.
If the spec doesn't exist yet, copy the template:
cp docs/spec/data/_templates/source.yaml docs/spec/data/<source>-ingester.yaml# fill in: id, title, why, user_stories, acceptance_criteria, success_determinerpython bin/build-spec.py # validates against _schema.json + regenerates HTMLDrive the spec through this runbook:
- The 14 standard tasks in the template (
t1–t14) intentionally mirror the 14 steps below. Tick each taskdone: trueas you finish the corresponding step. Add source-specific tasks att15+. - Flip
status: planned → in_progresswhen you start coding (Step 4). The build script auto-prepends achangelogentry on every status transition. - Flip
status: in_progress → verifiedwhen thesuccess_determinerruns green against the deployed substrate (after Step 11). - Flip
status: verified → shippedwhen the cron is registered and observed firing in production (after Step 12). Verified ≠ shipped — verified means the determiner passed; shipped means the change is live. - Re-run
python bin/build-spec.pyafter each YAML edit (or use the in-browser editor ondocs/spec/in Chrome/Edge — the File System Access API writes back silently).
The same write-the-criteria-first discipline applies to the success_determiner. Pick kind: bash or kind: sql over kind: manual wherever possible — manual is a fallback, not a goal.
Step 1 — Author the per-source spec doc
Section titled “Step 1 — Author the per-source spec doc”File: docs/sources/<name>.html. Reference template: docs/sources/crs-reports.html. (Note: this is the per-source documentation page — endpoints, response shapes, indicative DDL, probe findings. The ingester spec — acceptance criteria, success determiner, task contract — is a separate file you set up in Step 0.)
Required sections:
| Section | What goes here |
|---|---|
| Quick reference | A table at the top: source name, primary URL pattern, auth requirements, rate limits, format, indicative volume |
| Source priority decision | Which API/mirror you picked and why. If multiple options exist (Congress.gov vs EveryCRSReport mirror, etc.), document the tradeoff |
| Access notes | Bot walls, auth flows, gotchas. Note here if the source needs browser-use |
| Endpoints | Concrete URL templates with parameter placeholders |
| Vocabulary / enums | Type codes, status enums, anything an agent would benefit from understanding |
| Stable ID format | The prefix-namespaced ID (e.g., crs:R48481, fr:2026-08558) |
| Response shapes | Sample JSON/XML payloads for each endpoint. Capture during initial probe |
| Volume | Records all-time, records/year, expected on-disk size of raw + DB |
| Caching / incremental sync | Watermark strategy, ETag / If-Modified-Since support, RSS feed availability |
| Schema (indicative DDL) | The shape you intend to write. Migration is the source of truth — this is illustrative |
| Download / update strategy | Backfill plan + daily incremental plan |
| Failure modes | Known flake patterns + recovery |
| Open questions | Things you noticed but defer to v1.x |
Add the source to the inventory table in docs/josh-data-sources.html.
Step 2 — Decide stable ID format
Section titled “Step 2 — Decide stable ID format”The Josh ID format is <source-prefix>:<source-native-key>. Conventions:
| Property | Required |
|---|---|
| Lowercased prefix matching the source name | Yes (crs:, fr:, bill:) |
| Source-native key preserved verbatim | Yes (preserve case, separators) |
| Globally unique within source | Yes |
| Valid as a TEXT column value (no shell-special chars unless quoted) | Yes |
Examples in production:
crs:R48481— CRS report numbercrs:R48481:c0— chunk index zero of that reportcrs:R48481:author:0— first author of that report
Keep them human-readable. Agents will quote these in citations and logs; debugging is easier when IDs aren't UUIDs.
Step 3 — Pick or build the normalizer
Section titled “Step 3 — Pick or build the normalizer”Decision tree:
| Source format | Action |
|---|---|
| HTML | Reuse html_v1. Add per-source cleanup (e.g., wrapper-noise stripping) in your source module's parse() stage, not in the normalizer |
| XML with new schema | Write <source>_xml_v1 as a new file under normalizers/ — lxml-based walker that preserves heading hierarchy and produces Markdown + Heading[] |
| Plain text | Reuse text_v1 |
Use pdf_v1 (when shipped). PDF normalization is queued (slower than inline) — record loads with body_text=NULL and an entry in ingestion_normalization_queue; the worker fills it in later |
REQUIRED — Manual normalizer validation gate
Section titled “REQUIRED — Manual normalizer validation gate”Per the locked decision in the plan, every source's normalizer must pass the 12-sample stratified validation gate before it ships. This is non-optional. The bar is binary: all 12 samples × all 12 criteria pass. No averaging.
Sample selection rule:
- Select 12 fixtures stratified by an intrinsic source dimension (CRS uses report-type prefix; FR would use document type; bills would use bill type or congress).
- Pick the most-recently-published instances of each stratum that exceed a minimum size (filters empty stubs).
- Pin the sample IDs in your per-source spec doc's "Normalizer validation" section so the gate is reproducible.
Workflow per sample:
- Open the source URL in a browser (or fetch via
browser-use). - Run the converter:
await normalizer.to_markdown(raw_path, source_hint=url). - Render the Markdown side-by-side (VS Code preview /
glow/ similar). - Walk the criteria table below; score pass/fail.
Criteria with pass bars:
| Criterion | Pass bar |
|---|---|
| Heading hierarchy | All <h1>–<h4> map to #–#### in correct nesting order. No skipped levels. |
| Section ordering | First-paragraph-of-section text in source order |
| Footnotes | Markers preserved; round-trip from marker to text without manual hunting |
| Tables | Every <table> becomes Markdown table OR an Artifact reference. Fail if a table silently becomes prose |
| Block quotes | <blockquote> → >; statutory excerpts visually distinct from analysis |
| Citations | Statutory citations preserved verbatim; no Unicode mangling of §; no whitespace collapse around abbreviations |
| Links | Every <a href> → [text](url); URLs absolute (relative resolved against source URL) |
| Lists | <ol>/<ul> preserved with correct nesting; numbering preserved |
| Emphasis | <em>/<strong> → */** |
| Images / figures | <img> / <figure> → Markdown link with alt text or Artifact reference; captions preserved |
| No HTML tag bleed | Zero raw <div>, <span> (tables exempt where lossy) |
| Body length sanity | len(markdown) / len(html) ∈ [0.3, 0.8] |
Promotion criteria: all 12 samples × all 12 criteria pass. One fail = file an issue, fix the converter, re-run all 12.
Document the run in the source's spec doc under a "Normalizer validation" section. CRS's docs/sources/crs-reports.html "Normalizer validation" section is the template — duplicate that structure for your source.
Step 4 — Write the migration
Section titled “Step 4 — Write the migration”File: shared/josh_substrate/src/josh_substrate/migrations/versions/<NNNN>_<source>.py. Reference: 0002_crs.py.
Pattern: parent table → version/junction tables → <source>_chunks table → FTS5 virtual table + sync triggers → vec0 virtual table.
Required parent-table columns
Section titled “Required parent-table columns”Every record-bearing parent table includes these — copy verbatim:
-- Standard citation columns (Decision 6 — same on every source)source_url TEXT NOT NULL,source_org TEXT NOT NULL DEFAULT '<your source org>', -- e.g. 'Federal Register', 'GAO'published_at TEXT,retrieved_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),raw_sha256 TEXT,citation_string TEXT NOT NULL,
-- Body columns (Decision 5 — raw on disk, markdown in column)body_raw_path TEXT,body_raw_format TEXT, -- 'html' | 'xml' | 'pdf' | 'text'body_raw_sha256 TEXT,body_text TEXT, -- markdown — what FTS5 indexesbody_text_sha256 TEXT,body_normalized_at TEXT,body_normalizer_version TEXT,
-- Lifecyclefetched_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),parsed_at TEXT,inserted_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),updated_at TEXT NOT NULL DEFAULT (strftime('%Y-%m-%dT%H:%M:%fZ','now')),raw_json TEXT -- preserved upstream payload for re-parseAbout raw_json: populate this with the upstream JSON metadata payload (or, for non-JSON sources, the parsed metadata serialized to JSON) at load time. It's the safety net that lets a future migration re-derive any column without re-fetching from the upstream API — same role body_raw_path plays for body text. CRS uses it to preserve the EveryCRSReport per-report JSON; see crs_reports.py.
FTS5 virtual table
Section titled “FTS5 virtual table”Standard 4-column ordering: (title, abstract, action, body). The action column is the source-specific category that the agent might want to boost (type_id for CRS, agency for FR, chamber for votes, etc.).
BM25 weights are query-time, not schema. The standard tuple is (10.0, 4.0, 2.0, 1.0) for (title, abstract, action, body) — applied at the call site via bm25(<source>_fts, 10.0, 4.0, 2.0, 1.0). The search wrapper that applies these uniformly across sources lives with the rest-api-search spec; until that ships, source-spec acceptance criteria should assert "FTS5 returns BM25-ranked results" without pinning the tuple.
CREATE VIRTUAL TABLE <source>_fts USING fts5( title, abstract, action, body, content='<source>', content_rowid='rowid', tokenize='porter unicode61');Plus three sync triggers (ai, au, ad for after-insert / -update / -delete). See 0002_crs.py for the exact pattern — copy it.
vec0 virtual table
Section titled “vec0 virtual table”Created empty in this migration (chunking + embedding wires up in Phase 8 of the plan):
CREATE VIRTUAL TABLE <source>_chunks_vec0 USING vec0( chunk_id TEXT PRIMARY KEY, embedding float[1024], -- 1024-dim matches Arctic-L-v2 embedding_bq bit[1024] -- binary-quantized for fast brute-force scan);Verify
Section titled “Verify”SUBSTRATE_DB_PATH=/tmp/josh-test.db alembic upgrade headsqlite3 /tmp/josh-test.db ".tables" # all your new tables show upsqlite3 /tmp/josh-test.db "SELECT name, type FROM sqlite_master WHERE name LIKE '<source>%'"Step 5 — Write Pydantic models
Section titled “Step 5 — Write Pydantic models”File: shared/josh_substrate/src/josh_substrate/models/<source>.py. Reference: models/crs.py.
One Pydantic v2 model per table. Include model_config = ConfigDict(from_attributes=True) on each so SQLAlchemy row → Pydantic round-trips work.
The parent model carries the citation columns and body columns. Children carry FK to parent + their own fields. For the chunk model, import ChunkLocator from josh_substrate.protocols — don't redefine.
Step 6 — Write the citation formatter
Section titled “Step 6 — Write the citation formatter”File: shared/josh_substrate/src/josh_substrate/citations/formatters/<source>.py. Reference: formatters/crs.py.
Export citation_for(record, version=None) -> str. Bluebook-flavored. Pragmatic — Josh isn't a legal product.
| Source | Format example |
|---|---|
| FR document | 91 Fed. Reg. 12345 (Mar. 4, 2026). |
| Public Law | Pub. L. No. 119-21 (2026). |
| US Code section | 42 U.S.C. § 1396a (2024). |
| GAO report | U.S. Gov't Accountability Off., GAO-25-107521, Title (Apr. 2025). |
| Committee report | H.R. Rep. No. 119-100 (2025). |
| Bill | H.R. 1234, 119th Cong. (2025). |
| Hearing transcript (CHRG) | Hearing Title: Hearing Before the H. Comm. on X, 119th Cong. (2025). |
| Roll call vote | Roll Call Vote No. 362 (House), 119th Cong., 1st Sess. (Mar. 4, 2025). |
| LDA filing | Lobbying Disclosure Act Filing {filing_uuid}, {client} via {registrant} (Q3 2025). |
| SAP | Statement of Administration Policy on H.R. 1234 (Jan. 22, 2025). |
| Congressional Record | 171 Cong. Rec. H1234 (daily ed. Mar. 4, 2025) (statement of Rep. X). |
Add unit tests at shared/josh_substrate/tests/citations/test_<source>.py. At least 4: canonical record, version-parametric path, edge cases (missing date, missing title), and any source-specific quirks.
Step 7 — Register normalizer + citation formatter
Section titled “Step 7 — Register normalizer + citation formatter”Both registries auto-discover via pkgutil.iter_modules. Just dropping your file in the right location is enough — no manual registration.
For the normalizer: if you wrote a new one, the file must export a module-level normalizer instance (normalizer = MyNormalizer()).
For the citation formatter: the file must export a module-level citation_for callable.
Verify:
python -c "from josh_substrate.normalizers import all_normalizers; print(sorted(all_normalizers().keys()))"python -c "from josh_substrate.citations.formatters import all_formatters; print(sorted(all_formatters().keys()))"Step 8 — Write the source subpackage
Section titled “Step 8 — Write the source subpackage”Directory: josh-ingester/ingester/sources/<source>/. Reference: sources/crs_reports/.
Each concrete source is a Python subpackage — a directory containing __init__.py plus one file per pipeline stage. The single-file form (sources/<source>.py) is no longer accepted; the registry walks subpackages and the success_determiner of the ingester-modularity-pass spec asserts that every concrete source is a directory.
The skeleton:
sources/<source>/├── __init__.py # exposes `source = <Source>()` for the registry; thin│ # class delegating to the four stage modules. Optional:│ # re-export `cli` from cli.py.├── discover.py # async generator of FetchTasks; updates state.last_seen_key├── fetch.py # download payloads to /data/corpus/<source>/bodies/raw/…├── parse.py # bytes-on-disk → ParsedRecord; runs normalizer here├── load.py # transactional upsert└── cli.py # OPTIONAL: source-local Typer for maintenance commandsAlways split into the four stages, even when a stage is short. Uniformity makes the architecture legible across sources and lets per-stage work proceed in parallel without merge collisions. If a helper is used by exactly one stage, put it in that stage's file; if it's used across stages or is purely shape/formatting logic, add a helpers.py alongside the stages.
The __init__.py implements the Source Protocol from josh_substrate.protocols by delegating to the stage modules:
# sources/<source>/__init__.pyfrom __future__ import annotations
from .discover import discover as _discoverfrom .fetch import fetch as _fetchfrom .load import load as _loadfrom .parse import parse as _parse# from .cli import cli # uncomment if you wrote a source-local CLI
class <Source>Source: name = "<source>" schedule = "30 7 * * *" # cron, or None for manual-only depends_on: list[str] = [] # other source names that must load first # cli = cli # optional — uncomment alongside the import above
async def discover(self, state, http): async for task in _discover(state, http): yield task
async def fetch(self, task, http): return await _fetch(task, http)
async def parse(self, raw_paths): async for record in _parse(raw_paths): yield record
async def load(self, records, db): return await _load(records, db)
source = <Source>Source() # module-level instance for the registryEach stage module is a plain async function (async def discover(state, http), async def fetch(task, http) -> dict[str, Path], async def parse(raw_paths) -> AsyncIterator[ParsedRecord], async def load(records, db) -> LoadStats). The class is a thin shim so the registry's Source protocol stays satisfied — the actual logic lives in the stage files.
Optional source-local CLI
Section titled “Optional source-local CLI”If your source needs a maintenance command (one-shot data fix, manual reindex, etc.), put it in sources/<source>/cli.py as a typer.Typer instance and re-export it from __init__.py as cli. The global ingester CLI auto-mounts it under josh-ingester <source-name-with-dashes> at startup. Don't add the command to the global cli.py. See crs_reports/cli.py for the worked example — its fix-types command became josh-ingester crs-reports fix-types after the modularity pass moved it out of the global CLI.
Source-specific cleanup (e.g., wrapper-noise stripping for CRS, <HD>-header recognition for FR XML) belongs in parse.py, not in the normalizer. Keep normalizers source-agnostic.
Step 9 — Smoke test
Section titled “Step 9 — Smoke test”Local first, server next.
# Local (after `pip install -e josh-ingester`)SUBSTRATE_DB_PATH=/tmp/josh-test.db CORPUS_DIR=/tmp/josh-corpus LOCK_DIR=/tmp/josh-locks \ josh-ingester backfill <source> --limit 5sqlite3 /tmp/josh-test.db "SELECT count(*) FROM <source>" # 5 rowssqlite3 /tmp/josh-test.db "SELECT * FROM <source>_fts WHERE <source>_fts MATCH 'something' LIMIT 1"
# Server (after first deploy)kamal app exec --reuse 'josh-ingester backfill <source> --limit 100'ssh josh 'sqlite3 /data/josh.db "SELECT count(*) FROM <source>"'ssh josh 'ls /data/corpus/<source>/bodies/raw/ | wc -l'(josh-ingester is the console script defined in josh-ingester/pyproject.toml; python -m ingester is the equivalent module-form invocation if the script isn't on PATH in your environment.)
Pass criteria: N rows in parent table, ≥N raw payload files on disk, FTS5 query returns hits, all 6 citation columns NOT NULL, no recurring errors in ingestion_logs.
Add tests at josh-ingester/tests/sources/test_<source>.py. Reference: tests/sources/test_crs_reports.py. Cover: helper functions; discover() defensive parsing (good rows + malformed rows + filter watermark + backfill mode); parse() against real fixtures asserting all citation+body columns populate.
Step 10 — URL verification gate
Section titled “Step 10 — URL verification gate”REQUIRED. Random-sample 10 records from the smoke backfill. For each:
- Use the
browser-useskill to fetchrecord.source_url. Preferred overcurlbecause federal sources frequently bot-wall plain HTTP. - Pass = page loads (200) and the visible title matches the row's
title. - Fail = bad URL pattern in your parser. Fix, re-backfill, re-verify.
Document the verification result in your spec doc's "URL verification" section (a 2-line table is plenty: 10/10 pass, date, sample IDs).
Step 11 — Full backfill
Section titled “Step 11 — Full backfill”For backfills that fit comfortably in a 10-minute window, the synchronous form is fine:
kamal app exec --reuse 'josh-ingester backfill <source>'For multi-hour backfills (CRS is ~6 hours; FR full backfill will be longer), the local Kamal client can be killed by client-side timeouts, taking the SSH session with it. Detach the work from any SSH session by using docker exec -d directly:
CONTAINER=$(ssh josh 'docker ps --format "{{.Names}}" | grep ^josh-ingester-workers-')ssh josh "docker exec -d $CONTAINER sh -c 'josh-ingester backfill <source> > /data/<source>-backfill.log 2>&1'"Then monitor progress periodically (the runner writes incremental counters to ingestion_runs and ingestion_tasks):
ssh josh "docker exec $CONTAINER josh-ingester status --source <source> --limit 1"ssh josh 'tail /data/<source>-backfill.log'ssh josh 'df -h /data' # confirm disk headroomThe container's CMD is sleep infinity, so it stays alive after the detached job finishes. The log on /data/ (bind-mounted volume) is visible from the host without docker logs.
Pass criteria: ingestion_runs.status='success' (or 'partial' with error_count / discovered_count < 2%).
If errors > 2%, dig into ingestion_logs for that run, fix the root cause, re-backfill (idempotent — ON CONFLICT DO UPDATE handles repeats; fetch() skips files already on disk).
Step 12 — Schedule + deploy
Section titled “Step 12 — Schedule + deploy”Set the source's schedule to its cron string in the source-module file (schedule = "30 7 * * *" for daily 07:30 ET). Commit. Deploy:
cd josh-ingesterenv LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal deploykamal app exec --reuse 'josh-ingester schedule' # confirm registry picks it up(Note: as of this writing the scheduler isn't wired to a long-running daemon yet — schedule is informational. When the substrate-cron-scheduler spec ships, this command will register the cron job; until then, set the value correctly so it's ready to take effect on first daemon boot.)
Step 13 — Add to admin overview
Section titled “Step 13 — Add to admin overview”No code change needed. The admin page at josh-core /admin/ingestion/sources (when shipped) reads the source registry directly. Your new source appears automatically.
Step 14 — Update the inventory + status (both pages)
Section titled “Step 14 — Update the inventory + status (both pages)”Two hand-authored inventory pages must stay in sync any time a source's status changes — both on this commit:
docs/josh-data-sources.html— the v1/v2 inventory index. Bump the source's status fromschema-draftedtoingestion-built.docs/data-status.html— the live "what's loaded right now" view. Flip the source's row to shipped, fill the Notes column with the record count + load date, and bump theLast updatedfooter at the bottom.
This page is the canonical "what's loaded right now" surface until josh-core ships a live status endpoint. Don't let it drift.
And the spec YAML you opened in Step 0 — flip it now:
- Verified: the
success_determinerran green against the deployed substrate. Flipstatus: in_progress → verified. - Shipped: the source is live, the cron is registered, the inventory pages reflect it. Flip
status: verified → shipped. Tick the remaining tasksdone: true. Runpython bin/build-spec.pyto regenerate the HTML.
Troubleshooting
Section titled “Troubleshooting”| Symptom | Likely cause | Fix |
|---|---|---|
csv.Error: line contains NUL | Source CSV has a BOM or null byte | Read with errors='replace' and trust csv.DictReader |
csv.Error: field larger than field size limit | One row has a giant cell | csv.field_size_limit(sys.maxsize) once at module load |
| Some rows skipped silently | Defensive parse exception eaten | Set LOG_LEVEL=DEBUG and look for csv_row_malformed in ingestion_logs |
| HTTP 403 on plain fetch | Anti-bot wall | Switch fetch to browser-use skill via the runner's HTTP client OR use TLS-fingerprint-spoofing client (curl_cffi, tls-client, niquests) |
| HTTP 502 / 503 | Transient upstream | The runner's tenacity retries handle this. If recurring, raise rate-limit or pause |
| FK violation on insert | Cross-source FK to a record from a different source that hasn't loaded yet | Soft-FK pattern: omit the constraint from the migration, populate the column anyway, resolve at query time |
body_text is empty after load | Normalizer error swallowed; or PDF queued and not yet processed | Check body_normalizer_version is set; check ingestion_normalization_queue for PDF sources |
| FTS5 query returns nothing | Triggers not firing | Verify with SELECT count(*) FROM <source>_fts — should equal parent count |
sqlite-vec extension not loading | aiosqlite's connect path bypassed _on_connect | Ensure engine was built via josh_substrate.db.async_engine, not raw create_async_engine |
Container has data but it's not on /data | Bind mount missing in deploy.yml | Check volumes: ["/data:/data"]; verify with docker inspect showing (bind) not (volume) |
| Discover yields zero tasks unexpectedly | state.last_seen_key is too high | For backfill, ensure runner-cleared (mode='backfill'); for incremental, check the watermark advanced sensibly last run |
OperationalError: database is locked mid-backfill | SQLite single-writer + bounded fetch concurrency producing collisions on ingestion_runs / ingestion_tasks updates | At our scale (1 source at a time, fetch_concurrency=4), this hits ~0.05% of tasks during peak — acceptable. To reduce: drop fetch_concurrency to 1–2, or batch state-table writes through a single coroutine. Don't raise busy_timeout (already 10s). |
Run finalized as failed despite most records loading | Top-level exception in runner's outer try/except (e.g., a SQLITE_BUSY hit during mark_task or save_source_state in the cleanup phase) | Inspect error_summary on the ingestion_runs row. If the per-record error rate is low and the substrate has the data, treat as effectively partial and move on — re-run incrementally if the watermark advanced too far. |
Unexpectedly large body_text rows (50+ MB) | Source HTML is bloated (embedded large images, framework boilerplate, etc.) | Outliers are unavoidable for some sources. Add a len(body_text) > N check in the source's parse stage and either truncate, skip, or capture as Artifact if pathological |
Stray raw files in /data/corpus/<source>/bodies/raw/ not in DB | Fetch succeeded, parse/load failed for that record | Cross-check: find /data/corpus/<source>/bodies/raw -name '*.html' | while read f; do num=$(basename $f .html); echo "$num,$(sqlite3 /data/josh.db "SELECT 1 FROM <source> WHERE number='$num'")"; done finds orphans |
wal file growing very large during backfill | WAL hasn't checkpointed under sustained writer load | Periodically run PRAGMA wal_checkpoint(TRUNCATE) from a sidecar script during multi-hour backfills — or accept the WAL size during the run and let it auto-checkpoint when the writer rests |
Reference
Section titled “Reference”ingestion-architecture.html— framework spec (Source protocol, 4 stages, state tables, body normalization, citation metadata)chunking.html— chunker protocol + experiment framework (Phase 8 work)new-host-setup.html— provisioning a fresh servermigrations.html— Kamal pre-deploy hook for migrationscrs-reports.html— the worked example (per-source spec template)sources/crs_reports.py— the worked example (Source implementation)migrations/versions/0002_crs.py— the worked example (migration template)