Topic taxonomy ingester (LCSH subset)
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Pull a curated subset of the Library of Congress Subject Headings (LCSH)
into the substrate as a backbone vocabulary for cross-source topic
alignment. Each ingested source comes with its own native topic
vocabulary (bill_subjects, fr_document_topics, gao_report_topics,lda_lobbying_activities.general_issue_code, etc.), and the per-source
vocabularies don't align. LCSH is the authority-controlled, library-
jargony, well-defined vocabulary that gives us a stable backbone to
reference; the per-source taxonomies map *into* it.
Per the topic-taxonomy source doc, the full topic story has three
layers: (1) per-source native taxonomies — already ingested as part of
each source ingester, (2) LCSH backbone — this spec, (3) Josh
meta-taxonomy of ~30-50 hand-curated top-level topics with mappings
into source vocabularies — deferred to a separate spec when the
curation work is ready. This spec is intentionally narrow: load LCSH
data into the substrate; meta-taxonomy authoring is its own decision.
User stories
As a downstream agent doing cross-source topic queries, I want an LCSH `lcsh_subjects` table with the curated subset of headings the substrate needs so that per-source native topic values can be resolved to a shared LCSH backbone when the meta-taxonomy lands.
As an OSS self-hoster, I want to run `josh ingest topic-taxonomy` on a fresh substrate so that I can populate the LCSH subset from the LoC bulk download on my own instance without designing the meta-taxonomy myself.
As a future curator authoring the Josh meta-taxonomy, I want LCSH already loaded so I can reference authority IDs in my mappings YAML so that my meta-taxonomy work is unblocked when the curation spec ships.
Acceptance criteria (EARS)
- When `josh ingest topic-taxonomy` is run on the deployed substrate, the system shall populate `lcsh_subjects` rows from the curated LCSH subset, with the six citation columns populated on every row.
- Where the curated subset is defined (initial `topic_taxonomy_seed.yaml` ships ~5,000-10,000 authority IDs covering the per-source taxonomy values that exist in the substrate, plus 1-hop broader/narrower neighbors), the system shall fetch the per-heading JSON-LD and persist `label`, `broader_authorities`, `narrower_authorities`, `related_authorities`, and `raw_jsonld`.
- While the ingester is running, if a single LCSH heading fetch fails, then the system shall log the failure to `ingestion_logs` (with the offending `authority_id`) and continue with the next heading.
- The system shall enforce a UNIQUE constraint on `lcsh_subjects.authority_id` so re-fetching the same heading upserts rather than duplicates.
- When an FTS5 query is executed against `lcsh_subjects_fts`, the system shall return BM25-ranked results over `(label, broader_labels)`.
- When a yearly refresh runs, the system shall re-fetch the curated subset and update `lcsh_subjects.fetched_at` per row; deprecated headings shall be flagged with `is_deprecated=true` rather than deleted (preserves historical references).
Success determiner
Command
set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest topic-taxonomy --max 1000'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
SELECT COUNT(*) FROM lcsh_subjects
WHERE source_url IS NOT NULL AND source_org IS NOT NULL
AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
"SELECT COUNT(*) FROM lcsh_subjects_fts WHERE lcsh_subjects_fts MATCH '\''health'\'';"'
Expect
Smoke run only — `--max 1000` caps the discover stage. Full curated subset (~5,000-10,000 authority IDs based on per-source taxonomy coverage) runs detached on the server per `add-a-new-source.html`. Bulk download is ~150 MB JSON-LD; we filter aggressively before inserting.
Clarifications needed
- Curation seed: hand-author the initial `topic_taxonomy_seed.yaml` (list of authority IDs) by running a one-off scan of all per-source taxonomy values present in the substrate, mapping each to LCSH best-guesses, plus 1-hop neighbors? Or seed with a vanilla LCSH top-N and grow on demand? Lean: scan-then-seed once we have multiple sources loaded — the seed is grounded in real coverage.
- Bulk-download vs per-heading fetch: download the full ~150 MB SKOS bundle and filter locally (one network pull, fast) vs per-authority-ID JSON-LD fetch (slower, simpler curation tracking). Lean: bulk-download + local filter — fewer network calls, easier reproducibility.
- Refresh cadence: yearly (per source-doc recommendation) vs quarterly. LCSH is updated weekly upstream but headings of interest to us churn slowly. Lean: yearly with a manual force-refresh CLI command for ad-hoc updates.
- Deprecation handling: LCSH retires/merges headings occasionally. Flag `is_deprecated=true` with an optional `replaced_by_authority_id` link to preserve graph integrity? Lean: yes — deprecation is a normal LCSH event.
- Josh meta-taxonomy ingester (Layer 3 from the source doc): track as a separate spec to ship later when curation is ready, or add as a v1.x scope expansion of this spec? Lean: separate spec — it's editorial work, not data-loading work, and pretending otherwise muddles the success determiner.
Out of scope
- Josh meta-taxonomy authoring (~30-50 hand-curated top-level topics with mappings into source vocabularies) — separate spec, deferred to v1.x when the curation work is ready.
- Per-source native taxonomy ingestion (`bill_subjects`, `fr_document_topics`, etc.) — already covered by each source ingester.
- Full LCSH ingest (~10K+ headings) without filtering — a curated subset is the v1 scope.
- Topic classification of substrate documents (assigning Josh meta-topics to bills / FR docs / etc.) — that's the meta-taxonomy's job once it ships.
- LLM-suggested topic mappings (embed unmapped values, find nearest LCSH neighbor) — v1.x prefilter for the curator.
- User-customizable per-tenant taxonomies — commercial-side concern.
- Real-time webhooks / push subscriptions — surface-layer concern.
Dependencies
Plan
Source module follows the standard Source protocol. Per-source spec athttps://docs.usejosh.com/sources/topic-taxonomy/.
Discover/fetch/parse/load shape:
- discover: read shared/josh_substrate/data/topic_taxonomy_seed.yaml
listing the curated authority IDs of interest. Watermark in
ingestion_source_state.last_seen_key='lcsh-bulk-mtime' is the LoC
bulk download's Last-Modified so we only re-download when LoC has
actually re-published.
- fetch: bulk LCSH SKOS/JSON-LD from
https://id.loc.gov/download/ (~150 MB) to
/data/corpus/topic-taxonomy/bodies/raw/lcsh-bulk-{date}.jsonld.
SHA256 the download for raw_sha256 (per-row citation references
the bulk file).
- parse: stream-parse the bulk file with ijson; for each authority
ID in the seed (or matching seed criteria), emit an lcsh_subjects
row with label, broader/narrower/related authority lists, and
raw_jsonld snapshot.
- load: idempotent upsert on (authority_id). Deprecated headings
marked rather than deleted. FTS5 sync triggers fire on insert/update.
Migration shape: lcsh_subjects (with the six citation columns +authority_id UNIQUE + is_deprecated flag + replaced_by_authority_id
optional self-FK), lcsh_subjects_fts virtual + 3 sync triggers. No*_chunks_vec0 — labels are short and structured.
Initial seed work (one-time, before first ingest): scan all per-source
taxonomy tables present in the substrate (bill_subjects.subject,fr_document_topics.topic, gao_report_topics.topic,lda_lobbying_activities.general_issue_code, etc.), best-effort match
each to an LCSH authority ID via label match, write the resultingtopic_taxonomy_seed.yaml to the repo. As more sources land in the
substrate, the seed grows; refreshing the seed is a periodic curator task.
Schedule: yearly (LoC's update cadence is weekly upstream but headings
of interest to us churn slowly). Manual josh ingest topic-taxonomy available for ad-hoc updates.
--force-refresh
Tasks
0 of 15 done.
- t1 Alembic migration applied locally + production (lcsh_subjects + FTS5)
- t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/lcsh.py
- t3 Source module implementing the `Source` protocol
- t4 Body normalization N/A (structured JSON-LD → structured rows; no Markdown body)
- t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests
- t6 FTS5 virtual table + 3 sync triggers wired into migration
- t7 N/A — no body chunks, no vector pipeline
- t8 Unit tests for JSON-LD parser against captured fixtures (active heading, deprecated heading, deeply-nested broader hierarchy)
- t9 Integration test for full discover→fetch→parse→load pipeline against the LoC bulk download
- t10 Initial `topic_taxonomy_seed.yaml` authored from a substrate scan (~5,000-10,000 authority IDs covering per-source vocabularies + 1-hop neighbors)
- t11 Smoke backfill (1,000 headings from seed) clean per success_determiner
- t12 Production backfill complete (full curated seed)
- t13 Yearly cron registered and observed firing once
- t14 No recurring errors in `ingestion_logs` for the last completed run
- t15 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built (LCSH layer; meta-taxonomy explicitly deferred)
Changelog
No history yet.