sourceplannedp0

Topic taxonomy ingester (LCSH subset)

topic-taxonomy-ingester · updated 2026-05-10T18:00:00Z

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Pull a curated subset of the Library of Congress Subject Headings (LCSH)
into the substrate as a backbone vocabulary for cross-source topic
alignment. Each ingested source comes with its own native topic
vocabulary (bill_subjects, fr_document_topics, gao_report_topics,
lda_lobbying_activities.general_issue_code, etc.), and the per-source
vocabularies don't align. LCSH is the authority-controlled, library-
jargony, well-defined vocabulary that gives us a stable backbone to
reference; the per-source taxonomies map *into* it.

Per the topic-taxonomy source doc, the full topic story has three
layers: (1) per-source native taxonomies — already ingested as part of
each source ingester, (2) LCSH backbone — this spec, (3) Josh
meta-taxonomy of ~30-50 hand-curated top-level topics with mappings
into source vocabularies — deferred to a separate spec when the
curation work is ready. This spec is intentionally narrow: load LCSH
data into the substrate; meta-taxonomy authoring is its own decision.

As a downstream agent doing cross-source topic queries, I want an LCSH `lcsh_subjects` table with the curated subset of headings the substrate needs so that per-source native topic values can be resolved to a shared LCSH backbone when the meta-taxonomy lands.

As an OSS self-hoster, I want to run `josh ingest topic-taxonomy` on a fresh substrate so that I can populate the LCSH subset from the LoC bulk download on my own instance without designing the meta-taxonomy myself.

As a future curator authoring the Josh meta-taxonomy, I want LCSH already loaded so I can reference authority IDs in my mappings YAML so that my meta-taxonomy work is unblocked when the curation spec ships.

  1. When `josh ingest topic-taxonomy` is run on the deployed substrate, the system shall populate `lcsh_subjects` rows from the curated LCSH subset, with the six citation columns populated on every row.
  2. Where the curated subset is defined (initial `topic_taxonomy_seed.yaml` ships ~5,000-10,000 authority IDs covering the per-source taxonomy values that exist in the substrate, plus 1-hop broader/narrower neighbors), the system shall fetch the per-heading JSON-LD and persist `label`, `broader_authorities`, `narrower_authorities`, `related_authorities`, and `raw_jsonld`.
  3. While the ingester is running, if a single LCSH heading fetch fails, then the system shall log the failure to `ingestion_logs` (with the offending `authority_id`) and continue with the next heading.
  4. The system shall enforce a UNIQUE constraint on `lcsh_subjects.authority_id` so re-fetching the same heading upserts rather than duplicates.
  5. When an FTS5 query is executed against `lcsh_subjects_fts`, the system shall return BM25-ranked results over `(label, broader_labels)`.
  6. When a yearly refresh runs, the system shall re-fetch the curated subset and update `lcsh_subjects.fetched_at` per row; deprecated headings shall be flagged with `is_deprecated=true` rather than deleted (preserves historical references).
kindbash

Command

set -euo pipefail
ssh josh 'docker exec josh-ingester josh ingest topic-taxonomy --max 1000'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db "
  SELECT COUNT(*) FROM lcsh_subjects
  WHERE source_url IS NOT NULL AND source_org IS NOT NULL
    AND retrieved_at IS NOT NULL AND raw_sha256 IS NOT NULL
    AND citation_string IS NOT NULL;"'
ssh josh 'docker exec josh-core sqlite3 /data/josh.db \
  "SELECT COUNT(*) FROM lcsh_subjects_fts WHERE lcsh_subjects_fts MATCH '\''health'\'';"'

Expect

>= 1000 fully-citation-stamped rows; FTS5 match returns >= 1.

Smoke run only — `--max 1000` caps the discover stage. Full curated subset (~5,000-10,000 authority IDs based on per-source taxonomy coverage) runs detached on the server per `add-a-new-source.html`. Bulk download is ~150 MB JSON-LD; we filter aggressively before inserting.

  • Curation seed: hand-author the initial `topic_taxonomy_seed.yaml` (list of authority IDs) by running a one-off scan of all per-source taxonomy values present in the substrate, mapping each to LCSH best-guesses, plus 1-hop neighbors? Or seed with a vanilla LCSH top-N and grow on demand? Lean: scan-then-seed once we have multiple sources loaded — the seed is grounded in real coverage.
  • Bulk-download vs per-heading fetch: download the full ~150 MB SKOS bundle and filter locally (one network pull, fast) vs per-authority-ID JSON-LD fetch (slower, simpler curation tracking). Lean: bulk-download + local filter — fewer network calls, easier reproducibility.
  • Refresh cadence: yearly (per source-doc recommendation) vs quarterly. LCSH is updated weekly upstream but headings of interest to us churn slowly. Lean: yearly with a manual force-refresh CLI command for ad-hoc updates.
  • Deprecation handling: LCSH retires/merges headings occasionally. Flag `is_deprecated=true` with an optional `replaced_by_authority_id` link to preserve graph integrity? Lean: yes — deprecation is a normal LCSH event.
  • Josh meta-taxonomy ingester (Layer 3 from the source doc): track as a separate spec to ship later when curation is ready, or add as a v1.x scope expansion of this spec? Lean: separate spec — it's editorial work, not data-loading work, and pretending otherwise muddles the success determiner.
  • Josh meta-taxonomy authoring (~30-50 hand-curated top-level topics with mappings into source vocabularies) — separate spec, deferred to v1.x when the curation work is ready.
  • Per-source native taxonomy ingestion (`bill_subjects`, `fr_document_topics`, etc.) — already covered by each source ingester.
  • Full LCSH ingest (~10K+ headings) without filtering — a curated subset is the v1 scope.
  • Topic classification of substrate documents (assigning Josh meta-topics to bills / FR docs / etc.) — that's the meta-taxonomy's job once it ships.
  • LLM-suggested topic mappings (embed unmapped values, find nearest LCSH neighbor) — v1.x prefilter for the curator.
  • User-customizable per-tenant taxonomies — commercial-side concern.
  • Real-time webhooks / push subscriptions — surface-layer concern.

Source module follows the standard Source protocol. Per-source spec at
https://docs.usejosh.com/sources/topic-taxonomy/.

Discover/fetch/parse/load shape:
- discover: read shared/josh_substrate/data/topic_taxonomy_seed.yaml
listing the curated authority IDs of interest. Watermark in
ingestion_source_state.last_seen_key='lcsh-bulk-mtime' is the LoC
bulk download's Last-Modified so we only re-download when LoC has
actually re-published.
- fetch: bulk LCSH SKOS/JSON-LD from
https://id.loc.gov/download/ (~150 MB) to
/data/corpus/topic-taxonomy/bodies/raw/lcsh-bulk-{date}.jsonld.
SHA256 the download for raw_sha256 (per-row citation references
the bulk file).
- parse: stream-parse the bulk file with ijson; for each authority
ID in the seed (or matching seed criteria), emit an lcsh_subjects
row with label, broader/narrower/related authority lists, and
raw_jsonld snapshot.
- load: idempotent upsert on (authority_id). Deprecated headings
marked rather than deleted. FTS5 sync triggers fire on insert/update.

Migration shape: lcsh_subjects (with the six citation columns +
authority_id UNIQUE + is_deprecated flag + replaced_by_authority_id
optional self-FK), lcsh_subjects_fts virtual + 3 sync triggers. No
*_chunks_vec0 — labels are short and structured.

Initial seed work (one-time, before first ingest): scan all per-source
taxonomy tables present in the substrate (bill_subjects.subject,
fr_document_topics.topic, gao_report_topics.topic,
lda_lobbying_activities.general_issue_code, etc.), best-effort match
each to an LCSH authority ID via label match, write the resulting
topic_taxonomy_seed.yaml to the repo. As more sources land in the
substrate, the seed grows; refreshing the seed is a periodic curator task.

Schedule: yearly (LoC's update cadence is weekly upstream but headings
of interest to us churn slowly). Manual josh ingest topic-taxonomy
--force-refresh
available for ad-hoc updates.

0 of 15 done.

  • t1 Alembic migration applied locally + production (lcsh_subjects + FTS5)
  • t2 Pydantic models in shared/josh_substrate/src/josh_substrate/models/lcsh.py
  • t3 Source module implementing the `Source` protocol
  • t4 Body normalization N/A (structured JSON-LD → structured rows; no Markdown body)
  • t5 Citation metadata: all 6 columns populated; citation_for() formatter + 3-fixture unit tests
  • t6 FTS5 virtual table + 3 sync triggers wired into migration
  • t7 N/A — no body chunks, no vector pipeline
  • t8 Unit tests for JSON-LD parser against captured fixtures (active heading, deprecated heading, deeply-nested broader hierarchy)
  • t9 Integration test for full discover→fetch→parse→load pipeline against the LoC bulk download
  • t10 Initial `topic_taxonomy_seed.yaml` authored from a substrate scan (~5,000-10,000 authority IDs covering per-source vocabularies + 1-hop neighbors)
  • t11 Smoke backfill (1,000 headings from seed) clean per success_determiner
  • t12 Production backfill complete (full curated seed)
  • t13 Yearly cron registered and observed firing once
  • t14 No recurring errors in `ingestion_logs` for the last completed run
  • t15 Source doc updated; inventory rows in https://docs.usejosh.com/josh-data-sources/ + https://docs.usejosh.com/data-status/ flipped to ingestion-built (LCSH layer; meta-taxonomy explicitly deferred)

No history yet.

docs/spec/topic-taxonomy-ingester.html · generated by bin/build-spec.py