Topic / issue taxonomy
A normalized vocabulary for tagging the substrate's content by issue area. Used for: filtering ("show me bills tagged Health"), faceted browsing in the agent UI, and cross-source joins ("CRS reports about Defense" + "Federal Register rules tagged Defense" + "GAO reports tagged Defense").
The challenge: each ingested source comes with its own topic vocabulary, and they don't align. Bills use the Library of Congress legislative subjects (~1,000+ values per congress). Federal Register uses ~50 free-text TOC topics (Health, Banks, banking, etc.). GAO uses ~30 internal topics. CRS uses its own. CFR has agencies. LDA has issue codes (3-letter — TAX, HCR, etc.).
For Josh, we don't try to standardize all of these into one taxonomy. We do two things:
- Preserve each source's native taxonomy in source-specific tables (already in the per-source schemas —
bill_subjects,fr_document_topics,gao_report_topics, etc.). - Build a thin meta-taxonomy of ~30-50 high-level Josh-canonical topics with mappings into each source's native taxonomy. This lets the agent answer "show me everything Josh has on Health Care" by translating into per-source queries.
This is a thin doc — taxonomy is a small dataset where the engineering complexity is in the mapping, not the data shape.
Quick reference
Section titled “Quick reference”| Source name | Topic / issue taxonomy |
| Publishers | Library of Congress (legislative subjects). Various federal agencies (per-source taxonomies). Josh itself (the meta-taxonomy). |
| License | LCSH: free; per-source taxonomies: public domain; Josh meta-taxonomy: Josh-authored, Apache 2.0 with the rest of the substrate. |
| Coverage | LCSH: thousands of headings. Federal Register topic enum: ~50. GAO ~30. CRS ~70. LDA ~80 issue codes. |
| Volume | <50 MB total |
| Storage estimate | <100 MB |
| Source layers | (1) Native taxonomies per source — already ingested. (2) Library of Congress Subject Headings (LCSH) — bulk download. (3) Josh meta-taxonomy — hand-curated YAML in this repo. |
| Auth | None |
| Stable ID format | Josh topic: topic:{slug} (e.g. topic:health-care); LCSH: lcsh:{authority-id} (e.g. lcsh:sh85060417) |
| Status | exploring — schema drafted, meta-taxonomy not yet curated |
Source priority decision
Section titled “Source priority decision”Primary: Josh meta-taxonomy (hand-curated). A YAML file in this repo (docs/data/topic-taxonomy.yaml) defining ~30-50 top-level topics, each with:
- A canonical slug (
health-care,national-security,tax-policy). - A display name, description.
- Mappings into each source's native taxonomy (e.g.,
health-caremaps to bill subjects "Health", "Public health", "Medicare", "Medicaid", "Health insurance", "Mental health"; FR topics "Health", "Public health, Health"; GAO topic "Health Care"; LDA issue code "HCR"; CFR titles 42, 21, 45).
This is us doing the editorial work of cross-source mapping — analogous to how Wikipedia categories or Schema.org vocabulary work.
Secondary: Library of Congress Subject Headings (LCSH). Authority-controlled vocabulary used by libraries. Provides a stable, well-defined backbone we can reference. Bulk download available from the Library of Congress Linked Data Service.
Skip: build our own large taxonomy. Re-creating ~10K topic terms is a years-long curation effort. Lean: use LCSH as the deep vocabulary; build only ~30-50 high-level Josh meta-topics that bridge sources.
What we're not doing
Section titled “What we're not doing”- Not classifying every bill / FR doc / report against the meta-taxonomy. The native taxonomies on each source already do that work; the meta-taxonomy is the navigation layer, not a per-document tagging.
- Not exposing LCSH directly to users. LCSH is too granular and library-jargony. Use as backbone; surface our meta-taxonomy.
- Not maintaining LCSH ourselves — pull from LoC.
LCSH and per-source native taxonomies
Section titled “LCSH and per-source native taxonomies”LCSH is a library-controlled vocabulary published by the Library of Congress as bulk JSON-LD (also SKOS and MADS), updated weekly; we ingest a curated subset. Browse, bulk download, and per-heading JSON-LD all live under the Library of Congress Linked Data Service. The per-heading JSON-LD is what lcsh_subjects.raw_jsonld stores.
Per-source native taxonomies are already ingested and documented on each per-source page — we reference their values via topic_mappings, we don't re-describe them here:
bill_subjects(Bills) — LoC legislative subjects. The richest legislative taxonomy.fr_document_topics(Federal Register) — FR's free-text topic list.gao_report_topics(GAO reports) — GAO's internal categorization.crs_report_topics(CRS reports) — CRS-assigned topics.- LDA issue codes (
lda_lobbying_activities.general_issue_code, see LDA filings).
Stable ID formats
Section titled “Stable ID formats”Josh meta-topic: topic:{slug} — kebab-case lowercased.
Examples:
topic:health-caretopic:national-securitytopic:tax-policytopic:agriculturetopic:transportationtopic:civil-rights
LCSH heading: lcsh:{authority_id} — using LoC's authority IDs (sh85060417, sh85042567).
Schema (Postgres DDL)
Section titled “Schema (Postgres DDL)”-- ============================================================-- Josh meta-taxonomy (hand-curated)-- ============================================================
CREATE TABLE josh_topics ( id text PRIMARY KEY, -- 'topic:health-care' slug text NOT NULL UNIQUE, name text NOT NULL, -- 'Health Care' description text, parent_id text REFERENCES josh_topics(id), -- optional hierarchy
-- Display sort_order int, icon text, -- icon hint for UI color text, -- hex color hint for UI
-- Lifecycle is_active boolean NOT NULL DEFAULT true, inserted_at timestamptz NOT NULL DEFAULT now(), updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX josh_topics_parent ON josh_topics (parent_id) WHERE parent_id IS NOT NULL;
-- ============================================================-- Cross-source mappings: Josh meta-topic → native taxonomy values-- ============================================================
CREATE TABLE topic_mappings ( id bigserial PRIMARY KEY, josh_topic_id text NOT NULL REFERENCES josh_topics(id) ON DELETE CASCADE, source_kind text NOT NULL CHECK (source_kind IN ( 'bill_subject', 'fr_topic', 'gao_topic', 'crs_topic', 'lda_issue_code', 'cfr_title', 'lcsh', 'agency_id' )), source_value text NOT NULL, -- e.g. 'Health', 'HCR', 'sh85060417', '42' confidence text NOT NULL CHECK (confidence IN ('exact', 'broad', 'narrow', 'related')), UNIQUE (josh_topic_id, source_kind, source_value));
CREATE INDEX topic_mappings_source ON topic_mappings (source_kind, source_value);
-- ============================================================-- Library of Congress Subject Headings (subset)-- ============================================================
CREATE TABLE lcsh_subjects ( id text PRIMARY KEY, -- 'lcsh:sh85060417' authority_id text NOT NULL UNIQUE, -- 'sh85060417' label text NOT NULL, -- 'Health care reform' broader_authorities text[], -- parent authority IDs narrower_authorities text[], -- child authority IDs related_authorities text[], -- see-also raw_jsonld jsonb, fetched_at timestamptz NOT NULL, updated_at timestamptz NOT NULL DEFAULT now());
CREATE INDEX lcsh_subjects_label ON lcsh_subjects USING gin (to_tsvector('english', label));CREATE INDEX lcsh_subjects_broader ON lcsh_subjects USING gin (broader_authorities);Schema decisions worth flagging:
josh_topicsis small (~30-50 rows) and hand-curated. Lives in a YAML file in the repo (docs/data/topic-taxonomy.yaml) for code-review-style maintenance. A migration loads it into the DB at deploy time.topic_mappingsis the cross-source bridge. One row per (josh_topic, source_kind, source_value). Theconfidenceenum follows SKOS conventions:exact(1:1),broad(Josh topic is broader than source value),narrow(Josh topic is narrower),related(overlapping but not subset).- LCSH stored as a partial subset, not the full hierarchy. Pull the headings we actually reference from
topic_mappingsplus their immediate broader/narrower neighbors for browsing. Full LCSH is overkill. - No per-document tagging at the Josh meta-topic layer — when a user browses "Health Care," we resolve to the source values and federate the query. (This is similar to how multi-source search engines work.)
Curating the meta-taxonomy (the work)
Section titled “Curating the meta-taxonomy (the work)”The meta-taxonomy YAML follows this shape:
- id: topic:health-care name: Health Care description: Health care policy, public health, Medicare, Medicaid, health insurance. parent: null mappings: bill_subject: - Health - Public health - Medicare - Medicaid - Health insurance - Mental health - Pharmaceuticals fr_topic: - Health - Health and public welfare - Public health gao_topic: - Health Care crs_topic: - Health Policy lda_issue_code: - HCR - PHA - MMM cfr_title: - "21" - "42" - "45"
- id: topic:national-security name: National Security description: Defense, intelligence, homeland security, foreign affairs. parent: null mappings: bill_subject: - Armed forces and national security - Defense - Intelligence - Homeland security - Foreign trade and international finance - International affairs fr_topic: - Defense - World gao_topic: - Defense - Homeland Security - International Affairs lda_issue_code: - DEF - HOM - INT cfr_title: - "32" - "22" - "6"The curation work is ~1-2 weeks initial + monthly maintenance.
Volume
Section titled “Volume”| Metric | Value |
|---|---|
| Josh meta-topics | ~30-50 |
| Mappings per topic | ~10-30 source values across all source kinds |
| Total mapping rows | ~1,000-3,000 |
| LCSH subset (rows we reference + neighbors) | ~5,000-10,000 |
| Bill subjects (see Bills) | ~1,000+ per congress |
| FR topics (see Federal Register) | ~50 |
| GAO topics | ~30 |
| LDA issue codes | ~80 |
Postgres footprint: <100 MB.
Caching / incremental sync
Section titled “Caching / incremental sync”- Yearly: refresh LCSH subset. Pull the LCSH bulk download (~150 MB JSON-LD) from the Library of Congress Linked Data Service. Upsert headings we reference + neighbors.
- Per-deploy: refresh
josh_topicsandtopic_mappingsfrom the repo YAML. - Quarterly: review per-source taxonomy drift. When a new bill subject appears that's not in any mapping, it falls into "uncategorized" — surface in admin UI for curator review.
Schema (continued — taxonomy review queue)
Section titled “Schema (continued — taxonomy review queue)”-- A queue of unmapped source-taxonomy values for curator reviewCREATE TABLE topic_unmapped_values ( id bigserial PRIMARY KEY, source_kind text NOT NULL, source_value text NOT NULL, first_seen_at timestamptz NOT NULL, last_seen_at timestamptz NOT NULL, occurrence_count int NOT NULL DEFAULT 1, suggested_topic_id text, -- ML-suggested Josh topic reviewed_at timestamptz, reviewed_decision text CHECK (reviewed_decision IN ('mapped', 'rejected', 'pending') OR reviewed_decision IS NULL), UNIQUE (source_kind, source_value));
CREATE INDEX topic_unmapped_review ON topic_unmapped_values (occurrence_count DESC) WHERE reviewed_at IS NULL;Download / update strategy
Section titled “Download / update strategy”Curating the meta-taxonomy
Section titled “Curating the meta-taxonomy”Manual process. Initial ~30-50 topics curated by Robert + reviewed against actual user queries early on. Updates via PR to docs/data/topic-taxonomy.yaml.
LCSH ingestion
Section titled “LCSH ingestion”- Download the bulk LCSH JSON-LD from the Library of Congress Linked Data Service (~150 MB; format/size are upstream-owned and change without notice).
- Filter to subjects referenced by our
topic_mappings+ 1-hop broader/narrower. - Upsert into
lcsh_subjects. - Yearly refresh.
Drift detection
Section titled “Drift detection”- Daily: scan
bill_subjects.subjectetc. for values not in anytopic_mappings.source_value. - Insert into
topic_unmapped_valueswith occurrence count. - Surface high-occurrence unmapped values in admin UI for curator review.
State tracking
Section titled “State tracking”Source key: topic_taxonomy_curation. Manual flow.
Failure modes
Section titled “Failure modes”- Curation lag. New bill subjects appear faster than we can curate. Drift detection mitigates; the agent surfaces "uncategorized" cleanly.
- Cross-source ambiguity. A bill tagged "Civil rights" and an FR topic "Civil rights" should both map to
topic:civil-rights. Multiple subject-name collisions across sources mostly work; document edge cases. - Conflicting confidence. A native value might be
narrowto one Josh topic andbroadto another. Allow multiple mappings; the agent's federation prioritizesexact>broad>narrow>related. - LCSH versioning. LCSH headings get retired or merged occasionally. The bulk download tracks deprecations; honor them.
Open questions
Section titled “Open questions”These don't block ingestion but should be resolved before this source is "shipped":
- Top-level topic count. ~30-50 is a guess; the right number emerges from real user queries. Start narrow, expand.
- Hierarchy depth. Should
topic:medicarebe a child oftopic:health-care, or both top-level? Lean: 2-level hierarchy max (parent + child). Deeper hierarchies are SEO-resistant. - ML-suggested topic mappings. When a new source value appears, embed it and find nearest existing Josh topic. Useful prefilter for curator review. Defer to v1.x.
- User-customizable taxonomies. Multi-tenant deployments might want their own taxonomies. Schema accommodates by adding a
tenant_idtojosh_topics(NULL = global default). Defer. - Cross-source query federation. When the agent gets "show me Health Care content," it queries each source's native subjects via the mapping table and unions. The federation logic lives in
josh-core, not here. - LCSH licensing details. LCSH is freely available but the LoC reserves usage rules. Verify our use is in compliance.
- CFR title as topic mapping. CFR titles are agency-organized, not topic-organized — Title 21 is "Food and Drugs," but Title 26 is "Internal Revenue Code" (tax) and Title 17 is "Commodity and Securities Exchanges" (financial). The mapping is loose; treat with low confidence.
- Agency cross-mapping. Sources cite agencies (FR, regulations.gov, GAO recommendations) — should
topic_mappingsinclude agency_id mappings, or is that a separate concern? Schema allows; usage is open. - Static UI taxonomy vs dynamic discovery. The agent UI may want a stable left-nav of topics; we could harden the meta-taxonomy as nav while letting native subjects power filters. Lean: yes, treat meta-taxonomy as nav backbone.