Topic / issue taxonomy

A normalized vocabulary for tagging the substrate's content by issue area. Used for: filtering ("show me bills tagged Health"), faceted browsing in the agent UI, and cross-source joins ("CRS reports about Defense" + "Federal Register rules tagged Defense" + "GAO reports tagged Defense").

The challenge: each ingested source comes with its own topic vocabulary, and they don't align. Bills use the Library of Congress legislative subjects (~1,000+ values per congress). Federal Register uses ~50 free-text TOC topics (Health, Banks, banking, etc.). GAO uses ~30 internal topics. CRS uses its own. CFR has agencies. LDA has issue codes (3-letter — TAX, HCR, etc.).

For Josh, we don't try to standardize all of these into one taxonomy. We do two things:

Preserve each source's native taxonomy in source-specific tables (already in the per-source schemas — bill_subjects, fr_document_topics, gao_report_topics, etc.).
Build a thin meta-taxonomy of ~30-50 high-level Josh-canonical topics with mappings into each source's native taxonomy. This lets the agent answer "show me everything Josh has on Health Care" by translating into per-source queries.

This is a thin doc — taxonomy is a small dataset where the engineering complexity is in the mapping, not the data shape.

Quick reference


Source name	Topic / issue taxonomy
Publishers	Library of Congress (legislative subjects). Various federal agencies (per-source taxonomies). Josh itself (the meta-taxonomy).
License	LCSH: free; per-source taxonomies: public domain; Josh meta-taxonomy: Josh-authored, Apache 2.0 with the rest of the substrate.
Coverage	LCSH: thousands of headings. Federal Register topic enum: ~50. GAO ~30. CRS ~70. LDA ~80 issue codes.
Volume	<50 MB total
Storage estimate	<100 MB
Source layers	(1) Native taxonomies per source — already ingested. (2) Library of Congress Subject Headings (LCSH) — bulk download. (3) Josh meta-taxonomy — hand-curated YAML in this repo.
Auth	None
Stable ID format	Josh topic: `topic:{slug}` (e.g. `topic:health-care`); LCSH: `lcsh:{authority-id}` (e.g. `lcsh:sh85060417`)
Status	exploring — schema drafted, meta-taxonomy not yet curated

Source priority decision

Primary: Josh meta-taxonomy (hand-curated). A YAML file in this repo (docs/data/topic-taxonomy.yaml) defining ~30-50 top-level topics, each with:

A canonical slug (health-care, national-security, tax-policy).
A display name, description.
Mappings into each source's native taxonomy (e.g., health-care maps to bill subjects "Health", "Public health", "Medicare", "Medicaid", "Health insurance", "Mental health"; FR topics "Health", "Public health, Health"; GAO topic "Health Care"; LDA issue code "HCR"; CFR titles 42, 21, 45).

This is us doing the editorial work of cross-source mapping — analogous to how Wikipedia categories or Schema.org vocabulary work.

Secondary: Library of Congress Subject Headings (LCSH). Authority-controlled vocabulary used by libraries. Provides a stable, well-defined backbone we can reference. Bulk download available from the Library of Congress Linked Data Service.

Skip: build our own large taxonomy. Re-creating ~10K topic terms is a years-long curation effort. Lean: use LCSH as the deep vocabulary; build only ~30-50 high-level Josh meta-topics that bridge sources.

What we're not doing

Not classifying every bill / FR doc / report against the meta-taxonomy. The native taxonomies on each source already do that work; the meta-taxonomy is the navigation layer, not a per-document tagging.
Not exposing LCSH directly to users. LCSH is too granular and library-jargony. Use as backbone; surface our meta-taxonomy.
Not maintaining LCSH ourselves — pull from LoC.

LCSH and per-source native taxonomies

LCSH is a library-controlled vocabulary published by the Library of Congress as bulk JSON-LD (also SKOS and MADS), updated weekly; we ingest a curated subset. Browse, bulk download, and per-heading JSON-LD all live under the Library of Congress Linked Data Service. The per-heading JSON-LD is what lcsh_subjects.raw_jsonld stores.

Per-source native taxonomies are already ingested and documented on each per-source page — we reference their values via topic_mappings, we don't re-describe them here:

bill_subjects (Bills) — LoC legislative subjects. The richest legislative taxonomy.
fr_document_topics (Federal Register) — FR's free-text topic list.
gao_report_topics (GAO reports) — GAO's internal categorization.
crs_report_topics (CRS reports) — CRS-assigned topics.
LDA issue codes (lda_lobbying_activities.general_issue_code, see LDA filings).

Stable ID formats

Josh meta-topic: topic:{slug} — kebab-case lowercased.

Examples:

topic:health-care
topic:national-security
topic:tax-policy
topic:agriculture
topic:transportation
topic:civil-rights

LCSH heading: lcsh:{authority_id} — using LoC's authority IDs (sh85060417, sh85042567).

Schema (Postgres DDL)

-- ============================================================
-- Josh meta-taxonomy (hand-curated)
-- ============================================================

CREATE TABLE josh_topics (
    id text PRIMARY KEY,                          -- 'topic:health-care'
    slug text NOT NULL UNIQUE,
    name text NOT NULL,                           -- 'Health Care'
    description text,
    parent_id text REFERENCES josh_topics(id),   -- optional hierarchy

    -- Display
    sort_order int,
    icon text,                                    -- icon hint for UI
    color text,                                   -- hex color hint for UI

    -- Lifecycle
    is_active boolean NOT NULL DEFAULT true,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX josh_topics_parent ON josh_topics (parent_id) WHERE parent_id IS NOT NULL;

-- ============================================================
-- Cross-source mappings: Josh meta-topic → native taxonomy values
-- ============================================================

CREATE TABLE topic_mappings (
    id bigserial PRIMARY KEY,
    josh_topic_id text NOT NULL REFERENCES josh_topics(id) ON DELETE CASCADE,
    source_kind text NOT NULL CHECK (source_kind IN (
        'bill_subject', 'fr_topic', 'gao_topic', 'crs_topic',
        'lda_issue_code', 'cfr_title', 'lcsh', 'agency_id'
    )),
    source_value text NOT NULL,                   -- e.g. 'Health', 'HCR', 'sh85060417', '42'
    confidence text NOT NULL CHECK (confidence IN ('exact', 'broad', 'narrow', 'related')),
    UNIQUE (josh_topic_id, source_kind, source_value)
);

CREATE INDEX topic_mappings_source ON topic_mappings (source_kind, source_value);

-- ============================================================
-- Library of Congress Subject Headings (subset)
-- ============================================================

CREATE TABLE lcsh_subjects (
    id text PRIMARY KEY,                          -- 'lcsh:sh85060417'
    authority_id text NOT NULL UNIQUE,            -- 'sh85060417'
    label text NOT NULL,                          -- 'Health care reform'
    broader_authorities text[],                   -- parent authority IDs
    narrower_authorities text[],                  -- child authority IDs
    related_authorities text[],                   -- see-also
    raw_jsonld jsonb,
    fetched_at timestamptz NOT NULL,
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX lcsh_subjects_label ON lcsh_subjects USING gin (to_tsvector('english', label));
CREATE INDEX lcsh_subjects_broader ON lcsh_subjects USING gin (broader_authorities);

Schema decisions worth flagging:

josh_topics is small (~30-50 rows) and hand-curated. Lives in a YAML file in the repo (docs/data/topic-taxonomy.yaml) for code-review-style maintenance. A migration loads it into the DB at deploy time.
topic_mappings is the cross-source bridge. One row per (josh_topic, source_kind, source_value). The confidence enum follows SKOS conventions: exact (1:1), broad (Josh topic is broader than source value), narrow (Josh topic is narrower), related (overlapping but not subset).
LCSH stored as a partial subset, not the full hierarchy. Pull the headings we actually reference from topic_mappings plus their immediate broader/narrower neighbors for browsing. Full LCSH is overkill.
No per-document tagging at the Josh meta-topic layer — when a user browses "Health Care," we resolve to the source values and federate the query. (This is similar to how multi-source search engines work.)

Curating the meta-taxonomy (the work)

The meta-taxonomy YAML follows this shape:

- id: topic:health-care
  name: Health Care
  description: Health care policy, public health, Medicare, Medicaid, health insurance.
  parent: null
  mappings:
    bill_subject:
      - Health
      - Public health
      - Medicare
      - Medicaid
      - Health insurance
      - Mental health
      - Pharmaceuticals
    fr_topic:
      - Health
      - Health and public welfare
      - Public health
    gao_topic:
      - Health Care
    crs_topic:
      - Health Policy
    lda_issue_code:
      - HCR
      - PHA
      - MMM
    cfr_title:
      - "21"
      - "42"
      - "45"

- id: topic:national-security
  name: National Security
  description: Defense, intelligence, homeland security, foreign affairs.
  parent: null
  mappings:
    bill_subject:
      - Armed forces and national security
      - Defense
      - Intelligence
      - Homeland security
      - Foreign trade and international finance
      - International affairs
    fr_topic:
      - Defense
      - World
    gao_topic:
      - Defense
      - Homeland Security
      - International Affairs
    lda_issue_code:
      - DEF
      - HOM
      - INT
    cfr_title:
      - "32"
      - "22"
      - "6"

The curation work is ~1-2 weeks initial + monthly maintenance.

Volume

Metric	Value
Josh meta-topics	~30-50
Mappings per topic	~10-30 source values across all source kinds
Total mapping rows	~1,000-3,000
LCSH subset (rows we reference + neighbors)	~5,000-10,000
Bill subjects (see Bills)	~1,000+ per congress
FR topics (see Federal Register)	~50
GAO topics	~30
LDA issue codes	~80

Postgres footprint: <100 MB.

Caching / incremental sync

Yearly: refresh LCSH subset. Pull the LCSH bulk download (~150 MB JSON-LD) from the Library of Congress Linked Data Service. Upsert headings we reference + neighbors.
Per-deploy: refresh josh_topics and topic_mappings from the repo YAML.
Quarterly: review per-source taxonomy drift. When a new bill subject appears that's not in any mapping, it falls into "uncategorized" — surface in admin UI for curator review.

Schema (continued — taxonomy review queue)

-- A queue of unmapped source-taxonomy values for curator review
CREATE TABLE topic_unmapped_values (
    id bigserial PRIMARY KEY,
    source_kind text NOT NULL,
    source_value text NOT NULL,
    first_seen_at timestamptz NOT NULL,
    last_seen_at timestamptz NOT NULL,
    occurrence_count int NOT NULL DEFAULT 1,
    suggested_topic_id text,                      -- ML-suggested Josh topic
    reviewed_at timestamptz,
    reviewed_decision text CHECK (reviewed_decision IN ('mapped', 'rejected', 'pending') OR reviewed_decision IS NULL),
    UNIQUE (source_kind, source_value)
);

CREATE INDEX topic_unmapped_review ON topic_unmapped_values (occurrence_count DESC) WHERE reviewed_at IS NULL;

Download / update strategy

Curating the meta-taxonomy

Manual process. Initial ~30-50 topics curated by Robert + reviewed against actual user queries early on. Updates via PR to docs/data/topic-taxonomy.yaml.

LCSH ingestion

Download the bulk LCSH JSON-LD from the Library of Congress Linked Data Service (~150 MB; format/size are upstream-owned and change without notice).
Filter to subjects referenced by our topic_mappings + 1-hop broader/narrower.
Upsert into lcsh_subjects.
Yearly refresh.

Drift detection

Daily: scan bill_subjects.subject etc. for values not in any topic_mappings.source_value.
Insert into topic_unmapped_values with occurrence count.
Surface high-occurrence unmapped values in admin UI for curator review.

State tracking

Source key: topic_taxonomy_curation. Manual flow.

Failure modes

Curation lag. New bill subjects appear faster than we can curate. Drift detection mitigates; the agent surfaces "uncategorized" cleanly.
Cross-source ambiguity. A bill tagged "Civil rights" and an FR topic "Civil rights" should both map to topic:civil-rights. Multiple subject-name collisions across sources mostly work; document edge cases.
Conflicting confidence. A native value might be narrow to one Josh topic and broad to another. Allow multiple mappings; the agent's federation prioritizes exact > broad > narrow > related.
LCSH versioning. LCSH headings get retired or merged occasionally. The bulk download tracks deprecations; honor them.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Top-level topic count. ~30-50 is a guess; the right number emerges from real user queries. Start narrow, expand.
Hierarchy depth. Should topic:medicare be a child of topic:health-care, or both top-level? Lean: 2-level hierarchy max (parent + child). Deeper hierarchies are SEO-resistant.
ML-suggested topic mappings. When a new source value appears, embed it and find nearest existing Josh topic. Useful prefilter for curator review. Defer to v1.x.
User-customizable taxonomies. Multi-tenant deployments might want their own taxonomies. Schema accommodates by adding a tenant_id to josh_topics (NULL = global default). Defer.
Cross-source query federation. When the agent gets "show me Health Care content," it queries each source's native subjects via the mapping table and unions. The federation logic lives in josh-core, not here.
LCSH licensing details. LCSH is freely available but the LoC reserves usage rules. Verify our use is in compliance.
CFR title as topic mapping. CFR titles are agency-organized, not topic-organized — Title 21 is "Food and Drugs," but Title 26 is "Internal Revenue Code" (tax) and Title 17 is "Commodity and Securities Exchanges" (financial). The mapping is loose; treat with low confidence.
Agency cross-mapping. Sources cite agencies (FR, regulations.gov, GAO recommendations) — should topic_mappings include agency_id mappings, or is that a separate concern? Schema allows; usage is open.
Static UI taxonomy vs dynamic discovery. The agent UI may want a stable left-nav of topics; we could harden the meta-taxonomy as nav while letting native subjects power filters. Lean: yes, treat meta-taxonomy as nav backbone.