Skip to content

Topic / issue taxonomy

A normalized vocabulary for tagging the substrate's content by issue area. Used for: filtering ("show me bills tagged Health"), faceted browsing in the agent UI, and cross-source joins ("CRS reports about Defense" + "Federal Register rules tagged Defense" + "GAO reports tagged Defense").

The challenge: each ingested source comes with its own topic vocabulary, and they don't align. Bills use the Library of Congress legislative subjects (~1,000+ values per congress). Federal Register uses ~50 free-text TOC topics (Health, Banks, banking, etc.). GAO uses ~30 internal topics. CRS uses its own. CFR has agencies. LDA has issue codes (3-letter — TAX, HCR, etc.).

For Josh, we don't try to standardize all of these into one taxonomy. We do two things:

  1. Preserve each source's native taxonomy in source-specific tables (already in the per-source schemas — bill_subjects, fr_document_topics, gao_report_topics, etc.).
  2. Build a thin meta-taxonomy of ~30-50 high-level Josh-canonical topics with mappings into each source's native taxonomy. This lets the agent answer "show me everything Josh has on Health Care" by translating into per-source queries.

This is a thin doc — taxonomy is a small dataset where the engineering complexity is in the mapping, not the data shape.

Source nameTopic / issue taxonomy
PublishersLibrary of Congress (legislative subjects). Various federal agencies (per-source taxonomies). Josh itself (the meta-taxonomy).
LicenseLCSH: free; per-source taxonomies: public domain; Josh meta-taxonomy: Josh-authored, Apache 2.0 with the rest of the substrate.
CoverageLCSH: thousands of headings. Federal Register topic enum: ~50. GAO ~30. CRS ~70. LDA ~80 issue codes.
Volume<50 MB total
Storage estimate<100 MB
Source layers(1) Native taxonomies per source — already ingested. (2) Library of Congress Subject Headings (LCSH) — bulk download. (3) Josh meta-taxonomy — hand-curated YAML in this repo.
AuthNone
Stable ID formatJosh topic: topic:{slug} (e.g. topic:health-care); LCSH: lcsh:{authority-id} (e.g. lcsh:sh85060417)
Statusexploring — schema drafted, meta-taxonomy not yet curated

Primary: Josh meta-taxonomy (hand-curated). A YAML file in this repo (docs/data/topic-taxonomy.yaml) defining ~30-50 top-level topics, each with:

  • A canonical slug (health-care, national-security, tax-policy).
  • A display name, description.
  • Mappings into each source's native taxonomy (e.g., health-care maps to bill subjects "Health", "Public health", "Medicare", "Medicaid", "Health insurance", "Mental health"; FR topics "Health", "Public health, Health"; GAO topic "Health Care"; LDA issue code "HCR"; CFR titles 42, 21, 45).

This is us doing the editorial work of cross-source mapping — analogous to how Wikipedia categories or Schema.org vocabulary work.

Secondary: Library of Congress Subject Headings (LCSH). Authority-controlled vocabulary used by libraries. Provides a stable, well-defined backbone we can reference. Bulk download available from the Library of Congress Linked Data Service.

Skip: build our own large taxonomy. Re-creating ~10K topic terms is a years-long curation effort. Lean: use LCSH as the deep vocabulary; build only ~30-50 high-level Josh meta-topics that bridge sources.

  • Not classifying every bill / FR doc / report against the meta-taxonomy. The native taxonomies on each source already do that work; the meta-taxonomy is the navigation layer, not a per-document tagging.
  • Not exposing LCSH directly to users. LCSH is too granular and library-jargony. Use as backbone; surface our meta-taxonomy.
  • Not maintaining LCSH ourselves — pull from LoC.

LCSH is a library-controlled vocabulary published by the Library of Congress as bulk JSON-LD (also SKOS and MADS), updated weekly; we ingest a curated subset. Browse, bulk download, and per-heading JSON-LD all live under the Library of Congress Linked Data Service. The per-heading JSON-LD is what lcsh_subjects.raw_jsonld stores.

Per-source native taxonomies are already ingested and documented on each per-source page — we reference their values via topic_mappings, we don't re-describe them here:

  • bill_subjects (Bills) — LoC legislative subjects. The richest legislative taxonomy.
  • fr_document_topics (Federal Register) — FR's free-text topic list.
  • gao_report_topics (GAO reports) — GAO's internal categorization.
  • crs_report_topics (CRS reports) — CRS-assigned topics.
  • LDA issue codes (lda_lobbying_activities.general_issue_code, see LDA filings).

Josh meta-topic: topic:{slug} — kebab-case lowercased.

Examples:

  • topic:health-care
  • topic:national-security
  • topic:tax-policy
  • topic:agriculture
  • topic:transportation
  • topic:civil-rights

LCSH heading: lcsh:{authority_id} — using LoC's authority IDs (sh85060417, sh85042567).

-- ============================================================
-- Josh meta-taxonomy (hand-curated)
-- ============================================================
CREATE TABLE josh_topics (
id text PRIMARY KEY, -- 'topic:health-care'
slug text NOT NULL UNIQUE,
name text NOT NULL, -- 'Health Care'
description text,
parent_id text REFERENCES josh_topics(id), -- optional hierarchy
-- Display
sort_order int,
icon text, -- icon hint for UI
color text, -- hex color hint for UI
-- Lifecycle
is_active boolean NOT NULL DEFAULT true,
inserted_at timestamptz NOT NULL DEFAULT now(),
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX josh_topics_parent ON josh_topics (parent_id) WHERE parent_id IS NOT NULL;
-- ============================================================
-- Cross-source mappings: Josh meta-topic → native taxonomy values
-- ============================================================
CREATE TABLE topic_mappings (
id bigserial PRIMARY KEY,
josh_topic_id text NOT NULL REFERENCES josh_topics(id) ON DELETE CASCADE,
source_kind text NOT NULL CHECK (source_kind IN (
'bill_subject', 'fr_topic', 'gao_topic', 'crs_topic',
'lda_issue_code', 'cfr_title', 'lcsh', 'agency_id'
)),
source_value text NOT NULL, -- e.g. 'Health', 'HCR', 'sh85060417', '42'
confidence text NOT NULL CHECK (confidence IN ('exact', 'broad', 'narrow', 'related')),
UNIQUE (josh_topic_id, source_kind, source_value)
);
CREATE INDEX topic_mappings_source ON topic_mappings (source_kind, source_value);
-- ============================================================
-- Library of Congress Subject Headings (subset)
-- ============================================================
CREATE TABLE lcsh_subjects (
id text PRIMARY KEY, -- 'lcsh:sh85060417'
authority_id text NOT NULL UNIQUE, -- 'sh85060417'
label text NOT NULL, -- 'Health care reform'
broader_authorities text[], -- parent authority IDs
narrower_authorities text[], -- child authority IDs
related_authorities text[], -- see-also
raw_jsonld jsonb,
fetched_at timestamptz NOT NULL,
updated_at timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX lcsh_subjects_label ON lcsh_subjects USING gin (to_tsvector('english', label));
CREATE INDEX lcsh_subjects_broader ON lcsh_subjects USING gin (broader_authorities);

Schema decisions worth flagging:

  • josh_topics is small (~30-50 rows) and hand-curated. Lives in a YAML file in the repo (docs/data/topic-taxonomy.yaml) for code-review-style maintenance. A migration loads it into the DB at deploy time.
  • topic_mappings is the cross-source bridge. One row per (josh_topic, source_kind, source_value). The confidence enum follows SKOS conventions: exact (1:1), broad (Josh topic is broader than source value), narrow (Josh topic is narrower), related (overlapping but not subset).
  • LCSH stored as a partial subset, not the full hierarchy. Pull the headings we actually reference from topic_mappings plus their immediate broader/narrower neighbors for browsing. Full LCSH is overkill.
  • No per-document tagging at the Josh meta-topic layer — when a user browses "Health Care," we resolve to the source values and federate the query. (This is similar to how multi-source search engines work.)

The meta-taxonomy YAML follows this shape:

- id: topic:health-care
name: Health Care
description: Health care policy, public health, Medicare, Medicaid, health insurance.
parent: null
mappings:
bill_subject:
- Health
- Public health
- Medicare
- Medicaid
- Health insurance
- Mental health
- Pharmaceuticals
fr_topic:
- Health
- Health and public welfare
- Public health
gao_topic:
- Health Care
crs_topic:
- Health Policy
lda_issue_code:
- HCR
- PHA
- MMM
cfr_title:
- "21"
- "42"
- "45"
- id: topic:national-security
name: National Security
description: Defense, intelligence, homeland security, foreign affairs.
parent: null
mappings:
bill_subject:
- Armed forces and national security
- Defense
- Intelligence
- Homeland security
- Foreign trade and international finance
- International affairs
fr_topic:
- Defense
- World
gao_topic:
- Defense
- Homeland Security
- International Affairs
lda_issue_code:
- DEF
- HOM
- INT
cfr_title:
- "32"
- "22"
- "6"

The curation work is ~1-2 weeks initial + monthly maintenance.

MetricValue
Josh meta-topics~30-50
Mappings per topic~10-30 source values across all source kinds
Total mapping rows~1,000-3,000
LCSH subset (rows we reference + neighbors)~5,000-10,000
Bill subjects (see Bills)~1,000+ per congress
FR topics (see Federal Register)~50
GAO topics~30
LDA issue codes~80

Postgres footprint: <100 MB.

  1. Yearly: refresh LCSH subset. Pull the LCSH bulk download (~150 MB JSON-LD) from the Library of Congress Linked Data Service. Upsert headings we reference + neighbors.
  2. Per-deploy: refresh josh_topics and topic_mappings from the repo YAML.
  3. Quarterly: review per-source taxonomy drift. When a new bill subject appears that's not in any mapping, it falls into "uncategorized" — surface in admin UI for curator review.

Schema (continued — taxonomy review queue)

Section titled “Schema (continued — taxonomy review queue)”
-- A queue of unmapped source-taxonomy values for curator review
CREATE TABLE topic_unmapped_values (
id bigserial PRIMARY KEY,
source_kind text NOT NULL,
source_value text NOT NULL,
first_seen_at timestamptz NOT NULL,
last_seen_at timestamptz NOT NULL,
occurrence_count int NOT NULL DEFAULT 1,
suggested_topic_id text, -- ML-suggested Josh topic
reviewed_at timestamptz,
reviewed_decision text CHECK (reviewed_decision IN ('mapped', 'rejected', 'pending') OR reviewed_decision IS NULL),
UNIQUE (source_kind, source_value)
);
CREATE INDEX topic_unmapped_review ON topic_unmapped_values (occurrence_count DESC) WHERE reviewed_at IS NULL;

Manual process. Initial ~30-50 topics curated by Robert + reviewed against actual user queries early on. Updates via PR to docs/data/topic-taxonomy.yaml.

  1. Download the bulk LCSH JSON-LD from the Library of Congress Linked Data Service (~150 MB; format/size are upstream-owned and change without notice).
  2. Filter to subjects referenced by our topic_mappings + 1-hop broader/narrower.
  3. Upsert into lcsh_subjects.
  4. Yearly refresh.
  1. Daily: scan bill_subjects.subject etc. for values not in any topic_mappings.source_value.
  2. Insert into topic_unmapped_values with occurrence count.
  3. Surface high-occurrence unmapped values in admin UI for curator review.

Source key: topic_taxonomy_curation. Manual flow.

  • Curation lag. New bill subjects appear faster than we can curate. Drift detection mitigates; the agent surfaces "uncategorized" cleanly.
  • Cross-source ambiguity. A bill tagged "Civil rights" and an FR topic "Civil rights" should both map to topic:civil-rights. Multiple subject-name collisions across sources mostly work; document edge cases.
  • Conflicting confidence. A native value might be narrow to one Josh topic and broad to another. Allow multiple mappings; the agent's federation prioritizes exact > broad > narrow > related.
  • LCSH versioning. LCSH headings get retired or merged occasionally. The bulk download tracks deprecations; honor them.

These don't block ingestion but should be resolved before this source is "shipped":

  • Top-level topic count. ~30-50 is a guess; the right number emerges from real user queries. Start narrow, expand.
  • Hierarchy depth. Should topic:medicare be a child of topic:health-care, or both top-level? Lean: 2-level hierarchy max (parent + child). Deeper hierarchies are SEO-resistant.
  • ML-suggested topic mappings. When a new source value appears, embed it and find nearest existing Josh topic. Useful prefilter for curator review. Defer to v1.x.
  • User-customizable taxonomies. Multi-tenant deployments might want their own taxonomies. Schema accommodates by adding a tenant_id to josh_topics (NULL = global default). Defer.
  • Cross-source query federation. When the agent gets "show me Health Care content," it queries each source's native subjects via the mapping table and unions. The federation logic lives in josh-core, not here.
  • LCSH licensing details. LCSH is freely available but the LoC reserves usage rules. Verify our use is in compliance.
  • CFR title as topic mapping. CFR titles are agency-organized, not topic-organized — Title 21 is "Food and Drugs," but Title 26 is "Internal Revenue Code" (tax) and Title 17 is "Commodity and Securities Exchanges" (financial). The mapping is loose; treat with low confidence.
  • Agency cross-mapping. Sources cite agencies (FR, regulations.gov, GAO recommendations) — should topic_mappings include agency_id mappings, or is that a separate concern? Schema allows; usage is open.
  • Static UI taxonomy vs dynamic discovery. The agent UI may want a stable left-nav of topics; we could harden the meta-taxonomy as nav while letting native subjects power filters. Lean: yes, treat meta-taxonomy as nav backbone.