Skip to content

Data status

A quick at-a-glance view of which sources are loaded into the substrate and which still need work. Updated by hand as sources land — for the canonical inventory with per-source details, see the data sources index. A live status surface served from josh-core is on the roadmap.

  • shipped — source module written, schema applied, production backfill complete, queries return real data.
  • in progress — actively being built right now (spec written, code in flight, or a backfill underway).
  • not shipped — planned and (usually) spec-drafted, but no code yet, OR code is shipped but data is not currently loaded into a live substrate.
SourceStatusNotes
Billsnot shippedSchema drafted, ingester spec planned. Loads after legislators (FK on sponsor bioguide).
Members of Congressnot shippedCode shipped + verified (ingester spec is verified). 12,780-legislator load (2026-05-09) was on the destroyed DO droplet — re-ingest pending OVHcloud provision (per legislators-and-committees-ingester t10).
Committees + subcommitteesnot shippedCode shipped + verified. 230-committee load was on the destroyed DO droplet — re-ingest pending OVHcloud provision.
Committee assignmentsnot shippedCode shipped + verified. 3,879-membership load was on the destroyed DO droplet — re-ingest pending OVHcloud provision.
Hearings + markup schedulenot shippedSchema drafted, ingester spec planned (hearings-ingester).
Hearing transcriptsnot shippedSchema drafted, ingester spec planned (hearing-transcripts-ingester). Best-effort, partial coverage.
Roll-call votesnot shippedSchema drafted, ingester spec planned (roll-call-votes-ingester).
SourceStatusNotes
U.S. Codenot shippedSchema drafted, ingester spec planned (us-code-ingester).
Public Lawsnot shippedSchema drafted, ingester spec planned (public-laws-ingester).
SourceStatusNotes
Federal Registernot shippedSchema drafted, ingester spec planned (federal-register-ingester). Recommended next source — clean shape, no companion-source dependency.
Regulations.gov docketsnot shippedSchema drafted, ingester spec planned (regulations-dot-gov-dockets-ingester). Metadata + document list only; comment text deferred to v2.
eCFR & CFRnot shippedSchema drafted, ingester spec planned (ecfr-and-cfr-ingester).
SourceStatusNotes
CRS reportsnot shippedCode shipped + verified (ingester spec is verified). 23,092-report load (2026-05-08) was on the destroyed DO droplet — re-ingest pending OVHcloud provision (per crs-reports-ingester t10).
GAO reportsnot shippedSchema drafted, ingester spec planned (gao-reports-ingester). Anti-bot wall on gao.gov; needs headed-browser scrape.
CBO cost estimatesnot shippedSchema drafted, ingester spec planned (cbo-cost-estimates-ingester).
Committee reportsnot shippedSchema drafted, ingester spec planned (committee-reports-ingester).
Congressional Recordnot shippedSchema drafted, ingester spec planned (congressional-record-ingester).
Statements of Administration Policynot shippedSchema drafted, ingester spec planned (statements-of-administration-policy-ingester).
SourceStatusNotes
Senate & House LDA filingsnot shippedSchema drafted, ingester spec planned (lda-filings-ingester). Unified at lda.gov API v1.
SourceStatusNotes
Congressional staff directorynot shippedSchema drafted (partial coverage), ingester spec planned (staff-directories-ingester — single ingester covers all three sub-sources).
Federal agency officialsnot shippedSchema drafted (partial coverage), tracked under staff-directories-ingester.
Caucus membershipsnot shippedSchema drafted (partial coverage), tracked under staff-directories-ingester.
SourceStatusNotes
Issue / topic taxonomynot shippedLCSH-loader ingester spec planned (topic-taxonomy-ingester). Per-source native taxonomies are part of each source ingester. Hand-curated Josh meta-taxonomy is editorial work; deferred to a separate spec.

Storage shape & chunking — per-source decisions

Section titled “Storage shape & chunking — per-source decisions”

For each source, the substrate makes two architectural choices at ingest time: storage shape (structured-only, or chunked + vectorized) and, where chunked, which chunker is live. Both are decided up-front and locked at the source's first bulk run; changing either later means re-embedding the corpus.

  • Structured-only — relational tables, FTS5 on text columns where useful. No chunking, no vectors. Right for metadata-shaped sources where queries are keyword/SQL, not semantic.
  • Chunked + vectorized — body text passed through a Chunker per chunking, then embedded into a vec0 companion table. FTS5 still applies to the source's prose columns.

"Predicted chunker" is the prior from chunking.html. "Eval status" tracks the per-source confirmation under substrate-retrieval-eval-per-source: v1 frozen means the (model × chunker) pair is locked from a labeled retrieval eval on that source; Phase 1 only means the source was tested with fixed_token_512_50 in the cross-source bake-off but has not had its chunker confirmed against structural alternatives; not yet run means no eval. See "On the production default" for the Phase 1 model pick.

SourceStorage shapePredicted chunkerEval status
Legislators & committeesStructured-onlyn/a — no body
Roll-call votesStructured-onlyn/a — no body
Hearings (metadata)Structured-onlyn/a — transcripts handled separately
Regulations.gov docketsStructured-only (v1)n/a — comment text deferred
LDA filingsStructured-onlyn/a — short issue descriptions
Staff directoriesStructured-onlyn/a — partial coverage, no body
Topic taxonomyStructured-onlyn/a — LCSH label rows
CRS reportsChunked + vectorizedfixed_token_512_50_v1v1 frozen (2026-05-14, 50 queries, 6,476 labels)
Federal RegisterChunked + vectorizedsection_aware_fr_v1 (HD-driven)Phase 1 only — Phase 2 required before bulk-embed
BillsChunked + vectorizedsection_aware_bills_v1 (USLM <section>)Phase 1 only — Phase 2 required before bulk-embed
U.S. CodeChunked + vectorizedusc_uslm_section_v1Phase 1 only — Phase 2 required before bulk-embed (likely hybrid arm)
Public LawsChunked + vectorizedusc_uslm_section_v1 (USLM, same family)Phase 1 only — Phase 2 reuses USC result
eCFR & CFRChunked + vectorizedsection_aware_fr_v1 (likely reuse)Phase 1 only — Phase 2 likely reuses FR result
Hearing transcriptsChunked + vectorizedspeaker_turn_v1Phase 1 only — Phase 2 required before bulk-embed
Congressional RecordChunked + vectorizedspeaker_turn_v1 (same as transcripts)Phase 1 only — Phase 2 reuses transcripts result
Committee reportsChunked + vectorizedsection_aware_v1 (CRPT XML)Phase 1 only — Phase 2 nice-to-have, not required
GAO reportsChunked + vectorizedrecursive_v1Phase 1 only — no Phase 2 (recursive baseline)
CBO cost estimatesChunked + vectorizedrecursive_v1Not yet run — Phase 1 deferred (DataDome wall)
Statements of Admin PolicyChunked + vectorizedrecursive_v1 (or whole-doc — short)Phase 1 only — no Phase 2

Production model across all chunked sources: Snowflake/snowflake-arctic-embed-l-v2.0 (1024-dim, 8192 max-seq). See "On the production default" for the Phase 1 result that picked it.

This page is hand-authored. When a source ships (production backfill complete and queries return real data), edit the relevant row: flip the status to shipped, update the notes column with the record count and load date, and update the Last updated footer below. Same when a source moves into in progress.

The data sources index stays in sync with this page — both should reflect the same state. When a source's status changes, update both.

A live status surface served from josh-core (with row counts, last-run timestamps, error rates pulled directly from the substrate at request time) is on the roadmap. Until then, this page is the canonical "what's loaded right now" view.