A quick at-a-glance view of which sources are loaded into the substrate and which still need work. Updated by hand as sources land — for the canonical inventory with per-source details, see the data sources index. A live status surface served from josh-core is on the roadmap.
Code shipped + verified (ingester spec is verified). 12,780-legislator load (2026-05-09) was on the destroyed DO droplet — re-ingest pending OVHcloud provision (per legislators-and-committees-ingester t10).
Code shipped + verified (ingester spec is verified). 23,092-report load (2026-05-08) was on the destroyed DO droplet — re-ingest pending OVHcloud provision (per crs-reports-ingester t10).
LCSH-loader ingester spec planned (topic-taxonomy-ingester). Per-source native taxonomies are part of each source ingester. Hand-curated Josh meta-taxonomy is editorial work; deferred to a separate spec.
For each source, the substrate makes two architectural choices at ingest time: storage shape (structured-only, or chunked + vectorized) and, where chunked, which chunker is live. Both are decided up-front and locked at the source's first bulk run; changing either later means re-embedding the corpus.
Structured-only — relational tables, FTS5 on text columns where useful. No chunking, no vectors. Right for metadata-shaped sources where queries are keyword/SQL, not semantic.
Chunked + vectorized — body text passed through a Chunker per chunking, then embedded into a vec0 companion table. FTS5 still applies to the source's prose columns.
"Predicted chunker" is the prior from chunking.html. "Eval status" tracks the per-source confirmation under substrate-retrieval-eval-per-source: v1 frozen means the (model × chunker) pair is locked from a labeled retrieval eval on that source; Phase 1 only means the source was tested with fixed_token_512_50 in the cross-source bake-off but has not had its chunker confirmed against structural alternatives; not yet run means no eval. See "On the production default" for the Phase 1 model pick.
Production model across all chunked sources:Snowflake/snowflake-arctic-embed-l-v2.0 (1024-dim, 8192 max-seq). See "On the production default" for the Phase 1 result that picked it.
This page is hand-authored. When a source ships (production backfill complete and queries return real data), edit the relevant row: flip the status to shipped, update the notes column with the record count and load date, and update the Last updated footer below. Same when a source moves into in progress.
The data sources index stays in sync with this page — both should reflect the same state. When a source's status changes, update both.
A live status surface served from josh-core (with row counts, last-run timestamps, error rates pulled directly from the substrate at request time) is on the roadmap. Until then, this page is the canonical "what's loaded right now" view.