LocalSTProvider — sentence-transformers in-process adapter
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
The default provider for two scenarios: (1) query-time embedding in
josh-core, where Modal cold-start (10–30s) is unacceptable for
interactive latency, and (2) bulk embedding when a deployment has no
GPU/remote backend and must fall back to CPU. Without LocalSTProvider,
Josh deployments require Modal credentials or a TEI server to do
anything; with it, the substrate is fully usable on a single VM.
User stories
As a small-deployment operator, I want an embedding provider that works without external services so that I can run Josh on a single $30/mo VM.
As a developer, I want query-time embedding under 50ms on CPU so that interactive search stays snappy without GPUs.
Acceptance criteria (EARS)
- When a `LocalSTProvider` is constructed, the system shall NOT load the model — load happens lazily on first `_ensure_loaded` call.
- When `embed_documents` is called on a fresh provider, the system shall load the model in `asyncio.to_thread` so the event loop is not blocked.
- Where the model is in `_MODELS_WITH_QUERY_PROMPT`, `embed_query` shall pass `prompt_name='query'`; otherwise it falls back to plain encoding.
- When `embed_documents` is called with input longer than the model's max sequence length, the system shall truncate (sentence-transformers default) rather than raise.
- Where the publisher ships custom HF modeling code requiring `trust_remote_code=True` (e.g., Snowflake/snowflake-arctic-embed-m-v2.0; the production default Arctic-L uses standard XLM-Roberta and does NOT require the flag), the system shall accept the flag at construction and forward it.
- When `health_check` is called, the system shall load the model if not already loaded and return `True`; on load failure, return `False` (do not raise).
Success determiner
Path
Runner
Clarifications needed
None.
Out of scope
None.
Dependencies
Plan
Single module local_st.py. SentenceTransformer import is local to_ensure_loaded so the substrate's base import doesn't pull in torch
unconditionally. josh-substrate[local-st] extra carries the deps.
Lock for one-time model load is created lazily on first await to
avoid binding to a destroyed event loop in test contexts.
Tasks
5 of 5 done.
- t1 LocalSTProvider class implementing the Protocol
- t2 Lazy model load + asyncio bridge via to_thread
- t3 Asymmetric query/document encoding for known prompt-aware models
- t4 trust_remote_code support for Snowflake Arctic v2
- t5 Contract suite passes against all-MiniLM-L6-v2 (test factory)
Changelog
-
2026-05-27T00:00:00Z
verified→verifiedAcceptance criterion 5 clarified: the `trust_remote_code` example now notes that the production default (Arctic-L) does NOT require the flag — only models shipping custom HF modeling code (Arctic-M, etc.) do. No code change; documentation-only clarification after Phase 1 picked Arctic-L over Arctic-M. -
2026-05-10T11:00:00Z
planned→verifiedAdapter written; 12 contract tests pass against MiniLM in CI.