Substrate observability defaults — logging + embedding operational knobs
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Pin the cross-cutting observability conventions every Josh service
inherits — logging library, log format, log levels, request-id
propagation, sensitive-data redaction — so debugging a production
incident isn't four different grep patterns across four services.
Today there's drift: josh-ingester uses structlog whilejosh-embedder uses Python's stdlib logger; josh-core has no
convention; josh-cli (future) doesn't either. Without a substrate-
wide answer, observability dashboards have to handle multiple shapes
and agents debugging via log reads can't filter by stable fields.
This spec also closes the embedding-layer operational gaps the audit
flagged: Modal fan-out cap, HTTP provider timeout/retry, trust_remote_code
policy, and the missing readiness probe companion to /health/embedding.
Decisions-as-acceptance-criteria. Implementation is small per service
(~30 LOC) but the convergence is what makes operations coherent.
User stories
As an operator debugging a production incident, I want one log format with stable field names across all services so that my dashboard query and grep recipes work uniformly.
As an agent reading recent logs to triage a failed run, I want structured JSON I can filter by `event` and `level` so that I don't regex stderr text and break on next refactor.
As a Step-2 product feature builder, I want structured logs that an in-app observability panel could surface so that a built-in `/admin/logs` endpoint is a 200-LOC project, not a parser-and-storage rebuild.
As a security-conscious operator, I want API keys and tokens redacted from logs by default so that a stack trace dumped to journald doesn't leak credentials.
As a Kamal-proxy / k8s liveness probe, I want a separate `/health/embedding/ready` that returns 503 when unhealthy so that orchestrator restart logic doesn't depend on parsing JSON bodies.
Acceptance criteria (EARS)
- Where any Josh service emits log records, it shall use `structlog` configured to render `ConsoleRenderer` to a TTY and `JSONRenderer` otherwise; format shall not auto-switch on pipe (matches cli-conventions Q4 stderr-vs-stdout split).
- Where `josh-embedder` and `josh-core` are built, they shall use `structlog` for all log records; stdlib `logging` calls in those services shall be migrated.
- When the active log renderer is JSON, every record shall include at minimum: `timestamp` (ISO-8601 UTC ms-precision), `level`, `logger`, `event`, and `service` (auto-injected); records emitted within an HTTP request scope shall additionally include `request_id`.
- Where a record describes a milestone event (run started, batch completed, container started, queue polled), it shall emit at INFO level.
- Where a record describes a recoverable abnormality (5xx triggering retry, slow query, queue contention), it shall emit at WARNING level.
- Where a record describes an unrecoverable per-operation failure (run failed, provider unhealthy, migration failed), it shall emit at ERROR level.
- Where a record describes substrate-level danger (data corruption detected, disk full, can't open SQLite), it shall emit at CRITICAL level.
- When the `JOSH_LOG_LEVEL` env var is set, it shall override the substrate default INFO level for all services in that container.
- When `josh-core` accepts an HTTP request, the request-id middleware shall bind a UUIDv7 via `structlog.contextvars.bind_contextvars` at request entry and clear it at request exit.
- Where a service makes an outbound HTTP call to another Josh service, the `X-Request-Id` header shall be forwarded so the receiving service binds the same id; this is per rest-api-conventions Q3.
- When in-process code crosses a service boundary (e.g., the umbrella CLI imports `josh_ingester.runner` directly), the calling code shall propagate the active request-id via `bind_contextvars` so log lines correlate.
- Where a log record's processor pipeline runs, a default redactor shall replace values for keys matching the documented secret list (`api_key`, `JOSH_API_DATA_GOV_KEY`, `authorization`, `cookie`, `token`, `password`, `RESTIC_PASSWORD`, `set-cookie`) with `***REDACTED***`.
- Where additional secret keys are introduced by a service or feature, the maintaining service shall extend the redactor's allowlist; secrets shall not appear in production log output.
- Where the Modal embedding provider is configured, the worker shall hard-cap `JOSH_EMBED_CONCURRENCY` at 8; values exceeding 8 shall raise `ProviderConfigError` at worker startup with a hint that recommends 4 or below.
- Where the Modal app is deployed by an operator, the spec recommends `@app.function(max_containers=4)` to bound cost; this is documented as an operator responsibility (Josh cannot enforce it cross-vendor).
- Where the HTTP embedding provider issues a request, the connect timeout shall default to 10 seconds and the read timeout shall default to 120 seconds (longer than the ingestion HTTP default to allow for batch inference latency).
- Where the HTTP provider's request fails with a 5xx or network error, it shall retry up to 3 attempts with exponential backoff; 4xx responses shall not be retried.
- Where the operator wishes to override timeouts, the env vars `JOSH_HTTP_EMBED_CONNECT_TIMEOUT_SECONDS` and `JOSH_HTTP_EMBED_READ_TIMEOUT_SECONDS` shall be honored.
- Where `JOSH_EMBED_PROVIDER=local`, the substrate shall maintain a `TRUSTED_PUBLISHERS` allowlist (default: `Snowflake/`, `BAAI/`, `intfloat/`, `sentence-transformers/`).
- Where `JOSH_EMBED_MODEL_ID` does not start with an allowlisted publisher prefix, even with `JOSH_EMBED_TRUST_REMOTE_CODE=1` set, the provider shall raise `ProviderConfigError` at startup with a hint listing the trusted publishers and the override env var.
- Where the operator needs a non-allowlisted publisher (rare; novel model from a small lab), the env var `JOSH_EMBED_TRUSTED_PUBLISHERS=foo/,bar/` shall extend the allowlist for that deployment.
- When `JOSH_EMBED_TRUST_REMOTE_CODE` is unset or `0`, the provider shall instantiate `SentenceTransformer` with `trust_remote_code=False` regardless of model_id.
- Where `GET /health/embedding` is called, the response shall always be HTTP 200 with the documented JSON body (per embedding-observability spec; verified).
- Where `GET /health/embedding/ready` is called, the response shall be HTTP 200 if `provider.health_check()` passes AND substrate is reachable; HTTP 503 otherwise; the response body in both cases is the same JSON shape so dashboards can scrape either.
- Where Kamal-proxy or other orchestrator readiness probes are configured, the recommended target is `/health/embedding/ready`; `/health/embedding` is observability-only and shall not be used as a gating probe.
Success determiner
Path
Runner
Contract tests across the observability surface: - structlog config: ConsoleRenderer attaches when stderr is a TTY, JSONRenderer when not. Required fields present in JSON output. - Redactor: a log call carrying `api_key='secret'` emits `api_key='***REDACTED***'`. - Request-id propagation: a request that flows through `josh-core` middleware → in-process call → `josh-ingester` runner emits log records with the same `request_id` value across services. - Embedder: `JOSH_EMBED_CONCURRENCY=10` raises ProviderConfigError; `=4` works. `JOSH_EMBED_MODEL_ID=evilcorp/badmodel` with trust_remote_code=1 raises; same model with `JOSH_EMBED_TRUSTED_PUBLISHERS=evilcorp/` works. - HTTP provider: a request stub that returns 5xx is retried 3x; a request stub that returns 4xx is not retried. - Health probes: `/health/embedding` returns 200 even when provider.health_check is False; `/health/embedding/ready` returns 503 in that case. Determiner currently fails because: (a) shared/josh_substrate/tests/test_observability_defaults.py does not yet exist; (b) structlog migration is incomplete (josh-embedder + josh-core); (c) trust_remote_code allowlist is not implemented; (d) `/health/embedding/ready` route is not yet shipped. Will flip to passing as the implementation lands across all four services.
Clarifications needed
None.
Out of scope
- Centralized log aggregation infrastructure (Loki, Vector, Fluent Bit, Datadog) — out of scope for v1; structlog JSON output is compatible with all of them when an operator wants to add one.
- In-app `/admin/logs` or `/admin/events` endpoint — Step 2 product feature; the JSON-first decision now makes it cheap to build later.
- Prometheus exposition (`/metrics` endpoint) — deferred. `/health/embedding` and `/health/embedding/ready` are sufficient for v1.
- OpenTelemetry traces / spans — deferred. structlog's request-id propagation is the v1 substitute.
- Choice between Modal vs HF Inference Endpoints vs OpenAI embeddings — operator-side configuration, not a substrate spec decision. The HTTP provider's TEI/OpenAI-format support already covers all three.
- Per-source log volume budgeting / sampling — defer to ops as needed.
Dependencies
Plan
Decisions locked across two grilling-style questions, organized below.
## 1. Logging conventions (cross-cutting)
Locked: structlog substrate-wide; JSON in production; ConsoleRenderer
for TTY (matches cli-conventions Q4 stderr-vs-stdout pattern).
### Library
structlog (already in josh-ingester). 13-year-old library, mature,
pure Python, async-friendly via contextvars. Used in production by
Datadog, Mozilla, Stripe, Reddit. Pure-stdlib alternative
(python-json-logger) was considered; structlog wins on contextvars
binding (request-id propagation) and processor pipelines (redaction).
Migration:
- josh-embedder swaps stdlib logger to structlog (~30 LOC across
worker.py, cli.py, jobs.py, factory.py).
- josh-core adopts structlog from inception via
app/middleware/logging.py.
- josh-cli (future, per cli-conventions) uses structlog from
inception.
- josh-ingester already uses structlog; no migration needed.
### Format
```
TTY (developer terminal):
2026-05-10T18:30:00.123Z [info ] source.run.started
source=crs-reports mode=incremental run_id=01HF6...
Non-TTY (production journald, kamal app logs, container stdout):
{"timestamp":"2026-05-10T18:30:00.123Z","level":"info","logger":
"josh_ingester.runner","event":"source.run.started","source":
"crs-reports","mode":"incremental","run_id":"01HF6...","service":
"josh-ingester"}
```
Format never auto-switches on pipe (matches the explicit---json
pattern in cli-conventions Q4); JSON happens automatically when the
output stream isn't a TTY because the operator is running in a
container or the output is being captured by an aggregator.
### Required fields
Auto-injected on every record by the substrate's structlog config:
- timestamp (ISO-8601 UTC ms — matches rest-api-conventions Q6).
- level (DEBUG / INFO / WARNING / ERROR / CRITICAL).
- logger (module path: josh_ingester.runner, josh_embedder.worker, etc.).
- service (josh-core / josh-ingester / josh-embedder / josh-cli).
- request_id (when in HTTP request scope; via bind_contextvars).
Emitted by the caller per record:
- event — short snake_case identifier (source.run.started,
embedding.batch.completed). Operator dashboards aggregate by
this.
- Any event-specific kwargs (e.g., source=crs-reports, count=42).
### Log levels — when to use which
- DEBUG — verbose internals (per-row processing, per-request
timing). Off in production.
- INFO — milestone events. Default level in production.
- WARNING — recoverable abnormalities (5xx triggered retry, slow
query). Operator should glance.
- ERROR — unrecoverable per-operation failures. Operator should
investigate.
- CRITICAL — substrate-level danger (data corruption, disk full,
SQLite open failure). Page someone.
Default production level: INFO. Override via JOSH_LOG_LEVEL env var.
### Request-ID propagation
- HTTP cross-service: X-Request-Id header (already locked in
rest-api-conventions Q3; auto-emitted on every response).
- In-process cross-service: structlog.contextvars.bind_contextvars
+ the merge_contextvars processor pulls the bound id into every
log record.
- The umbrella CLI (per cli-conventions Q1) propagates the id when
importing josh_ingester.runner etc. — it generates a UUIDv7 at
invocation start and binds it.
Why this matters: an agent searching for "all log records from the
CRS run that failed at 14:30" can grep request_id=01HF6... across
every service's logs. Without context vars, you'd parse stderr text
and break on next refactor.
### Sensitive-data redaction
Default redactor (a structlog processor) replaces values for keys
matching:
``python``
REDACTED_KEYS = frozenset({
"api_key", "JOSH_API_DATA_GOV_KEY",
"authorization", "cookie", "set-cookie",
"token", "password",
"RESTIC_PASSWORD",
})
When a log call accidentally serializes a kwarg matching one of these
names, the value is replaced with *REDACTED* before rendering.
Per-service authors extend the list as new secret kinds are
introduced. Goal: zero secrets in production log output.
## 2. Embedding operational defaults
### Modal fan-out cap
- Worker hard-caps JOSH_EMBED_CONCURRENCY at 8; raise
ProviderConfigError at startup if exceeded.
- Modal-side: documented operator responsibility to set
@app.function(max_containers=4). Cross-vendor enforcement isn't
feasible.
### HTTP provider timeout/retry
- Connect: 10s (same as substrate-source-defaults; network reachability
is reachability).
- Read: 120s (2× ingestion default — batch embedding inference can
take 30-60s on busy backends).
- Retries: 3 attempts, exponential backoff, 5xx + network only (per
substrate-source-defaults convention).
- Override via JOSH_HTTP_EMBED_CONNECT_TIMEOUT_SECONDS /
_READ_TIMEOUT_SECONDS.
### trust_remote_code allowlist
``python``
TRUSTED_PUBLISHERS = frozenset({
"Snowflake/",
"BAAI/",
"intfloat/",
"sentence-transformers/",
})
Validation at provider startup:
1. If JOSH_EMBED_TRUST_REMOTE_CODE is unset or 0, instantiate
SentenceTransformer with trust_remote_code=False regardless.
2. If JOSH_EMBED_TRUST_REMOTE_CODE=1:
- Compare JOSH_EMBED_MODEL_ID against TRUSTED_PUBLISHERS plus
JOSH_EMBED_TRUSTED_PUBLISHERS (env-extended).
- If matches: instantiate with trust_remote_code=True.
- If doesn't match: ProviderConfigError with hint listing
trusted publishers and the override env var.
Closes the silent-arbitrary-code-execution surface for any model
the operator hasn't deliberately allowlisted.
### Health probes
- GET /health/embedding — verified, ships today, always 200, body
has details. Stays as the observability surface.
- GET /health/embedding/ready — new. Returns 200 if
provider.health_check() passes AND substrate reachable; 503
otherwise. Same JSON body shape regardless of status code.
Recommended target for Kamal-proxy / k8s readiness probes.
## Implementation surface
- shared/josh_substrate/observability/ (new subpackage):
logging.py (structlog config + processors), redaction.py
(default redactor), request_id.py (UUIDv7 generation +
contextvars helpers).
- josh-embedder/src/josh_embedder/ — swap stdlib logger to
structlog. ~30 LOC.
- josh-core/josh_core/middleware/logging.py — request-id middleware
+ structlog config bootstrap.
- josh-core/josh_core/routes/health_embedding_ready.py — new readiness
endpoint.
- shared/josh_substrate/embedding/providers/local_st.py — extend
with TRUSTED_PUBLISHERS validation.
- shared/josh_substrate/embedding/providers/http_provider.py —
add explicit timeout defaults + env-var overrides.
- josh-embedder/src/josh_embedder/settings.py — hard-cap
concurrency at 8 with raise.
- shared/josh_substrate/tests/test_observability_defaults.py —
contract test suite.
Tasks
1 of 13 done.
- t1 All cross-cutting decisions resolved with rritz; spec drafted
- t2 shared/josh_substrate/observability/ subpackage with structlog config + redactor + request-id helpers
- t3 josh-embedder migrates stdlib logger to structlog
- t4 josh-core adopts structlog via app/middleware/logging.py + request-id middleware
- t5 X-Request-Id forwarding in any cross-service HTTP plus contextvars binding for in-process boundaries
- t6 Default redactor + REDACTED_KEYS list applied to every service's structlog config
- t7 Worker concurrency hard-cap at 8 in settings.py validation
- t8 HTTP embedding provider explicit timeout (10s/120s) + env-var overrides
- t9 trust_remote_code allowlist validation in local_st.py
- t10 /health/embedding/ready endpoint shipped with 200/503 routing
- t11 embedding-provider-http + embedding-provider-modal + embedding-provider-local-st specs declare dependency on this spec
- t12 Contract test suite at shared/josh_substrate/tests/test_observability_defaults.py covers every acceptance criterion
- t13 Determiner runs green — every convention exercised across all four services
Changelog
-
2026-05-10T23:30:00Z
(new)→plannedSpec authored after a 2-question grilling-style decision pass with claude: Q1: structlog substrate-wide; ConsoleRenderer for TTY, JSONRenderer for non-TTY (production); standard log-level hierarchy with concrete rules; required fields (timestamp, level, logger, event, service, request_id-when-in-scope); request-id propagated via X-Request-Id header (HTTP) + contextvars (in-process); default redactor with documented secret-key list. josh-embedder + josh-core migrate from stdlib; josh-ingester already structlog; josh-cli (future) inherits from inception. Q2: Modal: hard-cap worker concurrency at 8 + documented operator responsibility to set max_containers=4. HTTP provider: 10s/120s timeout + 3-retry exp-backoff. trust_remote_code: allowlist (Snowflake/BAAI/intfloat/sentence-transformers/) + env-extension; non-allowlisted models raise even with opt-in env var. Add /health/embedding/ready as 200/503 gating probe; /health/embedding stays always-200 for observability. Three embedding-provider specs (http, local-st, modal) and the worker spec now declare dependency on this spec. Spec touches cross-cutting concerns shared across all four services (josh-core, josh-ingester, josh-embedder, future josh-cli). One operational footnote: HuggingFace Inference Endpoints can replace Modal for backfill bursts via the existing HTTP provider with `JOSH_HTTP_FORMAT=tei` — no new code required. Operators can A/B test Modal vs HF Endpoint when running large backfills; the spec doesn't mandate either.