substrateplannedp0

Substrate observability defaults — logging + embedding operational knobs

substrate-observability-defaults · updated 2026-05-10T23:30:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Pin the cross-cutting observability conventions every Josh service
inherits — logging library, log format, log levels, request-id
propagation, sensitive-data redaction — so debugging a production
incident isn't four different grep patterns across four services.
Today there's drift: josh-ingester uses structlog while
josh-embedder uses Python's stdlib logger; josh-core has no
convention; josh-cli (future) doesn't either. Without a substrate-
wide answer, observability dashboards have to handle multiple shapes
and agents debugging via log reads can't filter by stable fields.

This spec also closes the embedding-layer operational gaps the audit
flagged: Modal fan-out cap, HTTP provider timeout/retry, trust_remote_code
policy, and the missing readiness probe companion to /health/embedding.

Decisions-as-acceptance-criteria. Implementation is small per service
(~30 LOC) but the convergence is what makes operations coherent.

As an operator debugging a production incident, I want one log format with stable field names across all services so that my dashboard query and grep recipes work uniformly.

As an agent reading recent logs to triage a failed run, I want structured JSON I can filter by `event` and `level` so that I don't regex stderr text and break on next refactor.

As a Step-2 product feature builder, I want structured logs that an in-app observability panel could surface so that a built-in `/admin/logs` endpoint is a 200-LOC project, not a parser-and-storage rebuild.

As a security-conscious operator, I want API keys and tokens redacted from logs by default so that a stack trace dumped to journald doesn't leak credentials.

As a Kamal-proxy / k8s liveness probe, I want a separate `/health/embedding/ready` that returns 503 when unhealthy so that orchestrator restart logic doesn't depend on parsing JSON bodies.

  1. Where any Josh service emits log records, it shall use `structlog` configured to render `ConsoleRenderer` to a TTY and `JSONRenderer` otherwise; format shall not auto-switch on pipe (matches cli-conventions Q4 stderr-vs-stdout split).
  2. Where `josh-embedder` and `josh-core` are built, they shall use `structlog` for all log records; stdlib `logging` calls in those services shall be migrated.
  3. When the active log renderer is JSON, every record shall include at minimum: `timestamp` (ISO-8601 UTC ms-precision), `level`, `logger`, `event`, and `service` (auto-injected); records emitted within an HTTP request scope shall additionally include `request_id`.
  4. Where a record describes a milestone event (run started, batch completed, container started, queue polled), it shall emit at INFO level.
  5. Where a record describes a recoverable abnormality (5xx triggering retry, slow query, queue contention), it shall emit at WARNING level.
  6. Where a record describes an unrecoverable per-operation failure (run failed, provider unhealthy, migration failed), it shall emit at ERROR level.
  7. Where a record describes substrate-level danger (data corruption detected, disk full, can't open SQLite), it shall emit at CRITICAL level.
  8. When the `JOSH_LOG_LEVEL` env var is set, it shall override the substrate default INFO level for all services in that container.
  9. When `josh-core` accepts an HTTP request, the request-id middleware shall bind a UUIDv7 via `structlog.contextvars.bind_contextvars` at request entry and clear it at request exit.
  10. Where a service makes an outbound HTTP call to another Josh service, the `X-Request-Id` header shall be forwarded so the receiving service binds the same id; this is per rest-api-conventions Q3.
  11. When in-process code crosses a service boundary (e.g., the umbrella CLI imports `josh_ingester.runner` directly), the calling code shall propagate the active request-id via `bind_contextvars` so log lines correlate.
  12. Where a log record's processor pipeline runs, a default redactor shall replace values for keys matching the documented secret list (`api_key`, `JOSH_API_DATA_GOV_KEY`, `authorization`, `cookie`, `token`, `password`, `RESTIC_PASSWORD`, `set-cookie`) with `***REDACTED***`.
  13. Where additional secret keys are introduced by a service or feature, the maintaining service shall extend the redactor's allowlist; secrets shall not appear in production log output.
  14. Where the Modal embedding provider is configured, the worker shall hard-cap `JOSH_EMBED_CONCURRENCY` at 8; values exceeding 8 shall raise `ProviderConfigError` at worker startup with a hint that recommends 4 or below.
  15. Where the Modal app is deployed by an operator, the spec recommends `@app.function(max_containers=4)` to bound cost; this is documented as an operator responsibility (Josh cannot enforce it cross-vendor).
  16. Where the HTTP embedding provider issues a request, the connect timeout shall default to 10 seconds and the read timeout shall default to 120 seconds (longer than the ingestion HTTP default to allow for batch inference latency).
  17. Where the HTTP provider's request fails with a 5xx or network error, it shall retry up to 3 attempts with exponential backoff; 4xx responses shall not be retried.
  18. Where the operator wishes to override timeouts, the env vars `JOSH_HTTP_EMBED_CONNECT_TIMEOUT_SECONDS` and `JOSH_HTTP_EMBED_READ_TIMEOUT_SECONDS` shall be honored.
  19. Where `JOSH_EMBED_PROVIDER=local`, the substrate shall maintain a `TRUSTED_PUBLISHERS` allowlist (default: `Snowflake/`, `BAAI/`, `intfloat/`, `sentence-transformers/`).
  20. Where `JOSH_EMBED_MODEL_ID` does not start with an allowlisted publisher prefix, even with `JOSH_EMBED_TRUST_REMOTE_CODE=1` set, the provider shall raise `ProviderConfigError` at startup with a hint listing the trusted publishers and the override env var.
  21. Where the operator needs a non-allowlisted publisher (rare; novel model from a small lab), the env var `JOSH_EMBED_TRUSTED_PUBLISHERS=foo/,bar/` shall extend the allowlist for that deployment.
  22. When `JOSH_EMBED_TRUST_REMOTE_CODE` is unset or `0`, the provider shall instantiate `SentenceTransformer` with `trust_remote_code=False` regardless of model_id.
  23. Where `GET /health/embedding` is called, the response shall always be HTTP 200 with the documented JSON body (per embedding-observability spec; verified).
  24. Where `GET /health/embedding/ready` is called, the response shall be HTTP 200 if `provider.health_check()` passes AND substrate is reachable; HTTP 503 otherwise; the response body in both cases is the same JSON shape so dashboards can scrape either.
  25. Where Kamal-proxy or other orchestrator readiness probes are configured, the recommended target is `/health/embedding/ready`; `/health/embedding` is observability-only and shall not be used as a gating probe.
kindtest_file

Path

shared/josh_substrate/tests/test_observability_defaults.py

Runner

uv run pytest shared/josh_substrate/tests/test_observability_defaults.py -v

Contract tests across the observability surface: - structlog config: ConsoleRenderer attaches when stderr is a TTY, JSONRenderer when not. Required fields present in JSON output. - Redactor: a log call carrying `api_key='secret'` emits `api_key='***REDACTED***'`. - Request-id propagation: a request that flows through `josh-core` middleware → in-process call → `josh-ingester` runner emits log records with the same `request_id` value across services. - Embedder: `JOSH_EMBED_CONCURRENCY=10` raises ProviderConfigError; `=4` works. `JOSH_EMBED_MODEL_ID=evilcorp/badmodel` with trust_remote_code=1 raises; same model with `JOSH_EMBED_TRUSTED_PUBLISHERS=evilcorp/` works. - HTTP provider: a request stub that returns 5xx is retried 3x; a request stub that returns 4xx is not retried. - Health probes: `/health/embedding` returns 200 even when provider.health_check is False; `/health/embedding/ready` returns 503 in that case. Determiner currently fails because: (a) shared/josh_substrate/tests/test_observability_defaults.py does not yet exist; (b) structlog migration is incomplete (josh-embedder + josh-core); (c) trust_remote_code allowlist is not implemented; (d) `/health/embedding/ready` route is not yet shipped. Will flip to passing as the implementation lands across all four services.

None.

  • Centralized log aggregation infrastructure (Loki, Vector, Fluent Bit, Datadog) — out of scope for v1; structlog JSON output is compatible with all of them when an operator wants to add one.
  • In-app `/admin/logs` or `/admin/events` endpoint — Step 2 product feature; the JSON-first decision now makes it cheap to build later.
  • Prometheus exposition (`/metrics` endpoint) — deferred. `/health/embedding` and `/health/embedding/ready` are sufficient for v1.
  • OpenTelemetry traces / spans — deferred. structlog's request-id propagation is the v1 substitute.
  • Choice between Modal vs HF Inference Endpoints vs OpenAI embeddings — operator-side configuration, not a substrate spec decision. The HTTP provider's TEI/OpenAI-format support already covers all three.
  • Per-source log volume budgeting / sampling — defer to ops as needed.

Decisions locked across two grilling-style questions, organized below.

## 1. Logging conventions (cross-cutting)

Locked: structlog substrate-wide; JSON in production; ConsoleRenderer
for TTY (matches cli-conventions Q4 stderr-vs-stdout pattern).

### Library

structlog (already in josh-ingester). 13-year-old library, mature,
pure Python, async-friendly via contextvars. Used in production by
Datadog, Mozilla, Stripe, Reddit. Pure-stdlib alternative
(python-json-logger) was considered; structlog wins on contextvars
binding (request-id propagation) and processor pipelines (redaction).

Migration:
- josh-embedder swaps stdlib logger to structlog (~30 LOC across
worker.py, cli.py, jobs.py, factory.py).
- josh-core adopts structlog from inception via
app/middleware/logging.py.
- josh-cli (future, per cli-conventions) uses structlog from
inception.
- josh-ingester already uses structlog; no migration needed.

### Format

```
TTY (developer terminal):
2026-05-10T18:30:00.123Z [info ] source.run.started
source=crs-reports mode=incremental run_id=01HF6...

Non-TTY (production journald, kamal app logs, container stdout):
{"timestamp":"2026-05-10T18:30:00.123Z","level":"info","logger":
"josh_ingester.runner","event":"source.run.started","source":
"crs-reports","mode":"incremental","run_id":"01HF6...","service":
"josh-ingester"}
```

Format never auto-switches on pipe (matches the explicit---json
pattern in cli-conventions Q4); JSON happens automatically when the
output stream isn't a TTY because the operator is running in a
container or the output is being captured by an aggregator.

### Required fields

Auto-injected on every record by the substrate's structlog config:
- timestamp (ISO-8601 UTC ms — matches rest-api-conventions Q6).
- level (DEBUG / INFO / WARNING / ERROR / CRITICAL).
- logger (module path: josh_ingester.runner, josh_embedder.worker, etc.).
- service (josh-core / josh-ingester / josh-embedder / josh-cli).
- request_id (when in HTTP request scope; via bind_contextvars).

Emitted by the caller per record:
- event — short snake_case identifier (source.run.started,
embedding.batch.completed). Operator dashboards aggregate by
this.
- Any event-specific kwargs (e.g., source=crs-reports, count=42).

### Log levels — when to use which

- DEBUG — verbose internals (per-row processing, per-request
timing). Off in production.
- INFO — milestone events. Default level in production.
- WARNING — recoverable abnormalities (5xx triggered retry, slow
query). Operator should glance.
- ERROR — unrecoverable per-operation failures. Operator should
investigate.
- CRITICAL — substrate-level danger (data corruption, disk full,
SQLite open failure). Page someone.

Default production level: INFO. Override via JOSH_LOG_LEVEL env var.

### Request-ID propagation

- HTTP cross-service: X-Request-Id header (already locked in
rest-api-conventions Q3; auto-emitted on every response).
- In-process cross-service: structlog.contextvars.bind_contextvars
+ the merge_contextvars processor pulls the bound id into every
log record.
- The umbrella CLI (per cli-conventions Q1) propagates the id when
importing josh_ingester.runner etc. — it generates a UUIDv7 at
invocation start and binds it.

Why this matters: an agent searching for "all log records from the
CRS run that failed at 14:30" can grep request_id=01HF6... across
every service's logs. Without context vars, you'd parse stderr text
and break on next refactor.

### Sensitive-data redaction

Default redactor (a structlog processor) replaces values for keys
matching:

``python
REDACTED_KEYS = frozenset({
"api_key", "JOSH_API_DATA_GOV_KEY",
"authorization", "cookie", "set-cookie",
"token", "password",
"RESTIC_PASSWORD",
})
``

When a log call accidentally serializes a kwarg matching one of these
names, the value is replaced with *REDACTED* before rendering.

Per-service authors extend the list as new secret kinds are
introduced. Goal: zero secrets in production log output.

## 2. Embedding operational defaults

### Modal fan-out cap

- Worker hard-caps JOSH_EMBED_CONCURRENCY at 8; raise
ProviderConfigError at startup if exceeded.
- Modal-side: documented operator responsibility to set
@app.function(max_containers=4). Cross-vendor enforcement isn't
feasible.

### HTTP provider timeout/retry

- Connect: 10s (same as substrate-source-defaults; network reachability
is reachability).
- Read: 120s (2× ingestion default — batch embedding inference can
take 30-60s on busy backends).
- Retries: 3 attempts, exponential backoff, 5xx + network only (per
substrate-source-defaults convention).
- Override via JOSH_HTTP_EMBED_CONNECT_TIMEOUT_SECONDS /
_READ_TIMEOUT_SECONDS.

### trust_remote_code allowlist

``python
TRUSTED_PUBLISHERS = frozenset({
"Snowflake/",
"BAAI/",
"intfloat/",
"sentence-transformers/",
})
``

Validation at provider startup:

1. If JOSH_EMBED_TRUST_REMOTE_CODE is unset or 0, instantiate
SentenceTransformer with trust_remote_code=False regardless.
2. If JOSH_EMBED_TRUST_REMOTE_CODE=1:
- Compare JOSH_EMBED_MODEL_ID against TRUSTED_PUBLISHERS plus
JOSH_EMBED_TRUSTED_PUBLISHERS (env-extended).
- If matches: instantiate with trust_remote_code=True.
- If doesn't match: ProviderConfigError with hint listing
trusted publishers and the override env var.

Closes the silent-arbitrary-code-execution surface for any model
the operator hasn't deliberately allowlisted.

### Health probes

- GET /health/embedding — verified, ships today, always 200, body
has details. Stays as the observability surface.
- GET /health/embedding/ready — new. Returns 200 if
provider.health_check() passes AND substrate reachable; 503
otherwise. Same JSON body shape regardless of status code.
Recommended target for Kamal-proxy / k8s readiness probes.

## Implementation surface

- shared/josh_substrate/observability/ (new subpackage):
logging.py (structlog config + processors), redaction.py
(default redactor), request_id.py (UUIDv7 generation +
contextvars helpers).
- josh-embedder/src/josh_embedder/ — swap stdlib logger to
structlog. ~30 LOC.
- josh-core/josh_core/middleware/logging.py — request-id middleware
+ structlog config bootstrap.
- josh-core/josh_core/routes/health_embedding_ready.py — new readiness
endpoint.
- shared/josh_substrate/embedding/providers/local_st.py — extend
with TRUSTED_PUBLISHERS validation.
- shared/josh_substrate/embedding/providers/http_provider.py
add explicit timeout defaults + env-var overrides.
- josh-embedder/src/josh_embedder/settings.py — hard-cap
concurrency at 8 with raise.
- shared/josh_substrate/tests/test_observability_defaults.py
contract test suite.

1 of 13 done.

  • t1 All cross-cutting decisions resolved with rritz; spec drafted
  • t2 shared/josh_substrate/observability/ subpackage with structlog config + redactor + request-id helpers
  • t3 josh-embedder migrates stdlib logger to structlog
  • t4 josh-core adopts structlog via app/middleware/logging.py + request-id middleware
  • t5 X-Request-Id forwarding in any cross-service HTTP plus contextvars binding for in-process boundaries
  • t6 Default redactor + REDACTED_KEYS list applied to every service's structlog config
  • t7 Worker concurrency hard-cap at 8 in settings.py validation
  • t8 HTTP embedding provider explicit timeout (10s/120s) + env-var overrides
  • t9 trust_remote_code allowlist validation in local_st.py
  • t10 /health/embedding/ready endpoint shipped with 200/503 routing
  • t11 embedding-provider-http + embedding-provider-modal + embedding-provider-local-st specs declare dependency on this spec
  • t12 Contract test suite at shared/josh_substrate/tests/test_observability_defaults.py covers every acceptance criterion
  • t13 Determiner runs green — every convention exercised across all four services
  • 2026-05-10T23:30:00Z (new)planned Spec authored after a 2-question grilling-style decision pass with claude: Q1: structlog substrate-wide; ConsoleRenderer for TTY, JSONRenderer for non-TTY (production); standard log-level hierarchy with concrete rules; required fields (timestamp, level, logger, event, service, request_id-when-in-scope); request-id propagated via X-Request-Id header (HTTP) + contextvars (in-process); default redactor with documented secret-key list. josh-embedder + josh-core migrate from stdlib; josh-ingester already structlog; josh-cli (future) inherits from inception. Q2: Modal: hard-cap worker concurrency at 8 + documented operator responsibility to set max_containers=4. HTTP provider: 10s/120s timeout + 3-retry exp-backoff. trust_remote_code: allowlist (Snowflake/BAAI/intfloat/sentence-transformers/) + env-extension; non-allowlisted models raise even with opt-in env var. Add /health/embedding/ready as 200/503 gating probe; /health/embedding stays always-200 for observability. Three embedding-provider specs (http, local-st, modal) and the worker spec now declare dependency on this spec. Spec touches cross-cutting concerns shared across all four services (josh-core, josh-ingester, josh-embedder, future josh-cli). One operational footnote: HuggingFace Inference Endpoints can replace Modal for backfill bursts via the existing HTTP provider with `JOSH_HTTP_FORMAT=tei` — no new code required. Operators can A/B test Modal vs HF Endpoint when running large backfills; the spec doesn't mandate either.

docs/spec/substrate-observability-defaults.html · generated by bin/build-spec.py