REST API — aggregations

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Roughly 17% of realistic queries against the substrate (11 of 64 in
the https://docs.usejosh.com/operations/query-coverage/ analysis) are analytical:
counts, top-N rankings, group-by-and-tally, time-bucket histograms.
Examples that drove this spec:

- "Top 20 LDA registrants by total disclosed spending in 2025."
- "How many bills did each freshman House member introduce in the
119th Congress, ranked?"
- "Top 10 federal agencies by number of final rules published in 2025."
- "Monthly volume of bills mentioning 'AI' from 2018 through today."
- "What percent of roll-call votes in the 118th Senate were party-line?"

The substrate has the data. The current API surface (ID lookup,
list, search) doesn't expose it. Agents asked these questions today
would have to paginate through every record and count in their head
— wrong shape at the corpus's v1 scale (millions of rows), and a
great way to OOM the 32 GB host with naive pagination.

This spec adds declarative aggregate parameters to existing list
endpoints — Stripe/GitHub/Algolia faceting style, not arbitrary SQL.
Bounded by design: every aggregate query carries a result-size cap,
a query-cost ceiling, and a hard timeout. The substrate stays
memory-safe; the agent gets analytical answers in one call.

User stories

As a lobbyist, I want to call `GET /v1/lda-filings?aggregate=sum&sum_field=spend_amount&group_by=registrant_name&top=20&year=2025` so that I get the top 20 registrants by 2025 spend in one request, not by paginating through 50,000 filings.

As an academic researcher, I want to call `GET /v1/bills?aggregate=count&group_by=sponsor_bioguide_id&congress=119&sponsor_term_class=freshman` so that I can rank freshmen by bill-introduction volume without orchestrating dozens of list calls.

As a federal contractor, I want to call `GET /v1/federal-register?aggregate=count&group_by=agency&bucket_by_time=published_at&time_interval=quarter&since=2020-01-01` so that I see a multi-year, multi-agency activity histogram in one envelope.

As an OSS self-hoster reading docs, I want aggregate query semantics to be documented and predictable across every list endpoint so that I don't have to learn a per-resource analytical DSL.

As a substrate operator, I want every aggregate query to be bounded by row-touch budget and timeout so that a single ill-formed analytical query cannot exhaust the substrate host.

Acceptance criteria (EARS)

When a client sends `GET /v1/<resource>?aggregate=<op>` where `<op>` is one of `count`, `sum`, `avg`, `min`, `max`, the system shall execute the aggregate against the matched record set and return `{aggregate: <op>, value: <number>, query_cost: {rows_scanned: N, time_ms: M}}`.
When `?group_by=<field>` is added, the response shall be `{aggregate: <op>, buckets: [{key: <value>, value: <number>, count: <N>}, ...], total_buckets: <int>, query_cost: {...}}` — one bucket per distinct value of the grouped field.
When `?bucket_by_time=<field>&time_interval=<interval>` is added (interval one of `day`, `week`, `month`, `quarter`, `year`), the response shall return one bucket per time interval with `{key: <iso8601_start>, value: <number>, count: <N>}` covering the request's time range.
When both `?group_by=` and `?bucket_by_time=` are passed, the response shall be a nested histogram: `{aggregate: <op>, buckets: [{time_key, group_buckets: [...]}, ...]}`.
When `?top=<N>` is set, results shall be limited to the top N buckets by `value` descending; default 50, hard maximum 1000.
When `?sort=<field>` overrides the count-desc default, buckets shall be sorted by that field; only `key`, `value`, `count` are valid sort fields.
When structured filters (e.g., `?congress=119`, `?since=2025-01-01`, `?source=fr,bills`) are combined with aggregate params, the system shall apply filters BEFORE aggregation so the candidate set is bounded.
Where a field is requested via `?group_by=` or `?sum_field=` that is not registered in the per-resource aggregation plan (per `josh_substrate.aggregates.plans.<resource>`), the system shall return HTTP 400 with `error.code='field_not_aggregatable'` and `hint.valid_fields: [...]`.
Where a `sum`, `avg`, `min`, or `max` is requested on a non-numeric field, the system shall return HTTP 400 with `error.code='aggregate_op_invalid_for_field'` and `hint.valid_ops`.
When an aggregate query would touch more than 10,000,000 rows after filter application (estimated by SQLite `EXPLAIN QUERY PLAN`), the system shall return HTTP 400 with `error.code='aggregate_too_broad'` and `hint.estimated_rows_scanned`, `hint.suggest_narrowing` (list of filter dimensions that would help).
When an aggregate query exceeds a hard wall-clock timeout of 5000 ms, the system shall abort the query and return HTTP 408 with `error.code='aggregate_timeout'` and `hint.partial_buckets_returned: false`.
When an aggregate result returns buckets keyed on an ID column (`sponsor_bioguide_id`, `committee_id`, etc.), each `key` shall be a valid input to the corresponding `GET /v1/<resource>/{id}` endpoint (round-trip with `rest-api-resource-endpoints`).
When any aggregate query completes, the response shall include `query_cost: {rows_scanned: N, time_ms: M, used_index: <index_name_or_null>}` so the caller can reason about cost.

Success determiner

kindtest_file

Path

josh-core/tests/test_aggregations.py

Runner

uv run pytest josh-core/tests/test_aggregations.py -v

Contract test against the live FastAPI app with seeded analytical fixtures (multi-year bills, LDA filings spanning quarters, etc.): - Scalar aggregate: `?aggregate=count` returns `{value: N}`. - Group-by: `?aggregate=count&group_by=sponsor_party&congress=119` returns a bucket per party. - Sum + group: `?aggregate=sum&sum_field=spend_amount&group_by=registrant_name&top=20` against seeded LDA fixtures. - Time-bucket: `?bucket_by_time=published_at&time_interval=month&since=2024-01-01` returns 12 monthly buckets for 2024. - Nested histogram: `?group_by=agency&bucket_by_time=published_at&time_interval=quarter` returns nested structure. - Filter composition: structured filters apply first, aggregate runs over the filtered set. - Field eligibility: `?group_by=body_text` → 400 `field_not_aggregatable`. - Op eligibility: `?aggregate=sum&sum_field=title` → 400 `aggregate_op_invalid_for_field`. - Budget: an aggregate query estimated to scan > 10M rows → 400 `aggregate_too_broad`. - Timeout: an aggregate query that takes > 5s (artificial slowdown in test) → 408 `aggregate_timeout`. - Round-trip: each bucket key with an ID column is fetchable via the matching resource endpoint. - Query cost: every response includes `rows_scanned` and `time_ms`. Determiner currently fails because the aggregation router and per-resource aggregation plans don't exist. Flips to passing once the shared aggregation service ships.

Clarifications needed

Should aggregates support `having=` post-aggregation filters (e.g., `?aggregate=count&group_by=sponsor&having=count>10`)? Lean no for v1 — agents can filter the buckets client-side. Adds it later if a real use case emerges.
Should we expose `aggregate=distinct_count` (cardinality)? Useful for some questions ('how many distinct lobbyists registered for this client') but adds index pressure. Lean yes, but cap with a separate per-source budget.
Caching strategy for hot aggregates. A few queries are intrinsically broad (`bills?aggregate=count&group_by=congress` is a dashboard query). LRU cache keyed on the canonicalized query string, TTL 5 minutes? Or no cache for v1, revisit on production load?
Per-source aggregation plan must enumerate which fields are group-able and sum-able. Should the plan live in YAML (`shared/josh_substrate/aggregates/plans/<source>.yaml`) or as Python objects under `josh_substrate.aggregates.plans`? Lean Python — they reference schema columns and benefit from type checks.

Out of scope

Arbitrary SQL endpoint — rejected at design time. The 10M-row budget and the field-eligibility allow-list are explicit guard-rails against the kind of analytical query that could OOM the host.
JOIN-based aggregates across resources (e.g., 'count bills per sponsor's committee'). Lives in `rest-api-dossiers` for the bounded fan-out case; the analytical version is deferred to v2 if real demand appears.
Approximate aggregates (HyperLogLog distinct counts). Substrate is small enough at v1 that exact counts are cheap; deferred.
Window functions (running totals, rank within group). Deferred — most analytical needs are flat aggregates.
Aggregates over body-text content (e.g., word counts, average chunk length). Lives in a future eval/observability spec if needed.
Per-key drill-down endpoints (clicking a bucket to see its records) — that's just `GET /v1/<resource>?<filter>=<key>`, which already exists.

Dependencies

Plan

## 1. Why declarative params, not arbitrary SQL

An arbitrary SQL endpoint is tempting but fails three tests:

- Memory safety. Per the substrate's memory-safe-queries
invariant, every query must be bounded. Arbitrary SQL doesn't
compose with that.
- Stability. Schema changes shouldn't break callers. A
?group_by=sponsor param is stable across migrations; a raw
JOIN legislators ON … isn't.
- Discoverability. OpenAPI documents the params; agents see them
in the tool catalog. Raw SQL is opaque to introspection.

Declarative params hit the sweet spot: ~80% of analytical questions
in the coverage analysis are covered, the surface is small, and the
cost model is predictable.

## 2. The per-source aggregation plan

Each resource registers an AggregationPlan declaring which fields
are group-able and sum-able:

``python # shared/josh_substrate/aggregates/plans/bills.py PLAN = AggregationPlan( groupable={ "sponsor_bioguide_id", "sponsor_party", "sponsor_state", "sponsor_chamber", "congress", "type", "status", "introduced_year", "introduced_month", }, summable={}, # no numeric body field time_bucketable={"introduced_date", "last_action_date"}, indexed={"congress", "sponsor_bioguide_id", "type"}, )``

At request time, the router consults the plan: unknown fields → 400
with hint.valid_fields. Indexed fields are used for the EXPLAIN
query plan to estimate row-touch count.

## 3. Memory-safety: the 10M-row budget

Before executing, the service runs EXPLAIN QUERY PLAN (or a
cardinality estimate via index stats) to project how many rows the
aggregate will scan. If estimate > 10M, reject with 400 +
hint.suggest_narrowing:

``json { "error": { "code": "aggregate_too_broad", "hint": { "estimated_rows_scanned": 47000000, "max_rows_scanned": 10000000, "suggest_narrowing": ["congress", "since", "source"] } } }``

Agent retries with a tighter filter. v1 caps may need tuning under
real load; the constant lives in config.

## 4. Hard timeout

SQLite supports sqlite3_progress_handler and Python's
sqlite3.connect(..., timeout=...) for write-locks but not query
timeouts directly. Implementation uses a wrapper thread that sets
sqlite3_interrupt() after 5 seconds. Returns 408 with
aggregate_timeout.

## 5. Time bucketing

SQLite strftime covers day/week/month/quarter/year:

``sql -- ?bucket_by_time=published_at&time_interval=quarter SELECT strftime('%Y-%m', date(published_at, 'start of month', printf('-%d months', strftime('%m', published_at) % 3))) AS bucket, COUNT(*) AS value FROM federal_register WHERE published_at >= ? GROUP BY bucket ORDER BY bucket``

Time keys are ISO-8601 (2025-Q1, 2025-01, 2025-01-13) for
determinism.

## 6. MCP wrapper

mcp-server exposes Class E aggregation tools (typed wrappers
around this endpoint):

- count(resource, group_by?, filters?) — wraps ?aggregate=count
- sum(resource, sum_field, group_by?, filters?) — wraps ?aggregate=sum
- time_series(resource, time_field, interval, group_by?, filters?) — wraps ?bucket_by_time=

Each tool description includes the eligible group_by fields per
resource, derived from the live aggregation plan registry. Same
pattern as semantic_search's eligibility enumeration.

## 7. Implementation surface

```
shared/josh_substrate/
aggregates/
__init__.py
plans/
bills.py
lda_filings.py
federal_register.py
roll_call_votes.py
...one plan per source
query_builder.py # builds parameterized SQL from params
cost_estimator.py # row-touch projection via EXPLAIN
timeout.py # wrapper-thread interrupt

josh-core/josh_core/
services/
aggregate.py # router-facing service
tests/
test_aggregations.py # determiner
```

Per-resource routers consult the aggregation service when ?aggregate=
is present; otherwise they fall through to the list/singleton/?q=
paths as before.

Tasks

0 of 13 done.

Changelog

2026-05-13T15:00:00Z planned→planned New spec authored after the 64-query coverage analysis surfaced a ~17% aggregation gap (11 of 64 queries are counts, top-N, group-by, or time-series). Locked the declarative-params approach (Stripe / GitHub / Algolia faceting style) over both arbitrary SQL (memory unsafe) and pre-canned endpoints (too narrow). Per-source AggregationPlan registry enforces field-allowlist; cost estimator via EXPLAIN bounds row-touch at 10M; hard 5s timeout via sqlite3_interrupt. MCP Class E tools (count, sum, time_series) wrap the REST endpoint.