REST API — dossiers

Header

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Why

Cross-source fan-out from one ID is a high-volume agent shape. The
64-query coverage analysis (https://docs.usejosh.com/operations/query-coverage/)
surfaced 11 questions (17%) that need this — e.g.:

- "For HR 1, pull bill text, CBO score, every committee report,
witness lists of related hearings, and the SAP if any."
- "For every senator on the Banking Committee in the 119th, show
their roll-call vote on the CRA resolutions, plus LDA filings
naming them from financial-services clients."
- "Show every Senator who voted yes on the NDAA but whose floor
speech that week criticized provisions of it."

An agent can orchestrate 5+ tool calls per question, but at length 5
the LLM regularly drops one — wrong shape for an agent surface.
Dossier endpoints collapse the most common fan-outs into one
envelope, computed server-side via FK joins. Round-trip preserved:
every nested record carries its id, and the agent can click
through to the full record via the resource endpoint if needed.

Not a replacement for ad-hoc joins. Dossiers are the curated
high-traffic shapes (bill → everything, legislator → everything,
committee → everything). The arbitrary "give me X about Y filtered
by Z" question still routes through rest-api-resource-endpoints +
rest-api-search + rest-api-aggregations.

User stories

As a Hill staffer prepping the boss, I want to call `GET /v1/bills/hr:119:1/dossier` and get bill text, sponsor info, cosponsors, actions, committee referrals, related CRS reports, the CBO estimate, related committee reports, hearings where it was discussed, and the SAP if any — all in one envelope so that I don't manage 8 tool calls and worry about which one to drop when the agent times out.

As a journalist building a profile of Senator X, I want to call `GET /v1/legislators/S000033/dossier?congress=119` and get terms, current committees, sponsored bills count, leadership roles, district offices, and recent floor speeches so that I have the senator's full activity surface for the current congress in one fetch.

As an academic researcher studying committee activity, I want to call `GET /v1/committees/HSWM/dossier?congress=119` and get full roster, subcommittees, recent hearings, related bills, and recent reports so that I can audit the committee's session-by-session activity without orchestrating 5 calls.

As a lobbyist tracking a public law's downstream implementation, I want to call `GET /v1/public-laws/pl:119:1/dossier` and get the originating bill, amended CFR sections, and implementing Federal Register rules so that I see the full regulatory cascade from a law in one fetch.

Acceptance criteria (EARS)

When a client requests `GET /v1/bills/{bill_id}/dossier`, the system shall return one JSON envelope including the bill record, sponsor (legislator card), cosponsors (list), actions (list), committee_referrals (list), related_crs_reports (list), cbo_estimate (singleton or null), committee_reports (list), hearings (list), sap (singleton or null), roll_call_votes (list), and related_amendments (list).
When a client requests `GET /v1/legislators/{bioguide_id}/dossier?congress=<n>`, the system shall return the legislator record, terms (full), current_committees (list with role), sponsored_bills_count (scalar), cosponsored_bills_count (scalar), leadership_roles (list), district_offices (list), and recent_floor_speeches (list of 10 most recent congressional record granules).
When a client requests `GET /v1/committees/{committee_id}/dossier?congress=<n>`, the system shall return the committee record, current members (list with role), subcommittees (list), recent_hearings (list of 20 most recent), related_bills (top 50 most recent by introduction date), and recent_reports (list of 10 most recent committee reports).
When a client requests `GET /v1/public-laws/{id}/dossier`, the system shall return the public law record, originating_bill (singleton), amended_cfr_sections (list), implementing_rules (list of related Federal Register documents).
Where a dossier nests a list, that list shall be bounded to the documented default (per-endpoint, between 10 and 50 items) and shall expose a `more_url` pointer to the canonical `rest-api-resource-endpoints` sub-resource list when more is available.
Where any nested list is truncated, the response shall include `truncated: true` and `total: <int>` for that list.
When a dossier returns a nested record, it shall use the `card` fieldset (per `rest-api-conventions` §9), NOT the full record. The `id` shall always be present so the caller can click through to `GET /v1/<resource>/{id}` for the full record.
Where a nested singleton (sponsor, cbo_estimate, sap) is present, it shall also be a `card` fieldset, not full.
When any dossier returns a nested record, that record's `id` shall be a valid input to its matching resource endpoint (round-trip with `rest-api-resource-endpoints`).
When a dossier is requested against the v1 substrate, the system shall return within 500 ms p95 (same budget as `rest-api-search`).
Where a dossier's underlying joins would exceed an internal row-touch budget (estimated via EXPLAIN), the system shall execute each nested fetch independently with bounded LIMIT to keep memory under control — never one giant join.
Where a dossier is requested twice for the same ID within the 5-minute cache TTL, the second request shall return from an in-memory LRU cache. The response shall include `cache: {hit: bool, age_seconds: <int>, ttl: 300}` so callers can audit freshness.
When the cache is invalidated by an ingester writing to one of the underlying tables (via `ingestion_events` table hook), the next dossier request for that ID shall miss cache and recompute.
Where this spec defines an endpoint, response shape (envelope, error format, datetime serialization, status codes, rate-limit headers) shall conform to `rest-api-conventions` without overrides.

Success determiner

kindtest_file

Path

josh-core/tests/test_dossiers.py

Runner

uv run pytest josh-core/tests/test_dossiers.py -v

Contract test against the live FastAPI app with seeded fixtures for each dossier shape: - Bill dossier: HR 1 (119th) with seeded cosponsors, actions, committee referrals, a CRS report, a CBO estimate, two committee reports, three hearings, a SAP, and roll-call votes — assert all sections present with correct cardinalities and `card` shape. - Legislator dossier: a seeded senator with terms, committees, bills, leadership, district offices, and floor speeches. - Committee dossier: HSWM with roster, subcommittees, hearings, bills, reports. - Public-law dossier: a seeded PL with originating bill, CFR sections, FR rules. - Bounded fan-out: a seeded bill with > 50 cosponsors returns `truncated: true, total: N` + a valid `more_url`. - Round-trip: every nested `id` is fetchable via the matching resource endpoint. - Cache: two calls within 300s — first miss, second hit with `age_seconds` between 0 and 300. - Cache invalidation: ingester writes to bill_cosponsors table → next call misses cache. - Latency: p95 < 500 ms against v1 fixtures. Determiner currently fails because dossier routers don't exist and the cache layer is unimplemented.

Clarifications needed

Cache backend: in-process Python dict with LRU + TTL is the simplest. Multi-instance deployment would need Redis, but at v1 there's one josh-core container. Lean in-process.
Cache invalidation. Simplest path: write `ingestion_events` rows from ingester completion hooks, invalidate by source-table on dossier read. Cleaner but more work: pub/sub with explicit topics. Lean simple.
Nested-list defaults: 50 cosponsors, 20 actions, 10 floor speeches, etc. — pinned per dossier or globally? Lean per-dossier so each shape can tune.
Adding a new dossier endpoint when a new aggregation target appears. Should there be a public `DossierPlan` registry analogous to the AggregationPlan? Lean yes — each dossier is small but explicit, and the registry makes adding a new one a known motion.
Should dossiers accept a `?fields=` projection to exclude nested sections (e.g., `?fields=bill,sap,cosponsors` to skip everything else)? Useful for bandwidth-sensitive callers; defaults already cap size. Lean yes — same `?fields=` semantics from rest-api-conventions §9.

Out of scope

Arbitrary cross-source joins. Each dossier is a curated shape; ad-hoc fan-out routes through `rest-api-resource-endpoints` sub-resource lists plus agent orchestration.
Write-side dossiers (uploading a dossier). Substrate is read-only at v1.
Dossiers over registry sources only (e.g., `GET /v1/legislators/{id}/dossier` is in scope, but `GET /v1/staff-directories/{id}/dossier` is not — staffers don't have a big-enough relational footprint to justify a dossier shape).
Real-time push (websocket) updates of a dossier. Pull-only at v1.
Personalized dossiers (e.g., 'show me only the sections I've subscribed to'). User-state feature, not substrate.
Sub-dossier composition (calling one dossier from inside another). Each dossier is flat.

Dependencies

Plan

## 1. The four v1 dossiers

Picked from the coverage analysis as the most-requested fan-out
shapes. Each is a curated server-side join, NOT a generic graph
endpoint.

### Bill dossier — GET /v1/bills/{bill_id}/dossier

``json { "bill": { id, title, status, sponsor, ... }, "cosponsors": { data: [card, ...], total, truncated, more_url }, "actions": { data: [card, ...], total, truncated, more_url }, "committee_referrals": [card, ...], "related_crs_reports": { data: [card, ...], total, more_url }, "cbo_estimate": card | null, "committee_reports": { data: [card, ...], total, more_url }, "hearings": { data: [card, ...], total, more_url }, "sap": card | null, "roll_call_votes": [card, ...], "related_amendments": { data: [card, ...], total, more_url }, "cache": { hit: bool, age_seconds: int, ttl: 300 } }``

Default limits: cosponsors 50, actions 20, related_crs_reports 10,
committee_reports 10, hearings 10, roll_call_votes 10, related_amendments 20.

### Legislator dossier — GET /v1/legislators/{bioguide_id}/dossier?congress=<n>

Defaults: 10 most recent floor speeches, current congress only,
bills-count is scalar (not paginated list — use GET /v1/legislators/{id}/sponsored-bills to enumerate).

### Committee dossier — GET /v1/committees/{committee_id}/dossier?congress=<n>

Defaults: full member roster (capped 100), 20 most recent hearings,
top 50 recent bills by introduction date, 10 most recent reports.

### Public-law dossier — GET /v1/public-laws/{id}/dossier

Defaults: originating bill (singleton), CFR sections amended (full
list — typically small), Federal Register rules citing the public
law (top 20 by published_at).

## 2. The MCP wrapper

mcp-server exposes Class F dossier tools:

- get_bill_dossier(bill_id)
- get_legislator_dossier(bioguide_id, congress?)
- get_committee_dossier(committee_id, congress?)
- get_public_law_dossier(public_law_id)

Each tool description specifies what's in the envelope so the agent
knows what fields to expect without trial-and-error.

## 3. Why a registry, not ad-hoc

Each dossier has a DossierPlan analogous to the AggregationPlan
in rest-api-aggregations:

``python # shared/josh_substrate/dossiers/plans/bill.py PLAN = DossierPlan( resource="bills", id_field="bill_id", sections=[ DossierSection(name="cosponsors", table="bill_cosponsors", join_via="bill_id", limit=50, fieldset="card"), DossierSection(name="actions", table="bill_actions", join_via="bill_id", limit=20, fieldset="card"), DossierSection(name="cbo_estimate", table="cbo_publications", join_via="related_bill_id", limit=1, fieldset="card", as_singleton=True), ... ], cache_ttl=300, cache_invalidation_tables={ "bills", "bill_cosponsors", "bill_actions", "cbo_publications", "committee_reports", ... }, )``

The plan registry is the single source of truth for what's in each
dossier. Adding a new dossier is a known motion: register the plan,
wire the route, add the cache key.

## 4. Caching design

In-process LRU cache (Python functools.lru_cache or
cachetools.TTLCache). Key is (resource, id, congress, fields).
TTL 300 seconds.

Invalidation: each DossierPlan declares
cache_invalidation_tables. The ingester layer writes an
ingestion_events row when any of those tables changes. On dossier
read, the cache checks the latest ingestion_event.timestamp for
the relevant tables; if newer than the cached entry's
cached_at, evict.

Trade-off: every dossier read does a small lookup against
ingestion_events. Cheap (indexed by (table, timestamp)),
bounded.

## 5. Performance — why nested fetches, not one big JOIN

A 12-section bill dossier as one SQL query would have monstrous
fan-out. Instead, the service runs each section as its own
parameterized query, in parallel where possible:

``python async def build_bill_dossier(bill_id: str, fields: list[str]) -> dict: sections = await asyncio.gather( fetch_bill(bill_id), fetch_cosponsors(bill_id, limit=50), fetch_actions(bill_id, limit=20), fetch_cbo(bill_id), ... ) return assemble_dossier(sections)``

SQLite + WAL handles concurrent reads from the same connection pool.
Each section is bounded; total latency is the slowest section, not
the sum. p95 < 500 ms is the budget.

## 6. Implementation surface

```
shared/josh_substrate/
dossiers/
__init__.py
plans/
bill.py
legislator.py
committee.py
public_law.py
assembler.py # parallel section fetcher
cache.py # LRU + TTL + invalidation

josh-core/josh_core/
routers/
bills.py # extended: + /dossier
legislators.py # extended
committees.py # extended
public_laws.py # extended
tests/
test_dossiers.py # determiner
```

Per-router change is one new route per dossier; the heavy lifting is
in the shared assembler + plans.

Tasks

0 of 14 done.

Changelog

2026-05-13T15:00:00Z planned→planned New spec authored after the 64-query coverage analysis surfaced 11 cross-source fan-out questions (17%) where an agent would otherwise need to orchestrate 5+ tool calls. Locked four v1 dossiers (bill, legislator, committee, public-law) as curated server-side joins — not a generic graph endpoint. DossierPlan registry analogous to AggregationPlan; in-process LRU+TTL cache with table-level invalidation via ingestion_events hooks; parallel section fetching via asyncio; bounded per-section limits with `more_url` for truncated lists. MCP Class F tools (one per dossier shape) wrap the REST endpoints.