surfaceplannedp0

REST API — dossiers

rest-api-dossiers · updated 2026-05-13T15:00:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Cross-source fan-out from one ID is a high-volume agent shape. The
64-query coverage analysis (https://docs.usejosh.com/operations/query-coverage/)
surfaced 11 questions (17%) that need this — e.g.:

- "For HR 1, pull bill text, CBO score, every committee report,
witness lists of related hearings, and the SAP if any."
- "For every senator on the Banking Committee in the 119th, show
their roll-call vote on the CRA resolutions, plus LDA filings
naming them from financial-services clients."
- "Show every Senator who voted yes on the NDAA but whose floor
speech that week criticized provisions of it."

An agent can orchestrate 5+ tool calls per question, but at length 5
the LLM regularly drops one — wrong shape for an agent surface.
Dossier endpoints collapse the most common fan-outs into one
envelope, computed server-side via FK joins. Round-trip preserved:
every nested record carries its id, and the agent can click
through to the full record via the resource endpoint if needed.

Not a replacement for ad-hoc joins. Dossiers are the curated
high-traffic shapes (bill → everything, legislator → everything,
committee → everything). The arbitrary "give me X about Y filtered
by Z" question still routes through rest-api-resource-endpoints +
rest-api-search + rest-api-aggregations.

As a Hill staffer prepping the boss, I want to call `GET /v1/bills/hr:119:1/dossier` and get bill text, sponsor info, cosponsors, actions, committee referrals, related CRS reports, the CBO estimate, related committee reports, hearings where it was discussed, and the SAP if any — all in one envelope so that I don't manage 8 tool calls and worry about which one to drop when the agent times out.

As a journalist building a profile of Senator X, I want to call `GET /v1/legislators/S000033/dossier?congress=119` and get terms, current committees, sponsored bills count, leadership roles, district offices, and recent floor speeches so that I have the senator's full activity surface for the current congress in one fetch.

As an academic researcher studying committee activity, I want to call `GET /v1/committees/HSWM/dossier?congress=119` and get full roster, subcommittees, recent hearings, related bills, and recent reports so that I can audit the committee's session-by-session activity without orchestrating 5 calls.

As a lobbyist tracking a public law's downstream implementation, I want to call `GET /v1/public-laws/pl:119:1/dossier` and get the originating bill, amended CFR sections, and implementing Federal Register rules so that I see the full regulatory cascade from a law in one fetch.

  1. When a client requests `GET /v1/bills/{bill_id}/dossier`, the system shall return one JSON envelope including the bill record, sponsor (legislator card), cosponsors (list), actions (list), committee_referrals (list), related_crs_reports (list), cbo_estimate (singleton or null), committee_reports (list), hearings (list), sap (singleton or null), roll_call_votes (list), and related_amendments (list).
  2. When a client requests `GET /v1/legislators/{bioguide_id}/dossier?congress=<n>`, the system shall return the legislator record, terms (full), current_committees (list with role), sponsored_bills_count (scalar), cosponsored_bills_count (scalar), leadership_roles (list), district_offices (list), and recent_floor_speeches (list of 10 most recent congressional record granules).
  3. When a client requests `GET /v1/committees/{committee_id}/dossier?congress=<n>`, the system shall return the committee record, current members (list with role), subcommittees (list), recent_hearings (list of 20 most recent), related_bills (top 50 most recent by introduction date), and recent_reports (list of 10 most recent committee reports).
  4. When a client requests `GET /v1/public-laws/{id}/dossier`, the system shall return the public law record, originating_bill (singleton), amended_cfr_sections (list), implementing_rules (list of related Federal Register documents).
  5. Where a dossier nests a list, that list shall be bounded to the documented default (per-endpoint, between 10 and 50 items) and shall expose a `more_url` pointer to the canonical `rest-api-resource-endpoints` sub-resource list when more is available.
  6. Where any nested list is truncated, the response shall include `truncated: true` and `total: <int>` for that list.
  7. When a dossier returns a nested record, it shall use the `card` fieldset (per `rest-api-conventions` §9), NOT the full record. The `id` shall always be present so the caller can click through to `GET /v1/<resource>/{id}` for the full record.
  8. Where a nested singleton (sponsor, cbo_estimate, sap) is present, it shall also be a `card` fieldset, not full.
  9. When any dossier returns a nested record, that record's `id` shall be a valid input to its matching resource endpoint (round-trip with `rest-api-resource-endpoints`).
  10. When a dossier is requested against the v1 substrate, the system shall return within 500 ms p95 (same budget as `rest-api-search`).
  11. Where a dossier's underlying joins would exceed an internal row-touch budget (estimated via EXPLAIN), the system shall execute each nested fetch independently with bounded LIMIT to keep memory under control — never one giant join.
  12. Where a dossier is requested twice for the same ID within the 5-minute cache TTL, the second request shall return from an in-memory LRU cache. The response shall include `cache: {hit: bool, age_seconds: <int>, ttl: 300}` so callers can audit freshness.
  13. When the cache is invalidated by an ingester writing to one of the underlying tables (via `ingestion_events` table hook), the next dossier request for that ID shall miss cache and recompute.
  14. Where this spec defines an endpoint, response shape (envelope, error format, datetime serialization, status codes, rate-limit headers) shall conform to `rest-api-conventions` without overrides.
kindtest_file

Path

josh-core/tests/test_dossiers.py

Runner

uv run pytest josh-core/tests/test_dossiers.py -v

Contract test against the live FastAPI app with seeded fixtures for each dossier shape: - Bill dossier: HR 1 (119th) with seeded cosponsors, actions, committee referrals, a CRS report, a CBO estimate, two committee reports, three hearings, a SAP, and roll-call votes — assert all sections present with correct cardinalities and `card` shape. - Legislator dossier: a seeded senator with terms, committees, bills, leadership, district offices, and floor speeches. - Committee dossier: HSWM with roster, subcommittees, hearings, bills, reports. - Public-law dossier: a seeded PL with originating bill, CFR sections, FR rules. - Bounded fan-out: a seeded bill with > 50 cosponsors returns `truncated: true, total: N` + a valid `more_url`. - Round-trip: every nested `id` is fetchable via the matching resource endpoint. - Cache: two calls within 300s — first miss, second hit with `age_seconds` between 0 and 300. - Cache invalidation: ingester writes to bill_cosponsors table → next call misses cache. - Latency: p95 < 500 ms against v1 fixtures. Determiner currently fails because dossier routers don't exist and the cache layer is unimplemented.

  • Cache backend: in-process Python dict with LRU + TTL is the simplest. Multi-instance deployment would need Redis, but at v1 there's one josh-core container. Lean in-process.
  • Cache invalidation. Simplest path: write `ingestion_events` rows from ingester completion hooks, invalidate by source-table on dossier read. Cleaner but more work: pub/sub with explicit topics. Lean simple.
  • Nested-list defaults: 50 cosponsors, 20 actions, 10 floor speeches, etc. — pinned per dossier or globally? Lean per-dossier so each shape can tune.
  • Adding a new dossier endpoint when a new aggregation target appears. Should there be a public `DossierPlan` registry analogous to the AggregationPlan? Lean yes — each dossier is small but explicit, and the registry makes adding a new one a known motion.
  • Should dossiers accept a `?fields=` projection to exclude nested sections (e.g., `?fields=bill,sap,cosponsors` to skip everything else)? Useful for bandwidth-sensitive callers; defaults already cap size. Lean yes — same `?fields=` semantics from rest-api-conventions §9.
  • Arbitrary cross-source joins. Each dossier is a curated shape; ad-hoc fan-out routes through `rest-api-resource-endpoints` sub-resource lists plus agent orchestration.
  • Write-side dossiers (uploading a dossier). Substrate is read-only at v1.
  • Dossiers over registry sources only (e.g., `GET /v1/legislators/{id}/dossier` is in scope, but `GET /v1/staff-directories/{id}/dossier` is not — staffers don't have a big-enough relational footprint to justify a dossier shape).
  • Real-time push (websocket) updates of a dossier. Pull-only at v1.
  • Personalized dossiers (e.g., 'show me only the sections I've subscribed to'). User-state feature, not substrate.
  • Sub-dossier composition (calling one dossier from inside another). Each dossier is flat.

## 1. The four v1 dossiers

Picked from the coverage analysis as the most-requested fan-out
shapes. Each is a curated server-side join, NOT a generic graph
endpoint.

### Bill dossier — GET /v1/bills/{bill_id}/dossier

``json
{
"bill": { id, title, status, sponsor, ... },
"cosponsors": { data: [card, ...], total, truncated, more_url },
"actions": { data: [card, ...], total, truncated, more_url },
"committee_referrals": [card, ...],
"related_crs_reports": { data: [card, ...], total, more_url },
"cbo_estimate": card | null,
"committee_reports": { data: [card, ...], total, more_url },
"hearings": { data: [card, ...], total, more_url },
"sap": card | null,
"roll_call_votes": [card, ...],
"related_amendments": { data: [card, ...], total, more_url },
"cache": { hit: bool, age_seconds: int, ttl: 300 }
}
``

Default limits: cosponsors 50, actions 20, related_crs_reports 10,
committee_reports 10, hearings 10, roll_call_votes 10, related_amendments 20.

### Legislator dossier — GET /v1/legislators/{bioguide_id}/dossier?congress=<n>

Defaults: 10 most recent floor speeches, current congress only,
bills-count is scalar (not paginated list — use GET /v1/legislators/{id}/sponsored-bills to enumerate).

### Committee dossier — GET /v1/committees/{committee_id}/dossier?congress=<n>

Defaults: full member roster (capped 100), 20 most recent hearings,
top 50 recent bills by introduction date, 10 most recent reports.

### Public-law dossier — GET /v1/public-laws/{id}/dossier

Defaults: originating bill (singleton), CFR sections amended (full
list — typically small), Federal Register rules citing the public
law (top 20 by published_at).

## 2. The MCP wrapper

mcp-server exposes Class F dossier tools:

- get_bill_dossier(bill_id)
- get_legislator_dossier(bioguide_id, congress?)
- get_committee_dossier(committee_id, congress?)
- get_public_law_dossier(public_law_id)

Each tool description specifies what's in the envelope so the agent
knows what fields to expect without trial-and-error.

## 3. Why a registry, not ad-hoc

Each dossier has a DossierPlan analogous to the AggregationPlan
in rest-api-aggregations:

``python
# shared/josh_substrate/dossiers/plans/bill.py
PLAN = DossierPlan(
resource="bills",
id_field="bill_id",
sections=[
DossierSection(name="cosponsors", table="bill_cosponsors",
join_via="bill_id", limit=50, fieldset="card"),
DossierSection(name="actions", table="bill_actions",
join_via="bill_id", limit=20, fieldset="card"),
DossierSection(name="cbo_estimate", table="cbo_publications",
join_via="related_bill_id", limit=1,
fieldset="card", as_singleton=True),
...
],
cache_ttl=300,
cache_invalidation_tables={
"bills", "bill_cosponsors", "bill_actions",
"cbo_publications", "committee_reports",
...
},
)
``

The plan registry is the single source of truth for what's in each
dossier. Adding a new dossier is a known motion: register the plan,
wire the route, add the cache key.

## 4. Caching design

In-process LRU cache (Python functools.lru_cache or
cachetools.TTLCache). Key is (resource, id, congress, fields).
TTL 300 seconds.

Invalidation: each DossierPlan declares
cache_invalidation_tables. The ingester layer writes an
ingestion_events row when any of those tables changes. On dossier
read, the cache checks the latest ingestion_event.timestamp for
the relevant tables; if newer than the cached entry's
cached_at, evict.

Trade-off: every dossier read does a small lookup against
ingestion_events. Cheap (indexed by (table, timestamp)),
bounded.

## 5. Performance — why nested fetches, not one big JOIN

A 12-section bill dossier as one SQL query would have monstrous
fan-out. Instead, the service runs each section as its own
parameterized query, in parallel where possible:

``python
async def build_bill_dossier(bill_id: str, fields: list[str]) -> dict:
sections = await asyncio.gather(
fetch_bill(bill_id),
fetch_cosponsors(bill_id, limit=50),
fetch_actions(bill_id, limit=20),
fetch_cbo(bill_id),
...
)
return assemble_dossier(sections)
``

SQLite + WAL handles concurrent reads from the same connection pool.
Each section is bounded; total latency is the slowest section, not
the sum. p95 < 500 ms is the budget.

## 6. Implementation surface

```
shared/josh_substrate/
dossiers/
__init__.py
plans/
bill.py
legislator.py
committee.py
public_law.py
assembler.py # parallel section fetcher
cache.py # LRU + TTL + invalidation

josh-core/josh_core/
routers/
bills.py # extended: + /dossier
legislators.py # extended
committees.py # extended
public_laws.py # extended
tests/
test_dossiers.py # determiner
```

Per-router change is one new route per dossier; the heavy lifting is
in the shared assembler + plans.

0 of 14 done.

  • t1 DossierPlan registry + four v1 plans (bill, legislator, committee, public-law)
  • t2 Parallel section fetcher (asyncio.gather), bounded per-section, card fieldset
  • t3 Bill dossier router + envelope shape (12 sections, all bounded)
  • t4 Legislator dossier router + congress filter
  • t5 Committee dossier router + congress filter
  • t6 Public-law dossier router + CFR/FR cascade
  • t7 In-memory LRU + TTL cache; cache block in every response
  • t8 ingestion_events table + ingester hook for cache invalidation
  • t9 Truncation surface (more_url per truncated section) routes to rest-api-resource-endpoints sub-resource lists
  • t10 Round-trip property: every nested `id` is fetchable via /v1/<resource>/{id}
  • t11 ?fields= projection to exclude sections
  • t12 Latency budget: p95 < 500 ms against v1 fixtures (asserted)
  • t13 test_dossiers.py covers every AC + cache hit/miss + invalidation
  • t14 OpenAPI documents each dossier shape with all sections
  • 2026-05-13T15:00:00Z plannedplanned New spec authored after the 64-query coverage analysis surfaced 11 cross-source fan-out questions (17%) where an agent would otherwise need to orchestrate 5+ tool calls. Locked four v1 dossiers (bill, legislator, committee, public-law) as curated server-side joins — not a generic graph endpoint. DossierPlan registry analogous to AggregationPlan; in-process LRU+TTL cache with table-level invalidation via ingestion_events hooks; parallel section fetching via asyncio; bounded per-section limits with `more_url` for truncated lists. MCP Class F tools (one per dossier shape) wrap the REST endpoints.

docs/spec/rest-api-dossiers.html · generated by bin/build-spec.py