REST API — dossiers
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Cross-source fan-out from one ID is a high-volume agent shape. The
64-query coverage analysis (https://docs.usejosh.com/operations/query-coverage/)
surfaced 11 questions (17%) that need this — e.g.:
- "For HR 1, pull bill text, CBO score, every committee report,
witness lists of related hearings, and the SAP if any."
- "For every senator on the Banking Committee in the 119th, show
their roll-call vote on the CRA resolutions, plus LDA filings
naming them from financial-services clients."
- "Show every Senator who voted yes on the NDAA but whose floor
speech that week criticized provisions of it."
An agent can orchestrate 5+ tool calls per question, but at length 5
the LLM regularly drops one — wrong shape for an agent surface.
Dossier endpoints collapse the most common fan-outs into one
envelope, computed server-side via FK joins. Round-trip preserved:
every nested record carries its id, and the agent can click
through to the full record via the resource endpoint if needed.
Not a replacement for ad-hoc joins. Dossiers are the curated
high-traffic shapes (bill → everything, legislator → everything,
committee → everything). The arbitrary "give me X about Y filtered
by Z" question still routes through rest-api-resource-endpoints +rest-api-search + rest-api-aggregations.
User stories
As a Hill staffer prepping the boss, I want to call `GET /v1/bills/hr:119:1/dossier` and get bill text, sponsor info, cosponsors, actions, committee referrals, related CRS reports, the CBO estimate, related committee reports, hearings where it was discussed, and the SAP if any — all in one envelope so that I don't manage 8 tool calls and worry about which one to drop when the agent times out.
As a journalist building a profile of Senator X, I want to call `GET /v1/legislators/S000033/dossier?congress=119` and get terms, current committees, sponsored bills count, leadership roles, district offices, and recent floor speeches so that I have the senator's full activity surface for the current congress in one fetch.
As an academic researcher studying committee activity, I want to call `GET /v1/committees/HSWM/dossier?congress=119` and get full roster, subcommittees, recent hearings, related bills, and recent reports so that I can audit the committee's session-by-session activity without orchestrating 5 calls.
As a lobbyist tracking a public law's downstream implementation, I want to call `GET /v1/public-laws/pl:119:1/dossier` and get the originating bill, amended CFR sections, and implementing Federal Register rules so that I see the full regulatory cascade from a law in one fetch.
Acceptance criteria (EARS)
- When a client requests `GET /v1/bills/{bill_id}/dossier`, the system shall return one JSON envelope including the bill record, sponsor (legislator card), cosponsors (list), actions (list), committee_referrals (list), related_crs_reports (list), cbo_estimate (singleton or null), committee_reports (list), hearings (list), sap (singleton or null), roll_call_votes (list), and related_amendments (list).
- When a client requests `GET /v1/legislators/{bioguide_id}/dossier?congress=<n>`, the system shall return the legislator record, terms (full), current_committees (list with role), sponsored_bills_count (scalar), cosponsored_bills_count (scalar), leadership_roles (list), district_offices (list), and recent_floor_speeches (list of 10 most recent congressional record granules).
- When a client requests `GET /v1/committees/{committee_id}/dossier?congress=<n>`, the system shall return the committee record, current members (list with role), subcommittees (list), recent_hearings (list of 20 most recent), related_bills (top 50 most recent by introduction date), and recent_reports (list of 10 most recent committee reports).
- When a client requests `GET /v1/public-laws/{id}/dossier`, the system shall return the public law record, originating_bill (singleton), amended_cfr_sections (list), implementing_rules (list of related Federal Register documents).
- Where a dossier nests a list, that list shall be bounded to the documented default (per-endpoint, between 10 and 50 items) and shall expose a `more_url` pointer to the canonical `rest-api-resource-endpoints` sub-resource list when more is available.
- Where any nested list is truncated, the response shall include `truncated: true` and `total: <int>` for that list.
- When a dossier returns a nested record, it shall use the `card` fieldset (per `rest-api-conventions` §9), NOT the full record. The `id` shall always be present so the caller can click through to `GET /v1/<resource>/{id}` for the full record.
- Where a nested singleton (sponsor, cbo_estimate, sap) is present, it shall also be a `card` fieldset, not full.
- When any dossier returns a nested record, that record's `id` shall be a valid input to its matching resource endpoint (round-trip with `rest-api-resource-endpoints`).
- When a dossier is requested against the v1 substrate, the system shall return within 500 ms p95 (same budget as `rest-api-search`).
- Where a dossier's underlying joins would exceed an internal row-touch budget (estimated via EXPLAIN), the system shall execute each nested fetch independently with bounded LIMIT to keep memory under control — never one giant join.
- Where a dossier is requested twice for the same ID within the 5-minute cache TTL, the second request shall return from an in-memory LRU cache. The response shall include `cache: {hit: bool, age_seconds: <int>, ttl: 300}` so callers can audit freshness.
- When the cache is invalidated by an ingester writing to one of the underlying tables (via `ingestion_events` table hook), the next dossier request for that ID shall miss cache and recompute.
- Where this spec defines an endpoint, response shape (envelope, error format, datetime serialization, status codes, rate-limit headers) shall conform to `rest-api-conventions` without overrides.
Success determiner
Path
Runner
Contract test against the live FastAPI app with seeded fixtures for each dossier shape: - Bill dossier: HR 1 (119th) with seeded cosponsors, actions, committee referrals, a CRS report, a CBO estimate, two committee reports, three hearings, a SAP, and roll-call votes — assert all sections present with correct cardinalities and `card` shape. - Legislator dossier: a seeded senator with terms, committees, bills, leadership, district offices, and floor speeches. - Committee dossier: HSWM with roster, subcommittees, hearings, bills, reports. - Public-law dossier: a seeded PL with originating bill, CFR sections, FR rules. - Bounded fan-out: a seeded bill with > 50 cosponsors returns `truncated: true, total: N` + a valid `more_url`. - Round-trip: every nested `id` is fetchable via the matching resource endpoint. - Cache: two calls within 300s — first miss, second hit with `age_seconds` between 0 and 300. - Cache invalidation: ingester writes to bill_cosponsors table → next call misses cache. - Latency: p95 < 500 ms against v1 fixtures. Determiner currently fails because dossier routers don't exist and the cache layer is unimplemented.
Clarifications needed
- Cache backend: in-process Python dict with LRU + TTL is the simplest. Multi-instance deployment would need Redis, but at v1 there's one josh-core container. Lean in-process.
- Cache invalidation. Simplest path: write `ingestion_events` rows from ingester completion hooks, invalidate by source-table on dossier read. Cleaner but more work: pub/sub with explicit topics. Lean simple.
- Nested-list defaults: 50 cosponsors, 20 actions, 10 floor speeches, etc. — pinned per dossier or globally? Lean per-dossier so each shape can tune.
- Adding a new dossier endpoint when a new aggregation target appears. Should there be a public `DossierPlan` registry analogous to the AggregationPlan? Lean yes — each dossier is small but explicit, and the registry makes adding a new one a known motion.
- Should dossiers accept a `?fields=` projection to exclude nested sections (e.g., `?fields=bill,sap,cosponsors` to skip everything else)? Useful for bandwidth-sensitive callers; defaults already cap size. Lean yes — same `?fields=` semantics from rest-api-conventions §9.
Out of scope
- Arbitrary cross-source joins. Each dossier is a curated shape; ad-hoc fan-out routes through `rest-api-resource-endpoints` sub-resource lists plus agent orchestration.
- Write-side dossiers (uploading a dossier). Substrate is read-only at v1.
- Dossiers over registry sources only (e.g., `GET /v1/legislators/{id}/dossier` is in scope, but `GET /v1/staff-directories/{id}/dossier` is not — staffers don't have a big-enough relational footprint to justify a dossier shape).
- Real-time push (websocket) updates of a dossier. Pull-only at v1.
- Personalized dossiers (e.g., 'show me only the sections I've subscribed to'). User-state feature, not substrate.
- Sub-dossier composition (calling one dossier from inside another). Each dossier is flat.
Dependencies
rest-api-conventionsrest-api-resource-endpointsbills-ingesterlegislators-and-committees-ingestercommittee-reports-ingesterhearings-ingestercrs-reports-ingestercbo-cost-estimates-ingesterstatements-of-administration-policy-ingesterroll-call-votes-ingesterpublic-laws-ingesterecfr-and-cfr-ingesterfederal-register-ingester
Plan
## 1. The four v1 dossiers
Picked from the coverage analysis as the most-requested fan-out
shapes. Each is a curated server-side join, NOT a generic graph
endpoint.
### Bill dossier — GET /v1/bills/{bill_id}/dossier
``json``
{
"bill": { id, title, status, sponsor, ... },
"cosponsors": { data: [card, ...], total, truncated, more_url },
"actions": { data: [card, ...], total, truncated, more_url },
"committee_referrals": [card, ...],
"related_crs_reports": { data: [card, ...], total, more_url },
"cbo_estimate": card | null,
"committee_reports": { data: [card, ...], total, more_url },
"hearings": { data: [card, ...], total, more_url },
"sap": card | null,
"roll_call_votes": [card, ...],
"related_amendments": { data: [card, ...], total, more_url },
"cache": { hit: bool, age_seconds: int, ttl: 300 }
}
Default limits: cosponsors 50, actions 20, related_crs_reports 10,
committee_reports 10, hearings 10, roll_call_votes 10, related_amendments 20.
### Legislator dossier — GET /v1/legislators/{bioguide_id}/dossier?congress=<n>
Defaults: 10 most recent floor speeches, current congress only,
bills-count is scalar (not paginated list — use GET /v1/legislators/{id}/sponsored-bills to enumerate).
### Committee dossier — GET /v1/committees/{committee_id}/dossier?congress=<n>
Defaults: full member roster (capped 100), 20 most recent hearings,
top 50 recent bills by introduction date, 10 most recent reports.
### Public-law dossier — GET /v1/public-laws/{id}/dossier
Defaults: originating bill (singleton), CFR sections amended (full
list — typically small), Federal Register rules citing the public
law (top 20 by published_at).
## 2. The MCP wrapper
mcp-server exposes Class F dossier tools:
- get_bill_dossier(bill_id)
- get_legislator_dossier(bioguide_id, congress?)
- get_committee_dossier(committee_id, congress?)
- get_public_law_dossier(public_law_id)
Each tool description specifies what's in the envelope so the agent
knows what fields to expect without trial-and-error.
## 3. Why a registry, not ad-hoc
Each dossier has a DossierPlan analogous to the AggregationPlan
in rest-api-aggregations:
``python``
# shared/josh_substrate/dossiers/plans/bill.py
PLAN = DossierPlan(
resource="bills",
id_field="bill_id",
sections=[
DossierSection(name="cosponsors", table="bill_cosponsors",
join_via="bill_id", limit=50, fieldset="card"),
DossierSection(name="actions", table="bill_actions",
join_via="bill_id", limit=20, fieldset="card"),
DossierSection(name="cbo_estimate", table="cbo_publications",
join_via="related_bill_id", limit=1,
fieldset="card", as_singleton=True),
...
],
cache_ttl=300,
cache_invalidation_tables={
"bills", "bill_cosponsors", "bill_actions",
"cbo_publications", "committee_reports",
...
},
)
The plan registry is the single source of truth for what's in each
dossier. Adding a new dossier is a known motion: register the plan,
wire the route, add the cache key.
## 4. Caching design
In-process LRU cache (Python functools.lru_cache orcachetools.TTLCache). Key is (resource, id, congress, fields).
TTL 300 seconds.
Invalidation: each DossierPlan declarescache_invalidation_tables. The ingester layer writes aningestion_events row when any of those tables changes. On dossier
read, the cache checks the latest ingestion_event.timestamp for
the relevant tables; if newer than the cached entry'scached_at, evict.
Trade-off: every dossier read does a small lookup againstingestion_events. Cheap (indexed by (table, timestamp)),
bounded.
## 5. Performance — why nested fetches, not one big JOIN
A 12-section bill dossier as one SQL query would have monstrous
fan-out. Instead, the service runs each section as its own
parameterized query, in parallel where possible:
``python``
async def build_bill_dossier(bill_id: str, fields: list[str]) -> dict:
sections = await asyncio.gather(
fetch_bill(bill_id),
fetch_cosponsors(bill_id, limit=50),
fetch_actions(bill_id, limit=20),
fetch_cbo(bill_id),
...
)
return assemble_dossier(sections)
SQLite + WAL handles concurrent reads from the same connection pool.
Each section is bounded; total latency is the slowest section, not
the sum. p95 < 500 ms is the budget.
## 6. Implementation surface
```
shared/josh_substrate/
dossiers/
__init__.py
plans/
bill.py
legislator.py
committee.py
public_law.py
assembler.py # parallel section fetcher
cache.py # LRU + TTL + invalidation
josh-core/josh_core/
routers/
bills.py # extended: + /dossier
legislators.py # extended
committees.py # extended
public_laws.py # extended
tests/
test_dossiers.py # determiner
```
Per-router change is one new route per dossier; the heavy lifting is
in the shared assembler + plans.
Tasks
0 of 14 done.
- t1 DossierPlan registry + four v1 plans (bill, legislator, committee, public-law)
- t2 Parallel section fetcher (asyncio.gather), bounded per-section, card fieldset
- t3 Bill dossier router + envelope shape (12 sections, all bounded)
- t4 Legislator dossier router + congress filter
- t5 Committee dossier router + congress filter
- t6 Public-law dossier router + CFR/FR cascade
- t7 In-memory LRU + TTL cache; cache block in every response
- t8 ingestion_events table + ingester hook for cache invalidation
- t9 Truncation surface (more_url per truncated section) routes to rest-api-resource-endpoints sub-resource lists
- t10 Round-trip property: every nested `id` is fetchable via /v1/<resource>/{id}
- t11 ?fields= projection to exclude sections
- t12 Latency budget: p95 < 500 ms against v1 fixtures (asserted)
- t13 test_dossiers.py covers every AC + cache hit/miss + invalidation
- t14 OpenAPI documents each dossier shape with all sections
Changelog
-
2026-05-13T15:00:00Z
planned→plannedNew spec authored after the 64-query coverage analysis surfaced 11 cross-source fan-out questions (17%) where an agent would otherwise need to orchestrate 5+ tool calls. Locked four v1 dossiers (bill, legislator, committee, public-law) as curated server-side joins — not a generic graph endpoint. DossierPlan registry analogous to AggregationPlan; in-process LRU+TTL cache with table-level invalidation via ingestion_events hooks; parallel section fetching via asyncio; bounded per-section limits with `more_url` for truncated lists. MCP Class F tools (one per dossier shape) wrap the REST endpoints.