Substrate tombstone & supersession policy
Header
Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.
Why
Federal records get withdrawn, replaced, corrected, and superseded by
upstream sources. Without a substrate-wide policy, each ingester invents
its own response: bills hard-deletes, FR keeps the row, CRS bumpsupdated_at. Citations break inconsistently. Agents can't reason about
why one source's withdrawn record returns 404 while another returns 200.
This spec locks the substrate-wide answer: tombstone via four columns
on every document-bearing table, with status enum, tombstone_at,tombstone_reason, and successor_id. Identity and event tables
(legislators, committees, votes) keep their domain-specific lifecycle
fields and don't take tombstone columns.
This spec is also what makes the REST API's 410 Gone semantics fromrest-api-conventions Q9 implementable — without tombstone metadata,
the substrate can only return 404 (which loses the agent affordance of
"here's the successor").
User stories
As an agent resolving a year-old citation, I want 410 Gone with the successor's URL when the original was withdrawn so that I can show the user "the document you cited has been replaced — here's the new one" without a separate fetch.
As a downstream researcher, I want withdrawn records to remain in the substrate as evidence so that my historical analysis isn't silently corrupted by upstream re-publication.
As an ingester author for a new source, I want one substrate-wide convention for "what happens when upstream removes a record" so that I don't reinvent the policy per source.
As a search-API consumer, I want list and search results to default to active records so that my agent isn't surfacing withdrawn rows as if they're current.
Acceptance criteria (EARS)
- Where a record-bearing table represents documents (publications, reports, filings, dockets, granules), the migration shall include four tombstone columns: `status TEXT NOT NULL DEFAULT 'active' CHECK (status IN ('active','withdrawn','superseded','flagged'))`; `tombstone_at TEXT`; `tombstone_reason TEXT`; `successor_id TEXT`.
- Where a record-bearing table represents an identity (legislators, committees, organizations) or an immutable event (votes), the migration shall not include tombstone columns; lifecycle shall be modeled via domain-specific fields (`served_until`, `disbanded_on`, etc.).
- When migration `0006_tombstone_columns` runs, it shall add the four tombstone columns to every existing document-bearing table (`crs_reports` is the only such table at v1).
- Where the substrate enum admits a status value other than `active`, the value shall be one of: `withdrawn` (upstream removed the record entirely; no successor), `superseded` (a newer record replaces this one; `successor_id` populated), or `flagged` (operational flag for manual review; doesn't affect default API queries unless explicitly opted into).
- When an upstream source explicitly signals a withdrawal or correction (FR withdrawal document, regulations.gov `withdrawn_at` field, CRS report retired, etc.), the source's parse stage shall produce a `ParsedRecord` whose payload sets `status` to the appropriate non-active value and populates `tombstone_at`, `tombstone_reason`, and (where applicable) `successor_id`.
- When an upstream source's discover stage returns a result set that excludes a previously-seen record (implicit removal), the runner shall bump that record's `last_seen_at` watermark but shall not auto-tombstone it; the row remains `active` until an explicit signal arrives or operator action is taken.
- When `josh source janitor <name> --tombstone-stale=<N>` runs (N defaults to 30 days), the system shall identify records whose `last_seen_at` is older than N days and prompt the operator before applying tombstones.
- When a list or search endpoint is called without an explicit `?status=` parameter, the response shall include only rows where `status='active'`.
- When a list or search endpoint is called with `?status=<csv>` (per rest-api-conventions Q7 multi-value filter), the response shall include rows whose status is in the listed set; `?status=*` shall return rows of any status.
- When a singleton GET is called for a record whose status is non-active, the response shall be HTTP 410 Gone per rest-api-conventions Q9, with the full row body included AND `error.hint` populated with `tombstone_at`, `tombstone_reason`, `successor_id`, and `successor_url` (the latter computed from `successor_id` via canonical-route resolution).
- Where an FTS5 virtual table is paired with a document-bearing table, the FTS5 sync triggers (AFTER INSERT / AFTER UPDATE / AFTER DELETE per `substrate-fts5-pipeline`) shall index every row regardless of status; the search wrapper in `josh-core` shall filter by status at query time via a JOIN to the parent table.
- Where a vec0 virtual table is paired with a document-bearing table, the vec0 row shall persist regardless of status; the search wrapper shall filter by status at query time via a JOIN to the parent chunks + records tables.
- Where `successor_id` is set, its format shall follow the canonical ID convention from rest-api-conventions Q5 (e.g., `hr:119:1`, `crs:R47892-v2`, `fr:2026-08558`); cross-source pointers are allowed but expected to be rare.
- Where multiple successors exist for a single tombstoned record, the substrate at v1 shall record only the primary successor in `successor_id`; secondary successors shall be documented in `tombstone_reason` free text. A `successor_ids TEXT` JSON-array column is reserved for v1.x if a real multi-successor pattern surfaces.
- Where the `Source` protocol's `parse()` stage produces a `ParsedRecord` for a document-bearing table, the payload may include any of: `status`, `tombstone_at`, `tombstone_reason`, `successor_id`. Omitting them defaults to `status='active'` with the other fields NULL.
Success determiner
Path
Runner
Contract test that exercises every convention against a fresh substrate: - Migration `0006_tombstone_columns` adds the four columns to `crs_reports`; PRAGMA table_info confirms shape. - Insert + tombstone an `crs_reports` row; FTS5 still indexes it (post-tombstone MATCH returns the row). - Insert + tombstone a chunk's parent record; vec0 row persists. - Search wrapper filters by status: `?status=active` excludes the tombstoned row; `?status=*` includes it. - Singleton fetch of the tombstoned row returns the simulated 410 Gone payload with `successor_id` and `successor_url`. - Implicit-removal: parse stage doesn't set status; runner bumps `last_seen_at` only. - Janitor with `--tombstone-stale=30d` against a row whose `last_seen_at` is 31 days ago prompts (mocked stdin) before applying the tombstone. Determiner currently fails because: (a) `0006_tombstone_columns` migration not yet written; (b) Source protocol doesn't yet thread tombstone fields through `ParsedRecord`; (c) `josh-core` search wrapper doesn't yet filter by status; (d) `josh source janitor --tombstone-stale` not yet implemented. Will flip to passing as the implementation lands across `shared/josh_substrate/migrations/`, `josh-ingester/`, `josh-core/`, and `josh-cli/`.
Clarifications needed
None.
Out of scope
- Schema for non-document tables — legislators, committees, votes use domain-specific lifecycle fields (`served_until`, etc.). This spec defines the document-table convention; identity-table conventions live in their per-source ingester specs.
- Multi-successor support — reserved as `successor_ids TEXT` JSON column for v1.x. v1 records single primary successor only.
- Auto-tombstone on implicit removal — explicit operator confirmation required at v1. Automated heuristic-based tombstoning deferred.
- Restoration / un-tombstoning — supported (just flip status back to `active`) but no specific operator command at v1; operators write SQL or use a follow-up ingestion run that explicitly sets `active`.
- Cross-source successor URL resolution — the substrate stores the canonical-ID-format string; the API layer resolves to a route. The resolver implementation lives in `josh-core`, not here.
- Per-source migration timing — each source's migration adds its tombstone columns when the source's table is first created. This spec mandates the convention; doesn't enumerate per-source migration schedules.
Dependencies
Plan
Decisions locked across two grilling-style questions, organized below.
Each becomes one or more EARS-form acceptance criteria above and maps
to the contract test in the determiner.
## 1. Tombstone strategy
Locked: tombstone columns on document-bearing tables (Q1 Option A).
Hard delete and history-table approaches were considered and rejected;
the in-place-update default loses citation semantics entirely.
### Schema
Every document-bearing migration adds four columns:
``sql``
status TEXT NOT NULL DEFAULT 'active'
CHECK (status IN ('active','withdrawn','superseded','flagged'))
tombstone_at TEXT -- ISO-8601 UTC; set when status leaves 'active'
tombstone_reason TEXT -- free-text upstream rationale
successor_id TEXT -- canonical-ID format (rest-api-conventions Q5)
### Status enum semantics
| Value | Meaning |
|-------|---------|
| active | Record is current and authoritative. Default. |
| withdrawn | Upstream removed/withdrew the record. No successor. |
| superseded | A newer record replaces this; successor_id populated. |
| flagged | Operational flag for manual review (e.g., normalization failed). Not user-facing unless opted in. |
### Per-source applicability
- Document-bearing tables (CRS reports, FR documents, bills, public
laws, regulations.gov dockets, hearings, committee reports,
Congressional Record granules, GAO reports, CBO publications, LDA
filings, US Code sections, eCFR/CFR sections, SAPs, hearing
transcripts) — get the four columns. No exceptions.
- Identity tables (legislators, committees, organizations) — keep
their domain-specific lifecycle fields (served_from, served_until,
disbanded_on, etc.). No tombstone columns.
- Immutable event tables (roll-call votes) — historical record;
never withdraws. No tombstone columns.
- Reference / controlled-vocabulary tables (topic taxonomy / LCSH
subjects) — different lifecycle (deprecation flag); no tombstone
columns at v1.
## 2. Detection — explicit vs implicit removal
Locked: explicit-only at v1 (Q1 sub-decision 3).
- Explicit withdrawal — upstream sends a withdrawal signal (FR
"withdrawal document" type, regulations.gov withdrawn_at field, CRS
report retired with successor). Source's parse stage produces a
ParsedRecord with status set; runner upserts the tombstone metadata.
- Implicit removal — record was in last run's discover, isn't this
run's. Could be a withdrawal, could be a pagination glitch. Runner
bumps the row's last_seen_at watermark but does NOT auto-tombstone.
Operator can run josh source janitor <name> --tombstone-stale=30d
to retroactively tombstone records that haven't been seen for >30
days, with stdin confirmation.
This avoids the "upstream changed pagination, half the substrate gets
tombstoned overnight" failure mode.
## 3. Query layer
Locked: default status='active', opt-in via ?status= CSV filter
(Q2 Decision 1).
- List/search endpoints default-filter to active.
- ?status=active,superseded includes superseded rows.
- ?status=* (wildcard) returns all states.
- Singleton GET on non-active row → 410 Gone with full body + tombstone
metadata in error.hint (per rest-api-conventions Q9).
## 4. FTS5 + vec0 indexing
Locked: index everything; filter at query time (Q2 Decisions 3 & 4).
FTS5 sync triggers (per substrate-fts5-pipeline) fire on insert/update/
delete regardless of status — they don't know about tombstones. The
search wrapper in josh-core joins to the parent table and filters by
status at query time:
``sql``
SELECT r.id, r.title, r.status, r.successor_id,
bm25(crs_reports_fts) AS rank
FROM crs_reports_fts
JOIN crs_reports r ON crs_reports_fts.rowid = r.rowid
WHERE crs_reports_fts MATCH :query
AND r.status IN (:status_set) -- per ?status=
ORDER BY rank
LIMIT :limit OFFSET :offset;
Same pattern for vec0:
``sql``
SELECT v.chunk_id, v.distance, r.id, r.status, r.successor_id
FROM crs_report_chunks_vec0 v
JOIN crs_report_chunks c ON v.chunk_id = c.id
JOIN crs_reports r ON c.report_id = r.id
WHERE r.status IN (:status_set)
AND v.embedding MATCH :query_vec
ORDER BY v.distance
LIMIT :limit;
Filter-at-query has three benefits over index-active-only:
- FTS5 sync triggers stay simple (current substrate-fts5-pipeline
pattern unchanged).
- Status flip-then-flip-back doesn't trigger expensive re-indexing.
- Agents passing ?status=* get correct results without separate
FTS5/vec0 maintenance for tombstoned rows.
Cost: ~10-20% index bloat from indexing tombstoned content. Acceptable
at v1 scale; revisit only if disk/memory pressure surfaces.
## 5. Successor pointers
- successor_id follows the canonical-ID format from rest-api-conventions
Q5: e.g., crs:R47892-v2, fr:2026-04601. Cross-source pointers
allowed (rare; e.g., a docket consolidated into an FR rule).
- successor_url is computed at API time by parsing the successor_id
prefix and resolving to the canonical route. Logic in josh-core's
citation resolver.
- Multi-successor: single successor_id only at v1. Secondary
successors documented in tombstone_reason. successor_ids TEXT
JSON-array column reserved for v1.x.
## 6. Source protocol contract
ParsedRecord.payload (Pydantic from josh_substrate.protocols) may
include the four tombstone fields. The upsert helper writes them
through to the parent row. Source modules that don't set them get
default active behavior — no breaking change for sources that don't
care about tombstones.
## Implementation surface
- shared/josh_substrate/migrations/versions/0006_tombstone_columns.py
— new migration adding the four columns to crs_reports. Future
document-bearing migrations include the columns from inception (no
retroactive fix needed for sources that don't yet exist).
- shared/josh_substrate/protocols.py — already-Pydantic
ParsedRecord accepts arbitrary payload shapes; no change needed.
- josh-ingester/josh_ingester/state.py — last_seen_at watermark on
ingestion_source_state for implicit-removal tracking. Add column
if not already present.
- josh-ingester/josh_ingester/runner.py — discover-stage bookkeeping that
bumps last_seen_at for matched records.
- josh-cli/src/josh_cli/commands/source.py — josh source janitor subcommand (per cli-conventions Q2 group).
<name> --tombstone-stale=N
- josh-core/josh_core/routes/ — search wrapper applies ?status= filter
via JOIN; singleton route returns 410 Gone with hint metadata when
status is non-active.
- shared/josh_substrate/tests/test_tombstone_policy.py — contract
test suite that the success_determiner runs.
Tasks
1 of 14 done.
- t1 All cross-cutting decisions resolved with rritz; spec drafted
- t2 Migration 0006_tombstone_columns adds the four columns to crs_reports
- t3 Future document-bearing migrations (FR, bills, public laws, etc.) include tombstone columns from inception
- t4 ingestion_source_state.last_seen_at column added; runner bumps it on discover-stage match
- t5 Source protocol's ParsedRecord accepts status / tombstone_at / tombstone_reason / successor_id payload fields
- t6 Upsert helper writes tombstone fields through to parent row when present
- t7 josh-core search wrapper filters by status with default 'active' (?status= CSV per rest-api-conventions)
- t8 josh-core singleton-fetch route returns 410 Gone with hint metadata for non-active rows
- t9 FTS5 + vec0 search wrappers JOIN to parent and filter by status at query time
- t10 josh source janitor --tombstone-stale=N command implemented
- t11 Per-source ingester specs (16 document-bearing) declare dependency on this spec
- t12 Per-source ingester specs prune withdrawal/correction questions from clarifications_needed
- t13 Contract test suite at shared/josh_substrate/tests/test_tombstone_policy.py covers every acceptance criterion
- t14 Determiner runs green — every convention exercised against the live substrate
Changelog
-
2026-05-10T23:00:00Z
(new)→plannedSpec authored after a 2-question grilling-style decision pass with claude: Q1: Tombstone columns (`status` enum + tombstone_at + tombstone_reason + successor_id) on document-bearing tables; identity / event / vocabulary tables keep their domain-specific lifecycle fields. Per-source opt-in. Detection is explicit-only at v1; implicit-removal bumps `last_seen_at` and is operator-tombstoned via `josh source janitor --tombstone-stale=N`. Q2: List/search default-filter `status='active'`; opt-in via `?status=` CSV filter (with `*` wildcard). Singleton GET on non-active returns 410 Gone with full row + tombstone metadata in `error.hint`. FTS5 + vec0 index everything; search wrapper filters by status at query time via JOIN to parent. Per-source ingester specs (16 document-bearing) will declare dependency on this spec and prune resolved clarifications in a follow-up commit. Specs that don't tombstone (legislators, committees, votes, topic taxonomy) are explicitly out of scope.