substrateshippedp0

Substrate citation metadata

substrate-citation-metadata · updated 2026-05-10T18:30:00Z · owner rritz

Use the pencil to edit title, status, priority, and owner. Changing status auto-prepends a changelog entry.

Every record-bearing table in the substrate carries the same six citation
columns (source_url, source_org, published_at, retrieved_at,
raw_sha256, citation_string) so downstream agents can cite a result
without a second round-trip. Per-source citation_for() formatters live
in josh_substrate.citations.formatters.<source> and produce the
human-readable citation_string from the row's metadata using a single
shared style. Loader stamps citation columns at write time; API
serializers re-use the same formatters for response payloads.

As a downstream agent, I want a `citation_string` on every record I retrieve so that I can quote it back to the user without an extra fetch.

As a Source author, I want a `citation_for(record)` helper alongside the loader so that I'm not re-implementing date formatting and author lists per source.

As a policy analyst, I want citations that match what I'd see in a paper or report so that I can paste them straight into my own writing.

  1. Where a substrate table holds primary records, the migration shall include the six citation columns: `source_url TEXT`, `source_org TEXT`, `published_at TEXT`, `retrieved_at TEXT`, `raw_sha256 TEXT`, `citation_string TEXT NOT NULL` (per migrations `0002_crs.py`, `0003_legislators_committees.py`).
  2. When the loader writes a row, every citation column shall be non-null on success, and the `citation_string` shall be produced by the per-source `citation_for()` formatter — never assembled inline in the loader.
  3. When `citation_for()` is called with a record that's missing a date, it shall produce an `n.d.` placeholder rather than raising; missing identifier or title shall raise `ValueError` so the loader can route the record to `ingestion_logs`.
  4. When a month appears in a citation string, the formatter shall use the standard short-form abbreviations (`Jan`, `Feb`, …) without trailing periods (per `test_may_is_not_abbreviated_with_period`).
kindtest_file

Path

shared/josh_substrate/tests/citations/test_crs.py

Runner

uv run pytest shared/josh_substrate/tests/citations -v

The CRS formatter test is the contract — 10 tests covering the canonical citation, version-record date precedence, the May abbreviation edge case, the `n.d.` fallback, the missing-identifier raise, the untitled placeholder, ISO-string parsing, and a sweep across all month abbreviations. Future formatters (legislators, bills, FR, …) ship alongside their own per-source test files using the same shape; this determiner stays pinned to CRS as the reference because it exercises every branch of the formatter contract.

None.

  • Bibliographic styles other than the substrate's house style. We pick one and stick with it.
  • DOIs, URN-NID schemes, or other persistent identifiers — `source_url` is the substrate's identifier.
  • Programmatic citation export (BibTeX, RIS). Surface concern, not substrate concern.

Implementation:

- Migrations: every record-bearing migration declares the six
citation columns. CRS (0002_crs.py line 81-82, plus body_*
columns) and legislators/committees (0003_legislators_committees.py
line 204-205, 511-512 for the parent + member tables) both follow
this shape. New sources copy the same column block.
- Formatters: shared/josh_substrate/src/josh_substrate/citations/formatters/
holds crs.py (121 LOC), legislators_committees.py, plus a shared
__init__.py that re-exports the public surface. Each module exports
a citation_for(record) callable.
- Loader integration: per-source loaders pass the record through
citation_for() and pack the result into payload["citation_string"]
before calling upsert_with_children. The other five citation columns
are sourced from the parser stage.
- Tests: tests/citations/test_crs.py (155 LOC, 10 tests). Uses
types.SimpleNamespace and a small @dataclass to decouple from
Pydantic models.

4 of 4 done.

  • t1 Six citation columns enshrined in CRS + legislators migrations
  • t2 Per-source citation_for() formatters with shared style
  • t3 Loader stamps citation_string via citation_for(), never inline
  • t4 test_crs.py covers the formatter contract end-to-end
  • 2026-05-10T18:30:00Z (new)shipped Spec content backfilled retrospectively. The six-column citation pattern, per-source `citation_for()` formatters, and CRS reference tests have been live since the CRS Reports ingester landed and were tightened during the conventions-refactor. This commit replaces the [STUB] placeholders with acceptance criteria grounded in the migrations and the `test_crs.py` contract.

docs/spec/substrate-citation-metadata.html · generated by bin/build-spec.py