US Code

The United States Code is the official codification of permanent federal statutes, organized into 54 titles. When Congress passes a public law, the Office of the Law Revision Counsel (OLRC) integrates it into the appropriate title — moving sections, deleting repealed provisions, renumbering, and producing the consolidated body of law currently in effect.

For Josh, the US Code is the canonical statute layer the citation graph anchors on: bills cite USC sections; FR rules amend or implement USC sections; CRS reports cite USC; court decisions interpret USC. Every citation graph edge of the form "this rule implements 42 U.S.C. § 1396" needs a USC node to point to. Without USC, the citation graph is incomplete.

Quick reference


Source name	US Code
Publishers	Office of the Law Revision Counsel (OLRC), U.S. House of Representatives. Bulk distribution via GovInfo (Government Publishing Office).
License	Public domain
Coverage	Annual editions: 1925 – present (online: 1994 – present systematically). Release points (between annual editions): rolling, currently 119-36 era.
Volume	54 titles (Title 1 – Title 54, with Title 53 reserved). Each title 1-100+ MB raw XML. ~3-5 GB total per edition.
Storage estimate	~5-7 GB per edition raw XML; ~1-2 GB extracted text. We ingest the latest edition + release points; older editions on demand.
Stable ID format	Title: `usc:{N}`. Section: `usc:{title}-{section}` (e.g. `usc:42-1396`). Subsection: `usc:{title}-{section}({sub})` (e.g. `usc:42-1396(a)(1)`).
Status	exploring — schema drafted, ingestion not built

Endpoint/URL templates, auth (none), rate limits (none observed; ~1-2 req/sec polite), and caching/incremental-sync serving behavior are upstream-serving details — see the GovInfo API docs for the GovInfo collection and the OLRC download page for release points.

Source priority decision

Primary (USLM full text): OLRC release points at uscode.house.gov. The release-point bundle xml_uscAll@{c}-{n}.zip is the only source of USLM 2.0 full-text XML — the structured representation we parse into sections. Release points are keyed by the most recent public law processed (119-36 = 119th Congress, public law 36) and appear every 1-3 months as OLRC integrates new laws, so they double as the currentness mechanism for a substrate that tracks law-as-of-now.

Companion (human-readable): GovInfo USCODE collection. Same access pattern as CHRG, CRPT, CREC, with a per-year sitemap and annual-edition packages keyed by title. The GovInfo per-title ZIP does NOT contain USLM full text — verified by unzipping the 2023 title1/title5 packages, it holds only html/*.htm, pdf/*.pdf and metadata XML (MODS/PREMIS). We use it for the human-readable htm/pdf rendering and as a fallback, not for the structured parse.

The relationship: the OLRC release point xml_uscAll@119-36 carries the USLM 2.0 XML for "all titles as of Public Law 119-36." The GovInfo annual edition (e.g. USCODE-2024-title42) is the htm/pdf rendering of "Title 42 as of the annual cutoff." We parse from OLRC and layer GovInfo's companion artifacts on top for rendering — not the other way around.

Skip: scraping uscode.house.gov section-by-section pages. The release-point XML bulk download is the same content with structure. Don't pretend section pages are an "API" — they're rendered HTML.

Access notes

GovInfo: open. The HTML rendition (USCODE-{year}-title{N}.htm) is a <pre>-wrapped plaintext rendering; the package also ships a PDF and MODS/PREMIS metadata XML. The GovInfo per-title package does not include USLM full text — verified by unzipping the 2023 title1/title5 ZIPs, the only XML present is metadata (mods.xml, premis.xml), not the structured uscDoc. Use GovInfo for the htm/pdf rendering only.

For the USLM 2.0 XML we parse, go to OLRC release points instead: the all-titles bundle https://uscode.house.gov/download/releasepoints/us/pl/{c}/{n}/xml_uscAll@{c}-{n}.zip contains the per-title usc{N}.xml uscDoc files. Download the release-point ZIP and extract.

OLRC at uscode.house.gov: variable bot-handling. The download landing page itself sometimes 403s with non-browser UAs, but the release-point ZIP URLs themselves usually serve cleanly. Test before committing to a fetcher path.

The full GovInfo and OLRC endpoint/URL-pattern tables (sitemap, content/pkg ZIP, pdf, htm, mods.xml, wssearch getContentDetail, release-point landing .htm, per-title xml_usc{N} ZIP templates) are upstream-serving reference — see the GovInfo API docs and the OLRC download page. Per-title sizes range from Title 1 (~200 KB PDF/HTML) to Title 42 (Public Health, ~50+ MB PDF).

USLM 2.0 schema — Josh's interpretation

USLM is the United States Legislative Markup XML schema, also used for enacted Public Laws (PLAW-{c}publ{n}). The full element catalog (uscDoc/meta/main/title/subtitle/chapter/subchapter/part/subpart/section/subsection/paragraph/subparagraph/...) and the generic XML element tree are documented upstream at github.com/usgpo/uslm. The hierarchy depth varies: some titles are <title>/<chapter>/<section> with no subtitle/subchapter; others run <title>/<subtitle>/<chapter>/<subchapter>/<part>/<subpart>/<section>.

What matters for Josh:

Section is the citation target. A user citation 42 U.S.C. § 1396a(b)(2)(A) resolves to:

Title 42, Section 1396a, Subsection (b), Paragraph (2), Subparagraph (A)
We model section-level rows; subsections/paragraphs/subparagraphs as nested structure inside section text + a flattened citation index.

<sourceCredit> is the legislative provenance — the inbound link from public laws to USC. Every section ends with a <sourceCredit> block listing the public laws that enacted/amended it (e.g. (July 1, 1944, ch. 373, title III, § 301, 58 Stat. 691; Aug. 5, 1947, ch. 502, § 1(31), 61 Stat. 786; ...)). The citation graph extractor parses <sourceCredit> to build edges: usc:42-1396 ← law:79-410 (the original 1944 enactment) ← law:80-340 (the 1947 amendment) etc.

Title status

Status	Meaning
`positive_law`	Title has been enacted as positive law itself; the USC text is itself the authoritative source. ~26 of 54 titles.
`non_positive_law`	Title is editorial codification of underlying public laws. The Statutes at Large are authoritative; USC is a convenience. ~28 titles.
`reserved`	Title number is reserved (Title 53).

Positive Law: Yes/No is exposed in the GovInfo getContentDetail metadata. Important distinction for legal weight; CRS reports often note it, and it drives Josh's is_positive_law column.

Stable ID format

Section: usc:{title}-{section}, where section is the section number including any letter suffix (1396a, 1395dd).

Examples:

usc:42-1396 — 42 U.S.C. § 1396
usc:42-1396a — 42 U.S.C. § 1396a
usc:5-552 — 5 U.S.C. § 552 (FOIA)
usc:18-1030 — 18 U.S.C. § 1030 (CFAA)

For sub-section levels, append the path:

usc:42-1396a(b) — subsection
usc:42-1396a(b)(2)(A) — full path
We do not necessarily mint stable IDs for every subsection/paragraph row; instead, the section row contains the full structured XML and the citation lookup resolves paths.

Title: usc:{title}

Edition stamp: USC content is mutable across editions and release points. We add a usc_edition qualifier for forensic tracking — usc:42-1396@2024 or usc:42-1396@rp119-36. Default queries hit the latest edition; explicit qualifier hits historical.

Volume

Metric	Value
Titles	54 (53 reserved)
Total sections	~50,000-60,000 across all titles
USLM XML per title	100 KB - 100 MB (Title 26 — Internal Revenue Code — and Title 42 — Public Health are the largest)
Total USLM XML per edition	~3-5 GB
Annual edition releases	Once a year (e.g. `USCODE-2024`)
Release points per year	~10-30 (every 1-3 months)

Postgres footprint: ~3-5 GB per edition, with vector chunks adding ~1-2 GB. Older editions stored if needed for historical queries.

Caching / incremental sync

Yearly: full ingest of new annual edition. When USCODE_{YYYY}_sitemap.xml appears (typically June of year YYYY+1), download all 54 title ZIPs, parse, replace.
Per release point: per-title delta. Poll OLRC release points page (or sitemap if available). For each new release point, download per-title ZIPs only for titles that changed at that release point — the release point landing page lists changed titles with bold formatting.
Last-Modified per title package for conditional GET on annual edition refreshes.

Because USC is the citation target for many other sources, we want fast section lookup by (title, section) and stable IDs that don't change across editions (unless the section itself is renumbered, which rarely happens).

Schema (Postgres DDL)

-- ============================================================
-- US Code titles (one row per title)
-- ============================================================

CREATE TABLE usc_titles (
    id text PRIMARY KEY,                          -- 'usc:42'
    title_number int NOT NULL UNIQUE,             -- 42
    title_name text NOT NULL,                     -- 'The Public Health and Welfare'
    is_positive_law boolean NOT NULL,
    edition_year int NOT NULL,                    -- 2024
    laws_in_effect_as_of date,                    -- e.g. 2025-01-06 from MODS
    govinfo_package_id text NOT NULL,             -- 'USCODE-2024-title42'

    -- Source URLs
    usc_zip_url text,
    pdf_url text,
    html_url text,
    mods_url text,

    -- Lifecycle
    raw_uslm_xml bytea,                           -- gzipped USLM XML
    fetched_at timestamptz NOT NULL,
    parsed_at timestamptz,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now()
);

CREATE INDEX usc_titles_edition ON usc_titles (edition_year DESC);

-- ============================================================
-- US Code sections (the citation target — one row per section)
-- ============================================================

CREATE TABLE usc_sections (
    id text PRIMARY KEY,                          -- 'usc:42-1396a'
    title_id text NOT NULL REFERENCES usc_titles(id) ON DELETE CASCADE,
    title_number int NOT NULL,
    section_number text NOT NULL,                 -- '1396a' (string — preserves letter suffixes)

    -- Hierarchy path (denormalized for filter speed)
    chapter text,                                 -- '6A'
    subchapter text,                              -- 'II'
    part text,                                    -- 'A'
    subpart text,                                 -- 'i'

    heading text,                                 -- section heading
    chapeau text,                                 -- introductory text before subsections

    -- Body
    body_text text NOT NULL,                      -- flat text rendering, includes subsections
    body_xml bytea,                               -- the section's USLM XML subtree, gzipped

    -- Provenance from <sourceCredit>
    source_credit_text text,                      -- the verbatim sourceCredit prose
    enacted_by_law_ids text[],                    -- ['law:79-410', 'law:80-340'] — extracted soft FKs

    -- Status
    is_repealed boolean NOT NULL DEFAULT false,
    is_omitted boolean NOT NULL DEFAULT false,
    is_reserved boolean NOT NULL DEFAULT false,

    -- Lifecycle
    edition_year int NOT NULL,
    fetched_at timestamptz NOT NULL,
    inserted_at timestamptz NOT NULL DEFAULT now(),
    updated_at timestamptz NOT NULL DEFAULT now(),

    UNIQUE (title_number, section_number, edition_year)
);

CREATE INDEX usc_sections_title ON usc_sections (title_number, section_number);
CREATE INDEX usc_sections_active ON usc_sections (title_number) WHERE NOT (is_repealed OR is_omitted);
CREATE INDEX usc_sections_chapter ON usc_sections (title_number, chapter);

ALTER TABLE usc_sections ADD COLUMN search_tsv tsvector
    GENERATED ALWAYS AS (
        setweight(to_tsvector('english', coalesce(heading, '')), 'A') ||
        setweight(to_tsvector('english', coalesce(chapeau, '')), 'B') ||
        setweight(to_tsvector('english', coalesce(body_text, '')), 'D')
    ) STORED;
CREATE INDEX usc_sections_search ON usc_sections USING gin (search_tsv);

-- ============================================================
-- USC structure (chapters, subchapters etc — for nav, not retrieval)
-- ============================================================

CREATE TABLE usc_structure_nodes (
    id text PRIMARY KEY,                          -- 'usc-struct:42-ch6A-subII-pA'
    title_id text NOT NULL REFERENCES usc_titles(id) ON DELETE CASCADE,
    parent_id text REFERENCES usc_structure_nodes(id),
    level text NOT NULL CHECK (level IN ('subtitle', 'chapter', 'subchapter', 'part', 'subpart')),
    number text,                                  -- e.g. '6A', 'II', 'A'
    heading text NOT NULL,
    section_first text,                           -- first section in this node
    section_last text                             -- last section in this node
);

CREATE INDEX usc_structure_nodes_parent ON usc_structure_nodes (parent_id);

-- ============================================================
-- Vector chunks for similarity
-- ============================================================

CREATE TABLE usc_section_chunks (
    id bigserial PRIMARY KEY,
    section_id text NOT NULL REFERENCES usc_sections(id) ON DELETE CASCADE,
    chunk_index int NOT NULL,
    chunk_text text NOT NULL,
    embedding vector(1024),
    UNIQUE (section_id, chunk_index)
);
CREATE INDEX usc_section_chunks_embedding
    ON usc_section_chunks USING hnsw (embedding vector_cosine_ops);

Schema decisions worth flagging:

usc_sections.id is edition-stable for current edition but the table records edition_year so historical editions can be loaded as separate rows with the same logical key. Default queries filter by edition_year = (SELECT MAX(edition_year) FROM usc_titles).
Sub-section structure stored as flattened text + body_xml rather than its own table. Most queries don't need to discriminate "subsection (a) paragraph (1)" — they want the whole section. When a query needs to cite a specific subsection, the citation resolver walks the body_xml.
enacted_by_law_ids extracted at parse time from <sourceCredit> — soft FK array to public laws. This is the inbound side of the citation graph (laws → USC sections they amend/enact). The outbound side (USC sections → other USC sections referenced in a section's text) lives in the global citation graph table.
is_repealed, is_omitted, is_reserved flags — USC retains repealed sections as placeholder rows ("Repealed by P.L. 117-...") with no body. We preserve them so 42 U.S.C. § 1395a (a real section that was repealed) still resolves and explains its history.
raw_uslm_xml at title level, body_xml at section level — title-level raw lets us re-extract sections; section-level XML supports citation-path traversal for subsection retrieval without re-loading the full title.

Download / update strategy

Backfill

Fetch the latest OLRC release-point all-titles ZIP (e.g. xml_uscAll@119-36.zip) — the single ~1 GB bundle carrying the USLM 2.0 XML for all 54 titles.
For each per-title usc{N}.xml extracted from the bundle:
- Save the raw release-point ZIP to /data/corpus/us-code/bodies/raw/.
- Parse USLM into title row + N section rows + structure nodes.
- Extract <sourceCredit> law references.
Embed chunks.

One all-titles release-point ZIP ~1 GB downloaded; ~50K-60K sections parsed. Runs detached on the server per the add-a-new-source runbook.

Then layer the GovInfo companion (optional, for rendering):

Walk the GovInfo sitemap (USCODE_{latest}_sitemap.xml, one <loc> per title) to pull the per-title htm/pdf companion artifacts.
Attach the htm/pdf rendering to each title for human-readable display. (No USLM parsing here — the GovInfo ZIP has no USLM full text.)
Mark rows repealed / amended at parse time; preserve old text in a usc_section_history table if we need it (defer to v1.x).

Incremental updates

Monthly: poll OLRC release points for new release-point ZIPs. New release point → fetch per-title delta, update changed sections.
Yearly: full re-ingest of annual edition. Annual edition is the canonical edition stamp.

State tracking

Source key: usc. State stores latest ingested annual edition + latest release point.

Failure modes

OLRC bot wall. OLRC is the only USLM source, so a sustained 403 blocks the structured parse — GovInfo cannot substitute (no USLM full text). The landing page 403s more readily than the release-point ZIP URLs themselves; retry the direct xml_uscAll@{c}-{n}.zip URL with a browser-like UA before giving up. Fall back to the prior cached release point, not to GovInfo, for the parse.
USLM schema drift. Schema is stable (1.0 → 2.0 evolution was years ago) but could change. Defensive parser; preserve raw XML.
Section number with letter suffix (1395dd, 1396a) — store as text, not int. Already in schema.
Repealed section without <heading> — common. Allow heading nullable but populate body_text with the repeal note.
Cross-title references in <sourceCredit> — sometimes a <sourceCredit> says "see also Title 26 § ...". Capture in citation graph but not as enacted_by_law_ids.

Open questions

These don't block ingestion but should be resolved before this source is "shipped":

Subsection-level retrieval granularity. The agent may need to retrieve "42 U.S.C. § 1396a(b)(2)(A)" specifically, not the whole section. Decide whether to maintain a usc_subsections table or rely on body XML traversal at query time. Lean: defer to v1.x; ship v1 with section-level granularity.
Historical edition retention. We ingest the latest annual edition + release point. Do we keep prior annual editions for "as of date X" queries? Disk: ~3-5 GB per edition × 30 years = 100+ GB. Lean: keep last 5 annual editions only.
Citation graph integration. USC is the most-cited node in the graph. The schema supports it; the global citation graph (citation_graph_edges) is built across sources. Document the join semantics carefully when we build that source.
OLRC release point backfill. ~30 release points per year × 5 years = 150 release-point ZIPs at ~1 GB each. Most overlap. Decide how granular section-level history should be — every release point, or just annual snapshots? Lean: annual + last release point only.
Cross-section references within a section's text. A section often references other sections ("as defined in section 1395 of this title"). Extract these as edges in the citation graph at parse time.
USC Appendix and "Tables" volumes. Title 50 has an Appendix; Title 18 has an Appendix. GovInfo packageIds for these are USCODE-{YYYY}-title{N}-appendix. Schema accommodates by treating them as separate usc_titles rows or a flag — keep simple: separate rows with title_number having a string suffix? Actually safer: title_number int + appendix_designation text. Add column if needed.
Parallel-table hyperlink overlay. OLRC produces "popular name table," "tables of acts cited," etc. Useful navigation tools. Defer; v1 ships with section-level coverage.
Real-time legislative integration. When a public law is signed, OLRC takes 1-3 months to integrate. Until then, the affected USC section is technically out of date. We can't fix this without doing the integration work ourselves (out of scope). Document the lag honestly.