Repo structure
How the Josh monorepo is organized, why it's organized this way, and what to do when adding a new service.
This is the canonical reference for layout decisions. When CLAUDE.md or other docs disagree with this document, this document wins for repo-organization concerns.
Storage backend: SQLite + FTS5 +
sqlite-vec. See "Storage backend decision" inCLAUDE.md. The substrate is a single file at/data/josh.dbshared between containers via a host bind mount — there is nojosh-postgresaccessory.
The map
Section titled “The map”josh/├── CLAUDE.md├── .dockerignore # project-root, controls build-context uploads├── .gitignore├── .ruby-version # pins ruby for kamal│├── .kamal/ # shared kamal config across services│ ├── secrets # plaintext, gitignored│ ├── secrets.example # committed template│ └── secrets.age # encrypted, committed (when populated)│├── shared/│ └── josh_substrate/ # shared Python package — see below│ ├── pyproject.toml│ ├── README.md│ ├── alembic.ini│ └── src/josh_substrate/│ ├── __init__.py│ ├── migrations/ # all Alembic migrations│ ├── models/ # (future) Pydantic models per table│ ├── ids.py # (future) canonical ID helpers│ └── db.py # (future) connection helpers (PRAGMAs, sqlite-vec load)│├── josh-core/ # FastAPI + LLM harness (currently placeholder)│ ├── Dockerfile # build context = project root│ ├── README.md│ ├── pyproject.toml # service-specific deps (alembic etc. via shared/)│ ├── app/│ │ └── main.py│ ├── config/deploy.yml # kamal config (no accessory; volume mount /data)│ └── .kamal -> ../.kamal # symlink so kamal finds shared secrets│├── josh-ingester/ # (to be rebuilt) ETL workers — same shape as josh-core│├── josh-web/ # (later) Next.js — consumes /openapi.json│├── docs/│ ├── josh-data-sources.md # canonical inventory of v1 / v2 data sources│ ├── sources/ # per-data-source ingestion specs (probe findings, indicative schema)│ └── operations/ # operational runbooks (this directory)│└── design/ # UI / brand design systemThe previous iteration had a josh-postgres/ directory with extension init scripts; with the SQLite swap that directory is being removed. The substrate file lives on the host volume at /data/josh.db, mounted into each service container that needs it.
Why a shared package
Section titled “Why a shared package”Both josh-core (FastAPI + harness) and josh-ingester (ETL workers) need:
- Identical Pydantic models for the substrate tables
- The same Alembic migrations (one schema, one source of truth)
- The same DB connection helpers
- The same canonical-ID format helpers
We could duplicate these models across services and hand-sync. We don't, because:
- Drift is inevitable when two copies exist
- Refactors require changing N places instead of one
- The shape is genuinely shared, not coincidentally aligned
So: one Python package at shared/josh_substrate/, installable via pip install. Each service's Dockerfile copies it in and pip installs it before installing service-specific deps.
For the frontend (josh-web, when it exists), the chain is different — it's TypeScript, not Python. Approach: openapi-typescript http://localhost:8000/openapi.json -o src/types/api.ts. FastAPI auto-generates /openapi.json from the Pydantic models the API uses. So the same models drive backend types directly and frontend types via OpenAPI.
Why project-root build context
Section titled “Why project-root build context”The single root Dockerfile uses the project root as Docker build context so it can reach across service boundaries:
builder: context: . # project root dockerfile: ./Dockerfile # project-root DockerfileThis lets the Dockerfile COPY shared/josh_substrate + each service's source tree in turn. The Dockerfile installs all three packages into a single image; roles in config/deploy.yml pick which CMD to run.
The cost: a project-root .dockerignore is necessary to keep build-context uploads small. Things like docs/, design/, and .claude/ are excluded.
One image, multiple roles
Section titled “One image, multiple roles”The substrate ships as one Kamal service (josh) with multiple roles defined in config/deploy.yml:
kamal setup # bring up every active rolekamal deploy # deploy all active roleskamal deploy --roles=web # surgical: web onlykamal deploy --roles=ingester # surgical: ingester onlyRoles share the same image but use different CMDs (uvicorn josh_core.main:app … for web; josh-ingester daemon for ingester; josh-embedder daemon for embedder when activated). The web role is fronted by kamal-proxy on 80/443; the others set proxy: false.
The substrate SQLite file at /data/josh.db lives on a host directory bind-mounted at /data into every role. There's no accessory and no Docker network DB target — the file is the boundary, the volume mount is the wire. The root config/deploy.yml declares volumes: ["/data:/data"] globally, inherited by every role. See the substrate-single-image-deploy spec for the trade-offs (image bloat, activation path for deferred roles, etc.).
Kamal lives at the project root
Section titled “Kamal lives at the project root”Kamal runs from the project root: config/deploy.yml, .kamal/secrets, and .kamal/hooks/ are all there. One service, one config, one secrets file shared across every role. No symlinks needed because kamal's cwd is the same place every time.
Two .dockerignore patterns
Section titled “Two .dockerignore patterns”Project-root .dockerignore (the active one, controls build context uploads to the remote builder):
- Excludes
.git/,node_modules/,__pycache__/,docs/,design/,.claude/, secrets - Keeps the SSH-uploaded build context small and free of secrets
We don't have per-service .dockerignore files anymore. They were obsolete once the build context moved to project root — Docker only reads .dockerignore from the build-context root.
How services consume the shared package
Section titled “How services consume the shared package”The root Dockerfile installs shared/josh_substrate first (with the [all-providers] extra, which is the union of every role's substrate-side deps), then pip-installs each service package into the same image:
FROM python:3.12-slimWORKDIR /app
# Layer 1 — substrate (heavy: torch, sentence-transformers, alembic, …).# Sits at the bottom of the cache stack because it changes rarely.COPY shared/josh_substrate /tmp/josh_substrateRUN pip install --no-cache-dir '/tmp/josh_substrate[all-providers]'
# Alembic config — pre-deploy hook runs `alembic upgrade head` from /app# against the web role's container. script_location points into the# installed josh_substrate package.COPY shared/josh_substrate/alembic.ini /app/alembic.ini
# Layer 2 — josh-core (FastAPI). `josh_core/` is the top-level wheel package.COPY josh-core/pyproject.toml /tmp/josh-core/pyproject.tomlCOPY josh-core/josh_core /tmp/josh-core/josh_coreRUN pip install --no-cache-dir /tmp/josh-core
# Layer 3 — josh-ingester. Installs `josh-ingester` console script.COPY josh-ingester/pyproject.toml /tmp/josh-ingester/pyproject.tomlCOPY josh-ingester/josh_ingester /tmp/josh-ingester/josh_ingesterRUN pip install --no-cache-dir /tmp/josh-ingester
# Layer 4 — josh-embedder. Installs `josh-embedder` console script. Built# into the image even when the role is commented out in deploy.yml —# activation is config-only.COPY josh-embedder/pyproject.toml /tmp/josh-embedder/pyproject.tomlCOPY josh-embedder/src /tmp/josh-embedder/srcRUN pip install --no-cache-dir /tmp/josh-embedder
# Default CMD = web role; ingester/embedder override in config/deploy.yml.CMD ["uvicorn", "josh_core.main:app", "--host", "0.0.0.0", "--port", "8000"]Layers are ordered for cache reuse — substrate at the bottom (changes rarely), then service packages.
The sqlite-vec PyPI package bundles the compiled extension binary. Connection helpers in josh_substrate.db load it via conn.load_extension(...) after enabling extension loading on the aiosqlite connection. If extension loading needs to happen on a connection type that aiosqlite doesn't expose, fall back to wrapping apsw connections instead — same SQL, different driver.
Local development
Section titled “Local development”We don't ship a docker-compose.yml. Local dev runs against the deployed substrate (SSH tunnel into josh-core's /health, query the SQLite file remotely, etc.) or by running josh-core and josh-ingester directly via python -m against a local SQLite file. Provisioning a fresh test/staging server is fast — see new-host-setup.html.
When OSS Foundation packaging ships at Step 1 launch, we'll add an OSS-friendly path (single docker run, prebuilt image, or a fresh compose file) sized for self-hosters.
Adding a new service
Section titled “Adding a new service”When we add a service (next up: josh-ingester):
- Create the directory:
mkdir -p josh-ingester/{ingester,config} - Add the Dockerfile following the shared-package consumption pattern above.
- Add
josh-ingester/pyproject.tomlfor service-specific deps (e.g.,httpx,lxml,apscheduler,typer,tenacity,structlog) and register it under the workspace's[tool.uv.workspace] membersin the rootpyproject.toml. - Add
josh-ingester/config/deploy.yml:service: josh-ingesterimage: yourorg/josh-ingester- Same
buildershape as josh-core (context: ../,dockerfile: ./Dockerfile) - No accessories. The substrate SQLite file is shared via the host volume mount.
- Volume mount the
/datahost directory:volumes: ["/data:/data"]. Same as josh-core. - DB target env:
SUBSTRATE_DB_PATH=/data/josh.db. - For non-
web(worker) services, addprimary_role: <your_role>at the top level (Kamal v2 defaults to expecting awebrole; without this, deploy fails withprimary_role web isn't defined). - For non-
webservices, opt out of kamal-proxy registration by settingproxy: falseat the role level (top-levelproxy: falseis invalid — Kamal v2 expectsproxy:to be a hash, not a boolean):servers:workers:hosts: [your.host.example]proxy: false # workers have no HTTP — skip proxy registration
- Dockerfile shape — workers should idle, not run a one-shot command:
- For HTTP services (
josh-core):CMD ["uvicorn", "josh_core.main:app", ...]— long-running server. - For workers (
josh-ingester):CMD ["sleep", "infinity"]— container idles; jobs run viakamal app exec --reuse 'python -m josh_ingester ...'. Don't useENTRYPOINTfor workers; Kamal's per-rolecmd:override is appended to the entrypoint, soENTRYPOINT ["python","-m","josh_ingester"]+cmd: tail -f /dev/nullbecomespython -m josh_ingester tail -f /dev/nulland the container exits immediately.
- For HTTP services (
- Add a role to
config/deploy.ymlat the project root — a newservers.<rolename>block withhosts,cmd, andproxy: falseif it's a worker. - Extend the root
DockerfiletoCOPY+pip installthe new service package alongside the existing ones. - Update
.dockerignoreif anything in the new service needs excluding. - Deploy:
env LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 kamal deploy --roles=<rolename>from project root.
Long-running jobs in worker containers
Section titled “Long-running jobs in worker containers”kamal app exec --reuse 'python -m ingester backfill <source>' works for short jobs. For long jobs (the CRS full backfill is ~6 hours), the local Kamal client process can be killed by client-side timeouts, taking the SSH session with it. The right pattern: detach via docker exec -d so the work decouples from any SSH session:
CONTAINER=$(ssh josh 'docker ps --filter label=service=josh --filter label=role=ingester --format "{{.Names}}"')ssh josh "docker exec -d $CONTAINER sh -c 'josh-ingester backfill <source> > /data/<source>-backfill.log 2>&1'"
# Monitor progress periodically — the runner writes incremental countersssh josh "docker exec $CONTAINER josh-ingester status --source=<source> --limit=1"The container's CMD is sleep infinity, so it stays alive after the detached job finishes. The log lands on /data/ (the bind-mounted volume) so it's visible from the host.
Adding to the shared substrate package
Section titled “Adding to the shared substrate package”When we add a new substrate concept (a new table, a new ID format, a new DB helper):
- Schema work: add the migration in
shared/josh_substrate/src/josh_substrate/migrations/versions/. - Models work: add Pydantic models in
shared/josh_substrate/src/josh_substrate/models/. - Helpers work: add to
shared/josh_substrate/src/josh_substrate/db.pyorids.py. - Bump version in
shared/josh_substrate/pyproject.toml(semver — minor for additive, major for breaking). - Rebuild and redeploy every consuming service. Kamal picks up the changed shared package on its next build because the
COPY shared/josh_substratelayer invalidates. - Apply the migration (currently manual via
kamal app exec --reuse "alembic upgrade head"; eventually via the pre-deploy hook).
Convenience scripts (planned)
Section titled “Convenience scripts (planned)”When the multi-service workflow gets repetitive, add bin/ wrapper scripts:
bin/ingester— wrapskamal app exec --reuse --roles=ingester "josh-ingester $@"bin/migrate— wrapskamal app exec --reuse --roles=web "alembic upgrade head"
Not added yet — wait until the friction is real.
What goes where
Section titled “What goes where”| Concern | Lives in |
|---|---|
| Pydantic models for substrate tables | shared/josh_substrate/src/josh_substrate/models/ |
| Alembic migrations | shared/josh_substrate/src/josh_substrate/migrations/versions/ |
| FastAPI routes, request/response shapes | josh-core/app/ |
| LLM harness, tool definitions | josh-core/app/harness/ (when added) |
| Source modules (Federal Register, Bills, etc.) | josh-ingester/ingester/sources/ (when added) |
| ETL framework (Source protocol, runner, scheduler) | josh-ingester/ingester/ (when added) |
| Agent UI components | josh-web/src/ (when added) |
| TypeScript types from API | josh-web/src/types/api.ts (generated) |
josh-postgres/init/ | |
| Kamal hooks (pre-deploy migration, etc.) | <service>/.kamal/hooks/ (per-service) |
| Per-data-source schemas + ingestion specs | docs/sources/ |
| Operational runbooks | docs/operations/ |
| Product / design docs | docs/ |
Public repos
Section titled “Public repos”Josh ships as three public repos: josh-foundation (the substrate, the ingester, and the FastAPI core), josh-substrate (the pip-installable schema package — Pydantic models, Alembic migrations, ID helpers, snapshot loader), and josh-eval (the benchmark). License: Apache 2.0 for foundation + substrate; MIT for the benchmark.