tailwart/CLAUDE.md
Wayne Hayes d292fb0307 docs(CLAUDE): drop stale 'container has no IPv6' claim; align with LESSONS 8-9
The sidecar gained real IPv6 egress (commit 34422ba / LESSONS.md 9), but the
outbound pitfall still asserted 'no IPv6 / no AAAA->A fallback'. Reword to
reflect the fix while keeping the tailnet-relay guidance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 20:45:53 -04:00

179 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CLAUDE.md — tailwart
Guidance for Claude Code in this repo. Read before editing.
## What this is
A play deployment of **Stalwart** (all-in-one mail/JMAP/IMAP/SMTP server) wired,
gratuitously, into **three** shared backends — Postgres, Redis, and Garage S3 —
to see how far the federatedSocial Tailscale-sidecar pattern stretches past the
fediverse apps. Target domain: `infinidim.net` (may become real later).
It is **self-contained and outside** `/opt/federatedSocial` on purpose: that's
an upstream clone that `git pull` overwrites. tailwart owns its own `.env`,
compose, config, ACL snippet, and Caddy build, and only *reads from the tailnet*
(shared infra over MagicDNS) at runtime.
## Architecture — two ends of one wire
```
public IP host (tag:reverse-proxy) tailnet-only mailbox
┌───────────────────────────┐ ┌────────────────────────┐
│ caddy/ (caddy-l4) │ tailnet │ ts-stalwart sidecar │
│ :25 :465 :587 :143 :993 ──┼───WireGuard───▶│ stalwart (no WAN, no │
│ PROXY protocol v2 │ │ host ports) │
└───────────────────────────┘ └───────────┬────────────┘
L7 JMAP vhost on the main Caddy │
mail.infinidim.net → :8080 ┌───────┴───────┐
▼ ▼ ▼
Postgres Redis Garage S3
(the-record)(slo-time)(garage)
```
- **Mailbox** (`docker-compose.yml`): Stalwart in a Tailscale sidecar via
`network_mode: service:ts-stalwart`. Binds nothing on the host. All mail
ports listen on the tailnet only.
- **Edge** (`caddy/`): a layer-4 TCP proxy (Caddy + `caddy-l4`, pulled prebuilt
from caddyserver.com — no local `xcaddy` build, per `~/docs/caddy.md`). Pure
pass-through; Stalwart owns TLS. **Can run on a different machine** than the
mailbox — the key idea.
- **Backends**: data+fts → Postgres, blob → Garage S3, lookup/in-memory →
Redis. One stalwart role/db, one Garage bucket, one Redis logical DB.
## The `.env` contract
`.env` (gitignored) is the whole operator surface; `.env.example` is the
template. Both compose files read it. Secrets reach Stalwart as env vars and are
referenced from `config/config.toml` via `%{env:NAME}%` so the toml stays
commit-safe. Never hardcode a value that belongs in `.env` — except the two
spots a static file forces it: `caddy/caddy.json` dial targets and any
MagicDNS host in the toml.
## Sidecar boilerplate
Identical to federatedSocial's (TS_ACCEPT_DNS true, kernel networking, 127.0.0.1
healthcheck, ephemeral OAuth auth). Don't drift it. Tag: `tag:stalwart`.
## Prerequisites (shared tailnet infra — already running for the fediverse)
1. Postgres role + db: `stalwart` / `STALWART_DB_NAME`. Create via the
federatedSocial `bootstrap.sh` flow or a one-off `CREATE ROLE … LOGIN; CREATE
DATABASE … OWNER …`.
2. Garage bucket `stalwart-mail` + grant the shared access key access to it.
3. Redis: nothing to create — just use a dedicated logical DB index
(`STALWART_REDIS_DB`) so we don't collide with the apps.
4. Admin console: assign `tag:stalwart` to the OAuth client (Devices/Core +
Keys/AuthKeys) and add `acl-snippet.hujson` to the policy.
## Pitfalls (some learned the hard way next door)
- **Mail edge is layer 4, not layer 7.** Don't try to give the L4 ports a
normal Caddy vhost. SNI/Host routing doesn't apply to `:25`.
- **PROXY protocol or your mail reputation dies.** Without it Stalwart sees the
proxy's tailnet IP as every client → SPF/DNSBL/greylisting break. Both ends
must agree (caddy.json `proxy_protocol: v2` ↔ config `[server.proxy]
trusted-networks`).
- **Stalwart config drifts between versions and migrates into the admin store
after first boot.** `config/config.toml` is a strawman — verify keys against
the pinned image tag before trusting them. Pin the tag once it works.
- **`POSTGRES_PASSWORD`/role passwords only apply on an empty volume.** If a
password "doesn't work," the stored credential drifted — `ALTER USER`, don't
re-init. And never test a password over `127.0.0.1` against these Postgres
containers: pg_hba `trust`s loopback and accepts ANY password. Test over the
tailnet (scram) or you'll fool yourself.
- **Outbound :25 is usually blocked on VPS.** Set `STALWART_SMARTHOST`, and
prefer relaying over the tailnet (`100.x:587`) — it bypasses the VPS SMTP-port
blocks and, having no AAAA, sidesteps the v6-first trap. The sidecar now has
its **own IPv6 egress** (LESSONS.md 9), so dual-stack targets resolve too;
before that fix an AAAA-only path would hang (`os error 101`) with no fallback
to A. See LESSONS.md 89.
- **Mail forces WAN ports.** `:25` must be world-reachable for inbound
federation — this is the one place the tailnet-only model can't hold. Keep
submission/IMAP tailnet-only if you want a tighter surface.
## What not to do
- Don't put files in `/opt/federatedSocial`. Read its `.env` if you must; never
write there.
- Don't add `ports:` to the Stalwart container — the edge proxy is the only
public surface, and it lives in `caddy/`.
- Don't commit `.env` or a built Caddy binary (see `.gitignore`).
- Don't break the sidecar netns boundary with bridge networks or host ports.
## Lessons learned — v0.16 first real run (2026-06)
The pinned image is `stalwartlabs/stalwart:v0.16.7`, and v0.16 changed the config
model enough that most of the toml-era notes above are obsolete. Reality:
### Config model (supersedes the `.env`/`config.toml`/`%{env}%` notes above)
- Config is a single **JSON** file the image reads from `--config
/etc/stalwart/config.json`. It describes **only the datastore**. The root
object *is* the datastore:
```json
{ "@type": "PostgreSql", "host": "the-record-prod.tail7b1641.ts.net",
"port": 5432, "database": "stalwart", "authUsername": "stalwart",
"authSecret": { "@type": "EnvironmentVariable", "variableName": "STALWART_DB_PASSWORD" } }
```
- **TOML is gone. The `%{env:NAME}%` macro is gone.** Secrets use the
`EnvironmentVariable` secret type (field `variableName`); a literal uses the
`Value` type (field **`secret`**, not `value`). `config/config.toml` is dead —
kept only as historical reference.
- **Everything else lives in Postgres** (domains, accounts, listeners, ACME,
blob/redis store wiring, proxy trust, DKIM, spam) and is managed via the web
UI or the `x:` JMAP objects: `x:DataStore` `x:InMemoryStore` `x:BlobStore`
`x:NetworkListener` `x:SystemSettings` `x:Account` `x:AcmeProvider` `x:Action`.
All are JMAP `*/get`/`*/set` against `/jmap` with a Bearer token; singletons
use `ids:["singleton"]`.
### Persistence (this was the original "I keep losing settings" bug)
- Bind-mount `./config/config.json:/etc/stalwart/config.json`; make
`/var/lib/stalwart` a **named** volume. The image VOLUME-declares
`/etc/stalwart` + `/var/lib/stalwart`; left unmounted they become **anonymous
volumes that get orphaned on every recreate** → config/state vanishes.
### Store endpoints need a full FQDN + port
- Bare MagicDNS names silently fail. `http://garage``http://garage.tail7b1641.ts.net:3900`;
`redis://slo-time-prod``redis://slo-time-prod.tail7b1641.ts.net:6379/3`
(keep the `/3` logical-DB index). A wrong blob endpoint also blocks the web-UI
install (the SPA unpacks to S3) and all message-body storage.
### PROXY-protocol trust is PER-LISTENER, never global
- Set `overrideProxyTrustedNetworks` (`100.64.0.0/10` + `fd7a:115c:a1e0::/48`)
on the L4-fronted **mail** listeners only (25/465/587/143/993). Setting the
**global** `proxyTrustedNetworks` makes the `:8080` admin/HTTP listener demand
a PROXY header too → direct browser hits get `ERR_CONNECTION_RESET`.
- Adding/removing listeners (e.g. 143 IMAP-STARTTLS, 587 submission-STARTTLS,
not created by default) needs a **container restart** — a settings reload does
not rebind sockets.
### One data store ⇒ exactly one Stalwart instance
- Two instances on the same Postgres/Redis (a stray `docker run`, or
ephemeral-IP restart ghosts) cause ACME orders to go **INVALID**, corrupt
rate-limit/auto-ban state, and produce restart flapping. Ephemeral sidecar
nodes get a **new tailnet IP per restart**, leaving ghost idle Postgres
connections from dead incarnations (`pg_stat_activity` distinct `client_addr`
= a restart counter). Postgres being healthy ≠ Stalwart healthy.
### Accounts / recovery
- Locked out? Add `STALWART_RECOVERY_MODE=1` + `STALWART_RECOVERY_ADMIN=admin:<pw>`,
restart. Serves only `:8080`, pauses MTA/tasks, and **does not wipe** a
native-v0.16 DB (the "wipe" warning is only for migrating a v0.15 store). Mint
a token, fix the account, then remove both env vars and restart.
- Normal web login is **OAuth/PKCE against the directory**; the recovery admin
is honoured only in recovery mode/bootstrap. Set a password via `x:Account/set`
`credentials` `@type:Password` with a **pre-hashed `$argon2id$…`** secret
(plaintext is stored as cleartext and rejected). Verify with **IMAP AUTH over
TLS**, not the web flow.
### ACME
- Account registration succeeds even when the challenge can't run — don't be
fooled. `dns-01` needs a DNS-provider API token; `http-01` needs the edge to
forward `:80` to Stalwart's HTTP listener. `INVALID` authorizations in the
store = challenges failing (often the multi-instance race above). Watch LE's
5-failed-validations/hour limit; test against staging.
### Backups
- `stalwart --export <dir>` (read-only) dumps the whole store per subspace;
`--import` restores. Plus `pg_dump` of the `stalwart` DB. Both land in
`_backup/` / `_validate/`**gitignored** (real secrets + mail data).