diff --git a/.gitignore b/.gitignore index 06e6538..f263abb 100644 --- a/.gitignore +++ b/.gitignore @@ -9,3 +9,16 @@ caddy/.env # Built Caddy binary (rebuild from caddy/Dockerfile instead of committing 50MB) caddy/caddy caddy/*.bin + +# Local operational artifacts — DB dumps, store exports, validation runs. +# These contain REAL secrets + account/mail data. Never commit. +_backup/ +_validate/ +*.dump +# Stalwart store export/import dirs (stalwart --export/--import) +export/ +*.export + +# NB: config/config.json IS committed on purpose — it's the v0.16 bootstrap +# config and is secret-free (DB password comes from $STALWART_DB_PASSWORD via +# the EnvironmentVariable secret type). Don't add it here. diff --git a/CLAUDE.md b/CLAUDE.md index 646bc19..0ce3cc1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -94,3 +94,80 @@ healthcheck, ephemeral OAuth auth). Don't drift it. Tag: `tag:stalwart`. public surface, and it lives in `caddy/`. - Don't commit `.env` or a built Caddy binary (see `.gitignore`). - Don't break the sidecar netns boundary with bridge networks or host ports. + +## Lessons learned — v0.16 first real run (2026-06) + +The pinned image is `stalwartlabs/stalwart:v0.16.7`, and v0.16 changed the config +model enough that most of the toml-era notes above are obsolete. Reality: + +### Config model (supersedes the `.env`/`config.toml`/`%{env}%` notes above) +- Config is a single **JSON** file the image reads from `--config + /etc/stalwart/config.json`. It describes **only the datastore**. The root + object *is* the datastore: + ```json + { "@type": "PostgreSql", "host": "the-record-prod.tail7b1641.ts.net", + "port": 5432, "database": "stalwart", "authUsername": "stalwart", + "authSecret": { "@type": "EnvironmentVariable", "variableName": "STALWART_DB_PASSWORD" } } + ``` +- **TOML is gone. The `%{env:NAME}%` macro is gone.** Secrets use the + `EnvironmentVariable` secret type (field `variableName`); a literal uses the + `Value` type (field **`secret`**, not `value`). `config/config.toml` is dead — + kept only as historical reference. +- **Everything else lives in Postgres** (domains, accounts, listeners, ACME, + blob/redis store wiring, proxy trust, DKIM, spam) and is managed via the web + UI or the `x:` JMAP objects: `x:DataStore` `x:InMemoryStore` `x:BlobStore` + `x:NetworkListener` `x:SystemSettings` `x:Account` `x:AcmeProvider` `x:Action`. + All are JMAP `*/get`/`*/set` against `/jmap` with a Bearer token; singletons + use `ids:["singleton"]`. + +### Persistence (this was the original "I keep losing settings" bug) +- Bind-mount `./config/config.json:/etc/stalwart/config.json`; make + `/var/lib/stalwart` a **named** volume. The image VOLUME-declares + `/etc/stalwart` + `/var/lib/stalwart`; left unmounted they become **anonymous + volumes that get orphaned on every recreate** → config/state vanishes. + +### Store endpoints need a full FQDN + port +- Bare MagicDNS names silently fail. `http://garage` → `http://garage.tail7b1641.ts.net:3900`; + `redis://slo-time-prod` → `redis://slo-time-prod.tail7b1641.ts.net:6379/3` + (keep the `/3` logical-DB index). A wrong blob endpoint also blocks the web-UI + install (the SPA unpacks to S3) and all message-body storage. + +### PROXY-protocol trust is PER-LISTENER, never global +- Set `overrideProxyTrustedNetworks` (`100.64.0.0/10` + `fd7a:115c:a1e0::/48`) + on the L4-fronted **mail** listeners only (25/465/587/143/993). Setting the + **global** `proxyTrustedNetworks` makes the `:8080` admin/HTTP listener demand + a PROXY header too → direct browser hits get `ERR_CONNECTION_RESET`. +- Adding/removing listeners (e.g. 143 IMAP-STARTTLS, 587 submission-STARTTLS, + not created by default) needs a **container restart** — a settings reload does + not rebind sockets. + +### One data store ⇒ exactly one Stalwart instance +- Two instances on the same Postgres/Redis (a stray `docker run`, or + ephemeral-IP restart ghosts) cause ACME orders to go **INVALID**, corrupt + rate-limit/auto-ban state, and produce restart flapping. Ephemeral sidecar + nodes get a **new tailnet IP per restart**, leaving ghost idle Postgres + connections from dead incarnations (`pg_stat_activity` distinct `client_addr` + = a restart counter). Postgres being healthy ≠ Stalwart healthy. + +### Accounts / recovery +- Locked out? Add `STALWART_RECOVERY_MODE=1` + `STALWART_RECOVERY_ADMIN=admin:`, + restart. Serves only `:8080`, pauses MTA/tasks, and **does not wipe** a + native-v0.16 DB (the "wipe" warning is only for migrating a v0.15 store). Mint + a token, fix the account, then remove both env vars and restart. +- Normal web login is **OAuth/PKCE against the directory**; the recovery admin + is honoured only in recovery mode/bootstrap. Set a password via `x:Account/set` + `credentials` `@type:Password` with a **pre-hashed `$argon2id$…`** secret + (plaintext is stored as cleartext and rejected). Verify with **IMAP AUTH over + TLS**, not the web flow. + +### ACME +- Account registration succeeds even when the challenge can't run — don't be + fooled. `dns-01` needs a DNS-provider API token; `http-01` needs the edge to + forward `:80` to Stalwart's HTTP listener. `INVALID` authorizations in the + store = challenges failing (often the multi-instance race above). Watch LE's + 5-failed-validations/hour limit; test against staging. + +### Backups +- `stalwart --export ` (read-only) dumps the whole store per subspace; + `--import` restores. Plus `pg_dump` of the `stalwart` DB. Both land in + `_backup/` / `_validate/` — **gitignored** (real secrets + mail data). diff --git a/config/config.json b/config/config.json new file mode 100644 index 0000000..72ca234 --- /dev/null +++ b/config/config.json @@ -0,0 +1,8 @@ +{ + "@type": "PostgreSql", + "host": "the-record-prod.tail7b1641.ts.net", + "port": 5432, + "database": "stalwart", + "authUsername": "stalwart", + "authSecret": { "@type": "EnvironmentVariable", "variableName": "STALWART_DB_PASSWORD" } +} diff --git a/docker-compose.yml b/docker-compose.yml index 89b5bbb..9223627 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -59,10 +59,15 @@ services: STALWART_SMARTHOST: ${STALWART_SMARTHOST} STALWART_FALLBACK_ADMIN_SECRET: ${STALWART_FALLBACK_ADMIN_SECRET} volumes: - - ./config/config.toml:/opt/stalwart-mail/etc/config.toml:ro - # Local working dir only (logs, ACME cache, queue spool). The bulk data - # lives in Postgres + Garage, not here — but Stalwart still wants a home. - - stalwart-data:/opt/stalwart-mail + # Bootstrap config (v0.16 JSON): tells Stalwart only where Postgres lives; + # all other settings live in the DB. Mounted at the image's default + # --config path (/etc/stalwart/config.json). Secret comes from the + # STALWART_DB_PASSWORD env above, referenced via the EnvironmentVariable + # secret type inside the file — so this stays commit-safe. + - ./config/config.json:/etc/stalwart/config.json:ro + # Working dir: ACME cert cache + outbound queue spool. Named volume (not + # anonymous) so a recreate doesn't orphan it and drop queued mail/certs. + - stalwart-data:/var/lib/stalwart depends_on: ts-stalwart: condition: service_healthy