Wayne Hayes 38ba2eb83d Harden mail edge: PG-race healthcheck gate, :443 SNI fan-out, docs

Fixes the root cause that was silently dropping Stalwart's cert/setting
writes, completes the public HTTPS endpoints, and captures the debugging
knowledge.

- docker-compose.yml: gate the ts-stalwart healthcheck on Postgres
  reachability (nc -z the-record-prod:5432) in addition to tailscaled
  health. Stalwart's depends_on: service_healthy can no longer release it
  into the window where the tailnet route to Postgres isn't up yet — which
  was failing table init and losing in-flight cert writes (-> rcgen).

- caddy/caddy.json + README: add the :443 SNI fan-out. mta-sts /
  autoconfig / autodiscover pass through to stalwart:443 (Stalwart
  terminates TLS with its wildcard cert; no proxy_protocol on :443).
  All other SNIs go to the box's web Caddy on :8443 (https_port 8443).
  L7 reverse_proxy is impossible here: CAA pins issuance to Stalwart's
  ACME account, so Caddy can't obtain its own cert for these names.

- acl-snippet.hujson: grant tcp:443 on reverse-proxy -> stalwart for the
  SNI pass-through.

- config/config.json: track the v0.16 bootstrap (commit-safe; the DB
  secret is an EnvironmentVariable reference, not inline).

- LESSONS.md: symptom -> cause -> fix notes (PG race, DNS-01/Spaceship
  dead key, auto-ban vs PROXY protocol, wildcard-requires-DNS-01, SNI
  pass-through, ephemeral sidecar IP, LE rate-limit checks).

- .gitignore: exclude _backup/ and _validate/ (DB dumps + an inline-secret
  config) and editor swap files. NEVER commit those.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-06-11 05:15:34 +01:00

4.8 KiB

Raw Blame History

tailwart — lessons learned

Hard-won notes from bringing the mail edge up. Each entry is symptom → cause → fix, ordered roughly by how long it cost. Read this before re-debugging.

1. Postgres startup race ate cert/setting writes

Symptom: TLS certs (manual import and ACME) would validate but never persist — Stalwart kept serving its rcgen self-signed fallback. Logs showed Failed to create tables: error connecting to server on most boots.

Cause: Stalwart shares the ts-stalwart sidecar's netns. Its depends_on only waited for the sidecar's own health (/healthz = "tailscaled up"), which flips green before the tailnet route to Postgres (the-record-prod:5432) is usable. Stalwart started into that gap, failed the DB connect, and any write in that window — including a freshly obtained cert — was silently lost.

Fix: the sidecar healthcheck now also requires Postgres to be reachable (nc -z … 5432), so depends_on: service_healthy can't release Stalwart into the race. See docker-compose.yml. First clean boot after this: zero PG errors, 4 live connections immediately.

2. DNS-01 was blocked by a dead Spaceship API key

Symptom: Failed to set DNS RRSet: Unauthorized on every record; no cert issued; no _acme-challenge TXT ever set.

Cause: the cert design is ACME DNS-01 via the Spaceship provider (bundled in caddy/lego). The stored API key was invalid (recovery debris from an earlier config attempt). Note STALWART_ACME_PROVIDER / STALWART_ACME_TOKEN in .env are empty and not even passed through by compose — the provider + secret are entered in the admin UI (stored in the DB), not via env.

Gotcha: secret fields render blank in the Stalwart admin even when set (the S3 secret behaves identically). A blank field is not evidence it's unset.

Fix / how to verify a key directly (egresses the box's WAN IP, same as Stalwart):

curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
  -H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
# 401 application.unauthorized = bad key/secret or IP-restricted
# 200 = good

A fresh Spaceship key fixed it.

3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")

Symptom: the edge box could relay mail fine but could not reach Stalwart's :8080 admin — connections accept then immediately close. Looked like "tagged devices rejected, user phone works."

Cause: Stalwart's fail2ban checks the proxied client IP (from the PROXY header) on the mail listeners, but the raw connection IP on the non-proxied admin listener. A banned edge-box IP therefore still relays mail (ban checked against the header IP) while direct →:8080 is dropped (checked against the box IP). Malformed probing of the mail ports re-arms the ban.

Fix: add 100.64.0.0/10 (and the box's WAN IP, which appears as the proxied client when you hit the box's own public hostname) to the fail2ban allow-list. Bans are in-memory — a Stalwart restart flushes them. Don't rapid-poll the mail ports to test.

4. The wildcard request required DNS-01 (why HTTP-01 was a dead end)

With "Additional Hostnames" left empty, Stalwart requests a wildcard (*.<domain>). Wildcards can only be issued via DNS-01 — HTTP-01 literally cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding detour before realizing DNS-01 was the intended (and only viable) path. One wildcard cert then covers mail, mta-sts, autoconfig, autodiscover, etc.

5. `:443` web endpoints need SNI pass-through, not L7 proxy

MTA-STS / autoconfig / autodiscover serve over :443. You cannot L7 reverse_proxy them through Caddy, because the CAA record pins issuance to Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart holds the wildcard, so the edge passes TLS through by SNI. See caddy/README.md → "The HTTP side". Needed tcp:443 added to the reverse-proxy → stalwart ACL grant.

6. The sidecar is ephemeral — never hardcode its tailnet IP

ts-stalwart runs with ?ephemeral=true, so its tailnet IP changes on re-registration (an ACL re-sync did this mid-debug: 100.112.26.122 → 100.79.87.80). Everything must use the MagicDNS name stalwart.tail7b1641.ts.net. A hardcoded IP will mysteriously go Network is unreachable.

7. Don't trust crt.sh for rate-limit checks

crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly duplicate-cert limit, use certspotter instead: https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true. Also: LE limits are dimensioned — failed validations are hourly (5/hr/host, the one a retry storm trips), issued duplicates are weekly (5/wk). A renewal task hammering every 10 min trips the hourly one; consolidate to a single task.

4.8 KiB Raw Blame History