97 lines
4.8 KiB
Markdown
97 lines
4.8 KiB
Markdown
|
|
# tailwart — lessons learned
|
||
|
|
|
||
|
|
Hard-won notes from bringing the mail edge up. Each entry is **symptom → cause →
|
||
|
|
fix**, ordered roughly by how long it cost. Read this before re-debugging.
|
||
|
|
|
||
|
|
## 1. Postgres startup race ate cert/setting writes
|
||
|
|
|
||
|
|
**Symptom:** TLS certs (manual import *and* ACME) would validate but never
|
||
|
|
persist — Stalwart kept serving its `rcgen` self-signed fallback. Logs showed
|
||
|
|
`Failed to create tables: error connecting to server` on most boots.
|
||
|
|
|
||
|
|
**Cause:** Stalwart shares the `ts-stalwart` sidecar's netns. Its `depends_on`
|
||
|
|
only waited for the sidecar's *own* health (`/healthz` = "tailscaled up"), which
|
||
|
|
flips green **before** the tailnet route to Postgres (`the-record-prod:5432`) is
|
||
|
|
usable. Stalwart started into that gap, failed the DB connect, and any write in
|
||
|
|
that window — including a freshly obtained cert — was silently lost.
|
||
|
|
|
||
|
|
**Fix:** the sidecar healthcheck now also requires Postgres to be reachable
|
||
|
|
(`nc -z … 5432`), so `depends_on: service_healthy` can't release Stalwart into
|
||
|
|
the race. See `docker-compose.yml`. First clean boot after this: zero PG errors,
|
||
|
|
4 live connections immediately.
|
||
|
|
|
||
|
|
## 2. DNS-01 was blocked by a dead Spaceship API key
|
||
|
|
|
||
|
|
**Symptom:** `Failed to set DNS RRSet: Unauthorized` on every record; no cert
|
||
|
|
issued; no `_acme-challenge` TXT ever set.
|
||
|
|
|
||
|
|
**Cause:** the cert design is ACME **DNS-01** via the **Spaceship** provider
|
||
|
|
(bundled in caddy/lego). The stored API key was invalid (recovery debris from an
|
||
|
|
earlier config attempt). Note `STALWART_ACME_PROVIDER` / `STALWART_ACME_TOKEN`
|
||
|
|
in `.env` are **empty and not even passed through by compose** — the provider +
|
||
|
|
secret are entered in the **admin UI** (stored in the DB), not via env.
|
||
|
|
|
||
|
|
**Gotcha:** secret fields render **blank** in the Stalwart admin even when set
|
||
|
|
(the S3 secret behaves identically). A blank field is *not* evidence it's unset.
|
||
|
|
|
||
|
|
**Fix / how to verify a key directly (egresses the box's WAN IP, same as
|
||
|
|
Stalwart):**
|
||
|
|
```bash
|
||
|
|
curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
|
||
|
|
-H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
|
||
|
|
# 401 application.unauthorized = bad key/secret or IP-restricted
|
||
|
|
# 200 = good
|
||
|
|
```
|
||
|
|
A fresh Spaceship key fixed it.
|
||
|
|
|
||
|
|
## 3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")
|
||
|
|
|
||
|
|
**Symptom:** the edge box could relay mail fine but could **not** reach
|
||
|
|
Stalwart's `:8080` admin — connections accept then immediately close. Looked like
|
||
|
|
"tagged devices rejected, user phone works."
|
||
|
|
|
||
|
|
**Cause:** Stalwart's fail2ban checks the **proxied client IP** (from the PROXY
|
||
|
|
header) on the mail listeners, but the **raw connection IP** on the non-proxied
|
||
|
|
admin listener. A banned edge-box IP therefore still relays mail (ban checked
|
||
|
|
against the header IP) while direct `→:8080` is dropped (checked against the box
|
||
|
|
IP). Malformed probing of the mail ports **re-arms** the ban.
|
||
|
|
|
||
|
|
**Fix:** add `100.64.0.0/10` (and the box's WAN IP, which appears as the proxied
|
||
|
|
client when you hit the box's own public hostname) to the fail2ban allow-list.
|
||
|
|
Bans are in-memory — a Stalwart restart flushes them. **Don't rapid-poll the mail
|
||
|
|
ports** to test.
|
||
|
|
|
||
|
|
## 4. The wildcard request *required* DNS-01 (why HTTP-01 was a dead end)
|
||
|
|
|
||
|
|
With "Additional Hostnames" left empty, Stalwart requests a **wildcard**
|
||
|
|
(`*.<domain>`). Wildcards can **only** be issued via DNS-01 — HTTP-01 literally
|
||
|
|
cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding
|
||
|
|
detour before realizing DNS-01 was the intended (and only viable) path. One
|
||
|
|
wildcard cert then covers `mail`, `mta-sts`, `autoconfig`, `autodiscover`, etc.
|
||
|
|
|
||
|
|
## 5. `:443` web endpoints need SNI pass-through, not L7 proxy
|
||
|
|
|
||
|
|
MTA-STS / autoconfig / autodiscover serve over **:443**. You cannot L7
|
||
|
|
`reverse_proxy` them through Caddy, because the **CAA** record pins issuance to
|
||
|
|
Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart
|
||
|
|
holds the wildcard, so the edge **passes TLS through** by SNI. See
|
||
|
|
`caddy/README.md` → "The HTTP side". Needed `tcp:443` added to the
|
||
|
|
`reverse-proxy → stalwart` ACL grant.
|
||
|
|
|
||
|
|
## 6. The sidecar is ephemeral — never hardcode its tailnet IP
|
||
|
|
|
||
|
|
`ts-stalwart` runs with `?ephemeral=true`, so its tailnet IP **changes on
|
||
|
|
re-registration** (an ACL re-sync did this mid-debug: `100.112.26.122 →
|
||
|
|
100.79.87.80`). Everything must use the MagicDNS name
|
||
|
|
`stalwart.tail7b1641.ts.net`. A hardcoded IP will mysteriously go
|
||
|
|
`Network is unreachable`.
|
||
|
|
|
||
|
|
## 7. Don't trust crt.sh for rate-limit checks
|
||
|
|
|
||
|
|
crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly
|
||
|
|
duplicate-cert limit, use **certspotter** instead:
|
||
|
|
`https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true`.
|
||
|
|
Also: LE limits are dimensioned — **failed validations** are hourly (5/hr/host,
|
||
|
|
the one a retry storm trips), **issued duplicates** are weekly (5/wk). A renewal
|
||
|
|
task hammering every 10 min trips the hourly one; consolidate to a single task.
|