Compare commits
No commits in common. "45e06ed524e2534764e33433d35713faa0e26c65" and "e9febd037cc2740eb6ea0e51a642241e724ed66c" have entirely different histories.
45e06ed524
...
e9febd037c
5
.gitignore
vendored
5
.gitignore
vendored
@ -22,8 +22,3 @@ export/
|
|||||||
# NB: config/config.json IS committed on purpose — it's the v0.16 bootstrap
|
# NB: config/config.json IS committed on purpose — it's the v0.16 bootstrap
|
||||||
# config and is secret-free (DB password comes from $STALWART_DB_PASSWORD via
|
# config and is secret-free (DB password comes from $STALWART_DB_PASSWORD via
|
||||||
# the EnvironmentVariable secret type). Don't add it here.
|
# the EnvironmentVariable secret type). Don't add it here.
|
||||||
|
|
||||||
# Editor swap / backup files
|
|
||||||
*.swp
|
|
||||||
*.swo
|
|
||||||
*~
|
|
||||||
|
|||||||
96
LESSONS.md
96
LESSONS.md
@ -1,96 +0,0 @@
|
|||||||
# tailwart — lessons learned
|
|
||||||
|
|
||||||
Hard-won notes from bringing the mail edge up. Each entry is **symptom → cause →
|
|
||||||
fix**, ordered roughly by how long it cost. Read this before re-debugging.
|
|
||||||
|
|
||||||
## 1. Postgres startup race ate cert/setting writes
|
|
||||||
|
|
||||||
**Symptom:** TLS certs (manual import *and* ACME) would validate but never
|
|
||||||
persist — Stalwart kept serving its `rcgen` self-signed fallback. Logs showed
|
|
||||||
`Failed to create tables: error connecting to server` on most boots.
|
|
||||||
|
|
||||||
**Cause:** Stalwart shares the `ts-stalwart` sidecar's netns. Its `depends_on`
|
|
||||||
only waited for the sidecar's *own* health (`/healthz` = "tailscaled up"), which
|
|
||||||
flips green **before** the tailnet route to Postgres (`the-record-prod:5432`) is
|
|
||||||
usable. Stalwart started into that gap, failed the DB connect, and any write in
|
|
||||||
that window — including a freshly obtained cert — was silently lost.
|
|
||||||
|
|
||||||
**Fix:** the sidecar healthcheck now also requires Postgres to be reachable
|
|
||||||
(`nc -z … 5432`), so `depends_on: service_healthy` can't release Stalwart into
|
|
||||||
the race. See `docker-compose.yml`. First clean boot after this: zero PG errors,
|
|
||||||
4 live connections immediately.
|
|
||||||
|
|
||||||
## 2. DNS-01 was blocked by a dead Spaceship API key
|
|
||||||
|
|
||||||
**Symptom:** `Failed to set DNS RRSet: Unauthorized` on every record; no cert
|
|
||||||
issued; no `_acme-challenge` TXT ever set.
|
|
||||||
|
|
||||||
**Cause:** the cert design is ACME **DNS-01** via the **Spaceship** provider
|
|
||||||
(bundled in caddy/lego). The stored API key was invalid (recovery debris from an
|
|
||||||
earlier config attempt). Note `STALWART_ACME_PROVIDER` / `STALWART_ACME_TOKEN`
|
|
||||||
in `.env` are **empty and not even passed through by compose** — the provider +
|
|
||||||
secret are entered in the **admin UI** (stored in the DB), not via env.
|
|
||||||
|
|
||||||
**Gotcha:** secret fields render **blank** in the Stalwart admin even when set
|
|
||||||
(the S3 secret behaves identically). A blank field is *not* evidence it's unset.
|
|
||||||
|
|
||||||
**Fix / how to verify a key directly (egresses the box's WAN IP, same as
|
|
||||||
Stalwart):**
|
|
||||||
```bash
|
|
||||||
curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
|
|
||||||
-H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
|
|
||||||
# 401 application.unauthorized = bad key/secret or IP-restricted
|
|
||||||
# 200 = good
|
|
||||||
```
|
|
||||||
A fresh Spaceship key fixed it.
|
|
||||||
|
|
||||||
## 3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")
|
|
||||||
|
|
||||||
**Symptom:** the edge box could relay mail fine but could **not** reach
|
|
||||||
Stalwart's `:8080` admin — connections accept then immediately close. Looked like
|
|
||||||
"tagged devices rejected, user phone works."
|
|
||||||
|
|
||||||
**Cause:** Stalwart's fail2ban checks the **proxied client IP** (from the PROXY
|
|
||||||
header) on the mail listeners, but the **raw connection IP** on the non-proxied
|
|
||||||
admin listener. A banned edge-box IP therefore still relays mail (ban checked
|
|
||||||
against the header IP) while direct `→:8080` is dropped (checked against the box
|
|
||||||
IP). Malformed probing of the mail ports **re-arms** the ban.
|
|
||||||
|
|
||||||
**Fix:** add `100.64.0.0/10` (and the box's WAN IP, which appears as the proxied
|
|
||||||
client when you hit the box's own public hostname) to the fail2ban allow-list.
|
|
||||||
Bans are in-memory — a Stalwart restart flushes them. **Don't rapid-poll the mail
|
|
||||||
ports** to test.
|
|
||||||
|
|
||||||
## 4. The wildcard request *required* DNS-01 (why HTTP-01 was a dead end)
|
|
||||||
|
|
||||||
With "Additional Hostnames" left empty, Stalwart requests a **wildcard**
|
|
||||||
(`*.<domain>`). Wildcards can **only** be issued via DNS-01 — HTTP-01 literally
|
|
||||||
cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding
|
|
||||||
detour before realizing DNS-01 was the intended (and only viable) path. One
|
|
||||||
wildcard cert then covers `mail`, `mta-sts`, `autoconfig`, `autodiscover`, etc.
|
|
||||||
|
|
||||||
## 5. `:443` web endpoints need SNI pass-through, not L7 proxy
|
|
||||||
|
|
||||||
MTA-STS / autoconfig / autodiscover serve over **:443**. You cannot L7
|
|
||||||
`reverse_proxy` them through Caddy, because the **CAA** record pins issuance to
|
|
||||||
Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart
|
|
||||||
holds the wildcard, so the edge **passes TLS through** by SNI. See
|
|
||||||
`caddy/README.md` → "The HTTP side". Needed `tcp:443` added to the
|
|
||||||
`reverse-proxy → stalwart` ACL grant.
|
|
||||||
|
|
||||||
## 6. The sidecar is ephemeral — never hardcode its tailnet IP
|
|
||||||
|
|
||||||
`ts-stalwart` runs with `?ephemeral=true`, so its tailnet IP **changes on
|
|
||||||
re-registration** (an ACL re-sync did this mid-debug: `100.112.26.122 →
|
|
||||||
100.79.87.80`). Everything must use the MagicDNS name
|
|
||||||
`stalwart.tail7b1641.ts.net`. A hardcoded IP will mysteriously go
|
|
||||||
`Network is unreachable`.
|
|
||||||
|
|
||||||
## 7. Don't trust crt.sh for rate-limit checks
|
|
||||||
|
|
||||||
crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly
|
|
||||||
duplicate-cert limit, use **certspotter** instead:
|
|
||||||
`https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true`.
|
|
||||||
Also: LE limits are dimensioned — **failed validations** are hourly (5/hr/host,
|
|
||||||
the one a retry storm trips), **issued duplicates** are weekly (5/wk). A renewal
|
|
||||||
task hammering every 10 min trips the hourly one; consolidate to a single task.
|
|
||||||
@ -12,13 +12,10 @@
|
|||||||
|
|
||||||
// 3) grants — the edge proxy (tag:reverse-proxy) reaches the mailbox ports.
|
// 3) grants — the edge proxy (tag:reverse-proxy) reaches the mailbox ports.
|
||||||
// 8080 is the JMAP/admin HTTP tier (fronted by the main L7 Caddy).
|
// 8080 is the JMAP/admin HTTP tier (fronted by the main L7 Caddy).
|
||||||
// 443 is Stalwart's HTTPS web listener; the edge L4-proxies the public
|
|
||||||
// mta-sts/autoconfig/autodiscover SNIs to it (Stalwart terminates TLS with
|
|
||||||
// its wildcard cert). PROXY protocol v2, same as the mail ports.
|
|
||||||
{
|
{
|
||||||
"src": ["tag:reverse-proxy"],
|
"src": ["tag:reverse-proxy"],
|
||||||
"dst": ["tag:stalwart"],
|
"dst": ["tag:stalwart"],
|
||||||
"ip": ["tcp:25", "tcp:465", "tcp:587", "tcp:143", "tcp:993", "tcp:443", "tcp:8080"],
|
"ip": ["tcp:25", "tcp:465", "tcp:587", "tcp:143", "tcp:993", "tcp:8080"],
|
||||||
},
|
},
|
||||||
|
|
||||||
// 4) admin console (not this file): assign tag:stalwart to the same OAuth
|
// 4) admin console (not this file): assign tag:stalwart to the same OAuth
|
||||||
|
|||||||
@ -34,43 +34,17 @@ binary. To add plugins, append `&p=<url-encoded module path>` to
|
|||||||
can't read `.env`; this is the one spot the MagicDNS name is hardcoded — same
|
can't read `.env`; this is the one spot the MagicDNS name is hardcoded — same
|
||||||
trade-off as pgAdmin's `servers.json`.)
|
trade-off as pgAdmin's `servers.json`.)
|
||||||
|
|
||||||
## The HTTP side (MTA-STS / autoconfig / autodiscover) — `:443` SNI fan-out
|
## The HTTP side (JMAP / autoconfig / admin) is separate
|
||||||
|
|
||||||
Stalwart publishes DNS that points public HTTPS names at this edge:
|
That part *is* ordinary layer 7. Don't put it here if this box already runs the
|
||||||
`mta-sts.`, `autoconfig.`, `autodiscover.<domain>`. They serve the MTA-STS
|
main Caddy on :443 — you'll collide. Instead add a vhost to the existing Caddy:
|
||||||
policy and mail-client autoconfig over **:443** — so the edge has to handle
|
|
||||||
`:443` too, which is where a naive setup collides with a box that already runs a
|
|
||||||
web Caddy.
|
|
||||||
|
|
||||||
The fix is **not** an L7 `reverse_proxy` (terminate at Caddy). You can't: the
|
|
||||||
domain's **CAA** record pins issuance to Stalwart's ACME account
|
|
||||||
(`accounturi=…`), so Caddy can't obtain its own cert for `*.<domain>`. Stalwart
|
|
||||||
already holds the wildcard. So we **pass TLS through** to it.
|
|
||||||
|
|
||||||
The `web` server in `caddy.json` owns `:443` and fans out by SNI:
|
|
||||||
|
|
||||||
- `mta-sts` / `autoconfig` / `autodiscover.<domain>` → `stalwart:443`
|
|
||||||
(pass-through; Stalwart terminates with its wildcard cert — **no** proxy
|
|
||||||
protocol on `:443`, unlike the mail ports).
|
|
||||||
- every other SNI → `127.0.0.1:8443`, the box's own web Caddy.
|
|
||||||
|
|
||||||
For that fallback to exist, move the web Caddy's HTTPS off `:443`:
|
|
||||||
|
|
||||||
```caddyfile
|
```caddyfile
|
||||||
{
|
mail.infinidim.net {
|
||||||
https_port 8443 # web vhosts now listen here; the L4 :443 forwards to them
|
reverse_proxy stalwart.tail7b1641.ts.net:8080
|
||||||
}
|
}
|
||||||
|
|
||||||
your-web-site.example { reverse_proxy … }
|
|
||||||
```
|
```
|
||||||
|
|
||||||
HTTP→HTTPS redirects still resolve to `:443` correctly. A **mail-only** edge (no
|
|
||||||
web vhosts on the box) omits the `web` server entirely — keep just the mail
|
|
||||||
ports above.
|
|
||||||
|
|
||||||
> Note: `tag:reverse-proxy → tag:stalwart` must also grant **`tcp:443`** in the
|
|
||||||
> Tailscale ACL (see `../acl-snippet.hujson`), on top of the mail ports.
|
|
||||||
|
|
||||||
## Prerequisites on the host running this
|
## Prerequisites on the host running this
|
||||||
|
|
||||||
- Joined to the tailnet, tagged `tag:reverse-proxy` (so the ACL lets it reach
|
- Joined to the tailnet, tagged `tag:reverse-proxy` (so the ACL lets it reach
|
||||||
|
|||||||
@ -45,30 +45,6 @@
|
|||||||
"proxy_protocol": "v2",
|
"proxy_protocol": "v2",
|
||||||
"upstreams": [{ "dial": ["stalwart.tail7b1641.ts.net:993"] }]
|
"upstreams": [{ "dial": ["stalwart.tail7b1641.ts.net:993"] }]
|
||||||
}]}]
|
}]}]
|
||||||
},
|
|
||||||
"web": {
|
|
||||||
"//": "SNI fan-out on the public :443. Stalwart's HTTPS web endpoints",
|
|
||||||
"//2": "(MTA-STS policy, autoconfig, autodiscover) pass through to Stalwart,",
|
|
||||||
"//3": "which terminates TLS with its wildcard cert — NO proxy_protocol here,",
|
|
||||||
"//4": "unlike the mail ports above. Every other SNI falls to the box's own",
|
|
||||||
"//5": "web Caddy on :8443 (set `https_port 8443` there). A mail-only standalone",
|
|
||||||
"//6": "edge omits this server. See README — 'The HTTP side'.",
|
|
||||||
"listen": [":443"],
|
|
||||||
"routes": [
|
|
||||||
{
|
|
||||||
"match": [{ "tls": { "sni": ["mta-sts.infinidim.net", "autoconfig.infinidim.net", "autodiscover.infinidim.net"] } }],
|
|
||||||
"handle": [{
|
|
||||||
"handler": "proxy",
|
|
||||||
"upstreams": [{ "dial": ["stalwart.tail7b1641.ts.net:443"] }]
|
|
||||||
}]
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"handle": [{
|
|
||||||
"handler": "proxy",
|
|
||||||
"upstreams": [{ "dial": ["127.0.0.1:8443"] }]
|
|
||||||
}]
|
|
||||||
}
|
|
||||||
]
|
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|||||||
@ -31,14 +31,9 @@ services:
|
|||||||
- NET_ADMIN
|
- NET_ADMIN
|
||||||
- NET_RAW
|
- NET_RAW
|
||||||
healthcheck:
|
healthcheck:
|
||||||
# Healthy only when BOTH the tailnet link is up AND Postgres is reachable
|
test: ["CMD", "wget", "-qO-", "http://127.0.0.1:9002/healthz"]
|
||||||
# over it. The stalwart service gates on this (depends_on: service_healthy),
|
|
||||||
# so it can no longer start into the race where it tries the DB before the
|
|
||||||
# tailnet route exists — which logged "Failed to create tables" and dropped
|
|
||||||
# in-flight cert/setting writes (e.g. lost the ACME cert on 2026-06-10).
|
|
||||||
test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:9002/healthz && nc -z -w3 ${DB_MAGIC_NAME}.${TS_TAILNET} 5432"]
|
|
||||||
interval: 10s
|
interval: 10s
|
||||||
timeout: 8s
|
timeout: 5s
|
||||||
retries: 6
|
retries: 6
|
||||||
start_period: 30s
|
start_period: 30s
|
||||||
restart: unless-stopped
|
restart: unless-stopped
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user