tailwart/LESSONS.md
Wayne Hayes 3a9819c3ee docs: capture outbound-relay lessons (IPv6/AAAA trap, SMTP port block, sidecar ACL)
LESSONS.md gains 8-12: container has no IPv6 (AAAA fails before A, no
fallback), host IPv6 != container IPv6, VPS blocks all outbound SMTP
ports (relay over tailnet), sidecar needs a source ACL grant to
initiate, and MtaRoute changes only take effect on restart.

CLAUDE.md and .env.example warn that the smarthost address must be an
IPv4 literal or tailnet IP, never a dual-stack hostname. acl-snippet
adds the tag:stalwart -> tag:mail outbound grant.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 22:43:21 +01:00

9.3 KiB

tailwart — lessons learned

Hard-won notes from bringing the mail edge up. Each entry is symptom → cause → fix, ordered roughly by how long it cost. Read this before re-debugging.

1. Postgres startup race ate cert/setting writes

Symptom: TLS certs (manual import and ACME) would validate but never persist — Stalwart kept serving its rcgen self-signed fallback. Logs showed Failed to create tables: error connecting to server on most boots.

Cause: Stalwart shares the ts-stalwart sidecar's netns. Its depends_on only waited for the sidecar's own health (/healthz = "tailscaled up"), which flips green before the tailnet route to Postgres (the-record-prod:5432) is usable. Stalwart started into that gap, failed the DB connect, and any write in that window — including a freshly obtained cert — was silently lost.

Fix: the sidecar healthcheck now also requires Postgres to be reachable (nc -z … 5432), so depends_on: service_healthy can't release Stalwart into the race. See docker-compose.yml. First clean boot after this: zero PG errors, 4 live connections immediately.

2. DNS-01 was blocked by a dead Spaceship API key

Symptom: Failed to set DNS RRSet: Unauthorized on every record; no cert issued; no _acme-challenge TXT ever set.

Cause: the cert design is ACME DNS-01 via the Spaceship provider (bundled in caddy/lego). The stored API key was invalid (recovery debris from an earlier config attempt). Note STALWART_ACME_PROVIDER / STALWART_ACME_TOKEN in .env are empty and not even passed through by compose — the provider + secret are entered in the admin UI (stored in the DB), not via env.

Gotcha: secret fields render blank in the Stalwart admin even when set (the S3 secret behaves identically). A blank field is not evidence it's unset.

Fix / how to verify a key directly (egresses the box's WAN IP, same as Stalwart):

curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
  -H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
# 401 application.unauthorized = bad key/secret or IP-restricted
# 200 = good

A fresh Spaceship key fixed it.

3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")

Symptom: the edge box could relay mail fine but could not reach Stalwart's :8080 admin — connections accept then immediately close. Looked like "tagged devices rejected, user phone works."

Cause: Stalwart's fail2ban checks the proxied client IP (from the PROXY header) on the mail listeners, but the raw connection IP on the non-proxied admin listener. A banned edge-box IP therefore still relays mail (ban checked against the header IP) while direct →:8080 is dropped (checked against the box IP). Malformed probing of the mail ports re-arms the ban.

Fix: add 100.64.0.0/10 (and the box's WAN IP, which appears as the proxied client when you hit the box's own public hostname) to the fail2ban allow-list. Bans are in-memory — a Stalwart restart flushes them. Don't rapid-poll the mail ports to test.

4. The wildcard request required DNS-01 (why HTTP-01 was a dead end)

With "Additional Hostnames" left empty, Stalwart requests a wildcard (*.<domain>). Wildcards can only be issued via DNS-01 — HTTP-01 literally cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding detour before realizing DNS-01 was the intended (and only viable) path. One wildcard cert then covers mail, mta-sts, autoconfig, autodiscover, etc.

5. :443 web endpoints need SNI pass-through, not L7 proxy

MTA-STS / autoconfig / autodiscover serve over :443. You cannot L7 reverse_proxy them through Caddy, because the CAA record pins issuance to Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart holds the wildcard, so the edge passes TLS through by SNI. See caddy/README.md → "The HTTP side". Needed tcp:443 added to the reverse-proxy → stalwart ACL grant.

6. The sidecar is ephemeral — never hardcode its tailnet IP

ts-stalwart runs with ?ephemeral=true, so its tailnet IP changes on re-registration (an ACL re-sync did this mid-debug: 100.112.26.122 → 100.79.87.80). Everything must use the MagicDNS name stalwart.tail7b1641.ts.net. A hardcoded IP will mysteriously go Network is unreachable.

7. Don't trust crt.sh for rate-limit checks

crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly duplicate-cert limit, use certspotter instead: https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true. Also: LE limits are dimensioned — failed validations are hourly (5/hr/host, the one a retry storm trips), issued duplicates are weekly (5/wk). A renewal task hammering every 10 min trips the hourly one; consolidate to a single task.

8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried

Symptom: Outbound delivery (and relay-to-smarthost) to any host with an AAAA record fails with I/O error: Network is unreachable (os error 101). Hosts that are IPv4-only deliver fine. Pointing a relay at a hostname that has both A and AAAA fails; pointing it at the raw IPv4 works.

Cause: Stalwart shares the ts-stalwart sidecar's netns, which has no global IPv6. When it resolves a dual-stack target it tries the AAAA first, gets ENETUNREACH immediately, and for a relay next-hop it does not fall back to the A record — it just records the v6 failure and backs off. So a single missing address family wedges all mail to dual-stack destinations.

Fix: Either (a) pin the relay/smarthost address to an IPv4 literal (no AAAA to trip on), or (b) give the container real IPv6. Note that relaying over the tailnet sidesteps this entirely — you connect to a tailnet 100.x address, which has no AAAA, so the v6-first trap never triggers.

9. Configuring IPv6 on the KVM host does NOT give the container IPv6

Symptom: ip -6 addr and ping6 google.com succeed on the KVM host, but Stalwart still dies with os error 101 on AAAA targets, and the box is still a broken IPv6 Tailscale exit node.

Cause: The host's eth0 and the container/sidecar netns are separate network stacks. Adding the provider's /64 to eth0 (ifupdown inet6 static

  • onlink default route, since the gateway is in a different /64) fixes the host, not the container. Docker doesn't hand IPv6 to containers by default, and the sidecar routes via Tailscale, not eth0.

Fix: Don't assume host IPv6 = container IPv6. Test from inside the container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the tailnet relay avoids needing container IPv6 at all. Enabling true container IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task.

10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet

Symptom: Direct MX delivery and relay-to-public-host both fail with Connection timed out (os error 110), and the SYN never arrives at the destination. Not just port 25 — 465, 587, even alt-port 2525 all time out.

Cause: The KVM provider blocks all outbound SMTP submission ports to prevent spam. Only non-SMTP ports (443, etc.) egress. Confirmed with:

for p in 25 465 587 2525 443; do
  timeout 5 bash -c "exec 3<>/dev/tcp/<dst>/$p" && echo "$p OPEN" || echo "$p blocked"
done
# 443 OPEN, all SMTP ports timeout

Fix: Relay over the tailnet. Tailscale rides WireGuard/DERP (UDP 41641 / 443), so it's immune to SMTP port filtering. Point the relay at the smarthost's tailnet IP (e.g. 100.x:587), not its public address. Long-term: ask the provider to unblock outbound 25/587 for verified use.

11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant

Symptom: The relay to <mailbox-tailnet-ip>:587 times out (os error 110), yet the KVM host (same physical machine) can reach that exact IP:port over the tailnet fine. Looks like a routing or transparent-proxy bug.

Cause: The Stalwart container rides the ts-stalwart sidecar — a separate tailnet node (tag:stalwart) from the KVM host. The tailwart ACL block only listed tag:stalwart as a destination ("dst": ["tag:stalwart"]). Tailnet is default-deny, so the sidecar could receive connections but could not initiate the relay back to the mailbox → silent drop → timeout. The KVM host worked because it's a different, permitted identity, which masked the real cause.

Fix: Add an ACL rule granting tag:stalwart as a source:

{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }

(mailbox is tag:mail). Applies in seconds, no restart. See acl-snippet.hujson.

12. Stalwart only rebuilds its MTA route table at container startup

Symptom: You edit an MtaRoute (address, etc.) via API/UI, but delivery keeps using the old value. The datastore shows the new value; live delivery ignores it.

Cause: The routing_strategy map is built once when the process boots. The ReloadSettings action reloads the datastore but does not rebuild the SMTP route map. So route/strategy changes are invisible until restart.

Fix: After any MtaRoute / MtaOutboundStrategy change, docker restart tailwart-stalwart-1. (Side effect: the ephemeral sidecar gets a new tailnet IP each restart — anything addressing it by IP must rediscover it; use the MagicDNS name where possible.)