From 3a9819c3eebb7e917e45c737e074bf462ff95a96 Mon Sep 17 00:00:00 2001 From: Wayne Hayes Date: Thu, 11 Jun 2026 22:43:21 +0100 Subject: [PATCH] docs: capture outbound-relay lessons (IPv6/AAAA trap, SMTP port block, sidecar ACL) LESSONS.md gains 8-12: container has no IPv6 (AAAA fails before A, no fallback), host IPv6 != container IPv6, VPS blocks all outbound SMTP ports (relay over tailnet), sidecar needs a source ACL grant to initiate, and MtaRoute changes only take effect on restart. CLAUDE.md and .env.example warn that the smarthost address must be an IPv4 literal or tailnet IP, never a dual-stack hostname. acl-snippet adds the tag:stalwart -> tag:mail outbound grant. Co-Authored-By: Claude Opus 4.8 --- .env.example | 5 +++ CLAUDE.md | 5 ++- LESSONS.md | 88 ++++++++++++++++++++++++++++++++++++++++++++++ acl-snippet.hujson | 8 ++++- 4 files changed, 104 insertions(+), 2 deletions(-) diff --git a/.env.example b/.env.example index 26a1b01..c58ba2f 100644 --- a/.env.example +++ b/.env.example @@ -68,6 +68,11 @@ STALWART_S3_BUCKET=stalwart-mail # ---------------------------------------------------------------------------- # Most VPS providers block outbound :25. If yours does, relay through a # smarthost (host:port). Leave blank to attempt direct MX delivery. +# +# IMPORTANT: Use an IPv4 literal or a tailnet IP — never a dual-stack hostname. +# The container has no IPv6 and will NOT fall back from AAAA to A; any host +# with an AAAA record will fail immediately (os error 101). Relaying over the +# tailnet (100.x:587) sidesteps this entirely and also bypasses VPS SMTP blocks. STALWART_SMARTHOST= # ---------------------------------------------------------------------------- diff --git a/CLAUDE.md b/CLAUDE.md index 0ce3cc1..e7ca12c 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -81,7 +81,10 @@ healthcheck, ephemeral OAuth auth). Don't drift it. Tag: `tag:stalwart`. re-init. And never test a password over `127.0.0.1` against these Postgres containers: pg_hba `trust`s loopback and accepts ANY password. Test over the tailnet (scram) or you'll fool yourself. -- **Outbound :25 is usually blocked on VPS.** Set `STALWART_SMARTHOST`. +- **Outbound :25 is usually blocked on VPS.** Set `STALWART_SMARTHOST`. The + relay address **must be an IPv4 literal or a tailnet IP** — never a dual-stack + hostname. The container has no IPv6 and will not fall back from AAAA to A; + relaying over the tailnet (`100.x:587`) also bypasses all VPS SMTP port blocks. - **Mail forces WAN ports.** `:25` must be world-reachable for inbound federation — this is the one place the tailnet-only model can't hold. Keep submission/IMAP tailnet-only if you want a tighter surface. diff --git a/LESSONS.md b/LESSONS.md index b283f3d..485c320 100644 --- a/LESSONS.md +++ b/LESSONS.md @@ -94,3 +94,91 @@ duplicate-cert limit, use **certspotter** instead: Also: LE limits are dimensioned — **failed validations** are hourly (5/hr/host, the one a retry storm trips), **issued duplicates** are weekly (5/wk). A renewal task hammering every 10 min trips the hourly one; consolidate to a single task. + +## 8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried + +**Symptom:** Outbound delivery (and relay-to-smarthost) to any host with an +AAAA record fails with `I/O error: Network is unreachable (os error 101)`. +Hosts that are IPv4-only deliver fine. Pointing a relay at a *hostname* that +has both A and AAAA fails; pointing it at the raw IPv4 works. + +**Cause:** Stalwart shares the `ts-stalwart` sidecar's netns, which has no +global IPv6. When it resolves a dual-stack target it tries the AAAA first, +gets `ENETUNREACH` immediately, and for a **relay next-hop it does not fall +back to the A record** — it just records the v6 failure and backs off. So a +single missing address family wedges all mail to dual-stack destinations. + +**Fix:** Either (a) pin the relay/smarthost `address` to an **IPv4 literal** +(no AAAA to trip on), or (b) give the container real IPv6. Note that relaying +over the **tailnet** sidesteps this entirely — you connect to a tailnet +`100.x` address, which has no AAAA, so the v6-first trap never triggers. + +## 9. Configuring IPv6 on the KVM host does NOT give the container IPv6 + +**Symptom:** `ip -6 addr` and `ping6 google.com` succeed on the KVM host, but +Stalwart still dies with `os error 101` on AAAA targets, and the box is still +a broken IPv6 Tailscale exit node. + +**Cause:** The host's `eth0` and the container/sidecar netns are separate +network stacks. Adding the provider's `/64` to `eth0` (ifupdown `inet6 static` ++ `onlink` default route, since the gateway is in a different /64) fixes the +*host*, not the container. Docker doesn't hand IPv6 to containers by default, +and the sidecar routes via Tailscale, not eth0. + +**Fix:** Don't assume host IPv6 = container IPv6. Test from *inside* the +container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the +tailnet relay avoids needing container IPv6 at all. Enabling true container +IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task. + +## 10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet + +**Symptom:** Direct MX delivery and relay-to-public-host both fail with +`Connection timed out (os error 110)`, and the SYN never arrives at the +destination. Not just port 25 — `465`, `587`, even alt-port `2525` all time out. + +**Cause:** The KVM provider blocks all outbound SMTP submission ports to prevent +spam. Only non-SMTP ports (`443`, etc.) egress. Confirmed with: +```bash +for p in 25 465 587 2525 443; do + timeout 5 bash -c "exec 3<>/dev/tcp//$p" && echo "$p OPEN" || echo "$p blocked" +done +# 443 OPEN, all SMTP ports timeout +``` + +**Fix:** Relay over the **tailnet**. Tailscale rides WireGuard/DERP (UDP 41641 / +443), so it's immune to SMTP port filtering. Point the relay at the smarthost's +**tailnet IP** (e.g. `100.x:587`), not its public address. Long-term: ask the +provider to unblock outbound 25/587 for verified use. + +## 11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant + +**Symptom:** The relay to `:587` times out (`os error 110`), +yet the **KVM host** (same physical machine) can reach that exact IP:port over +the tailnet fine. Looks like a routing or transparent-proxy bug. + +**Cause:** The Stalwart container rides the `ts-stalwart` sidecar — a **separate +tailnet node** (`tag:stalwart`) from the KVM host. The `tailwart` ACL block only +listed `tag:stalwart` as a **destination** (`"dst": ["tag:stalwart"]`). Tailnet +is default-deny, so the sidecar could receive connections but could not +*initiate* the relay back to the mailbox → silent drop → timeout. The KVM host +worked because it's a different, permitted identity, which masked the real cause. + +**Fix:** Add an ACL rule granting `tag:stalwart` as a **source**: +```json +{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] } +``` +(mailbox is `tag:mail`). Applies in seconds, no restart. See `acl-snippet.hujson`. + +## 12. Stalwart only rebuilds its MTA route table at container startup + +**Symptom:** You edit an `MtaRoute` (address, etc.) via API/UI, but delivery keeps +using the old value. The datastore shows the new value; live delivery ignores it. + +**Cause:** The `routing_strategy` map is built once when the process boots. The +`ReloadSettings` action reloads the datastore but does **not** rebuild the SMTP +route map. So route/strategy changes are invisible until restart. + +**Fix:** After any `MtaRoute` / `MtaOutboundStrategy` change, +`docker restart tailwart-stalwart-1`. (Side effect: the ephemeral sidecar gets a +new tailnet IP each restart — anything addressing it by IP must rediscover it; +use the MagicDNS name where possible.) diff --git a/acl-snippet.hujson b/acl-snippet.hujson index 32b60e9..3700698 100644 --- a/acl-snippet.hujson +++ b/acl-snippet.hujson @@ -21,6 +21,12 @@ "ip": ["tcp:25", "tcp:465", "tcp:587", "tcp:143", "tcp:993", "tcp:443", "tcp:8080"], }, -// 4) admin console (not this file): assign tag:stalwart to the same OAuth +// 4) grant — Stalwart initiates outbound relay to the smarthost (tag:mail). +// Without this, the sidecar can RECEIVE but cannot INITIATE over the tailnet +// (default-deny; the KVM host is a different identity and won't mask this). +// Adjust dst/port to match your smarthost's tag and submission port. +{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }, + +// 5) admin console (not this file): assign tag:stalwart to the same OAuth // client federatedSocial uses, on the Devices/Core + Keys/AuthKeys scopes. // Missing → 403 "calling actor does not have enough permissions" at boot.