Add enable_ipv6 + a ULA subnet to tailwart_default so the Stalwart container (sharing the ts-stalwart netns) gets working IPv6 egress. Because only egress is needed (inbound arrives via the edge/tailnet), a ULA + Docker masquerade suffices -- no routable prefix, ndppd, or host sysctl changes (Docker 29 enables ip6tables by default; host forwarding was already on). Verified: ping6 + TCP/443 to v6 literals from inside the netns; zero ENETUNREACH since boot. LESSONS: mark #8/#9 resolved with the ULA-masquerade recipe, and add #13 -- Spaceship's DNS API is RRSet-upsert (not zone-replace), so Stalwart/ACME did not eat custom AAAA records; a vanished AAAA is a provider-side loss, not Stalwart. Includes the safe read/verify flow and the "don't publish mail AAAA before edge v6 listeners" caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
13 KiB
tailwart — lessons learned
Hard-won notes from bringing the mail edge up. Each entry is symptom → cause → fix, ordered roughly by how long it cost. Read this before re-debugging.
1. Postgres startup race ate cert/setting writes
Symptom: TLS certs (manual import and ACME) would validate but never
persist — Stalwart kept serving its rcgen self-signed fallback. Logs showed
Failed to create tables: error connecting to server on most boots.
Cause: Stalwart shares the ts-stalwart sidecar's netns. Its depends_on
only waited for the sidecar's own health (/healthz = "tailscaled up"), which
flips green before the tailnet route to Postgres (the-record-prod:5432) is
usable. Stalwart started into that gap, failed the DB connect, and any write in
that window — including a freshly obtained cert — was silently lost.
Fix: the sidecar healthcheck now also requires Postgres to be reachable
(nc -z … 5432), so depends_on: service_healthy can't release Stalwart into
the race. See docker-compose.yml. First clean boot after this: zero PG errors,
4 live connections immediately.
2. DNS-01 was blocked by a dead Spaceship API key
Symptom: Failed to set DNS RRSet: Unauthorized on every record; no cert
issued; no _acme-challenge TXT ever set.
Cause: the cert design is ACME DNS-01 via the Spaceship provider
(bundled in caddy/lego). The stored API key was invalid (recovery debris from an
earlier config attempt). Note STALWART_ACME_PROVIDER / STALWART_ACME_TOKEN
in .env are empty and not even passed through by compose — the provider +
secret are entered in the admin UI (stored in the DB), not via env.
Gotcha: secret fields render blank in the Stalwart admin even when set (the S3 secret behaves identically). A blank field is not evidence it's unset.
Fix / how to verify a key directly (egresses the box's WAN IP, same as Stalwart):
curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
-H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
# 401 application.unauthorized = bad key/secret or IP-restricted
# 200 = good
A fresh Spaceship key fixed it.
3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")
Symptom: the edge box could relay mail fine but could not reach
Stalwart's :8080 admin — connections accept then immediately close. Looked like
"tagged devices rejected, user phone works."
Cause: Stalwart's fail2ban checks the proxied client IP (from the PROXY
header) on the mail listeners, but the raw connection IP on the non-proxied
admin listener. A banned edge-box IP therefore still relays mail (ban checked
against the header IP) while direct →:8080 is dropped (checked against the box
IP). Malformed probing of the mail ports re-arms the ban.
Fix: add 100.64.0.0/10 (and the box's WAN IP, which appears as the proxied
client when you hit the box's own public hostname) to the fail2ban allow-list.
Bans are in-memory — a Stalwart restart flushes them. Don't rapid-poll the mail
ports to test.
4. The wildcard request required DNS-01 (why HTTP-01 was a dead end)
With "Additional Hostnames" left empty, Stalwart requests a wildcard
(*.<domain>). Wildcards can only be issued via DNS-01 — HTTP-01 literally
cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding
detour before realizing DNS-01 was the intended (and only viable) path. One
wildcard cert then covers mail, mta-sts, autoconfig, autodiscover, etc.
5. :443 web endpoints need SNI pass-through, not L7 proxy
MTA-STS / autoconfig / autodiscover serve over :443. You cannot L7
reverse_proxy them through Caddy, because the CAA record pins issuance to
Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart
holds the wildcard, so the edge passes TLS through by SNI. See
caddy/README.md → "The HTTP side". Needed tcp:443 added to the
reverse-proxy → stalwart ACL grant.
6. The sidecar is ephemeral — never hardcode its tailnet IP
ts-stalwart runs with ?ephemeral=true, so its tailnet IP changes on
re-registration (an ACL re-sync did this mid-debug: 100.112.26.122 → 100.79.87.80). Everything must use the MagicDNS name
stalwart.tail7b1641.ts.net. A hardcoded IP will mysteriously go
Network is unreachable.
7. Don't trust crt.sh for rate-limit checks
crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly
duplicate-cert limit, use certspotter instead:
https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true.
Also: LE limits are dimensioned — failed validations are hourly (5/hr/host,
the one a retry storm trips), issued duplicates are weekly (5/wk). A renewal
task hammering every 10 min trips the hourly one; consolidate to a single task.
8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried
Symptom: Outbound delivery (and relay-to-smarthost) to any host with an
AAAA record fails with I/O error: Network is unreachable (os error 101).
Hosts that are IPv4-only deliver fine. Pointing a relay at a hostname that
has both A and AAAA fails; pointing it at the raw IPv4 works.
Cause: Stalwart shares the ts-stalwart sidecar's netns, which has no
global IPv6. When it resolves a dual-stack target it tries the AAAA first,
gets ENETUNREACH immediately, and for a relay next-hop it does not fall
back to the A record — it just records the v6 failure and backs off. So a
single missing address family wedges all mail to dual-stack destinations.
Fix: Either (a) pin the relay/smarthost address to an IPv4 literal
(no AAAA to trip on), or (b) give the container real IPv6. Note that relaying
over the tailnet sidesteps this entirely — you connect to a tailnet
100.x address, which has no AAAA, so the v6-first trap never triggers.
RESOLVED (2026-06-11) — option (b) is now done. The container has real IPv6 egress; this trap no longer fires. See Lesson 9's fix for how.
9. Configuring IPv6 on the KVM host does NOT give the container IPv6
Symptom: ip -6 addr and ping6 google.com succeed on the KVM host, but
Stalwart still dies with os error 101 on AAAA targets, and the box is still
a broken IPv6 Tailscale exit node.
Cause: The host's eth0 and the container/sidecar netns are separate
network stacks. Adding the provider's /64 to eth0 (ifupdown inet6 static
onlinkdefault route, since the gateway is in a different /64) fixes the host, not the container. Docker doesn't hand IPv6 to containers by default, and the sidecar routes via Tailscale, not eth0.
Fix: Don't assume host IPv6 = container IPv6. Test from inside the container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the tailnet relay avoids needing container IPv6 at all. Enabling true container IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task.
RESOLVED (2026-06-11) — the easy way, no /64 routing or ndppd. Because the container only needs IPv6 egress (inbound arrives via the edge/tailnet, never v6), you don't need a routable prefix or NDP proxy at all — just a ULA subnet + masquerade, exactly like Docker does for v4:
# docker-compose.yml
networks:
default:
enable_ipv6: true
ipam:
config:
- subnet: fd00:7a17:600d::/64
gateway: fd00:7a17:600d::1
Docker 29 enables ip6tables by default and masquerades the ULA out the host's
global v6, so the sidecar netns (shared by Stalwart via network_mode) gets a
working v6 default route with zero host sysctl/daemon changes (host
net.ipv6.conf.all.forwarding was already 1 from the static-v6 setup). Verify
from inside the netns: ping6 google.com + a TCP connect to a v6 literal on
:443. Recreating the network (docker compose down && up) bounces the stack and
the ephemeral sidecar gets a new tailnet IP — MagicDNS covers it (Lesson 6), and
the MTA route table rebuilds anyway (Lesson 12). This does not give inbound
v6; for that you'd still publish AAAA + make the edge listen on v6 (separate).
10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet
Symptom: Direct MX delivery and relay-to-public-host both fail with
Connection timed out (os error 110), and the SYN never arrives at the
destination. Not just port 25 — 465, 587, even alt-port 2525 all time out.
Cause: The KVM provider blocks all outbound SMTP submission ports to prevent
spam. Only non-SMTP ports (443, etc.) egress. Confirmed with:
for p in 25 465 587 2525 443; do
timeout 5 bash -c "exec 3<>/dev/tcp/<dst>/$p" && echo "$p OPEN" || echo "$p blocked"
done
# 443 OPEN, all SMTP ports timeout
Fix: Relay over the tailnet. Tailscale rides WireGuard/DERP (UDP 41641 /
443), so it's immune to SMTP port filtering. Point the relay at the smarthost's
tailnet IP (e.g. 100.x:587), not its public address. Long-term: ask the
provider to unblock outbound 25/587 for verified use.
11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant
Symptom: The relay to <mailbox-tailnet-ip>:587 times out (os error 110),
yet the KVM host (same physical machine) can reach that exact IP:port over
the tailnet fine. Looks like a routing or transparent-proxy bug.
Cause: The Stalwart container rides the ts-stalwart sidecar — a separate
tailnet node (tag:stalwart) from the KVM host. The tailwart ACL block only
listed tag:stalwart as a destination ("dst": ["tag:stalwart"]). Tailnet
is default-deny, so the sidecar could receive connections but could not
initiate the relay back to the mailbox → silent drop → timeout. The KVM host
worked because it's a different, permitted identity, which masked the real cause.
Fix: Add an ACL rule granting tag:stalwart as a source:
{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }
(mailbox is tag:mail). Applies in seconds, no restart. See acl-snippet.hujson.
12. Stalwart only rebuilds its MTA route table at container startup
Symptom: You edit an MtaRoute (address, etc.) via API/UI, but delivery keeps
using the old value. The datastore shows the new value; live delivery ignores it.
Cause: The routing_strategy map is built once when the process boots. The
ReloadSettings action reloads the datastore but does not rebuild the SMTP
route map. So route/strategy changes are invisible until restart.
Fix: After any MtaRoute / MtaOutboundStrategy change,
docker restart tailwart-stalwart-1. (Side effect: the ephemeral sidecar gets a
new tailnet IP each restart — anything addressing it by IP must rediscover it;
use the MagicDNS name where possible.)
13. "Did Stalwart eat my custom DNS records?" — no; Spaceship is RRSet-upsert
Symptom: A manually-added record (e.g. an AAAA for the apex/mail) is
gone from the zone, and the suspicion is that Stalwart's ACME DNS-01 integration
overwrote it on a renewal.
Cause: Almost never Stalwart. Its only DNS-provider writes are
_acme-challenge.<name> TXT (the rotating challenge) and _validation-persist
TXT (the LE account-pinned persistent-validation record). It does not create
or modify A/AAAA/MX/SRV — those you add yourself from its "recommended records"
page. And the Spaceship API is RRSet-upsert keyed by (name, type), not a
whole-zone replace: a PUT /api/v1/dns/records/{domain} with
{"force":true,"items":[…]} only touches the RRSets named in items. Proof:
25 unrelated records coexist untouched through every rotating _acme-challenge
write; and adding one apex AAAA left the other 25 exactly intact (25→26).
So a vanished AAAA is far more likely a provider-side loss/rollback (e.g. during a data-center DDoS) or a manual edit — not Stalwart.
How to inspect / verify (read-only), creds in .env:
KEY=$(grep '^SPACESHIP_KEY=' .env | cut -d= -f2)
SECRET=$(grep '^SPACESHIP_SECRET=' .env | cut -d= -f2)
curl -s "https://spaceship.dev/api/v1/dns/records/<domain>?take=100&skip=0" \
-H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" | python3 -m json.tool
To add a record, PUT the same endpoint with a single-item items array — it
won't disturb siblings. Snapshot the zone (GET) before any write and diff
after; snapshots land in _backup/ (gitignored). Always re-check at the
authoritative NS (dig +short AAAA <name> @launch1.spaceship.net), not a cache.
Caveat — don't publish mail AAAA before the edge listens on v6. Inbound
mail follows MX → mail.<domain>; an AAAA there with no v6 :25 listener on
the edge makes senders try v6 and some won't fall back → deferred/bounced mail.
An apex AAAA is safe (it doesn't affect MX routing). Do mail AAAA + edge
v6 listeners together.