#14: Spaceship PUT keys records by name+type+VALUE, so changing an existing RRSet's value APPENDS a second record (a double v=spf1 = RFC 7208 permerror). Correct pattern: PUT new, DELETE old; DELETE body is a bare JSON array, not {items:[...]}. #15: ed25519 DKIM "fail" at Gmail alongside passing RSA is the known Stalwart dual-signing issue, not a key problem -- proved the stored seed derives the published p= exactly. Fix is RSA-only: removed the ed25519 DNS key (done); disabling the ed25519 signature in Stalwart is the remaining step. Also records the smarthost identity behind the SPF fix. Corrected #13's "PUT won't disturb siblings" claim accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
16 KiB
tailwart — lessons learned
Hard-won notes from bringing the mail edge up. Each entry is symptom → cause → fix, ordered roughly by how long it cost. Read this before re-debugging.
1. Postgres startup race ate cert/setting writes
Symptom: TLS certs (manual import and ACME) would validate but never
persist — Stalwart kept serving its rcgen self-signed fallback. Logs showed
Failed to create tables: error connecting to server on most boots.
Cause: Stalwart shares the ts-stalwart sidecar's netns. Its depends_on
only waited for the sidecar's own health (/healthz = "tailscaled up"), which
flips green before the tailnet route to Postgres (the-record-prod:5432) is
usable. Stalwart started into that gap, failed the DB connect, and any write in
that window — including a freshly obtained cert — was silently lost.
Fix: the sidecar healthcheck now also requires Postgres to be reachable
(nc -z … 5432), so depends_on: service_healthy can't release Stalwart into
the race. See docker-compose.yml. First clean boot after this: zero PG errors,
4 live connections immediately.
2. DNS-01 was blocked by a dead Spaceship API key
Symptom: Failed to set DNS RRSet: Unauthorized on every record; no cert
issued; no _acme-challenge TXT ever set.
Cause: the cert design is ACME DNS-01 via the Spaceship provider
(bundled in caddy/lego). The stored API key was invalid (recovery debris from an
earlier config attempt). Note STALWART_ACME_PROVIDER / STALWART_ACME_TOKEN
in .env are empty and not even passed through by compose — the provider +
secret are entered in the admin UI (stored in the DB), not via env.
Gotcha: secret fields render blank in the Stalwart admin even when set (the S3 secret behaves identically). A blank field is not evidence it's unset.
Fix / how to verify a key directly (egresses the box's WAN IP, same as Stalwart):
curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
-H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
# 401 application.unauthorized = bad key/secret or IP-restricted
# 200 = good
A fresh Spaceship key fixed it.
3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")
Symptom: the edge box could relay mail fine but could not reach
Stalwart's :8080 admin — connections accept then immediately close. Looked like
"tagged devices rejected, user phone works."
Cause: Stalwart's fail2ban checks the proxied client IP (from the PROXY
header) on the mail listeners, but the raw connection IP on the non-proxied
admin listener. A banned edge-box IP therefore still relays mail (ban checked
against the header IP) while direct →:8080 is dropped (checked against the box
IP). Malformed probing of the mail ports re-arms the ban.
Fix: add 100.64.0.0/10 (and the box's WAN IP, which appears as the proxied
client when you hit the box's own public hostname) to the fail2ban allow-list.
Bans are in-memory — a Stalwart restart flushes them. Don't rapid-poll the mail
ports to test.
4. The wildcard request required DNS-01 (why HTTP-01 was a dead end)
With "Additional Hostnames" left empty, Stalwart requests a wildcard
(*.<domain>). Wildcards can only be issued via DNS-01 — HTTP-01 literally
cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding
detour before realizing DNS-01 was the intended (and only viable) path. One
wildcard cert then covers mail, mta-sts, autoconfig, autodiscover, etc.
5. :443 web endpoints need SNI pass-through, not L7 proxy
MTA-STS / autoconfig / autodiscover serve over :443. You cannot L7
reverse_proxy them through Caddy, because the CAA record pins issuance to
Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart
holds the wildcard, so the edge passes TLS through by SNI. See
caddy/README.md → "The HTTP side". Needed tcp:443 added to the
reverse-proxy → stalwart ACL grant.
6. The sidecar is ephemeral — never hardcode its tailnet IP
ts-stalwart runs with ?ephemeral=true, so its tailnet IP changes on
re-registration (an ACL re-sync did this mid-debug: 100.112.26.122 → 100.79.87.80). Everything must use the MagicDNS name
stalwart.tail7b1641.ts.net. A hardcoded IP will mysteriously go
Network is unreachable.
7. Don't trust crt.sh for rate-limit checks
crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly
duplicate-cert limit, use certspotter instead:
https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true.
Also: LE limits are dimensioned — failed validations are hourly (5/hr/host,
the one a retry storm trips), issued duplicates are weekly (5/wk). A renewal
task hammering every 10 min trips the hourly one; consolidate to a single task.
8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried
Symptom: Outbound delivery (and relay-to-smarthost) to any host with an
AAAA record fails with I/O error: Network is unreachable (os error 101).
Hosts that are IPv4-only deliver fine. Pointing a relay at a hostname that
has both A and AAAA fails; pointing it at the raw IPv4 works.
Cause: Stalwart shares the ts-stalwart sidecar's netns, which has no
global IPv6. When it resolves a dual-stack target it tries the AAAA first,
gets ENETUNREACH immediately, and for a relay next-hop it does not fall
back to the A record — it just records the v6 failure and backs off. So a
single missing address family wedges all mail to dual-stack destinations.
Fix: Either (a) pin the relay/smarthost address to an IPv4 literal
(no AAAA to trip on), or (b) give the container real IPv6. Note that relaying
over the tailnet sidesteps this entirely — you connect to a tailnet
100.x address, which has no AAAA, so the v6-first trap never triggers.
RESOLVED (2026-06-11) — option (b) is now done. The container has real IPv6 egress; this trap no longer fires. See Lesson 9's fix for how.
9. Configuring IPv6 on the KVM host does NOT give the container IPv6
Symptom: ip -6 addr and ping6 google.com succeed on the KVM host, but
Stalwart still dies with os error 101 on AAAA targets, and the box is still
a broken IPv6 Tailscale exit node.
Cause: The host's eth0 and the container/sidecar netns are separate
network stacks. Adding the provider's /64 to eth0 (ifupdown inet6 static
onlinkdefault route, since the gateway is in a different /64) fixes the host, not the container. Docker doesn't hand IPv6 to containers by default, and the sidecar routes via Tailscale, not eth0.
Fix: Don't assume host IPv6 = container IPv6. Test from inside the container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the tailnet relay avoids needing container IPv6 at all. Enabling true container IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task.
RESOLVED (2026-06-11) — the easy way, no /64 routing or ndppd. Because the container only needs IPv6 egress (inbound arrives via the edge/tailnet, never v6), you don't need a routable prefix or NDP proxy at all — just a ULA subnet + masquerade, exactly like Docker does for v4:
# docker-compose.yml
networks:
default:
enable_ipv6: true
ipam:
config:
- subnet: fd00:7a17:600d::/64
gateway: fd00:7a17:600d::1
Docker 29 enables ip6tables by default and masquerades the ULA out the host's
global v6, so the sidecar netns (shared by Stalwart via network_mode) gets a
working v6 default route with zero host sysctl/daemon changes (host
net.ipv6.conf.all.forwarding was already 1 from the static-v6 setup). Verify
from inside the netns: ping6 google.com + a TCP connect to a v6 literal on
:443. Recreating the network (docker compose down && up) bounces the stack and
the ephemeral sidecar gets a new tailnet IP — MagicDNS covers it (Lesson 6), and
the MTA route table rebuilds anyway (Lesson 12). This does not give inbound
v6; for that you'd still publish AAAA + make the edge listen on v6 (separate).
10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet
Symptom: Direct MX delivery and relay-to-public-host both fail with
Connection timed out (os error 110), and the SYN never arrives at the
destination. Not just port 25 — 465, 587, even alt-port 2525 all time out.
Cause: The KVM provider blocks all outbound SMTP submission ports to prevent
spam. Only non-SMTP ports (443, etc.) egress. Confirmed with:
for p in 25 465 587 2525 443; do
timeout 5 bash -c "exec 3<>/dev/tcp/<dst>/$p" && echo "$p OPEN" || echo "$p blocked"
done
# 443 OPEN, all SMTP ports timeout
Fix: Relay over the tailnet. Tailscale rides WireGuard/DERP (UDP 41641 /
443), so it's immune to SMTP port filtering. Point the relay at the smarthost's
tailnet IP (e.g. 100.x:587), not its public address. Long-term: ask the
provider to unblock outbound 25/587 for verified use.
11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant
Symptom: The relay to <mailbox-tailnet-ip>:587 times out (os error 110),
yet the KVM host (same physical machine) can reach that exact IP:port over
the tailnet fine. Looks like a routing or transparent-proxy bug.
Cause: The Stalwart container rides the ts-stalwart sidecar — a separate
tailnet node (tag:stalwart) from the KVM host. The tailwart ACL block only
listed tag:stalwart as a destination ("dst": ["tag:stalwart"]). Tailnet
is default-deny, so the sidecar could receive connections but could not
initiate the relay back to the mailbox → silent drop → timeout. The KVM host
worked because it's a different, permitted identity, which masked the real cause.
Fix: Add an ACL rule granting tag:stalwart as a source:
{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }
(mailbox is tag:mail). Applies in seconds, no restart. See acl-snippet.hujson.
12. Stalwart only rebuilds its MTA route table at container startup
Symptom: You edit an MtaRoute (address, etc.) via API/UI, but delivery keeps
using the old value. The datastore shows the new value; live delivery ignores it.
Cause: The routing_strategy map is built once when the process boots. The
ReloadSettings action reloads the datastore but does not rebuild the SMTP
route map. So route/strategy changes are invisible until restart.
Fix: After any MtaRoute / MtaOutboundStrategy change,
docker restart tailwart-stalwart-1. (Side effect: the ephemeral sidecar gets a
new tailnet IP each restart — anything addressing it by IP must rediscover it;
use the MagicDNS name where possible.)
13. "Did Stalwart eat my custom DNS records?" — no; Spaceship is RRSet-upsert
Symptom: A manually-added record (e.g. an AAAA for the apex/mail) is
gone from the zone, and the suspicion is that Stalwart's ACME DNS-01 integration
overwrote it on a renewal.
Cause: Almost never Stalwart. Its only DNS-provider writes are
_acme-challenge.<name> TXT (the rotating challenge) and _validation-persist
TXT (the LE account-pinned persistent-validation record). It does not create
or modify A/AAAA/MX/SRV — those you add yourself from its "recommended records"
page. And the Spaceship API is RRSet-upsert keyed by (name, type), not a
whole-zone replace: a PUT /api/v1/dns/records/{domain} with
{"force":true,"items":[…]} only touches the RRSets named in items. Proof:
25 unrelated records coexist untouched through every rotating _acme-challenge
write; and adding one apex AAAA left the other 25 exactly intact (25→26).
So a vanished AAAA is far more likely a provider-side loss/rollback (e.g. during a data-center DDoS) or a manual edit — not Stalwart.
How to inspect / verify (read-only), creds in .env:
KEY=$(grep '^SPACESHIP_KEY=' .env | cut -d= -f2)
SECRET=$(grep '^SPACESHIP_SECRET=' .env | cut -d= -f2)
curl -s "https://spaceship.dev/api/v1/dns/records/<domain>?take=100&skip=0" \
-H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" | python3 -m json.tool
To add a record, PUT the same endpoint with a single-item items array — it
won't disturb siblings of a different name/type (but see #14 — for an existing
RRSet it appends, it does not replace). Snapshot the zone (GET) before any
write and diff after; snapshots land in _backup/ (gitignored). Always
re-check at the authoritative NS (dig +short AAAA <name> @launch1.spaceship.net),
not a cache.
Caveat — don't publish mail AAAA before the edge listens on v6. Inbound
mail follows MX → mail.<domain>; an AAAA there with no v6 :25 listener on
the edge makes senders try v6 and some won't fall back → deferred/bounced mail.
An apex AAAA is safe (it doesn't affect MX routing). Do mail AAAA + edge
v6 listeners together.
14. Spaceship PUT is an APPEND-by-value, not a replace — it can dupe an RRSet
Symptom: "Updating" the SPF record (PUT with force:true and the new
value) left the zone with two v=spf1 apex TXT records. Two SPF records is
an RFC 7208 permerror → SPF fails hard for everyone — worse than the typo
you were fixing.
Cause: Spaceship keys records by (name, type, value). A PUT whose value
differs from the existing record is a new record, so force:true adds
rather than replacing. (The earlier AAAA/SPF adds looked like clean "upserts"
only because there was no prior record at that name+type, or the value matched.)
Fix / correct pattern for an in-place value change: PUT the new value, then
DELETE the old one — and the DELETE body is a bare JSON array, not
{"items":[…]} (the latter 422s with Value is "object" but should be "array"):
curl -s -X DELETE "https://spaceship.dev/api/v1/dns/records/<domain>" \
-H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" -H 'Content-Type: application/json' \
-d '[{"type":"TXT","name":"@","value":"v=spf1 mx -all"}]'
Always GET-diff before/after (count + REMOVED/ADDED sets) to catch a stray dupe.
15. ed25519 DKIM "fails" at Gmail with both ed25519+RSA — it's not your key
Symptom: DMARC aggregate reports show, per message, dkim=pass for the RSA
selector but dkim=fail for the ed25519 selector (v1-ed25519-…), on the same
intact message. Looks like a broken/mismatched ed25519 key.
Cause: Not the key. Verified cryptographically: the stored ed25519 seed
derives exactly the published p= (and the PKCS#8-v2 blob even embeds that same
pubkey). seed → pubkey → DNS all agree. It's the known Stalwart dual-signing
issue (discussion #2727):
when Stalwart applies both an ed25519 and an RSA signature, Gmail/Hotmail
mishandle the ed25519 one (fail, or neutral (no key)), while RSA passes. The
maintainer's own server runs with "ed25519 ignored, RSA passes." RSA carries
DMARC, so mail is unaffected — it's cosmetic, just noisy in reports.
How the key was proven (the seed lives in settings table s, PKCS#8 v2):
# 32-byte seed from the OCTET STRING in the stored PKCS#8; wrap as clean v0 DER:
printf '302e020100300506032b657004220420%s' "$SEED_HEX" | xxd -r -p > /tmp/ed.der
openssl pkey -inform DER -in /tmp/ed.der -pubout -outform DER | tail -c 32 | base64
# == the DNS p= value → key is correct
Fix (proper = RSA-only): the recommended cure is to stop emitting the ed25519 signature, not republish anything. Two parts:
- DNS (done 2026-06-12): removed the
v1-ed25519-20260604._domainkeyTXT — turns the reportfailinto a harmless "no key", DMARC still green via RSA. - Stalwart (still TODO): disable the ed25519 signature in the admin UI /
JMAP signing config so outbound stops carrying it (DB surgery on the serialized
signature object is risky — do it through the supported surface). The fallback
admin can't mint an API token non-interactively (only
authorization_code/device_codegrants; no ROPC), so this needs the web UI or a device-code login.
Aside discovered here: outbound is a catch-all smarthost relay to
mail.tail7b1641.ts.net (auth stalwart-relay@waynehayes.com), which re-emits
as mail.waynehayes.com (216.189.156.74 / 2602:ffc5:20::1:6b52). That relay
IP is why SPF needed include:waynehayes.com (#14 / the SPF fix).