# tailwart — lessons learned Hard-won notes from bringing the mail edge up. Each entry is **symptom → cause → fix**, ordered roughly by how long it cost. Read this before re-debugging. ## 1. Postgres startup race ate cert/setting writes **Symptom:** TLS certs (manual import *and* ACME) would validate but never persist — Stalwart kept serving its `rcgen` self-signed fallback. Logs showed `Failed to create tables: error connecting to server` on most boots. **Cause:** Stalwart shares the `ts-stalwart` sidecar's netns. Its `depends_on` only waited for the sidecar's *own* health (`/healthz` = "tailscaled up"), which flips green **before** the tailnet route to Postgres (`the-record-prod:5432`) is usable. Stalwart started into that gap, failed the DB connect, and any write in that window — including a freshly obtained cert — was silently lost. **Fix:** the sidecar healthcheck now also requires Postgres to be reachable (`nc -z … 5432`), so `depends_on: service_healthy` can't release Stalwart into the race. See `docker-compose.yml`. First clean boot after this: zero PG errors, 4 live connections immediately. ## 2. DNS-01 was blocked by a dead Spaceship API key **Symptom:** `Failed to set DNS RRSet: Unauthorized` on every record; no cert issued; no `_acme-challenge` TXT ever set. **Cause:** the cert design is ACME **DNS-01** via the **Spaceship** provider (bundled in caddy/lego). The stored API key was invalid (recovery debris from an earlier config attempt). Note `STALWART_ACME_PROVIDER` / `STALWART_ACME_TOKEN` in `.env` are **empty and not even passed through by compose** — the provider + secret are entered in the **admin UI** (stored in the DB), not via env. **Gotcha:** secret fields render **blank** in the Stalwart admin even when set (the S3 secret behaves identically). A blank field is *not* evidence it's unset. **Fix / how to verify a key directly (egresses the box's WAN IP, same as Stalwart):** ```bash curl -i 'https://spaceship.dev/api/v1/dns/records/?take=5&skip=0' \ -H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET' # 401 application.unauthorized = bad key/secret or IP-restricted # 200 = good ``` A fresh Spaceship key fixed it. ## 3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery") **Symptom:** the edge box could relay mail fine but could **not** reach Stalwart's `:8080` admin — connections accept then immediately close. Looked like "tagged devices rejected, user phone works." **Cause:** Stalwart's fail2ban checks the **proxied client IP** (from the PROXY header) on the mail listeners, but the **raw connection IP** on the non-proxied admin listener. A banned edge-box IP therefore still relays mail (ban checked against the header IP) while direct `→:8080` is dropped (checked against the box IP). Malformed probing of the mail ports **re-arms** the ban. **Fix:** add `100.64.0.0/10` (and the box's WAN IP, which appears as the proxied client when you hit the box's own public hostname) to the fail2ban allow-list. Bans are in-memory — a Stalwart restart flushes them. **Don't rapid-poll the mail ports** to test. ## 4. The wildcard request *required* DNS-01 (why HTTP-01 was a dead end) With "Additional Hostnames" left empty, Stalwart requests a **wildcard** (`*.`). Wildcards can **only** be issued via DNS-01 — HTTP-01 literally cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding detour before realizing DNS-01 was the intended (and only viable) path. One wildcard cert then covers `mail`, `mta-sts`, `autoconfig`, `autodiscover`, etc. ## 5. `:443` web endpoints need SNI pass-through, not L7 proxy MTA-STS / autoconfig / autodiscover serve over **:443**. You cannot L7 `reverse_proxy` them through Caddy, because the **CAA** record pins issuance to Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart holds the wildcard, so the edge **passes TLS through** by SNI. See `caddy/README.md` → "The HTTP side". Needed `tcp:443` added to the `reverse-proxy → stalwart` ACL grant. ## 6. The sidecar is ephemeral — never hardcode its tailnet IP `ts-stalwart` runs with `?ephemeral=true`, so its tailnet IP **changes on re-registration** (an ACL re-sync did this mid-debug: `100.112.26.122 → 100.79.87.80`). Everything must use the MagicDNS name `stalwart.tail7b1641.ts.net`. A hardcoded IP will mysteriously go `Network is unreachable`. ## 7. Don't trust crt.sh for rate-limit checks crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly duplicate-cert limit, use **certspotter** instead: `https://api.certspotter.com/v1/issuances?domain=&include_subdomains=true`. Also: LE limits are dimensioned — **failed validations** are hourly (5/hr/host, the one a retry storm trips), **issued duplicates** are weekly (5/wk). A renewal task hammering every 10 min trips the hourly one; consolidate to a single task. ## 8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried **Symptom:** Outbound delivery (and relay-to-smarthost) to any host with an AAAA record fails with `I/O error: Network is unreachable (os error 101)`. Hosts that are IPv4-only deliver fine. Pointing a relay at a *hostname* that has both A and AAAA fails; pointing it at the raw IPv4 works. **Cause:** Stalwart shares the `ts-stalwart` sidecar's netns, which has no global IPv6. When it resolves a dual-stack target it tries the AAAA first, gets `ENETUNREACH` immediately, and for a **relay next-hop it does not fall back to the A record** — it just records the v6 failure and backs off. So a single missing address family wedges all mail to dual-stack destinations. **Fix:** Either (a) pin the relay/smarthost `address` to an **IPv4 literal** (no AAAA to trip on), or (b) give the container real IPv6. Note that relaying over the **tailnet** sidesteps this entirely — you connect to a tailnet `100.x` address, which has no AAAA, so the v6-first trap never triggers. > **RESOLVED (2026-06-11) — option (b) is now done.** The container has real > IPv6 egress; this trap no longer fires. See Lesson 9's fix for how. ## 9. Configuring IPv6 on the KVM host does NOT give the container IPv6 **Symptom:** `ip -6 addr` and `ping6 google.com` succeed on the KVM host, but Stalwart still dies with `os error 101` on AAAA targets, and the box is still a broken IPv6 Tailscale exit node. **Cause:** The host's `eth0` and the container/sidecar netns are separate network stacks. Adding the provider's `/64` to `eth0` (ifupdown `inet6 static` + `onlink` default route, since the gateway is in a different /64) fixes the *host*, not the container. Docker doesn't hand IPv6 to containers by default, and the sidecar routes via Tailscale, not eth0. **Fix:** Don't assume host IPv6 = container IPv6. Test from *inside* the container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the tailnet relay avoids needing container IPv6 at all. Enabling true container IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task. **RESOLVED (2026-06-11) — the easy way, no /64 routing or ndppd.** Because the container only needs IPv6 **egress** (inbound arrives via the edge/tailnet, never v6), you don't need a routable prefix or NDP proxy at all — just a **ULA subnet + masquerade**, exactly like Docker does for v4: ```yaml # docker-compose.yml networks: default: enable_ipv6: true ipam: config: - subnet: fd00:7a17:600d::/64 gateway: fd00:7a17:600d::1 ``` Docker 29 enables `ip6tables` by default and masquerades the ULA out the host's global v6, so the sidecar netns (shared by Stalwart via `network_mode`) gets a working v6 default route with **zero host sysctl/daemon changes** (host `net.ipv6.conf.all.forwarding` was already 1 from the static-v6 setup). Verify from *inside* the netns: `ping6 google.com` + a TCP connect to a v6 literal on :443. Recreating the network (`docker compose down && up`) bounces the stack and the ephemeral sidecar gets a new tailnet IP — MagicDNS covers it (Lesson 6), and the MTA route table rebuilds anyway (Lesson 12). This does **not** give inbound v6; for that you'd still publish AAAA + make the edge listen on v6 (separate). ## 10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet **Symptom:** Direct MX delivery and relay-to-public-host both fail with `Connection timed out (os error 110)`, and the SYN never arrives at the destination. Not just port 25 — `465`, `587`, even alt-port `2525` all time out. **Cause:** The KVM provider blocks all outbound SMTP submission ports to prevent spam. Only non-SMTP ports (`443`, etc.) egress. Confirmed with: ```bash for p in 25 465 587 2525 443; do timeout 5 bash -c "exec 3<>/dev/tcp//$p" && echo "$p OPEN" || echo "$p blocked" done # 443 OPEN, all SMTP ports timeout ``` **Fix:** Relay over the **tailnet**. Tailscale rides WireGuard/DERP (UDP 41641 / 443), so it's immune to SMTP port filtering. Point the relay at the smarthost's **tailnet IP** (e.g. `100.x:587`), not its public address. Long-term: ask the provider to unblock outbound 25/587 for verified use. ## 11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant **Symptom:** The relay to `:587` times out (`os error 110`), yet the **KVM host** (same physical machine) can reach that exact IP:port over the tailnet fine. Looks like a routing or transparent-proxy bug. **Cause:** The Stalwart container rides the `ts-stalwart` sidecar — a **separate tailnet node** (`tag:stalwart`) from the KVM host. The `tailwart` ACL block only listed `tag:stalwart` as a **destination** (`"dst": ["tag:stalwart"]`). Tailnet is default-deny, so the sidecar could receive connections but could not *initiate* the relay back to the mailbox → silent drop → timeout. The KVM host worked because it's a different, permitted identity, which masked the real cause. **Fix:** Add an ACL rule granting `tag:stalwart` as a **source**: ```json { "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] } ``` (mailbox is `tag:mail`). Applies in seconds, no restart. See `acl-snippet.hujson`. ## 12. Stalwart only rebuilds its MTA route table at container startup **Symptom:** You edit an `MtaRoute` (address, etc.) via API/UI, but delivery keeps using the old value. The datastore shows the new value; live delivery ignores it. **Cause:** The `routing_strategy` map is built once when the process boots. The `ReloadSettings` action reloads the datastore but does **not** rebuild the SMTP route map. So route/strategy changes are invisible until restart. **Fix:** After any `MtaRoute` / `MtaOutboundStrategy` change, `docker restart tailwart-stalwart-1`. (Side effect: the ephemeral sidecar gets a new tailnet IP each restart — anything addressing it by IP must rediscover it; use the MagicDNS name where possible.) ## 13. "Did Stalwart eat my custom DNS records?" — no; Spaceship is RRSet-upsert **Symptom:** A manually-added record (e.g. an `AAAA` for the apex/`mail`) is gone from the zone, and the suspicion is that Stalwart's ACME DNS-01 integration overwrote it on a renewal. **Cause:** Almost never Stalwart. Its **only** DNS-provider writes are `_acme-challenge.` TXT (the rotating challenge) and `_validation-persist` TXT (the LE account-pinned persistent-validation record). It does **not** create or modify A/AAAA/MX/SRV — those you add yourself from its "recommended records" page. And the Spaceship API is **RRSet-upsert keyed by (name, type)**, not a whole-zone replace: a `PUT /api/v1/dns/records/{domain}` with `{"force":true,"items":[…]}` only touches the RRSets named in `items`. Proof: 25 unrelated records coexist untouched through every rotating `_acme-challenge` write; and adding one apex `AAAA` left the other 25 exactly intact (25→26). So a vanished AAAA is far more likely a **provider-side loss/rollback** (e.g. during a data-center DDoS) or a manual edit — not Stalwart. **How to inspect / verify (read-only), creds in `.env`:** ```bash KEY=$(grep '^SPACESHIP_KEY=' .env | cut -d= -f2) SECRET=$(grep '^SPACESHIP_SECRET=' .env | cut -d= -f2) curl -s "https://spaceship.dev/api/v1/dns/records/?take=100&skip=0" \ -H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" | python3 -m json.tool ``` To add a record, `PUT` the same endpoint with a single-item `items` array — it won't disturb siblings. **Snapshot the zone (GET) before any write** and diff after; snapshots land in `_backup/` (gitignored). Always re-check at the authoritative NS (`dig +short AAAA @launch1.spaceship.net`), not a cache. **Caveat — don't publish `mail` AAAA before the edge listens on v6.** Inbound mail follows `MX → mail.`; an `AAAA` there with no v6 `:25` listener on the edge makes senders try v6 and some won't fall back → deferred/bounced mail. An **apex** `AAAA` is safe (it doesn't affect MX routing). Do `mail` AAAA + edge v6 listeners together.