tailwart/LESSONS.md
Wayne Hayes 1e5fc982eb LESSONS: shared-infra readiness (#16 boot-order) + flapping consumer (#17 atuin)
#16 Stalwart-before-Garage on reboot → S3-backed admin SPA 404'd (not a boot
loop). Gate every app on backend *liveness* (depends_on service_healthy +
probe PG/Redis/Garage over the tailnet), don't assume shared infra boots first.

#17 atuin crash-looped 6318x (exit 1) and looked like a Postgres problem;
Postgres was healthy and atuin never even connected. PG health != consumer
health — check RestartCount and pg_stat_activity client_addr churn; confirm a
consumer's creds/reachability before restart:always.

Both generalize to federatedSocial (shared PG/Redis/Garage = blast radius).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 23:17:05 -04:00

19 KiB

tailwart — lessons learned

Hard-won notes from bringing the mail edge up. Each entry is symptom → cause → fix, ordered roughly by how long it cost. Read this before re-debugging.

1. Postgres startup race ate cert/setting writes

Symptom: TLS certs (manual import and ACME) would validate but never persist — Stalwart kept serving its rcgen self-signed fallback. Logs showed Failed to create tables: error connecting to server on most boots.

Cause: Stalwart shares the ts-stalwart sidecar's netns. Its depends_on only waited for the sidecar's own health (/healthz = "tailscaled up"), which flips green before the tailnet route to Postgres (the-record-prod:5432) is usable. Stalwart started into that gap, failed the DB connect, and any write in that window — including a freshly obtained cert — was silently lost.

Fix: the sidecar healthcheck now also requires Postgres to be reachable (nc -z … 5432), so depends_on: service_healthy can't release Stalwart into the race. See docker-compose.yml. First clean boot after this: zero PG errors, 4 live connections immediately.

2. DNS-01 was blocked by a dead Spaceship API key

Symptom: Failed to set DNS RRSet: Unauthorized on every record; no cert issued; no _acme-challenge TXT ever set.

Cause: the cert design is ACME DNS-01 via the Spaceship provider (bundled in caddy/lego). The stored API key was invalid (recovery debris from an earlier config attempt). Note STALWART_ACME_PROVIDER / STALWART_ACME_TOKEN in .env are empty and not even passed through by compose — the provider + secret are entered in the admin UI (stored in the DB), not via env.

Gotcha: secret fields render blank in the Stalwart admin even when set (the S3 secret behaves identically). A blank field is not evidence it's unset.

Fix / how to verify a key directly (egresses the box's WAN IP, same as Stalwart):

curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
  -H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
# 401 application.unauthorized = bad key/secret or IP-restricted
# 200 = good

A fresh Spaceship key fixed it.

3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")

Symptom: the edge box could relay mail fine but could not reach Stalwart's :8080 admin — connections accept then immediately close. Looked like "tagged devices rejected, user phone works."

Cause: Stalwart's fail2ban checks the proxied client IP (from the PROXY header) on the mail listeners, but the raw connection IP on the non-proxied admin listener. A banned edge-box IP therefore still relays mail (ban checked against the header IP) while direct →:8080 is dropped (checked against the box IP). Malformed probing of the mail ports re-arms the ban.

Fix: add 100.64.0.0/10 (and the box's WAN IP, which appears as the proxied client when you hit the box's own public hostname) to the fail2ban allow-list. Bans are in-memory — a Stalwart restart flushes them. Don't rapid-poll the mail ports to test.

4. The wildcard request required DNS-01 (why HTTP-01 was a dead end)

With "Additional Hostnames" left empty, Stalwart requests a wildcard (*.<domain>). Wildcards can only be issued via DNS-01 — HTTP-01 literally cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding detour before realizing DNS-01 was the intended (and only viable) path. One wildcard cert then covers mail, mta-sts, autoconfig, autodiscover, etc.

5. :443 web endpoints need SNI pass-through, not L7 proxy

MTA-STS / autoconfig / autodiscover serve over :443. You cannot L7 reverse_proxy them through Caddy, because the CAA record pins issuance to Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart holds the wildcard, so the edge passes TLS through by SNI. See caddy/README.md → "The HTTP side". Needed tcp:443 added to the reverse-proxy → stalwart ACL grant.

6. The sidecar is ephemeral — never hardcode its tailnet IP

ts-stalwart runs with ?ephemeral=true, so its tailnet IP changes on re-registration (an ACL re-sync did this mid-debug: 100.112.26.122 → 100.79.87.80). Everything must use the MagicDNS name stalwart.tail7b1641.ts.net. A hardcoded IP will mysteriously go Network is unreachable.

7. Don't trust crt.sh for rate-limit checks

crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly duplicate-cert limit, use certspotter instead: https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true. Also: LE limits are dimensioned — failed validations are hourly (5/hr/host, the one a retry storm trips), issued duplicates are weekly (5/wk). A renewal task hammering every 10 min trips the hourly one; consolidate to a single task.

8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried

Symptom: Outbound delivery (and relay-to-smarthost) to any host with an AAAA record fails with I/O error: Network is unreachable (os error 101). Hosts that are IPv4-only deliver fine. Pointing a relay at a hostname that has both A and AAAA fails; pointing it at the raw IPv4 works.

Cause: Stalwart shares the ts-stalwart sidecar's netns, which has no global IPv6. When it resolves a dual-stack target it tries the AAAA first, gets ENETUNREACH immediately, and for a relay next-hop it does not fall back to the A record — it just records the v6 failure and backs off. So a single missing address family wedges all mail to dual-stack destinations.

Fix: Either (a) pin the relay/smarthost address to an IPv4 literal (no AAAA to trip on), or (b) give the container real IPv6. Note that relaying over the tailnet sidesteps this entirely — you connect to a tailnet 100.x address, which has no AAAA, so the v6-first trap never triggers.

RESOLVED (2026-06-11) — option (b) is now done. The container has real IPv6 egress; this trap no longer fires. See Lesson 9's fix for how.

9. Configuring IPv6 on the KVM host does NOT give the container IPv6

Symptom: ip -6 addr and ping6 google.com succeed on the KVM host, but Stalwart still dies with os error 101 on AAAA targets, and the box is still a broken IPv6 Tailscale exit node.

Cause: The host's eth0 and the container/sidecar netns are separate network stacks. Adding the provider's /64 to eth0 (ifupdown inet6 static

  • onlink default route, since the gateway is in a different /64) fixes the host, not the container. Docker doesn't hand IPv6 to containers by default, and the sidecar routes via Tailscale, not eth0.

Fix: Don't assume host IPv6 = container IPv6. Test from inside the container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the tailnet relay avoids needing container IPv6 at all. Enabling true container IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task.

RESOLVED (2026-06-11) — the easy way, no /64 routing or ndppd. Because the container only needs IPv6 egress (inbound arrives via the edge/tailnet, never v6), you don't need a routable prefix or NDP proxy at all — just a ULA subnet + masquerade, exactly like Docker does for v4:

# docker-compose.yml
networks:
  default:
    enable_ipv6: true
    ipam:
      config:
        - subnet: fd00:7a17:600d::/64
          gateway: fd00:7a17:600d::1

Docker 29 enables ip6tables by default and masquerades the ULA out the host's global v6, so the sidecar netns (shared by Stalwart via network_mode) gets a working v6 default route with zero host sysctl/daemon changes (host net.ipv6.conf.all.forwarding was already 1 from the static-v6 setup). Verify from inside the netns: ping6 google.com + a TCP connect to a v6 literal on :443. Recreating the network (docker compose down && up) bounces the stack and the ephemeral sidecar gets a new tailnet IP — MagicDNS covers it (Lesson 6), and the MTA route table rebuilds anyway (Lesson 12). This does not give inbound v6; for that you'd still publish AAAA + make the edge listen on v6 (separate).

10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet

Symptom: Direct MX delivery and relay-to-public-host both fail with Connection timed out (os error 110), and the SYN never arrives at the destination. Not just port 25 — 465, 587, even alt-port 2525 all time out.

Cause: The KVM provider blocks all outbound SMTP submission ports to prevent spam. Only non-SMTP ports (443, etc.) egress. Confirmed with:

for p in 25 465 587 2525 443; do
  timeout 5 bash -c "exec 3<>/dev/tcp/<dst>/$p" && echo "$p OPEN" || echo "$p blocked"
done
# 443 OPEN, all SMTP ports timeout

Fix: Relay over the tailnet. Tailscale rides WireGuard/DERP (UDP 41641 / 443), so it's immune to SMTP port filtering. Point the relay at the smarthost's tailnet IP (e.g. 100.x:587), not its public address. Long-term: ask the provider to unblock outbound 25/587 for verified use.

11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant

Symptom: The relay to <mailbox-tailnet-ip>:587 times out (os error 110), yet the KVM host (same physical machine) can reach that exact IP:port over the tailnet fine. Looks like a routing or transparent-proxy bug.

Cause: The Stalwart container rides the ts-stalwart sidecar — a separate tailnet node (tag:stalwart) from the KVM host. The tailwart ACL block only listed tag:stalwart as a destination ("dst": ["tag:stalwart"]). Tailnet is default-deny, so the sidecar could receive connections but could not initiate the relay back to the mailbox → silent drop → timeout. The KVM host worked because it's a different, permitted identity, which masked the real cause.

Fix: Add an ACL rule granting tag:stalwart as a source:

{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }

(mailbox is tag:mail). Applies in seconds, no restart. See acl-snippet.hujson.

12. Stalwart only rebuilds its MTA route table at container startup

Symptom: You edit an MtaRoute (address, etc.) via API/UI, but delivery keeps using the old value. The datastore shows the new value; live delivery ignores it.

Cause: The routing_strategy map is built once when the process boots. The ReloadSettings action reloads the datastore but does not rebuild the SMTP route map. So route/strategy changes are invisible until restart.

Fix: After any MtaRoute / MtaOutboundStrategy change, docker restart tailwart-stalwart-1. (Side effect: the ephemeral sidecar gets a new tailnet IP each restart — anything addressing it by IP must rediscover it; use the MagicDNS name where possible.)

13. "Did Stalwart eat my custom DNS records?" — no; Spaceship is RRSet-upsert

Symptom: A manually-added record (e.g. an AAAA for the apex/mail) is gone from the zone, and the suspicion is that Stalwart's ACME DNS-01 integration overwrote it on a renewal.

Cause: Almost never Stalwart. Its only DNS-provider writes are _acme-challenge.<name> TXT (the rotating challenge) and _validation-persist TXT (the LE account-pinned persistent-validation record). It does not create or modify A/AAAA/MX/SRV — those you add yourself from its "recommended records" page. And the Spaceship API is RRSet-upsert keyed by (name, type), not a whole-zone replace: a PUT /api/v1/dns/records/{domain} with {"force":true,"items":[…]} only touches the RRSets named in items. Proof: 25 unrelated records coexist untouched through every rotating _acme-challenge write; and adding one apex AAAA left the other 25 exactly intact (25→26).

So a vanished AAAA is far more likely a provider-side loss/rollback (e.g. during a data-center DDoS) or a manual edit — not Stalwart.

How to inspect / verify (read-only), creds in .env:

KEY=$(grep '^SPACESHIP_KEY=' .env | cut -d= -f2)
SECRET=$(grep '^SPACESHIP_SECRET=' .env | cut -d= -f2)
curl -s "https://spaceship.dev/api/v1/dns/records/<domain>?take=100&skip=0" \
  -H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" | python3 -m json.tool

To add a record, PUT the same endpoint with a single-item items array — it won't disturb siblings of a different name/type (but see #14 — for an existing RRSet it appends, it does not replace). Snapshot the zone (GET) before any write and diff after; snapshots land in _backup/ (gitignored). Always re-check at the authoritative NS (dig +short AAAA <name> @launch1.spaceship.net), not a cache.

Caveat — don't publish mail AAAA before the edge listens on v6. Inbound mail follows MX → mail.<domain>; an AAAA there with no v6 :25 listener on the edge makes senders try v6 and some won't fall back → deferred/bounced mail. An apex AAAA is safe (it doesn't affect MX routing). Do mail AAAA + edge v6 listeners together.

14. Spaceship PUT is an APPEND-by-value, not a replace — it can dupe an RRSet

Symptom: "Updating" the SPF record (PUT with force:true and the new value) left the zone with two v=spf1 apex TXT records. Two SPF records is an RFC 7208 permerror → SPF fails hard for everyone — worse than the typo you were fixing.

Cause: Spaceship keys records by (name, type, value). A PUT whose value differs from the existing record is a new record, so force:true adds rather than replacing. (The earlier AAAA/SPF adds looked like clean "upserts" only because there was no prior record at that name+type, or the value matched.)

Fix / correct pattern for an in-place value change: PUT the new value, then DELETE the old one — and the DELETE body is a bare JSON array, not {"items":[…]} (the latter 422s with Value is "object" but should be "array"):

curl -s -X DELETE "https://spaceship.dev/api/v1/dns/records/<domain>" \
  -H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" -H 'Content-Type: application/json' \
  -d '[{"type":"TXT","name":"@","value":"v=spf1 mx -all"}]'

Always GET-diff before/after (count + REMOVED/ADDED sets) to catch a stray dupe.

15. ed25519 DKIM "fails" at Gmail with both ed25519+RSA — it's not your key

Symptom: DMARC aggregate reports show, per message, dkim=pass for the RSA selector but dkim=fail for the ed25519 selector (v1-ed25519-…), on the same intact message. Looks like a broken/mismatched ed25519 key.

Cause: Not the key. Verified cryptographically: the stored ed25519 seed derives exactly the published p= (and the PKCS#8-v2 blob even embeds that same pubkey). seed → pubkey → DNS all agree. It's the known Stalwart dual-signing issue (discussion #2727): when Stalwart applies both an ed25519 and an RSA signature, Gmail/Hotmail mishandle the ed25519 one (fail, or neutral (no key)), while RSA passes. The maintainer's own server runs with "ed25519 ignored, RSA passes." RSA carries DMARC, so mail is unaffected — it's cosmetic, just noisy in reports.

How the key was proven (the seed lives in settings table s, PKCS#8 v2):

# 32-byte seed from the OCTET STRING in the stored PKCS#8; wrap as clean v0 DER:
printf '302e020100300506032b657004220420%s' "$SEED_HEX" | xxd -r -p > /tmp/ed.der
openssl pkey -inform DER -in /tmp/ed.der -pubout -outform DER | tail -c 32 | base64
# == the DNS p= value  →  key is correct

Fix (proper = RSA-only): the recommended cure is to stop emitting the ed25519 signature, not republish anything. Two parts:

  1. DNS (done 2026-06-12): removed the v1-ed25519-20260604._domainkey TXT — turns the report fail into a harmless "no key", DMARC still green via RSA.
  2. Stalwart (still TODO): disable the ed25519 signature in the admin UI / JMAP signing config so outbound stops carrying it (DB surgery on the serialized signature object is risky — do it through the supported surface). The fallback admin can't mint an API token non-interactively (only authorization_code / device_code grants; no ROPC), so this needs the web UI or a device-code login.

Aside discovered here: outbound is a catch-all smarthost relay to mail.tail7b1641.ts.net (auth stalwart-relay@waynehayes.com), which re-emits as mail.waynehayes.com (216.189.156.74 / 2602:ffc5:20::1:6b52). That relay IP is why SPF needed include:waynehayes.com (#14 / the SPF fix).

16. After a reboot, Stalwart started before Garage — admin site 404'd (NOT a boot loop)

Symptom: Post-reboot, the Stalwart web admin / app assets wouldn't load (404 / blank), even though the container was running and not restart-looping.

Cause: the web UI (and other app assets) live in the S3 blob store (Garage) — Stalwart unpacks/serves them from S3. On reboot Stalwart came up before Garage was ready, so the asset fetch failed. Stalwart itself was fine (PG connected, listeners up); only the S3-backed content was missing. Easy to misread as "Stalwart is broken."

Fix: once Garage is up, restart Stalwart (or it picks them up on the next fetch). Quick confirm it's a backend-readiness issue, not Stalwart: running+healthy but assets 404 → probe the backend from the sidecar (nc -z garage.<tailnet> 3900).

Rule for the whole fleet (federatedSocial): every app must gate on its backends being live, not merely present. Model it on the Stalwart sidecar's healthcheck — depends_on: { <backend>: service_healthy } plus a check that actually probes PG/Redis/ Garage over the tailnet (see #1, the PG-startup-race healthcheck). Don't assume shared infra boots first; make it a startup-ordering/readiness convention across all sidecars.

17. A flapping shared-store consumer (atuin) looked like a Postgres problem

Symptom: "Postgres seems to be the cause / unstable." Actually atuin-server had RestartCount 6318, exit 1 — crash-looping for days and generating all the noise.

Cause: atuin couldn't reach/authenticate its DB and crash-looped under restart: unless-stopped. Postgres itself was healthy (6 days up, 0 restarts, 17/100 conns). atuin never even established a connection — no atuin lines in the PG log and no atuin rows in pg_stat_activity — i.e. it was dying before reaching PG.

Diagnosis (fast):

# which container is actually flapping (PG health != consumer health):
docker inspect <c> --format '{{.RestartCount}} exit={{.State.ExitCode}} oom={{.State.OOMKilled}}'
# is a consumer reconnect-storming the shared store? distinct/ghost client_addr = churn:
docker exec <pg> psql -U postgres -tAc \
  "SELECT client_addr, state, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 1"

Ephemeral sidecar nodes get a new tailnet IP per restart, so successive incarnations leave ghost idle connections from dead IPs — a handy "how many times did it restart" fingerprint (we saw this with Stalwart too: 1 live IP + 2 ghosts).

Rule for the whole fleet: a shared Postgres/Redis/Garage is a blast-radius surface — one misconfigured consumer shouldn't be mistaken for a shared-infra outage. Confirm a consumer's creds + backend reachability before enabling restart: always/unless-stopped, and when something "looks like the DB," check the consumers first.