tailwart/LESSONS.md

# tailwart — lessons learned

Hard-won notes from bringing the mail edge up. Each entry is **symptom → cause →
fix**, ordered roughly by how long it cost. Read this before re-debugging.

## 1. Postgres startup race ate cert/setting writes

**Symptom:** TLS certs (manual import *and* ACME) would validate but never
persist — Stalwart kept serving its `rcgen` self-signed fallback. Logs showed
`Failed to create tables: error connecting to server` on most boots.

**Cause:** Stalwart shares the `ts-stalwart` sidecar's netns. Its `depends_on`
only waited for the sidecar's *own* health (`/healthz` = "tailscaled up"), which
flips green **before** the tailnet route to Postgres (`the-record-prod:5432`) is
usable. Stalwart started into that gap, failed the DB connect, and any write in
that window — including a freshly obtained cert — was silently lost.

**Fix:** the sidecar healthcheck now also requires Postgres to be reachable
(`nc -z … 5432`), so `depends_on: service_healthy` can't release Stalwart into
the race. See `docker-compose.yml`. First clean boot after this: zero PG errors,
4 live connections immediately.

## 2. DNS-01 was blocked by a dead Spaceship API key

**Symptom:** `Failed to set DNS RRSet: Unauthorized` on every record; no cert
issued; no `_acme-challenge` TXT ever set.

**Cause:** the cert design is ACME **DNS-01** via the **Spaceship** provider
(bundled in caddy/lego). The stored API key was invalid (recovery debris from an
earlier config attempt). Note `STALWART_ACME_PROVIDER` / `STALWART_ACME_TOKEN`
in `.env` are **empty and not even passed through by compose** — the provider +
secret are entered in the **admin UI** (stored in the DB), not via env.

**Gotcha:** secret fields render **blank** in the Stalwart admin even when set
(the S3 secret behaves identically). A blank field is *not* evidence it's unset.

**Fix / how to verify a key directly (egresses the box's WAN IP, same as
Stalwart):**
```bash
curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \
  -H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'
# 401 application.unauthorized = bad key/secret or IP-restricted
# 200 = good
```
A fresh Spaceship key fixed it.

## 3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")

**Symptom:** the edge box could relay mail fine but could **not** reach
Stalwart's `:8080` admin — connections accept then immediately close. Looked like
"tagged devices rejected, user phone works."

**Cause:** Stalwart's fail2ban checks the **proxied client IP** (from the PROXY
header) on the mail listeners, but the **raw connection IP** on the non-proxied
admin listener. A banned edge-box IP therefore still relays mail (ban checked
against the header IP) while direct `→:8080` is dropped (checked against the box
IP). Malformed probing of the mail ports **re-arms** the ban.

**Fix:** add `100.64.0.0/10` (and the box's WAN IP, which appears as the proxied
client when you hit the box's own public hostname) to the fail2ban allow-list.
Bans are in-memory — a Stalwart restart flushes them. **Don't rapid-poll the mail
ports** to test.

## 4. The wildcard request *required* DNS-01 (why HTTP-01 was a dead end)

With "Additional Hostnames" left empty, Stalwart requests a **wildcard**
(`*.<domain>`). Wildcards can **only** be issued via DNS-01 — HTTP-01 literally
cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding
detour before realizing DNS-01 was the intended (and only viable) path. One
wildcard cert then covers `mail`, `mta-sts`, `autoconfig`, `autodiscover`, etc.

## 5. `:443` web endpoints need SNI pass-through, not L7 proxy

MTA-STS / autoconfig / autodiscover serve over **:443**. You cannot L7
`reverse_proxy` them through Caddy, because the **CAA** record pins issuance to
Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart
holds the wildcard, so the edge **passes TLS through** by SNI. See
`caddy/README.md` → "The HTTP side". Needed `tcp:443` added to the
`reverse-proxy → stalwart` ACL grant.

## 6. The sidecar is ephemeral — never hardcode its tailnet IP

`ts-stalwart` runs with `?ephemeral=true`, so its tailnet IP **changes on
re-registration** (an ACL re-sync did this mid-debug: `100.112.26.122 →
100.79.87.80`). Everything must use the MagicDNS name
`stalwart.tail7b1641.ts.net`. A hardcoded IP will mysteriously go
`Network is unreachable`.

## 7. Don't trust crt.sh for rate-limit checks

crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly
duplicate-cert limit, use **certspotter** instead:
`https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true`.
Also: LE limits are dimensioned — **failed validations** are hourly (5/hr/host,
the one a retry storm trips), **issued duplicates** are weekly (5/wk). A renewal
task hammering every 10 min trips the hourly one; consolidate to a single task.

## 8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried

**Symptom:** Outbound delivery (and relay-to-smarthost) to any host with an
AAAA record fails with `I/O error: Network is unreachable (os error 101)`.
Hosts that are IPv4-only deliver fine. Pointing a relay at a *hostname* that
has both A and AAAA fails; pointing it at the raw IPv4 works.

**Cause:** Stalwart shares the `ts-stalwart` sidecar's netns, which has no
global IPv6. When it resolves a dual-stack target it tries the AAAA first,
gets `ENETUNREACH` immediately, and for a **relay next-hop it does not fall
back to the A record** — it just records the v6 failure and backs off. So a
single missing address family wedges all mail to dual-stack destinations.

**Fix:** Either (a) pin the relay/smarthost `address` to an **IPv4 literal**
(no AAAA to trip on), or (b) give the container real IPv6. Note that relaying
over the **tailnet** sidesteps this entirely — you connect to a tailnet
`100.x` address, which has no AAAA, so the v6-first trap never triggers.

> **RESOLVED (2026-06-11) — option (b) is now done.** The container has real
> IPv6 egress; this trap no longer fires. See Lesson 9's fix for how.

## 9. Configuring IPv6 on the KVM host does NOT give the container IPv6

**Symptom:** `ip -6 addr` and `ping6 google.com` succeed on the KVM host, but
Stalwart still dies with `os error 101` on AAAA targets, and the box is still
a broken IPv6 Tailscale exit node.

**Cause:** The host's `eth0` and the container/sidecar netns are separate
network stacks. Adding the provider's `/64` to `eth0` (ifupdown `inet6 static`
+ `onlink` default route, since the gateway is in a different /64) fixes the
*host*, not the container. Docker doesn't hand IPv6 to containers by default,
and the sidecar routes via Tailscale, not eth0.

**Fix:** Don't assume host IPv6 = container IPv6. Test from *inside* the
container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the
tailnet relay avoids needing container IPv6 at all. Enabling true container
IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task.

**RESOLVED (2026-06-11) — the easy way, no /64 routing or ndppd.** Because the
container only needs IPv6 **egress** (inbound arrives via the edge/tailnet,
never v6), you don't need a routable prefix or NDP proxy at all — just a **ULA
subnet + masquerade**, exactly like Docker does for v4:
```yaml
# docker-compose.yml
networks:
  default:
    enable_ipv6: true
    ipam:
      config:
        - subnet: fd00:7a17:600d::/64
          gateway: fd00:7a17:600d::1
```
Docker 29 enables `ip6tables` by default and masquerades the ULA out the host's
global v6, so the sidecar netns (shared by Stalwart via `network_mode`) gets a
working v6 default route with **zero host sysctl/daemon changes** (host
`net.ipv6.conf.all.forwarding` was already 1 from the static-v6 setup). Verify
from *inside* the netns: `ping6 google.com` + a TCP connect to a v6 literal on
:443. Recreating the network (`docker compose down && up`) bounces the stack and
the ephemeral sidecar gets a new tailnet IP — MagicDNS covers it (Lesson 6), and
the MTA route table rebuilds anyway (Lesson 12). This does **not** give inbound
v6; for that you'd still publish AAAA + make the edge listen on v6 (separate).

## 10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet

**Symptom:** Direct MX delivery and relay-to-public-host both fail with
`Connection timed out (os error 110)`, and the SYN never arrives at the
destination. Not just port 25 — `465`, `587`, even alt-port `2525` all time out.

**Cause:** The KVM provider blocks all outbound SMTP submission ports to prevent
spam. Only non-SMTP ports (`443`, etc.) egress. Confirmed with:
```bash
for p in 25 465 587 2525 443; do
  timeout 5 bash -c "exec 3<>/dev/tcp/<dst>/$p" && echo "$p OPEN" || echo "$p blocked"
done
# 443 OPEN, all SMTP ports timeout
```

**Fix:** Relay over the **tailnet**. Tailscale rides WireGuard/DERP (UDP 41641 /
443), so it's immune to SMTP port filtering. Point the relay at the smarthost's
**tailnet IP** (e.g. `100.x:587`), not its public address. Long-term: ask the
provider to unblock outbound 25/587 for verified use.

## 11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant

**Symptom:** The relay to `<mailbox-tailnet-ip>:587` times out (`os error 110`),
yet the **KVM host** (same physical machine) can reach that exact IP:port over
the tailnet fine. Looks like a routing or transparent-proxy bug.

**Cause:** The Stalwart container rides the `ts-stalwart` sidecar — a **separate
tailnet node** (`tag:stalwart`) from the KVM host. The `tailwart` ACL block only
listed `tag:stalwart` as a **destination** (`"dst": ["tag:stalwart"]`). Tailnet
is default-deny, so the sidecar could receive connections but could not
*initiate* the relay back to the mailbox → silent drop → timeout. The KVM host
worked because it's a different, permitted identity, which masked the real cause.

**Fix:** Add an ACL rule granting `tag:stalwart` as a **source**:
```json
{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }
```
(mailbox is `tag:mail`). Applies in seconds, no restart. See `acl-snippet.hujson`.

## 12. Stalwart only rebuilds its MTA route table at container startup

**Symptom:** You edit an `MtaRoute` (address, etc.) via API/UI, but delivery keeps
using the old value. The datastore shows the new value; live delivery ignores it.

**Cause:** The `routing_strategy` map is built once when the process boots. The
`ReloadSettings` action reloads the datastore but does **not** rebuild the SMTP
route map. So route/strategy changes are invisible until restart.

**Fix:** After any `MtaRoute` / `MtaOutboundStrategy` change,
`docker restart tailwart-stalwart-1`. (Side effect: the ephemeral sidecar gets a
new tailnet IP each restart — anything addressing it by IP must rediscover it;
use the MagicDNS name where possible.)

## 13. "Did Stalwart eat my custom DNS records?" — no; Spaceship is RRSet-upsert

**Symptom:** A manually-added record (e.g. an `AAAA` for the apex/`mail`) is
gone from the zone, and the suspicion is that Stalwart's ACME DNS-01 integration
overwrote it on a renewal.

**Cause:** Almost never Stalwart. Its **only** DNS-provider writes are
`_acme-challenge.<name>` TXT (the rotating challenge) and `_validation-persist`
TXT (the LE account-pinned persistent-validation record). It does **not** create
or modify A/AAAA/MX/SRV — those you add yourself from its "recommended records"
page. And the Spaceship API is **RRSet-upsert keyed by (name, type)**, not a
whole-zone replace: a `PUT /api/v1/dns/records/{domain}` with
`{"force":true,"items":[…]}` only touches the RRSets named in `items`. Proof:
25 unrelated records coexist untouched through every rotating `_acme-challenge`
write; and adding one apex `AAAA` left the other 25 exactly intact (25→26).

So a vanished AAAA is far more likely a **provider-side loss/rollback** (e.g.
during a data-center DDoS) or a manual edit — not Stalwart.

**How to inspect / verify (read-only), creds in `.env`:**
```bash
KEY=$(grep '^SPACESHIP_KEY=' .env | cut -d= -f2)
SECRET=$(grep '^SPACESHIP_SECRET=' .env | cut -d= -f2)
curl -s "https://spaceship.dev/api/v1/dns/records/<domain>?take=100&skip=0" \
  -H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" | python3 -m json.tool
```
To add a record, `PUT` the same endpoint with a single-item `items` array — it
won't disturb siblings. **Snapshot the zone (GET) before any write** and diff
after; snapshots land in `_backup/` (gitignored). Always re-check at the
authoritative NS (`dig +short AAAA <name> @launch1.spaceship.net`), not a cache.

**Caveat — don't publish `mail` AAAA before the edge listens on v6.** Inbound
mail follows `MX → mail.<domain>`; an `AAAA` there with no v6 `:25` listener on
the edge makes senders try v6 and some won't fall back → deferred/bounced mail.
An **apex** `AAAA` is safe (it doesn't affect MX routing). Do `mail` AAAA + edge
v6 listeners together.
Harden mail edge: PG-race healthcheck gate, :443 SNI fan-out, docs Fixes the root cause that was silently dropping Stalwart's cert/setting writes, completes the public HTTPS endpoints, and captures the debugging knowledge. - docker-compose.yml: gate the ts-stalwart healthcheck on Postgres reachability (nc -z the-record-prod:5432) in addition to tailscaled health. Stalwart's depends_on: service_healthy can no longer release it into the window where the tailnet route to Postgres isn't up yet — which was failing table init and losing in-flight cert writes (-> rcgen). - caddy/caddy.json + README: add the :443 SNI fan-out. mta-sts / autoconfig / autodiscover pass through to stalwart:443 (Stalwart terminates TLS with its wildcard cert; no proxy_protocol on :443). All other SNIs go to the box's web Caddy on :8443 (https_port 8443). L7 reverse_proxy is impossible here: CAA pins issuance to Stalwart's ACME account, so Caddy can't obtain its own cert for these names. - acl-snippet.hujson: grant tcp:443 on reverse-proxy -> stalwart for the SNI pass-through. - config/config.json: track the v0.16 bootstrap (commit-safe; the DB secret is an EnvironmentVariable reference, not inline). - LESSONS.md: symptom -> cause -> fix notes (PG race, DNS-01/Spaceship dead key, auto-ban vs PROXY protocol, wildcard-requires-DNS-01, SNI pass-through, ephemeral sidecar IP, LE rate-limit checks). - .gitignore: exclude _backup/ and _validate/ (DB dumps + an inline-secret config) and editor swap files. NEVER commit those. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 00:03:52 -04:00			`# tailwart — lessons learned`

			`Hard-won notes from bringing the mail edge up. Each entry is **symptom → cause →`
			`fix**, ordered roughly by how long it cost. Read this before re-debugging.`

			`## 1. Postgres startup race ate cert/setting writes`

			`Symptom: TLS certs (manual import and ACME) would validate but never`
			persist — Stalwart kept serving its `rcgen` self-signed fallback. Logs showed
			`Failed to create tables: error connecting to server` on most boots.

			Cause: Stalwart shares the `ts-stalwart` sidecar's netns. Its `depends_on`
			only waited for the sidecar's own health (`/healthz` = "tailscaled up"), which
			flips green before the tailnet route to Postgres (`the-record-prod:5432`) is
			`usable. Stalwart started into that gap, failed the DB connect, and any write in`
			`that window — including a freshly obtained cert — was silently lost.`

			`Fix: the sidecar healthcheck now also requires Postgres to be reachable`
			(`nc -z … 5432`), so `depends_on: service_healthy` can't release Stalwart into
			the race. See `docker-compose.yml`. First clean boot after this: zero PG errors,
			`4 live connections immediately.`

			`## 2. DNS-01 was blocked by a dead Spaceship API key`

			Symptom: `Failed to set DNS RRSet: Unauthorized` on every record; no cert
			issued; no `_acme-challenge` TXT ever set.

			`Cause: the cert design is ACME DNS-01 via the Spaceship provider`
			`(bundled in caddy/lego). The stored API key was invalid (recovery debris from an`
			earlier config attempt). Note `STALWART_ACME_PROVIDER` / `STALWART_ACME_TOKEN`
			in `.env` are empty and not even passed through by compose — the provider +
			`secret are entered in the admin UI (stored in the DB), not via env.`

			`Gotcha: secret fields render blank in the Stalwart admin even when set`
			`(the S3 secret behaves identically). A blank field is not evidence it's unset.`

			`**Fix / how to verify a key directly (egresses the box's WAN IP, same as`
			`Stalwart):**`
			```bash
			`curl -i 'https://spaceship.dev/api/v1/dns/records/<domain>?take=5&skip=0' \`
			`-H 'X-Api-Key: KEY' -H 'X-Api-Secret: SECRET'`
			`# 401 application.unauthorized = bad key/secret or IP-restricted`
			`# 200 = good`
			```
			`A fresh Spaceship key fixed it.`

			`## 3. Stalwart's auto-ban vs PROXY protocol (the "8080 mystery")`

			`Symptom: the edge box could relay mail fine but could not reach`
			Stalwart's `:8080` admin — connections accept then immediately close. Looked like
			`"tagged devices rejected, user phone works."`

			`Cause: Stalwart's fail2ban checks the proxied client IP (from the PROXY`
			`header) on the mail listeners, but the raw connection IP on the non-proxied`
			`admin listener. A banned edge-box IP therefore still relays mail (ban checked`
			against the header IP) while direct `→:8080` is dropped (checked against the box
			`IP). Malformed probing of the mail ports re-arms the ban.`

			Fix: add `100.64.0.0/10` (and the box's WAN IP, which appears as the proxied
			`client when you hit the box's own public hostname) to the fail2ban allow-list.`
			`Bans are in-memory — a Stalwart restart flushes them. **Don't rapid-poll the mail`
			`ports** to test.`

			`## 4. The wildcard request required DNS-01 (why HTTP-01 was a dead end)`

			`With "Additional Hostnames" left empty, Stalwart requests a wildcard`
			(`.<domain>`). Wildcards can only* be issued via DNS-01 — HTTP-01 literally
			`cannot satisfy them. We burned time on an HTTP-01 + Caddy-challenge-forwarding`
			`detour before realizing DNS-01 was the intended (and only viable) path. One`
			wildcard cert then covers `mail`, `mta-sts`, `autoconfig`, `autodiscover`, etc.

			## 5. `:443` web endpoints need SNI pass-through, not L7 proxy

			`MTA-STS / autoconfig / autodiscover serve over :443. You cannot L7`
			`reverse_proxy` them through Caddy, because the CAA record pins issuance to
			`Stalwart's ACME account — Caddy can't get its own cert for those names. Stalwart`
			`holds the wildcard, so the edge passes TLS through by SNI. See`
			`caddy/README.md` → "The HTTP side". Needed `tcp:443` added to the
			`reverse-proxy → stalwart` ACL grant.

			`## 6. The sidecar is ephemeral — never hardcode its tailnet IP`

			`ts-stalwart` runs with `?ephemeral=true`, so its tailnet IP **changes on
			re-registration** (an ACL re-sync did this mid-debug: `100.112.26.122 →
			100.79.87.80`). Everything must use the MagicDNS name
			`stalwart.tail7b1641.ts.net`. A hardcoded IP will mysteriously go
			`Network is unreachable`.

			`## 7. Don't trust crt.sh for rate-limit checks`

			`crt.sh was flaky/empty all session. To gauge Let's Encrypt's weekly`
			`duplicate-cert limit, use certspotter instead:`
			`https://api.certspotter.com/v1/issuances?domain=<d>&include_subdomains=true`.
			`Also: LE limits are dimensioned — failed validations are hourly (5/hr/host,`
			`the one a retry storm trips), issued duplicates are weekly (5/wk). A renewal`
			`task hammering every 10 min trips the hourly one; consolidate to a single task.`
docs: capture outbound-relay lessons (IPv6/AAAA trap, SMTP port block, sidecar ACL) LESSONS.md gains 8-12: container has no IPv6 (AAAA fails before A, no fallback), host IPv6 != container IPv6, VPS blocks all outbound SMTP ports (relay over tailnet), sidecar needs a source ACL grant to initiate, and MtaRoute changes only take effect on restart. CLAUDE.md and .env.example warn that the smarthost address must be an IPv4 literal or tailnet IP, never a dual-stack hostname. acl-snippet adds the tag:stalwart -> tag:mail outbound grant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 17:43:21 -04:00
			`## 8. The Stalwart container has no IPv6 — AAAA targets fail before IPv4 is tried`

			`Symptom: Outbound delivery (and relay-to-smarthost) to any host with an`
			AAAA record fails with `I/O error: Network is unreachable (os error 101)`.
			`Hosts that are IPv4-only deliver fine. Pointing a relay at a hostname that`
			`has both A and AAAA fails; pointing it at the raw IPv4 works.`

			Cause: Stalwart shares the `ts-stalwart` sidecar's netns, which has no
			`global IPv6. When it resolves a dual-stack target it tries the AAAA first,`
			gets `ENETUNREACH` immediately, and for a **relay next-hop it does not fall
			`back to the A record** — it just records the v6 failure and backs off. So a`
			`single missing address family wedges all mail to dual-stack destinations.`

			Fix: Either (a) pin the relay/smarthost `address` to an IPv4 literal
			`(no AAAA to trip on), or (b) give the container real IPv6. Note that relaying`
			`over the tailnet sidesteps this entirely — you connect to a tailnet`
			`100.x` address, which has no AAAA, so the v6-first trap never triggers.

mailbox: give sidecar netns real IPv6 egress; resolve AAAA trap; DNS notes Add enable_ipv6 + a ULA subnet to tailwart_default so the Stalwart container (sharing the ts-stalwart netns) gets working IPv6 egress. Because only egress is needed (inbound arrives via the edge/tailnet), a ULA + Docker masquerade suffices -- no routable prefix, ndppd, or host sysctl changes (Docker 29 enables ip6tables by default; host forwarding was already on). Verified: ping6 + TCP/443 to v6 literals from inside the netns; zero ENETUNREACH since boot. LESSONS: mark #8/#9 resolved with the ULA-masquerade recipe, and add #13 -- Spaceship's DNS API is RRSet-upsert (not zone-replace), so Stalwart/ACME did not eat custom AAAA records; a vanished AAAA is a provider-side loss, not Stalwart. Includes the safe read/verify flow and the "don't publish mail AAAA before edge v6 listeners" caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 18:53:28 -04:00			`> RESOLVED (2026-06-11) — option (b) is now done. The container has real`
			`> IPv6 egress; this trap no longer fires. See Lesson 9's fix for how.`

docs: capture outbound-relay lessons (IPv6/AAAA trap, SMTP port block, sidecar ACL) LESSONS.md gains 8-12: container has no IPv6 (AAAA fails before A, no fallback), host IPv6 != container IPv6, VPS blocks all outbound SMTP ports (relay over tailnet), sidecar needs a source ACL grant to initiate, and MtaRoute changes only take effect on restart. CLAUDE.md and .env.example warn that the smarthost address must be an IPv4 literal or tailnet IP, never a dual-stack hostname. acl-snippet adds the tag:stalwart -> tag:mail outbound grant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 17:43:21 -04:00			`## 9. Configuring IPv6 on the KVM host does NOT give the container IPv6`

			Symptom: `ip -6 addr` and `ping6 google.com` succeed on the KVM host, but
			Stalwart still dies with `os error 101` on AAAA targets, and the box is still
			`a broken IPv6 Tailscale exit node.`

			Cause: The host's `eth0` and the container/sidecar netns are separate
			network stacks. Adding the provider's `/64` to `eth0` (ifupdown `inet6 static`
			+ `onlink` default route, since the gateway is in a different /64) fixes the
			`host, not the container. Docker doesn't hand IPv6 to containers by default,`
			`and the sidecar routes via Tailscale, not eth0.`

			`Fix: Don't assume host IPv6 = container IPv6. Test from inside the`
			`container's netns. For mail egress, the IPv4-literal relay (Lesson 8) or the`
			`tailnet relay avoids needing container IPv6 at all. Enabling true container`
			`IPv6 (Docker IPv6 + routing the /64 in) is a separate, larger task.`

mailbox: give sidecar netns real IPv6 egress; resolve AAAA trap; DNS notes Add enable_ipv6 + a ULA subnet to tailwart_default so the Stalwart container (sharing the ts-stalwart netns) gets working IPv6 egress. Because only egress is needed (inbound arrives via the edge/tailnet), a ULA + Docker masquerade suffices -- no routable prefix, ndppd, or host sysctl changes (Docker 29 enables ip6tables by default; host forwarding was already on). Verified: ping6 + TCP/443 to v6 literals from inside the netns; zero ENETUNREACH since boot. LESSONS: mark #8/#9 resolved with the ULA-masquerade recipe, and add #13 -- Spaceship's DNS API is RRSet-upsert (not zone-replace), so Stalwart/ACME did not eat custom AAAA records; a vanished AAAA is a provider-side loss, not Stalwart. Includes the safe read/verify flow and the "don't publish mail AAAA before edge v6 listeners" caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 18:53:28 -04:00			`RESOLVED (2026-06-11) — the easy way, no /64 routing or ndppd. Because the`
			`container only needs IPv6 egress (inbound arrives via the edge/tailnet,`
			`never v6), you don't need a routable prefix or NDP proxy at all — just a **ULA`
			`subnet + masquerade**, exactly like Docker does for v4:`
			```yaml
			`# docker-compose.yml`
			`networks:`
			`default:`
			`enable_ipv6: true`
			`ipam:`
			`config:`
			`- subnet: fd00:7a17:600d::/64`
			`gateway: fd00:7a17:600d::1`
			```
			Docker 29 enables `ip6tables` by default and masquerades the ULA out the host's
			global v6, so the sidecar netns (shared by Stalwart via `network_mode`) gets a
			`working v6 default route with zero host sysctl/daemon changes (host`
			`net.ipv6.conf.all.forwarding` was already 1 from the static-v6 setup). Verify
			from inside the netns: `ping6 google.com` + a TCP connect to a v6 literal on
			:443. Recreating the network (`docker compose down && up`) bounces the stack and
			`the ephemeral sidecar gets a new tailnet IP — MagicDNS covers it (Lesson 6), and`
			`the MTA route table rebuilds anyway (Lesson 12). This does not give inbound`
			`v6; for that you'd still publish AAAA + make the edge listen on v6 (separate).`

docs: capture outbound-relay lessons (IPv6/AAAA trap, SMTP port block, sidecar ACL) LESSONS.md gains 8-12: container has no IPv6 (AAAA fails before A, no fallback), host IPv6 != container IPv6, VPS blocks all outbound SMTP ports (relay over tailnet), sidecar needs a source ACL grant to initiate, and MtaRoute changes only take effect on restart. CLAUDE.md and .env.example warn that the smarthost address must be an IPv4 literal or tailnet IP, never a dual-stack hostname. acl-snippet adds the tag:stalwart -> tag:mail outbound grant. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 17:43:21 -04:00			`## 10. The VPS blocks ALL outbound SMTP ports — relay over the tailnet`

			`Symptom: Direct MX delivery and relay-to-public-host both fail with`
			`Connection timed out (os error 110)`, and the SYN never arrives at the
			destination. Not just port 25 — `465`, `587`, even alt-port `2525` all time out.

			`Cause: The KVM provider blocks all outbound SMTP submission ports to prevent`
			spam. Only non-SMTP ports (`443`, etc.) egress. Confirmed with:
			```bash
			`for p in 25 465 587 2525 443; do`
			`timeout 5 bash -c "exec 3<>/dev/tcp/<dst>/$p" && echo "$p OPEN" \|\| echo "$p blocked"`
			`done`
			`# 443 OPEN, all SMTP ports timeout`
			```

			`Fix: Relay over the tailnet. Tailscale rides WireGuard/DERP (UDP 41641 /`
			`443), so it's immune to SMTP port filtering. Point the relay at the smarthost's`
			tailnet IP (e.g. `100.x:587`), not its public address. Long-term: ask the
			`provider to unblock outbound 25/587 for verified use.`

			`## 11. The sidecar can RECEIVE on the tailnet but can't INITIATE without an ACL grant`

			Symptom: The relay to `<mailbox-tailnet-ip>:587` times out (`os error 110`),
			`yet the KVM host (same physical machine) can reach that exact IP:port over`
			`the tailnet fine. Looks like a routing or transparent-proxy bug.`

			Cause: The Stalwart container rides the `ts-stalwart` sidecar — a **separate
			tailnet node** (`tag:stalwart`) from the KVM host. The `tailwart` ACL block only
			listed `tag:stalwart` as a destination (`"dst": ["tag:stalwart"]`). Tailnet
			`is default-deny, so the sidecar could receive connections but could not`
			`initiate the relay back to the mailbox → silent drop → timeout. The KVM host`
			`worked because it's a different, permitted identity, which masked the real cause.`

			Fix: Add an ACL rule granting `tag:stalwart` as a source:
			```json
			`{ "src": ["tag:stalwart"], "dst": ["tag:mail"], "ip": ["tcp:587"] }`
			```
			(mailbox is `tag:mail`). Applies in seconds, no restart. See `acl-snippet.hujson`.

			`## 12. Stalwart only rebuilds its MTA route table at container startup`

			Symptom: You edit an `MtaRoute` (address, etc.) via API/UI, but delivery keeps
			`using the old value. The datastore shows the new value; live delivery ignores it.`

			Cause: The `routing_strategy` map is built once when the process boots. The
			`ReloadSettings` action reloads the datastore but does not rebuild the SMTP
			`route map. So route/strategy changes are invisible until restart.`

			Fix: After any `MtaRoute` / `MtaOutboundStrategy` change,
			`docker restart tailwart-stalwart-1`. (Side effect: the ephemeral sidecar gets a
			`new tailnet IP each restart — anything addressing it by IP must rediscover it;`
			`use the MagicDNS name where possible.)`
mailbox: give sidecar netns real IPv6 egress; resolve AAAA trap; DNS notes Add enable_ipv6 + a ULA subnet to tailwart_default so the Stalwart container (sharing the ts-stalwart netns) gets working IPv6 egress. Because only egress is needed (inbound arrives via the edge/tailnet), a ULA + Docker masquerade suffices -- no routable prefix, ndppd, or host sysctl changes (Docker 29 enables ip6tables by default; host forwarding was already on). Verified: ping6 + TCP/443 to v6 literals from inside the netns; zero ENETUNREACH since boot. LESSONS: mark #8/#9 resolved with the ULA-masquerade recipe, and add #13 -- Spaceship's DNS API is RRSet-upsert (not zone-replace), so Stalwart/ACME did not eat custom AAAA records; a vanished AAAA is a provider-side loss, not Stalwart. Includes the safe read/verify flow and the "don't publish mail AAAA before edge v6 listeners" caveat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> 2026-06-11 18:53:28 -04:00
			`## 13. "Did Stalwart eat my custom DNS records?" — no; Spaceship is RRSet-upsert`

			Symptom: A manually-added record (e.g. an `AAAA` for the apex/`mail`) is
			`gone from the zone, and the suspicion is that Stalwart's ACME DNS-01 integration`
			`overwrote it on a renewal.`

			`Cause: Almost never Stalwart. Its only DNS-provider writes are`
			`_acme-challenge.<name>` TXT (the rotating challenge) and `_validation-persist`
			`TXT (the LE account-pinned persistent-validation record). It does not create`
			`or modify A/AAAA/MX/SRV — those you add yourself from its "recommended records"`
			`page. And the Spaceship API is RRSet-upsert keyed by (name, type), not a`
			whole-zone replace: a `PUT /api/v1/dns/records/{domain}` with
			`{"force":true,"items":[…]}` only touches the RRSets named in `items`. Proof:
			25 unrelated records coexist untouched through every rotating `_acme-challenge`
			write; and adding one apex `AAAA` left the other 25 exactly intact (25→26).

			`So a vanished AAAA is far more likely a provider-side loss/rollback (e.g.`
			`during a data-center DDoS) or a manual edit — not Stalwart.`

			How to inspect / verify (read-only), creds in `.env`:
			```bash
			`KEY=$(grep '^SPACESHIP_KEY=' .env \| cut -d= -f2)`
			`SECRET=$(grep '^SPACESHIP_SECRET=' .env \| cut -d= -f2)`
			`curl -s "https://spaceship.dev/api/v1/dns/records/<domain>?take=100&skip=0" \`
			`-H "X-Api-Key: $KEY" -H "X-Api-Secret: $SECRET" \| python3 -m json.tool`
			```
			To add a record, `PUT` the same endpoint with a single-item `items` array — it
			`won't disturb siblings. Snapshot the zone (GET) before any write and diff`
			after; snapshots land in `_backup/` (gitignored). Always re-check at the
			authoritative NS (`dig +short AAAA <name> @launch1.spaceship.net`), not a cache.

			Caveat — don't publish `mail` AAAA before the edge listens on v6. Inbound
			mail follows `MX → mail.<domain>`; an `AAAA` there with no v6 `:25` listener on
			`the edge makes senders try v6 and some won't fall back → deferred/bounced mail.`
			An apex `AAAA` is safe (it doesn't affect MX routing). Do `mail` AAAA + edge
			`v6 listeners together.`