LESSONS: shared-infra readiness (#16 boot-order) + flapping consumer (#17 atuin)

#16 Stalwart-before-Garage on reboot → S3-backed admin SPA 404'd (not a boot
loop). Gate every app on backend *liveness* (depends_on service_healthy +
probe PG/Redis/Garage over the tailnet), don't assume shared infra boots first.

#17 atuin crash-looped 6318x (exit 1) and looked like a Postgres problem;
Postgres was healthy and atuin never even connected. PG health != consumer
health — check RestartCount and pg_stat_activity client_addr churn; confirm a
consumer's creds/reachability before restart:always.

Both generalize to federatedSocial (shared PG/Redis/Garage = blast radius).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Wayne Hayes 2026-06-12 23:17:05 -04:00
parent 5d162884e8
commit 1e5fc982eb

View File

@ -308,3 +308,50 @@ signature, not republish anything. Two parts:
`mail.tail7b1641.ts.net` (auth `stalwart-relay@waynehayes.com`), which re-emits
as `mail.waynehayes.com` (`216.189.156.74` / `2602:ffc5:20::1:6b52`). That relay
IP is why SPF needed `include:waynehayes.com` (#14 / the SPF fix).
## 16. After a reboot, Stalwart started before Garage — admin site 404'd (NOT a boot loop)
**Symptom:** Post-reboot, the Stalwart web admin / app assets wouldn't load (404 /
blank), even though the container was `running` and **not** restart-looping.
**Cause:** the web UI (and other app assets) live in the **S3 blob store (Garage)**
Stalwart unpacks/serves them from S3. On reboot Stalwart came up *before* Garage was
ready, so the asset fetch failed. Stalwart itself was fine (PG connected, listeners up);
only the S3-backed content was missing. Easy to misread as "Stalwart is broken."
**Fix:** once Garage is up, restart Stalwart (or it picks them up on the next fetch).
Quick confirm it's a backend-readiness issue, not Stalwart: `running`+`healthy` but assets
404 → probe the backend from the sidecar (`nc -z garage.<tailnet> 3900`).
**Rule for the whole fleet (federatedSocial):** every app must gate on its backends being
**live, not merely present**. Model it on the Stalwart sidecar's healthcheck —
`depends_on: { <backend>: service_healthy }` plus a check that actually *probes* PG/Redis/
Garage over the tailnet (see #1, the PG-startup-race healthcheck). Don't assume shared
infra boots first; make it a startup-ordering/readiness convention across all sidecars.
## 17. A flapping shared-store consumer (atuin) looked like a Postgres problem
**Symptom:** "Postgres seems to be the cause / unstable." Actually `atuin-server` had
**RestartCount 6318, exit 1** — crash-looping for days and generating all the noise.
**Cause:** atuin couldn't reach/authenticate its DB and crash-looped under
`restart: unless-stopped`. **Postgres itself was healthy** (6 days up, 0 restarts,
17/100 conns). atuin never even established a connection — *no* atuin lines in the PG log
and *no* atuin rows in `pg_stat_activity` — i.e. it was dying **before** reaching PG.
**Diagnosis (fast):**
```bash
# which container is actually flapping (PG health != consumer health):
docker inspect <c> --format '{{.RestartCount}} exit={{.State.ExitCode}} oom={{.State.OOMKilled}}'
# is a consumer reconnect-storming the shared store? distinct/ghost client_addr = churn:
docker exec <pg> psql -U postgres -tAc \
"SELECT client_addr, state, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 1"
```
Ephemeral sidecar nodes get a **new tailnet IP per restart**, so successive incarnations
leave **ghost idle connections** from dead IPs — a handy "how many times did it restart"
fingerprint (we saw this with Stalwart too: 1 live IP + 2 ghosts).
**Rule for the whole fleet:** a shared Postgres/Redis/Garage is a blast-radius surface —
one misconfigured consumer shouldn't be mistaken for a shared-infra outage. Confirm a
consumer's creds + backend reachability **before** enabling `restart: always/unless-stopped`,
and when something "looks like the DB," check the *consumers* first.