LESSONS: shared-infra readiness (#16 boot-order) + flapping consumer (#17 atuin)

#16 Stalwart-before-Garage on reboot → S3-backed admin SPA 404'd (not a boot loop). Gate every app on backend *liveness* (depends_on service_healthy + probe PG/Redis/Garage over the tailnet), don't assume shared infra boots first. #17 atuin crash-looped 6318x (exit 1) and looked like a Postgres problem; Postgres was healthy and atuin never even connected. PG health != consumer health — check RestartCount and pg_stat_activity client_addr churn; confirm a consumer's creds/reachability before restart:always. Both generalize to federatedSocial (shared PG/Redis/Garage = blast radius). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-12 23:17:05 -04:00 · 2026-06-12 23:17:05 -04:00 · 1e5fc982eb
commit 1e5fc982eb
parent 5d162884e8
1 changed files with 47 additions and 0 deletions
--- a/LESSONS.md
+++ b/LESSONS.md
@ -308,3 +308,50 @@ signature, not republish anything. Two parts:
 `mail.tail7b1641.ts.net` (auth `stalwart-relay@waynehayes.com`), which re-emits
 as `mail.waynehayes.com` (`216.189.156.74` / `2602:ffc5:20::1:6b52`). That relay
 IP is why SPF needed `include:waynehayes.com` (#14 / the SPF fix).
+
+## 16. After a reboot, Stalwart started before Garage — admin site 404'd (NOT a boot loop)
+
+**Symptom:** Post-reboot, the Stalwart web admin / app assets wouldn't load (404 /
+blank), even though the container was `running` and **not** restart-looping.
+
+**Cause:** the web UI (and other app assets) live in the **S3 blob store (Garage)** —
+Stalwart unpacks/serves them from S3. On reboot Stalwart came up *before* Garage was
+ready, so the asset fetch failed. Stalwart itself was fine (PG connected, listeners up);
+only the S3-backed content was missing. Easy to misread as "Stalwart is broken."
+
+**Fix:** once Garage is up, restart Stalwart (or it picks them up on the next fetch).
+Quick confirm it's a backend-readiness issue, not Stalwart: `running`+`healthy` but assets
+404 → probe the backend from the sidecar (`nc -z garage.<tailnet> 3900`).
+
+**Rule for the whole fleet (federatedSocial):** every app must gate on its backends being
+**live, not merely present**. Model it on the Stalwart sidecar's healthcheck —
+`depends_on: { <backend>: service_healthy }` plus a check that actually *probes* PG/Redis/
+Garage over the tailnet (see #1, the PG-startup-race healthcheck). Don't assume shared
+infra boots first; make it a startup-ordering/readiness convention across all sidecars.
+
+## 17. A flapping shared-store consumer (atuin) looked like a Postgres problem
+
+**Symptom:** "Postgres seems to be the cause / unstable." Actually `atuin-server` had
+**RestartCount 6318, exit 1** — crash-looping for days and generating all the noise.
+
+**Cause:** atuin couldn't reach/authenticate its DB and crash-looped under
+`restart: unless-stopped`. **Postgres itself was healthy** (6 days up, 0 restarts,
+17/100 conns). atuin never even established a connection — *no* atuin lines in the PG log
+and *no* atuin rows in `pg_stat_activity` — i.e. it was dying **before** reaching PG.
+
+**Diagnosis (fast):**
+```bash
+# which container is actually flapping (PG health != consumer health):
+docker inspect <c> --format '{{.RestartCount}} exit={{.State.ExitCode}} oom={{.State.OOMKilled}}'
+# is a consumer reconnect-storming the shared store? distinct/ghost client_addr = churn:
+docker exec <pg> psql -U postgres -tAc \
+  "SELECT client_addr, state, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 1"
+```
+Ephemeral sidecar nodes get a **new tailnet IP per restart**, so successive incarnations
+leave **ghost idle connections** from dead IPs — a handy "how many times did it restart"
+fingerprint (we saw this with Stalwart too: 1 live IP + 2 ghosts).
+
+**Rule for the whole fleet:** a shared Postgres/Redis/Garage is a blast-radius surface —
+one misconfigured consumer shouldn't be mistaken for a shared-infra outage. Confirm a
+consumer's creds + backend reachability **before** enabling `restart: always/unless-stopped`,
+and when something "looks like the DB," check the *consumers* first.