diff --git a/LESSONS.md b/LESSONS.md index 368ef76..bc9dc21 100644 --- a/LESSONS.md +++ b/LESSONS.md @@ -308,3 +308,50 @@ signature, not republish anything. Two parts: `mail.tail7b1641.ts.net` (auth `stalwart-relay@waynehayes.com`), which re-emits as `mail.waynehayes.com` (`216.189.156.74` / `2602:ffc5:20::1:6b52`). That relay IP is why SPF needed `include:waynehayes.com` (#14 / the SPF fix). + +## 16. After a reboot, Stalwart started before Garage — admin site 404'd (NOT a boot loop) + +**Symptom:** Post-reboot, the Stalwart web admin / app assets wouldn't load (404 / +blank), even though the container was `running` and **not** restart-looping. + +**Cause:** the web UI (and other app assets) live in the **S3 blob store (Garage)** — +Stalwart unpacks/serves them from S3. On reboot Stalwart came up *before* Garage was +ready, so the asset fetch failed. Stalwart itself was fine (PG connected, listeners up); +only the S3-backed content was missing. Easy to misread as "Stalwart is broken." + +**Fix:** once Garage is up, restart Stalwart (or it picks them up on the next fetch). +Quick confirm it's a backend-readiness issue, not Stalwart: `running`+`healthy` but assets +404 → probe the backend from the sidecar (`nc -z garage. 3900`). + +**Rule for the whole fleet (federatedSocial):** every app must gate on its backends being +**live, not merely present**. Model it on the Stalwart sidecar's healthcheck — +`depends_on: { : service_healthy }` plus a check that actually *probes* PG/Redis/ +Garage over the tailnet (see #1, the PG-startup-race healthcheck). Don't assume shared +infra boots first; make it a startup-ordering/readiness convention across all sidecars. + +## 17. A flapping shared-store consumer (atuin) looked like a Postgres problem + +**Symptom:** "Postgres seems to be the cause / unstable." Actually `atuin-server` had +**RestartCount 6318, exit 1** — crash-looping for days and generating all the noise. + +**Cause:** atuin couldn't reach/authenticate its DB and crash-looped under +`restart: unless-stopped`. **Postgres itself was healthy** (6 days up, 0 restarts, +17/100 conns). atuin never even established a connection — *no* atuin lines in the PG log +and *no* atuin rows in `pg_stat_activity` — i.e. it was dying **before** reaching PG. + +**Diagnosis (fast):** +```bash +# which container is actually flapping (PG health != consumer health): +docker inspect --format '{{.RestartCount}} exit={{.State.ExitCode}} oom={{.State.OOMKilled}}' +# is a consumer reconnect-storming the shared store? distinct/ghost client_addr = churn: +docker exec psql -U postgres -tAc \ + "SELECT client_addr, state, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 1" +``` +Ephemeral sidecar nodes get a **new tailnet IP per restart**, so successive incarnations +leave **ghost idle connections** from dead IPs — a handy "how many times did it restart" +fingerprint (we saw this with Stalwart too: 1 live IP + 2 ghosts). + +**Rule for the whole fleet:** a shared Postgres/Redis/Garage is a blast-radius surface — +one misconfigured consumer shouldn't be mistaken for a shared-infra outage. Confirm a +consumer's creds + backend reachability **before** enabling `restart: always/unless-stopped`, +and when something "looks like the DB," check the *consumers* first.