#16 Stalwart-before-Garage on reboot → S3-backed admin SPA 404'd (not a boot loop). Gate every app on backend *liveness* (depends_on service_healthy + probe PG/Redis/Garage over the tailnet), don't assume shared infra boots first. #17 atuin crash-looped 6318x (exit 1) and looked like a Postgres problem; Postgres was healthy and atuin never even connected. PG health != consumer health — check RestartCount and pg_stat_activity client_addr churn; confirm a consumer's creds/reachability before restart:always. Both generalize to federatedSocial (shared PG/Redis/Garage = blast radius). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
5d162884e8
commit
1e5fc982eb
47
LESSONS.md
47
LESSONS.md
@ -308,3 +308,50 @@ signature, not republish anything. Two parts:
|
|||||||
`mail.tail7b1641.ts.net` (auth `stalwart-relay@waynehayes.com`), which re-emits
|
`mail.tail7b1641.ts.net` (auth `stalwart-relay@waynehayes.com`), which re-emits
|
||||||
as `mail.waynehayes.com` (`216.189.156.74` / `2602:ffc5:20::1:6b52`). That relay
|
as `mail.waynehayes.com` (`216.189.156.74` / `2602:ffc5:20::1:6b52`). That relay
|
||||||
IP is why SPF needed `include:waynehayes.com` (#14 / the SPF fix).
|
IP is why SPF needed `include:waynehayes.com` (#14 / the SPF fix).
|
||||||
|
|
||||||
|
## 16. After a reboot, Stalwart started before Garage — admin site 404'd (NOT a boot loop)
|
||||||
|
|
||||||
|
**Symptom:** Post-reboot, the Stalwart web admin / app assets wouldn't load (404 /
|
||||||
|
blank), even though the container was `running` and **not** restart-looping.
|
||||||
|
|
||||||
|
**Cause:** the web UI (and other app assets) live in the **S3 blob store (Garage)** —
|
||||||
|
Stalwart unpacks/serves them from S3. On reboot Stalwart came up *before* Garage was
|
||||||
|
ready, so the asset fetch failed. Stalwart itself was fine (PG connected, listeners up);
|
||||||
|
only the S3-backed content was missing. Easy to misread as "Stalwart is broken."
|
||||||
|
|
||||||
|
**Fix:** once Garage is up, restart Stalwart (or it picks them up on the next fetch).
|
||||||
|
Quick confirm it's a backend-readiness issue, not Stalwart: `running`+`healthy` but assets
|
||||||
|
404 → probe the backend from the sidecar (`nc -z garage.<tailnet> 3900`).
|
||||||
|
|
||||||
|
**Rule for the whole fleet (federatedSocial):** every app must gate on its backends being
|
||||||
|
**live, not merely present**. Model it on the Stalwart sidecar's healthcheck —
|
||||||
|
`depends_on: { <backend>: service_healthy }` plus a check that actually *probes* PG/Redis/
|
||||||
|
Garage over the tailnet (see #1, the PG-startup-race healthcheck). Don't assume shared
|
||||||
|
infra boots first; make it a startup-ordering/readiness convention across all sidecars.
|
||||||
|
|
||||||
|
## 17. A flapping shared-store consumer (atuin) looked like a Postgres problem
|
||||||
|
|
||||||
|
**Symptom:** "Postgres seems to be the cause / unstable." Actually `atuin-server` had
|
||||||
|
**RestartCount 6318, exit 1** — crash-looping for days and generating all the noise.
|
||||||
|
|
||||||
|
**Cause:** atuin couldn't reach/authenticate its DB and crash-looped under
|
||||||
|
`restart: unless-stopped`. **Postgres itself was healthy** (6 days up, 0 restarts,
|
||||||
|
17/100 conns). atuin never even established a connection — *no* atuin lines in the PG log
|
||||||
|
and *no* atuin rows in `pg_stat_activity` — i.e. it was dying **before** reaching PG.
|
||||||
|
|
||||||
|
**Diagnosis (fast):**
|
||||||
|
```bash
|
||||||
|
# which container is actually flapping (PG health != consumer health):
|
||||||
|
docker inspect <c> --format '{{.RestartCount}} exit={{.State.ExitCode}} oom={{.State.OOMKilled}}'
|
||||||
|
# is a consumer reconnect-storming the shared store? distinct/ghost client_addr = churn:
|
||||||
|
docker exec <pg> psql -U postgres -tAc \
|
||||||
|
"SELECT client_addr, state, count(*) FROM pg_stat_activity GROUP BY 1,2 ORDER BY 1"
|
||||||
|
```
|
||||||
|
Ephemeral sidecar nodes get a **new tailnet IP per restart**, so successive incarnations
|
||||||
|
leave **ghost idle connections** from dead IPs — a handy "how many times did it restart"
|
||||||
|
fingerprint (we saw this with Stalwart too: 1 live IP + 2 ghosts).
|
||||||
|
|
||||||
|
**Rule for the whole fleet:** a shared Postgres/Redis/Garage is a blast-radius surface —
|
||||||
|
one misconfigured consumer shouldn't be mistaken for a shared-infra outage. Confirm a
|
||||||
|
consumer's creds + backend reachability **before** enabling `restart: always/unless-stopped`,
|
||||||
|
and when something "looks like the DB," check the *consumers* first.
|
||||||
|
|||||||
Loading…
Reference in New Issue
Block a user