- Add `HEALTHCHECK` to every service image: `HEALTHCHECK --interval=30s --timeout=5s --retries=3 CMD curl -f http://localhost:8080/health || exit 1` — Docker marks the container unhealthy and stops sending traffic before restarting it.
- Set `restart: unless-stopped` in `docker-compose.yml` for production services — use `restart: on-failure:3` to cap restart loops for one-shot jobs.
- Always check the exit code of `docker run` commands in shell scripts: non-zero exit codes indicate container failure; don't swallow them with `|| true`.
- Use `docker compose up --exit-code-from <service>` in CI to propagate the test container's exit code to the CI job.
- Add `HEALTHCHECK` to every long-running service container: `HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 CMD curl -sf http://localhost:8080/health || exit 1` — compose and orchestrators use health status for traffic routing and restart decisions.
- Configure restart policies in `docker-compose.yml`: `restart: unless-stopped` for daemons, `restart: on-failure:5` for jobs — never use `restart: always` in dev (makes debugging crashes harder).
- Emit structured JSON logs to stdout/stderr — never write logs to files inside the container: `docker logs` captures stdout/stderr, not arbitrary file paths.
- Use `--init` flag (`init: true` in compose) to run an init process that reaps zombie child processes and forwards signals correctly — prevents PID 1 signal-handling bugs.
- In multi-stage Dockerfiles, validate the binary works with a `RUN` smoke test before the final stage: `RUN /app/server --version` catches build-time failures early.