Fixing healthcheck failed and unhealthy Docker containers

What Docker healthchecks do
Quickstart: verify and debug fast
Minimal working example (MWE)
Common symptoms and likely fixes
Step-by-step diagnosis
Patterns for robust healthchecks
Pitfalls to avoid
Performance notes
When to disable healthchecks
Tiny FAQ
Checklist

What Docker healthchecks do

A Docker healthcheck runs a command in the container to report if the service is working. If the command exits 0, the container is healthy; non‑zero marks it unhealthy. Orchestrators and docker compose can gate dependencies on health.

Healthy states: starting → healthy or unhealthy
Health is separate from running: a container can be running but unhealthy
Healthcheck runs in the container’s namespace, using its filesystem, network, and user

Quickstart: verify and debug fast

Check status: docker ps shows STATUS with health.
Inspect details: docker inspect --format '{{json .State.Health}}' <container>
Read recent health logs: docker inspect <container> | jq '.State.Health.Log[-5:]'
Run the health command manually inside the container: docker exec -it <container> sh -lc "<health command>"
Check application logs for clues: docker logs <container>

Minimal working example (MWE)

This container serves a static file via Python’s HTTP server and uses wget to verify it.

# Dockerfile
FROM python:3.12-alpine
WORKDIR /app
RUN printf 'ok' > index.html
EXPOSE 8000

# Healthcheck: succeed only if index contains 'ok'
HEALTHCHECK --interval=10s --timeout=3s --start-period=5s --retries=3 \
  CMD wget -qO- http://127.0.0.1:8000/ | grep -q ok || exit 1

CMD ["python", "-m", "http.server", "8000"]

Build and run:

docker build -t mwe-health .
docker run --name mwe --rm -p 8000:8000 mwe-health

In another terminal:

docker ps
# Wait a few seconds, then:
docker inspect --format '{{.State.Health.Status}}' mwe  # should print: healthy

docker compose example:

version: "3.9"
services:
  app:
    build: .
    ports:
      - "8000:8000"
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:8000/ | grep -q ok || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 5s
  db:
    image: postgres:16-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 10s
  api:
    build: ./api
    depends_on:
      db:
        condition: service_healthy

Common symptoms and likely fixes

Symptom	Likely cause	Practical fix
Healthcheck exits 127	Command not found	Install tool (apk add curl), or use busybox wget; verify PATH
Times out	App slow to start or probe too heavy	Increase timeout/start-period; reduce work in probe
Works manually, fails in healthcheck	Shell features not available	Use CMD-SHELL; quote properly; avoid bash-only syntax on sh
HTTP 200 but unhealthy	Probe does not exit 0	Ensure final command returns 0 on success; add `
DB-dependent app unhealthy	Dependency not ready	Add start-period; or make app probe only its own readiness; use depends_on: service_healthy
Port connection refused	App binds to wrong interface	Bind app to 0.0.0.0 inside container; probe 127.0.0.1 or localhost
TLS probe fails	Missing CA or self-signed	Use http for local probe; add ca-certificates; or `--insecure` if acceptable

Step-by-step diagnosis

Inspect the health command
- Find it in Dockerfile or compose under healthcheck.test
- Confirm it is either an array exec form or CMD-SHELL appropriately
Validate command availability
- docker exec <ctr> which curl wget nc and ensure the tool exists
- If missing, install during build, or switch to a tool you have
Run the probe verbatim
- docker exec -it <ctr> sh -lc "<probe>"; echo $? and confirm exit code
Check app binding and endpoints
- Inside the container: ss -lntp or netstat -lnt to verify ports
- Test: wget -S -O- http://127.0.0.1:<port>/health
Tune timing
- If the app needs warmup, raise --start-period and --timeout, lower frequency (--interval)
Make the probe cheap and deterministic
- Use a fast readiness endpoint; avoid full DB queries or migrations
Watch the logs
- docker inspect includes .State.Health.Log with command, exitCode, and output; fix based on errors

Patterns for robust healthchecks

Keep probes local: target localhost or a UNIX socket if applicable
Exit 0 only when the service can handle traffic; non-zero otherwise
Use a stable, fast endpoint like /healthz or /readyz
Prefer exec arrays for simple binaries; use CMD-SHELL for pipes and redirection
Avoid heavy dependencies; busybox wget or nc often suffice

Examples:

# Simple TCP port probe
HEALTHCHECK CMD nc -z 127.0.0.1 8080 || exit 1

# HTTP probe with curl (ensure curl is installed)
HEALTHCHECK CMD curl -fsS http://127.0.0.1/healthz >/dev/null || exit 1

# App-provided check script
COPY healthcheck.sh /usr/local/bin/
HEALTHCHECK --interval=15s CMD ["/usr/local/bin/healthcheck.sh"]

healthcheck.sh should be small, fast, and end with exit 0 on success.

Pitfalls to avoid

Using hostnames/ports that are only valid from other containers; the healthcheck runs inside the target container
Depending on external services for health; prefer checking only what this container controls
Returning success on partial failures; keep semantics strict
Using bashisms on alpine busybox sh (e.g., [[); either install bash or rewrite for POSIX sh
Forgetting start-period for apps with long cold starts
Not failing explicitly; ensure the command ends with || exit 1 when using pipes

Performance notes

Every healthcheck spawns a process; too-frequent checks waste CPU and I/O
Reasonable defaults: interval 10–30s, timeout 2–5s, retries 3
Use lightweight tools (wget -q, nc -z) and cheap endpoints
Avoid disk writes, large payloads, or TLS handshakes unless necessary
In high-density hosts, stagger intervals across services to reduce bursts

When to disable healthchecks

For pure batch/one-shot jobs (they exit anyway)
When an orchestrator already performs equivalent checks externally

Disable with:

HEALTHCHECK NONE

Tiny FAQ

My container runs but is unhealthy. What is the difference?
- Running means the process is alive. Healthy means your probe says it is ready. They are independent.
Should the probe check dependencies (DB, cache)?
- Prefer checking the container’s own readiness. If you must check dependencies, make it fast and resilient.
How do I gate startup on a healthy dependency in compose?
- Use depends_on: condition: service_healthy and define a healthcheck for the dependency.
Why does a pipe succeed but the health is unhealthy?
- Without set -o pipefail, only the last command’s exit code is used. Use CMD-SHELL and add || exit 1.
How do I view the last failures?
- docker inspect <ctr> | jq '.State.Health.Log' shows timestamped entries with exit code and output.