KhueApps
Home/DevOps/Fixing healthcheck failed and unhealthy Docker containers

Fixing healthcheck failed and unhealthy Docker containers

Last updated: October 07, 2025

What Docker healthchecks do

A Docker healthcheck runs a command in the container to report if the service is working. If the command exits 0, the container is healthy; non‑zero marks it unhealthy. Orchestrators and docker compose can gate dependencies on health.

  • Healthy states: starting → healthy or unhealthy
  • Health is separate from running: a container can be running but unhealthy
  • Healthcheck runs in the container’s namespace, using its filesystem, network, and user

Quickstart: verify and debug fast

  1. Check status: docker ps shows STATUS with health.
  2. Inspect details: docker inspect --format '{{json .State.Health}}' <container>
  3. Read recent health logs: docker inspect <container> | jq '.State.Health.Log[-5:]'
  4. Run the health command manually inside the container: docker exec -it <container> sh -lc "<health command>"
  5. Check application logs for clues: docker logs <container>

Minimal working example (MWE)

This container serves a static file via Python’s HTTP server and uses wget to verify it.

# Dockerfile
FROM python:3.12-alpine
WORKDIR /app
RUN printf 'ok' > index.html
EXPOSE 8000

# Healthcheck: succeed only if index contains 'ok'
HEALTHCHECK --interval=10s --timeout=3s --start-period=5s --retries=3 \
  CMD wget -qO- http://127.0.0.1:8000/ | grep -q ok || exit 1

CMD ["python", "-m", "http.server", "8000"]

Build and run:

docker build -t mwe-health .
docker run --name mwe --rm -p 8000:8000 mwe-health

In another terminal:

docker ps
# Wait a few seconds, then:
docker inspect --format '{{.State.Health.Status}}' mwe  # should print: healthy

docker compose example:

version: "3.9"
services:
  app:
    build: .
    ports:
      - "8000:8000"
    healthcheck:
      test: ["CMD-SHELL", "wget -qO- http://127.0.0.1:8000/ | grep -q ok || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 3
      start_period: 5s
  db:
    image: postgres:16-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 10s
  api:
    build: ./api
    depends_on:
      db:
        condition: service_healthy

Common symptoms and likely fixes

SymptomLikely causePractical fix
Healthcheck exits 127Command not foundInstall tool (apk add curl), or use busybox wget; verify PATH
Times outApp slow to start or probe too heavyIncrease timeout/start-period; reduce work in probe
Works manually, fails in healthcheckShell features not availableUse CMD-SHELL; quote properly; avoid bash-only syntax on sh
HTTP 200 but unhealthyProbe does not exit 0Ensure final command returns 0 on success; add `
DB-dependent app unhealthyDependency not readyAdd start-period; or make app probe only its own readiness; use depends_on: service_healthy
Port connection refusedApp binds to wrong interfaceBind app to 0.0.0.0 inside container; probe 127.0.0.1 or localhost
TLS probe failsMissing CA or self-signedUse http for local probe; add ca-certificates; or --insecure if acceptable

Step-by-step diagnosis

  1. Inspect the health command
    • Find it in Dockerfile or compose under healthcheck.test
    • Confirm it is either an array exec form or CMD-SHELL appropriately
  2. Validate command availability
    • docker exec <ctr> which curl wget nc and ensure the tool exists
    • If missing, install during build, or switch to a tool you have
  3. Run the probe verbatim
    • docker exec -it <ctr> sh -lc "<probe>"; echo $? and confirm exit code
  4. Check app binding and endpoints
    • Inside the container: ss -lntp or netstat -lnt to verify ports
    • Test: wget -S -O- http://127.0.0.1:<port>/health
  5. Tune timing
    • If the app needs warmup, raise --start-period and --timeout, lower frequency (--interval)
  6. Make the probe cheap and deterministic
    • Use a fast readiness endpoint; avoid full DB queries or migrations
  7. Watch the logs
    • docker inspect includes .State.Health.Log with command, exitCode, and output; fix based on errors

Patterns for robust healthchecks

  • Keep probes local: target localhost or a UNIX socket if applicable
  • Exit 0 only when the service can handle traffic; non-zero otherwise
  • Use a stable, fast endpoint like /healthz or /readyz
  • Prefer exec arrays for simple binaries; use CMD-SHELL for pipes and redirection
  • Avoid heavy dependencies; busybox wget or nc often suffice

Examples:

# Simple TCP port probe
HEALTHCHECK CMD nc -z 127.0.0.1 8080 || exit 1

# HTTP probe with curl (ensure curl is installed)
HEALTHCHECK CMD curl -fsS http://127.0.0.1/healthz >/dev/null || exit 1

# App-provided check script
COPY healthcheck.sh /usr/local/bin/
HEALTHCHECK --interval=15s CMD ["/usr/local/bin/healthcheck.sh"]

healthcheck.sh should be small, fast, and end with exit 0 on success.

Pitfalls to avoid

  • Using hostnames/ports that are only valid from other containers; the healthcheck runs inside the target container
  • Depending on external services for health; prefer checking only what this container controls
  • Returning success on partial failures; keep semantics strict
  • Using bashisms on alpine busybox sh (e.g., [[); either install bash or rewrite for POSIX sh
  • Forgetting start-period for apps with long cold starts
  • Not failing explicitly; ensure the command ends with || exit 1 when using pipes

Performance notes

  • Every healthcheck spawns a process; too-frequent checks waste CPU and I/O
  • Reasonable defaults: interval 10–30s, timeout 2–5s, retries 3
  • Use lightweight tools (wget -q, nc -z) and cheap endpoints
  • Avoid disk writes, large payloads, or TLS handshakes unless necessary
  • In high-density hosts, stagger intervals across services to reduce bursts

When to disable healthchecks

  • For pure batch/one-shot jobs (they exit anyway)
  • When an orchestrator already performs equivalent checks externally

Disable with:

HEALTHCHECK NONE

Tiny FAQ

  • My container runs but is unhealthy. What is the difference?

    • Running means the process is alive. Healthy means your probe says it is ready. They are independent.
  • Should the probe check dependencies (DB, cache)?

    • Prefer checking the container’s own readiness. If you must check dependencies, make it fast and resilient.
  • How do I gate startup on a healthy dependency in compose?

    • Use depends_on: condition: service_healthy and define a healthcheck for the dependency.
  • Why does a pipe succeed but the health is unhealthy?

    • Without set -o pipefail, only the last command’s exit code is used. Use CMD-SHELL and add || exit 1.
  • How do I view the last failures?

    • docker inspect <ctr> | jq '.State.Health.Log' shows timestamped entries with exit code and output.

Checklist

  • Command exists and exits 0 on success
  • Probe targets the right host/port (usually 127.0.0.1)
  • Timing tuned: start-period, interval, timeout, retries
  • Logs inspected; failures are actionable
  • Probe is lightweight and deterministic

Series: Docker

DevOps