← Back to Tutorials

Monitoring and Alerting for Dockerized Workloads: Detect Failures Before Users Do

docker monitoringcontainer alertingdevops observabilityprometheusgrafanakubernetes monitoringlogging and tracingsre

Monitoring and Alerting for Dockerized Workloads: Detect Failures Before Users Do

Monitoring and alerting for Dockerized workloads is not just about drawing pretty graphs. The goal is to detect failures early, pinpoint root causes quickly, and reduce mean time to recovery (MTTR)—ideally before customers notice. This tutorial walks through a practical, production-oriented approach using real commands and a clear mental model: what to measure, how to collect it, how to visualize it, and how to alert on it.

You’ll build a monitoring stack around Docker containers, learn what “good” signals look like, and implement alerts that catch common failure modes: crashes, OOM kills, disk pressure, CPU throttling, latency spikes, and dependency failures.


Table of Contents


1. What “good monitoring” means for containers

A Docker container is not a VM. It’s a process (or a group of processes) with cgroup limits, namespaces, and a lifecycle controlled by Docker. Monitoring must reflect that reality:

A useful way to structure observability is:

You will typically combine application metrics (RED) with container/host metrics (USE) to get both symptom detection and root cause hints.


2. Key failure modes in Dockerized systems

These are common ways Dockerized workloads fail in production:

  1. Crash loops: process exits repeatedly; may be due to bad config, missing secrets, failed migrations.
  2. OOMKilled: container hits memory limit; kernel kills it.
  3. CPU throttling: container is CPU-limited and gets throttled, causing latency spikes.
  4. Disk full: logs, images, or volumes fill disk; Docker and apps misbehave.
  5. FD exhaustion: “too many open files” from leaks or high load.
  6. Network issues: DNS failures, dropped packets, connection saturation.
  7. Dependency failures: DB down, queue backlog, upstream API slow.
  8. Slow degradation: memory leaks, increasing GC time, growing latency.
  9. Silent failure: container is “running” but not serving (deadlock, stuck thread, hung event loop).

Your monitoring should detect each of these with either:


3. Metrics, logs, and traces: what to collect

Metrics

Best for: trends, alerting, capacity planning.

Collect:

Logs

Best for: debugging, incident forensics.

Collect:

Traces

Best for: latency root cause across services.

Collect:

This tutorial focuses primarily on metrics + alerting, with a practical logging section.


4. Baseline: Docker’s built-in observability commands

Before deploying a full stack, you should be fluent in Docker’s own tools. They are invaluable during incidents.

Container list and status

docker ps
docker ps -a

Resource usage (live)

docker stats
docker stats --no-stream

Inspect container state (including OOMKilled)

docker inspect --format '{{json .State}}' my-container | jq

Look for:

Logs

docker logs my-container
docker logs -f --tail 200 my-container

Events (often underused)

Docker emits events for restarts, kills, health status changes, etc.

docker events --since 1h
docker events --filter container=my-container

Check restart count

docker inspect --format '{{.RestartCount}}' my-container

These commands are not a monitoring system, but they teach you what signals exist and help you validate alerts.


5. A practical monitoring stack for Docker

A common, battle-tested stack:

You can run all of these as containers. The tutorial uses docker run commands to keep things explicit.

Note: In Kubernetes, you’d do this differently (operators, ServiceMonitors, etc.). Here we focus on plain Docker hosts.


6. Collecting container and host metrics with Prometheus + cAdvisor

6.1 Create a Docker network for monitoring

docker network create monitoring

6.2 Run cAdvisor

cAdvisor needs access to host cgroups and Docker state.

docker run -d \
  --name cadvisor \
  --network monitoring \
  --restart unless-stopped \
  -p 8080:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:rw \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor:v0.49.1

Verify:

curl -s http://localhost:8080/metrics | head

You should see Prometheus-style metrics like container_cpu_usage_seconds_total.

6.3 Run node-exporter (host metrics)

docker run -d \
  --name node-exporter \
  --network monitoring \
  --restart unless-stopped \
  -p 9100:9100 \
  --pid=host \
  -v /:/host:ro,rslave \
  quay.io/prometheus/node-exporter:v1.8.2 \
  --path.rootfs=/host

Verify:

curl -s http://localhost:9100/metrics | head

6.4 Create a Prometheus config file

Create a directory:

mkdir -p /opt/monitoring/prometheus

Create the config:

cat > /opt/monitoring/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
EOF

6.5 Run Prometheus

docker run -d \
  --name prometheus \
  --network monitoring \
  --restart unless-stopped \
  -p 9090:9090 \
  -v /opt/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
  -v prometheus-data:/prometheus \
  prom/prometheus:v2.55.1

Verify Prometheus targets:

6.6 Why both cAdvisor and node-exporter?

If you only watch containers, you may miss “host disk 100% full” which kills everything. If you only watch host metrics, you may miss “one container is being throttled or OOMKilled”.


7. Visualizing in Grafana

7.1 Run Grafana

docker run -d \
  --name grafana \
  --network monitoring \
  --restart unless-stopped \
  -p 3000:3000 \
  -v grafana-data:/var/lib/grafana \
  grafana/grafana:11.2.0

Open:

7.2 Add Prometheus as a data source

In Grafana UI:

7.3 Dashboards: what to build first

Start with 3 dashboards:

  1. Service Overview (Golden Signals)

    • Request rate (RPS)
    • Error rate (5xx, app errors)
    • Latency p95/p99
    • Saturation (CPU, memory, queue depth)
  2. Container Health

    • Restarts per container
    • OOM kills
    • CPU throttling
    • Memory working set vs limit
  3. Host Health

    • Disk usage %
    • inode usage %
    • load average
    • CPU iowait
    • network errors/drops

You can import community dashboards, but treat them as starting points. The most valuable dashboards are the ones that match your service names and your SLOs.


8. Alerting with Alertmanager: principles and examples

Alerting is where monitoring becomes operationally useful—and where many setups fail due to noise.

8.1 Alerting principles

  1. Alert on user impact when possible (latency, errors, failed requests).
  2. Use symptoms first, causes second:
    • Symptom: error rate spike
    • Cause hints: OOM kills, CPU throttling, disk full
  3. Avoid flapping:
    • Use for: to require sustained failure (e.g., 5 minutes)
  4. Make alerts actionable:
    • Include what to check: container name, host, dashboard link, runbook steps
  5. Route by severity:
    • Page for urgent, ticket for non-urgent

8.2 Run Alertmanager

Create directories:

mkdir -p /opt/monitoring/alertmanager

Create a minimal config (email/Slack integration omitted here; you can add later):

cat > /opt/monitoring/alertmanager/alertmanager.yml <<'EOF'
global: {}

route:
  receiver: 'default'
  group_by: ['alertname', 'job', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
EOF

Run:

docker run -d \
  --name alertmanager \
  --network monitoring \
  --restart unless-stopped \
  -p 9093:9093 \
  -v /opt/monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \
  prom/alertmanager:v0.27.0

Open:

8.3 Connect Prometheus to Alertmanager

Edit Prometheus config:

cat >> /opt/monitoring/prometheus/prometheus.yml <<'EOF'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
EOF

Restart Prometheus:

docker restart prometheus

8.4 Add alert rules

Create rules directory:

mkdir -p /opt/monitoring/prometheus/rules

Create alert rules file:

cat > /opt/monitoring/prometheus/rules/docker-alerts.yml <<'EOF'
groups:
- name: docker-and-host-alerts
  rules:

  - alert: HostDiskWillFillSoon
    expr: |
      (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Host disk low on {{ $labels.instance }}"
      description: "Filesystem has <10% available for 10m. Check docker images, logs, volumes."

  - alert: ContainerHighRestartRate
    expr: |
      increase(container_start_time_seconds[15m]) > 3
    for: 5m
    labels:
      severity: ticket
    annotations:
      summary: "Container restarting frequently"
      description: "A container appears to be restarting often (start_time increased). Investigate docker events/logs."

  - alert: ContainerOOMKilled
    expr: |
      increase(container_oom_events_total[10m]) > 0
    for: 0m
    labels:
      severity: page
    annotations:
      summary: "Container OOM kill detected"
      description: "Container had an OOM event in last 10m. Check memory usage and limits."

  - alert: ContainerCPUThrottlingHigh
    expr: |
      rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.25
    for: 10m
    labels:
      severity: ticket
    annotations:
      summary: "High CPU throttling"
      description: "CPU throttling ratio >25% for 10m. Consider raising CPU limits or optimizing workload."
EOF

Now include this rules file in Prometheus config. Replace your /opt/monitoring/prometheus/prometheus.yml with a version that includes rule_files:

grep -q '^rule_files:' /opt/monitoring/prometheus/prometheus.yml || \
  sed -i '1irule_files:\n  - /etc/prometheus/rules/*.yml\n' /opt/monitoring/prometheus/prometheus.yml

Restart Prometheus with rules mounted:

docker rm -f prometheus

docker run -d \
  --name prometheus \
  --network monitoring \
  --restart unless-stopped \
  -p 9090:9090 \
  -v /opt/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
  -v /opt/monitoring/prometheus/rules:/etc/prometheus/rules:ro \
  -v prometheus-data:/prometheus \
  prom/prometheus:v2.55.1

Check alerts:

Important: Alert expressions vary by cAdvisor version and environment. Always validate metric names in Prometheus “Graph” tab before relying on them.


9. Monitoring container health properly (healthchecks and uptime)

9.1 Docker HEALTHCHECK

A container being “running” only means its process exists. Add a healthcheck so Docker can report healthy/unhealthy.

Example Dockerfile snippet:

HEALTHCHECK --interval=10s --timeout=2s --retries=3 \
  CMD curl -fsS http://localhost:8080/health || exit 1

If you can’t rebuild images, you can still add healthchecks in docker run:

docker run -d --name myapp \
  --health-cmd='curl -fsS http://localhost:8080/health || exit 1' \
  --health-interval=10s --health-timeout=2s --health-retries=3 \
  myimage:latest

Check status:

docker inspect --format '{{json .State.Health}}' myapp | jq

9.2 Alerting on health status

cAdvisor does not always expose Docker health status as a metric. A common approach is:

A simple and robust method is to run blackbox-exporter and probe your services like a user would.

Run blackbox exporter:

docker run -d \
  --name blackbox \
  --network monitoring \
  --restart unless-stopped \
  -p 9115:9115 \
  prom/blackbox-exporter:v0.25.0

Add to Prometheus scrape config (edit prometheus.yml accordingly):

This is one of the best “detect before users do” techniques: synthetic checks.


10. Logs: from docker logs to centralized logging

10.1 Choose a logging driver intentionally

Docker defaults to json-file. It works, but can fill disks if unbounded.

Check current logging driver:

docker info --format '{{.LoggingDriver}}'

If using json-file, set rotation (daemon-level, requires Docker daemon config). Example (conceptual):

Even without daemon changes, you can still mitigate by:

10.2 Quick log triage commands

Find noisy containers:

docker ps --format '{{.Names}}' | while read -r c; do
  echo "== $c =="; docker logs --tail 5 "$c" 2>/dev/null
done

Search logs (basic):

docker logs myapp 2>&1 | grep -i "error" | tail -n 50

10.3 Centralized logging options

Common approaches:

If you already run Grafana, Loki is often a pragmatic next step.


11. Detecting common container failures (recipes)

This section connects failure modes to specific metrics and alert ideas.

11.1 Crash loops / frequent restarts

Symptoms:

Signals:

Investigation:

docker ps -a --filter "name=myapp"
docker inspect --format '{{.State.Status}} {{.State.ExitCode}} {{.State.Error}}' myapp
docker logs --tail 200 myapp
docker events --since 30m --filter container=myapp

Alerting:

11.2 OOM kills

Symptoms:

Signals:

Investigation:

docker inspect --format '{{json .State}}' myapp | jq '.OOMKilled, .ExitCode, .FinishedAt'
docker stats --no-stream myapp

Fixes:

Alerting:

11.3 CPU throttling (hidden latency killer)

Symptoms:

Signals:

Fixes:

Alerting:

11.4 Disk full (host)

Symptoms:

Signals:

Investigation:

df -h
df -i
docker system df
docker system prune --dry-run
sudo du -sh /var/lib/docker/* 2>/dev/null

Fixes:

11.5 Network and dependency failures

Symptoms:

Signals:

Investigation:

docker exec -it myapp sh -lc 'apk add --no-cache curl bind-tools || true; nslookup db; curl -v http://upstream/health'

Alerting:


12. On-call hygiene: noise reduction and actionable alerts

12.1 Severity levels

A simple scheme:

12.2 Add context to alerts

Include:

Example annotation style:

12.3 Use inhibition and grouping

Alertmanager can inhibit “symptom” alerts when a higher-level outage is firing (e.g., “HostDown” inhibits “ContainerDown”). This prevents cascades.

12.4 Test your alerts

Do not assume alerts work. Induce failures in a controlled environment:

Example: stop cAdvisor briefly and confirm you get “target down” alerts (if you add them).


13. Hardening and operating the monitoring stack

13.1 Persist data

Prometheus data should be on a volume (prometheus-data). Grafana also uses a volume (grafana-data). You already did this.

Check volumes:

docker volume ls | grep -E 'prometheus-data|grafana-data'

13.2 Secure access

At minimum:

13.3 Monitor the monitors

Add alerts for:

Example “target down” alert (add to rules):

# Fires when a scrape target is down for 5 minutes
# (Tune exclusions to avoid noisy ephemeral targets.)
up == 0

13.4 Capacity planning

Prometheus performance depends on:

Practical advice:


14. Quick checklist

Foundations

Metrics

Alerting

Validation


Next steps (practical upgrades)

If you want to go beyond “host + container metrics” and truly detect failures before users do, prioritize:

  1. Application instrumentation (Prometheus client libraries or OpenTelemetry metrics)
  2. Blackbox probing of critical endpoints (login, checkout, API health)
  3. Centralized logging (Loki/ELK) with correlation IDs
  4. SLO-based alerting (burn-rate alerts on error budget)

If you share what kind of services you run (HTTP APIs, workers, databases) and whether you use Docker Compose or plain docker run, I can propose a tailored set of PromQL alerts and dashboards that match your real failure modes.