Monitoring and Alerting for Dockerized Workloads: Detect Failures Before Users Do

Monitoring and alerting for Dockerized workloads is not just about drawing pretty graphs. The goal is to detect failures early, pinpoint root causes quickly, and reduce mean time to recovery (MTTR)—ideally before customers notice. This tutorial walks through a practical, production-oriented approach using real commands and a clear mental model: what to measure, how to collect it, how to visualize it, and how to alert on it.

You’ll build a monitoring stack around Docker containers, learn what “good” signals look like, and implement alerts that catch common failure modes: crashes, OOM kills, disk pressure, CPU throttling, latency spikes, and dependency failures.

1. What “good monitoring” means for containers
2. Key failure modes in Dockerized systems
3. Metrics, logs, and traces: what to collect
4. Baseline: Docker’s built-in observability commands
5. A practical monitoring stack for Docker
6. Collecting container and host metrics with Prometheus + cAdvisor
7. Visualizing in Grafana
8. Alerting with Alertmanager: principles and examples
9. Monitoring container health properly (healthchecks and uptime)
10. Logs: from docker logs to centralized logging
11. Detecting common container failures (recipes)
12. On-call hygiene: noise reduction and actionable alerts
13. Hardening and operating the monitoring stack
14. Quick checklist

1. What “good monitoring” means for containers

A Docker container is not a VM. It’s a process (or a group of processes) with cgroup limits, namespaces, and a lifecycle controlled by Docker. Monitoring must reflect that reality:

Containers can be rescheduled/recreated frequently; you need stable labels (service name, image, environment), not just container IDs.
Resource limits matter: CPU throttling and memory OOM kills are common and often invisible if you only watch host CPU/RAM.
Dependencies are everything: your container may be “up” but effectively down because DNS, database, or upstream APIs are failing.
Alerting must be service-oriented: “container restarted” is not always a page, but “error rate increased” often is.

A useful way to structure observability is:

Golden Signals (SRE): latency, traffic, errors, saturation.
RED method (request-based services): Rate, Errors, Duration.
USE method (resources): Utilization, Saturation, Errors.

You will typically combine application metrics (RED) with container/host metrics (USE) to get both symptom detection and root cause hints.

2. Key failure modes in Dockerized systems

These are common ways Dockerized workloads fail in production:

Crash loops: process exits repeatedly; may be due to bad config, missing secrets, failed migrations.
OOMKilled: container hits memory limit; kernel kills it.
CPU throttling: container is CPU-limited and gets throttled, causing latency spikes.
Disk full: logs, images, or volumes fill disk; Docker and apps misbehave.
FD exhaustion: “too many open files” from leaks or high load.
Network issues: DNS failures, dropped packets, connection saturation.
Dependency failures: DB down, queue backlog, upstream API slow.
Slow degradation: memory leaks, increasing GC time, growing latency.
Silent failure: container is “running” but not serving (deadlock, stuck thread, hung event loop).

Your monitoring should detect each of these with either:

direct signals (OOM kill counter, restart count), or
indirect signals (latency/error rate), or ideally both.

3. Metrics, logs, and traces: what to collect

Metrics

Best for: trends, alerting, capacity planning.

Collect:

Container CPU usage, CPU throttling
Memory usage, memory working set
OOM kills, restarts
Network RX/TX, errors
Disk I/O, filesystem usage
Application metrics: request rate, error rate, latency, queue depth

Logs

Best for: debugging, incident forensics.

Collect:

Container stdout/stderr
App logs with request IDs
Reverse proxy logs (nginx/traefik)
Docker daemon logs (sometimes critical)

Traces

Best for: latency root cause across services.

Collect:

Distributed traces (OpenTelemetry) when you have multiple services

This tutorial focuses primarily on metrics + alerting, with a practical logging section.

4. Baseline: Docker’s built-in observability commands

Before deploying a full stack, you should be fluent in Docker’s own tools. They are invaluable during incidents.

Container list and status

docker ps
docker ps -a

Resource usage (live)

docker stats
docker stats --no-stream

Inspect container state (including OOMKilled)

docker inspect --format '{{json .State}}' my-container | jq

Look for:

OOMKilled: true
ExitCode
Error
FinishedAt
Health (if configured)

Logs

docker logs my-container
docker logs -f --tail 200 my-container

Events (often underused)

Docker emits events for restarts, kills, health status changes, etc.

docker events --since 1h
docker events --filter container=my-container

Check restart count

docker inspect --format '{{.RestartCount}}' my-container

These commands are not a monitoring system, but they teach you what signals exist and help you validate alerts.

5. A practical monitoring stack for Docker

A common, battle-tested stack:

cAdvisor: exports container metrics (CPU, memory, network, filesystem) for Prometheus.
node-exporter: exports host metrics (disk, CPU, memory, network, filesystem).
Prometheus: scrapes metrics and stores time series.
Grafana: dashboards.
Alertmanager: routes Prometheus alerts to email/Slack/PagerDuty/etc.

You can run all of these as containers. The tutorial uses docker run commands to keep things explicit.

Note: In Kubernetes, you’d do this differently (operators, ServiceMonitors, etc.). Here we focus on plain Docker hosts.

6. Collecting container and host metrics with Prometheus + cAdvisor

6.1 Create a Docker network for monitoring

docker network create monitoring

6.2 Run cAdvisor

cAdvisor needs access to host cgroups and Docker state.

docker run -d \
  --name cadvisor \
  --network monitoring \
  --restart unless-stopped \
  -p 8080:8080 \
  -v /:/rootfs:ro \
  -v /var/run:/var/run:rw \
  -v /sys:/sys:ro \
  -v /var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor:v0.49.1

Verify:

curl -s http://localhost:8080/metrics | head

You should see Prometheus-style metrics like container_cpu_usage_seconds_total.

6.3 Run node-exporter (host metrics)

docker run -d \
  --name node-exporter \
  --network monitoring \
  --restart unless-stopped \
  -p 9100:9100 \
  --pid=host \
  -v /:/host:ro,rslave \
  quay.io/prometheus/node-exporter:v1.8.2 \
  --path.rootfs=/host

Verify:

curl -s http://localhost:9100/metrics | head

6.4 Create a Prometheus config file

Create a directory:

mkdir -p /opt/monitoring/prometheus

Create the config:

cat > /opt/monitoring/prometheus/prometheus.yml <<'EOF'
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']

  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
EOF

6.5 Run Prometheus

docker run -d \
  --name prometheus \
  --network monitoring \
  --restart unless-stopped \
  -p 9090:9090 \
  -v /opt/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
  -v prometheus-data:/prometheus \
  prom/prometheus:v2.55.1

Verify Prometheus targets:

Open: http://localhost:9090/targets
Ensure cadvisor and node are UP

6.6 Why both cAdvisor and node-exporter?

cAdvisor answers: “What is each container doing?” (per-container CPU, memory, network)
node-exporter answers: “What is the host doing?” (disk usage, filesystem fullness, CPU iowait, network errors)

If you only watch containers, you may miss “host disk 100% full” which kills everything. If you only watch host metrics, you may miss “one container is being throttled or OOMKilled”.

7. Visualizing in Grafana

7.1 Run Grafana

docker run -d \
  --name grafana \
  --network monitoring \
  --restart unless-stopped \
  -p 3000:3000 \
  -v grafana-data:/var/lib/grafana \
  grafana/grafana:11.2.0

Open:

http://localhost:3000
Default login: admin / admin (you will be prompted to change it)

7.2 Add Prometheus as a data source

In Grafana UI:

Connections → Data sources → Add data source → Prometheus
URL: http://prometheus:9090
Save & test

7.3 Dashboards: what to build first

Start with 3 dashboards:

Service Overview (Golden Signals)
- Request rate (RPS)
- Error rate (5xx, app errors)
- Latency p95/p99
- Saturation (CPU, memory, queue depth)
Container Health
- Restarts per container
- OOM kills
- CPU throttling
- Memory working set vs limit
Host Health
- Disk usage %
- inode usage %
- load average
- CPU iowait
- network errors/drops

You can import community dashboards, but treat them as starting points. The most valuable dashboards are the ones that match your service names and your SLOs.

8. Alerting with Alertmanager: principles and examples

Alerting is where monitoring becomes operationally useful—and where many setups fail due to noise.

8.1 Alerting principles

Alert on user impact when possible (latency, errors, failed requests).
Use symptoms first, causes second:
- Symptom: error rate spike
- Cause hints: OOM kills, CPU throttling, disk full
Avoid flapping:
- Use for: to require sustained failure (e.g., 5 minutes)
Make alerts actionable:
- Include what to check: container name, host, dashboard link, runbook steps
Route by severity:
- Page for urgent, ticket for non-urgent

8.2 Run Alertmanager

Create directories:

mkdir -p /opt/monitoring/alertmanager

Create a minimal config (email/Slack integration omitted here; you can add later):

cat > /opt/monitoring/alertmanager/alertmanager.yml <<'EOF'
global: {}

route:
  receiver: 'default'
  group_by: ['alertname', 'job', 'instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

receivers:
  - name: 'default'
EOF

Run:

docker run -d \
  --name alertmanager \
  --network monitoring \
  --restart unless-stopped \
  -p 9093:9093 \
  -v /opt/monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \
  prom/alertmanager:v0.27.0

Open:

http://localhost:9093

8.3 Connect Prometheus to Alertmanager

Edit Prometheus config:

cat >> /opt/monitoring/prometheus/prometheus.yml <<'EOF'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
EOF

Restart Prometheus:

docker restart prometheus

8.4 Add alert rules

Create rules directory:

mkdir -p /opt/monitoring/prometheus/rules

Create alert rules file:

cat > /opt/monitoring/prometheus/rules/docker-alerts.yml <<'EOF'
groups:
- name: docker-and-host-alerts
  rules:

  - alert: HostDiskWillFillSoon
    expr: |
      (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
    for: 10m
    labels:
      severity: page
    annotations:
      summary: "Host disk low on {{ $labels.instance }}"
      description: "Filesystem has <10% available for 10m. Check docker images, logs, volumes."

  - alert: ContainerHighRestartRate
    expr: |
      increase(container_start_time_seconds[15m]) > 3
    for: 5m
    labels:
      severity: ticket
    annotations:
      summary: "Container restarting frequently"
      description: "A container appears to be restarting often (start_time increased). Investigate docker events/logs."

  - alert: ContainerOOMKilled
    expr: |
      increase(container_oom_events_total[10m]) > 0
    for: 0m
    labels:
      severity: page
    annotations:
      summary: "Container OOM kill detected"
      description: "Container had an OOM event in last 10m. Check memory usage and limits."

  - alert: ContainerCPUThrottlingHigh
    expr: |
      rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.25
    for: 10m
    labels:
      severity: ticket
    annotations:
      summary: "High CPU throttling"
      description: "CPU throttling ratio >25% for 10m. Consider raising CPU limits or optimizing workload."
EOF

Now include this rules file in Prometheus config. Replace your /opt/monitoring/prometheus/prometheus.yml with a version that includes rule_files:

grep -q '^rule_files:' /opt/monitoring/prometheus/prometheus.yml || \
  sed -i '1irule_files:\n  - /etc/prometheus/rules/*.yml\n' /opt/monitoring/prometheus/prometheus.yml

Restart Prometheus with rules mounted:

docker rm -f prometheus

docker run -d \
  --name prometheus \
  --network monitoring \
  --restart unless-stopped \
  -p 9090:9090 \
  -v /opt/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
  -v /opt/monitoring/prometheus/rules:/etc/prometheus/rules:ro \
  -v prometheus-data:/prometheus \
  prom/prometheus:v2.55.1

Check alerts:

Prometheus: http://localhost:9090/alerts
Alertmanager: http://localhost:9093/#/alerts

Important: Alert expressions vary by cAdvisor version and environment. Always validate metric names in Prometheus “Graph” tab before relying on them.

9. Monitoring container health properly (healthchecks and uptime)

9.1 Docker HEALTHCHECK

A container being “running” only means its process exists. Add a healthcheck so Docker can report healthy/unhealthy.

Example Dockerfile snippet:

HEALTHCHECK --interval=10s --timeout=2s --retries=3 \
  CMD curl -fsS http://localhost:8080/health || exit 1

If you can’t rebuild images, you can still add healthchecks in docker run:

docker run -d --name myapp \
  --health-cmd='curl -fsS http://localhost:8080/health || exit 1' \
  --health-interval=10s --health-timeout=2s --health-retries=3 \
  myimage:latest

Check status:

docker inspect --format '{{json .State.Health}}' myapp | jq

9.2 Alerting on health status

cAdvisor does not always expose Docker health status as a metric. A common approach is:

Export health status via a small sidecar exporter, or
Prefer blackbox probing (HTTP/TCP checks) from Prometheus.

A simple and robust method is to run blackbox-exporter and probe your services like a user would.

Run blackbox exporter:

docker run -d \
  --name blackbox \
  --network monitoring \
  --restart unless-stopped \
  -p 9115:9115 \
  prom/blackbox-exporter:v0.25.0

Add to Prometheus scrape config (edit prometheus.yml accordingly):

Probe an HTTP endpoint like http://myapp:8080/health (if on same network)
Or probe via host port http://host.docker.internal:8080/health (platform-dependent)

This is one of the best “detect before users do” techniques: synthetic checks.

10. Logs: from `docker logs` to centralized logging

10.1 Choose a logging driver intentionally

Docker defaults to json-file. It works, but can fill disks if unbounded.

Check current logging driver:

docker info --format '{{.LoggingDriver}}'

If using json-file, set rotation (daemon-level, requires Docker daemon config). Example (conceptual):

max-size
max-file

Even without daemon changes, you can still mitigate by:

ensuring apps log to stdout/stderr (not to files in the container)
shipping logs centrally

10.2 Quick log triage commands

Find noisy containers:

docker ps --format '{{.Names}}' | while read -r c; do
  echo "== $c =="; docker logs --tail 5 "$c" 2>/dev/null
done

Search logs (basic):

docker logs myapp 2>&1 | grep -i "error" | tail -n 50

10.3 Centralized logging options

Common approaches:

Loki + Promtail (Grafana ecosystem)
ELK/EFK (Elasticsearch + Fluentd/Fluent Bit + Kibana)
Cloud logging (CloudWatch, Stackdriver, etc.)

If you already run Grafana, Loki is often a pragmatic next step.

11. Detecting common container failures (recipes)

This section connects failure modes to specific metrics and alert ideas.

11.1 Crash loops / frequent restarts

Symptoms:

container exits repeatedly
service intermittently unavailable

Signals:

Docker restart count
container start time changes
application error rate spikes

Investigation:

docker ps -a --filter "name=myapp"
docker inspect --format '{{.State.Status}} {{.State.ExitCode}} {{.State.Error}}' myapp
docker logs --tail 200 myapp
docker events --since 30m --filter container=myapp

Alerting:

If restarts exceed threshold in 15 minutes → ticket/page depending on criticality.

11.2 OOM kills

Symptoms:

sudden restarts
requests fail under load
memory usage climbs over time

Signals:

container_oom_events_total increase
memory usage near limit

Investigation:

docker inspect --format '{{json .State}}' myapp | jq '.OOMKilled, .ExitCode, .FinishedAt'
docker stats --no-stream myapp

Fixes:

raise memory limit
fix memory leak
reduce concurrency / batch sizes
tune JVM/Node/Python memory behavior

Alerting:

Page on any OOM kill for critical services.

11.3 CPU throttling (hidden latency killer)

Symptoms:

high latency, timeouts
CPU usage may look “fine” at host level

Signals:

throttled seconds / periods ratio
request duration p95/p99 increases

Fixes:

increase CPU quota
reduce CPU-heavy work
move background jobs off request path

Alerting:

Ticket if throttling ratio sustained > 20–30% for 10 minutes.
Page only if it correlates with user-impact signals.

11.4 Disk full (host)

Symptoms:

containers fail to start
Docker pulls fail
logs stop writing
database corruption risk

Signals:

node filesystem available %
inode exhaustion

Investigation:

df -h
df -i
docker system df
docker system prune --dry-run
sudo du -sh /var/lib/docker/* 2>/dev/null

Fixes:

rotate logs
prune unused images/containers
move volumes to larger disk
set up disk alerts early (10–15% free)

11.5 Network and dependency failures

Symptoms:

timeouts, 5xx
increased retries
queue backlogs

Signals:

blackbox probe failures
app-level error counters
latency increases

Investigation:

docker exec -it myapp sh -lc 'apk add --no-cache curl bind-tools || true; nslookup db; curl -v http://upstream/health'

Alerting:

Page on sustained probe failure (e.g., 3/5 minutes) for critical endpoints.

12. On-call hygiene: noise reduction and actionable alerts

12.1 Severity levels

A simple scheme:

page: immediate human action required (user impact likely/confirmed)
ticket: needs attention during business hours
info: for dashboards and trend tracking

12.2 Add context to alerts

Include:

container/service name
host
suspected cause
next command to run
dashboard link (Grafana URL)

Example annotation style:

“Check docker logs <container>”
“Check docker inspect for OOMKilled”
“Check host disk usage df -h”

12.3 Use inhibition and grouping

Alertmanager can inhibit “symptom” alerts when a higher-level outage is firing (e.g., “HostDown” inhibits “ContainerDown”). This prevents cascades.

12.4 Test your alerts

Do not assume alerts work. Induce failures in a controlled environment:

Stop a container
Artificially limit memory
Fill disk in a temp filesystem (careful)
Introduce latency

Example: stop cAdvisor briefly and confirm you get “target down” alerts (if you add them).

13. Hardening and operating the monitoring stack

13.1 Persist data

Prometheus data should be on a volume (prometheus-data). Grafana also uses a volume (grafana-data). You already did this.

Check volumes:

docker volume ls | grep -E 'prometheus-data|grafana-data'

13.2 Secure access

At minimum:

bind Prometheus/Grafana/Alertmanager to internal interfaces or VPN
put them behind an authenticating reverse proxy
set strong Grafana admin password
consider TLS termination

13.3 Monitor the monitors

Add alerts for:

Prometheus target down
Alertmanager unreachable
low disk on Prometheus volume
high memory usage in Prometheus itself

Example “target down” alert (add to rules):

# Fires when a scrape target is down for 5 minutes
# (Tune exclusions to avoid noisy ephemeral targets.)
up == 0

13.4 Capacity planning

Prometheus performance depends on:

scrape interval
number of targets
number of time series (cardinality)
retention period

Practical advice:

start with 15s scrape
avoid high-cardinality labels (request IDs, full URLs)
keep retention reasonable (e.g., 15–30 days) unless you have a long-term store

14. Quick checklist

Foundations

You can answer: “Is the service down?” in < 60 seconds
You can answer: “Why is it down?” in < 10 minutes (with metrics + logs)

Metrics

Host: disk %, inode %, CPU iowait, memory pressure
Container: restarts, OOM kills, CPU throttling, memory working set
App: request rate, error rate, latency percentiles

Alerting

Alerts map to user impact or imminent failure
Alerts have for: to reduce flapping
Alerts include runbook hints and context
Alertmanager groups and routes by severity

Validation

You tested at least one crash loop, one OOM, and one dependency failure scenario
Dashboards reflect your service labels and SLOs

Next steps (practical upgrades)

If you want to go beyond “host + container metrics” and truly detect failures before users do, prioritize:

Application instrumentation (Prometheus client libraries or OpenTelemetry metrics)
Blackbox probing of critical endpoints (login, checkout, API health)
Centralized logging (Loki/ELK) with correlation IDs
SLO-based alerting (burn-rate alerts on error budget)

If you share what kind of services you run (HTTP APIs, workers, databases) and whether you use Docker Compose or plain docker run, I can propose a tailored set of PromQL alerts and dashboards that match your real failure modes.

Monitoring and Alerting for Dockerized Workloads: Detect Failures Before Users Do

Monitoring and Alerting for Dockerized Workloads: Detect Failures Before Users Do

Table of Contents

1. What “good monitoring” means for containers

2. Key failure modes in Dockerized systems

3. Metrics, logs, and traces: what to collect

Metrics

Logs

Traces

4. Baseline: Docker’s built-in observability commands

Container list and status

Resource usage (live)

Inspect container state (including OOMKilled)

Logs

Events (often underused)

Check restart count

5. A practical monitoring stack for Docker

6. Collecting container and host metrics with Prometheus + cAdvisor

6.1 Create a Docker network for monitoring

6.2 Run cAdvisor

6.3 Run node-exporter (host metrics)

6.4 Create a Prometheus config file

6.5 Run Prometheus

6.6 Why both cAdvisor and node-exporter?

7. Visualizing in Grafana

7.1 Run Grafana

7.2 Add Prometheus as a data source

7.3 Dashboards: what to build first

8. Alerting with Alertmanager: principles and examples

8.1 Alerting principles

8.2 Run Alertmanager

8.3 Connect Prometheus to Alertmanager

8.4 Add alert rules

9. Monitoring container health properly (healthchecks and uptime)

9.1 Docker HEALTHCHECK

9.2 Alerting on health status

10. Logs: from docker logs to centralized logging

10.1 Choose a logging driver intentionally

10.2 Quick log triage commands

10.3 Centralized logging options

11. Detecting common container failures (recipes)

11.1 Crash loops / frequent restarts

11.2 OOM kills

11.3 CPU throttling (hidden latency killer)

11.4 Disk full (host)

11.5 Network and dependency failures

12. On-call hygiene: noise reduction and actionable alerts

12.1 Severity levels

12.2 Add context to alerts

12.3 Use inhibition and grouping

12.4 Test your alerts

13. Hardening and operating the monitoring stack

13.1 Persist data

13.2 Secure access

13.3 Monitor the monitors

13.4 Capacity planning

14. Quick checklist

Next steps (practical upgrades)

Related Tutorials

10. Logs: from `docker logs` to centralized logging