Monitoring and Alerting for Dockerized Workloads: Detect Failures Before Users Do
Monitoring and alerting for Dockerized workloads is not just about drawing pretty graphs. The goal is to detect failures early, pinpoint root causes quickly, and reduce mean time to recovery (MTTR)—ideally before customers notice. This tutorial walks through a practical, production-oriented approach using real commands and a clear mental model: what to measure, how to collect it, how to visualize it, and how to alert on it.
You’ll build a monitoring stack around Docker containers, learn what “good” signals look like, and implement alerts that catch common failure modes: crashes, OOM kills, disk pressure, CPU throttling, latency spikes, and dependency failures.
Table of Contents
- 1. What “good monitoring” means for containers
- 2. Key failure modes in Dockerized systems
- 3. Metrics, logs, and traces: what to collect
- 4. Baseline: Docker’s built-in observability commands
- 5. A practical monitoring stack for Docker
- 6. Collecting container and host metrics with Prometheus + cAdvisor
- 7. Visualizing in Grafana
- 8. Alerting with Alertmanager: principles and examples
- 9. Monitoring container health properly (healthchecks and uptime)
- 10. Logs: from
docker logsto centralized logging - 11. Detecting common container failures (recipes)
- 12. On-call hygiene: noise reduction and actionable alerts
- 13. Hardening and operating the monitoring stack
- 14. Quick checklist
1. What “good monitoring” means for containers
A Docker container is not a VM. It’s a process (or a group of processes) with cgroup limits, namespaces, and a lifecycle controlled by Docker. Monitoring must reflect that reality:
- Containers can be rescheduled/recreated frequently; you need stable labels (service name, image, environment), not just container IDs.
- Resource limits matter: CPU throttling and memory OOM kills are common and often invisible if you only watch host CPU/RAM.
- Dependencies are everything: your container may be “up” but effectively down because DNS, database, or upstream APIs are failing.
- Alerting must be service-oriented: “container restarted” is not always a page, but “error rate increased” often is.
A useful way to structure observability is:
- Golden Signals (SRE): latency, traffic, errors, saturation.
- RED method (request-based services): Rate, Errors, Duration.
- USE method (resources): Utilization, Saturation, Errors.
You will typically combine application metrics (RED) with container/host metrics (USE) to get both symptom detection and root cause hints.
2. Key failure modes in Dockerized systems
These are common ways Dockerized workloads fail in production:
- Crash loops: process exits repeatedly; may be due to bad config, missing secrets, failed migrations.
- OOMKilled: container hits memory limit; kernel kills it.
- CPU throttling: container is CPU-limited and gets throttled, causing latency spikes.
- Disk full: logs, images, or volumes fill disk; Docker and apps misbehave.
- FD exhaustion: “too many open files” from leaks or high load.
- Network issues: DNS failures, dropped packets, connection saturation.
- Dependency failures: DB down, queue backlog, upstream API slow.
- Slow degradation: memory leaks, increasing GC time, growing latency.
- Silent failure: container is “running” but not serving (deadlock, stuck thread, hung event loop).
Your monitoring should detect each of these with either:
- direct signals (OOM kill counter, restart count), or
- indirect signals (latency/error rate), or ideally both.
3. Metrics, logs, and traces: what to collect
Metrics
Best for: trends, alerting, capacity planning.
Collect:
- Container CPU usage, CPU throttling
- Memory usage, memory working set
- OOM kills, restarts
- Network RX/TX, errors
- Disk I/O, filesystem usage
- Application metrics: request rate, error rate, latency, queue depth
Logs
Best for: debugging, incident forensics.
Collect:
- Container stdout/stderr
- App logs with request IDs
- Reverse proxy logs (nginx/traefik)
- Docker daemon logs (sometimes critical)
Traces
Best for: latency root cause across services.
Collect:
- Distributed traces (OpenTelemetry) when you have multiple services
This tutorial focuses primarily on metrics + alerting, with a practical logging section.
4. Baseline: Docker’s built-in observability commands
Before deploying a full stack, you should be fluent in Docker’s own tools. They are invaluable during incidents.
Container list and status
docker ps
docker ps -a
Resource usage (live)
docker stats
docker stats --no-stream
Inspect container state (including OOMKilled)
docker inspect --format '{{json .State}}' my-container | jq
Look for:
OOMKilled: trueExitCodeErrorFinishedAtHealth(if configured)
Logs
docker logs my-container
docker logs -f --tail 200 my-container
Events (often underused)
Docker emits events for restarts, kills, health status changes, etc.
docker events --since 1h
docker events --filter container=my-container
Check restart count
docker inspect --format '{{.RestartCount}}' my-container
These commands are not a monitoring system, but they teach you what signals exist and help you validate alerts.
5. A practical monitoring stack for Docker
A common, battle-tested stack:
- cAdvisor: exports container metrics (CPU, memory, network, filesystem) for Prometheus.
- node-exporter: exports host metrics (disk, CPU, memory, network, filesystem).
- Prometheus: scrapes metrics and stores time series.
- Grafana: dashboards.
- Alertmanager: routes Prometheus alerts to email/Slack/PagerDuty/etc.
You can run all of these as containers. The tutorial uses docker run commands to keep things explicit.
Note: In Kubernetes, you’d do this differently (operators, ServiceMonitors, etc.). Here we focus on plain Docker hosts.
6. Collecting container and host metrics with Prometheus + cAdvisor
6.1 Create a Docker network for monitoring
docker network create monitoring
6.2 Run cAdvisor
cAdvisor needs access to host cgroups and Docker state.
docker run -d \
--name cadvisor \
--network monitoring \
--restart unless-stopped \
-p 8080:8080 \
-v /:/rootfs:ro \
-v /var/run:/var/run:rw \
-v /sys:/sys:ro \
-v /var/lib/docker/:/var/lib/docker:ro \
gcr.io/cadvisor/cadvisor:v0.49.1
Verify:
curl -s http://localhost:8080/metrics | head
You should see Prometheus-style metrics like container_cpu_usage_seconds_total.
6.3 Run node-exporter (host metrics)
docker run -d \
--name node-exporter \
--network monitoring \
--restart unless-stopped \
-p 9100:9100 \
--pid=host \
-v /:/host:ro,rslave \
quay.io/prometheus/node-exporter:v1.8.2 \
--path.rootfs=/host
Verify:
curl -s http://localhost:9100/metrics | head
6.4 Create a Prometheus config file
Create a directory:
mkdir -p /opt/monitoring/prometheus
Create the config:
cat > /opt/monitoring/prometheus/prometheus.yml <<'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
EOF
6.5 Run Prometheus
docker run -d \
--name prometheus \
--network monitoring \
--restart unless-stopped \
-p 9090:9090 \
-v /opt/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
-v prometheus-data:/prometheus \
prom/prometheus:v2.55.1
Verify Prometheus targets:
- Open:
http://localhost:9090/targets - Ensure
cadvisorandnodeare UP
6.6 Why both cAdvisor and node-exporter?
- cAdvisor answers: “What is each container doing?” (per-container CPU, memory, network)
- node-exporter answers: “What is the host doing?” (disk usage, filesystem fullness, CPU iowait, network errors)
If you only watch containers, you may miss “host disk 100% full” which kills everything. If you only watch host metrics, you may miss “one container is being throttled or OOMKilled”.
7. Visualizing in Grafana
7.1 Run Grafana
docker run -d \
--name grafana \
--network monitoring \
--restart unless-stopped \
-p 3000:3000 \
-v grafana-data:/var/lib/grafana \
grafana/grafana:11.2.0
Open:
http://localhost:3000- Default login:
admin/admin(you will be prompted to change it)
7.2 Add Prometheus as a data source
In Grafana UI:
- Connections → Data sources → Add data source → Prometheus
- URL:
http://prometheus:9090 - Save & test
7.3 Dashboards: what to build first
Start with 3 dashboards:
-
Service Overview (Golden Signals)
- Request rate (RPS)
- Error rate (5xx, app errors)
- Latency p95/p99
- Saturation (CPU, memory, queue depth)
-
Container Health
- Restarts per container
- OOM kills
- CPU throttling
- Memory working set vs limit
-
Host Health
- Disk usage %
- inode usage %
- load average
- CPU iowait
- network errors/drops
You can import community dashboards, but treat them as starting points. The most valuable dashboards are the ones that match your service names and your SLOs.
8. Alerting with Alertmanager: principles and examples
Alerting is where monitoring becomes operationally useful—and where many setups fail due to noise.
8.1 Alerting principles
- Alert on user impact when possible (latency, errors, failed requests).
- Use symptoms first, causes second:
- Symptom: error rate spike
- Cause hints: OOM kills, CPU throttling, disk full
- Avoid flapping:
- Use
for:to require sustained failure (e.g., 5 minutes)
- Use
- Make alerts actionable:
- Include what to check: container name, host, dashboard link, runbook steps
- Route by severity:
- Page for urgent, ticket for non-urgent
8.2 Run Alertmanager
Create directories:
mkdir -p /opt/monitoring/alertmanager
Create a minimal config (email/Slack integration omitted here; you can add later):
cat > /opt/monitoring/alertmanager/alertmanager.yml <<'EOF'
global: {}
route:
receiver: 'default'
group_by: ['alertname', 'job', 'instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'default'
EOF
Run:
docker run -d \
--name alertmanager \
--network monitoring \
--restart unless-stopped \
-p 9093:9093 \
-v /opt/monitoring/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro \
prom/alertmanager:v0.27.0
Open:
http://localhost:9093
8.3 Connect Prometheus to Alertmanager
Edit Prometheus config:
cat >> /opt/monitoring/prometheus/prometheus.yml <<'EOF'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
EOF
Restart Prometheus:
docker restart prometheus
8.4 Add alert rules
Create rules directory:
mkdir -p /opt/monitoring/prometheus/rules
Create alert rules file:
cat > /opt/monitoring/prometheus/rules/docker-alerts.yml <<'EOF'
groups:
- name: docker-and-host-alerts
rules:
- alert: HostDiskWillFillSoon
expr: |
(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
for: 10m
labels:
severity: page
annotations:
summary: "Host disk low on {{ $labels.instance }}"
description: "Filesystem has <10% available for 10m. Check docker images, logs, volumes."
- alert: ContainerHighRestartRate
expr: |
increase(container_start_time_seconds[15m]) > 3
for: 5m
labels:
severity: ticket
annotations:
summary: "Container restarting frequently"
description: "A container appears to be restarting often (start_time increased). Investigate docker events/logs."
- alert: ContainerOOMKilled
expr: |
increase(container_oom_events_total[10m]) > 0
for: 0m
labels:
severity: page
annotations:
summary: "Container OOM kill detected"
description: "Container had an OOM event in last 10m. Check memory usage and limits."
- alert: ContainerCPUThrottlingHigh
expr: |
rate(container_cpu_cfs_throttled_seconds_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) > 0.25
for: 10m
labels:
severity: ticket
annotations:
summary: "High CPU throttling"
description: "CPU throttling ratio >25% for 10m. Consider raising CPU limits or optimizing workload."
EOF
Now include this rules file in Prometheus config. Replace your /opt/monitoring/prometheus/prometheus.yml with a version that includes rule_files:
grep -q '^rule_files:' /opt/monitoring/prometheus/prometheus.yml || \
sed -i '1irule_files:\n - /etc/prometheus/rules/*.yml\n' /opt/monitoring/prometheus/prometheus.yml
Restart Prometheus with rules mounted:
docker rm -f prometheus
docker run -d \
--name prometheus \
--network monitoring \
--restart unless-stopped \
-p 9090:9090 \
-v /opt/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro \
-v /opt/monitoring/prometheus/rules:/etc/prometheus/rules:ro \
-v prometheus-data:/prometheus \
prom/prometheus:v2.55.1
Check alerts:
- Prometheus:
http://localhost:9090/alerts - Alertmanager:
http://localhost:9093/#/alerts
Important: Alert expressions vary by cAdvisor version and environment. Always validate metric names in Prometheus “Graph” tab before relying on them.
9. Monitoring container health properly (healthchecks and uptime)
9.1 Docker HEALTHCHECK
A container being “running” only means its process exists. Add a healthcheck so Docker can report healthy/unhealthy.
Example Dockerfile snippet:
HEALTHCHECK --interval=10s --timeout=2s --retries=3 \
CMD curl -fsS http://localhost:8080/health || exit 1
If you can’t rebuild images, you can still add healthchecks in docker run:
docker run -d --name myapp \
--health-cmd='curl -fsS http://localhost:8080/health || exit 1' \
--health-interval=10s --health-timeout=2s --health-retries=3 \
myimage:latest
Check status:
docker inspect --format '{{json .State.Health}}' myapp | jq
9.2 Alerting on health status
cAdvisor does not always expose Docker health status as a metric. A common approach is:
- Export health status via a small sidecar exporter, or
- Prefer blackbox probing (HTTP/TCP checks) from Prometheus.
A simple and robust method is to run blackbox-exporter and probe your services like a user would.
Run blackbox exporter:
docker run -d \
--name blackbox \
--network monitoring \
--restart unless-stopped \
-p 9115:9115 \
prom/blackbox-exporter:v0.25.0
Add to Prometheus scrape config (edit prometheus.yml accordingly):
- Probe an HTTP endpoint like
http://myapp:8080/health(if on same network) - Or probe via host port
http://host.docker.internal:8080/health(platform-dependent)
This is one of the best “detect before users do” techniques: synthetic checks.
10. Logs: from docker logs to centralized logging
10.1 Choose a logging driver intentionally
Docker defaults to json-file. It works, but can fill disks if unbounded.
Check current logging driver:
docker info --format '{{.LoggingDriver}}'
If using json-file, set rotation (daemon-level, requires Docker daemon config). Example (conceptual):
- max-size
- max-file
Even without daemon changes, you can still mitigate by:
- ensuring apps log to stdout/stderr (not to files in the container)
- shipping logs centrally
10.2 Quick log triage commands
Find noisy containers:
docker ps --format '{{.Names}}' | while read -r c; do
echo "== $c =="; docker logs --tail 5 "$c" 2>/dev/null
done
Search logs (basic):
docker logs myapp 2>&1 | grep -i "error" | tail -n 50
10.3 Centralized logging options
Common approaches:
- Loki + Promtail (Grafana ecosystem)
- ELK/EFK (Elasticsearch + Fluentd/Fluent Bit + Kibana)
- Cloud logging (CloudWatch, Stackdriver, etc.)
If you already run Grafana, Loki is often a pragmatic next step.
11. Detecting common container failures (recipes)
This section connects failure modes to specific metrics and alert ideas.
11.1 Crash loops / frequent restarts
Symptoms:
- container exits repeatedly
- service intermittently unavailable
Signals:
- Docker restart count
- container start time changes
- application error rate spikes
Investigation:
docker ps -a --filter "name=myapp"
docker inspect --format '{{.State.Status}} {{.State.ExitCode}} {{.State.Error}}' myapp
docker logs --tail 200 myapp
docker events --since 30m --filter container=myapp
Alerting:
- If restarts exceed threshold in 15 minutes → ticket/page depending on criticality.
11.2 OOM kills
Symptoms:
- sudden restarts
- requests fail under load
- memory usage climbs over time
Signals:
container_oom_events_totalincrease- memory usage near limit
Investigation:
docker inspect --format '{{json .State}}' myapp | jq '.OOMKilled, .ExitCode, .FinishedAt'
docker stats --no-stream myapp
Fixes:
- raise memory limit
- fix memory leak
- reduce concurrency / batch sizes
- tune JVM/Node/Python memory behavior
Alerting:
- Page on any OOM kill for critical services.
11.3 CPU throttling (hidden latency killer)
Symptoms:
- high latency, timeouts
- CPU usage may look “fine” at host level
Signals:
- throttled seconds / periods ratio
- request duration p95/p99 increases
Fixes:
- increase CPU quota
- reduce CPU-heavy work
- move background jobs off request path
Alerting:
- Ticket if throttling ratio sustained > 20–30% for 10 minutes.
- Page only if it correlates with user-impact signals.
11.4 Disk full (host)
Symptoms:
- containers fail to start
- Docker pulls fail
- logs stop writing
- database corruption risk
Signals:
- node filesystem available %
- inode exhaustion
Investigation:
df -h
df -i
docker system df
docker system prune --dry-run
sudo du -sh /var/lib/docker/* 2>/dev/null
Fixes:
- rotate logs
- prune unused images/containers
- move volumes to larger disk
- set up disk alerts early (10–15% free)
11.5 Network and dependency failures
Symptoms:
- timeouts, 5xx
- increased retries
- queue backlogs
Signals:
- blackbox probe failures
- app-level error counters
- latency increases
Investigation:
docker exec -it myapp sh -lc 'apk add --no-cache curl bind-tools || true; nslookup db; curl -v http://upstream/health'
Alerting:
- Page on sustained probe failure (e.g., 3/5 minutes) for critical endpoints.
12. On-call hygiene: noise reduction and actionable alerts
12.1 Severity levels
A simple scheme:
page: immediate human action required (user impact likely/confirmed)ticket: needs attention during business hoursinfo: for dashboards and trend tracking
12.2 Add context to alerts
Include:
- container/service name
- host
- suspected cause
- next command to run
- dashboard link (Grafana URL)
Example annotation style:
- “Check
docker logs <container>” - “Check
docker inspectfor OOMKilled” - “Check host disk usage
df -h”
12.3 Use inhibition and grouping
Alertmanager can inhibit “symptom” alerts when a higher-level outage is firing (e.g., “HostDown” inhibits “ContainerDown”). This prevents cascades.
12.4 Test your alerts
Do not assume alerts work. Induce failures in a controlled environment:
- Stop a container
- Artificially limit memory
- Fill disk in a temp filesystem (careful)
- Introduce latency
Example: stop cAdvisor briefly and confirm you get “target down” alerts (if you add them).
13. Hardening and operating the monitoring stack
13.1 Persist data
Prometheus data should be on a volume (prometheus-data). Grafana also uses a volume (grafana-data). You already did this.
Check volumes:
docker volume ls | grep -E 'prometheus-data|grafana-data'
13.2 Secure access
At minimum:
- bind Prometheus/Grafana/Alertmanager to internal interfaces or VPN
- put them behind an authenticating reverse proxy
- set strong Grafana admin password
- consider TLS termination
13.3 Monitor the monitors
Add alerts for:
- Prometheus target down
- Alertmanager unreachable
- low disk on Prometheus volume
- high memory usage in Prometheus itself
Example “target down” alert (add to rules):
# Fires when a scrape target is down for 5 minutes
# (Tune exclusions to avoid noisy ephemeral targets.)
up == 0
13.4 Capacity planning
Prometheus performance depends on:
- scrape interval
- number of targets
- number of time series (cardinality)
- retention period
Practical advice:
- start with 15s scrape
- avoid high-cardinality labels (request IDs, full URLs)
- keep retention reasonable (e.g., 15–30 days) unless you have a long-term store
14. Quick checklist
Foundations
- You can answer: “Is the service down?” in < 60 seconds
- You can answer: “Why is it down?” in < 10 minutes (with metrics + logs)
Metrics
- Host: disk %, inode %, CPU iowait, memory pressure
- Container: restarts, OOM kills, CPU throttling, memory working set
- App: request rate, error rate, latency percentiles
Alerting
- Alerts map to user impact or imminent failure
- Alerts have
for:to reduce flapping - Alerts include runbook hints and context
- Alertmanager groups and routes by severity
Validation
- You tested at least one crash loop, one OOM, and one dependency failure scenario
- Dashboards reflect your service labels and SLOs
Next steps (practical upgrades)
If you want to go beyond “host + container metrics” and truly detect failures before users do, prioritize:
- Application instrumentation (Prometheus client libraries or OpenTelemetry metrics)
- Blackbox probing of critical endpoints (login, checkout, API health)
- Centralized logging (Loki/ELK) with correlation IDs
- SLO-based alerting (burn-rate alerts on error budget)
If you share what kind of services you run (HTTP APIs, workers, databases) and whether you use Docker Compose or plain docker run, I can propose a tailored set of PromQL alerts and dashboards that match your real failure modes.