← Back to Tutorials

Out of Memory (OOM) in Docker: Diagnose and Fix Container Memory Crashes

dockeroomcontainer-memorydevopskuberneteslinuxobservabilityperformance-tuning

Out of Memory (OOM) in Docker: Diagnose and Fix Container Memory Crashes

Running containers “in production” often means running them close to resource limits. When memory pressure hits, Linux will protect the host by killing processes. In Docker, that typically shows up as containers exiting unexpectedly, restarts looping, or logs that end abruptly. This tutorial explains what OOM is, how Docker and the Linux kernel enforce memory, and how to diagnose and fix memory crashes with real, copy‑pasteable commands.


Table of Contents


1. What “OOM” means in Docker

OOM stands for Out Of Memory. In Linux, when available memory becomes too low, the kernel may kill one or more processes to recover memory. In a containerized environment, memory is controlled by cgroups (control groups). If a container exceeds its configured memory limit, the kernel can kill processes inside that container, which often results in the container exiting.

Typical outcomes:


2. Linux memory, cgroups, and why containers get killed

Docker does not “manage memory” by itself; it asks the Linux kernel to enforce memory limits through cgroups.

2.1 cgroups v1 vs v2

You can check which is active:

stat -fc %T /sys/fs/cgroup

Or:

mount | grep cgroup

2.2 What counts as “memory”

Memory accounting includes:

This is why an app might claim “I only use 200MB” while the container shows 800MB: the container includes more than your app’s own heap.

2.3 OOM killer vs cgroup OOM

There are two major scenarios:

  1. Host OOM (system-wide): the entire host is out of memory. The kernel chooses a victim process across the system (could be dockerd, container processes, databases, etc.).
  2. cgroup OOM (container limit exceeded): the container hits its cgroup memory limit. The kernel kills one or more processes in that cgroup.

In Docker troubleshooting, you want to know which one happened because the fixes differ:


3. Recognizing OOM symptoms

Common signs:


4. Quick triage checklist

  1. Is the container OOMKilled?
  2. Did the host run out of memory (global OOM) or just the container?
  3. What was the container’s memory limit?
  4. What was the peak memory usage before crash?
  5. Is memory usage growing over time (leak) or spiking (batch/startup)?
  6. Is the runtime configured to respect container limits (JVM, Node)?
  7. Are there multiple processes in the container (sidecars, workers) sharing the same limit?

5. Diagnose OOM with Docker commands

5.1 Check container exit codes and OOMKilled flag

List stopped containers and exit codes:

docker ps -a --no-trunc

Inspect a specific container:

docker inspect <container_id_or_name> --format \
'Name={{.Name}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}'

If OOMKilled=true, it’s a strong indicator of cgroup OOM.

Also inspect the configured memory limit:

docker inspect <container> --format \
'Memory={{.HostConfig.Memory}} MemorySwap={{.HostConfig.MemorySwap}} OomKillDisable={{.HostConfig.OomKillDisable}}'

Notes:

5.2 Inspect restart loops and health checks

Restart loops can hide the initial failure.

docker inspect <container> --format \
'RestartCount={{.RestartCount}} Status={{.State.Status}} StartedAt={{.State.StartedAt}}'

Check events around the crash:

docker events --since 30m --until 0m | grep -E '<container_name>|<container_id>'

5.3 Live memory usage: docker stats

docker stats

This shows current usage and limit, e.g. 512MiB / 1GiB. For deeper diagnosis, watch it over time:

docker stats --no-stream

Or sample repeatedly:

while true; do
  date
  docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
  sleep 5
done

If memory steadily climbs until it hits the limit, suspect a leak or unbounded cache.

5.4 Container logs and “last words”

Get the last log lines before death:

docker logs --tail 200 <container>

If the process was SIGKILLed, you may see nothing helpful. That’s normal—SIGKILL does not allow cleanup.


6. Diagnose OOM from the host (kernel logs)

Docker’s OOMKilled flag is useful, but the most authoritative source is the kernel log.

6.1 dmesg and journald

On many systems:

sudo dmesg -T | grep -i -E 'oom|killed process|out of memory'

On systems using systemd journal:

sudo journalctl -k --since "1 hour ago" | grep -i -E 'oom|killed process|out of memory'

You’re looking for:

6.2 Identify which container/process was killed

Kernel logs often show a PID. You can map it back to a container.

If the container is still running (or quickly restarted), you can check its main PID:

docker inspect <container> --format 'PID={{.State.Pid}}'

To map an arbitrary PID to a container, inspect cgroup membership:

PID=12345
cat /proc/$PID/cgroup

Look for a path containing docker or kubepods (if Kubernetes). For Docker, you might see a container ID embedded.

If you have the container ID, you can correlate:

docker ps --no-trunc | grep <container_id_prefix>

7. Inspect cgroup memory settings and events

Sometimes Docker’s view is not enough; reading cgroup files shows the real limits and OOM counters.

7.1 Find the container’s cgroup path

Get the container’s init PID:

PID=$(docker inspect <container> --format '{{.State.Pid}}')
echo "$PID"

Then:

cat /proc/$PID/cgroup

7.2 Read memory limits and current usage (cgroups v2)

If your system uses cgroups v2, find the cgroup directory:

CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '{print $3}')
CGROUP_DIR="/sys/fs/cgroup${CGROUP_PATH}"
echo "$CGROUP_DIR"

Now read key files:

cat "$CGROUP_DIR/memory.max"
cat "$CGROUP_DIR/memory.current"
cat "$CGROUP_DIR/memory.events"
cat "$CGROUP_DIR/memory.stat" | head -n 50

Interpretation:

If memory.events shows increasing oom_kill, you have confirmed cgroup-level OOM kills.

7.3 Read memory limits and current usage (cgroups v1)

If using v1, the memory controller is typically:

MEM_CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '$2 ~ /memory/ {print $3}')
MEM_DIR="/sys/fs/cgroup/memory${MEM_CGROUP_PATH}"
echo "$MEM_DIR"

Read:

cat "$MEM_DIR/memory.limit_in_bytes"
cat "$MEM_DIR/memory.usage_in_bytes"
cat "$MEM_DIR/memory.max_usage_in_bytes"
cat "$MEM_DIR/memory.failcnt"
cat "$MEM_DIR/memory.stat" | head -n 50

8. Common root causes (and how to confirm them)

8.1 Memory leak

Pattern: memory usage grows steadily over time and never returns.

How to confirm:

What to do:

8.2 Unbounded caches (JVM, Node, Python, Go)

Many runtimes use memory aggressively for performance (caches, JIT, arenas). In containers, this can exceed limits if not configured.

Confirm:

Examples:

8.3 Too low memory limit / wrong sizing

Pattern: OOM happens under normal load, often after a deployment or traffic increase.

Confirm:

Fix:

8.4 Spikes during startup, compilation, or batch jobs

Pattern: container dies during startup or periodic tasks (cron-like jobs, report generation).

Confirm:

Fix:

8.5 Native memory (not visible in app-level metrics)

Your app may report low heap usage but still OOM due to:

Confirm:

Fix:

8.6 Page cache and file I/O pressure

Heavy file reads/writes can increase page cache. Depending on cgroup and kernel behavior, this can contribute to memory pressure.

Confirm:

Fix:


9. Fix strategies

9.1 Raise the container memory limit (correctly)

Run a container with a 1GiB memory limit:

docker run --rm -m 1g --name myapp myimage:latest

If you also want to allow swap (more on that later):

docker run --rm -m 1g --memory-swap 2g myimage:latest

Update an existing container is not possible directly; you typically recreate it. If you use Docker Compose, set:

Verify the limit:

docker inspect myapp --format 'Memory={{.HostConfig.Memory}}'

Sizing advice:

9.2 Add swap (carefully) and tune swappiness

Swap can prevent immediate OOM, but it can also cause severe latency. For some workloads (burst memory, background jobs), swap is a useful safety net.

Check if swap exists on the host:

swapon --show
free -h

Create a swapfile (example: 4GiB):

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
swapon --show

Persist it (typical /etc/fstab entry):

echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Docker swap behavior:

Example: 1GiB RAM + 1GiB swap for the container:

docker run --rm -m 1g --memory-swap 2g myimage:latest

9.3 Set sane language/runtime memory caps

Java (JVM)

Modern JVMs are container-aware, but you still should set explicit limits to avoid surprises (and account for non-heap).

Common flags:

Remember: heap is not total memory. You also need headroom for:

A practical approach in containers:

Node.js

Node’s V8 heap limit can be too high or too low depending on container size. Set it:

node --max-old-space-size=512 server.js

--max-old-space-size is in MB. If your container has 1GiB, you might choose 512–768MB depending on native usage.

Python

Python doesn’t have a simple “cap heap” flag. You can:

Example gunicorn pattern:

gunicorn app:app --workers 4 --max-requests 1000 --max-requests-jitter 100

This mitigates fragmentation/leaks by periodically restarting workers.

Go

Go’s GC can be tuned with GOGC (lower = more aggressive GC, lower memory, more CPU):

export GOGC=75
./my-go-service

Go 1.19+ also supports GOMEMLIMIT to cap memory target:

export GOMEMLIMIT=800MiB
./my-go-service

9.4 Reduce concurrency and batch sizes

If OOM correlates with traffic spikes:

Examples:

This is often the fastest fix when you cannot immediately add memory.

9.5 Prevent OOM with proactive monitoring and alerts

At minimum, monitor:

Useful commands for ad-hoc checks:

docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'
docker inspect <container> --format 'RestartCount={{.RestartCount}} OOMKilled={{.State.OOMKilled}}'

On cgroups v2, you can watch OOM counters:

watch -n 2 "cat $CGROUP_DIR/memory.events; echo; cat $CGROUP_DIR/memory.current; cat $CGROUP_DIR/memory.max"

In production, export metrics to Prometheus/Grafana or your monitoring stack. Key is to alert before hitting the limit (e.g., at 80–90% sustained usage).

9.6 Use --oom-score-adj and --oom-kill-disable (with caution)

Docker supports:

Examples:

docker run --rm --oom-score-adj -500 myimage
docker run --rm --oom-kill-disable myimage

Cautions:


10. Reproduce and test OOM safely

To confirm your detection pipeline, you can intentionally OOM a test container.

Example: allocate memory until killed:

docker run --rm -m 100m --name oom-test python:3.12-slim \
python -c "a=[]; 
import time; 
while True: a.append('x'*10_000_000); time.sleep(0.1)"

Observe:

docker ps -a | grep oom-test
docker inspect oom-test --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}'

Check kernel logs:

sudo dmesg -T | tail -n 50

This validates that:


11. Practical examples

11.1 Example: Node.js container OOM

Scenario: A Node API container has a 512MiB limit and restarts under load.

  1. Confirm OOM:
docker inspect node-api --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}'
  1. Observe memory trend:
docker stats node-api
  1. Fix by setting V8 heap cap and leaving headroom for native memory:

If container limit is 512MiB, set old space to ~256–320MB:

docker run -d --name node-api -m 512m \
my-node-image node --max-old-space-size=320 server.js
  1. Re-check stability with load testing and docker stats.

If still OOM:

11.2 Example: Java (JVM) container OOM

Scenario: A Java service in a 2GiB container OOMs even though -Xmx is 1GiB.

  1. Confirm cgroup OOM via kernel log:
sudo journalctl -k --since "2 hours ago" | grep -i -E 'memory cgroup out of memory|killed process'
  1. Check if non-heap is large:
  1. Fix by budgeting memory explicitly:

Example for a 2GiB container:

Command:

java \
  -Xms1200m -Xmx1200m \
  -XX:MaxDirectMemorySize=256m \
  -XX:MaxMetaspaceSize=256m \
  -Xss512k \
  -jar app.jar

Then validate RSS vs limit:

docker stats java-service

And check cgroup stats (v2):

PID=$(docker inspect java-service --format '{{.State.Pid}}')
CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '{print $3}')
CGROUP_DIR="/sys/fs/cgroup${CGROUP_PATH}"
cat "$CGROUP_DIR/memory.current"
cat "$CGROUP_DIR/memory.events"

11.3 Example: Python memory growth

Scenario: A gunicorn-based Python service slowly grows and OOMs after 2–3 days.

  1. Confirm trend:
docker stats python-api
  1. Mitigate with worker recycling:
gunicorn app:app --workers 4 --max-requests 2000 --max-requests-jitter 200
  1. If using libraries that cache heavily (e.g., image processing), add explicit cache limits or clear caches.

  2. If the service must be long-lived, profile memory:

Even with recycling, you should still investigate the root cause.


12. Summary

Diagnosing Docker OOM issues is mostly about distinguishing container-limit OOM from host OOM, then confirming the cause using:

Fixes generally fall into these buckets:

  1. Right-size memory limits (and consider swap carefully)
  2. Configure runtimes (JVM/Node/Go) to respect container constraints
  3. Reduce concurrency / batch sizes to avoid spikes
  4. Find and fix leaks or unbounded caches
  5. Monitor and alert before you hit the cliff

If you share:

you can usually pinpoint whether the crash is due to heap sizing, native memory, page cache, or a true leak within a few iterations.