Out of Memory (OOM) in Docker: Diagnose and Fix Container Memory Crashes

Running containers “in production” often means running them close to resource limits. When memory pressure hits, Linux will protect the host by killing processes. In Docker, that typically shows up as containers exiting unexpectedly, restarts looping, or logs that end abruptly. This tutorial explains what OOM is, how Docker and the Linux kernel enforce memory, and how to diagnose and fix memory crashes with real, copy‑pasteable commands.

1. What “OOM” means in Docker
2. Linux memory, cgroups, and why containers get killed
3. Recognizing OOM symptoms
4. Quick triage checklist
5. Diagnose OOM with Docker commands
6. Diagnose OOM from the host (kernel logs)
- 6.1 dmesg and journald
- 6.2 Identify which container/process was killed
7. Inspect cgroup memory settings and events
8. Common root causes (and how to confirm them)
9. Fix strategies
10. Reproduce and test OOM safely
11. Practical examples
12. Summary

1. What “OOM” means in Docker

OOM stands for Out Of Memory. In Linux, when available memory becomes too low, the kernel may kill one or more processes to recover memory. In a containerized environment, memory is controlled by cgroups (control groups). If a container exceeds its configured memory limit, the kernel can kill processes inside that container, which often results in the container exiting.

Typical outcomes:

Container exits with code 137 (killed by SIGKILL) or sometimes 143 (SIGTERM then SIGKILL).
docker inspect shows "OOMKilled": true.
Logs stop abruptly with no stack trace (because SIGKILL cannot be caught).
Container restarts repeatedly if a restart policy is set.

2. Linux memory, cgroups, and why containers get killed

Docker does not “manage memory” by itself; it asks the Linux kernel to enforce memory limits through cgroups.

2.1 cgroups v1 vs v2

cgroups v1: separate controllers with their own hierarchies (e.g., memory, cpu).
cgroups v2: unified hierarchy; modern distros increasingly default to v2.

You can check which is active:

stat -fc %T /sys/fs/cgroup

cgroup2fs means v2
tmpfs often indicates v1 mounted controllers (depends on distro)

Or:

mount | grep cgroup

2.2 What counts as “memory”

Memory accounting includes:

Anonymous memory: heap allocations, stacks, malloc/new, etc.
File-backed memory: mapped files, shared libraries.
Page cache (depending on cgroup version/settings): cached file data.
Kernel memory (historically separate in v1; mostly unified in v2).

This is why an app might claim “I only use 200MB” while the container shows 800MB: the container includes more than your app’s own heap.

2.3 OOM killer vs cgroup OOM

There are two major scenarios:

Host OOM (system-wide): the entire host is out of memory. The kernel chooses a victim process across the system (could be dockerd, container processes, databases, etc.).
cgroup OOM (container limit exceeded): the container hits its cgroup memory limit. The kernel kills one or more processes in that cgroup.

In Docker troubleshooting, you want to know which one happened because the fixes differ:

If it’s cgroup OOM, adjust container sizing/runtime behavior.
If it’s host OOM, you must also consider other workloads, host memory, swap, and overall capacity.

3. Recognizing OOM symptoms

Common signs:

Container exits with 137:
- 137 = 128 + 9, meaning “terminated by signal 9 (SIGKILL)”.
docker ps -a shows Exited (137) or restarts.
Application logs end mid-line, no graceful shutdown.
dmesg contains lines like:
- Memory cgroup out of memory: Kill process 12345 (node) score 987 or sacrifice child
- Killed process 12345 (java) total-vm:... anon-rss:...

4. Quick triage checklist

Is the container OOMKilled?
Did the host run out of memory (global OOM) or just the container?
What was the container’s memory limit?
What was the peak memory usage before crash?
Is memory usage growing over time (leak) or spiking (batch/startup)?
Is the runtime configured to respect container limits (JVM, Node)?
Are there multiple processes in the container (sidecars, workers) sharing the same limit?

5. Diagnose OOM with Docker commands

5.1 Check container exit codes and OOMKilled flag

List stopped containers and exit codes:

docker ps -a --no-trunc

Inspect a specific container:

docker inspect <container_id_or_name> --format \
'Name={{.Name}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}'

If OOMKilled=true, it’s a strong indicator of cgroup OOM.

Also inspect the configured memory limit:

docker inspect <container> --format \
'Memory={{.HostConfig.Memory}} MemorySwap={{.HostConfig.MemorySwap}} OomKillDisable={{.HostConfig.OomKillDisable}}'

Notes:

Memory is in bytes. 0 means “no explicit limit”.
MemorySwap controls swap availability (details later).

5.2 Inspect restart loops and health checks

Restart loops can hide the initial failure.

docker inspect <container> --format \
'RestartCount={{.RestartCount}} Status={{.State.Status}} StartedAt={{.State.StartedAt}}'

Check events around the crash:

docker events --since 30m --until 0m | grep -E '<container_name>|<container_id>'

5.3 Live memory usage: `docker stats`

docker stats

This shows current usage and limit, e.g. 512MiB / 1GiB. For deeper diagnosis, watch it over time:

docker stats --no-stream

Or sample repeatedly:

while true; do
  date
  docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
  sleep 5
done

If memory steadily climbs until it hits the limit, suspect a leak or unbounded cache.

5.4 Container logs and “last words”

Get the last log lines before death:

docker logs --tail 200 <container>

If the process was SIGKILLed, you may see nothing helpful. That’s normal—SIGKILL does not allow cleanup.

6. Diagnose OOM from the host (kernel logs)

Docker’s OOMKilled flag is useful, but the most authoritative source is the kernel log.

6.1 `dmesg` and journald

On many systems:

sudo dmesg -T | grep -i -E 'oom|killed process|out of memory'

On systems using systemd journal:

sudo journalctl -k --since "1 hour ago" | grep -i -E 'oom|killed process|out of memory'

You’re looking for:

“Memory cgroup out of memory”
“Killed process …”
The process name (java/node/python/nginx)
Memory stats (anon-rss, file-rss, shmem-rss)

6.2 Identify which container/process was killed

Kernel logs often show a PID. You can map it back to a container.

If the container is still running (or quickly restarted), you can check its main PID:

docker inspect <container> --format 'PID={{.State.Pid}}'

To map an arbitrary PID to a container, inspect cgroup membership:

PID=12345
cat /proc/$PID/cgroup

Look for a path containing docker or kubepods (if Kubernetes). For Docker, you might see a container ID embedded.

If you have the container ID, you can correlate:

docker ps --no-trunc | grep <container_id_prefix>

7. Inspect cgroup memory settings and events

Sometimes Docker’s view is not enough; reading cgroup files shows the real limits and OOM counters.

7.1 Find the container’s cgroup path

Get the container’s init PID:

PID=$(docker inspect <container> --format '{{.State.Pid}}')
echo "$PID"

Then:

cat /proc/$PID/cgroup

On cgroups v2, you’ll see a single line like 0::/docker/<id>
On v1, you’ll see multiple controllers, including memory:/docker/<id>

7.2 Read memory limits and current usage (cgroups v2)

If your system uses cgroups v2, find the cgroup directory:

CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '{print $3}')
CGROUP_DIR="/sys/fs/cgroup${CGROUP_PATH}"
echo "$CGROUP_DIR"

Now read key files:

cat "$CGROUP_DIR/memory.max"
cat "$CGROUP_DIR/memory.current"
cat "$CGROUP_DIR/memory.events"
cat "$CGROUP_DIR/memory.stat" | head -n 50

Interpretation:

memory.max: the limit (max means unlimited)
memory.current: current usage in bytes
memory.events: counters like oom, oom_kill, and high
memory.stat: breakdown (anon, file, slab, etc.)

If memory.events shows increasing oom_kill, you have confirmed cgroup-level OOM kills.

7.3 Read memory limits and current usage (cgroups v1)

If using v1, the memory controller is typically:

MEM_CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '$2 ~ /memory/ {print $3}')
MEM_DIR="/sys/fs/cgroup/memory${MEM_CGROUP_PATH}"
echo "$MEM_DIR"

Read:

cat "$MEM_DIR/memory.limit_in_bytes"
cat "$MEM_DIR/memory.usage_in_bytes"
cat "$MEM_DIR/memory.max_usage_in_bytes"
cat "$MEM_DIR/memory.failcnt"
cat "$MEM_DIR/memory.stat" | head -n 50

memory.failcnt increments when allocations fail due to the limit.
memory.max_usage_in_bytes shows historical peak usage.

8. Common root causes (and how to confirm them)

8.1 Memory leak

Pattern: memory usage grows steadily over time and never returns.

How to confirm:

docker stats shows a consistent upward trend.
Application-level metrics show heap growth (if instrumented).
Heap dumps/profiling reveal retained objects.

What to do:

Fix the leak in code or dependencies.
Add periodic restarts only as a temporary mitigation (not a real fix).

8.2 Unbounded caches (JVM, Node, Python, Go)

Many runtimes use memory aggressively for performance (caches, JIT, arenas). In containers, this can exceed limits if not configured.

Confirm:

Memory usage grows until plateauing near the limit.
No obvious “leak” in heap, but RSS keeps increasing.

Examples:

JVM: heap not capped relative to container; metaspace/direct buffers too large.
Node.js: default --max-old-space-size may not match container size.
Go: GC target may allow higher RSS; memory arenas may not return to OS quickly.
Python: allocator fragmentation; memory not returned to OS.

8.3 Too low memory limit / wrong sizing

Pattern: OOM happens under normal load, often after a deployment or traffic increase.

Confirm:

Container memory limit is small (e.g., 256MiB) compared to typical usage.
OOM occurs even without leaks.

Fix:

Increase limit and/or reduce workload per container instance.
Scale horizontally (more replicas) rather than only vertical scaling.

8.4 Spikes during startup, compilation, or batch jobs

Pattern: container dies during startup or periodic tasks (cron-like jobs, report generation).

Confirm:

OOM time aligns with a known job.
Memory usage spikes quickly rather than slowly increasing.

Fix:

Reduce concurrency, batch sizes.
Stream data instead of buffering.
Move heavy jobs to separate worker containers with different limits.

8.5 Native memory (not visible in app-level metrics)

Your app may report low heap usage but still OOM due to:

Native libraries (image processing, ML, crypto)
Thread stacks (many threads)
Direct buffers (Java NIO)
Memory-mapped files

Confirm:

JVM heap looks fine, but RSS is high.
Kernel OOM log shows high anon-rss or file-rss.

Fix:

Cap native allocations (where possible).
Reduce threads.
Use runtime flags (see later sections).

8.6 Page cache and file I/O pressure

Heavy file reads/writes can increase page cache. Depending on cgroup and kernel behavior, this can contribute to memory pressure.

Confirm:

memory.stat shows high file usage (cgroups v2) or cache (v1).
Workload involves large file scans, backups, or data processing.

Fix:

Stream files, avoid reading huge files into memory.
Consider tuning application I/O patterns.
Ensure limits are sized with cache behavior in mind.

9. Fix strategies

9.1 Raise the container memory limit (correctly)

Run a container with a 1GiB memory limit:

docker run --rm -m 1g --name myapp myimage:latest

If you also want to allow swap (more on that later):

docker run --rm -m 1g --memory-swap 2g myimage:latest

Update an existing container is not possible directly; you typically recreate it. If you use Docker Compose, set:

mem_limit (Compose v2 supports deploy.resources mainly for Swarm; standalone Compose has some differences)
In practice, many teams manage this via orchestration (Kubernetes) or redeploy.

Verify the limit:

docker inspect myapp --format 'Memory={{.HostConfig.Memory}}'

Sizing advice:

Start with observed peak usage + headroom (often 20–50%).
Consider worst-case concurrency and request bursts.
If you have multiple processes in the container, sum their needs.

9.2 Add swap (carefully) and tune swappiness

Swap can prevent immediate OOM, but it can also cause severe latency. For some workloads (burst memory, background jobs), swap is a useful safety net.

Check if swap exists on the host:

swapon --show
free -h

Create a swapfile (example: 4GiB):

sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
swapon --show

Persist it (typical /etc/fstab entry):

echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab

Docker swap behavior:

If you set --memory-swap equal to --memory, swap is effectively disabled for that container.
If you set --memory-swap to a larger value, the container can use swap up to that limit.
If you set --memory-swap to -1, it can use unlimited swap (generally not recommended).

Example: 1GiB RAM + 1GiB swap for the container:

docker run --rm -m 1g --memory-swap 2g myimage:latest

9.3 Set sane language/runtime memory caps

Java (JVM)

Modern JVMs are container-aware, but you still should set explicit limits to avoid surprises (and account for non-heap).

Common flags:

Cap heap by percentage of container memory:

java -XX:MaxRAMPercentage=70 -XX:InitialRAMPercentage=50 -jar app.jar

Or cap heap explicitly:
```
java -Xms512m -Xmx512m -jar app.jar
```

Remember: heap is not total memory. You also need headroom for:

metaspace (-XX:MaxMetaspaceSize=...)
direct buffers (-XX:MaxDirectMemorySize=...)
thread stacks (-Xss...)
JIT/code cache

A practical approach in containers:

Set -Xmx to ~50–75% of the container limit depending on workload.
Monitor RSS, not only heap.

Node.js

Node’s V8 heap limit can be too high or too low depending on container size. Set it:

node --max-old-space-size=512 server.js

--max-old-space-size is in MB. If your container has 1GiB, you might choose 512–768MB depending on native usage.

Python

Python doesn’t have a simple “cap heap” flag. You can:

Fix leaks and reduce caching.
Use worker recycling (e.g., gunicorn --max-requests).
Control concurrency.

Example gunicorn pattern:

gunicorn app:app --workers 4 --max-requests 1000 --max-requests-jitter 100

This mitigates fragmentation/leaks by periodically restarting workers.

Go

Go’s GC can be tuned with GOGC (lower = more aggressive GC, lower memory, more CPU):

export GOGC=75
./my-go-service

Go 1.19+ also supports GOMEMLIMIT to cap memory target:

export GOMEMLIMIT=800MiB
./my-go-service

9.4 Reduce concurrency and batch sizes

If OOM correlates with traffic spikes:

Reduce worker counts
Reduce in-flight requests
Add backpressure
Limit queue sizes

Examples:

Nginx: reduce worker_connections, tune buffering.
App servers: limit thread pools.
Background jobs: limit parallelism.

This is often the fastest fix when you cannot immediately add memory.

9.5 Prevent OOM with proactive monitoring and alerts

At minimum, monitor:

Container memory usage vs limit
OOM kill events
Restart counts

Useful commands for ad-hoc checks:

docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'
docker inspect <container> --format 'RestartCount={{.RestartCount}} OOMKilled={{.State.OOMKilled}}'

On cgroups v2, you can watch OOM counters:

watch -n 2 "cat $CGROUP_DIR/memory.events; echo; cat $CGROUP_DIR/memory.current; cat $CGROUP_DIR/memory.max"

In production, export metrics to Prometheus/Grafana or your monitoring stack. Key is to alert before hitting the limit (e.g., at 80–90% sustained usage).

9.6 Use `--oom-score-adj` and `--oom-kill-disable` (with caution)

Docker supports:

--oom-score-adj: influence which processes are killed in a host OOM scenario.
--oom-kill-disable: attempt to disable OOM killing for the container.

Examples:

docker run --rm --oom-score-adj -500 myimage

docker run --rm --oom-kill-disable myimage

Cautions:

Disabling OOM kill can make the system unstable; the kernel may still kill something else or the container may hang.
These are advanced levers; prefer correct sizing and app-level fixes.

10. Reproduce and test OOM safely

To confirm your detection pipeline, you can intentionally OOM a test container.

Example: allocate memory until killed:

docker run --rm -m 100m --name oom-test python:3.12-slim \
python -c "a=[]; 
import time; 
while True: a.append('x'*10_000_000); time.sleep(0.1)"

Observe:

docker ps -a | grep oom-test
docker inspect oom-test --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}'

Check kernel logs:

sudo dmesg -T | tail -n 50

This validates that:

Your host logs capture OOM messages
Docker reports OOMKilled=true
Exit code is typically 137

11. Practical examples

11.1 Example: Node.js container OOM

Scenario: A Node API container has a 512MiB limit and restarts under load.

Confirm OOM:

docker inspect node-api --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}'

Observe memory trend:

docker stats node-api

Fix by setting V8 heap cap and leaving headroom for native memory:

If container limit is 512MiB, set old space to ~256–320MB:

docker run -d --name node-api -m 512m \
my-node-image node --max-old-space-size=320 server.js

Re-check stability with load testing and docker stats.

If still OOM:

Reduce concurrency
Investigate leaks (heap snapshots)
Increase container memory

11.2 Example: Java (JVM) container OOM

Scenario: A Java service in a 2GiB container OOMs even though -Xmx is 1GiB.

Confirm cgroup OOM via kernel log:

sudo journalctl -k --since "2 hours ago" | grep -i -E 'memory cgroup out of memory|killed process'

Check if non-heap is large:

Many threads? Each thread stack may be 1MB+.
Direct buffers? Netty? NIO?

Fix by budgeting memory explicitly:

Example for a 2GiB container:

Heap: 1200MiB
Direct: 256MiB
Metaspace: 256MiB
Leave remainder for stacks, code cache, libc, etc.

Command:

java \
  -Xms1200m -Xmx1200m \
  -XX:MaxDirectMemorySize=256m \
  -XX:MaxMetaspaceSize=256m \
  -Xss512k \
  -jar app.jar

Then validate RSS vs limit:

docker stats java-service

And check cgroup stats (v2):

PID=$(docker inspect java-service --format '{{.State.Pid}}')
CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '{print $3}')
CGROUP_DIR="/sys/fs/cgroup${CGROUP_PATH}"
cat "$CGROUP_DIR/memory.current"
cat "$CGROUP_DIR/memory.events"

11.3 Example: Python memory growth

Scenario: A gunicorn-based Python service slowly grows and OOMs after 2–3 days.

Confirm trend:

docker stats python-api

Mitigate with worker recycling:

gunicorn app:app --workers 4 --max-requests 2000 --max-requests-jitter 200

If using libraries that cache heavily (e.g., image processing), add explicit cache limits or clear caches.
If the service must be long-lived, profile memory:

tracemalloc
objgraph
memray (for deeper analysis)

Even with recycling, you should still investigate the root cause.

12. Summary

Diagnosing Docker OOM issues is mostly about distinguishing container-limit OOM from host OOM, then confirming the cause using:

docker inspect (OOMKilled, exit code, memory settings)
docker stats (trend and peaks)
Kernel logs (dmesg, journalctl -k)
cgroup files (memory.current, memory.max, memory.events, memory.stat)

Fixes generally fall into these buckets:

Right-size memory limits (and consider swap carefully)
Configure runtimes (JVM/Node/Go) to respect container constraints
Reduce concurrency / batch sizes to avoid spikes
Find and fix leaks or unbounded caches
Monitor and alert before you hit the cliff

If you share:

your container memory limit,
docker inspect ... output for HostConfig.Memory*,
and the relevant dmesg/journalctl -k OOM lines,

you can usually pinpoint whether the crash is due to heap sizing, native memory, page cache, or a true leak within a few iterations.

Out of Memory (OOM) in Docker: Diagnose and Fix Container Memory Crashes

Out of Memory (OOM) in Docker: Diagnose and Fix Container Memory Crashes

Table of Contents

1. What “OOM” means in Docker

2. Linux memory, cgroups, and why containers get killed

2.1 cgroups v1 vs v2

2.2 What counts as “memory”

2.3 OOM killer vs cgroup OOM

3. Recognizing OOM symptoms

4. Quick triage checklist

5. Diagnose OOM with Docker commands

5.1 Check container exit codes and OOMKilled flag

5.2 Inspect restart loops and health checks

5.3 Live memory usage: docker stats

5.4 Container logs and “last words”

6. Diagnose OOM from the host (kernel logs)

6.1 dmesg and journald

6.2 Identify which container/process was killed

7. Inspect cgroup memory settings and events

7.1 Find the container’s cgroup path

7.2 Read memory limits and current usage (cgroups v2)

7.3 Read memory limits and current usage (cgroups v1)

8. Common root causes (and how to confirm them)

8.1 Memory leak

8.2 Unbounded caches (JVM, Node, Python, Go)

8.3 Too low memory limit / wrong sizing

8.4 Spikes during startup, compilation, or batch jobs

8.5 Native memory (not visible in app-level metrics)

8.6 Page cache and file I/O pressure

9. Fix strategies

9.1 Raise the container memory limit (correctly)

9.2 Add swap (carefully) and tune swappiness

9.3 Set sane language/runtime memory caps

Java (JVM)

Node.js

Python

Go

9.4 Reduce concurrency and batch sizes

9.5 Prevent OOM with proactive monitoring and alerts

9.6 Use --oom-score-adj and --oom-kill-disable (with caution)

10. Reproduce and test OOM safely

11. Practical examples

11.1 Example: Node.js container OOM

11.2 Example: Java (JVM) container OOM

11.3 Example: Python memory growth

12. Summary

Related Tutorials

5.3 Live memory usage: `docker stats`

6.1 `dmesg` and journald

9.6 Use `--oom-score-adj` and `--oom-kill-disable` (with caution)