Out of Memory (OOM) in Docker: Diagnose and Fix Container Memory Crashes
Running containers “in production” often means running them close to resource limits. When memory pressure hits, Linux will protect the host by killing processes. In Docker, that typically shows up as containers exiting unexpectedly, restarts looping, or logs that end abruptly. This tutorial explains what OOM is, how Docker and the Linux kernel enforce memory, and how to diagnose and fix memory crashes with real, copy‑pasteable commands.
Table of Contents
- 1. What “OOM” means in Docker
- 2. Linux memory, cgroups, and why containers get killed
- 3. Recognizing OOM symptoms
- 4. Quick triage checklist
- 5. Diagnose OOM with Docker commands
- 6. Diagnose OOM from the host (kernel logs)
- 7. Inspect cgroup memory settings and events
- 8. Common root causes (and how to confirm them)
- 9. Fix strategies
- 10. Reproduce and test OOM safely
- 11. Practical examples
- 12. Summary
1. What “OOM” means in Docker
OOM stands for Out Of Memory. In Linux, when available memory becomes too low, the kernel may kill one or more processes to recover memory. In a containerized environment, memory is controlled by cgroups (control groups). If a container exceeds its configured memory limit, the kernel can kill processes inside that container, which often results in the container exiting.
Typical outcomes:
- Container exits with code 137 (killed by SIGKILL) or sometimes 143 (SIGTERM then SIGKILL).
docker inspectshows"OOMKilled": true.- Logs stop abruptly with no stack trace (because SIGKILL cannot be caught).
- Container restarts repeatedly if a restart policy is set.
2. Linux memory, cgroups, and why containers get killed
Docker does not “manage memory” by itself; it asks the Linux kernel to enforce memory limits through cgroups.
2.1 cgroups v1 vs v2
- cgroups v1: separate controllers with their own hierarchies (e.g.,
memory,cpu). - cgroups v2: unified hierarchy; modern distros increasingly default to v2.
You can check which is active:
stat -fc %T /sys/fs/cgroup
cgroup2fsmeans v2tmpfsoften indicates v1 mounted controllers (depends on distro)
Or:
mount | grep cgroup
2.2 What counts as “memory”
Memory accounting includes:
- Anonymous memory: heap allocations, stacks, malloc/new, etc.
- File-backed memory: mapped files, shared libraries.
- Page cache (depending on cgroup version/settings): cached file data.
- Kernel memory (historically separate in v1; mostly unified in v2).
This is why an app might claim “I only use 200MB” while the container shows 800MB: the container includes more than your app’s own heap.
2.3 OOM killer vs cgroup OOM
There are two major scenarios:
- Host OOM (system-wide): the entire host is out of memory. The kernel chooses a victim process across the system (could be dockerd, container processes, databases, etc.).
- cgroup OOM (container limit exceeded): the container hits its cgroup memory limit. The kernel kills one or more processes in that cgroup.
In Docker troubleshooting, you want to know which one happened because the fixes differ:
- If it’s cgroup OOM, adjust container sizing/runtime behavior.
- If it’s host OOM, you must also consider other workloads, host memory, swap, and overall capacity.
3. Recognizing OOM symptoms
Common signs:
- Container exits with 137:
- 137 = 128 + 9, meaning “terminated by signal 9 (SIGKILL)”.
docker ps -ashowsExited (137)or restarts.- Application logs end mid-line, no graceful shutdown.
dmesgcontains lines like:Memory cgroup out of memory: Kill process 12345 (node) score 987 or sacrifice childKilled process 12345 (java) total-vm:... anon-rss:...
4. Quick triage checklist
- Is the container OOMKilled?
- Did the host run out of memory (global OOM) or just the container?
- What was the container’s memory limit?
- What was the peak memory usage before crash?
- Is memory usage growing over time (leak) or spiking (batch/startup)?
- Is the runtime configured to respect container limits (JVM, Node)?
- Are there multiple processes in the container (sidecars, workers) sharing the same limit?
5. Diagnose OOM with Docker commands
5.1 Check container exit codes and OOMKilled flag
List stopped containers and exit codes:
docker ps -a --no-trunc
Inspect a specific container:
docker inspect <container_id_or_name> --format \
'Name={{.Name}} ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}} Error={{.State.Error}} FinishedAt={{.State.FinishedAt}}'
If OOMKilled=true, it’s a strong indicator of cgroup OOM.
Also inspect the configured memory limit:
docker inspect <container> --format \
'Memory={{.HostConfig.Memory}} MemorySwap={{.HostConfig.MemorySwap}} OomKillDisable={{.HostConfig.OomKillDisable}}'
Notes:
Memoryis in bytes.0means “no explicit limit”.MemorySwapcontrols swap availability (details later).
5.2 Inspect restart loops and health checks
Restart loops can hide the initial failure.
docker inspect <container> --format \
'RestartCount={{.RestartCount}} Status={{.State.Status}} StartedAt={{.State.StartedAt}}'
Check events around the crash:
docker events --since 30m --until 0m | grep -E '<container_name>|<container_id>'
5.3 Live memory usage: docker stats
docker stats
This shows current usage and limit, e.g. 512MiB / 1GiB. For deeper diagnosis, watch it over time:
docker stats --no-stream
Or sample repeatedly:
while true; do
date
docker stats --no-stream --format "table {{.Name}}\t{{.MemUsage}}\t{{.MemPerc}}"
sleep 5
done
If memory steadily climbs until it hits the limit, suspect a leak or unbounded cache.
5.4 Container logs and “last words”
Get the last log lines before death:
docker logs --tail 200 <container>
If the process was SIGKILLed, you may see nothing helpful. That’s normal—SIGKILL does not allow cleanup.
6. Diagnose OOM from the host (kernel logs)
Docker’s OOMKilled flag is useful, but the most authoritative source is the kernel log.
6.1 dmesg and journald
On many systems:
sudo dmesg -T | grep -i -E 'oom|killed process|out of memory'
On systems using systemd journal:
sudo journalctl -k --since "1 hour ago" | grep -i -E 'oom|killed process|out of memory'
You’re looking for:
- “Memory cgroup out of memory”
- “Killed process …”
- The process name (java/node/python/nginx)
- Memory stats (anon-rss, file-rss, shmem-rss)
6.2 Identify which container/process was killed
Kernel logs often show a PID. You can map it back to a container.
If the container is still running (or quickly restarted), you can check its main PID:
docker inspect <container> --format 'PID={{.State.Pid}}'
To map an arbitrary PID to a container, inspect cgroup membership:
PID=12345
cat /proc/$PID/cgroup
Look for a path containing docker or kubepods (if Kubernetes). For Docker, you might see a container ID embedded.
If you have the container ID, you can correlate:
docker ps --no-trunc | grep <container_id_prefix>
7. Inspect cgroup memory settings and events
Sometimes Docker’s view is not enough; reading cgroup files shows the real limits and OOM counters.
7.1 Find the container’s cgroup path
Get the container’s init PID:
PID=$(docker inspect <container> --format '{{.State.Pid}}')
echo "$PID"
Then:
cat /proc/$PID/cgroup
- On cgroups v2, you’ll see a single line like
0::/docker/<id> - On v1, you’ll see multiple controllers, including
memory:/docker/<id>
7.2 Read memory limits and current usage (cgroups v2)
If your system uses cgroups v2, find the cgroup directory:
CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '{print $3}')
CGROUP_DIR="/sys/fs/cgroup${CGROUP_PATH}"
echo "$CGROUP_DIR"
Now read key files:
cat "$CGROUP_DIR/memory.max"
cat "$CGROUP_DIR/memory.current"
cat "$CGROUP_DIR/memory.events"
cat "$CGROUP_DIR/memory.stat" | head -n 50
Interpretation:
memory.max: the limit (maxmeans unlimited)memory.current: current usage in bytesmemory.events: counters likeoom,oom_kill, andhighmemory.stat: breakdown (anon, file, slab, etc.)
If memory.events shows increasing oom_kill, you have confirmed cgroup-level OOM kills.
7.3 Read memory limits and current usage (cgroups v1)
If using v1, the memory controller is typically:
MEM_CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '$2 ~ /memory/ {print $3}')
MEM_DIR="/sys/fs/cgroup/memory${MEM_CGROUP_PATH}"
echo "$MEM_DIR"
Read:
cat "$MEM_DIR/memory.limit_in_bytes"
cat "$MEM_DIR/memory.usage_in_bytes"
cat "$MEM_DIR/memory.max_usage_in_bytes"
cat "$MEM_DIR/memory.failcnt"
cat "$MEM_DIR/memory.stat" | head -n 50
memory.failcntincrements when allocations fail due to the limit.memory.max_usage_in_bytesshows historical peak usage.
8. Common root causes (and how to confirm them)
8.1 Memory leak
Pattern: memory usage grows steadily over time and never returns.
How to confirm:
docker statsshows a consistent upward trend.- Application-level metrics show heap growth (if instrumented).
- Heap dumps/profiling reveal retained objects.
What to do:
- Fix the leak in code or dependencies.
- Add periodic restarts only as a temporary mitigation (not a real fix).
8.2 Unbounded caches (JVM, Node, Python, Go)
Many runtimes use memory aggressively for performance (caches, JIT, arenas). In containers, this can exceed limits if not configured.
Confirm:
- Memory usage grows until plateauing near the limit.
- No obvious “leak” in heap, but RSS keeps increasing.
Examples:
- JVM: heap not capped relative to container; metaspace/direct buffers too large.
- Node.js: default
--max-old-space-sizemay not match container size. - Go: GC target may allow higher RSS; memory arenas may not return to OS quickly.
- Python: allocator fragmentation; memory not returned to OS.
8.3 Too low memory limit / wrong sizing
Pattern: OOM happens under normal load, often after a deployment or traffic increase.
Confirm:
- Container memory limit is small (e.g., 256MiB) compared to typical usage.
- OOM occurs even without leaks.
Fix:
- Increase limit and/or reduce workload per container instance.
- Scale horizontally (more replicas) rather than only vertical scaling.
8.4 Spikes during startup, compilation, or batch jobs
Pattern: container dies during startup or periodic tasks (cron-like jobs, report generation).
Confirm:
- OOM time aligns with a known job.
- Memory usage spikes quickly rather than slowly increasing.
Fix:
- Reduce concurrency, batch sizes.
- Stream data instead of buffering.
- Move heavy jobs to separate worker containers with different limits.
8.5 Native memory (not visible in app-level metrics)
Your app may report low heap usage but still OOM due to:
- Native libraries (image processing, ML, crypto)
- Thread stacks (many threads)
- Direct buffers (Java NIO)
- Memory-mapped files
Confirm:
- JVM heap looks fine, but RSS is high.
- Kernel OOM log shows high
anon-rssorfile-rss.
Fix:
- Cap native allocations (where possible).
- Reduce threads.
- Use runtime flags (see later sections).
8.6 Page cache and file I/O pressure
Heavy file reads/writes can increase page cache. Depending on cgroup and kernel behavior, this can contribute to memory pressure.
Confirm:
memory.statshows highfileusage (cgroups v2) orcache(v1).- Workload involves large file scans, backups, or data processing.
Fix:
- Stream files, avoid reading huge files into memory.
- Consider tuning application I/O patterns.
- Ensure limits are sized with cache behavior in mind.
9. Fix strategies
9.1 Raise the container memory limit (correctly)
Run a container with a 1GiB memory limit:
docker run --rm -m 1g --name myapp myimage:latest
If you also want to allow swap (more on that later):
docker run --rm -m 1g --memory-swap 2g myimage:latest
Update an existing container is not possible directly; you typically recreate it. If you use Docker Compose, set:
mem_limit(Compose v2 supportsdeploy.resourcesmainly for Swarm; standalone Compose has some differences)- In practice, many teams manage this via orchestration (Kubernetes) or redeploy.
Verify the limit:
docker inspect myapp --format 'Memory={{.HostConfig.Memory}}'
Sizing advice:
- Start with observed peak usage + headroom (often 20–50%).
- Consider worst-case concurrency and request bursts.
- If you have multiple processes in the container, sum their needs.
9.2 Add swap (carefully) and tune swappiness
Swap can prevent immediate OOM, but it can also cause severe latency. For some workloads (burst memory, background jobs), swap is a useful safety net.
Check if swap exists on the host:
swapon --show
free -h
Create a swapfile (example: 4GiB):
sudo fallocate -l 4G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
swapon --show
Persist it (typical /etc/fstab entry):
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Docker swap behavior:
- If you set
--memory-swapequal to--memory, swap is effectively disabled for that container. - If you set
--memory-swapto a larger value, the container can use swap up to that limit. - If you set
--memory-swapto-1, it can use unlimited swap (generally not recommended).
Example: 1GiB RAM + 1GiB swap for the container:
docker run --rm -m 1g --memory-swap 2g myimage:latest
9.3 Set sane language/runtime memory caps
Java (JVM)
Modern JVMs are container-aware, but you still should set explicit limits to avoid surprises (and account for non-heap).
Common flags:
-
Cap heap by percentage of container memory:
java -XX:MaxRAMPercentage=70 -XX:InitialRAMPercentage=50 -jar app.jar -
Or cap heap explicitly:
java -Xms512m -Xmx512m -jar app.jar
Remember: heap is not total memory. You also need headroom for:
- metaspace (
-XX:MaxMetaspaceSize=...) - direct buffers (
-XX:MaxDirectMemorySize=...) - thread stacks (
-Xss...) - JIT/code cache
A practical approach in containers:
- Set
-Xmxto ~50–75% of the container limit depending on workload. - Monitor RSS, not only heap.
Node.js
Node’s V8 heap limit can be too high or too low depending on container size. Set it:
node --max-old-space-size=512 server.js
--max-old-space-size is in MB. If your container has 1GiB, you might choose 512–768MB depending on native usage.
Python
Python doesn’t have a simple “cap heap” flag. You can:
- Fix leaks and reduce caching.
- Use worker recycling (e.g., gunicorn
--max-requests). - Control concurrency.
Example gunicorn pattern:
gunicorn app:app --workers 4 --max-requests 1000 --max-requests-jitter 100
This mitigates fragmentation/leaks by periodically restarting workers.
Go
Go’s GC can be tuned with GOGC (lower = more aggressive GC, lower memory, more CPU):
export GOGC=75
./my-go-service
Go 1.19+ also supports GOMEMLIMIT to cap memory target:
export GOMEMLIMIT=800MiB
./my-go-service
9.4 Reduce concurrency and batch sizes
If OOM correlates with traffic spikes:
- Reduce worker counts
- Reduce in-flight requests
- Add backpressure
- Limit queue sizes
Examples:
- Nginx: reduce
worker_connections, tune buffering. - App servers: limit thread pools.
- Background jobs: limit parallelism.
This is often the fastest fix when you cannot immediately add memory.
9.5 Prevent OOM with proactive monitoring and alerts
At minimum, monitor:
- Container memory usage vs limit
- OOM kill events
- Restart counts
Useful commands for ad-hoc checks:
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'
docker inspect <container> --format 'RestartCount={{.RestartCount}} OOMKilled={{.State.OOMKilled}}'
On cgroups v2, you can watch OOM counters:
watch -n 2 "cat $CGROUP_DIR/memory.events; echo; cat $CGROUP_DIR/memory.current; cat $CGROUP_DIR/memory.max"
In production, export metrics to Prometheus/Grafana or your monitoring stack. Key is to alert before hitting the limit (e.g., at 80–90% sustained usage).
9.6 Use --oom-score-adj and --oom-kill-disable (with caution)
Docker supports:
--oom-score-adj: influence which processes are killed in a host OOM scenario.--oom-kill-disable: attempt to disable OOM killing for the container.
Examples:
docker run --rm --oom-score-adj -500 myimage
docker run --rm --oom-kill-disable myimage
Cautions:
- Disabling OOM kill can make the system unstable; the kernel may still kill something else or the container may hang.
- These are advanced levers; prefer correct sizing and app-level fixes.
10. Reproduce and test OOM safely
To confirm your detection pipeline, you can intentionally OOM a test container.
Example: allocate memory until killed:
docker run --rm -m 100m --name oom-test python:3.12-slim \
python -c "a=[];
import time;
while True: a.append('x'*10_000_000); time.sleep(0.1)"
Observe:
docker ps -a | grep oom-test
docker inspect oom-test --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}'
Check kernel logs:
sudo dmesg -T | tail -n 50
This validates that:
- Your host logs capture OOM messages
- Docker reports
OOMKilled=true - Exit code is typically 137
11. Practical examples
11.1 Example: Node.js container OOM
Scenario: A Node API container has a 512MiB limit and restarts under load.
- Confirm OOM:
docker inspect node-api --format 'ExitCode={{.State.ExitCode}} OOMKilled={{.State.OOMKilled}}'
- Observe memory trend:
docker stats node-api
- Fix by setting V8 heap cap and leaving headroom for native memory:
If container limit is 512MiB, set old space to ~256–320MB:
docker run -d --name node-api -m 512m \
my-node-image node --max-old-space-size=320 server.js
- Re-check stability with load testing and
docker stats.
If still OOM:
- Reduce concurrency
- Investigate leaks (heap snapshots)
- Increase container memory
11.2 Example: Java (JVM) container OOM
Scenario: A Java service in a 2GiB container OOMs even though -Xmx is 1GiB.
- Confirm cgroup OOM via kernel log:
sudo journalctl -k --since "2 hours ago" | grep -i -E 'memory cgroup out of memory|killed process'
- Check if non-heap is large:
- Many threads? Each thread stack may be 1MB+.
- Direct buffers? Netty? NIO?
- Fix by budgeting memory explicitly:
Example for a 2GiB container:
- Heap: 1200MiB
- Direct: 256MiB
- Metaspace: 256MiB
- Leave remainder for stacks, code cache, libc, etc.
Command:
java \
-Xms1200m -Xmx1200m \
-XX:MaxDirectMemorySize=256m \
-XX:MaxMetaspaceSize=256m \
-Xss512k \
-jar app.jar
Then validate RSS vs limit:
docker stats java-service
And check cgroup stats (v2):
PID=$(docker inspect java-service --format '{{.State.Pid}}')
CGROUP_PATH=$(cat /proc/$PID/cgroup | awk -F: '{print $3}')
CGROUP_DIR="/sys/fs/cgroup${CGROUP_PATH}"
cat "$CGROUP_DIR/memory.current"
cat "$CGROUP_DIR/memory.events"
11.3 Example: Python memory growth
Scenario: A gunicorn-based Python service slowly grows and OOMs after 2–3 days.
- Confirm trend:
docker stats python-api
- Mitigate with worker recycling:
gunicorn app:app --workers 4 --max-requests 2000 --max-requests-jitter 200
-
If using libraries that cache heavily (e.g., image processing), add explicit cache limits or clear caches.
-
If the service must be long-lived, profile memory:
tracemallocobjgraphmemray(for deeper analysis)
Even with recycling, you should still investigate the root cause.
12. Summary
Diagnosing Docker OOM issues is mostly about distinguishing container-limit OOM from host OOM, then confirming the cause using:
docker inspect(OOMKilled, exit code, memory settings)docker stats(trend and peaks)- Kernel logs (
dmesg,journalctl -k) - cgroup files (
memory.current,memory.max,memory.events,memory.stat)
Fixes generally fall into these buckets:
- Right-size memory limits (and consider swap carefully)
- Configure runtimes (JVM/Node/Go) to respect container constraints
- Reduce concurrency / batch sizes to avoid spikes
- Find and fix leaks or unbounded caches
- Monitor and alert before you hit the cliff
If you share:
- your container memory limit,
docker inspect ...output forHostConfig.Memory*,- and the relevant
dmesg/journalctl -kOOM lines,
you can usually pinpoint whether the crash is due to heap sizing, native memory, page cache, or a true leak within a few iterations.