Production Incident Walkthrough: Debugging a Memory Leak in a Dockerized API
This tutorial is a realistic, end-to-end walkthrough of diagnosing and fixing a memory leak in a Dockerized API running in production. It focuses on practical steps, real commands, and the reasoning behind each decision so you can adapt the approach to your own stack.
We’ll assume:
- The API runs in Docker containers (e.g., on a VM, Kubernetes, ECS, etc.).
- You have shell access to at least one host running the container.
- You can deploy a patched image after confirming the root cause.
- The API is written in Node.js (Express/Fastify-style), but most of the workflow applies to other runtimes.
1) Incident Symptoms and Initial Triage
Typical alert signals
A memory leak in production often first appears as one or more of:
- Container restarts due to OOM (Out Of Memory) kills
- Increasing latency and timeouts (GC pressure, swapping, CPU spikes)
- Node process memory steadily rising over time
- Host memory pressure if limits are not set correctly
You might see alerts like:
- “Container restarted > 5 times in 10 minutes”
- “p95 latency doubled”
- “Memory usage > 90% for 15 minutes”
Confirm what’s actually failing
On a Docker host, start by checking container restarts and recent exits:
docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.RunningFor}}'
docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.ExitCode}}\t{{.FinishedAt}}' | head
If you suspect OOM, inspect the container state:
docker inspect api-prod-1 --format '{{json .State}}' | jq
Look for:
"OOMKilled": trueExitCodeoften137(SIGKILL)
Also check host logs:
dmesg -T | egrep -i 'killed process|out of memory|oom' | tail -n 50
If you’re on systemd-based Linux, you can also check journald for Docker:
journalctl -u docker --since "2 hours ago" | tail -n 200
Quick stabilization (don’t skip this in real life)
Before deep debugging, reduce blast radius:
- Scale out replicas (if possible) to reduce per-instance load.
- Increase memory limits temporarily (if safe) to buy time.
- Add rate limiting or shed load.
- Roll back to last known good version if the leak correlates with a recent deploy.
Example: temporarily increase memory limit for a container (standalone Docker):
docker update --memory 1.5g --memory-swap 1.5g api-prod-1
In Kubernetes you’d adjust resources.limits.memory and redeploy, but the concept is the same: stabilize first, then investigate.
2) Establish Baselines: Is Memory Actually Growing?
Use docker stats for a fast view
docker stats --no-stream
If you can watch over time:
docker stats api-prod-1
You’re looking for a pattern: memory usage rises steadily and never returns to baseline after traffic drops.
Confirm cgroup memory usage from inside the container
Exec into the container:
docker exec -it api-prod-1 sh
Depending on cgroups version:
cgroups v1:
cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes
cgroups v2:
cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max
This tells you what the kernel thinks the container is using, which is the authoritative source for OOM behavior.
Distinguish “heap leak” vs “RSS growth” vs “cache”
In Node.js, memory can grow for different reasons:
- V8 heap: JavaScript objects retained unintentionally (classic leak).
- RSS (resident set size): native allocations, buffers, C++ addons, fragmentation.
- Page cache / filesystem cache: can rise but is reclaimable; less likely to OOM-kill the container if properly limited, but can still contribute.
From inside the container:
ps aux | head
ps -o pid,ppid,cmd,%mem,%cpu,rss,vsz -p 1
If PID 1 is your Node process, rss is a key indicator: it’s the actual resident memory.
3) Gather Evidence: Logs, Metrics, and a Repro Timeline
Correlate memory growth with traffic or endpoints
If you have access logs, identify whether memory growth correlates with:
- A specific endpoint (e.g.,
/reports/export) - A request parameter (e.g.,
?include=all) - A tenant/customer
- A background job
If you have Prometheus/Grafana, look at:
- Request rate by route
- Response time by route
- Error rate by route
- Process memory metrics (if exported)
If you don’t have route-level metrics, you can still add temporary logging (carefully) to identify suspicious endpoints.
Enable Node runtime metrics (if not already)
If you can redeploy quickly, consider exposing basic process metrics. For example, with prom-client you can export:
process_resident_memory_bytesprocess_heap_used_bytesnodejs_heap_size_used_bytesnodejs_gc_duration_seconds
But during an incident, you might not have that ready. So we’ll proceed with low-level tooling.
4) Decide on a Debug Strategy
You generally have two tracks:
- In-production forensic approach: attach to the running process and gather heap snapshots / profiles.
- Staging reproduction: replay traffic or run a load test to reproduce the leak safely.
Often you do both:
- Production: capture one or two heap snapshots at different times to confirm growth and identify retained objects.
- Staging: reproduce and iterate quickly to validate fixes.
5) Inspect Node Memory from the Outside
Check Node version and flags
From inside the container:
node -v
ps -ef
cat /proc/1/cmdline | tr '\0' ' '
You’re looking for flags like:
--max-old-space-size=...(heap limit)--inspect(debug port)--trace-gc(GC logs; usually too noisy for prod)
If --max-old-space-size is too high relative to container limit, Node may be killed by the kernel before it can throw an out-of-memory error. A good practice is to set Node’s heap limit below the container memory limit to leave headroom for native allocations and overhead.
Example: if container limit is 512MiB, set Node heap to ~384MiB:
node --max-old-space-size=384 server.js
6) Capture Heap Snapshots in a Running Container (Node.js)
Option A: Use the Node inspector (most common)
If your container exposes the inspector port, you can connect. But in production it often isn’t enabled.
You can still enable it by sending SIGUSR1 to Node (works on many Node versions/configurations). From inside the container:
kill -USR1 1
This typically starts the inspector on 127.0.0.1:9229. Confirm:
ss -lntp | grep 9229 || netstat -lntp | grep 9229
Now you need access to that port. If you’re on the Docker host:
docker exec -it api-prod-1 sh -lc "ss -lntp | grep 9229"
You can use port forwarding via docker by running a temporary socat container in the same network namespace, or simpler: use docker exec plus curl to talk to the inspector endpoints locally. A pragmatic approach is to use node’s built-in inspector endpoints to trigger heap snapshots.
Option B: Trigger heap snapshot via the inspector HTTP endpoint
Once inspector is enabled, list targets:
curl -s http://127.0.0.1:9229/json/list | jq
You’ll see a webSocketDebuggerUrl. You can use tools like chrome://inspect (locally) by port-forwarding, but here’s a production-friendly approach:
- Use a small script to connect to the inspector WebSocket and request a heap snapshot.
- Or install a minimal tool like
node-heapdump(requires code change) — not ideal mid-incident.
A common incident workflow is to temporarily deploy a build with a safe admin endpoint that triggers heap snapshots to a mounted volume or object storage. If you can’t redeploy, you can still do it with inspector, but it’s more fiddly.
Practical approach: Redeploy with a controlled heapdump trigger (recommended)
Add heapdump and a protected endpoint. Example (Express):
npm install heapdump
Code snippet:
import heapdump from "heapdump";
import fs from "fs";
import path from "path";
import express from "express";
const app = express();
app.post("/admin/heapdump", async (req, res) => {
// Protect this endpoint with auth in real life.
const dir = process.env.HEAPDUMP_DIR || "/tmp";
const filename = path.join(dir, `heap-${Date.now()}.heapsnapshot`);
heapdump.writeSnapshot(filename, (err, filePath) => {
if (err) return res.status(500).json({ error: String(err) });
res.json({ ok: true, filePath });
});
});
Then mount a volume so the snapshot survives container restarts:
docker run -d --name api-prod-1 \
-p 8080:8080 \
-v /var/lib/api-heapdumps:/heapdumps \
-e HEAPDUMP_DIR=/heapdumps \
myorg/api:debug-heapdump
Trigger snapshots at two points in time:
- Shortly after deploy (baseline)
- After memory has grown significantly
curl -X POST http://localhost:8080/admin/heapdump
sleep 1800
curl -X POST http://localhost:8080/admin/heapdump
ls -lh /var/lib/api-heapdumps
Why two snapshots? A single snapshot shows what’s in memory, but not what’s growing. Comparing snapshots reveals which object types and retainers increase over time.
7) Analyze Heap Snapshots (Chrome DevTools)
Copy snapshots to your workstation:
scp user@prod-host:/var/lib/api-heapdumps/heap-*.heapsnapshot .
Open Chrome DevTools:
- Open
chrome://inspect - “Open dedicated DevTools for Node”
- Or open DevTools → Memory tab → Load snapshot file
What to look for
In the Memory panel:
- Summary: which constructor types dominate (e.g.,
Array,Object,Map,Buffer) - Comparison (between two snapshots): which types increased
- Retainers: why those objects are still referenced (the critical part)
Common leak signatures:
- A
Mapthat grows without bound (cache with no eviction) - Arrays accumulating request objects
- EventEmitter listeners not removed
- Closures capturing large objects
- Buffers retained by pending promises/timeouts
8) Case Study: The Actual Leak
The scenario
The API has an endpoint:
GET /reports/export?tenantId=...
It generates a CSV report by fetching data from a database, then formatting it.
A recent change introduced an in-memory cache to speed up repeated exports. The cache key is tenantId, but the cached value includes the entire dataset and never expires.
Under production traffic (many tenants), memory grows steadily until the container is OOM-killed.
The buggy code
A simplified example:
const reportCache = new Map(); // key: tenantId, value: huge array of rows
export async function exportReport(req, res) {
const tenantId = req.query.tenantId;
if (reportCache.has(tenantId)) {
return res.json({ rows: reportCache.get(tenantId) });
}
const rows = await db.query("SELECT * FROM big_table WHERE tenant_id = ?", [tenantId]);
// Leak: caches unbounded data forever
reportCache.set(tenantId, rows);
res.json({ rows });
}
Why this leaks:
Mapis strongly referenced by module scope, so it lives for the process lifetime.- Each new tenant inserts a new entry.
- Each entry may be large (arrays of rows, strings, etc.).
- No TTL, no max size, no eviction policy.
In heap snapshots, you’d see:
- Growing counts of
ArrayandObject - Retainers pointing to
reportCache→Map→ entries
9) Validate the Hypothesis with Live Instrumentation
Before changing code, confirm the cache growth in production (or staging).
Add a temporary metric/log:
setInterval(() => {
const mem = process.memoryUsage();
console.log(JSON.stringify({
msg: "mem",
rss: mem.rss,
heapUsed: mem.heapUsed,
heapTotal: mem.heapTotal,
external: mem.external,
reportCacheSize: reportCache.size
}));
}, 60000).unref();
In production logs you might see:
reportCacheSizeincreasing steadilyheapUsedandrssincreasing in parallel
If heapUsed grows but rss grows even more, you may also have native/buffer pressure; still, the cache is likely the driver.
10) Fix: Add Bounded Caching (TTL + Max Entries)
Choose an eviction strategy
For caches in API processes, you generally want:
- Max size: prevents unbounded growth
- TTL: ensures stale data is removed
- LRU: evicts least-recently-used items when max size is reached
A popular library is lru-cache.
Install:
npm install lru-cache
Replace the unbounded Map:
import LRU from "lru-cache";
const reportCache = new LRU({
max: 200, // max number of tenants cached
ttl: 5 * 60 * 1000, // 5 minutes
allowStale: false,
updateAgeOnGet: true
});
Use it:
export async function exportReport(req, res) {
const tenantId = req.query.tenantId;
const cached = reportCache.get(tenantId);
if (cached) {
return res.json({ rows: cached });
}
const rows = await db.query("SELECT * FROM big_table WHERE tenant_id = ?", [tenantId]);
reportCache.set(tenantId, rows);
res.json({ rows });
}
Why this works
maxensures memory can’t grow beyond a bounded number of cached tenants.ttlensures that even frequently accessed tenants won’t keep data forever if traffic patterns change.- LRU eviction tends to keep “hot” entries and discard “cold” ones.
Consider caching only what you need
Often the best fix is not just bounding the cache, but reducing what’s cached:
- Cache a compressed representation
- Cache IDs and re-fetch details
- Cache per-page results rather than full datasets
- Cache in an external system (Redis) with explicit limits and eviction
For report exports, caching the entire dataset is usually a red flag unless the dataset is small and bounded.
11) Confirm the Fix with Load Testing
Build and run locally with a memory limit
To simulate production constraints, run the container with a memory limit:
docker build -t myorg/api:leakfix .
docker run --rm -p 8080:8080 --memory 512m --memory-swap 512m myorg/api:leakfix
Generate traffic
Use wrk to hammer the endpoint with multiple tenants:
wrk -t4 -c50 -d2m "http://localhost:8080/reports/export?tenantId=tenant-$(date +%s)"
But the above changes tenantId constantly; better: cycle through many tenants.
Create a small script:
cat > hit.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
for i in $(seq 1 2000); do
tenant="tenant-$((i % 500))"
curl -s "http://localhost:8080/reports/export?tenantId=$tenant" > /dev/null &
if (( i % 50 == 0 )); then wait; fi
done
wait
EOF
chmod +x hit.sh
./hit.sh
Watch memory:
docker stats --no-stream
Inside container:
docker exec -it $(docker ps -q --filter ancestor=myorg/api:leakfix) sh -lc 'node -e "console.log(process.memoryUsage())"'
With the fix, memory may rise initially (warming cache) but should plateau rather than grow indefinitely.
12) Production Rollout Plan
Safe rollout steps
- Deploy to staging with production-like traffic (if possible).
- Enable additional memory logging temporarily.
- Deploy to a small percentage of production instances.
- Watch:
- Container memory
- Restart count
- Latency and error rate
- Roll out fully.
Verify OOM behavior is controlled
Even with a fix, you should ensure:
- Container memory limits are set
- Node heap max is set appropriately
Example Docker run:
docker run -d --name api-prod-1 \
--memory 512m --memory-swap 512m \
-e NODE_OPTIONS="--max-old-space-size=384" \
myorg/api:leakfix
Why set NODE_OPTIONS:
- Prevents Node from trying to expand heap too close to the cgroup limit
- Leaves room for native memory, buffers, and overhead
13) Postmortem: What Actually Happened and How to Prevent It
Root cause summary
- A new feature introduced an in-process cache keyed by tenant ID.
- The cache stored large report datasets and never evicted entries.
- As new tenants hit the endpoint, memory grew monotonically.
- The container hit its memory limit and was OOM-killed, causing restarts and elevated latency.
Contributing factors
- Lack of cache bounds (no TTL, no max entries)
- Insufficient memory monitoring at the process level (heap vs RSS)
- No load test that simulated many tenants
- Possibly no canary rollout, so leak impacted many instances quickly
Preventative measures
-
Cache design checklist
- Always define max size and TTL
- Prefer external caches for large datasets
- Avoid caching request/response objects directly
-
Add memory dashboards
- RSS, heap used, external memory
- GC pause time
- Container OOM kill events
-
Automated regression tests
- A soak test that runs for 30–60 minutes
- A multi-tenant traffic pattern
- Assert memory plateaus (within reason)
-
Operational guardrails
- Canary deployments
- Automated rollback on restart spikes
- Alert on monotonic memory growth (slope-based alerting)
14) Extra: When Heap Snapshots Don’t Show the Leak
Sometimes heap snapshots look normal but RSS keeps growing. That suggests:
- Native memory leak (C++ addon, image processing library, crypto, etc.)
- Buffer accumulation outside V8 heap
- Fragmentation or allocator behavior under churn
Useful commands:
Inside container:
cat /proc/1/status | egrep 'VmRSS|VmSize|Threads'
On host, using pmap (if available):
docker exec -it api-prod-1 sh -lc 'apk add --no-cache procps || true; pmap -x 1 | tail -n 20'
If you suspect file descriptor leaks (can indirectly cause memory issues):
docker exec -it api-prod-1 sh -lc 'ls /proc/1/fd | wc -l'
docker exec -it api-prod-1 sh -lc 'lsof -p 1 | head'
For deeper native analysis you may need:
jemallocprofiling (if used)valgrind(rare in production)- eBPF tools (
bcc,bpftrace) on the host
But in many API incidents, the leak is in application-level object retention (like unbounded caches).
15) Quick Reference Command Cheat Sheet
Host-level:
docker ps
docker ps -a
docker stats
docker inspect <container> --format '{{json .State}}' | jq
dmesg -T | egrep -i 'oom|killed process'
journalctl -u docker --since "1 hour ago"
Container-level:
docker exec -it <container> sh
ps -o pid,cmd,rss,vsz -p 1
cat /proc/1/status | egrep 'VmRSS|VmSize'
ss -lntp
Node-level:
node -v
node -e "console.log(process.memoryUsage())"
kill -USR1 1 # enable inspector (often)
Heapdump approach (requires code change):
npm install heapdump
curl -X POST http://localhost:8080/admin/heapdump
Closing Notes
Debugging memory leaks in Dockerized production systems is less about a single magic tool and more about a disciplined process:
- Stabilize the incident.
- Confirm memory growth and whether it’s heap vs RSS.
- Capture evidence (snapshots/profiles) at multiple times.
- Identify retention paths and the specific code responsible.
- Fix with bounded resource usage and validate under load.
- Roll out safely and add guardrails to prevent recurrence.
If you want, share:
- your runtime (Node/Java/Python/Go),
- orchestration (Docker Compose/Kubernetes/ECS),
- and a few details about memory graphs or OOM logs, and I can adapt the commands and analysis workflow to your exact environment.