Production Incident Walkthrough: Debugging a Memory Leak in a Dockerized API

This tutorial is a realistic, end-to-end walkthrough of diagnosing and fixing a memory leak in a Dockerized API running in production. It focuses on practical steps, real commands, and the reasoning behind each decision so you can adapt the approach to your own stack.

We’ll assume:

The API runs in Docker containers (e.g., on a VM, Kubernetes, ECS, etc.).
You have shell access to at least one host running the container.
You can deploy a patched image after confirming the root cause.
The API is written in Node.js (Express/Fastify-style), but most of the workflow applies to other runtimes.

1) Incident Symptoms and Initial Triage

Typical alert signals

A memory leak in production often first appears as one or more of:

Container restarts due to OOM (Out Of Memory) kills
Increasing latency and timeouts (GC pressure, swapping, CPU spikes)
Node process memory steadily rising over time
Host memory pressure if limits are not set correctly

You might see alerts like:

“Container restarted > 5 times in 10 minutes”
“p95 latency doubled”
“Memory usage > 90% for 15 minutes”

Confirm what’s actually failing

On a Docker host, start by checking container restarts and recent exits:

docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.RunningFor}}'
docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.ExitCode}}\t{{.FinishedAt}}' | head

If you suspect OOM, inspect the container state:

docker inspect api-prod-1 --format '{{json .State}}' | jq

Look for:

"OOMKilled": true
ExitCode often 137 (SIGKILL)

Also check host logs:

dmesg -T | egrep -i 'killed process|out of memory|oom' | tail -n 50

If you’re on systemd-based Linux, you can also check journald for Docker:

journalctl -u docker --since "2 hours ago" | tail -n 200

Quick stabilization (don’t skip this in real life)

Before deep debugging, reduce blast radius:

Scale out replicas (if possible) to reduce per-instance load.
Increase memory limits temporarily (if safe) to buy time.
Add rate limiting or shed load.
Roll back to last known good version if the leak correlates with a recent deploy.

Example: temporarily increase memory limit for a container (standalone Docker):

docker update --memory 1.5g --memory-swap 1.5g api-prod-1

In Kubernetes you’d adjust resources.limits.memory and redeploy, but the concept is the same: stabilize first, then investigate.

2) Establish Baselines: Is Memory Actually Growing?

Use `docker stats` for a fast view

docker stats --no-stream

If you can watch over time:

docker stats api-prod-1

You’re looking for a pattern: memory usage rises steadily and never returns to baseline after traffic drops.

Confirm cgroup memory usage from inside the container

Exec into the container:

docker exec -it api-prod-1 sh

Depending on cgroups version:

cgroups v1:

cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes

cgroups v2:

cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max

This tells you what the kernel thinks the container is using, which is the authoritative source for OOM behavior.

Distinguish “heap leak” vs “RSS growth” vs “cache”

In Node.js, memory can grow for different reasons:

V8 heap: JavaScript objects retained unintentionally (classic leak).
RSS (resident set size): native allocations, buffers, C++ addons, fragmentation.
Page cache / filesystem cache: can rise but is reclaimable; less likely to OOM-kill the container if properly limited, but can still contribute.

From inside the container:

ps aux | head
ps -o pid,ppid,cmd,%mem,%cpu,rss,vsz -p 1

If PID 1 is your Node process, rss is a key indicator: it’s the actual resident memory.

3) Gather Evidence: Logs, Metrics, and a Repro Timeline

Correlate memory growth with traffic or endpoints

If you have access logs, identify whether memory growth correlates with:

A specific endpoint (e.g., /reports/export)
A request parameter (e.g., ?include=all)
A tenant/customer
A background job

If you have Prometheus/Grafana, look at:

Request rate by route
Response time by route
Error rate by route
Process memory metrics (if exported)

If you don’t have route-level metrics, you can still add temporary logging (carefully) to identify suspicious endpoints.

Enable Node runtime metrics (if not already)

If you can redeploy quickly, consider exposing basic process metrics. For example, with prom-client you can export:

process_resident_memory_bytes
process_heap_used_bytes
nodejs_heap_size_used_bytes
nodejs_gc_duration_seconds

But during an incident, you might not have that ready. So we’ll proceed with low-level tooling.

4) Decide on a Debug Strategy

You generally have two tracks:

In-production forensic approach: attach to the running process and gather heap snapshots / profiles.
Staging reproduction: replay traffic or run a load test to reproduce the leak safely.

Often you do both:

Production: capture one or two heap snapshots at different times to confirm growth and identify retained objects.
Staging: reproduce and iterate quickly to validate fixes.

5) Inspect Node Memory from the Outside

Check Node version and flags

From inside the container:

node -v
ps -ef
cat /proc/1/cmdline | tr '\0' ' '

You’re looking for flags like:

--max-old-space-size=... (heap limit)
--inspect (debug port)
--trace-gc (GC logs; usually too noisy for prod)

If --max-old-space-size is too high relative to container limit, Node may be killed by the kernel before it can throw an out-of-memory error. A good practice is to set Node’s heap limit below the container memory limit to leave headroom for native allocations and overhead.

Example: if container limit is 512MiB, set Node heap to ~384MiB:

node --max-old-space-size=384 server.js

6) Capture Heap Snapshots in a Running Container (Node.js)

Option A: Use the Node inspector (most common)

If your container exposes the inspector port, you can connect. But in production it often isn’t enabled.

You can still enable it by sending SIGUSR1 to Node (works on many Node versions/configurations). From inside the container:

kill -USR1 1

This typically starts the inspector on 127.0.0.1:9229. Confirm:

ss -lntp | grep 9229 || netstat -lntp | grep 9229

Now you need access to that port. If you’re on the Docker host:

docker exec -it api-prod-1 sh -lc "ss -lntp | grep 9229"

You can use port forwarding via docker by running a temporary socat container in the same network namespace, or simpler: use docker exec plus curl to talk to the inspector endpoints locally. A pragmatic approach is to use node’s built-in inspector endpoints to trigger heap snapshots.

Option B: Trigger heap snapshot via the inspector HTTP endpoint

Once inspector is enabled, list targets:

curl -s http://127.0.0.1:9229/json/list | jq

You’ll see a webSocketDebuggerUrl. You can use tools like chrome://inspect (locally) by port-forwarding, but here’s a production-friendly approach:

Use a small script to connect to the inspector WebSocket and request a heap snapshot.
Or install a minimal tool like node-heapdump (requires code change) — not ideal mid-incident.

A common incident workflow is to temporarily deploy a build with a safe admin endpoint that triggers heap snapshots to a mounted volume or object storage. If you can’t redeploy, you can still do it with inspector, but it’s more fiddly.

Practical approach: Redeploy with a controlled heapdump trigger (recommended)

Add heapdump and a protected endpoint. Example (Express):

npm install heapdump

Code snippet:

import heapdump from "heapdump";
import fs from "fs";
import path from "path";
import express from "express";

const app = express();

app.post("/admin/heapdump", async (req, res) => {
  // Protect this endpoint with auth in real life.
  const dir = process.env.HEAPDUMP_DIR || "/tmp";
  const filename = path.join(dir, `heap-${Date.now()}.heapsnapshot`);

  heapdump.writeSnapshot(filename, (err, filePath) => {
    if (err) return res.status(500).json({ error: String(err) });
    res.json({ ok: true, filePath });
  });
});

Then mount a volume so the snapshot survives container restarts:

docker run -d --name api-prod-1 \
  -p 8080:8080 \
  -v /var/lib/api-heapdumps:/heapdumps \
  -e HEAPDUMP_DIR=/heapdumps \
  myorg/api:debug-heapdump

Trigger snapshots at two points in time:

Shortly after deploy (baseline)
After memory has grown significantly

curl -X POST http://localhost:8080/admin/heapdump
sleep 1800
curl -X POST http://localhost:8080/admin/heapdump
ls -lh /var/lib/api-heapdumps

Why two snapshots? A single snapshot shows what’s in memory, but not what’s growing. Comparing snapshots reveals which object types and retainers increase over time.

7) Analyze Heap Snapshots (Chrome DevTools)

Copy snapshots to your workstation:

scp user@prod-host:/var/lib/api-heapdumps/heap-*.heapsnapshot .

Open Chrome DevTools:

Open chrome://inspect
“Open dedicated DevTools for Node”
Or open DevTools → Memory tab → Load snapshot file

What to look for

In the Memory panel:

Summary: which constructor types dominate (e.g., Array, Object, Map, Buffer)
Comparison (between two snapshots): which types increased
Retainers: why those objects are still referenced (the critical part)

Common leak signatures:

A Map that grows without bound (cache with no eviction)
Arrays accumulating request objects
EventEmitter listeners not removed
Closures capturing large objects
Buffers retained by pending promises/timeouts

8) Case Study: The Actual Leak

The scenario

The API has an endpoint:

GET /reports/export?tenantId=...

It generates a CSV report by fetching data from a database, then formatting it.

A recent change introduced an in-memory cache to speed up repeated exports. The cache key is tenantId, but the cached value includes the entire dataset and never expires.

Under production traffic (many tenants), memory grows steadily until the container is OOM-killed.

The buggy code

A simplified example:

const reportCache = new Map(); // key: tenantId, value: huge array of rows

export async function exportReport(req, res) {
  const tenantId = req.query.tenantId;

  if (reportCache.has(tenantId)) {
    return res.json({ rows: reportCache.get(tenantId) });
  }

  const rows = await db.query("SELECT * FROM big_table WHERE tenant_id = ?", [tenantId]);

  // Leak: caches unbounded data forever
  reportCache.set(tenantId, rows);

  res.json({ rows });
}

Why this leaks:

Map is strongly referenced by module scope, so it lives for the process lifetime.
Each new tenant inserts a new entry.
Each entry may be large (arrays of rows, strings, etc.).
No TTL, no max size, no eviction policy.

In heap snapshots, you’d see:

Growing counts of Array and Object
Retainers pointing to reportCache → Map → entries

9) Validate the Hypothesis with Live Instrumentation

Before changing code, confirm the cache growth in production (or staging).

Add a temporary metric/log:

setInterval(() => {
  const mem = process.memoryUsage();
  console.log(JSON.stringify({
    msg: "mem",
    rss: mem.rss,
    heapUsed: mem.heapUsed,
    heapTotal: mem.heapTotal,
    external: mem.external,
    reportCacheSize: reportCache.size
  }));
}, 60000).unref();

In production logs you might see:

reportCacheSize increasing steadily
heapUsed and rss increasing in parallel

If heapUsed grows but rss grows even more, you may also have native/buffer pressure; still, the cache is likely the driver.

10) Fix: Add Bounded Caching (TTL + Max Entries)

Choose an eviction strategy

For caches in API processes, you generally want:

Max size: prevents unbounded growth
TTL: ensures stale data is removed
LRU: evicts least-recently-used items when max size is reached

A popular library is lru-cache.

Install:

npm install lru-cache

Replace the unbounded Map:

import LRU from "lru-cache";

const reportCache = new LRU({
  max: 200,               // max number of tenants cached
  ttl: 5 * 60 * 1000,     // 5 minutes
  allowStale: false,
  updateAgeOnGet: true
});

Use it:

export async function exportReport(req, res) {
  const tenantId = req.query.tenantId;

  const cached = reportCache.get(tenantId);
  if (cached) {
    return res.json({ rows: cached });
  }

  const rows = await db.query("SELECT * FROM big_table WHERE tenant_id = ?", [tenantId]);
  reportCache.set(tenantId, rows);

  res.json({ rows });
}

Why this works

max ensures memory can’t grow beyond a bounded number of cached tenants.
ttl ensures that even frequently accessed tenants won’t keep data forever if traffic patterns change.
LRU eviction tends to keep “hot” entries and discard “cold” ones.

Consider caching only what you need

Often the best fix is not just bounding the cache, but reducing what’s cached:

Cache a compressed representation
Cache IDs and re-fetch details
Cache per-page results rather than full datasets
Cache in an external system (Redis) with explicit limits and eviction

For report exports, caching the entire dataset is usually a red flag unless the dataset is small and bounded.

11) Confirm the Fix with Load Testing

Build and run locally with a memory limit

To simulate production constraints, run the container with a memory limit:

docker build -t myorg/api:leakfix .
docker run --rm -p 8080:8080 --memory 512m --memory-swap 512m myorg/api:leakfix

Generate traffic

Use wrk to hammer the endpoint with multiple tenants:

wrk -t4 -c50 -d2m "http://localhost:8080/reports/export?tenantId=tenant-$(date +%s)"

But the above changes tenantId constantly; better: cycle through many tenants.

Create a small script:

cat > hit.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
for i in $(seq 1 2000); do
  tenant="tenant-$((i % 500))"
  curl -s "http://localhost:8080/reports/export?tenantId=$tenant" > /dev/null &
  if (( i % 50 == 0 )); then wait; fi
done
wait
EOF
chmod +x hit.sh
./hit.sh

Watch memory:

docker stats --no-stream

Inside container:

docker exec -it $(docker ps -q --filter ancestor=myorg/api:leakfix) sh -lc 'node -e "console.log(process.memoryUsage())"'

With the fix, memory may rise initially (warming cache) but should plateau rather than grow indefinitely.

12) Production Rollout Plan

Safe rollout steps

Deploy to staging with production-like traffic (if possible).
Enable additional memory logging temporarily.
Deploy to a small percentage of production instances.
Watch:
- Container memory
- Restart count
- Latency and error rate
Roll out fully.

Verify OOM behavior is controlled

Even with a fix, you should ensure:

Container memory limits are set
Node heap max is set appropriately

Example Docker run:

docker run -d --name api-prod-1 \
  --memory 512m --memory-swap 512m \
  -e NODE_OPTIONS="--max-old-space-size=384" \
  myorg/api:leakfix

Why set NODE_OPTIONS:

Prevents Node from trying to expand heap too close to the cgroup limit
Leaves room for native memory, buffers, and overhead

13) Postmortem: What Actually Happened and How to Prevent It

Root cause summary

A new feature introduced an in-process cache keyed by tenant ID.
The cache stored large report datasets and never evicted entries.
As new tenants hit the endpoint, memory grew monotonically.
The container hit its memory limit and was OOM-killed, causing restarts and elevated latency.

Contributing factors

Lack of cache bounds (no TTL, no max entries)
Insufficient memory monitoring at the process level (heap vs RSS)
No load test that simulated many tenants
Possibly no canary rollout, so leak impacted many instances quickly

Preventative measures

Cache design checklist
- Always define max size and TTL
- Prefer external caches for large datasets
- Avoid caching request/response objects directly
Add memory dashboards
- RSS, heap used, external memory
- GC pause time
- Container OOM kill events
Automated regression tests
- A soak test that runs for 30–60 minutes
- A multi-tenant traffic pattern
- Assert memory plateaus (within reason)
Operational guardrails
- Canary deployments
- Automated rollback on restart spikes
- Alert on monotonic memory growth (slope-based alerting)

14) Extra: When Heap Snapshots Don’t Show the Leak

Sometimes heap snapshots look normal but RSS keeps growing. That suggests:

Native memory leak (C++ addon, image processing library, crypto, etc.)
Buffer accumulation outside V8 heap
Fragmentation or allocator behavior under churn

Useful commands:

Inside container:

cat /proc/1/status | egrep 'VmRSS|VmSize|Threads'

On host, using pmap (if available):

docker exec -it api-prod-1 sh -lc 'apk add --no-cache procps || true; pmap -x 1 | tail -n 20'

If you suspect file descriptor leaks (can indirectly cause memory issues):

docker exec -it api-prod-1 sh -lc 'ls /proc/1/fd | wc -l'
docker exec -it api-prod-1 sh -lc 'lsof -p 1 | head'

For deeper native analysis you may need:

jemalloc profiling (if used)
valgrind (rare in production)
eBPF tools (bcc, bpftrace) on the host

But in many API incidents, the leak is in application-level object retention (like unbounded caches).

15) Quick Reference Command Cheat Sheet

Host-level:

docker ps
docker ps -a
docker stats
docker inspect <container> --format '{{json .State}}' | jq
dmesg -T | egrep -i 'oom|killed process'
journalctl -u docker --since "1 hour ago"

Container-level:

docker exec -it <container> sh
ps -o pid,cmd,rss,vsz -p 1
cat /proc/1/status | egrep 'VmRSS|VmSize'
ss -lntp

Node-level:

node -v
node -e "console.log(process.memoryUsage())"
kill -USR1 1   # enable inspector (often)

Heapdump approach (requires code change):

npm install heapdump
curl -X POST http://localhost:8080/admin/heapdump

Closing Notes

Debugging memory leaks in Dockerized production systems is less about a single magic tool and more about a disciplined process:

Stabilize the incident.
Confirm memory growth and whether it’s heap vs RSS.
Capture evidence (snapshots/profiles) at multiple times.
Identify retention paths and the specific code responsible.
Fix with bounded resource usage and validate under load.
Roll out safely and add guardrails to prevent recurrence.

If you want, share:

your runtime (Node/Java/Python/Go),
orchestration (Docker Compose/Kubernetes/ECS),
and a few details about memory graphs or OOM logs, and I can adapt the commands and analysis workflow to your exact environment.

Production Incident Walkthrough: Debugging a Memory Leak in a Dockerized API

Production Incident Walkthrough: Debugging a Memory Leak in a Dockerized API

1) Incident Symptoms and Initial Triage

Typical alert signals

Confirm what’s actually failing

Quick stabilization (don’t skip this in real life)

2) Establish Baselines: Is Memory Actually Growing?

Use docker stats for a fast view

Confirm cgroup memory usage from inside the container

Distinguish “heap leak” vs “RSS growth” vs “cache”

3) Gather Evidence: Logs, Metrics, and a Repro Timeline

Correlate memory growth with traffic or endpoints

Enable Node runtime metrics (if not already)

4) Decide on a Debug Strategy

5) Inspect Node Memory from the Outside

Check Node version and flags

6) Capture Heap Snapshots in a Running Container (Node.js)

Option A: Use the Node inspector (most common)

Option B: Trigger heap snapshot via the inspector HTTP endpoint

Practical approach: Redeploy with a controlled heapdump trigger (recommended)

7) Analyze Heap Snapshots (Chrome DevTools)

What to look for

8) Case Study: The Actual Leak

The scenario

The buggy code

9) Validate the Hypothesis with Live Instrumentation

10) Fix: Add Bounded Caching (TTL + Max Entries)

Choose an eviction strategy

Why this works

Consider caching only what you need

11) Confirm the Fix with Load Testing

Build and run locally with a memory limit

Generate traffic

12) Production Rollout Plan

Safe rollout steps

Verify OOM behavior is controlled

13) Postmortem: What Actually Happened and How to Prevent It

Root cause summary

Contributing factors

Preventative measures

14) Extra: When Heap Snapshots Don’t Show the Leak

15) Quick Reference Command Cheat Sheet

Closing Notes

Related Tutorials

Use `docker stats` for a fast view