← Back to Tutorials

Production Incident Walkthrough: Debugging a Memory Leak in a Dockerized API

devopsproduction-incidentmemory-leakdockerapiobservabilityprofilingkubernetessreincident-response

Production Incident Walkthrough: Debugging a Memory Leak in a Dockerized API

This tutorial is a realistic, end-to-end walkthrough of diagnosing and fixing a memory leak in a Dockerized API running in production. It focuses on practical steps, real commands, and the reasoning behind each decision so you can adapt the approach to your own stack.

We’ll assume:


1) Incident Symptoms and Initial Triage

Typical alert signals

A memory leak in production often first appears as one or more of:

You might see alerts like:

Confirm what’s actually failing

On a Docker host, start by checking container restarts and recent exits:

docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.RunningFor}}'
docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.ExitCode}}\t{{.FinishedAt}}' | head

If you suspect OOM, inspect the container state:

docker inspect api-prod-1 --format '{{json .State}}' | jq

Look for:

Also check host logs:

dmesg -T | egrep -i 'killed process|out of memory|oom' | tail -n 50

If you’re on systemd-based Linux, you can also check journald for Docker:

journalctl -u docker --since "2 hours ago" | tail -n 200

Quick stabilization (don’t skip this in real life)

Before deep debugging, reduce blast radius:

Example: temporarily increase memory limit for a container (standalone Docker):

docker update --memory 1.5g --memory-swap 1.5g api-prod-1

In Kubernetes you’d adjust resources.limits.memory and redeploy, but the concept is the same: stabilize first, then investigate.


2) Establish Baselines: Is Memory Actually Growing?

Use docker stats for a fast view

docker stats --no-stream

If you can watch over time:

docker stats api-prod-1

You’re looking for a pattern: memory usage rises steadily and never returns to baseline after traffic drops.

Confirm cgroup memory usage from inside the container

Exec into the container:

docker exec -it api-prod-1 sh

Depending on cgroups version:

cgroups v1:

cat /sys/fs/cgroup/memory/memory.usage_in_bytes
cat /sys/fs/cgroup/memory/memory.limit_in_bytes

cgroups v2:

cat /sys/fs/cgroup/memory.current
cat /sys/fs/cgroup/memory.max

This tells you what the kernel thinks the container is using, which is the authoritative source for OOM behavior.

Distinguish “heap leak” vs “RSS growth” vs “cache”

In Node.js, memory can grow for different reasons:

From inside the container:

ps aux | head
ps -o pid,ppid,cmd,%mem,%cpu,rss,vsz -p 1

If PID 1 is your Node process, rss is a key indicator: it’s the actual resident memory.


3) Gather Evidence: Logs, Metrics, and a Repro Timeline

Correlate memory growth with traffic or endpoints

If you have access logs, identify whether memory growth correlates with:

If you have Prometheus/Grafana, look at:

If you don’t have route-level metrics, you can still add temporary logging (carefully) to identify suspicious endpoints.

Enable Node runtime metrics (if not already)

If you can redeploy quickly, consider exposing basic process metrics. For example, with prom-client you can export:

But during an incident, you might not have that ready. So we’ll proceed with low-level tooling.


4) Decide on a Debug Strategy

You generally have two tracks:

  1. In-production forensic approach: attach to the running process and gather heap snapshots / profiles.
  2. Staging reproduction: replay traffic or run a load test to reproduce the leak safely.

Often you do both:


5) Inspect Node Memory from the Outside

Check Node version and flags

From inside the container:

node -v
ps -ef
cat /proc/1/cmdline | tr '\0' ' '

You’re looking for flags like:

If --max-old-space-size is too high relative to container limit, Node may be killed by the kernel before it can throw an out-of-memory error. A good practice is to set Node’s heap limit below the container memory limit to leave headroom for native allocations and overhead.

Example: if container limit is 512MiB, set Node heap to ~384MiB:

node --max-old-space-size=384 server.js

6) Capture Heap Snapshots in a Running Container (Node.js)

Option A: Use the Node inspector (most common)

If your container exposes the inspector port, you can connect. But in production it often isn’t enabled.

You can still enable it by sending SIGUSR1 to Node (works on many Node versions/configurations). From inside the container:

kill -USR1 1

This typically starts the inspector on 127.0.0.1:9229. Confirm:

ss -lntp | grep 9229 || netstat -lntp | grep 9229

Now you need access to that port. If you’re on the Docker host:

docker exec -it api-prod-1 sh -lc "ss -lntp | grep 9229"

You can use port forwarding via docker by running a temporary socat container in the same network namespace, or simpler: use docker exec plus curl to talk to the inspector endpoints locally. A pragmatic approach is to use node’s built-in inspector endpoints to trigger heap snapshots.

Option B: Trigger heap snapshot via the inspector HTTP endpoint

Once inspector is enabled, list targets:

curl -s http://127.0.0.1:9229/json/list | jq

You’ll see a webSocketDebuggerUrl. You can use tools like chrome://inspect (locally) by port-forwarding, but here’s a production-friendly approach:

A common incident workflow is to temporarily deploy a build with a safe admin endpoint that triggers heap snapshots to a mounted volume or object storage. If you can’t redeploy, you can still do it with inspector, but it’s more fiddly.

Add heapdump and a protected endpoint. Example (Express):

npm install heapdump

Code snippet:

import heapdump from "heapdump";
import fs from "fs";
import path from "path";
import express from "express";

const app = express();

app.post("/admin/heapdump", async (req, res) => {
  // Protect this endpoint with auth in real life.
  const dir = process.env.HEAPDUMP_DIR || "/tmp";
  const filename = path.join(dir, `heap-${Date.now()}.heapsnapshot`);

  heapdump.writeSnapshot(filename, (err, filePath) => {
    if (err) return res.status(500).json({ error: String(err) });
    res.json({ ok: true, filePath });
  });
});

Then mount a volume so the snapshot survives container restarts:

docker run -d --name api-prod-1 \
  -p 8080:8080 \
  -v /var/lib/api-heapdumps:/heapdumps \
  -e HEAPDUMP_DIR=/heapdumps \
  myorg/api:debug-heapdump

Trigger snapshots at two points in time:

  1. Shortly after deploy (baseline)
  2. After memory has grown significantly
curl -X POST http://localhost:8080/admin/heapdump
sleep 1800
curl -X POST http://localhost:8080/admin/heapdump
ls -lh /var/lib/api-heapdumps

Why two snapshots? A single snapshot shows what’s in memory, but not what’s growing. Comparing snapshots reveals which object types and retainers increase over time.


7) Analyze Heap Snapshots (Chrome DevTools)

Copy snapshots to your workstation:

scp user@prod-host:/var/lib/api-heapdumps/heap-*.heapsnapshot .

Open Chrome DevTools:

  1. Open chrome://inspect
  2. “Open dedicated DevTools for Node”
  3. Or open DevTools → Memory tab → Load snapshot file

What to look for

In the Memory panel:

Common leak signatures:


8) Case Study: The Actual Leak

The scenario

The API has an endpoint:

It generates a CSV report by fetching data from a database, then formatting it.

A recent change introduced an in-memory cache to speed up repeated exports. The cache key is tenantId, but the cached value includes the entire dataset and never expires.

Under production traffic (many tenants), memory grows steadily until the container is OOM-killed.

The buggy code

A simplified example:

const reportCache = new Map(); // key: tenantId, value: huge array of rows

export async function exportReport(req, res) {
  const tenantId = req.query.tenantId;

  if (reportCache.has(tenantId)) {
    return res.json({ rows: reportCache.get(tenantId) });
  }

  const rows = await db.query("SELECT * FROM big_table WHERE tenant_id = ?", [tenantId]);

  // Leak: caches unbounded data forever
  reportCache.set(tenantId, rows);

  res.json({ rows });
}

Why this leaks:

In heap snapshots, you’d see:


9) Validate the Hypothesis with Live Instrumentation

Before changing code, confirm the cache growth in production (or staging).

Add a temporary metric/log:

setInterval(() => {
  const mem = process.memoryUsage();
  console.log(JSON.stringify({
    msg: "mem",
    rss: mem.rss,
    heapUsed: mem.heapUsed,
    heapTotal: mem.heapTotal,
    external: mem.external,
    reportCacheSize: reportCache.size
  }));
}, 60000).unref();

In production logs you might see:

If heapUsed grows but rss grows even more, you may also have native/buffer pressure; still, the cache is likely the driver.


10) Fix: Add Bounded Caching (TTL + Max Entries)

Choose an eviction strategy

For caches in API processes, you generally want:

A popular library is lru-cache.

Install:

npm install lru-cache

Replace the unbounded Map:

import LRU from "lru-cache";

const reportCache = new LRU({
  max: 200,               // max number of tenants cached
  ttl: 5 * 60 * 1000,     // 5 minutes
  allowStale: false,
  updateAgeOnGet: true
});

Use it:

export async function exportReport(req, res) {
  const tenantId = req.query.tenantId;

  const cached = reportCache.get(tenantId);
  if (cached) {
    return res.json({ rows: cached });
  }

  const rows = await db.query("SELECT * FROM big_table WHERE tenant_id = ?", [tenantId]);
  reportCache.set(tenantId, rows);

  res.json({ rows });
}

Why this works

Consider caching only what you need

Often the best fix is not just bounding the cache, but reducing what’s cached:

For report exports, caching the entire dataset is usually a red flag unless the dataset is small and bounded.


11) Confirm the Fix with Load Testing

Build and run locally with a memory limit

To simulate production constraints, run the container with a memory limit:

docker build -t myorg/api:leakfix .
docker run --rm -p 8080:8080 --memory 512m --memory-swap 512m myorg/api:leakfix

Generate traffic

Use wrk to hammer the endpoint with multiple tenants:

wrk -t4 -c50 -d2m "http://localhost:8080/reports/export?tenantId=tenant-$(date +%s)"

But the above changes tenantId constantly; better: cycle through many tenants.

Create a small script:

cat > hit.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail
for i in $(seq 1 2000); do
  tenant="tenant-$((i % 500))"
  curl -s "http://localhost:8080/reports/export?tenantId=$tenant" > /dev/null &
  if (( i % 50 == 0 )); then wait; fi
done
wait
EOF
chmod +x hit.sh
./hit.sh

Watch memory:

docker stats --no-stream

Inside container:

docker exec -it $(docker ps -q --filter ancestor=myorg/api:leakfix) sh -lc 'node -e "console.log(process.memoryUsage())"'

With the fix, memory may rise initially (warming cache) but should plateau rather than grow indefinitely.


12) Production Rollout Plan

Safe rollout steps

  1. Deploy to staging with production-like traffic (if possible).
  2. Enable additional memory logging temporarily.
  3. Deploy to a small percentage of production instances.
  4. Watch:
    • Container memory
    • Restart count
    • Latency and error rate
  5. Roll out fully.

Verify OOM behavior is controlled

Even with a fix, you should ensure:

Example Docker run:

docker run -d --name api-prod-1 \
  --memory 512m --memory-swap 512m \
  -e NODE_OPTIONS="--max-old-space-size=384" \
  myorg/api:leakfix

Why set NODE_OPTIONS:


13) Postmortem: What Actually Happened and How to Prevent It

Root cause summary

Contributing factors

Preventative measures

  1. Cache design checklist

    • Always define max size and TTL
    • Prefer external caches for large datasets
    • Avoid caching request/response objects directly
  2. Add memory dashboards

    • RSS, heap used, external memory
    • GC pause time
    • Container OOM kill events
  3. Automated regression tests

    • A soak test that runs for 30–60 minutes
    • A multi-tenant traffic pattern
    • Assert memory plateaus (within reason)
  4. Operational guardrails

    • Canary deployments
    • Automated rollback on restart spikes
    • Alert on monotonic memory growth (slope-based alerting)

14) Extra: When Heap Snapshots Don’t Show the Leak

Sometimes heap snapshots look normal but RSS keeps growing. That suggests:

Useful commands:

Inside container:

cat /proc/1/status | egrep 'VmRSS|VmSize|Threads'

On host, using pmap (if available):

docker exec -it api-prod-1 sh -lc 'apk add --no-cache procps || true; pmap -x 1 | tail -n 20'

If you suspect file descriptor leaks (can indirectly cause memory issues):

docker exec -it api-prod-1 sh -lc 'ls /proc/1/fd | wc -l'
docker exec -it api-prod-1 sh -lc 'lsof -p 1 | head'

For deeper native analysis you may need:

But in many API incidents, the leak is in application-level object retention (like unbounded caches).


15) Quick Reference Command Cheat Sheet

Host-level:

docker ps
docker ps -a
docker stats
docker inspect <container> --format '{{json .State}}' | jq
dmesg -T | egrep -i 'oom|killed process'
journalctl -u docker --since "1 hour ago"

Container-level:

docker exec -it <container> sh
ps -o pid,cmd,rss,vsz -p 1
cat /proc/1/status | egrep 'VmRSS|VmSize'
ss -lntp

Node-level:

node -v
node -e "console.log(process.memoryUsage())"
kill -USR1 1   # enable inspector (often)

Heapdump approach (requires code change):

npm install heapdump
curl -X POST http://localhost:8080/admin/heapdump

Closing Notes

Debugging memory leaks in Dockerized production systems is less about a single magic tool and more about a disciplined process:

  1. Stabilize the incident.
  2. Confirm memory growth and whether it’s heap vs RSS.
  3. Capture evidence (snapshots/profiles) at multiple times.
  4. Identify retention paths and the specific code responsible.
  5. Fix with bounded resource usage and validate under load.
  6. Roll out safely and add guardrails to prevent recurrence.

If you want, share: