Handling Graceful Shutdowns: Fixing Stuck or Zombie Containers in Production
Production container platforms are optimized for starting and stopping workloads quickly. But “stop” is not a single action: it is a sequence of signals, timeouts, process behavior, and kernel mechanics. When that sequence breaks, you get containers that won’t die, containers that are “Exited” but still hold resources, or “zombie” processes accumulating inside a container. This tutorial explains why that happens and how to fix it—using real commands and production-safe patterns.
Table of Contents
- 1. What “graceful shutdown” means for containers
- 2. The signal flow: Docker, containerd, Kubernetes
- 3. Common failure modes that create stuck or zombie containers
- 4. Diagnosing a stuck container (host and inside-container)
- 5. Fixing zombie processes: PID 1, init systems, and reaping
- 6. Fixing containers that ignore SIGTERM
- 7. Fixing containers stuck in
Stoppingor unkillable (Dstate) - 8. Kubernetes specifics: terminationGracePeriodSeconds, preStop, and probes
- 9. Practical hardening patterns (Dockerfile, entrypoint, app code)
- 10. Incident playbook: step-by-step commands
- 11. Prevention checklist
1. What “graceful shutdown” means for containers
A container is not a VM; it’s a set of Linux processes isolated by namespaces and controlled by cgroups. Stopping a container typically means:
- Send a “please exit” signal (usually
SIGTERM) to the container’s main process (PID 1 inside the container). - Wait for a grace period.
- If it hasn’t exited, send
SIGKILL(force kill). - Tear down networking, cgroups, mounts, and release resources.
A graceful shutdown is successful when:
- The application receives the termination signal.
- It stops accepting new work.
- It finishes or cancels in-flight work within the grace period.
- It flushes buffers, closes sockets, releases locks, and exits.
- The process tree is cleaned up (no zombies), and the runtime can remove the container.
When it fails, you may observe:
docker stophangs or takes the full timeout.docker rm -ffails or hangs.- Kubernetes pods stuck in
Terminating. - Containers that “Exited” but still have child processes (rare but possible with misconfigured runtimes or host issues).
- Zombie processes inside the container (processes in
Zstate). - “Unkillable” processes in
Dstate (uninterruptible sleep), often due to kernel/I/O issues.
2. The signal flow: Docker, containerd, Kubernetes
Docker (classic behavior)
docker stop <container>:- Sends
SIGTERMto PID 1 in the container. - Waits
--timeseconds (default 10). - Sends
SIGKILLif still running.
- Sends
Commands:
docker stop --time 20 myapp
docker kill --signal=SIGTERM myapp
docker kill --signal=SIGKILL myapp
containerd / runc (under the hood)
Docker and Kubernetes ultimately rely on an OCI runtime (commonly runc). The runtime sends signals to the container process and manages cgroups and namespaces. If the runtime can’t signal or can’t reap, you can see “stuck” states.
Kubernetes
Kubernetes termination sequence (simplified):
- Pod gets a deletion timestamp.
- Endpoints are updated (pod removed from Service endpoints).
- If defined,
preStophook runs. - Kubelet asks runtime to stop the container:
- Sends
SIGTERM. - Waits
terminationGracePeriodSeconds. - Sends
SIGKILL.
- Sends
If your app needs 30 seconds to drain connections, but grace is 10 seconds, you’ll see forced kills and potentially corrupted work.
3. Common failure modes that create stuck or zombie containers
A) PID 1 doesn’t forward signals
Inside a container, PID 1 has special semantics: it may ignore some signals by default, and it is responsible for reaping orphaned child processes. If PID 1 is a shell script that doesn’t exec the real app, signals may never reach the app.
Bad pattern:
#!/bin/sh
myserver & # runs in background
wait # PID 1 waits, but signal handling is often wrong here
Better pattern:
#!/bin/sh
exec myserver
B) PID 1 doesn’t reap children → zombies
If your app spawns child processes and doesn’t wait() for them, they become zombies (STAT=Z). In a normal Linux system, systemd (PID 1) reaps them. In containers, your app is PID 1 and must reap or you need a minimal init.
C) App ignores SIGTERM or blocks shutdown
Common causes:
- Not registering signal handlers (or using frameworks incorrectly).
- Long blocking I/O without cancellation.
- Deadlocks on shutdown (e.g., waiting for a goroutine/thread that waits for a lock held by shutdown path).
- Not closing listeners, so the process never exits.
D) Uninterruptible sleep (D state)
If a process is stuck in kernel space (often I/O), SIGKILL won’t kill it. This is not a “container problem”; it’s a host/kernel/storage problem. Symptoms:
docker kill -9has no effect.psshowsDstate.- Often related to NFS, hung disks, FUSE, overlayfs issues, or kernel bugs.
E) Runtime / cgroup cleanup issues
Sometimes the process exits but cgroup cleanup hangs due to kernel or runtime issues. You might see containers stuck in “Removing” or “Dead”.
4. Diagnosing a stuck container (host and inside-container)
4.1 Identify the container and state
docker ps -a --no-trunc
docker inspect -f '{{.State.Status}} {{.State.Running}} {{.State.Pid}} {{.State.FinishedAt}}' myapp
If .State.Pid is non-zero, the container still has a running init process on the host.
4.2 Check what PID 1 is doing (from the host)
Get the host PID:
PID=$(docker inspect -f '{{.State.Pid}}' myapp)
echo "$PID"
Inspect process state:
ps -o pid,ppid,stat,etime,cmd -p "$PID"
cat /proc/"$PID"/status | sed -n '1,40p'
If you see State: D (disk sleep) or STAT includes D, you likely have an unkillable process.
Check open files and what it’s waiting on:
sudo ls -l /proc/"$PID"/fd | head
sudo cat /proc/"$PID"/wchan
If wchan shows something like nfs_*, fuse_*, or block I/O wait, suspect storage.
4.3 Enter the container’s namespaces without relying on docker exec
If docker exec hangs (it can if the runtime is unhealthy), use nsenter:
sudo nsenter -t "$PID" -m -u -i -n -p -- bash -lc 'ps auxf'
If the image doesn’t have bash, use sh:
sudo nsenter -t "$PID" -m -u -i -n -p -- sh -lc 'ps -eo pid,ppid,stat,cmd --forest'
4.4 Look for zombies
Inside the container namespace:
ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/ {print}'
Or a quick count:
ps -eo stat | grep -c Z
If zombies exist and PID 1 is not reaping, they will accumulate over time.
4.5 Check signal handling quickly
From the host, send SIGTERM and see if it exits:
docker kill --signal=SIGTERM myapp
sleep 2
docker inspect -f '{{.State.Running}}' myapp
If it stays running, either it ignores SIGTERM, is stuck, or PID 1 is not your app.
5. Fixing zombie processes: PID 1, init systems, and reaping
5.1 Why zombies happen in containers
A zombie process is a process that has exited but still has an entry in the process table because its parent hasn’t collected its exit status via wait().
In a container:
- If your application is PID 1 and spawns children, it must
wait()for them. - Many apps do not implement a proper reaper loop.
- Shell scripts used as entrypoints often mishandle child processes.
5.2 Use a minimal init: tini (recommended)
tini is a tiny init process that:
- Forwards signals to your app.
- Reaps zombie processes.
Docker run:
docker run --init myimage:latest
Docker’s --init uses tini under the hood on many installations.
Dockerfile approach (explicit):
FROM debian:stable-slim
RUN apt-get update && apt-get install -y --no-install-recommends tini ca-certificates \
&& rm -rf /var/lib/apt/lists/*
ENTRYPOINT ["/usr/bin/tini","--"]
CMD ["./myserver"]
5.3 If you must use a shell entrypoint, exec properly
Bad:
#!/bin/sh
./myserver
This keeps the shell as PID 1; signals go to the shell, not necessarily to myserver.
Good:
#!/bin/sh
exec ./myserver
Now myserver becomes PID 1 and receives signals directly.
5.4 For apps that spawn children: ensure reaping
If you’re writing the app, implement child reaping or avoid spawning unmanaged children. For example, in Go you typically don’t need to spawn OS processes for concurrency; use goroutines. If you do spawn processes, call Wait() and handle SIGCHLD.
If you can’t change the app, use tini or dumb-init.
6. Fixing containers that ignore SIGTERM
6.1 Confirm what signal is sent and what the app receives
Docker sends SIGTERM by default. Some apps only handle SIGINT (Ctrl+C) in dev setups. You can test:
docker kill --signal=SIGINT myapp
If SIGINT works but SIGTERM doesn’t, fix the app to handle SIGTERM correctly.
6.2 Ensure PID 1 is the app (not a wrapper)
Check:
docker exec myapp ps -p 1 -o pid,cmd
If PID 1 is sh, bash, python entrypoint.py, or a supervisor, ensure it forwards signals and exits when the child exits.
6.3 Increase stop timeout (as a mitigation)
If the app is slow but correct:
docker stop --time 60 myapp
For Compose:
docker compose stop -t 60
This is not a “fix” if the app never exits, but it prevents premature SIGKILL for workloads that legitimately need time to drain.
6.4 Application-level shutdown patterns (what “good” looks like)
A robust server shutdown generally does:
- Stop accepting new connections (close listener).
- Signal worker pools to stop.
- Set deadlines for in-flight requests.
- Flush logs/metrics.
- Exit with a clean code.
If you run HTTP services behind a load balancer, also consider:
- Draining keep-alive connections.
- Returning
503quickly during shutdown window.
7. Fixing containers stuck in Stopping or unkillable (D state)
7.1 First attempt: normal stop, then SIGKILL
docker stop --time 20 myapp
docker kill --signal=SIGKILL myapp
If docker kill returns success but the container remains running, the process may be in D state or the runtime is stuck.
7.2 Inspect host PID and process state
PID=$(docker inspect -f '{{.State.Pid}}' myapp)
ps -o pid,stat,wchan,cmd -p "$PID"
If stat includes D, you cannot kill it from userspace. Your options shift to fixing the underlying kernel wait condition.
7.3 Typical root causes of D state in production
- NFS mount hung (common with network storage hiccups).
- Block device latency/hang.
- overlayfs issues under heavy I/O.
- FUSE filesystem deadlock.
- Kernel bugs or resource exhaustion.
Check kernel logs:
dmesg -T | tail -n 200
journalctl -k --since "30 min ago"
Look for I/O errors, NFS timeouts, or hung task warnings.
7.4 If the container uses NFS or remote volumes
List mounts used by the process:
sudo cat /proc/"$PID"/mountinfo | head -n 50
sudo lsof -p "$PID" | head
If you suspect NFS, see NFS stats:
nfsstat -m 2>/dev/null || true
Mitigations:
- Fix the storage/network issue.
- Consider mounting NFS with options that fail faster (careful: this changes semantics).
- Avoid putting critical shutdown paths on remote storage (e.g., writing final state to NFS during SIGTERM).
7.5 When removal is stuck: restart runtime services (last resort)
On a Docker host (systemd-based), restarting Docker can release runtime deadlocks, but it can also disrupt running containers. Use extreme caution.
sudo systemctl status docker
sudo systemctl restart docker
On Kubernetes nodes with containerd:
sudo systemctl status containerd
sudo systemctl restart containerd
If a process is truly unkillable (D state), even restarting the runtime may not help. The process remains until the kernel wait resolves or the host reboots.
7.6 Host reboot decision
If you have confirmed:
- PID is in
Dstate, - storage is hung or kernel is wedged,
- the container blocks critical operations (e.g., node drain),
then a controlled node reboot may be the only resolution. In Kubernetes, cordon and drain first when possible:
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data --grace-period=60 --timeout=10m
If drain cannot complete due to stuck pods, you may need forced deletion (see Kubernetes section), but understand it may leave resources behind until reboot.
8. Kubernetes specifics: terminationGracePeriodSeconds, preStop, and probes
8.1 Understand the termination timeline
SIGTERMis sent when the pod is terminating.preStopruns before SIGTERM completes, and it counts against the grace period.- After grace period, kubelet issues
SIGKILL.
If your preStop sleeps 20 seconds and your grace period is 30 seconds, your app has at most ~10 seconds to shut down after preStop completes.
8.2 Configure a realistic grace period
Example:
kubectl get pod myapp -o jsonpath='{.spec.terminationGracePeriodSeconds}{"\n"}'
A typical web service might need 30–60 seconds depending on request duration and connection draining.
8.3 Use preStop to drain, not to “wait and hope”
A useful preStop might call an internal endpoint to start draining:
kubectl exec deploy/myapp -- curl -sf http://127.0.0.1:8080/drain
In a Pod spec, the hook could be:
- Execute a command that flips the app into “draining” mode.
- Then sleep briefly to allow endpoints to update.
Be careful: preStop failures can shorten your effective shutdown time.
8.4 Readiness probes and termination
A strong pattern:
- On SIGTERM, immediately fail readiness (or stop responding to readiness endpoint).
- This removes the pod from load balancer rotation quickly.
- Then finish in-flight work.
If readiness stays “ready” during shutdown, traffic may continue to hit the pod until it dies.
8.5 Pods stuck in Terminating
Get details:
kubectl get pod -n myns mypod -o wide
kubectl describe pod -n myns mypod
kubectl get pod -n myns mypod -o json | jq '.metadata.finalizers, .status.containerStatuses'
Common causes:
- Finalizers (e.g., PVC protection, custom controllers).
- Kubelet can’t kill container due to node/runtime issues.
- Volume unmount hangs (again often storage).
Force delete (dangerous; use when node is unhealthy and you accept cleanup later):
kubectl delete pod -n myns mypod --grace-period=0 --force
If the node is unreachable, Kubernetes will remove the API object, but the process may still run on the node until it recovers or reboots.
9. Practical hardening patterns (Dockerfile, entrypoint, app code)
9.1 Prefer exec-form ENTRYPOINT/CMD
Exec form avoids an extra shell and preserves signal delivery:
ENTRYPOINT ["./myserver"]
If you need arguments:
CMD ["--port=8080","--log-level=info"]
Avoid:
ENTRYPOINT ./myserver --port=8080
That uses a shell and can break signal handling.
9.2 Add an init for reaping
Use Docker --init in runtime config, or bake tini in the image (especially for Kubernetes where --init is not a Pod setting).
9.3 Ensure logs flush on shutdown
If you use buffered logging, flush on SIGTERM. Otherwise you’ll see truncated logs exactly when you need them most.
9.4 Avoid shutdown work that depends on fragile dependencies
Common mistake: on SIGTERM, write final state to an NFS mount or a remote DB and block indefinitely. Use timeouts and fallbacks.
9.5 Add explicit timeouts everywhere
- HTTP server shutdown timeout
- DB connection close timeout
- Queue consumer stop timeout
If the app can’t stop within the platform grace period, it will eventually be SIGKILLed.
10. Incident playbook: step-by-step commands
This section is a practical sequence you can run during an incident on a Docker host. Adjust names and be mindful of impact.
10.1 Identify the problem container
docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Image}}'
docker ps -a --no-trunc | grep -E 'Stopping|Dead|Exited'
10.2 Attempt graceful stop with longer timeout
docker stop --time 60 myapp
10.3 If still running, inspect PID and state
PID=$(docker inspect -f '{{.State.Pid}}' myapp)
ps -o pid,ppid,stat,etime,wchan,cmd -p "$PID"
10.4 Check container process tree via nsenter
sudo nsenter -t "$PID" -m -u -i -n -p -- sh -lc 'ps -eo pid,ppid,stat,cmd --forest | sed -n "1,200p"'
10.5 Look for zombies
sudo nsenter -t "$PID" -p -- sh -lc 'ps -eo pid,ppid,stat,cmd | awk "$3 ~ /Z/ {print}"'
If zombies are present, plan a redeploy with tini/proper PID 1 behavior.
10.6 If ignoring SIGTERM, send SIGKILL
docker kill --signal=SIGKILL myapp
10.7 If SIGKILL doesn’t work: check for D state and kernel logs
ps -o pid,stat,wchan,cmd -p "$PID"
dmesg -T | tail -n 100
journalctl -k --since "15 min ago" | tail -n 200
If D state correlates with storage/network issues, engage the storage layer. If the node is wedged, prepare for reboot.
10.8 If Docker metadata is stuck (container “Dead”)
Sometimes Docker shows a container as Dead and it can’t be removed. Try:
docker rm -f myapp
If it hangs, you may need to restart Docker after assessing impact:
sudo systemctl restart docker
On containerd-based systems:
sudo systemctl restart containerd
If the underlying process is unkillable, runtime restarts won’t fix it—only resolution of the kernel wait or reboot will.
11. Prevention checklist
Use this as a pre-production and post-incident checklist.
Container image and entrypoint
- Use exec-form
ENTRYPOINT/CMD. - Ensure PID 1 is the actual app process (or a minimal init like
tini). - Avoid shell wrappers; if needed,
execthe child. - Add
tini/dumb-initto reap zombies.
Application behavior
- Handle
SIGTERM(and ideallySIGINT) explicitly. - Stop accepting new work immediately on shutdown.
- Drain connections and stop background workers with timeouts.
- Avoid indefinite waits; always use deadlines.
- Flush logs/metrics on exit.
Platform configuration
- Set realistic stop/grace timeouts (
docker stop -t, KubernetesterminationGracePeriodSeconds). - Ensure readiness fails quickly during shutdown (drain pattern).
- Use
preStoponly when necessary; remember it consumes grace time.
Storage and kernel realities
- Be cautious with NFS/remote mounts in critical paths.
- Monitor for hung tasks and I/O latency.
- Have a node reboot playbook for unkillable
Dstate processes.
Closing notes
“Zombie containers” in production usually boil down to two root causes:
- Bad PID 1 behavior (signals not forwarded, children not reaped) — fixable by using
tini, exec-form entrypoints, and correct shutdown code. - Kernel-level unkillable waits (
Dstate) — not fixable by signals; requires resolving underlying I/O issues or rebooting the node.
If you want, share:
- your Dockerfile and entrypoint,
- the output of
docker inspect -f '{{.State.Pid}}'andps -o pid,stat,wchan,cmd -p <PID>, - and whether you’re on Docker or Kubernetes,
and I can suggest a targeted remediation plan for your specific shutdown behavior.