← Back to Tutorials

Production Readiness Checklist for Dockerized Microservices (DevOps Guide)

devopsdockermicroservicesproduction-readinesskubernetesci-cdobservabilitysecuritysrereliability

Production Readiness Checklist for Dockerized Microservices (DevOps Guide)

This tutorial is a practical, command-heavy checklist for taking Dockerized microservices from “it runs on my laptop” to production-ready. It focuses on Linux hosts and common tooling (Docker Engine, Docker Compose, container registries, CI/CD, and observability stacks). It is written as a checklist, but each item includes the “why”, the “how”, and real commands you can run.


0) Baseline assumptions and goals

Assumptions

Production readiness goals


1) Image build hygiene: deterministic, minimal, and traceable

1.1 Use multi-stage builds and minimal base images

Why: Smaller images reduce attack surface and pull time. Multi-stage builds keep compilers/build tools out of runtime.

How (example Dockerfile skeleton):

# Build stage
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /out/service ./cmd/service

# Runtime stage
FROM gcr.io/distroless/static-debian12:nonroot
WORKDIR /
COPY --from=build /out/service /service
USER nonroot:nonroot
EXPOSE 8080
ENTRYPOINT ["/service"]

Checklist

1.2 Pin base images by digest

Why: Tags like alpine:latest change. Digests are immutable, enabling reproducibility.

How:

docker pull alpine:3.20
docker image inspect alpine:3.20 --format '{{index .RepoDigests 0}}'
# Example output: alpine@sha256:...

Then in Dockerfile:

FROM alpine@sha256:...  # pinned digest

Checklist

1.3 Build with BuildKit and record provenance metadata

Why: BuildKit improves caching and supports SBOM/provenance with modern tooling.

How:

export DOCKER_BUILDKIT=1
docker build -t myorg/service:1.2.3 .

If you use buildx:

docker buildx create --use --name prod-builder
docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t registry.example.com/myorg/service:1.2.3 \
  --push .

Checklist

1.4 Use .dockerignore aggressively

Why: Prevent leaking secrets, reduce build context size, speed builds.

Example .dockerignore:

.git
node_modules
dist
target
*.log
.env
secrets/
**/*_test.go

Checklist


2) Versioning, tagging, and artifact traceability

2.1 Use immutable tags and embed commit metadata

Why: You must be able to map a running container back to a source revision and build pipeline run.

Tagging strategy

Embed labels:

ARG VCS_REF
ARG BUILD_DATE
LABEL org.opencontainers.image.revision=$VCS_REF \
      org.opencontainers.image.created=$BUILD_DATE \
      org.opencontainers.image.source="https://github.com/myorg/service"

Build:

docker build \
  --build-arg VCS_REF="$(git rev-parse HEAD)" \
  --build-arg BUILD_DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  -t registry.example.com/myorg/service:$(git rev-parse --short HEAD) .

Checklist

2.2 Generate SBOMs and store them with artifacts

Why: SBOMs help with vulnerability response and compliance.

Using Syft:

syft registry.example.com/myorg/service:1.2.3 -o spdx-json > sbom.spdx.json

Checklist


3) Vulnerability scanning and supply-chain security

3.1 Scan images in CI and block on severity thresholds

Why: Catch known CVEs before deployment.

Using Trivy:

trivy image --ignore-unfixed --severity HIGH,CRITICAL \
  registry.example.com/myorg/service:1.2.3

Fail the pipeline if findings exceed policy.

Checklist

3.2 Sign images and verify at deploy time

Why: Prevent tampering and ensure only trusted images run.

Using Cosign (keyless example requires OIDC-capable CI):

cosign sign --yes registry.example.com/myorg/service:1.2.3
cosign verify registry.example.com/myorg/service:1.2.3

Checklist

3.3 Keep secrets out of images

Why: Secrets baked into layers are hard to remove and are often leaked.

Anti-patterns

Better

Checklist


4) Runtime security: least privilege by default

4.1 Run as non-root

Why: Limits container breakout impact and reduces risk.

In Dockerfile:

RUN addgroup -S app && adduser -S app -G app
USER app:app

Verify:

docker run --rm myorg/service:1.2.3 id

Checklist

4.2 Drop Linux capabilities and use read-only filesystem where possible

Why: Many services do not need extra capabilities. Read-only FS prevents persistence and some exploit chains.

Run example:

docker run --rm \
  --read-only \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  -p 8080:8080 \
  myorg/service:1.2.3

If the app needs temp space:

docker run --rm \
  --read-only \
  --tmpfs /tmp:rw,noexec,nosuid,size=64m \
  --cap-drop ALL \
  --security-opt no-new-privileges \
  myorg/service:1.2.3

Checklist

4.3 Use seccomp and AppArmor/SELinux profiles

Why: System call filtering and MAC policies reduce kernel attack surface.

Check seccomp default:

docker info | grep -i seccomp

On Ubuntu with AppArmor, ensure Docker uses a profile:

aa-status | head

Checklist

4.4 Network exposure and firewalling

Why: Only expose what is necessary; segment networks.

docker run -p 127.0.0.1:8080:8080 myorg/service:1.2.3
ss -lntp | grep 8080

Checklist


5) Configuration management: env vars, config files, and feature flags

5.1 Separate config from code

Why: Promotes the Twelve-Factor approach and enables environment-specific behavior without rebuilding images.

Use environment variables:

docker run --rm \
  -e LOG_LEVEL=info \
  -e DATABASE_URL="postgres://user:pass@db:5432/app?sslmode=disable" \
  myorg/service:1.2.3

Checklist

5.2 Validate configuration at startup

Why: Fail fast prevents partial outages and confusing runtime errors.

Pattern:

Checklist

5.3 Feature flags for risky changes

Why: Allows safe rollout and quick disable without redeploy.

Checklist


6) Health checks, readiness, and graceful shutdown

6.1 Implement liveness and readiness endpoints

Why: Orchestrators need to know when to restart vs when to stop routing traffic.

Typical endpoints:

Test:

curl -fsS http://127.0.0.1:8080/healthz
curl -fsS http://127.0.0.1:8080/readyz

Add Docker HEALTHCHECK (useful even outside Kubernetes):

HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
  CMD wget -qO- http://127.0.0.1:8080/healthz || exit 1

Checklist

6.2 Graceful shutdown and termination signals

Why: Containers are stopped with SIGTERM; you must stop accepting new requests and finish in-flight work.

Test locally:

docker run --name svc -p 8080:8080 myorg/service:1.2.3
docker stop --time 20 svc
docker logs svc

Checklist


7) Resource management: CPU, memory, file descriptors, and limits

7.1 Set container resource limits

Why: Prevent noisy-neighbor issues and OOM cascades.

Docker run example:

docker run --rm \
  --memory=512m --memory-swap=512m \
  --cpus=1.0 \
  --pids-limit=200 \
  myorg/service:1.2.3

Check runtime stats:

docker stats --no-stream

Checklist

7.2 Tune ulimits and file descriptors

Why: High concurrency services can exhaust file descriptors.

Inspect current limits:

docker run --rm myorg/service:1.2.3 sh -c 'ulimit -n && ulimit -u'

Set ulimit:

docker run --rm --ulimit nofile=65535:65535 myorg/service:1.2.3

Checklist

7.3 JVM / runtime-specific memory settings (if applicable)

Why: Some runtimes don’t automatically respect cgroup limits unless configured.

Checklist


8) Logging: structured, centralized, and privacy-aware

8.1 Log to stdout/stderr, not files

Why: Container platforms collect stdout/stderr easily; file logs complicate rotation and persistence.

Run and view:

docker logs -f <container>

Checklist

8.2 Use structured logging with correlation IDs

Why: JSON logs are queryable; correlation IDs connect services.

Example expectations:

Checklist

8.3 Centralize logs and define retention

Why: Debugging incidents requires historical logs.

Checklist


9) Metrics and alerting: what to measure and how to act

9.1 Expose service metrics (Prometheus/OpenMetrics)

Why: Metrics enable SLOs, capacity planning, and rapid detection.

Common metrics:

Example check:

curl -fsS http://127.0.0.1:8080/metrics | head

Checklist

9.2 Define SLOs and alerts based on user impact

Why: Alert fatigue happens when alerts don’t map to real problems.

Examples:

Checklist


10) Tracing and dependency visibility

10.1 Distributed tracing with OpenTelemetry

Why: Microservices fail in the gaps—tracing shows where time and errors occur.

Checklist

Quick sanity check (varies by stack):

curl -H 'traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01' \
  -v http://127.0.0.1:8080/api

11) Data and state: databases, migrations, and backups

11.1 Database migrations: automated and safe

Why: Schema drift and manual migrations cause outages.

Checklist

Example (generic):

# Example using a migration tool; replace with your tooling
migrate -path ./migrations -database "$DATABASE_URL" up

11.2 Backups and restore drills

Why: Backups are useless until you test restores.

Checklist


12) Networking: timeouts, retries, and circuit breakers

12.1 Set explicit timeouts everywhere

Why: Default timeouts are often infinite, causing thread/connection exhaustion.

Checklist

12.2 Retries with jitter and budgets

Why: Naive retries amplify outages (retry storms).

Checklist


13) Deployment strategy: rollouts, rollbacks, and environment parity

13.1 Avoid snowflake servers: immutable infrastructure mindset

Why: If you “SSH and fix,” you can’t reproduce or audit changes.

Checklist

13.2 Blue/green or canary deployments

Why: Reduce blast radius and enable quick rollback.

Checklist

13.3 Environment parity and promotion

Why: “Works in staging” only helps if staging resembles prod.

Checklist


14) CI/CD pipeline essentials (with real commands)

14.1 Pipeline stages to include

Recommended stages

  1. Lint + unit tests
  2. Build image
  3. Generate SBOM
  4. Scan vulnerabilities
  5. Sign image
  6. Integration tests (spin up dependencies)
  7. Push immutable tags
  8. Deploy to staging
  9. Smoke tests
  10. Promote to prod

14.2 Integration testing with Docker Compose

Why: Validate service behavior with real dependencies.

Example commands:

docker compose up -d --build
docker compose ps
docker compose logs -f --no-color

Run smoke tests:

curl -fsS http://127.0.0.1:8080/readyz
curl -fsS http://127.0.0.1:8080/api/version

Tear down:

docker compose down -v

Checklist


15) Host and runtime hardening (Docker Engine on Linux)

15.1 Keep Docker and OS patched

Why: Container isolation depends on kernel and runtime security.

Check versions:

docker version
uname -a

Checklist

15.2 Use a dedicated user and restrict Docker socket access

Why: Access to /var/run/docker.sock is effectively root.

Inspect socket permissions:

ls -l /var/run/docker.sock
getent group docker

Checklist

15.3 Configure log rotation for Docker

Why: Prevent disk exhaustion.

Inspect current logging driver:

docker info | grep -i "Logging Driver"

Example run with json-file options:

docker run --log-opt max-size=10m --log-opt max-file=3 myorg/service:1.2.3

Checklist


16) Secrets management: injection, rotation, and auditability

16.1 Inject secrets at runtime

Why: Secrets should be short-lived, rotated, and audited.

Options:

Checklist

16.2 Avoid passing secrets via command line

Why: Process args can leak via ps, logs, or crash reports.

Prefer environment variables or mounted secret files (depending on platform).

Checklist


17) Operational readiness: runbooks, on-call, and incident response

17.1 Runbooks for common failures

Why: Reduce MTTR and cognitive load during incidents.

Runbook should include:

Checklist

17.2 Debugging commands you should be able to run quickly

On a Docker host:

docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
docker logs --tail 200 <container>
docker inspect <container> --format '{{json .State}}' | jq
docker exec -it <container> sh
docker top <container>
docker stats --no-stream <container>

Network debugging (host):

ss -lntp
curl -v http://127.0.0.1:8080/readyz

Checklist


18) Testing for failure: chaos and resilience checks

18.1 Simulate dependency outages and latency

Why: Microservices must degrade gracefully.

Checklist

18.2 Load testing and capacity planning

Why: You need to know limits before users find them.

Checklist


19) Compliance and data protection basics

19.1 PII and sensitive data handling

Why: Legal and reputational risk.

Checklist

19.2 TLS and certificate management

Why: Prevent MITM and protect credentials.

Checklist


20) A practical “go/no-go” production checklist (copy/paste)

Use this as a final gate before production:

Build & artifacts

Security

Reliability

Observability

Deployment & operations


21) Example: end-to-end commands for a release

Below is a realistic sequence you can adapt. Replace registry/service names as needed.

# 1) Test
make test

# 2) Build with metadata
export DOCKER_BUILDKIT=1
VERSION="1.2.3"
GIT_SHA="$(git rev-parse HEAD)"
docker build \
  --build-arg VCS_REF="$GIT_SHA" \
  --build-arg BUILD_DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  -t registry.example.com/myorg/service:$VERSION \
  -t registry.example.com/myorg/service:${GIT_SHA:0:12} \
  .

# 3) Scan
trivy image --ignore-unfixed --severity HIGH,CRITICAL \
  registry.example.com/myorg/service:$VERSION

# 4) SBOM
syft registry.example.com/myorg/service:$VERSION -o spdx-json > sbom.spdx.json

# 5) Push
docker push registry.example.com/myorg/service:$VERSION
docker push registry.example.com/myorg/service:${GIT_SHA:0:12}

# 6) Sign
cosign sign --yes registry.example.com/myorg/service:$VERSION
cosign verify registry.example.com/myorg/service:$VERSION

# 7) Deploy (example placeholder)
# Your deploy command depends on platform (Kubernetes/Swarm/nomad/custom)
# Ensure you deploy the immutable tag, not 'latest'.
echo "Deploy registry.example.com/myorg/service:$VERSION"

# 8) Post-deploy smoke test
curl -fsS https://service.example.com/readyz
curl -fsS https://service.example.com/api/version

22) Common production pitfalls (and how to avoid them)

  1. Deploying latest

    • Fix: Use immutable tags; promote the same digest across environments.
  2. Health checks that hit the database

    • Fix: Keep liveness check process-only; readiness can check dependencies but must be fast and cached.
  3. No timeouts

    • Fix: Set explicit timeouts on servers and clients; enforce deadlines across request chains.
  4. Over-permissive containers

    • Fix: Non-root, drop capabilities, read-only FS, no-new-privileges, and MAC policies.
  5. Logs with secrets/PII

    • Fix: Redaction, structured logging, and strict review of log fields.
  6. No rollback plan

    • Fix: Blue/green or canary plus one-command rollback; keep previous versions available.

23) What “done” looks like

A Dockerized microservice is production-ready when:

Use the checklist sections above as gating criteria in your CI/CD pipeline and as a recurring audit (monthly/quarterly). Production readiness is not a one-time milestone—it is an operational habit.