Production Readiness Checklist for Dockerized Microservices (DevOps Guide)
This tutorial is a practical, command-heavy checklist for taking Dockerized microservices from “it runs on my laptop” to production-ready. It focuses on Linux hosts and common tooling (Docker Engine, Docker Compose, container registries, CI/CD, and observability stacks). It is written as a checklist, but each item includes the “why”, the “how”, and real commands you can run.
0) Baseline assumptions and goals
Assumptions
- You have one or more microservices packaged as Docker images.
- You deploy to Linux hosts (VMs or bare metal) or a managed container platform.
- You have a container registry (Docker Hub, GHCR, ECR, GCR, ACR, etc.).
- You want repeatable builds, safe rollouts, and fast incident response.
Production readiness goals
- Deterministic builds and traceable artifacts
- Secure runtime (least privilege, minimal attack surface)
- Predictable performance under load and failure
- Observability (logs, metrics, traces) and actionable alerts
- Safe deployment process (rollbacks, canaries/blue-green)
- Documented operations (runbooks, SLOs, ownership)
1) Image build hygiene: deterministic, minimal, and traceable
1.1 Use multi-stage builds and minimal base images
Why: Smaller images reduce attack surface and pull time. Multi-stage builds keep compilers/build tools out of runtime.
How (example Dockerfile skeleton):
# Build stage
FROM golang:1.22-alpine AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -trimpath -ldflags="-s -w" -o /out/service ./cmd/service
# Runtime stage
FROM gcr.io/distroless/static-debian12:nonroot
WORKDIR /
COPY --from=build /out/service /service
USER nonroot:nonroot
EXPOSE 8080
ENTRYPOINT ["/service"]
Checklist
- Multi-stage build used
- Runtime image does not contain build tools/package managers
- Prefer distroless or slim images when possible
1.2 Pin base images by digest
Why: Tags like alpine:latest change. Digests are immutable, enabling reproducibility.
How:
docker pull alpine:3.20
docker image inspect alpine:3.20 --format '{{index .RepoDigests 0}}'
# Example output: alpine@sha256:...
Then in Dockerfile:
FROM alpine@sha256:... # pinned digest
Checklist
- Base images pinned to digest in production builds
1.3 Build with BuildKit and record provenance metadata
Why: BuildKit improves caching and supports SBOM/provenance with modern tooling.
How:
export DOCKER_BUILDKIT=1
docker build -t myorg/service:1.2.3 .
If you use buildx:
docker buildx create --use --name prod-builder
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t registry.example.com/myorg/service:1.2.3 \
--push .
Checklist
- BuildKit enabled
- Multi-arch builds supported if needed (amd64/arm64)
- Build outputs pushed to a registry, not built ad-hoc on servers
1.4 Use .dockerignore aggressively
Why: Prevent leaking secrets, reduce build context size, speed builds.
Example .dockerignore:
.git
node_modules
dist
target
*.log
.env
secrets/
**/*_test.go
Checklist
-
.dockerignoreexists and excludes secrets, VCS metadata, and bulky artifacts
2) Versioning, tagging, and artifact traceability
2.1 Use immutable tags and embed commit metadata
Why: You must be able to map a running container back to a source revision and build pipeline run.
Tagging strategy
- Immutable:
service:<git_sha>orservice:<semver>-<build> - Mutable only for convenience:
service:latest(never deploylatestto prod)
Embed labels:
ARG VCS_REF
ARG BUILD_DATE
LABEL org.opencontainers.image.revision=$VCS_REF \
org.opencontainers.image.created=$BUILD_DATE \
org.opencontainers.image.source="https://github.com/myorg/service"
Build:
docker build \
--build-arg VCS_REF="$(git rev-parse HEAD)" \
--build-arg BUILD_DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
-t registry.example.com/myorg/service:$(git rev-parse --short HEAD) .
Checklist
- Every image tag maps to an immutable identifier (commit SHA)
- OCI labels include revision and source URL
2.2 Generate SBOMs and store them with artifacts
Why: SBOMs help with vulnerability response and compliance.
Using Syft:
syft registry.example.com/myorg/service:1.2.3 -o spdx-json > sbom.spdx.json
Checklist
- SBOM generated per build and stored (artifact store or registry attachment)
3) Vulnerability scanning and supply-chain security
3.1 Scan images in CI and block on severity thresholds
Why: Catch known CVEs before deployment.
Using Trivy:
trivy image --ignore-unfixed --severity HIGH,CRITICAL \
registry.example.com/myorg/service:1.2.3
Fail the pipeline if findings exceed policy.
Checklist
- Image scan runs on every build
- Policy defined for blocking builds (e.g., no CRITICAL)
3.2 Sign images and verify at deploy time
Why: Prevent tampering and ensure only trusted images run.
Using Cosign (keyless example requires OIDC-capable CI):
cosign sign --yes registry.example.com/myorg/service:1.2.3
cosign verify registry.example.com/myorg/service:1.2.3
Checklist
- Image signing enabled
- Deploy step verifies signatures (or admission policy in orchestrator)
3.3 Keep secrets out of images
Why: Secrets baked into layers are hard to remove and are often leaked.
Anti-patterns
COPY .env /app/.envARG AWS_SECRET_ACCESS_KEY=...
Better
- Inject secrets at runtime via secret stores (Vault, cloud secret manager) or orchestrator secrets.
Checklist
- No secrets in Git history
- No secrets in image layers (verify with scanning, grep, or history inspection)
4) Runtime security: least privilege by default
4.1 Run as non-root
Why: Limits container breakout impact and reduces risk.
In Dockerfile:
RUN addgroup -S app && adduser -S app -G app
USER app:app
Verify:
docker run --rm myorg/service:1.2.3 id
Checklist
- Container runs as non-root
- File permissions support non-root operation
4.2 Drop Linux capabilities and use read-only filesystem where possible
Why: Many services do not need extra capabilities. Read-only FS prevents persistence and some exploit chains.
Run example:
docker run --rm \
--read-only \
--cap-drop ALL \
--security-opt no-new-privileges \
-p 8080:8080 \
myorg/service:1.2.3
If the app needs temp space:
docker run --rm \
--read-only \
--tmpfs /tmp:rw,noexec,nosuid,size=64m \
--cap-drop ALL \
--security-opt no-new-privileges \
myorg/service:1.2.3
Checklist
-
--cap-drop ALLused unless justified -
no-new-privilegesenabled - Read-only root filesystem where feasible
-
tmpfsmounts for writable paths (/tmp, cache dirs)
4.3 Use seccomp and AppArmor/SELinux profiles
Why: System call filtering and MAC policies reduce kernel attack surface.
Check seccomp default:
docker info | grep -i seccomp
On Ubuntu with AppArmor, ensure Docker uses a profile:
aa-status | head
Checklist
- Default seccomp profile enabled (or custom hardened profile)
- AppArmor/SELinux enforced in production
4.4 Network exposure and firewalling
Why: Only expose what is necessary; segment networks.
- Bind ports to localhost when using a reverse proxy:
docker run -p 127.0.0.1:8080:8080 myorg/service:1.2.3
- Confirm listening ports:
ss -lntp | grep 8080
Checklist
- Only required ports are published
- Host firewall rules exist (ufw/nftables/security groups)
- Service-to-service traffic is restricted (network policies in orchestrator)
5) Configuration management: env vars, config files, and feature flags
5.1 Separate config from code
Why: Promotes the Twelve-Factor approach and enables environment-specific behavior without rebuilding images.
Use environment variables:
docker run --rm \
-e LOG_LEVEL=info \
-e DATABASE_URL="postgres://user:pass@db:5432/app?sslmode=disable" \
myorg/service:1.2.3
Checklist
- All environment-specific config is injected at runtime
- Defaults are safe; missing config fails fast with clear errors
5.2 Validate configuration at startup
Why: Fail fast prevents partial outages and confusing runtime errors.
Pattern:
- Parse config
- Validate required fields
- Exit non-zero with clear message
Checklist
- Startup fails if required config missing/invalid
- Config validation covered by tests
5.3 Feature flags for risky changes
Why: Allows safe rollout and quick disable without redeploy.
Checklist
- Feature flags exist for high-risk behavior
- Flags are auditable and have ownership
6) Health checks, readiness, and graceful shutdown
6.1 Implement liveness and readiness endpoints
Why: Orchestrators need to know when to restart vs when to stop routing traffic.
Typical endpoints:
/healthz(liveness): process is alive/readyz(readiness): dependencies reachable, warmed up
Test:
curl -fsS http://127.0.0.1:8080/healthz
curl -fsS http://127.0.0.1:8080/readyz
Add Docker HEALTHCHECK (useful even outside Kubernetes):
HEALTHCHECK --interval=30s --timeout=3s --start-period=10s --retries=3 \
CMD wget -qO- http://127.0.0.1:8080/healthz || exit 1
Checklist
- Liveness and readiness endpoints implemented
- Health checks are lightweight and do not DDoS dependencies
6.2 Graceful shutdown and termination signals
Why: Containers are stopped with SIGTERM; you must stop accepting new requests and finish in-flight work.
Test locally:
docker run --name svc -p 8080:8080 myorg/service:1.2.3
docker stop --time 20 svc
docker logs svc
Checklist
- SIGTERM triggers graceful shutdown
- Server stops accepting new connections quickly
- Background workers drain queues safely
- Shutdown timeout is documented and aligned with orchestrator settings
7) Resource management: CPU, memory, file descriptors, and limits
7.1 Set container resource limits
Why: Prevent noisy-neighbor issues and OOM cascades.
Docker run example:
docker run --rm \
--memory=512m --memory-swap=512m \
--cpus=1.0 \
--pids-limit=200 \
myorg/service:1.2.3
Check runtime stats:
docker stats --no-stream
Checklist
- Memory limit set (and tested under load)
- CPU limit/requests defined (in orchestrator)
- PIDs limit set for defense-in-depth
7.2 Tune ulimits and file descriptors
Why: High concurrency services can exhaust file descriptors.
Inspect current limits:
docker run --rm myorg/service:1.2.3 sh -c 'ulimit -n && ulimit -u'
Set ulimit:
docker run --rm --ulimit nofile=65535:65535 myorg/service:1.2.3
Checklist
-
nofiletuned for expected concurrency - Connection pools configured (DB, HTTP clients)
7.3 JVM / runtime-specific memory settings (if applicable)
Why: Some runtimes don’t automatically respect cgroup limits unless configured.
Checklist
- Java: set container-aware flags and heap sizing
- Node: set
--max-old-space-sizewhen needed - Go: consider
GOMEMLIMITfor tight memory budgets
8) Logging: structured, centralized, and privacy-aware
8.1 Log to stdout/stderr, not files
Why: Container platforms collect stdout/stderr easily; file logs complicate rotation and persistence.
Run and view:
docker logs -f <container>
Checklist
- Logs go to stdout/stderr
- No log files required for normal operation
8.2 Use structured logging with correlation IDs
Why: JSON logs are queryable; correlation IDs connect services.
Example expectations:
timestamp,level,service,trace_id,request_id,msg,latency_ms,status
Checklist
- JSON logs in production
- Request ID propagated across services (headers like
X-Request-Id) - PII is redacted; secrets never logged
8.3 Centralize logs and define retention
Why: Debugging incidents requires historical logs.
Checklist
- Logs shipped to a centralized system (ELK/OpenSearch, Loki, cloud logging)
- Retention meets compliance and cost constraints
- Access controls and audit trails exist
9) Metrics and alerting: what to measure and how to act
9.1 Expose service metrics (Prometheus/OpenMetrics)
Why: Metrics enable SLOs, capacity planning, and rapid detection.
Common metrics:
- Request rate, error rate, latency (p50/p95/p99)
- Saturation (CPU, memory, queue depth)
- Dependency errors (DB, cache)
Example check:
curl -fsS http://127.0.0.1:8080/metrics | head
Checklist
-
/metricsendpoint exists (or sidecar exporter) - Golden signals instrumented (latency, traffic, errors, saturation)
9.2 Define SLOs and alerts based on user impact
Why: Alert fatigue happens when alerts don’t map to real problems.
Examples:
- 99.9% successful requests over 30 days
- p95 latency < 300ms
Checklist
- SLOs documented per service
- Alerts are actionable with runbooks
- Paging alerts are tied to SLO burn rate or high-severity symptoms
10) Tracing and dependency visibility
10.1 Distributed tracing with OpenTelemetry
Why: Microservices fail in the gaps—tracing shows where time and errors occur.
Checklist
- Trace context propagated across HTTP/gRPC boundaries
- Spans include key attributes (route, status, db.system, peer.service)
- Sampling strategy defined (head-based/tail-based)
Quick sanity check (varies by stack):
- Confirm
traceparentheader is accepted and forwarded.
curl -H 'traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01' \
-v http://127.0.0.1:8080/api
11) Data and state: databases, migrations, and backups
11.1 Database migrations: automated and safe
Why: Schema drift and manual migrations cause outages.
Checklist
- Migrations run automatically in CI/CD or as a controlled job
- Migrations are backward-compatible (expand/contract pattern)
- Rollback strategy defined (down migrations or forward fixes)
Example (generic):
# Example using a migration tool; replace with your tooling
migrate -path ./migrations -database "$DATABASE_URL" up
11.2 Backups and restore drills
Why: Backups are useless until you test restores.
Checklist
- Automated backups with encryption
- Restore procedure documented and rehearsed
- RPO/RTO targets defined
12) Networking: timeouts, retries, and circuit breakers
12.1 Set explicit timeouts everywhere
Why: Default timeouts are often infinite, causing thread/connection exhaustion.
Checklist
- HTTP client timeout set (connect + request)
- Server read/write timeouts set
- DB connection and query timeouts set
12.2 Retries with jitter and budgets
Why: Naive retries amplify outages (retry storms).
Checklist
- Retries only on safe operations (idempotent)
- Exponential backoff + jitter
- Retry budget and max attempts enforced
- Circuit breaker or bulkhead patterns used for dependencies
13) Deployment strategy: rollouts, rollbacks, and environment parity
13.1 Avoid snowflake servers: immutable infrastructure mindset
Why: If you “SSH and fix,” you can’t reproduce or audit changes.
Checklist
- Hosts are configured via IaC (Terraform, Ansible, etc.)
- Deployments are automated via CI/CD
- Manual changes are prohibited or tightly controlled
13.2 Blue/green or canary deployments
Why: Reduce blast radius and enable quick rollback.
Checklist
- Deployment supports incremental rollout
- Automated health gates (metrics-based) before full rollout
- Rollback is one command or one click
13.3 Environment parity and promotion
Why: “Works in staging” only helps if staging resembles prod.
Checklist
- Same container image promoted across environments (dev → staging → prod)
- Config differs, not code
- Load tests run in a prod-like environment
14) CI/CD pipeline essentials (with real commands)
14.1 Pipeline stages to include
Recommended stages
- Lint + unit tests
- Build image
- Generate SBOM
- Scan vulnerabilities
- Sign image
- Integration tests (spin up dependencies)
- Push immutable tags
- Deploy to staging
- Smoke tests
- Promote to prod
14.2 Integration testing with Docker Compose
Why: Validate service behavior with real dependencies.
Example commands:
docker compose up -d --build
docker compose ps
docker compose logs -f --no-color
Run smoke tests:
curl -fsS http://127.0.0.1:8080/readyz
curl -fsS http://127.0.0.1:8080/api/version
Tear down:
docker compose down -v
Checklist
- Integration tests run in CI
- Compose/test harness uses pinned dependency versions
- Tests fail fast and provide logs/artifacts
15) Host and runtime hardening (Docker Engine on Linux)
15.1 Keep Docker and OS patched
Why: Container isolation depends on kernel and runtime security.
Check versions:
docker version
uname -a
Checklist
- Regular patch cadence for OS and Docker
- Reboot strategy for kernel updates
15.2 Use a dedicated user and restrict Docker socket access
Why: Access to /var/run/docker.sock is effectively root.
Inspect socket permissions:
ls -l /var/run/docker.sock
getent group docker
Checklist
- Only trusted admins/automation can access Docker socket
- Consider rootless Docker where appropriate
15.3 Configure log rotation for Docker
Why: Prevent disk exhaustion.
Inspect current logging driver:
docker info | grep -i "Logging Driver"
Example run with json-file options:
docker run --log-opt max-size=10m --log-opt max-file=3 myorg/service:1.2.3
Checklist
- Log rotation configured globally or per container
- Disk usage monitored and alerts set
16) Secrets management: injection, rotation, and auditability
16.1 Inject secrets at runtime
Why: Secrets should be short-lived, rotated, and audited.
Options:
- Orchestrator secrets (Swarm/Kubernetes)
- Vault agent injection
- Cloud secret managers
Checklist
- Secrets never stored in images
- Rotation process exists and is tested
- Access to secrets is least privilege and audited
16.2 Avoid passing secrets via command line
Why: Process args can leak via ps, logs, or crash reports.
Prefer environment variables or mounted secret files (depending on platform).
Checklist
- No secrets in CLI args
- Secret values are masked in logs and CI output
17) Operational readiness: runbooks, on-call, and incident response
17.1 Runbooks for common failures
Why: Reduce MTTR and cognitive load during incidents.
Runbook should include:
- Symptom
- Impact
- Diagnosis steps (commands, dashboards)
- Mitigation steps
- Rollback steps
- Escalation contacts
Checklist
- Runbook exists per service
- On-call rotation and escalation defined
- Postmortem process defined
17.2 Debugging commands you should be able to run quickly
On a Docker host:
docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
docker logs --tail 200 <container>
docker inspect <container> --format '{{json .State}}' | jq
docker exec -it <container> sh
docker top <container>
docker stats --no-stream <container>
Network debugging (host):
ss -lntp
curl -v http://127.0.0.1:8080/readyz
Checklist
- Operators have access and permissions to run diagnostics
- Debug tools exist (either in a debug image or via ephemeral toolbox containers)
18) Testing for failure: chaos and resilience checks
18.1 Simulate dependency outages and latency
Why: Microservices must degrade gracefully.
Checklist
- Service handles DB/cache downtime with clear errors
- Timeouts prevent resource exhaustion
- Retries do not create storms
18.2 Load testing and capacity planning
Why: You need to know limits before users find them.
Checklist
- Load tests run for key endpoints and workflows
- Scaling strategy documented (horizontal/vertical)
- Bottlenecks identified (DB, CPU, locks, GC)
19) Compliance and data protection basics
19.1 PII and sensitive data handling
Why: Legal and reputational risk.
Checklist
- Data classification documented
- PII redaction in logs
- Encryption in transit (TLS) and at rest where applicable
- Access controls and audit logs for sensitive operations
19.2 TLS and certificate management
Why: Prevent MITM and protect credentials.
Checklist
- TLS termination strategy defined (ingress/reverse proxy/service)
- Certificates rotated automatically
- Strong ciphers and minimum TLS versions enforced
20) A practical “go/no-go” production checklist (copy/paste)
Use this as a final gate before production:
Build & artifacts
- Multi-stage Dockerfile; minimal runtime image
- Base images pinned by digest
- Immutable tags (commit SHA) used for deployment
- OCI labels include revision and source URL
- SBOM generated and stored
Security
- Trivy (or equivalent) scan passes policy
- Image signed (Cosign) and verified at deploy
- No secrets in image or repo; runtime secret injection
- Runs as non-root;
no-new-privileges; capabilities dropped - Read-only FS where possible; tmpfs for writable paths
- Seccomp/AppArmor/SELinux enabled
Reliability
-
/healthzand/readyzimplemented and tested - Graceful shutdown on SIGTERM verified
- Resource limits defined and tested (CPU/mem/pids/ulimits)
- Timeouts configured for server and clients
- Retries are bounded, jittered, and safe
Observability
- Structured logs with request/trace IDs
- Centralized log shipping and retention defined
- Metrics endpoint available; dashboards exist
- Alerts map to SLOs and have runbooks
- Tracing enabled across service boundaries
Deployment & operations
- CI/CD pipeline builds, scans, signs, and promotes the same image
- Rollout strategy supports canary/blue-green and quick rollback
- Staging is prod-like; smoke tests exist
- Backups and restore drills done (if stateful)
- On-call, runbooks, and incident process in place
21) Example: end-to-end commands for a release
Below is a realistic sequence you can adapt. Replace registry/service names as needed.
# 1) Test
make test
# 2) Build with metadata
export DOCKER_BUILDKIT=1
VERSION="1.2.3"
GIT_SHA="$(git rev-parse HEAD)"
docker build \
--build-arg VCS_REF="$GIT_SHA" \
--build-arg BUILD_DATE="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
-t registry.example.com/myorg/service:$VERSION \
-t registry.example.com/myorg/service:${GIT_SHA:0:12} \
.
# 3) Scan
trivy image --ignore-unfixed --severity HIGH,CRITICAL \
registry.example.com/myorg/service:$VERSION
# 4) SBOM
syft registry.example.com/myorg/service:$VERSION -o spdx-json > sbom.spdx.json
# 5) Push
docker push registry.example.com/myorg/service:$VERSION
docker push registry.example.com/myorg/service:${GIT_SHA:0:12}
# 6) Sign
cosign sign --yes registry.example.com/myorg/service:$VERSION
cosign verify registry.example.com/myorg/service:$VERSION
# 7) Deploy (example placeholder)
# Your deploy command depends on platform (Kubernetes/Swarm/nomad/custom)
# Ensure you deploy the immutable tag, not 'latest'.
echo "Deploy registry.example.com/myorg/service:$VERSION"
# 8) Post-deploy smoke test
curl -fsS https://service.example.com/readyz
curl -fsS https://service.example.com/api/version
22) Common production pitfalls (and how to avoid them)
-
Deploying
latest- Fix: Use immutable tags; promote the same digest across environments.
-
Health checks that hit the database
- Fix: Keep liveness check process-only; readiness can check dependencies but must be fast and cached.
-
No timeouts
- Fix: Set explicit timeouts on servers and clients; enforce deadlines across request chains.
-
Over-permissive containers
- Fix: Non-root, drop capabilities, read-only FS, no-new-privileges, and MAC policies.
-
Logs with secrets/PII
- Fix: Redaction, structured logging, and strict review of log fields.
-
No rollback plan
- Fix: Blue/green or canary plus one-command rollback; keep previous versions available.
23) What “done” looks like
A Dockerized microservice is production-ready when:
- You can rebuild the same artifact deterministically and prove what code it came from.
- You can deploy safely with controlled rollouts and fast rollbacks.
- The service is secure by default (least privilege, scanned, signed).
- You can detect issues quickly (metrics/logs/traces) and respond with documented runbooks.
- You have tested failure modes (dependency outages, load, restarts) and the service degrades predictably.
Use the checklist sections above as gating criteria in your CI/CD pipeline and as a recurring audit (monthly/quarterly). Production readiness is not a one-time milestone—it is an operational habit.