← Back to Tutorials

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

devopscicdinfrastructure-as-codeobservabilitysre

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

DevOps is not a toolchain—it’s an operating model for delivering software safely and quickly. The most effective DevOps programs consistently invest in three pillars:

  1. CI/CD (how code becomes running software)
  2. Infrastructure as Code (IaC) (how environments are created and changed)
  3. Observability (how you understand and improve behavior in production)

This tutorial walks through best practices in each pillar with deep explanations and real commands you can run. Examples assume Linux/macOS shells; Windows users can use WSL.


1) CI/CD Best Practices (Continuous Integration / Continuous Delivery)

1.1 What “good CI” actually means

Continuous Integration is not “we run tests sometimes.” It means:

Key outcomes:

Practical CI principles

Commands for deterministic dependency installs:

Node.js:

npm ci

Python:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --require-hashes

Go:

go mod download
go test ./...

1.2 Build once, promote the same artifact

A common anti-pattern is rebuilding the application separately for staging and production. That introduces “it passed staging but prod is different” failures.

Best practice: build a single immutable artifact (container image, JAR, binary) and promote it across environments by changing configuration, not code.

Example: build a Docker image once and tag it with the Git commit SHA.

GIT_SHA="$(git rev-parse --short HEAD)"
docker build -t myapp:${GIT_SHA} .
docker tag myapp:${GIT_SHA} registry.example.com/myapp:${GIT_SHA}
docker push registry.example.com/myapp:${GIT_SHA}

Then deploy that exact tag to dev/staging/prod.


1.3 Pipeline stages that map to risk

A robust pipeline typically has stages like:

  1. Lint + static checks (fast, cheap)
  2. Unit tests (fast)
  3. Build artifact (repeatable)
  4. Security scanning (dependencies, container image)
  5. Integration tests (slower, higher confidence)
  6. Deploy to staging (automated)
  7. Smoke tests (validate basic behavior)
  8. Progressive delivery to prod (canary/blue-green)
  9. Post-deploy verification (SLIs/SLOs, error budgets)

Example commands for common checks

Linting (JavaScript/TypeScript):

npm run lint

Unit tests with coverage:

npm test -- --coverage

Python formatting + linting:

python -m pip install ruff black
ruff check .
black --check .

Container image vulnerability scan (Trivy):

trivy image --severity HIGH,CRITICAL registry.example.com/myapp:${GIT_SHA}

Dependency vulnerability scan (Node):

npm audit --audit-level=high

1.4 Secrets management in CI/CD

Never store secrets in source control or bake them into images. Use:

Anti-pattern: export AWS_SECRET_ACCESS_KEY=... in scripts committed to repo.

Better: use OIDC to obtain cloud credentials at runtime. For AWS, many CI systems can assume a role without long-lived keys.


1.5 Progressive delivery: canary and blue/green

Deploying to production doesn’t have to be “all at once.”

Why it matters: it reduces blast radius and makes rollback safer.

A simple Kubernetes canary approach might use two Deployments and a Service selector shift, or a service mesh/ingress controller with weighted routing. Even without a mesh, you can do controlled rollouts with Kubernetes’ rolling updates and careful monitoring.


1.6 Rollback strategy: plan it before you need it

A rollback is not “git revert and redeploy” during an incident. You want a fast, predictable action:

Kubernetes rollback example:

kubectl rollout history deploy/myapp
kubectl rollout undo deploy/myapp --to-revision=12
kubectl rollout status deploy/myapp

1.7 CI/CD design patterns that scale

Example: Docker build caching with BuildKit

export DOCKER_BUILDKIT=1
docker build --progress=plain -t myapp:${GIT_SHA} .

2) Infrastructure as Code (IaC) Best Practices

2.1 Why IaC is more than “automation”

IaC is the practice of managing infrastructure through code and version control. The real benefits are:

A mature IaC workflow treats infrastructure changes like application changes:


2.2 Choose the right IaC tool and model

Common approaches:

You can mix tools, but do so intentionally and document boundaries (e.g., Terraform provisions EKS and networking; Helm deploys apps).


2.3 Terraform/OpenTofu workflow: validate → plan → apply

Install OpenTofu (Terraform-compatible fork) or Terraform. Example commands are identical for most workflows.

Initialize:

tofu init

Format and validate:

tofu fmt -recursive
tofu validate

Plan (review the diff):

tofu plan -out=tfplan

Apply the reviewed plan:

tofu apply tfplan

Best practice: remote state + locking

Local state files don’t scale and can corrupt easily. Use remote state with locking (e.g., S3 + DynamoDB, Terraform Cloud, GCS).

Why locking matters: it prevents two engineers/pipelines from applying changes concurrently and corrupting state.


2.4 Structure: modules, environments, and boundaries

A common, scalable structure:

Guidelines:


2.5 Immutable infrastructure vs configuration drift

Configuration drift happens when reality differs from code (manual console edits, ad-hoc changes). IaC reduces drift, but only if you enforce:

Detect drift:

tofu plan

If the plan shows unexpected changes, investigate:


2.6 IaC testing and policy enforcement

Treat infrastructure code as testable:

Security scanning for Terraform

Using tfsec (or equivalents):

tfsec .

Using checkov:

checkov -d .

Policy as code with OPA (conceptual)

OPA (Open Policy Agent) can evaluate plans against policies. In practice, you export a plan to JSON and evaluate it.

Terraform/OpenTofu plan to JSON:

tofu show -json tfplan > tfplan.json

Then evaluate with OPA (example, policy not included here):

opa eval --data policy.rego --input tfplan.json "data.iac.deny"

2.7 Secrets in IaC: what not to do

Never put secrets in:

Instead:

For example, store a database password in AWS Secrets Manager and have the app retrieve it using IAM permissions rather than embedding it.


2.8 Database migrations: the hardest part of delivery

Infrastructure changes often involve data. The safest approach is usually:

A simple example with a SQL migration tool might be:

# Example using Flyway (conceptual)
flyway -url="jdbc:postgresql://db.example.com:5432/app" \
       -user="$DB_USER" -password="$DB_PASS" migrate

Best practice: run migrations as part of deployment with clear ownership and rollback strategy (often forward-only with compensating migrations).


3) Observability Best Practices (Metrics, Logs, Traces)

3.1 Observability vs monitoring

Monitoring tells you known failure modes (CPU high, disk full). Observability helps you understand unknown failure modes by making systems explain themselves.

Observability is typically built from three signals:

A fourth signal often included:


3.2 Start with SLIs and SLOs (not dashboards)

Dashboards are useful, but the best teams start with:

Example SLI/SLO:

Why this matters: it aligns engineering work with user experience and provides a rational basis for release velocity. If you’re burning error budget too fast, slow down releases and focus on reliability.


3.3 Instrumentation: make telemetry consistent

Logging best practices

Example: structured logging from an app might emit:

{"level":"info","msg":"order_created","order_id":"123","user_id":"456","trace_id":"abc..."}

Metrics best practices

Avoid high-cardinality labels (like raw user IDs) in metrics; they can explode cost and degrade performance.

Tracing best practices


3.4 OpenTelemetry: a practical standard

OpenTelemetry (OTel) is the de facto standard for generating and exporting telemetry.

A common architecture:

Run an OpenTelemetry Collector locally (example): You can run a collector container, but it requires a config file. Even without showing the full config here, the real command looks like:

docker run --rm -p 4317:4317 -p 4318:4318 \
  -v "$(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml" \
  otel/opentelemetry-collector:latest \
  --config /etc/otelcol/config.yaml

Your app can then export OTLP to:


3.5 Prometheus metrics: concrete queries and checks

If you use Prometheus-style metrics, you’ll typically define alerts and dashboards using PromQL.

Examples:

Request rate:

sum(rate(http_requests_total[5m]))

Error rate:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

p95 latency (histogram):

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Best practice: alert on symptoms (user impact), not causes. For example:


3.6 Logs: from “search” to “investigation”

Logs are most valuable when they are:

If you’re using a tool like Loki, Elasticsearch, or a cloud logging service, you’ll often query by fields.

Example grep-based local investigation (real command):

# Find errors in the last 2000 lines of a container log file
tail -n 2000 /var/log/myapp.log | grep -i "error"

Follow logs in Kubernetes:

kubectl logs -n prod deploy/myapp -f --tail=200

Include timestamps:

kubectl logs -n prod deploy/myapp --timestamps --tail=200

3.7 Traces: reduce MTTR dramatically

Distributed tracing answers questions like:

If you have trace IDs in logs, you can pivot:

  1. Alert fires (latency high)
  2. Find trace exemplars (slow traces)
  3. Identify the slow span (e.g., DB query)
  4. Jump to logs for that trace ID
  5. Mitigate and confirm via metrics

Best practice: ensure consistent propagation of trace context across:


3.8 Alerting: actionable, owned, and tested

Bad alerts create noise; noise creates missed incidents.

Good alerts are:

Anti-pattern: alerting on CPU > 80% for 5 minutes for every service. Better: alert on high error rate or high latency relative to SLO.

Also implement:


3.9 Incident response: observability meets process

When incidents happen, observability should support a clear workflow:

Practical mitigation commands (Kubernetes examples):

Scale up temporarily:

kubectl -n prod scale deploy/myapp --replicas=10
kubectl -n prod rollout status deploy/myapp

Check resource usage:

kubectl -n prod top pods
kubectl -n prod top nodes

Describe a pod to see events:

kubectl -n prod describe pod <pod-name>

4) Putting It Together: A Reference Delivery Workflow

This section ties CI/CD, IaC, and observability into one coherent practice.

4.1 A typical change lifecycle

  1. Developer creates a small PR.
  2. CI runs lint/unit tests quickly.
  3. CI builds an immutable artifact and scans it.
  4. Merge to main triggers:
    • integration tests
    • deployment to staging
    • smoke tests
  5. Promotion to production:
    • progressive rollout (canary)
    • automated verification against SLIs
  6. Observability confirms health; release is completed.
  7. If SLIs degrade, rollout is halted and rolled back.

4.2 Verification after deploy (real checks)

Check HTTP endpoint:

curl -fsS https://myapp.example.com/health

Check a key user journey (simple smoke):

curl -fsS https://myapp.example.com/api/version
curl -fsS -X POST https://myapp.example.com/api/login \
  -H 'content-type: application/json' \
  -d '{"username":"smoke","password":"smoke"}'

Check Kubernetes rollout:

kubectl -n prod rollout status deploy/myapp --timeout=120s

Check recent errors:

kubectl -n prod logs deploy/myapp --tail=200 | grep -E "ERROR|Exception" || true

5) Security and Compliance as DevOps Multipliers (DevSecOps)

Security is not a gate at the end; it’s integrated into delivery.

5.1 Supply chain security

Generate an SBOM for a container image (Syft):

syft registry.example.com/myapp:${GIT_SHA} -o spdx-json > sbom.spdx.json

Sign an image (cosign):

cosign sign registry.example.com/myapp:${GIT_SHA}

Verify signature:

cosign verify registry.example.com/myapp:${GIT_SHA}

5.2 Least privilege everywhere


6) Common Pitfalls and How to Avoid Them

Pitfall: “We have CI/CD” but deployments are still scary

Cause: poor test coverage, no progressive delivery, manual steps. Fix: invest in test strategy, canaries, automated verification, and rollbacks.

Pitfall: IaC exists but people still click in the console

Cause: missing features in code, slow pipeline, unclear ownership. Fix: define a break-glass process, improve IaC coverage, shorten feedback loops.

Pitfall: Lots of dashboards but no one knows what matters

Cause: no SLIs/SLOs, alert fatigue. Fix: define SLOs, alert on symptoms, link runbooks, measure error budget.

Pitfall: Observability is too expensive

Cause: high-cardinality metrics, verbose logs, too much retention. Fix: reduce cardinality, sample traces, structure logs, set retention tiers.


7) A Practical “Start Here” Checklist

If you want a concrete sequence to implement:

CI/CD

Infrastructure as Code

Observability


8) Example: A Minimal End-to-End Flow (Commands You Can Adapt)

This is a simplified flow you can run in a real project.

8.1 Local pre-flight checks

git checkout -b feature/small-change
npm ci
npm run lint
npm test

8.2 Build and scan a container

GIT_SHA="$(git rev-parse --short HEAD)"
docker build -t myapp:${GIT_SHA} .
trivy image myapp:${GIT_SHA}

8.3 Push and deploy (conceptual)

docker tag myapp:${GIT_SHA} registry.example.com/myapp:${GIT_SHA}
docker push registry.example.com/myapp:${GIT_SHA}

Deploy to Kubernetes by updating the image (example):

kubectl -n staging set image deploy/myapp myapp=registry.example.com/myapp:${GIT_SHA}
kubectl -n staging rollout status deploy/myapp

Smoke test:

curl -fsS https://staging-myapp.example.com/health

Promote to production (same image tag):

kubectl -n prod set image deploy/myapp myapp=registry.example.com/myapp:${GIT_SHA}
kubectl -n prod rollout status deploy/myapp

Verify:

curl -fsS https://myapp.example.com/health
kubectl -n prod logs deploy/myapp --tail=100

Rollback if needed:

kubectl -n prod rollout undo deploy/myapp
kubectl -n prod rollout status deploy/myapp

9) Closing Guidance: Optimize for Learning Speed

The best DevOps organizations optimize for learning speed:

If you implement only one meta-practice: make every change small, observable, and reversible. That single idea drives safer releases, faster incident recovery, and a more sustainable engineering culture.