DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

DevOps is not a toolchain—it’s an operating model for delivering software safely and quickly. The most effective DevOps programs consistently invest in three pillars:

CI/CD (how code becomes running software)
Infrastructure as Code (IaC) (how environments are created and changed)
Observability (how you understand and improve behavior in production)

This tutorial walks through best practices in each pillar with deep explanations and real commands you can run. Examples assume Linux/macOS shells; Windows users can use WSL.

1) CI/CD Best Practices (Continuous Integration / Continuous Delivery)

1.1 What “good CI” actually means

Continuous Integration is not “we run tests sometimes.” It means:

Developers integrate to the mainline frequently (ideally daily).
Every change is validated by an automated pipeline.
The pipeline is fast enough to be used continuously.
Failures are treated as urgent because they block safe delivery.

Key outcomes:

Reduced merge conflicts
Higher confidence in main branch
Faster feedback loops (bugs found minutes after introduction, not weeks later)

Practical CI principles

Trunk-based development: short-lived branches, frequent merges to main.
Small changes: keep PRs small to reduce risk and review time.
Test pyramid: many unit tests, fewer integration tests, minimal end-to-end tests.
Deterministic builds: pin dependencies; avoid “works on my machine.”

Commands for deterministic dependency installs:

Node.js:

npm ci

Python:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --require-hashes

Go:

go mod download
go test ./...

1.2 Build once, promote the same artifact

A common anti-pattern is rebuilding the application separately for staging and production. That introduces “it passed staging but prod is different” failures.

Best practice: build a single immutable artifact (container image, JAR, binary) and promote it across environments by changing configuration, not code.

Example: build a Docker image once and tag it with the Git commit SHA.

GIT_SHA="$(git rev-parse --short HEAD)"
docker build -t myapp:${GIT_SHA} .
docker tag myapp:${GIT_SHA} registry.example.com/myapp:${GIT_SHA}
docker push registry.example.com/myapp:${GIT_SHA}

Then deploy that exact tag to dev/staging/prod.

1.3 Pipeline stages that map to risk

A robust pipeline typically has stages like:

Lint + static checks (fast, cheap)
Unit tests (fast)
Build artifact (repeatable)
Security scanning (dependencies, container image)
Integration tests (slower, higher confidence)
Deploy to staging (automated)
Smoke tests (validate basic behavior)
Progressive delivery to prod (canary/blue-green)
Post-deploy verification (SLIs/SLOs, error budgets)

Example commands for common checks

Linting (JavaScript/TypeScript):

npm run lint

Unit tests with coverage:

npm test -- --coverage

Python formatting + linting:

python -m pip install ruff black
ruff check .
black --check .

Container image vulnerability scan (Trivy):

trivy image --severity HIGH,CRITICAL registry.example.com/myapp:${GIT_SHA}

Dependency vulnerability scan (Node):

npm audit --audit-level=high

1.4 Secrets management in CI/CD

Never store secrets in source control or bake them into images. Use:

CI secret stores (GitHub Actions Secrets, GitLab CI variables, Jenkins credentials)
Dedicated secret managers (Vault, AWS Secrets Manager, GCP Secret Manager)
Short-lived credentials via OIDC where possible

Anti-pattern: export AWS_SECRET_ACCESS_KEY=... in scripts committed to repo.

Better: use OIDC to obtain cloud credentials at runtime. For AWS, many CI systems can assume a role without long-lived keys.

1.5 Progressive delivery: canary and blue/green

Deploying to production doesn’t have to be “all at once.”

Blue/Green: maintain two identical environments (blue = live, green = new). Switch traffic when green is validated.
Canary: roll out to a small percentage of users/traffic, observe metrics, then expand.

Why it matters: it reduces blast radius and makes rollback safer.

A simple Kubernetes canary approach might use two Deployments and a Service selector shift, or a service mesh/ingress controller with weighted routing. Even without a mesh, you can do controlled rollouts with Kubernetes’ rolling updates and careful monitoring.

1.6 Rollback strategy: plan it before you need it

A rollback is not “git revert and redeploy” during an incident. You want a fast, predictable action:

Roll back to the previous known-good artifact tag
Roll back database changes safely (or use forward-only migrations)
Keep a runbook: who does what, which commands, what validation

Kubernetes rollback example:

kubectl rollout history deploy/myapp
kubectl rollout undo deploy/myapp --to-revision=12
kubectl rollout status deploy/myapp

1.7 CI/CD design patterns that scale

Pipeline as code: version your pipeline definitions.
Reusable steps: shared scripts/actions to avoid copy-paste.
Caching: speed matters; cache dependencies and build layers.
Parallelization: run test suites in parallel.
Fail fast: stop early on lint/test failures.
Quality gates: require passing checks before merge.

Example: Docker build caching with BuildKit

export DOCKER_BUILDKIT=1
docker build --progress=plain -t myapp:${GIT_SHA} .

2) Infrastructure as Code (IaC) Best Practices

2.1 Why IaC is more than “automation”

IaC is the practice of managing infrastructure through code and version control. The real benefits are:

Repeatability: recreate environments reliably
Auditability: changes are reviewed, tracked, and attributable
Safety: reduce manual, error-prone console clicking
Scalability: manage many environments consistently

A mature IaC workflow treats infrastructure changes like application changes:

pull requests
automated validation
plan review
controlled apply
post-change verification

2.2 Choose the right IaC tool and model

Common approaches:

Terraform/OpenTofu: declarative, multi-cloud, strong ecosystem
CloudFormation: AWS-native, deep integration
Pulumi: IaC using general-purpose languages
Ansible: configuration management, orchestration; best for OS/app config more than cloud primitives
Kubernetes manifests/Helm/Kustomize: for cluster resources

You can mix tools, but do so intentionally and document boundaries (e.g., Terraform provisions EKS and networking; Helm deploys apps).

2.3 Terraform/OpenTofu workflow: validate → plan → apply

Install OpenTofu (Terraform-compatible fork) or Terraform. Example commands are identical for most workflows.

Initialize:

tofu init

Format and validate:

tofu fmt -recursive
tofu validate

Plan (review the diff):

tofu plan -out=tfplan

Apply the reviewed plan:

tofu apply tfplan

Best practice: remote state + locking

Local state files don’t scale and can corrupt easily. Use remote state with locking (e.g., S3 + DynamoDB, Terraform Cloud, GCS).

Why locking matters: it prevents two engineers/pipelines from applying changes concurrently and corrupting state.

2.4 Structure: modules, environments, and boundaries

A common, scalable structure:

modules/ reusable building blocks (VPC, cluster, database)
envs/dev, envs/staging, envs/prod composition using modules

Guidelines:

Keep modules small and focused.
Version modules (git tags or registry versions).
Avoid environment-specific logic inside modules; pass variables instead.
Keep blast radius small: separate state per environment and sometimes per component.

2.5 Immutable infrastructure vs configuration drift

Configuration drift happens when reality differs from code (manual console edits, ad-hoc changes). IaC reduces drift, but only if you enforce:

No manual changes (or document break-glass procedures)
Frequent reconciliation (plan regularly)
Access controls (limit who can change infra outside CI)

Detect drift:

tofu plan

If the plan shows unexpected changes, investigate:

Was something changed manually?
Did a cloud provider default change?
Did a module update alter behavior?

2.6 IaC testing and policy enforcement

Treat infrastructure code as testable:

Static checks: formatting, linting, security scanning
Policy as code: enforce rules like “no public S3 buckets,” “encryption required,” “approved regions only”
Integration tests: create ephemeral environments and validate behavior

Security scanning for Terraform

Using tfsec (or equivalents):

tfsec .

Using checkov:

checkov -d .

Policy as code with OPA (conceptual)

OPA (Open Policy Agent) can evaluate plans against policies. In practice, you export a plan to JSON and evaluate it.

Terraform/OpenTofu plan to JSON:

tofu show -json tfplan > tfplan.json

Then evaluate with OPA (example, policy not included here):

opa eval --data policy.rego --input tfplan.json "data.iac.deny"

2.7 Secrets in IaC: what not to do

Never put secrets in:

Terraform variables committed to git
.tfvars files in repo
user data scripts in plaintext
container images

Instead:

Reference secret manager ARNs/paths
Inject secrets at runtime (Kubernetes secrets from external secret stores)
Use encryption (KMS) and access controls

For example, store a database password in AWS Secrets Manager and have the app retrieve it using IAM permissions rather than embedding it.

2.8 Database migrations: the hardest part of delivery

Infrastructure changes often involve data. The safest approach is usually:

Backward-compatible migrations (expand/contract pattern)
Deploy code that can handle both old and new schema
Migrate data
Remove old schema later

A simple example with a SQL migration tool might be:

# Example using Flyway (conceptual)
flyway -url="jdbc:postgresql://db.example.com:5432/app" \
       -user="$DB_USER" -password="$DB_PASS" migrate

Best practice: run migrations as part of deployment with clear ownership and rollback strategy (often forward-only with compensating migrations).

3) Observability Best Practices (Metrics, Logs, Traces)

3.1 Observability vs monitoring

Monitoring tells you known failure modes (CPU high, disk full). Observability helps you understand unknown failure modes by making systems explain themselves.

Observability is typically built from three signals:

Metrics (aggregated numbers over time)
Logs (discrete events with context)
Traces (end-to-end request flow across services)

A fourth signal often included:

Profiles (CPU/memory profiling over time)

3.2 Start with SLIs and SLOs (not dashboards)

Dashboards are useful, but the best teams start with:

SLI (Service Level Indicator): what you measure (e.g., request success rate)
SLO (Service Level Objective): target (e.g., 99.9% success rate over 30 days)
Error budget: allowed unreliability (100% - SLO)

Example SLI/SLO:

SLI: proportion of HTTP 2xx/3xx responses
SLO: 99.9% over 30 days
Error budget: 0.1% failures allowed

Why this matters: it aligns engineering work with user experience and provides a rational basis for release velocity. If you’re burning error budget too fast, slow down releases and focus on reliability.

3.3 Instrumentation: make telemetry consistent

Logging best practices

Use structured logs (JSON) rather than free-form strings.
Include correlation IDs (request ID, trace ID).
Avoid logging secrets or sensitive data.
Log at appropriate levels (INFO for business events, WARN for recoverable issues, ERROR for failures).

Example: structured logging from an app might emit:

{"level":"info","msg":"order_created","order_id":"123","user_id":"456","trace_id":"abc..."}

Metrics best practices

Prefer counters for events (requests, errors)
histograms for latency (p50/p95/p99)
gauges for current values (queue depth, memory)

Avoid high-cardinality labels (like raw user IDs) in metrics; they can explode cost and degrade performance.

Tracing best practices

Propagate context across service boundaries.
Sample intelligently (head-based or tail-based).
Record spans around critical operations: DB calls, external APIs, queue operations.

3.4 OpenTelemetry: a practical standard

OpenTelemetry (OTel) is the de facto standard for generating and exporting telemetry.

A common architecture:

App emits OTel metrics/logs/traces
OTel Collector receives, batches, enriches, exports
Backend stores/visualizes (Prometheus, Grafana, Tempo, Loki, Jaeger, etc.)

Run an OpenTelemetry Collector locally (example): You can run a collector container, but it requires a config file. Even without showing the full config here, the real command looks like:

docker run --rm -p 4317:4317 -p 4318:4318 \
  -v "$(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml" \
  otel/opentelemetry-collector:latest \
  --config /etc/otelcol/config.yaml

Your app can then export OTLP to:

http://localhost:4318 (OTLP HTTP)
localhost:4317 (OTLP gRPC)

3.5 Prometheus metrics: concrete queries and checks

If you use Prometheus-style metrics, you’ll typically define alerts and dashboards using PromQL.

Examples:

Request rate:

sum(rate(http_requests_total[5m]))

Error rate:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

p95 latency (histogram):

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Best practice: alert on symptoms (user impact), not causes. For example:

Symptom: elevated 5xx error rate, high latency
Cause: CPU high, DB connections exhausted (useful for debugging but not always paging)

3.6 Logs: from “search” to “investigation”

Logs are most valuable when they are:

Queryable (structured fields)
Correlated (trace IDs, request IDs)
Retained appropriately (cost vs compliance)
Redacted (no secrets)

If you’re using a tool like Loki, Elasticsearch, or a cloud logging service, you’ll often query by fields.

Example grep-based local investigation (real command):

# Find errors in the last 2000 lines of a container log file
tail -n 2000 /var/log/myapp.log | grep -i "error"

Follow logs in Kubernetes:

kubectl logs -n prod deploy/myapp -f --tail=200

Include timestamps:

kubectl logs -n prod deploy/myapp --timestamps --tail=200

3.7 Traces: reduce MTTR dramatically

Distributed tracing answers questions like:

Which service is slow?
Is latency in DB, cache, or an external API?
What percentage of requests hit a degraded dependency?

If you have trace IDs in logs, you can pivot:

Alert fires (latency high)
Find trace exemplars (slow traces)
Identify the slow span (e.g., DB query)
Jump to logs for that trace ID
Mitigate and confirm via metrics

Best practice: ensure consistent propagation of trace context across:

HTTP headers (traceparent)
messaging systems (inject/extract context)
background jobs

3.8 Alerting: actionable, owned, and tested

Bad alerts create noise; noise creates missed incidents.

Good alerts are:

Actionable: someone knows what to do
Owned: there is a team responsible
Routed: goes to the right on-call rotation
Tested: you verify alerts fire when expected

Anti-pattern: alerting on CPU > 80% for 5 minutes for every service. Better: alert on high error rate or high latency relative to SLO.

Also implement:

Deduplication
Rate limiting
Maintenance windows
Runbooks linked in alerts

3.9 Incident response: observability meets process

When incidents happen, observability should support a clear workflow:

Detect (alerts)
Triage (is it real user impact?)
Mitigate (rollback, feature flag off, scale, failover)
Communicate (status page, internal updates)
Learn (postmortem with action items)

Practical mitigation commands (Kubernetes examples):

Scale up temporarily:

kubectl -n prod scale deploy/myapp --replicas=10
kubectl -n prod rollout status deploy/myapp

Check resource usage:

kubectl -n prod top pods
kubectl -n prod top nodes

Describe a pod to see events:

kubectl -n prod describe pod <pod-name>

4) Putting It Together: A Reference Delivery Workflow

This section ties CI/CD, IaC, and observability into one coherent practice.

4.1 A typical change lifecycle

Developer creates a small PR.
CI runs lint/unit tests quickly.
CI builds an immutable artifact and scans it.
Merge to main triggers:
- integration tests
- deployment to staging
- smoke tests
Promotion to production:
- progressive rollout (canary)
- automated verification against SLIs
Observability confirms health; release is completed.
If SLIs degrade, rollout is halted and rolled back.

4.2 Verification after deploy (real checks)

Check HTTP endpoint:

curl -fsS https://myapp.example.com/health

Check a key user journey (simple smoke):

curl -fsS https://myapp.example.com/api/version
curl -fsS -X POST https://myapp.example.com/api/login \
  -H 'content-type: application/json' \
  -d '{"username":"smoke","password":"smoke"}'

Check Kubernetes rollout:

kubectl -n prod rollout status deploy/myapp --timeout=120s

Check recent errors:

kubectl -n prod logs deploy/myapp --tail=200 | grep -E "ERROR|Exception" || true

5) Security and Compliance as DevOps Multipliers (DevSecOps)

Security is not a gate at the end; it’s integrated into delivery.

5.1 Supply chain security

Pin dependencies and verify integrity
Generate SBOMs (Software Bill of Materials)
Sign artifacts
Enforce provenance

Generate an SBOM for a container image (Syft):

syft registry.example.com/myapp:${GIT_SHA} -o spdx-json > sbom.spdx.json

Sign an image (cosign):

cosign sign registry.example.com/myapp:${GIT_SHA}

Verify signature:

cosign verify registry.example.com/myapp:${GIT_SHA}

5.2 Least privilege everywhere

CI should have only the permissions it needs.
Production access should be time-bound and audited.
Use separate accounts/projects/subscriptions for environments.

6) Common Pitfalls and How to Avoid Them

Pitfall: “We have CI/CD” but deployments are still scary

Cause: poor test coverage, no progressive delivery, manual steps. Fix: invest in test strategy, canaries, automated verification, and rollbacks.

Pitfall: IaC exists but people still click in the console

Cause: missing features in code, slow pipeline, unclear ownership. Fix: define a break-glass process, improve IaC coverage, shorten feedback loops.

Pitfall: Lots of dashboards but no one knows what matters

Cause: no SLIs/SLOs, alert fatigue. Fix: define SLOs, alert on symptoms, link runbooks, measure error budget.

Pitfall: Observability is too expensive

Cause: high-cardinality metrics, verbose logs, too much retention. Fix: reduce cardinality, sample traces, structure logs, set retention tiers.

7) A Practical “Start Here” Checklist

If you want a concrete sequence to implement:

CI/CD

Enforce main is always green (fix pipeline failures immediately)
Build once, tag with commit SHA, promote the same artifact
Add security scanning (dependencies + images)
Add progressive delivery (canary or blue/green)
Implement fast rollback (document and test it)

Infrastructure as Code

Remote state + locking
Separate state per environment
Module boundaries and versioning
Policy checks in CI (no public resources, encryption required)
Drift detection (scheduled plans)

Observability

Define SLIs/SLOs for critical services
Structured logs with trace IDs
Metrics for golden signals (latency, traffic, errors, saturation)
Distributed tracing across services
Alerts that are actionable and tied to runbooks

8) Example: A Minimal End-to-End Flow (Commands You Can Adapt)

This is a simplified flow you can run in a real project.

8.1 Local pre-flight checks

git checkout -b feature/small-change
npm ci
npm run lint
npm test

8.2 Build and scan a container

GIT_SHA="$(git rev-parse --short HEAD)"
docker build -t myapp:${GIT_SHA} .
trivy image myapp:${GIT_SHA}

8.3 Push and deploy (conceptual)

docker tag myapp:${GIT_SHA} registry.example.com/myapp:${GIT_SHA}
docker push registry.example.com/myapp:${GIT_SHA}

Deploy to Kubernetes by updating the image (example):

kubectl -n staging set image deploy/myapp myapp=registry.example.com/myapp:${GIT_SHA}
kubectl -n staging rollout status deploy/myapp

Smoke test:

curl -fsS https://staging-myapp.example.com/health

Promote to production (same image tag):

kubectl -n prod set image deploy/myapp myapp=registry.example.com/myapp:${GIT_SHA}
kubectl -n prod rollout status deploy/myapp

Verify:

curl -fsS https://myapp.example.com/health
kubectl -n prod logs deploy/myapp --tail=100

Rollback if needed:

kubectl -n prod rollout undo deploy/myapp
kubectl -n prod rollout status deploy/myapp

9) Closing Guidance: Optimize for Learning Speed

The best DevOps organizations optimize for learning speed:

CI/CD shortens feedback loops.
IaC makes environments reproducible and changes reviewable.
Observability turns production into a source of truth, not mystery.

If you implement only one meta-practice: make every change small, observable, and reversible. That single idea drives safer releases, faster incident recovery, and a more sustainable engineering culture.

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

1) CI/CD Best Practices (Continuous Integration / Continuous Delivery)

1.1 What “good CI” actually means

Practical CI principles

1.2 Build once, promote the same artifact

1.3 Pipeline stages that map to risk

Example commands for common checks

1.4 Secrets management in CI/CD

1.5 Progressive delivery: canary and blue/green

1.6 Rollback strategy: plan it before you need it

1.7 CI/CD design patterns that scale

2) Infrastructure as Code (IaC) Best Practices

2.1 Why IaC is more than “automation”

2.2 Choose the right IaC tool and model

2.3 Terraform/OpenTofu workflow: validate → plan → apply

Best practice: remote state + locking

2.4 Structure: modules, environments, and boundaries

2.5 Immutable infrastructure vs configuration drift

2.6 IaC testing and policy enforcement

Security scanning for Terraform

Policy as code with OPA (conceptual)

2.7 Secrets in IaC: what not to do

2.8 Database migrations: the hardest part of delivery

3) Observability Best Practices (Metrics, Logs, Traces)

3.1 Observability vs monitoring

3.2 Start with SLIs and SLOs (not dashboards)

3.3 Instrumentation: make telemetry consistent

Logging best practices

Metrics best practices

Tracing best practices

3.4 OpenTelemetry: a practical standard

3.5 Prometheus metrics: concrete queries and checks

3.6 Logs: from “search” to “investigation”

3.7 Traces: reduce MTTR dramatically

3.8 Alerting: actionable, owned, and tested

3.9 Incident response: observability meets process

4) Putting It Together: A Reference Delivery Workflow

4.1 A typical change lifecycle

4.2 Verification after deploy (real checks)

5) Security and Compliance as DevOps Multipliers (DevSecOps)

5.1 Supply chain security

5.2 Least privilege everywhere

6) Common Pitfalls and How to Avoid Them

Pitfall: “We have CI/CD” but deployments are still scary

Pitfall: IaC exists but people still click in the console

Pitfall: Lots of dashboards but no one knows what matters

Pitfall: Observability is too expensive

7) A Practical “Start Here” Checklist

CI/CD

Infrastructure as Code

Observability

8) Example: A Minimal End-to-End Flow (Commands You Can Adapt)

8.1 Local pre-flight checks

8.2 Build and scan a container

8.3 Push and deploy (conceptual)

9) Closing Guidance: Optimize for Learning Speed

Related Tutorials