DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

DevOps is not a toolchain—it’s a set of practices that reduce lead time for changes, improve reliability, and make systems easier to operate. This tutorial focuses on three pillars that, when implemented together, form a strong foundation for modern delivery:

CI/CD (Continuous Integration and Continuous Delivery/Deployment)
Infrastructure as Code (IaC)
Observability (metrics, logs, traces, and actionable alerting)

The goal is not “use tool X,” but to build repeatable, auditable workflows that scale with teams and complexity.

1) CI/CD Best Practices

1.1 Continuous Integration: What “good” looks like

A strong CI system ensures that every change is validated quickly and consistently. The best CI pipelines share these traits:

Fast feedback: Most checks complete in minutes.
Deterministic: Same input → same output (pin versions, lock dependencies).
Hermetic where possible: Avoid relying on developer machines or mutable shared environments.
Shift-left security: Run SAST, dependency scanning, secret scanning early.
Artifact-based: Build once, promote the same artifact through environments.

Recommended CI stages (typical)

Checkout + dependency restore
Lint + formatting
Unit tests
Build artifact/container
Security scans
Integration tests (optional, but valuable)
Publish artifact
Trigger CD

1.2 Example: A simple CI pipeline for a Node.js service

Assume a Node.js API that builds into a Docker image.

Local commands (what CI should run)

# Install dependencies deterministically
npm ci

# Lint and format checks
npm run lint
npm run format:check

# Unit tests with coverage
npm test -- --coverage

# Build production bundle (if applicable)
npm run build

Build a Docker image

docker build -t myorg/myservice:git-$(git rev-parse --short HEAD) .

Run container locally

docker run --rm -p 3000:3000 myorg/myservice:git-$(git rev-parse --short HEAD)

Best practice: CI should never rely on “latest” base images or floating tags. Pin base images by digest when possible.

# Example: pinning a base image digest (illustrative)
# FROM node:20-alpine@sha256:<digest>

1.3 Build once, promote the same artifact

A common anti-pattern is rebuilding separately for staging and production. That creates drift and makes “what is running?” hard to answer.

Better approach:

Build once in CI.
Tag with immutable identifiers (commit SHA).
Push to registry.
Deploy by referencing that exact tag/digest.

Tagging strategy

myservice:<git-sha> (immutable)
myservice:main (mutable convenience tag, optional)
myservice:v1.4.2 (release tag)

Commands:

SHA=$(git rev-parse --short HEAD)
docker tag myorg/myservice:git-$SHA registry.example.com/myorg/myservice:$SHA
docker push registry.example.com/myorg/myservice:$SHA

To get the immutable digest after pushing:

docker pull registry.example.com/myorg/myservice:$SHA
docker inspect --format='{{index .RepoDigests 0}}' registry.example.com/myorg/myservice:$SHA

1.4 Test pyramid and where to spend time

A healthy test suite typically follows:

Many unit tests: fast, deterministic
Some integration tests: validate DB, queues, caches, external contracts
Few end-to-end tests: expensive; keep them focused on critical flows

Integration test example with Docker Compose

If your service depends on Postgres, run an ephemeral DB in CI:

docker run --rm -d --name pg \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=mydb \
  -p 5432:5432 postgres:16

# Run migrations and integration tests
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/mydb"
npm run migrate
npm run test:integration

docker stop pg

Best practice: make integration tests self-contained and idempotent. Tests should create and clean up their own data.

1.5 Security in CI: practical checks

Security scanning is most effective when it is:

automated,
fast,
and enforced with clear policies.

Dependency vulnerability scanning (example with npm audit)

npm audit --audit-level=high

Secret scanning with Gitleaks

Install and run:

gitleaks version

# Scan the repository
gitleaks detect --source . --no-git --redact

Container image scanning with Trivy

trivy image --severity HIGH,CRITICAL registry.example.com/myorg/myservice:$SHA

Best practice: define what fails the build (e.g., critical vulns only) and create an exception process that is documented and time-bound.

1.6 Continuous Delivery vs Continuous Deployment

Continuous Delivery: every change is deployable; production deploy may require approval.
Continuous Deployment: every change that passes pipeline automatically deploys to production.

A pragmatic approach:

Auto-deploy to dev/staging on merge.
Deploy to production via approval + change window (or progressive rollout) until confidence is high.

1.7 Deployment strategies: reduce risk

Blue/Green

Two environments (blue and green). Deploy to the idle one, switch traffic, keep rollback easy.

Canary

Send a small percentage of traffic to the new version and gradually increase.

Rolling update

Replace instances gradually. Common in Kubernetes.

Best practice: pair progressive delivery with strong observability and automated rollback conditions.

2) Infrastructure as Code (IaC) Best Practices

IaC means your infrastructure is:

versioned (Git history),
reviewed (PRs),
repeatable (reprovision reliably),
auditable (who changed what and why).

2.1 Choose declarative, keep modules small, enforce standards

Whether you use Terraform, Pulumi, CloudFormation, or others, good IaC tends to:

avoid giant “god modules,”
expose clear inputs/outputs,
include validation,
and keep environments consistent.

This tutorial uses Terraform examples because it is widely used.

2.2 Terraform project structure

A common structure:

infra/
  modules/
    network/
    service/
  envs/
    dev/
    staging/
    prod/

modules/ contains reusable building blocks.
envs/ wires modules together with environment-specific values.

2.3 Remote state, locking, and state hygiene

Terraform state must be protected:

stored remotely (not on laptops),
locked during changes,
backed up.

For AWS, a typical approach is S3 + DynamoDB locking. (Exact backend config varies.)

Commands you’ll use often:

terraform fmt -recursive
terraform validate
terraform init
terraform plan
terraform apply

Best practice: run terraform plan in CI and require approval before apply in production.

2.4 Example: Create a minimal AWS S3 bucket with Terraform

Create infra/envs/dev/main.tf:

terraform {
  required_version = ">= 1.6.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "app_bucket" {
  bucket = "myorg-dev-app-bucket-123456"
}

resource "aws_s3_bucket_versioning" "versioning" {
  bucket = aws_s3_bucket.app_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

Run:

cd infra/envs/dev
terraform init
terraform fmt
terraform validate
terraform plan
terraform apply

Destroy when done:

terraform destroy

Best practice: enable encryption, block public access, and define lifecycle rules. Security defaults should be explicit.

2.5 Manage secrets correctly (don’t put them in state)

A critical IaC rule: avoid placing sensitive secrets directly in Terraform state. State files often end up accessible to many systems and people.

Preferred patterns:

Store secrets in a dedicated secret manager (AWS Secrets Manager, GCP Secret Manager, Vault).
Reference secret ARNs/paths in IaC, not secret values.
Inject secrets at runtime (Kubernetes secrets via external secret operators, or sidecars).

Example (AWS Secrets Manager data source reference conceptually):

data "aws_secretsmanager_secret" "db_password" {
  name = "prod/db_password"
}

Then your application runtime fetches it, or you pass only the secret reference.

2.6 Immutable infrastructure and configuration drift

Configuration drift happens when someone changes infrastructure manually in the console. IaC can detect drift, but only if you:

run plans regularly,
restrict manual changes via IAM policies,
treat IaC as the source of truth.

Drift detection:

terraform plan -detailed-exitcode

Exit codes:

0: no changes
2: changes present (drift or intended updates)
1: error

Best practice: schedule drift detection (nightly) and alert on drift.

2.7 Policy as code and guardrails

Guardrails prevent risky changes (public S3 buckets, open security groups, unencrypted databases).

Tools and approaches:

Terraform Cloud/Enterprise policies (Sentinel)
Open Policy Agent (OPA) / Conftest
Checkov / tfsec for static checks

Example with tfsec:

tfsec infra/envs/dev

Example with checkov:

checkov -d infra/envs/dev

Best practice: fail CI on high-severity policy violations, and require explicit exceptions with justification.

2.8 IaC reviews: what to look for in PRs

When reviewing Terraform PRs, focus on:

Blast radius: which resources are replaced/destroyed?
Networking changes: subnets, routes, security groups
Data stores: encryption, backups, deletion protection
IAM: least privilege, no wildcard permissions
Cost: instance sizes, autoscaling bounds
Rollout plan: can it be applied safely? any downtime?

Use:

terraform plan

And scrutinize:

-/+ (replace)
- (destroy)
changes to security rules and IAM

3) Observability Best Practices

Observability answers: “What is happening inside my system?” It goes beyond monitoring by enabling you to ask new questions without shipping new code each time.

The three primary signals:

Metrics (aggregated numeric time series)
Logs (discrete events)
Traces (request-level, distributed context)

A fourth, often overlooked component: 4. Alerting (actionable, low-noise notifications)

3.1 Golden signals and SLOs

A practical starting point is the Golden Signals:

Latency
Traffic
Errors
Saturation

Then define SLOs (Service Level Objectives) like:

“99.9% of requests under 300ms over 30 days”
“Error rate < 0.1% over 7 days”

SLOs shift the conversation from “CPU is high” to “users are impacted.”

Best practice: alert on symptoms that violate SLOs, not on every resource metric.

3.2 Structured logging: make logs queryable

Logs should be:

structured (JSON),
consistent (fields like service, env, trace_id, request_id),
leveled (DEBUG, INFO, WARN, ERROR),
privacy-aware (no secrets, no unnecessary PII).

Example JSON log line:

{
  "timestamp": "2026-04-30T12:00:00Z",
  "level": "INFO",
  "service": "myservice",
  "env": "staging",
  "message": "request completed",
  "http_method": "GET",
  "path": "/api/orders",
  "status": 200,
  "duration_ms": 42,
  "trace_id": "3b1d2f0f2c1a4c0b",
  "request_id": "req_9f2c..."
}

Best practice: ensure every request log includes a correlation ID and trace ID.

3.3 Metrics: instrument what matters

Metrics should reflect:

request duration histograms,
request counts by route/status,
error counts,
dependency timings (DB, cache, external APIs),
queue depth and consumer lag,
saturation (CPU/memory), but as supporting context.

If you use Prometheus, you’ll typically scrape /metrics endpoints and visualize in Grafana.

Example: basic PromQL queries

Request rate:

sum(rate(http_requests_total{service="myservice"}[5m]))

Error rate:

sum(rate(http_requests_total{service="myservice",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myservice"}[5m]))

P95 latency:

histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="myservice"}[5m]))
)

Best practice: use histograms for latency, not averages. Averages hide tail latency.

3.4 Tracing: follow a request across services

Distributed tracing is essential once you have:

multiple services,
async messaging,
or external dependencies.

With OpenTelemetry (OTel), you can propagate context across HTTP and messaging boundaries.

Key concepts:

Span: a timed operation (e.g., “HTTP GET /orders”)
Trace: a tree of spans for one request
Context propagation: trace IDs passed via headers (e.g., W3C traceparent)

Best practice: sample intelligently. Full sampling in high-traffic production can be expensive. Use head-based sampling for baseline and tail-based sampling for errors/slow requests if your backend supports it.

3.5 OpenTelemetry Collector: a practical architecture

Instead of sending telemetry directly from apps to multiple vendors, use the OpenTelemetry Collector as an agent/gateway:

Apps export OTLP (OpenTelemetry Protocol) to the collector.
Collector enriches, batches, samples, and forwards to:
- Prometheus/remote write for metrics,
- Loki/Elastic for logs,
- Tempo/Jaeger for traces,
- or a SaaS backend.

Benefits:

vendor flexibility,
centralized control of sampling and enrichment,
fewer egress endpoints from workloads.

3.6 Alerting: reduce noise, increase actionability

Alerts should be:

tied to user impact (SLO burn rate, error spikes),
actionable (clear runbook and owner),
deduplicated and routed properly.

Anti-patterns

Alerting on CPU > 80% with no context
Alerting on every 404
Alerts without runbooks

Better patterns

Alert on high 5xx error rate sustained for N minutes
Alert on latency SLO burn rate
Alert when queue lag exceeds a threshold and is increasing

Example Prometheus alert logic (conceptual query; actual alert rules depend on your setup):

(
  sum(rate(http_requests_total{service="myservice",status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{service="myservice"}[5m]))
) > 0.02

Best practice: include these in every alert:

summary of impact,
dashboard link,
runbook link,
recent deploy marker (if available).

3.7 Runbooks and incident response

A runbook is a step-by-step guide for common incidents:

“DB connections exhausted”
“Elevated 5xx after deploy”
“Queue consumer lag increasing”

A good runbook includes:

how to confirm the issue,
likely causes,
safe mitigations,
rollback steps,
escalation contacts,
links to dashboards and logs.

Best practice: treat runbooks as code in the same repo as the service, reviewed and updated with changes.

4) Putting It Together: A Practical End-to-End Workflow

This section ties CI/CD + IaC + Observability into a cohesive delivery loop.

4.1 A reference flow

Developer opens a PR.
CI runs:
- lint, tests
- build container
- scan dependencies and image
- publish artifact tagged by commit SHA
Merge to main triggers CD:
- deploy to staging using the same artifact
- run smoke tests
Observability checks:
- confirm error rate/latency stable
- verify traces show healthy dependencies
Promote to production:
- canary rollout
- automated rollback if SLO burn rate spikes
IaC changes:
- Terraform plan in CI
- apply with approval
- drift detection nightly

4.2 Smoke tests after deploy

Smoke tests should validate critical paths quickly.

Example with curl:

BASE_URL="https://staging.example.com"
curl -fsS "$BASE_URL/health"
curl -fsS "$BASE_URL/api/version"

Example with hey for quick load sampling:

hey -z 30s -c 20 https://staging.example.com/api/orders

Best practice: smoke tests should be fast and deterministic; deeper load tests can run on schedule or before major releases.

4.3 Deployment verification using metrics

After deploying a canary, validate:

Error rate did not increase
P95 latency did not regress
Saturation is stable (CPU/memory)
Key business metrics (orders created, logins) are normal

If you track deploy markers, correlate changes:

“Error spike started 2 minutes after deploy X”

Best practice: automatically annotate dashboards when deployments happen.

5) Common Pitfalls and How to Avoid Them

Pitfall: “CI is slow, so we skip tests”

Fix:

parallelize tests,
cache dependencies,
separate fast checks from slow checks,
run slow suites on merge or nightly.

Pitfall: “We rebuild for production”

Fix:

build once, promote immutable artifacts.

Pitfall: “Terraform state is shared and unmanaged”

Fix:

remote state + locking,
least privilege access to state,
separate state per environment.

Pitfall: “We have dashboards but still don’t know what’s wrong”

Fix:

add tracing with consistent context propagation,
improve structured logs,
define SLOs and alert on burn rate.

Pitfall: “Alerts are noisy, everyone ignores them”

Fix:

alert on symptoms (user impact),
add runbooks,
tune thresholds and windows,
route alerts to the right owners.

6) A Minimal Checklist You Can Apply Immediately

CI/CD

Deterministic builds (npm ci, lockfiles, pinned versions)
Lint + unit tests on every PR
Build once, push immutable artifact (SHA tag)
Security scanning (deps, secrets, image)
Progressive delivery (canary/rolling) + rollback plan

IaC

Observability

Structured logs with trace_id and request_id
Metrics for golden signals + histograms for latency
Distributed tracing for key services
SLOs defined and alerts tied to user impact
Runbooks stored with code

7) Next Steps (Practical Improvements)

If you want to deepen maturity beyond the basics:

Introduce feature flags for safer releases and experiments.
Add contract testing (e.g., Pact) for critical service boundaries.
Adopt GitOps (Argo CD / Flux) so deployments are reconciled from Git.
Implement automatic rollback based on SLO burn rate.
Standardize templates for new services (CI pipeline, Terraform module skeleton, OTel instrumentation, dashboards, and alerts).

Appendix: Useful Commands Reference

Git and tagging

git status
git diff
git rev-parse --short HEAD
git tag -a v1.2.3 -m "Release v1.2.3"
git push --tags

Docker

docker build -t myservice:local .
docker images
docker run --rm -p 3000:3000 myservice:local
docker logs -f <container_id>
docker exec -it <container_id> sh

Terraform

terraform init
terraform fmt -recursive
terraform validate
terraform plan
terraform apply
terraform destroy

Security scanning

gitleaks detect --source . --no-git --redact
trivy image --severity HIGH,CRITICAL myorg/myservice:git-<sha>
tfsec infra/
checkov -d infra/

HTTP smoke testing

curl -fsS https://example.com/health
curl -fsS https://example.com/api/version

By implementing CI/CD, IaC, and Observability as a unified system—rather than isolated initiatives—you get faster delivery, safer changes, and clearer operational insight. The biggest wins come from consistency: consistent pipelines, consistent infrastructure definitions, and consistent telemetry across every service and environment.

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

1) CI/CD Best Practices

1.1 Continuous Integration: What “good” looks like

Recommended CI stages (typical)

1.2 Example: A simple CI pipeline for a Node.js service

Local commands (what CI should run)

Build a Docker image

Run container locally

1.3 Build once, promote the same artifact

Tagging strategy

1.4 Test pyramid and where to spend time

Integration test example with Docker Compose

1.5 Security in CI: practical checks

Dependency vulnerability scanning (example with npm audit)

Secret scanning with Gitleaks

Container image scanning with Trivy

1.6 Continuous Delivery vs Continuous Deployment

1.7 Deployment strategies: reduce risk

Blue/Green

Canary

Rolling update

2) Infrastructure as Code (IaC) Best Practices

2.1 Choose declarative, keep modules small, enforce standards

2.2 Terraform project structure

2.3 Remote state, locking, and state hygiene

2.4 Example: Create a minimal AWS S3 bucket with Terraform

2.5 Manage secrets correctly (don’t put them in state)

2.6 Immutable infrastructure and configuration drift

2.7 Policy as code and guardrails

2.8 IaC reviews: what to look for in PRs

3) Observability Best Practices

3.1 Golden signals and SLOs

3.2 Structured logging: make logs queryable

3.3 Metrics: instrument what matters

Example: basic PromQL queries

3.4 Tracing: follow a request across services

3.5 OpenTelemetry Collector: a practical architecture

3.6 Alerting: reduce noise, increase actionability

Anti-patterns

Better patterns

3.7 Runbooks and incident response

4) Putting It Together: A Practical End-to-End Workflow

4.1 A reference flow

4.2 Smoke tests after deploy

4.3 Deployment verification using metrics

5) Common Pitfalls and How to Avoid Them

Pitfall: “CI is slow, so we skip tests”

Pitfall: “We rebuild for production”

Pitfall: “Terraform state is shared and unmanaged”

Pitfall: “We have dashboards but still don’t know what’s wrong”

Pitfall: “Alerts are noisy, everyone ignores them”

6) A Minimal Checklist You Can Apply Immediately

CI/CD

IaC

Observability

7) Next Steps (Practical Improvements)

Appendix: Useful Commands Reference

Git and tagging

Docker

Terraform

Security scanning

HTTP smoke testing

Related Tutorials