← Back to Tutorials

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

devopsci-cdinfrastructure-as-codekubernetesobservability

DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability

DevOps is not a toolchain—it’s a set of practices that reduce lead time for changes, improve reliability, and make systems easier to operate. This tutorial focuses on three pillars that, when implemented together, form a strong foundation for modern delivery:

  1. CI/CD (Continuous Integration and Continuous Delivery/Deployment)
  2. Infrastructure as Code (IaC)
  3. Observability (metrics, logs, traces, and actionable alerting)

The goal is not “use tool X,” but to build repeatable, auditable workflows that scale with teams and complexity.


1) CI/CD Best Practices

1.1 Continuous Integration: What “good” looks like

A strong CI system ensures that every change is validated quickly and consistently. The best CI pipelines share these traits:

  1. Checkout + dependency restore
  2. Lint + formatting
  3. Unit tests
  4. Build artifact/container
  5. Security scans
  6. Integration tests (optional, but valuable)
  7. Publish artifact
  8. Trigger CD

1.2 Example: A simple CI pipeline for a Node.js service

Assume a Node.js API that builds into a Docker image.

Local commands (what CI should run)

# Install dependencies deterministically
npm ci

# Lint and format checks
npm run lint
npm run format:check

# Unit tests with coverage
npm test -- --coverage

# Build production bundle (if applicable)
npm run build

Build a Docker image

docker build -t myorg/myservice:git-$(git rev-parse --short HEAD) .

Run container locally

docker run --rm -p 3000:3000 myorg/myservice:git-$(git rev-parse --short HEAD)

Best practice: CI should never rely on “latest” base images or floating tags. Pin base images by digest when possible.

# Example: pinning a base image digest (illustrative)
# FROM node:20-alpine@sha256:<digest>

1.3 Build once, promote the same artifact

A common anti-pattern is rebuilding separately for staging and production. That creates drift and makes “what is running?” hard to answer.

Better approach:

Tagging strategy

Commands:

SHA=$(git rev-parse --short HEAD)
docker tag myorg/myservice:git-$SHA registry.example.com/myorg/myservice:$SHA
docker push registry.example.com/myorg/myservice:$SHA

To get the immutable digest after pushing:

docker pull registry.example.com/myorg/myservice:$SHA
docker inspect --format='{{index .RepoDigests 0}}' registry.example.com/myorg/myservice:$SHA

1.4 Test pyramid and where to spend time

A healthy test suite typically follows:

Integration test example with Docker Compose

If your service depends on Postgres, run an ephemeral DB in CI:

docker run --rm -d --name pg \
  -e POSTGRES_PASSWORD=postgres \
  -e POSTGRES_DB=mydb \
  -p 5432:5432 postgres:16

# Run migrations and integration tests
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/mydb"
npm run migrate
npm run test:integration

docker stop pg

Best practice: make integration tests self-contained and idempotent. Tests should create and clean up their own data.


1.5 Security in CI: practical checks

Security scanning is most effective when it is:

Dependency vulnerability scanning (example with npm audit)

npm audit --audit-level=high

Secret scanning with Gitleaks

Install and run:

gitleaks version

# Scan the repository
gitleaks detect --source . --no-git --redact

Container image scanning with Trivy

trivy image --severity HIGH,CRITICAL registry.example.com/myorg/myservice:$SHA

Best practice: define what fails the build (e.g., critical vulns only) and create an exception process that is documented and time-bound.


1.6 Continuous Delivery vs Continuous Deployment

A pragmatic approach:


1.7 Deployment strategies: reduce risk

Blue/Green

Two environments (blue and green). Deploy to the idle one, switch traffic, keep rollback easy.

Canary

Send a small percentage of traffic to the new version and gradually increase.

Rolling update

Replace instances gradually. Common in Kubernetes.

Best practice: pair progressive delivery with strong observability and automated rollback conditions.


2) Infrastructure as Code (IaC) Best Practices

IaC means your infrastructure is:

2.1 Choose declarative, keep modules small, enforce standards

Whether you use Terraform, Pulumi, CloudFormation, or others, good IaC tends to:

This tutorial uses Terraform examples because it is widely used.


2.2 Terraform project structure

A common structure:

infra/
  modules/
    network/
    service/
  envs/
    dev/
    staging/
    prod/

2.3 Remote state, locking, and state hygiene

Terraform state must be protected:

For AWS, a typical approach is S3 + DynamoDB locking. (Exact backend config varies.)

Commands you’ll use often:

terraform fmt -recursive
terraform validate
terraform init
terraform plan
terraform apply

Best practice: run terraform plan in CI and require approval before apply in production.


2.4 Example: Create a minimal AWS S3 bucket with Terraform

Create infra/envs/dev/main.tf:

terraform {
  required_version = ">= 1.6.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_s3_bucket" "app_bucket" {
  bucket = "myorg-dev-app-bucket-123456"
}

resource "aws_s3_bucket_versioning" "versioning" {
  bucket = aws_s3_bucket.app_bucket.id
  versioning_configuration {
    status = "Enabled"
  }
}

Run:

cd infra/envs/dev
terraform init
terraform fmt
terraform validate
terraform plan
terraform apply

Destroy when done:

terraform destroy

Best practice: enable encryption, block public access, and define lifecycle rules. Security defaults should be explicit.


2.5 Manage secrets correctly (don’t put them in state)

A critical IaC rule: avoid placing sensitive secrets directly in Terraform state. State files often end up accessible to many systems and people.

Preferred patterns:

Example (AWS Secrets Manager data source reference conceptually):

data "aws_secretsmanager_secret" "db_password" {
  name = "prod/db_password"
}

Then your application runtime fetches it, or you pass only the secret reference.


2.6 Immutable infrastructure and configuration drift

Configuration drift happens when someone changes infrastructure manually in the console. IaC can detect drift, but only if you:

Drift detection:

terraform plan -detailed-exitcode

Exit codes:

Best practice: schedule drift detection (nightly) and alert on drift.


2.7 Policy as code and guardrails

Guardrails prevent risky changes (public S3 buckets, open security groups, unencrypted databases).

Tools and approaches:

Example with tfsec:

tfsec infra/envs/dev

Example with checkov:

checkov -d infra/envs/dev

Best practice: fail CI on high-severity policy violations, and require explicit exceptions with justification.


2.8 IaC reviews: what to look for in PRs

When reviewing Terraform PRs, focus on:

Use:

terraform plan

And scrutinize:


3) Observability Best Practices

Observability answers: “What is happening inside my system?” It goes beyond monitoring by enabling you to ask new questions without shipping new code each time.

The three primary signals:

  1. Metrics (aggregated numeric time series)
  2. Logs (discrete events)
  3. Traces (request-level, distributed context)

A fourth, often overlooked component: 4. Alerting (actionable, low-noise notifications)


3.1 Golden signals and SLOs

A practical starting point is the Golden Signals:

Then define SLOs (Service Level Objectives) like:

SLOs shift the conversation from “CPU is high” to “users are impacted.”

Best practice: alert on symptoms that violate SLOs, not on every resource metric.


3.2 Structured logging: make logs queryable

Logs should be:

Example JSON log line:

{
  "timestamp": "2026-04-30T12:00:00Z",
  "level": "INFO",
  "service": "myservice",
  "env": "staging",
  "message": "request completed",
  "http_method": "GET",
  "path": "/api/orders",
  "status": 200,
  "duration_ms": 42,
  "trace_id": "3b1d2f0f2c1a4c0b",
  "request_id": "req_9f2c..."
}

Best practice: ensure every request log includes a correlation ID and trace ID.


3.3 Metrics: instrument what matters

Metrics should reflect:

If you use Prometheus, you’ll typically scrape /metrics endpoints and visualize in Grafana.

Example: basic PromQL queries

sum(rate(http_requests_total{service="myservice"}[5m]))
sum(rate(http_requests_total{service="myservice",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myservice"}[5m]))
histogram_quantile(
  0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="myservice"}[5m]))
)

Best practice: use histograms for latency, not averages. Averages hide tail latency.


3.4 Tracing: follow a request across services

Distributed tracing is essential once you have:

With OpenTelemetry (OTel), you can propagate context across HTTP and messaging boundaries.

Key concepts:

Best practice: sample intelligently. Full sampling in high-traffic production can be expensive. Use head-based sampling for baseline and tail-based sampling for errors/slow requests if your backend supports it.


3.5 OpenTelemetry Collector: a practical architecture

Instead of sending telemetry directly from apps to multiple vendors, use the OpenTelemetry Collector as an agent/gateway:

Benefits:


3.6 Alerting: reduce noise, increase actionability

Alerts should be:

Anti-patterns

Better patterns

Example Prometheus alert logic (conceptual query; actual alert rules depend on your setup):

(
  sum(rate(http_requests_total{service="myservice",status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{service="myservice"}[5m]))
) > 0.02

Best practice: include these in every alert:


3.7 Runbooks and incident response

A runbook is a step-by-step guide for common incidents:

A good runbook includes:

Best practice: treat runbooks as code in the same repo as the service, reviewed and updated with changes.


4) Putting It Together: A Practical End-to-End Workflow

This section ties CI/CD + IaC + Observability into a cohesive delivery loop.

4.1 A reference flow

  1. Developer opens a PR.
  2. CI runs:
    • lint, tests
    • build container
    • scan dependencies and image
    • publish artifact tagged by commit SHA
  3. Merge to main triggers CD:
    • deploy to staging using the same artifact
    • run smoke tests
  4. Observability checks:
    • confirm error rate/latency stable
    • verify traces show healthy dependencies
  5. Promote to production:
    • canary rollout
    • automated rollback if SLO burn rate spikes
  6. IaC changes:
    • Terraform plan in CI
    • apply with approval
    • drift detection nightly

4.2 Smoke tests after deploy

Smoke tests should validate critical paths quickly.

Example with curl:

BASE_URL="https://staging.example.com"
curl -fsS "$BASE_URL/health"
curl -fsS "$BASE_URL/api/version"

Example with hey for quick load sampling:

hey -z 30s -c 20 https://staging.example.com/api/orders

Best practice: smoke tests should be fast and deterministic; deeper load tests can run on schedule or before major releases.


4.3 Deployment verification using metrics

After deploying a canary, validate:

If you track deploy markers, correlate changes:

Best practice: automatically annotate dashboards when deployments happen.


5) Common Pitfalls and How to Avoid Them

Pitfall: “CI is slow, so we skip tests”

Fix:

Pitfall: “We rebuild for production”

Fix:

Pitfall: “Terraform state is shared and unmanaged”

Fix:

Pitfall: “We have dashboards but still don’t know what’s wrong”

Fix:

Pitfall: “Alerts are noisy, everyone ignores them”

Fix:


6) A Minimal Checklist You Can Apply Immediately

CI/CD

IaC

Observability


7) Next Steps (Practical Improvements)

If you want to deepen maturity beyond the basics:

  1. Introduce feature flags for safer releases and experiments.
  2. Add contract testing (e.g., Pact) for critical service boundaries.
  3. Adopt GitOps (Argo CD / Flux) so deployments are reconciled from Git.
  4. Implement automatic rollback based on SLO burn rate.
  5. Standardize templates for new services (CI pipeline, Terraform module skeleton, OTel instrumentation, dashboards, and alerts).

Appendix: Useful Commands Reference

Git and tagging

git status
git diff
git rev-parse --short HEAD
git tag -a v1.2.3 -m "Release v1.2.3"
git push --tags

Docker

docker build -t myservice:local .
docker images
docker run --rm -p 3000:3000 myservice:local
docker logs -f <container_id>
docker exec -it <container_id> sh

Terraform

terraform init
terraform fmt -recursive
terraform validate
terraform plan
terraform apply
terraform destroy

Security scanning

gitleaks detect --source . --no-git --redact
trivy image --severity HIGH,CRITICAL myorg/myservice:git-<sha>
tfsec infra/
checkov -d infra/

HTTP smoke testing

curl -fsS https://example.com/health
curl -fsS https://example.com/api/version

By implementing CI/CD, IaC, and Observability as a unified system—rather than isolated initiatives—you get faster delivery, safer changes, and clearer operational insight. The biggest wins come from consistency: consistent pipelines, consistent infrastructure definitions, and consistent telemetry across every service and environment.