DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability
DevOps is not a toolchain—it’s a set of practices that reduce lead time for changes, improve reliability, and make systems easier to operate. This tutorial focuses on three pillars that, when implemented together, form a strong foundation for modern delivery:
- CI/CD (Continuous Integration and Continuous Delivery/Deployment)
- Infrastructure as Code (IaC)
- Observability (metrics, logs, traces, and actionable alerting)
The goal is not “use tool X,” but to build repeatable, auditable workflows that scale with teams and complexity.
1) CI/CD Best Practices
1.1 Continuous Integration: What “good” looks like
A strong CI system ensures that every change is validated quickly and consistently. The best CI pipelines share these traits:
- Fast feedback: Most checks complete in minutes.
- Deterministic: Same input → same output (pin versions, lock dependencies).
- Hermetic where possible: Avoid relying on developer machines or mutable shared environments.
- Shift-left security: Run SAST, dependency scanning, secret scanning early.
- Artifact-based: Build once, promote the same artifact through environments.
Recommended CI stages (typical)
- Checkout + dependency restore
- Lint + formatting
- Unit tests
- Build artifact/container
- Security scans
- Integration tests (optional, but valuable)
- Publish artifact
- Trigger CD
1.2 Example: A simple CI pipeline for a Node.js service
Assume a Node.js API that builds into a Docker image.
Local commands (what CI should run)
# Install dependencies deterministically
npm ci
# Lint and format checks
npm run lint
npm run format:check
# Unit tests with coverage
npm test -- --coverage
# Build production bundle (if applicable)
npm run build
Build a Docker image
docker build -t myorg/myservice:git-$(git rev-parse --short HEAD) .
Run container locally
docker run --rm -p 3000:3000 myorg/myservice:git-$(git rev-parse --short HEAD)
Best practice: CI should never rely on “latest” base images or floating tags. Pin base images by digest when possible.
# Example: pinning a base image digest (illustrative)
# FROM node:20-alpine@sha256:<digest>
1.3 Build once, promote the same artifact
A common anti-pattern is rebuilding separately for staging and production. That creates drift and makes “what is running?” hard to answer.
Better approach:
- Build once in CI.
- Tag with immutable identifiers (commit SHA).
- Push to registry.
- Deploy by referencing that exact tag/digest.
Tagging strategy
myservice:<git-sha>(immutable)myservice:main(mutable convenience tag, optional)myservice:v1.4.2(release tag)
Commands:
SHA=$(git rev-parse --short HEAD)
docker tag myorg/myservice:git-$SHA registry.example.com/myorg/myservice:$SHA
docker push registry.example.com/myorg/myservice:$SHA
To get the immutable digest after pushing:
docker pull registry.example.com/myorg/myservice:$SHA
docker inspect --format='{{index .RepoDigests 0}}' registry.example.com/myorg/myservice:$SHA
1.4 Test pyramid and where to spend time
A healthy test suite typically follows:
- Many unit tests: fast, deterministic
- Some integration tests: validate DB, queues, caches, external contracts
- Few end-to-end tests: expensive; keep them focused on critical flows
Integration test example with Docker Compose
If your service depends on Postgres, run an ephemeral DB in CI:
docker run --rm -d --name pg \
-e POSTGRES_PASSWORD=postgres \
-e POSTGRES_DB=mydb \
-p 5432:5432 postgres:16
# Run migrations and integration tests
export DATABASE_URL="postgres://postgres:postgres@localhost:5432/mydb"
npm run migrate
npm run test:integration
docker stop pg
Best practice: make integration tests self-contained and idempotent. Tests should create and clean up their own data.
1.5 Security in CI: practical checks
Security scanning is most effective when it is:
- automated,
- fast,
- and enforced with clear policies.
Dependency vulnerability scanning (example with npm audit)
npm audit --audit-level=high
Secret scanning with Gitleaks
Install and run:
gitleaks version
# Scan the repository
gitleaks detect --source . --no-git --redact
Container image scanning with Trivy
trivy image --severity HIGH,CRITICAL registry.example.com/myorg/myservice:$SHA
Best practice: define what fails the build (e.g., critical vulns only) and create an exception process that is documented and time-bound.
1.6 Continuous Delivery vs Continuous Deployment
- Continuous Delivery: every change is deployable; production deploy may require approval.
- Continuous Deployment: every change that passes pipeline automatically deploys to production.
A pragmatic approach:
- Auto-deploy to dev/staging on merge.
- Deploy to production via approval + change window (or progressive rollout) until confidence is high.
1.7 Deployment strategies: reduce risk
Blue/Green
Two environments (blue and green). Deploy to the idle one, switch traffic, keep rollback easy.
Canary
Send a small percentage of traffic to the new version and gradually increase.
Rolling update
Replace instances gradually. Common in Kubernetes.
Best practice: pair progressive delivery with strong observability and automated rollback conditions.
2) Infrastructure as Code (IaC) Best Practices
IaC means your infrastructure is:
- versioned (Git history),
- reviewed (PRs),
- repeatable (reprovision reliably),
- auditable (who changed what and why).
2.1 Choose declarative, keep modules small, enforce standards
Whether you use Terraform, Pulumi, CloudFormation, or others, good IaC tends to:
- avoid giant “god modules,”
- expose clear inputs/outputs,
- include validation,
- and keep environments consistent.
This tutorial uses Terraform examples because it is widely used.
2.2 Terraform project structure
A common structure:
infra/
modules/
network/
service/
envs/
dev/
staging/
prod/
modules/contains reusable building blocks.envs/wires modules together with environment-specific values.
2.3 Remote state, locking, and state hygiene
Terraform state must be protected:
- stored remotely (not on laptops),
- locked during changes,
- backed up.
For AWS, a typical approach is S3 + DynamoDB locking. (Exact backend config varies.)
Commands you’ll use often:
terraform fmt -recursive
terraform validate
terraform init
terraform plan
terraform apply
Best practice: run terraform plan in CI and require approval before apply in production.
2.4 Example: Create a minimal AWS S3 bucket with Terraform
Create infra/envs/dev/main.tf:
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
resource "aws_s3_bucket" "app_bucket" {
bucket = "myorg-dev-app-bucket-123456"
}
resource "aws_s3_bucket_versioning" "versioning" {
bucket = aws_s3_bucket.app_bucket.id
versioning_configuration {
status = "Enabled"
}
}
Run:
cd infra/envs/dev
terraform init
terraform fmt
terraform validate
terraform plan
terraform apply
Destroy when done:
terraform destroy
Best practice: enable encryption, block public access, and define lifecycle rules. Security defaults should be explicit.
2.5 Manage secrets correctly (don’t put them in state)
A critical IaC rule: avoid placing sensitive secrets directly in Terraform state. State files often end up accessible to many systems and people.
Preferred patterns:
- Store secrets in a dedicated secret manager (AWS Secrets Manager, GCP Secret Manager, Vault).
- Reference secret ARNs/paths in IaC, not secret values.
- Inject secrets at runtime (Kubernetes secrets via external secret operators, or sidecars).
Example (AWS Secrets Manager data source reference conceptually):
data "aws_secretsmanager_secret" "db_password" {
name = "prod/db_password"
}
Then your application runtime fetches it, or you pass only the secret reference.
2.6 Immutable infrastructure and configuration drift
Configuration drift happens when someone changes infrastructure manually in the console. IaC can detect drift, but only if you:
- run plans regularly,
- restrict manual changes via IAM policies,
- treat IaC as the source of truth.
Drift detection:
terraform plan -detailed-exitcode
Exit codes:
0: no changes2: changes present (drift or intended updates)1: error
Best practice: schedule drift detection (nightly) and alert on drift.
2.7 Policy as code and guardrails
Guardrails prevent risky changes (public S3 buckets, open security groups, unencrypted databases).
Tools and approaches:
- Terraform Cloud/Enterprise policies (Sentinel)
- Open Policy Agent (OPA) / Conftest
- Checkov / tfsec for static checks
Example with tfsec:
tfsec infra/envs/dev
Example with checkov:
checkov -d infra/envs/dev
Best practice: fail CI on high-severity policy violations, and require explicit exceptions with justification.
2.8 IaC reviews: what to look for in PRs
When reviewing Terraform PRs, focus on:
- Blast radius: which resources are replaced/destroyed?
- Networking changes: subnets, routes, security groups
- Data stores: encryption, backups, deletion protection
- IAM: least privilege, no wildcard permissions
- Cost: instance sizes, autoscaling bounds
- Rollout plan: can it be applied safely? any downtime?
Use:
terraform plan
And scrutinize:
-/+(replace)-(destroy)- changes to security rules and IAM
3) Observability Best Practices
Observability answers: “What is happening inside my system?” It goes beyond monitoring by enabling you to ask new questions without shipping new code each time.
The three primary signals:
- Metrics (aggregated numeric time series)
- Logs (discrete events)
- Traces (request-level, distributed context)
A fourth, often overlooked component: 4. Alerting (actionable, low-noise notifications)
3.1 Golden signals and SLOs
A practical starting point is the Golden Signals:
- Latency
- Traffic
- Errors
- Saturation
Then define SLOs (Service Level Objectives) like:
- “99.9% of requests under 300ms over 30 days”
- “Error rate < 0.1% over 7 days”
SLOs shift the conversation from “CPU is high” to “users are impacted.”
Best practice: alert on symptoms that violate SLOs, not on every resource metric.
3.2 Structured logging: make logs queryable
Logs should be:
- structured (JSON),
- consistent (fields like
service,env,trace_id,request_id), - leveled (
DEBUG,INFO,WARN,ERROR), - privacy-aware (no secrets, no unnecessary PII).
Example JSON log line:
{
"timestamp": "2026-04-30T12:00:00Z",
"level": "INFO",
"service": "myservice",
"env": "staging",
"message": "request completed",
"http_method": "GET",
"path": "/api/orders",
"status": 200,
"duration_ms": 42,
"trace_id": "3b1d2f0f2c1a4c0b",
"request_id": "req_9f2c..."
}
Best practice: ensure every request log includes a correlation ID and trace ID.
3.3 Metrics: instrument what matters
Metrics should reflect:
- request duration histograms,
- request counts by route/status,
- error counts,
- dependency timings (DB, cache, external APIs),
- queue depth and consumer lag,
- saturation (CPU/memory), but as supporting context.
If you use Prometheus, you’ll typically scrape /metrics endpoints and visualize in Grafana.
Example: basic PromQL queries
- Request rate:
sum(rate(http_requests_total{service="myservice"}[5m]))
- Error rate:
sum(rate(http_requests_total{service="myservice",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myservice"}[5m]))
- P95 latency:
histogram_quantile(
0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="myservice"}[5m]))
)
Best practice: use histograms for latency, not averages. Averages hide tail latency.
3.4 Tracing: follow a request across services
Distributed tracing is essential once you have:
- multiple services,
- async messaging,
- or external dependencies.
With OpenTelemetry (OTel), you can propagate context across HTTP and messaging boundaries.
Key concepts:
- Span: a timed operation (e.g., “HTTP GET /orders”)
- Trace: a tree of spans for one request
- Context propagation: trace IDs passed via headers (e.g., W3C
traceparent)
Best practice: sample intelligently. Full sampling in high-traffic production can be expensive. Use head-based sampling for baseline and tail-based sampling for errors/slow requests if your backend supports it.
3.5 OpenTelemetry Collector: a practical architecture
Instead of sending telemetry directly from apps to multiple vendors, use the OpenTelemetry Collector as an agent/gateway:
- Apps export OTLP (OpenTelemetry Protocol) to the collector.
- Collector enriches, batches, samples, and forwards to:
- Prometheus/remote write for metrics,
- Loki/Elastic for logs,
- Tempo/Jaeger for traces,
- or a SaaS backend.
Benefits:
- vendor flexibility,
- centralized control of sampling and enrichment,
- fewer egress endpoints from workloads.
3.6 Alerting: reduce noise, increase actionability
Alerts should be:
- tied to user impact (SLO burn rate, error spikes),
- actionable (clear runbook and owner),
- deduplicated and routed properly.
Anti-patterns
- Alerting on CPU > 80% with no context
- Alerting on every 404
- Alerts without runbooks
Better patterns
- Alert on high 5xx error rate sustained for N minutes
- Alert on latency SLO burn rate
- Alert when queue lag exceeds a threshold and is increasing
Example Prometheus alert logic (conceptual query; actual alert rules depend on your setup):
(
sum(rate(http_requests_total{service="myservice",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{service="myservice"}[5m]))
) > 0.02
Best practice: include these in every alert:
- summary of impact,
- dashboard link,
- runbook link,
- recent deploy marker (if available).
3.7 Runbooks and incident response
A runbook is a step-by-step guide for common incidents:
- “DB connections exhausted”
- “Elevated 5xx after deploy”
- “Queue consumer lag increasing”
A good runbook includes:
- how to confirm the issue,
- likely causes,
- safe mitigations,
- rollback steps,
- escalation contacts,
- links to dashboards and logs.
Best practice: treat runbooks as code in the same repo as the service, reviewed and updated with changes.
4) Putting It Together: A Practical End-to-End Workflow
This section ties CI/CD + IaC + Observability into a cohesive delivery loop.
4.1 A reference flow
- Developer opens a PR.
- CI runs:
- lint, tests
- build container
- scan dependencies and image
- publish artifact tagged by commit SHA
- Merge to main triggers CD:
- deploy to staging using the same artifact
- run smoke tests
- Observability checks:
- confirm error rate/latency stable
- verify traces show healthy dependencies
- Promote to production:
- canary rollout
- automated rollback if SLO burn rate spikes
- IaC changes:
- Terraform plan in CI
- apply with approval
- drift detection nightly
4.2 Smoke tests after deploy
Smoke tests should validate critical paths quickly.
Example with curl:
BASE_URL="https://staging.example.com"
curl -fsS "$BASE_URL/health"
curl -fsS "$BASE_URL/api/version"
Example with hey for quick load sampling:
hey -z 30s -c 20 https://staging.example.com/api/orders
Best practice: smoke tests should be fast and deterministic; deeper load tests can run on schedule or before major releases.
4.3 Deployment verification using metrics
After deploying a canary, validate:
- Error rate did not increase
- P95 latency did not regress
- Saturation is stable (CPU/memory)
- Key business metrics (orders created, logins) are normal
If you track deploy markers, correlate changes:
- “Error spike started 2 minutes after deploy X”
Best practice: automatically annotate dashboards when deployments happen.
5) Common Pitfalls and How to Avoid Them
Pitfall: “CI is slow, so we skip tests”
Fix:
- parallelize tests,
- cache dependencies,
- separate fast checks from slow checks,
- run slow suites on merge or nightly.
Pitfall: “We rebuild for production”
Fix:
- build once, promote immutable artifacts.
Pitfall: “Terraform state is shared and unmanaged”
Fix:
- remote state + locking,
- least privilege access to state,
- separate state per environment.
Pitfall: “We have dashboards but still don’t know what’s wrong”
Fix:
- add tracing with consistent context propagation,
- improve structured logs,
- define SLOs and alert on burn rate.
Pitfall: “Alerts are noisy, everyone ignores them”
Fix:
- alert on symptoms (user impact),
- add runbooks,
- tune thresholds and windows,
- route alerts to the right owners.
6) A Minimal Checklist You Can Apply Immediately
CI/CD
- Deterministic builds (
npm ci, lockfiles, pinned versions) - Lint + unit tests on every PR
- Build once, push immutable artifact (SHA tag)
- Security scanning (deps, secrets, image)
- Progressive delivery (canary/rolling) + rollback plan
IaC
- All infra changes via PR
- Remote state + locking
- Policy checks (tfsec/checkov)
- Secrets not stored in state
- Drift detection scheduled
Observability
- Structured logs with
trace_idandrequest_id - Metrics for golden signals + histograms for latency
- Distributed tracing for key services
- SLOs defined and alerts tied to user impact
- Runbooks stored with code
7) Next Steps (Practical Improvements)
If you want to deepen maturity beyond the basics:
- Introduce feature flags for safer releases and experiments.
- Add contract testing (e.g., Pact) for critical service boundaries.
- Adopt GitOps (Argo CD / Flux) so deployments are reconciled from Git.
- Implement automatic rollback based on SLO burn rate.
- Standardize templates for new services (CI pipeline, Terraform module skeleton, OTel instrumentation, dashboards, and alerts).
Appendix: Useful Commands Reference
Git and tagging
git status
git diff
git rev-parse --short HEAD
git tag -a v1.2.3 -m "Release v1.2.3"
git push --tags
Docker
docker build -t myservice:local .
docker images
docker run --rm -p 3000:3000 myservice:local
docker logs -f <container_id>
docker exec -it <container_id> sh
Terraform
terraform init
terraform fmt -recursive
terraform validate
terraform plan
terraform apply
terraform destroy
Security scanning
gitleaks detect --source . --no-git --redact
trivy image --severity HIGH,CRITICAL myorg/myservice:git-<sha>
tfsec infra/
checkov -d infra/
HTTP smoke testing
curl -fsS https://example.com/health
curl -fsS https://example.com/api/version
By implementing CI/CD, IaC, and Observability as a unified system—rather than isolated initiatives—you get faster delivery, safer changes, and clearer operational insight. The biggest wins come from consistency: consistent pipelines, consistent infrastructure definitions, and consistent telemetry across every service and environment.