DevOps Best Practices: CI/CD, Infrastructure as Code, and Observability
DevOps is not a toolchain—it’s an operating model for delivering software safely and quickly. The most effective DevOps programs consistently invest in three pillars:
- CI/CD (how code becomes running software)
- Infrastructure as Code (IaC) (how environments are created and changed)
- Observability (how you understand and improve behavior in production)
This tutorial walks through best practices in each pillar with deep explanations and real commands you can run. Examples assume Linux/macOS shells; Windows users can use WSL.
1) CI/CD Best Practices (Continuous Integration / Continuous Delivery)
1.1 What “good CI” actually means
Continuous Integration is not “we run tests sometimes.” It means:
- Developers integrate to the mainline frequently (ideally daily).
- Every change is validated by an automated pipeline.
- The pipeline is fast enough to be used continuously.
- Failures are treated as urgent because they block safe delivery.
Key outcomes:
- Reduced merge conflicts
- Higher confidence in main branch
- Faster feedback loops (bugs found minutes after introduction, not weeks later)
Practical CI principles
- Trunk-based development: short-lived branches, frequent merges to
main. - Small changes: keep PRs small to reduce risk and review time.
- Test pyramid: many unit tests, fewer integration tests, minimal end-to-end tests.
- Deterministic builds: pin dependencies; avoid “works on my machine.”
Commands for deterministic dependency installs:
Node.js:
npm ci
Python:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt --require-hashes
Go:
go mod download
go test ./...
1.2 Build once, promote the same artifact
A common anti-pattern is rebuilding the application separately for staging and production. That introduces “it passed staging but prod is different” failures.
Best practice: build a single immutable artifact (container image, JAR, binary) and promote it across environments by changing configuration, not code.
Example: build a Docker image once and tag it with the Git commit SHA.
GIT_SHA="$(git rev-parse --short HEAD)"
docker build -t myapp:${GIT_SHA} .
docker tag myapp:${GIT_SHA} registry.example.com/myapp:${GIT_SHA}
docker push registry.example.com/myapp:${GIT_SHA}
Then deploy that exact tag to dev/staging/prod.
1.3 Pipeline stages that map to risk
A robust pipeline typically has stages like:
- Lint + static checks (fast, cheap)
- Unit tests (fast)
- Build artifact (repeatable)
- Security scanning (dependencies, container image)
- Integration tests (slower, higher confidence)
- Deploy to staging (automated)
- Smoke tests (validate basic behavior)
- Progressive delivery to prod (canary/blue-green)
- Post-deploy verification (SLIs/SLOs, error budgets)
Example commands for common checks
Linting (JavaScript/TypeScript):
npm run lint
Unit tests with coverage:
npm test -- --coverage
Python formatting + linting:
python -m pip install ruff black
ruff check .
black --check .
Container image vulnerability scan (Trivy):
trivy image --severity HIGH,CRITICAL registry.example.com/myapp:${GIT_SHA}
Dependency vulnerability scan (Node):
npm audit --audit-level=high
1.4 Secrets management in CI/CD
Never store secrets in source control or bake them into images. Use:
- CI secret stores (GitHub Actions Secrets, GitLab CI variables, Jenkins credentials)
- Dedicated secret managers (Vault, AWS Secrets Manager, GCP Secret Manager)
- Short-lived credentials via OIDC where possible
Anti-pattern: export AWS_SECRET_ACCESS_KEY=... in scripts committed to repo.
Better: use OIDC to obtain cloud credentials at runtime. For AWS, many CI systems can assume a role without long-lived keys.
1.5 Progressive delivery: canary and blue/green
Deploying to production doesn’t have to be “all at once.”
- Blue/Green: maintain two identical environments (blue = live, green = new). Switch traffic when green is validated.
- Canary: roll out to a small percentage of users/traffic, observe metrics, then expand.
Why it matters: it reduces blast radius and makes rollback safer.
A simple Kubernetes canary approach might use two Deployments and a Service selector shift, or a service mesh/ingress controller with weighted routing. Even without a mesh, you can do controlled rollouts with Kubernetes’ rolling updates and careful monitoring.
1.6 Rollback strategy: plan it before you need it
A rollback is not “git revert and redeploy” during an incident. You want a fast, predictable action:
- Roll back to the previous known-good artifact tag
- Roll back database changes safely (or use forward-only migrations)
- Keep a runbook: who does what, which commands, what validation
Kubernetes rollback example:
kubectl rollout history deploy/myapp
kubectl rollout undo deploy/myapp --to-revision=12
kubectl rollout status deploy/myapp
1.7 CI/CD design patterns that scale
- Pipeline as code: version your pipeline definitions.
- Reusable steps: shared scripts/actions to avoid copy-paste.
- Caching: speed matters; cache dependencies and build layers.
- Parallelization: run test suites in parallel.
- Fail fast: stop early on lint/test failures.
- Quality gates: require passing checks before merge.
Example: Docker build caching with BuildKit
export DOCKER_BUILDKIT=1
docker build --progress=plain -t myapp:${GIT_SHA} .
2) Infrastructure as Code (IaC) Best Practices
2.1 Why IaC is more than “automation”
IaC is the practice of managing infrastructure through code and version control. The real benefits are:
- Repeatability: recreate environments reliably
- Auditability: changes are reviewed, tracked, and attributable
- Safety: reduce manual, error-prone console clicking
- Scalability: manage many environments consistently
A mature IaC workflow treats infrastructure changes like application changes:
- pull requests
- automated validation
- plan review
- controlled apply
- post-change verification
2.2 Choose the right IaC tool and model
Common approaches:
- Terraform/OpenTofu: declarative, multi-cloud, strong ecosystem
- CloudFormation: AWS-native, deep integration
- Pulumi: IaC using general-purpose languages
- Ansible: configuration management, orchestration; best for OS/app config more than cloud primitives
- Kubernetes manifests/Helm/Kustomize: for cluster resources
You can mix tools, but do so intentionally and document boundaries (e.g., Terraform provisions EKS and networking; Helm deploys apps).
2.3 Terraform/OpenTofu workflow: validate → plan → apply
Install OpenTofu (Terraform-compatible fork) or Terraform. Example commands are identical for most workflows.
Initialize:
tofu init
Format and validate:
tofu fmt -recursive
tofu validate
Plan (review the diff):
tofu plan -out=tfplan
Apply the reviewed plan:
tofu apply tfplan
Best practice: remote state + locking
Local state files don’t scale and can corrupt easily. Use remote state with locking (e.g., S3 + DynamoDB, Terraform Cloud, GCS).
Why locking matters: it prevents two engineers/pipelines from applying changes concurrently and corrupting state.
2.4 Structure: modules, environments, and boundaries
A common, scalable structure:
modules/reusable building blocks (VPC, cluster, database)envs/dev,envs/staging,envs/prodcomposition using modules
Guidelines:
- Keep modules small and focused.
- Version modules (git tags or registry versions).
- Avoid environment-specific logic inside modules; pass variables instead.
- Keep blast radius small: separate state per environment and sometimes per component.
2.5 Immutable infrastructure vs configuration drift
Configuration drift happens when reality differs from code (manual console edits, ad-hoc changes). IaC reduces drift, but only if you enforce:
- No manual changes (or document break-glass procedures)
- Frequent reconciliation (plan regularly)
- Access controls (limit who can change infra outside CI)
Detect drift:
tofu plan
If the plan shows unexpected changes, investigate:
- Was something changed manually?
- Did a cloud provider default change?
- Did a module update alter behavior?
2.6 IaC testing and policy enforcement
Treat infrastructure code as testable:
- Static checks: formatting, linting, security scanning
- Policy as code: enforce rules like “no public S3 buckets,” “encryption required,” “approved regions only”
- Integration tests: create ephemeral environments and validate behavior
Security scanning for Terraform
Using tfsec (or equivalents):
tfsec .
Using checkov:
checkov -d .
Policy as code with OPA (conceptual)
OPA (Open Policy Agent) can evaluate plans against policies. In practice, you export a plan to JSON and evaluate it.
Terraform/OpenTofu plan to JSON:
tofu show -json tfplan > tfplan.json
Then evaluate with OPA (example, policy not included here):
opa eval --data policy.rego --input tfplan.json "data.iac.deny"
2.7 Secrets in IaC: what not to do
Never put secrets in:
- Terraform variables committed to git
.tfvarsfiles in repo- user data scripts in plaintext
- container images
Instead:
- Reference secret manager ARNs/paths
- Inject secrets at runtime (Kubernetes secrets from external secret stores)
- Use encryption (KMS) and access controls
For example, store a database password in AWS Secrets Manager and have the app retrieve it using IAM permissions rather than embedding it.
2.8 Database migrations: the hardest part of delivery
Infrastructure changes often involve data. The safest approach is usually:
- Backward-compatible migrations (expand/contract pattern)
- Deploy code that can handle both old and new schema
- Migrate data
- Remove old schema later
A simple example with a SQL migration tool might be:
# Example using Flyway (conceptual)
flyway -url="jdbc:postgresql://db.example.com:5432/app" \
-user="$DB_USER" -password="$DB_PASS" migrate
Best practice: run migrations as part of deployment with clear ownership and rollback strategy (often forward-only with compensating migrations).
3) Observability Best Practices (Metrics, Logs, Traces)
3.1 Observability vs monitoring
Monitoring tells you known failure modes (CPU high, disk full). Observability helps you understand unknown failure modes by making systems explain themselves.
Observability is typically built from three signals:
- Metrics (aggregated numbers over time)
- Logs (discrete events with context)
- Traces (end-to-end request flow across services)
A fourth signal often included:
- Profiles (CPU/memory profiling over time)
3.2 Start with SLIs and SLOs (not dashboards)
Dashboards are useful, but the best teams start with:
- SLI (Service Level Indicator): what you measure (e.g., request success rate)
- SLO (Service Level Objective): target (e.g., 99.9% success rate over 30 days)
- Error budget: allowed unreliability (100% - SLO)
Example SLI/SLO:
- SLI: proportion of HTTP 2xx/3xx responses
- SLO: 99.9% over 30 days
- Error budget: 0.1% failures allowed
Why this matters: it aligns engineering work with user experience and provides a rational basis for release velocity. If you’re burning error budget too fast, slow down releases and focus on reliability.
3.3 Instrumentation: make telemetry consistent
Logging best practices
- Use structured logs (JSON) rather than free-form strings.
- Include correlation IDs (request ID, trace ID).
- Avoid logging secrets or sensitive data.
- Log at appropriate levels (INFO for business events, WARN for recoverable issues, ERROR for failures).
Example: structured logging from an app might emit:
{"level":"info","msg":"order_created","order_id":"123","user_id":"456","trace_id":"abc..."}
Metrics best practices
- Prefer counters for events (requests, errors)
- histograms for latency (p50/p95/p99)
- gauges for current values (queue depth, memory)
Avoid high-cardinality labels (like raw user IDs) in metrics; they can explode cost and degrade performance.
Tracing best practices
- Propagate context across service boundaries.
- Sample intelligently (head-based or tail-based).
- Record spans around critical operations: DB calls, external APIs, queue operations.
3.4 OpenTelemetry: a practical standard
OpenTelemetry (OTel) is the de facto standard for generating and exporting telemetry.
A common architecture:
- App emits OTel metrics/logs/traces
- OTel Collector receives, batches, enriches, exports
- Backend stores/visualizes (Prometheus, Grafana, Tempo, Loki, Jaeger, etc.)
Run an OpenTelemetry Collector locally (example): You can run a collector container, but it requires a config file. Even without showing the full config here, the real command looks like:
docker run --rm -p 4317:4317 -p 4318:4318 \
-v "$(pwd)/otel-collector-config.yaml:/etc/otelcol/config.yaml" \
otel/opentelemetry-collector:latest \
--config /etc/otelcol/config.yaml
Your app can then export OTLP to:
http://localhost:4318(OTLP HTTP)localhost:4317(OTLP gRPC)
3.5 Prometheus metrics: concrete queries and checks
If you use Prometheus-style metrics, you’ll typically define alerts and dashboards using PromQL.
Examples:
Request rate:
sum(rate(http_requests_total[5m]))
Error rate:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
p95 latency (histogram):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Best practice: alert on symptoms (user impact), not causes. For example:
- Symptom: elevated 5xx error rate, high latency
- Cause: CPU high, DB connections exhausted (useful for debugging but not always paging)
3.6 Logs: from “search” to “investigation”
Logs are most valuable when they are:
- Queryable (structured fields)
- Correlated (trace IDs, request IDs)
- Retained appropriately (cost vs compliance)
- Redacted (no secrets)
If you’re using a tool like Loki, Elasticsearch, or a cloud logging service, you’ll often query by fields.
Example grep-based local investigation (real command):
# Find errors in the last 2000 lines of a container log file
tail -n 2000 /var/log/myapp.log | grep -i "error"
Follow logs in Kubernetes:
kubectl logs -n prod deploy/myapp -f --tail=200
Include timestamps:
kubectl logs -n prod deploy/myapp --timestamps --tail=200
3.7 Traces: reduce MTTR dramatically
Distributed tracing answers questions like:
- Which service is slow?
- Is latency in DB, cache, or an external API?
- What percentage of requests hit a degraded dependency?
If you have trace IDs in logs, you can pivot:
- Alert fires (latency high)
- Find trace exemplars (slow traces)
- Identify the slow span (e.g., DB query)
- Jump to logs for that trace ID
- Mitigate and confirm via metrics
Best practice: ensure consistent propagation of trace context across:
- HTTP headers (
traceparent) - messaging systems (inject/extract context)
- background jobs
3.8 Alerting: actionable, owned, and tested
Bad alerts create noise; noise creates missed incidents.
Good alerts are:
- Actionable: someone knows what to do
- Owned: there is a team responsible
- Routed: goes to the right on-call rotation
- Tested: you verify alerts fire when expected
Anti-pattern: alerting on CPU > 80% for 5 minutes for every service. Better: alert on high error rate or high latency relative to SLO.
Also implement:
- Deduplication
- Rate limiting
- Maintenance windows
- Runbooks linked in alerts
3.9 Incident response: observability meets process
When incidents happen, observability should support a clear workflow:
- Detect (alerts)
- Triage (is it real user impact?)
- Mitigate (rollback, feature flag off, scale, failover)
- Communicate (status page, internal updates)
- Learn (postmortem with action items)
Practical mitigation commands (Kubernetes examples):
Scale up temporarily:
kubectl -n prod scale deploy/myapp --replicas=10
kubectl -n prod rollout status deploy/myapp
Check resource usage:
kubectl -n prod top pods
kubectl -n prod top nodes
Describe a pod to see events:
kubectl -n prod describe pod <pod-name>
4) Putting It Together: A Reference Delivery Workflow
This section ties CI/CD, IaC, and observability into one coherent practice.
4.1 A typical change lifecycle
- Developer creates a small PR.
- CI runs lint/unit tests quickly.
- CI builds an immutable artifact and scans it.
- Merge to
maintriggers:- integration tests
- deployment to staging
- smoke tests
- Promotion to production:
- progressive rollout (canary)
- automated verification against SLIs
- Observability confirms health; release is completed.
- If SLIs degrade, rollout is halted and rolled back.
4.2 Verification after deploy (real checks)
Check HTTP endpoint:
curl -fsS https://myapp.example.com/health
Check a key user journey (simple smoke):
curl -fsS https://myapp.example.com/api/version
curl -fsS -X POST https://myapp.example.com/api/login \
-H 'content-type: application/json' \
-d '{"username":"smoke","password":"smoke"}'
Check Kubernetes rollout:
kubectl -n prod rollout status deploy/myapp --timeout=120s
Check recent errors:
kubectl -n prod logs deploy/myapp --tail=200 | grep -E "ERROR|Exception" || true
5) Security and Compliance as DevOps Multipliers (DevSecOps)
Security is not a gate at the end; it’s integrated into delivery.
5.1 Supply chain security
- Pin dependencies and verify integrity
- Generate SBOMs (Software Bill of Materials)
- Sign artifacts
- Enforce provenance
Generate an SBOM for a container image (Syft):
syft registry.example.com/myapp:${GIT_SHA} -o spdx-json > sbom.spdx.json
Sign an image (cosign):
cosign sign registry.example.com/myapp:${GIT_SHA}
Verify signature:
cosign verify registry.example.com/myapp:${GIT_SHA}
5.2 Least privilege everywhere
- CI should have only the permissions it needs.
- Production access should be time-bound and audited.
- Use separate accounts/projects/subscriptions for environments.
6) Common Pitfalls and How to Avoid Them
Pitfall: “We have CI/CD” but deployments are still scary
Cause: poor test coverage, no progressive delivery, manual steps. Fix: invest in test strategy, canaries, automated verification, and rollbacks.
Pitfall: IaC exists but people still click in the console
Cause: missing features in code, slow pipeline, unclear ownership. Fix: define a break-glass process, improve IaC coverage, shorten feedback loops.
Pitfall: Lots of dashboards but no one knows what matters
Cause: no SLIs/SLOs, alert fatigue. Fix: define SLOs, alert on symptoms, link runbooks, measure error budget.
Pitfall: Observability is too expensive
Cause: high-cardinality metrics, verbose logs, too much retention. Fix: reduce cardinality, sample traces, structure logs, set retention tiers.
7) A Practical “Start Here” Checklist
If you want a concrete sequence to implement:
CI/CD
- Enforce
mainis always green (fix pipeline failures immediately) - Build once, tag with commit SHA, promote the same artifact
- Add security scanning (dependencies + images)
- Add progressive delivery (canary or blue/green)
- Implement fast rollback (document and test it)
Infrastructure as Code
- Remote state + locking
- Separate state per environment
- Module boundaries and versioning
- Policy checks in CI (no public resources, encryption required)
- Drift detection (scheduled plans)
Observability
- Define SLIs/SLOs for critical services
- Structured logs with trace IDs
- Metrics for golden signals (latency, traffic, errors, saturation)
- Distributed tracing across services
- Alerts that are actionable and tied to runbooks
8) Example: A Minimal End-to-End Flow (Commands You Can Adapt)
This is a simplified flow you can run in a real project.
8.1 Local pre-flight checks
git checkout -b feature/small-change
npm ci
npm run lint
npm test
8.2 Build and scan a container
GIT_SHA="$(git rev-parse --short HEAD)"
docker build -t myapp:${GIT_SHA} .
trivy image myapp:${GIT_SHA}
8.3 Push and deploy (conceptual)
docker tag myapp:${GIT_SHA} registry.example.com/myapp:${GIT_SHA}
docker push registry.example.com/myapp:${GIT_SHA}
Deploy to Kubernetes by updating the image (example):
kubectl -n staging set image deploy/myapp myapp=registry.example.com/myapp:${GIT_SHA}
kubectl -n staging rollout status deploy/myapp
Smoke test:
curl -fsS https://staging-myapp.example.com/health
Promote to production (same image tag):
kubectl -n prod set image deploy/myapp myapp=registry.example.com/myapp:${GIT_SHA}
kubectl -n prod rollout status deploy/myapp
Verify:
curl -fsS https://myapp.example.com/health
kubectl -n prod logs deploy/myapp --tail=100
Rollback if needed:
kubectl -n prod rollout undo deploy/myapp
kubectl -n prod rollout status deploy/myapp
9) Closing Guidance: Optimize for Learning Speed
The best DevOps organizations optimize for learning speed:
- CI/CD shortens feedback loops.
- IaC makes environments reproducible and changes reviewable.
- Observability turns production into a source of truth, not mystery.
If you implement only one meta-practice: make every change small, observable, and reversible. That single idea drives safer releases, faster incident recovery, and a more sustainable engineering culture.