DevOps Resources: CI/CD, Infrastructure as Code, Observability & Automation
This tutorial is a practical, command-heavy guide to core DevOps capabilities: CI/CD, Infrastructure as Code (IaC), observability, and automation. It’s written to be used as a reference you can copy from while building real pipelines and systems.
Table of Contents
- 1. What “DevOps” Means in Practice
- 2. CI/CD: Build, Test, Package, Release
- 3. Infrastructure as Code (IaC)
- 4. Observability: Metrics, Logs, Traces, and SLOs
- 5. Automation: Repeatability at Scale
- 6. Security Essentials: Supply Chain, Secrets, and Least Privilege
- 7. A Practical End-to-End Example (Local)
- 8. Curated Resource List
1. What “DevOps” Means in Practice
DevOps is less a job title and more a set of operational outcomes:
- Fast, safe delivery (CI/CD)
- Repeatable infrastructure (IaC)
- High-quality signals about system health (observability)
- Eliminating toil through automation (scripts, runbooks, self-service)
A useful mental model is a feedback loop:
- Code changes are proposed (PR).
- CI runs tests, security checks, and builds artifacts.
- CD deploys to environments using consistent mechanisms.
- Observability detects regressions quickly.
- Automation accelerates response and prevents repeated manual work.
The goal is not “deploy more” at any cost; it’s “deploy more safely” with measurable reliability.
2. CI/CD: Build, Test, Package, Release
2.1 CI/CD design principles
A robust pipeline usually follows these principles:
- Reproducibility: builds are deterministic; dependencies are pinned.
- Fast feedback: run cheap checks early (lint, unit tests), slow checks later (integration tests).
- Artifact immutability: build once, promote the same artifact across environments.
- Policy as code: security and compliance checks are automated.
- Environment parity: dev/staging/prod are as similar as possible.
- Progressive delivery: release gradually, observe, then expand.
A common anti-pattern is “deploy from a developer machine.” Instead, the pipeline should be the only path to production.
2.2 A minimal CI pipeline (GitHub Actions)
Even though this tutorial avoids YAML frontmatter, CI systems themselves often use YAML. The following is a minimal GitHub Actions workflow that:
- checks out code
- sets up Node.js
- installs dependencies
- runs tests
- builds
Create .github/workflows/ci.yml:
name: ci
on:
pull_request:
push:
branches: [ "main" ]
jobs:
test-build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- name: Install
run: npm ci
- name: Test
run: npm test -- --ci
- name: Build
run: npm run build
Key details:
npm ciis preferred in CI because it usespackage-lock.jsonstrictly.- Caching speeds up runs but should not compromise correctness—cache only dependency downloads, not build outputs unless you know what you’re doing.
- The workflow triggers on PRs and on pushes to
main.
To run the same steps locally (a best practice for developer experience):
npm ci
npm test -- --ci
npm run build
2.3 Build artifacts, versioning, and SBOM
Artifacts are outputs of CI that you can deploy: a container image, a zip, a binary, etc. A key DevOps rule:
Build once; deploy many times.
Versioning
A practical approach is Semantic Versioning plus build metadata:
1.4.0(release)1.4.0-rc.1(release candidate)1.4.0+git.<sha>(metadata)
In CI, you can generate a version string:
git rev-parse --short HEAD
git describe --tags --always
SBOM (Software Bill of Materials)
An SBOM lists components included in your build. Many organizations require it for supply chain security.
Example using syft (works well for containers and directories):
# Install (macOS)
brew install syft
# Generate SBOM for a container image
syft your-image:tag -o spdx-json > sbom.spdx.json
# Or for a local directory
syft dir:. -o cyclonedx-json > sbom.cdx.json
You can store SBOMs as build artifacts and attach them to releases.
2.4 Container image build & push (Docker)
A typical pipeline builds an image and pushes it to a registry.
Build locally
docker build -t myapp:dev .
docker run --rm -p 8080:8080 myapp:dev
Tag with commit SHA
SHA="$(git rev-parse --short HEAD)"
docker tag myapp:dev "ghcr.io/yourorg/myapp:${SHA}"
Login and push (GitHub Container Registry example)
echo "$GITHUB_TOKEN" | docker login ghcr.io -u youruser --password-stdin
docker push "ghcr.io/yourorg/myapp:${SHA}"
Best practices:
- Use multi-stage builds to keep images small.
- Pin base images (e.g.,
node:20-alpine), but also keep them updated. - Avoid baking secrets into images (use runtime secrets).
2.5 Deployment strategies: rolling, blue/green, canary
How you deploy matters as much as what you deploy.
Rolling deployment
Replace instances gradually. Pros: simple; Cons: mixed versions during rollout.
Blue/green
Two environments (blue=live, green=next). Switch traffic after validation. Pros: fast rollback; Cons: higher cost.
Canary
Release to a small percentage of traffic, observe, then expand. Pros: safest at scale; Cons: requires routing/metrics maturity.
A canary mindset depends on observability: you must measure errors/latency and compare canary vs baseline.
3. Infrastructure as Code (IaC)
IaC is about managing infrastructure with the same discipline as software:
- code review
- version control
- automated testing
- repeatable provisioning
Two broad categories:
- Provisioning (Terraform, Pulumi, CloudFormation): create cloud resources.
- Configuration management (Ansible, Chef, Puppet): configure OS and apps.
In modern setups, Kubernetes and managed services reduce the need for heavy configuration management, but it still matters for VMs, edge cases, and bootstrapping.
3.1 Terraform fundamentals
Terraform describes desired infrastructure in code and reconciles it via:
terraform init(download providers, set up backend)terraform plan(show changes)terraform apply(make changes)terraform destroy(tear down)
Basic workflow:
terraform fmt -recursive
terraform validate
terraform init
terraform plan -out tfplan
terraform apply tfplan
Important concepts:
- Providers: plugins for AWS/GCP/Azure/Kubernetes/etc.
- Resources: actual infrastructure objects.
- Data sources: read existing infrastructure.
- State: Terraform’s record of what it manages.
State is critical: losing it can cause drift and accidental recreation.
3.2 Remote state, locking, and environments
Why remote state?
Local state (terraform.tfstate on a laptop) is dangerous:
- not shared across team
- no locking (two applies can collide)
- harder to secure
Use a remote backend (S3 + DynamoDB locking on AWS, GCS on GCP, Terraform Cloud, etc.).
Even without showing provider-specific backend config, the operational commands look the same:
terraform init -reconfigure
terraform plan
terraform apply
Environments: dev/staging/prod
Avoid copy-pasting entire Terraform directories. Prefer:
- modules for reusable components
- separate workspaces or separate state backends per environment
- environment-specific variable files
Example usage:
terraform workspace new dev
terraform workspace select dev
terraform plan -var-file=env/dev.tfvars
terraform apply -var-file=env/dev.tfvars
Note: many teams prefer separate state per environment directory rather than workspaces, because it’s harder to accidentally apply to the wrong workspace when you’re tired.
3.3 Example: provisioning a VM (conceptual) + best practices
Terraform code varies by cloud, but the structure is consistent:
- network
- compute
- security rules
- outputs
Best practices you can apply everywhere:
- Small modules with clear inputs/outputs
- No secrets in state
- Use
terraform planin CI and require approval for apply - Tag resources (owner, cost center, environment)
- Policy checks (OPA/Conftest, Sentinel, or cloud-native policies)
A common CI pattern:
terraform fmt -check -recursive
terraform validate
terraform plan -no-color -out tfplan
Then, in a protected environment step (manual approval):
terraform apply -no-color tfplan
3.4 Configuration management: Ansible basics
Ansible is useful for:
- configuring VMs
- installing packages
- templating config files
- running repeatable operational tasks
Install:
python3 -m pip install --user ansible
ansible --version
Inventory example (inventory.ini):
[web]
10.0.0.10
10.0.0.11
Ping hosts:
ansible -i inventory.ini web -m ping
Run a command:
ansible -i inventory.ini web -a "uname -a"
Run a playbook:
ansible-playbook -i inventory.ini site.yml
Operational best practices:
- Use idempotent tasks (safe to run repeatedly).
- Use roles for reusable configuration.
- Store secrets in Ansible Vault or an external secret manager.
4. Observability: Metrics, Logs, Traces, and SLOs
Observability answers: “What’s happening inside the system?”—not just “Is it up?”
Three pillars:
- Metrics: numeric time series (latency, error rate, CPU)
- Logs: event records (errors, requests, audits)
- Traces: per-request journey across services
A fourth pillar often included in practice:
- Profiling: CPU/memory hotspots (continuous profiling)
4.1 What to measure and why
Start with the Golden Signals (common SRE practice):
- Latency: how long requests take
- Traffic: request rate, throughput
- Errors: error rate, failed requests
- Saturation: resource utilization (CPU, memory, queue depth)
For APIs, also track:
- p50/p95/p99 latency (tail latency matters)
- HTTP status code counts
- dependency latency (DB, cache, external APIs)
A good metric is:
- actionable (you know what to do when it changes)
- stable (not too noisy)
- tied to user impact
4.2 Prometheus + Grafana quickstart (local)
You can run Prometheus and Grafana locally using Docker. This section uses real commands and focuses on the operational flow.
Start Grafana quickly
docker run -d --name grafana -p 3000:3000 grafana/grafana:latest
Open http://localhost:3000 (default login is admin / admin, then change it).
Run a node exporter (host metrics)
docker run -d --name node-exporter -p 9100:9100 prom/node-exporter:latest
curl -s http://localhost:9100/metrics | head
Run Prometheus
Prometheus needs a config file. Create prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: "node"
static_configs:
- targets: ["host.docker.internal:9100"]
Run Prometheus:
docker run -d --name prometheus \
-p 9090:9090 \
-v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml:ro" \
prom/prometheus:latest
Open http://localhost:9090.
Try a query:
uprate(node_cpu_seconds_total[5m])
What you just built:
- an exporter emits metrics (
/metrics) - Prometheus scrapes and stores them
- Grafana visualizes them
This is the same pattern you’ll use in Kubernetes and production, just with service discovery and more robust storage.
4.3 Logging with structured JSON and correlation IDs
Logs become dramatically more useful when they are:
- structured (JSON)
- include context (service name, environment, request id)
- consistent across services
A simple example of emitting JSON logs from a shell script:
REQUEST_ID="$(uuidgen | tr '[:upper:]' '[:lower:]')"
echo "{\"level\":\"info\",\"msg\":\"request started\",\"request_id\":\"$REQUEST_ID\",\"service\":\"payments\",\"env\":\"dev\"}"
In application code, you typically:
- generate or propagate a
request_id(ortrace_id) - include it in every log line
- include it in HTTP response headers for debugging
When logs are centralized (ELK/OpenSearch, Loki, Cloud Logging), you can search by request_id to reconstruct user journeys.
4.4 Distributed tracing with OpenTelemetry
Distributed tracing is essential once you have multiple services. OpenTelemetry (OTel) is the industry standard for instrumentation.
Concepts:
- Trace: the whole request
- Span: one operation (HTTP call, DB query)
- Context propagation: passing trace IDs between services
A practical approach:
- instrument services with OpenTelemetry SDK
- export traces to a collector
- send to a backend (Jaeger, Tempo, Honeycomb, etc.)
Run Jaeger locally:
docker run -d --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
- UI:
http://localhost:16686 - OTLP gRPC:
4317 - OTLP HTTP:
4318
If your app exports OTLP to http://localhost:4318, you can view traces in Jaeger.
Why tracing matters operationally:
- find the slow dependency causing p95 latency spikes
- detect retry storms
- understand fan-out patterns (one request triggers 20 downstream calls)
4.5 SLOs, error budgets, and alerting
SLIs are measurements (e.g., “% of requests under 300ms”). SLOs are targets (e.g., “99.9% under 300ms over 30 days”). SLAs are contracts with users/customers.
Example SLI/SLO:
- SLI: successful requests / total requests
- SLO: 99.95% success over 28 days
Error budget:
- If SLO is 99.95%, allowed error is 0.05%
- Over 28 days, that’s the “budget” you can spend on incidents and risky changes
Alerting guidance:
- Alert on user impact, not on every CPU spike.
- Prefer multi-window, multi-burn-rate alerts for SLOs.
- Use dashboards for investigation, alerts for action.
A simple Prometheus-style alert query conceptually looks like:
- “error rate over last 5m is above threshold”
- “latency p95 above threshold”
Even if your tooling differs, the principle is the same: alerts should be actionable and tied to SLOs.
5. Automation: Repeatability at Scale
Automation is how you remove manual, error-prone steps. It’s also how you scale operations without scaling headcount linearly.
Targets for automation:
- environment provisioning
- deployments
- backups and restores
- incident response runbooks
- access requests (with approvals)
- routine maintenance (rotating keys, patching)
5.1 Makefiles and task runners
A Makefile is a simple, effective way to standardize local workflows.
Example Makefile:
SHELL := /bin/bash
.PHONY: test build run docker-build docker-run fmt
fmt:
npm run fmt
test:
npm test
build:
npm run build
run:
npm start
docker-build:
docker build -t myapp:local .
docker-run:
docker run --rm -p 8080:8080 myapp:local
Now developers can run:
make test
make docker-build
make docker-run
This reduces “works on my machine” problems by making the happy path consistent.
5.2 Shell scripting patterns for safe automation
Shell scripts are powerful but can be dangerous without guardrails.
Use strict mode:
set -euo pipefail
IFS=$'\n\t'
Add logging and validation:
#!/usr/bin/env bash
set -euo pipefail
log() { printf '%s %s\n' "$(date -u +%FT%TZ)" "$*"; }
: "${ENVIRONMENT:?ENVIRONMENT is required}"
: "${IMAGE_TAG:?IMAGE_TAG is required}"
log "Deploying ${IMAGE_TAG} to ${ENVIRONMENT}"
Dry-run patterns:
DRY_RUN="${DRY_RUN:-0}"
run() {
if [[ "$DRY_RUN" == "1" ]]; then
echo "[dry-run] $*"
else
eval "$@"
fi
}
run "echo Deploy step here"
Idempotency matters: scripts should be safe to re-run after partial failure.
5.3 GitOps workflows
GitOps is an operational model where:
- Git is the source of truth for desired state
- changes are made via pull requests
- an agent reconciles actual state to match Git
Benefits:
- auditability (who changed what, when)
- rollback via
git revert - consistent deployments
Typical flow:
- CI builds and pushes image
myapp:<sha> - CI updates deployment config repo to reference
<sha> - GitOps controller applies change to cluster
- Observability confirms health
Even outside Kubernetes, the model applies: treat operational state as code, reconcile continuously.
6. Security Essentials: Supply Chain, Secrets, and Least Privilege
DevOps without security becomes “fast failure.” Modern DevOps integrates security into pipelines and daily workflows.
6.1 Secrets management
Rules:
- never commit secrets to Git
- never bake secrets into container images
- rotate secrets and limit blast radius
- use least privilege (scoped tokens, short-lived credentials)
Practical local check: scan for accidental secrets before pushing:
git diff --cached | grep -Ei "api_key|secret|password|token" || true
Better: use dedicated scanners (e.g., gitleaks):
brew install gitleaks
gitleaks detect --source . --no-git
At runtime, use:
- cloud secret managers (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault)
- Vault
- Kubernetes secrets (preferably encrypted at rest and accessed via workload identity)
6.2 Container scanning and signing
Scan images for vulnerabilities:
brew install trivy
trivy image myapp:local
Sign images (conceptually) with Sigstore Cosign:
brew install cosign
cosign version
In real pipelines, you’d sign the pushed image and verify signatures during deployment admission.
7. A Practical End-to-End Example (Local)
This section ties together CI-like steps, containerization, and basic observability locally.
Step 1: Build and test
npm ci
npm test
npm run build
Step 2: Build a container image
docker build -t myapp:local .
docker run --rm -p 8080:8080 myapp:local
Step 3: Add a basic health check endpoint
If your app supports it, expose:
GET /healthz(returns 200 if process is alive)GET /readyz(returns 200 if dependencies are ready)
Then you can validate:
curl -i http://localhost:8080/healthz
curl -i http://localhost:8080/readyz
Step 4: Emit metrics (conceptually) and scrape them
If your app exposes /metrics in Prometheus format:
curl -s http://localhost:8080/metrics | head
Then configure Prometheus to scrape it (add a job in prometheus.yml) and query in Prometheus:
up{job="myapp"}
Step 5: Add request correlation in logs
Have your reverse proxy or app add a request ID header, then log it. Validate by making a request and checking logs:
curl -H "X-Request-Id: test-123" http://localhost:8080/
docker logs <container_id> | tail -n 50
This is the smallest “full loop” that resembles production: build → run → observe.
8. Curated Resource List
Below is a focused list of high-value resources by category.
CI/CD
- GitHub Actions documentation: https://docs.github.com/actions
- GitLab CI/CD documentation: https://docs.gitlab.com/ee/ci/
- Google SRE book (release engineering & reliability): https://sre.google/books/
Infrastructure as Code
- Terraform docs: https://developer.hashicorp.com/terraform/docs
- Terraform best practices (community): search “terraform module structure”, “remote state locking”
- Ansible docs: https://docs.ansible.com/
Observability
- Prometheus docs: https://prometheus.io/docs/
- Grafana docs: https://grafana.com/docs/
- OpenTelemetry docs: https://opentelemetry.io/docs/
- Jaeger docs: https://www.jaegertracing.io/docs/
Security / Supply Chain
- SLSA framework: https://slsa.dev/
- Sigstore/Cosign: https://docs.sigstore.dev/
- Trivy: https://aquasecurity.github.io/trivy/
- OWASP Top 10: https://owasp.org/www-project-top-ten/
Automation & Operations
- The Twelve-Factor App: https://12factor.net/
- Incident management basics (PagerDuty resources): https://www.pagerduty.com/resources/
Closing Notes
A mature DevOps practice is built from small, repeatable building blocks:
- a pipeline that enforces quality gates
- infrastructure defined and reviewed as code
- telemetry that makes failures obvious and diagnosable
- automation that eliminates manual, error-prone tasks
If you want, tell me your stack (cloud provider, language, container/Kubernetes or VM-based) and I can adapt this into a concrete blueprint with a recommended repo structure, pipeline stages, and observability setup tailored to your environment.