DevOps Resources: CI/CD, Infrastructure as Code, Observability & Automation

This tutorial is a practical, command-heavy guide to core DevOps capabilities: CI/CD, Infrastructure as Code (IaC), observability, and automation. It’s written to be used as a reference you can copy from while building real pipelines and systems.

1. What “DevOps” Means in Practice
2. CI/CD: Build, Test, Package, Release
3. Infrastructure as Code (IaC)
4. Observability: Metrics, Logs, Traces, and SLOs
5. Automation: Repeatability at Scale
6. Security Essentials: Supply Chain, Secrets, and Least Privilege
- 6.1 Secrets management
- 6.2 Container scanning and signing
7. A Practical End-to-End Example (Local)
8. Curated Resource List

1. What “DevOps” Means in Practice

DevOps is less a job title and more a set of operational outcomes:

Fast, safe delivery (CI/CD)
Repeatable infrastructure (IaC)
High-quality signals about system health (observability)
Eliminating toil through automation (scripts, runbooks, self-service)

A useful mental model is a feedback loop:

Code changes are proposed (PR).
CI runs tests, security checks, and builds artifacts.
CD deploys to environments using consistent mechanisms.
Observability detects regressions quickly.
Automation accelerates response and prevents repeated manual work.

The goal is not “deploy more” at any cost; it’s “deploy more safely” with measurable reliability.

2. CI/CD: Build, Test, Package, Release

2.1 CI/CD design principles

A robust pipeline usually follows these principles:

Reproducibility: builds are deterministic; dependencies are pinned.
Fast feedback: run cheap checks early (lint, unit tests), slow checks later (integration tests).
Artifact immutability: build once, promote the same artifact across environments.
Policy as code: security and compliance checks are automated.
Environment parity: dev/staging/prod are as similar as possible.
Progressive delivery: release gradually, observe, then expand.

A common anti-pattern is “deploy from a developer machine.” Instead, the pipeline should be the only path to production.

2.2 A minimal CI pipeline (GitHub Actions)

Even though this tutorial avoids YAML frontmatter, CI systems themselves often use YAML. The following is a minimal GitHub Actions workflow that:

checks out code
sets up Node.js
installs dependencies
runs tests
builds

Create .github/workflows/ci.yml:

name: ci

on:
  pull_request:
  push:
    branches: [ "main" ]

jobs:
  test-build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"

      - name: Install
        run: npm ci

      - name: Test
        run: npm test -- --ci

      - name: Build
        run: npm run build

Key details:

npm ci is preferred in CI because it uses package-lock.json strictly.
Caching speeds up runs but should not compromise correctness—cache only dependency downloads, not build outputs unless you know what you’re doing.
The workflow triggers on PRs and on pushes to main.

To run the same steps locally (a best practice for developer experience):

npm ci
npm test -- --ci
npm run build

2.3 Build artifacts, versioning, and SBOM

Artifacts are outputs of CI that you can deploy: a container image, a zip, a binary, etc. A key DevOps rule:

Build once; deploy many times.

Versioning

A practical approach is Semantic Versioning plus build metadata:

1.4.0 (release)
1.4.0-rc.1 (release candidate)
1.4.0+git.<sha> (metadata)

In CI, you can generate a version string:

git rev-parse --short HEAD
git describe --tags --always

SBOM (Software Bill of Materials)

An SBOM lists components included in your build. Many organizations require it for supply chain security.

Example using syft (works well for containers and directories):

# Install (macOS)
brew install syft

# Generate SBOM for a container image
syft your-image:tag -o spdx-json > sbom.spdx.json

# Or for a local directory
syft dir:. -o cyclonedx-json > sbom.cdx.json

You can store SBOMs as build artifacts and attach them to releases.

2.4 Container image build & push (Docker)

A typical pipeline builds an image and pushes it to a registry.

Build locally

docker build -t myapp:dev .
docker run --rm -p 8080:8080 myapp:dev

Tag with commit SHA

SHA="$(git rev-parse --short HEAD)"
docker tag myapp:dev "ghcr.io/yourorg/myapp:${SHA}"

echo "$GITHUB_TOKEN" | docker login ghcr.io -u youruser --password-stdin
docker push "ghcr.io/yourorg/myapp:${SHA}"

Best practices:

Use multi-stage builds to keep images small.
Pin base images (e.g., node:20-alpine), but also keep them updated.
Avoid baking secrets into images (use runtime secrets).

2.5 Deployment strategies: rolling, blue/green, canary

How you deploy matters as much as what you deploy.

Rolling deployment

Replace instances gradually. Pros: simple; Cons: mixed versions during rollout.

Blue/green

Two environments (blue=live, green=next). Switch traffic after validation. Pros: fast rollback; Cons: higher cost.

Canary

Release to a small percentage of traffic, observe, then expand. Pros: safest at scale; Cons: requires routing/metrics maturity.

A canary mindset depends on observability: you must measure errors/latency and compare canary vs baseline.

3. Infrastructure as Code (IaC)

IaC is about managing infrastructure with the same discipline as software:

code review
version control
automated testing
repeatable provisioning

Two broad categories:

Provisioning (Terraform, Pulumi, CloudFormation): create cloud resources.
Configuration management (Ansible, Chef, Puppet): configure OS and apps.

In modern setups, Kubernetes and managed services reduce the need for heavy configuration management, but it still matters for VMs, edge cases, and bootstrapping.

3.1 Terraform fundamentals

Terraform describes desired infrastructure in code and reconciles it via:

terraform init (download providers, set up backend)
terraform plan (show changes)
terraform apply (make changes)
terraform destroy (tear down)

Basic workflow:

terraform fmt -recursive
terraform validate
terraform init
terraform plan -out tfplan
terraform apply tfplan

Important concepts:

Providers: plugins for AWS/GCP/Azure/Kubernetes/etc.
Resources: actual infrastructure objects.
Data sources: read existing infrastructure.
State: Terraform’s record of what it manages.

State is critical: losing it can cause drift and accidental recreation.

3.2 Remote state, locking, and environments

Why remote state?

Local state (terraform.tfstate on a laptop) is dangerous:

not shared across team
no locking (two applies can collide)
harder to secure

Use a remote backend (S3 + DynamoDB locking on AWS, GCS on GCP, Terraform Cloud, etc.).

Even without showing provider-specific backend config, the operational commands look the same:

terraform init -reconfigure
terraform plan
terraform apply

Environments: dev/staging/prod

Avoid copy-pasting entire Terraform directories. Prefer:

modules for reusable components
separate workspaces or separate state backends per environment
environment-specific variable files

Example usage:

terraform workspace new dev
terraform workspace select dev
terraform plan -var-file=env/dev.tfvars
terraform apply -var-file=env/dev.tfvars

Note: many teams prefer separate state per environment directory rather than workspaces, because it’s harder to accidentally apply to the wrong workspace when you’re tired.

3.3 Example: provisioning a VM (conceptual) + best practices

Terraform code varies by cloud, but the structure is consistent:

network
compute
security rules
outputs

Best practices you can apply everywhere:

Small modules with clear inputs/outputs
No secrets in state
Use terraform plan in CI and require approval for apply
Tag resources (owner, cost center, environment)
Policy checks (OPA/Conftest, Sentinel, or cloud-native policies)

A common CI pattern:

terraform fmt -check -recursive
terraform validate
terraform plan -no-color -out tfplan

Then, in a protected environment step (manual approval):

terraform apply -no-color tfplan

3.4 Configuration management: Ansible basics

Ansible is useful for:

configuring VMs
installing packages
templating config files
running repeatable operational tasks

Install:

python3 -m pip install --user ansible
ansible --version

Inventory example (inventory.ini):

[web]
10.0.0.10
10.0.0.11

Ping hosts:

ansible -i inventory.ini web -m ping

Run a command:

ansible -i inventory.ini web -a "uname -a"

Run a playbook:

ansible-playbook -i inventory.ini site.yml

Operational best practices:

Use idempotent tasks (safe to run repeatedly).
Use roles for reusable configuration.
Store secrets in Ansible Vault or an external secret manager.

4. Observability: Metrics, Logs, Traces, and SLOs

Observability answers: “What’s happening inside the system?”—not just “Is it up?”

Three pillars:

Metrics: numeric time series (latency, error rate, CPU)
Logs: event records (errors, requests, audits)
Traces: per-request journey across services

A fourth pillar often included in practice:

Profiling: CPU/memory hotspots (continuous profiling)

4.1 What to measure and why

Start with the Golden Signals (common SRE practice):

Latency: how long requests take
Traffic: request rate, throughput
Errors: error rate, failed requests
Saturation: resource utilization (CPU, memory, queue depth)

For APIs, also track:

p50/p95/p99 latency (tail latency matters)
HTTP status code counts
dependency latency (DB, cache, external APIs)

A good metric is:

actionable (you know what to do when it changes)
stable (not too noisy)
tied to user impact

4.2 Prometheus + Grafana quickstart (local)

You can run Prometheus and Grafana locally using Docker. This section uses real commands and focuses on the operational flow.

Start Grafana quickly

docker run -d --name grafana -p 3000:3000 grafana/grafana:latest

Open http://localhost:3000 (default login is admin / admin, then change it).

Run a node exporter (host metrics)

docker run -d --name node-exporter -p 9100:9100 prom/node-exporter:latest
curl -s http://localhost:9100/metrics | head

Run Prometheus

Prometheus needs a config file. Create prometheus.yml:

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: "node"
    static_configs:
      - targets: ["host.docker.internal:9100"]

Run Prometheus:

docker run -d --name prometheus \
  -p 9090:9090 \
  -v "$PWD/prometheus.yml:/etc/prometheus/prometheus.yml:ro" \
  prom/prometheus:latest

Open http://localhost:9090.

Try a query:

up
rate(node_cpu_seconds_total[5m])

What you just built:

an exporter emits metrics (/metrics)
Prometheus scrapes and stores them
Grafana visualizes them

This is the same pattern you’ll use in Kubernetes and production, just with service discovery and more robust storage.

4.3 Logging with structured JSON and correlation IDs

Logs become dramatically more useful when they are:

structured (JSON)
include context (service name, environment, request id)
consistent across services

A simple example of emitting JSON logs from a shell script:

REQUEST_ID="$(uuidgen | tr '[:upper:]' '[:lower:]')"
echo "{\"level\":\"info\",\"msg\":\"request started\",\"request_id\":\"$REQUEST_ID\",\"service\":\"payments\",\"env\":\"dev\"}"

In application code, you typically:

generate or propagate a request_id (or trace_id)
include it in every log line
include it in HTTP response headers for debugging

When logs are centralized (ELK/OpenSearch, Loki, Cloud Logging), you can search by request_id to reconstruct user journeys.

4.4 Distributed tracing with OpenTelemetry

Distributed tracing is essential once you have multiple services. OpenTelemetry (OTel) is the industry standard for instrumentation.

Concepts:

Trace: the whole request
Span: one operation (HTTP call, DB query)
Context propagation: passing trace IDs between services

A practical approach:

instrument services with OpenTelemetry SDK
export traces to a collector
send to a backend (Jaeger, Tempo, Honeycomb, etc.)

Run Jaeger locally:

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

UI: http://localhost:16686
OTLP gRPC: 4317
OTLP HTTP: 4318

If your app exports OTLP to http://localhost:4318, you can view traces in Jaeger.

Why tracing matters operationally:

find the slow dependency causing p95 latency spikes
detect retry storms
understand fan-out patterns (one request triggers 20 downstream calls)

4.5 SLOs, error budgets, and alerting

SLIs are measurements (e.g., “% of requests under 300ms”). SLOs are targets (e.g., “99.9% under 300ms over 30 days”). SLAs are contracts with users/customers.

Example SLI/SLO:

SLI: successful requests / total requests
SLO: 99.95% success over 28 days

Error budget:

If SLO is 99.95%, allowed error is 0.05%
Over 28 days, that’s the “budget” you can spend on incidents and risky changes

Alerting guidance:

Alert on user impact, not on every CPU spike.
Prefer multi-window, multi-burn-rate alerts for SLOs.
Use dashboards for investigation, alerts for action.

A simple Prometheus-style alert query conceptually looks like:

“error rate over last 5m is above threshold”
“latency p95 above threshold”

Even if your tooling differs, the principle is the same: alerts should be actionable and tied to SLOs.

5. Automation: Repeatability at Scale

Automation is how you remove manual, error-prone steps. It’s also how you scale operations without scaling headcount linearly.

Targets for automation:

environment provisioning
deployments
backups and restores
incident response runbooks
access requests (with approvals)
routine maintenance (rotating keys, patching)

5.1 Makefiles and task runners

A Makefile is a simple, effective way to standardize local workflows.

Example Makefile:

SHELL := /bin/bash

.PHONY: test build run docker-build docker-run fmt

fmt:
	npm run fmt

test:
	npm test

build:
	npm run build

run:
	npm start

docker-build:
	docker build -t myapp:local .

docker-run:
	docker run --rm -p 8080:8080 myapp:local

Now developers can run:

make test
make docker-build
make docker-run

This reduces “works on my machine” problems by making the happy path consistent.

5.2 Shell scripting patterns for safe automation

Shell scripts are powerful but can be dangerous without guardrails.

Use strict mode:

set -euo pipefail
IFS=$'\n\t'

Add logging and validation:

#!/usr/bin/env bash
set -euo pipefail

log() { printf '%s %s\n' "$(date -u +%FT%TZ)" "$*"; }

: "${ENVIRONMENT:?ENVIRONMENT is required}"
: "${IMAGE_TAG:?IMAGE_TAG is required}"

log "Deploying ${IMAGE_TAG} to ${ENVIRONMENT}"

Dry-run patterns:

DRY_RUN="${DRY_RUN:-0}"

run() {
  if [[ "$DRY_RUN" == "1" ]]; then
    echo "[dry-run] $*"
  else
    eval "$@"
  fi
}

run "echo Deploy step here"

Idempotency matters: scripts should be safe to re-run after partial failure.

5.3 GitOps workflows

GitOps is an operational model where:

Git is the source of truth for desired state
changes are made via pull requests
an agent reconciles actual state to match Git

Benefits:

auditability (who changed what, when)
rollback via git revert
consistent deployments

Typical flow:

CI builds and pushes image myapp:<sha>
CI updates deployment config repo to reference <sha>
GitOps controller applies change to cluster
Observability confirms health

Even outside Kubernetes, the model applies: treat operational state as code, reconcile continuously.

6. Security Essentials: Supply Chain, Secrets, and Least Privilege

DevOps without security becomes “fast failure.” Modern DevOps integrates security into pipelines and daily workflows.

6.1 Secrets management

Rules:

never commit secrets to Git
never bake secrets into container images
rotate secrets and limit blast radius
use least privilege (scoped tokens, short-lived credentials)

Practical local check: scan for accidental secrets before pushing:

git diff --cached | grep -Ei "api_key|secret|password|token" || true

Better: use dedicated scanners (e.g., gitleaks):

brew install gitleaks
gitleaks detect --source . --no-git

At runtime, use:

cloud secret managers (AWS Secrets Manager, GCP Secret Manager, Azure Key Vault)
Vault
Kubernetes secrets (preferably encrypted at rest and accessed via workload identity)

6.2 Container scanning and signing

Scan images for vulnerabilities:

brew install trivy
trivy image myapp:local

Sign images (conceptually) with Sigstore Cosign:

brew install cosign
cosign version

In real pipelines, you’d sign the pushed image and verify signatures during deployment admission.

7. A Practical End-to-End Example (Local)

This section ties together CI-like steps, containerization, and basic observability locally.

Step 1: Build and test

npm ci
npm test
npm run build

Step 2: Build a container image

docker build -t myapp:local .
docker run --rm -p 8080:8080 myapp:local

Step 3: Add a basic health check endpoint

If your app supports it, expose:

GET /healthz (returns 200 if process is alive)
GET /readyz (returns 200 if dependencies are ready)

Then you can validate:

curl -i http://localhost:8080/healthz
curl -i http://localhost:8080/readyz

Step 4: Emit metrics (conceptually) and scrape them

If your app exposes /metrics in Prometheus format:

curl -s http://localhost:8080/metrics | head

Then configure Prometheus to scrape it (add a job in prometheus.yml) and query in Prometheus:

up{job="myapp"}

Step 5: Add request correlation in logs

Have your reverse proxy or app add a request ID header, then log it. Validate by making a request and checking logs:

curl -H "X-Request-Id: test-123" http://localhost:8080/
docker logs <container_id> | tail -n 50

This is the smallest “full loop” that resembles production: build → run → observe.

8. Curated Resource List

Below is a focused list of high-value resources by category.

CI/CD

GitHub Actions documentation: https://docs.github.com/actions
GitLab CI/CD documentation: https://docs.gitlab.com/ee/ci/
Google SRE book (release engineering & reliability): https://sre.google/books/

Infrastructure as Code

Terraform docs: https://developer.hashicorp.com/terraform/docs
Terraform best practices (community): search “terraform module structure”, “remote state locking”
Ansible docs: https://docs.ansible.com/

Observability

Prometheus docs: https://prometheus.io/docs/
Grafana docs: https://grafana.com/docs/
OpenTelemetry docs: https://opentelemetry.io/docs/
Jaeger docs: https://www.jaegertracing.io/docs/

Security / Supply Chain

SLSA framework: https://slsa.dev/
Sigstore/Cosign: https://docs.sigstore.dev/
Trivy: https://aquasecurity.github.io/trivy/
OWASP Top 10: https://owasp.org/www-project-top-ten/

Automation & Operations

The Twelve-Factor App: https://12factor.net/
Incident management basics (PagerDuty resources): https://www.pagerduty.com/resources/

Closing Notes

A mature DevOps practice is built from small, repeatable building blocks:

a pipeline that enforces quality gates
infrastructure defined and reviewed as code
telemetry that makes failures obvious and diagnosable
automation that eliminates manual, error-prone tasks

If you want, tell me your stack (cloud provider, language, container/Kubernetes or VM-based) and I can adapt this into a concrete blueprint with a recommended repo structure, pipeline stages, and observability setup tailored to your environment.

DevOps Resources: CI/CD, Infrastructure as Code, Observability & Automation

DevOps Resources: CI/CD, Infrastructure as Code, Observability & Automation

Table of Contents

1. What “DevOps” Means in Practice

2. CI/CD: Build, Test, Package, Release

2.1 CI/CD design principles

2.2 A minimal CI pipeline (GitHub Actions)

2.3 Build artifacts, versioning, and SBOM

Versioning

SBOM (Software Bill of Materials)

2.4 Container image build & push (Docker)

Build locally

Tag with commit SHA

Login and push (GitHub Container Registry example)

2.5 Deployment strategies: rolling, blue/green, canary

Rolling deployment

Blue/green

Canary

3. Infrastructure as Code (IaC)

3.1 Terraform fundamentals

3.2 Remote state, locking, and environments

Why remote state?

Environments: dev/staging/prod

3.3 Example: provisioning a VM (conceptual) + best practices

3.4 Configuration management: Ansible basics

4. Observability: Metrics, Logs, Traces, and SLOs

4.1 What to measure and why

4.2 Prometheus + Grafana quickstart (local)

Start Grafana quickly

Run a node exporter (host metrics)

Run Prometheus

4.3 Logging with structured JSON and correlation IDs

4.4 Distributed tracing with OpenTelemetry

4.5 SLOs, error budgets, and alerting

5. Automation: Repeatability at Scale

5.1 Makefiles and task runners

5.2 Shell scripting patterns for safe automation

5.3 GitOps workflows

6. Security Essentials: Supply Chain, Secrets, and Least Privilege

6.1 Secrets management

6.2 Container scanning and signing

7. A Practical End-to-End Example (Local)

Step 1: Build and test

Step 2: Build a container image

Step 3: Add a basic health check endpoint

Step 4: Emit metrics (conceptually) and scrape them

Step 5: Add request correlation in logs

8. Curated Resource List

CI/CD

Infrastructure as Code

Observability

Security / Supply Chain

Automation & Operations

Closing Notes

Related Tutorials