Docker in CI Pipelines: Fixing Common Build and Cache Failures

Docker builds that work perfectly on a developer laptop often fail (or become painfully slow) in CI. The reasons are rarely “Docker is flaky” and almost always come down to how CI runners handle caching, networking, permissions, and build isolation.

This tutorial focuses on common Docker build and cache failures in CI pipelines, how to diagnose them with real commands, and how to fix them reliably. Examples assume Linux-based CI runners (GitHub Actions, GitLab CI, Jenkins agents, Buildkite, etc.), but the principles apply broadly.

1. Mental model: what’s different in CI?
2. Essential diagnostics you should always capture
3. BuildKit vs legacy builder: why it matters
4. “No space left on device” and disk pressure failures
5. Cache misses: why every build starts from scratch
6. Corrupted cache / “failed to compute cache key” / checksum errors
7. Registry auth and rate limits: pulls that fail only in CI
8. Network and DNS failures during apt, pip, npm, go, etc.
9. Permission problems: rootless, Docker socket, and file ownership
10. Multi-arch builds: QEMU, emulation, and cache pitfalls
11. Reproducible builds: pinning base images and dependencies
12. A practical “CI-proof” Dockerfile pattern
13. A repeatable troubleshooting checklist

1. Mental model: what’s different in CI?

On a laptop, Docker benefits from:

A warm local image cache (base images and intermediate layers already present).
Stable DNS and network.
Persistent disk and build cache across builds.
Consistent user permissions.
No concurrent builds competing for disk and CPU.

In CI, you often have the opposite:

Runners are ephemeral: every job starts with an empty cache.
Multiple jobs share the same host: caches can be evicted or corrupted under pressure.
Outbound network can be restricted or unstable.
Docker daemon may run in a different context (Docker-in-Docker, remote daemon, rootless).
You may build on ARM runners (or cross-build), changing cache behavior.

Your goal is to make builds:

Deterministic (same inputs → same outputs),
Cache-friendly (stable layer boundaries and explicit cache export/import),
Resilient (retries, mirrors, pinned versions, and good diagnostics).

2. Essential diagnostics you should always capture

Before “fixing” anything, capture enough context to know what the runner is doing.

Run these commands early in the job (or when a failure occurs):

set -euxo pipefail

docker version
docker info

df -h
df -i

docker system df
docker builder ls || true
docker buildx version || true
docker buildx ls || true

If you suspect network/DNS issues:

cat /etc/resolv.conf || true
getent hosts registry-1.docker.io || true
getent hosts github.com || true

curl -I https://registry-1.docker.io/v2/ || true
curl -I https://google.com || true

If your CI uses Docker-in-Docker (DinD), also check:

ps aux | grep -i dockerd || true
docker context ls || true
docker context show || true

These outputs often immediately reveal the real cause: disk full, wrong daemon, no BuildKit, DNS misconfigured, etc.

3. BuildKit vs legacy builder: why it matters

Most modern CI setups should use BuildKit because it provides:

Better caching and parallelism
--mount=type=cache for package managers
--secret mounts for credentials
Cache export/import (inline, registry, local)
More reliable multi-platform builds via buildx

Check if BuildKit is enabled:

docker buildx version
docker buildx ls

If docker build is using legacy mode, you might see older output formatting and missing features. You can force BuildKit:

export DOCKER_BUILDKIT=1
docker build --progress=plain -t myapp:ci .

For buildx:

docker buildx create --name ci-builder --use
docker buildx inspect --bootstrap

Key point: many cache-related “mysteries” are simply because the pipeline is using the legacy builder in one environment and BuildKit in another, producing different behavior and cache compatibility.

4. “No space left on device” and disk pressure failures

Symptoms

no space left on device
failed to register layer: write ... no space left on device
Random failures extracting layers
Build succeeds sometimes, fails under concurrency

Diagnose

Check disk usage and Docker storage:

df -h
docker system df
docker info | sed -n '/Storage Driver/,$p'

Look for:

Very small root disks (common on hosted runners)
Large build contexts being sent to Docker
Many dangling images/layers

Fixes

A) Reduce build context size with `.dockerignore`

A huge context can fill disk and also kills caching (because context changes frequently).

Create .dockerignore:

.git
node_modules
dist
target
coverage
*.log
.env
.DS_Store

Measure context size:

tar -czf /tmp/context.tgz .
ls -lh /tmp/context.tgz

B) Prune safely (especially on shared runners)

On ephemeral runners, pruning is usually safe:

docker system prune -af --volumes
docker builder prune -af

On shared runners, pruning can break other jobs. Prefer runner-level isolation or dedicated build nodes.

C) Use multi-stage builds to keep images small

If your final image contains compilers, package caches, and build artifacts, it grows quickly.

Use a builder stage and a runtime stage:

FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/app ./cmd/app

FROM gcr.io/distroless/static-debian12
COPY --from=build /out/app /app
ENTRYPOINT ["/app"]

D) Avoid creating large layers

Combine cleanup into the same RUN layer:

RUN apt-get update \
 && apt-get install -y --no-install-recommends ca-certificates curl \
 && rm -rf /var/lib/apt/lists/*

If you do cleanup in a later layer, the earlier layer still contains the files and consumes space.

5. Cache misses: why every build starts from scratch

Symptoms

CI builds are always slow
Every run downloads base images and dependencies again
CACHED rarely appears in build output

Root causes

Ephemeral runners: no persistent local cache.
Cache not exported/imported: BuildKit cache exists only locally unless you export it.
Unstable Dockerfile layer boundaries: copying the whole repo too early invalidates dependency layers.
Build args or timestamps change: ARG values or generated files change each run.
Different builder instances: build cache is tied to the builder.

Fix: restructure Dockerfile for stable caching

Bad pattern:

COPY . .
RUN npm ci
RUN npm run build

Any change in the repo invalidates the npm ci layer.

Better pattern:

WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

Now dependency install is cached unless package*.json changes.

Fix: use BuildKit cache mounts for package managers

For example, apt cache:

RUN --mount=type=cache,target=/var/cache/apt \
    --mount=type=cache,target=/var/lib/apt \
    apt-get update && apt-get install -y --no-install-recommends curl

For npm:

RUN --mount=type=cache,target=/root/.npm npm ci

For pip:

RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt

These caches speed up rebuilds within the same builder cache. To persist across CI runs, you also need cache export/import.

Fix: export/import cache via registry (works well in CI)

With buildx, you can push cache to a registry:

docker buildx build \
  --builder ci-builder \
  --progress=plain \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to   type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:ci \
  --push \
  .

Notes:

mode=max stores more cache metadata (bigger but better reuse).
You can also use type=inline to embed cache into the image manifest, but registry cache refs are often more flexible.

If you don’t want to push the image (only cache), you can still export cache:

docker buildx build \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  --output type=docker \
  -t myapp:ci \
  .

(Behavior depends on driver; often you’ll still need --push for remote builders.)

Fix: ensure the same builder is used

List builders:

docker buildx ls

If your CI creates a new builder each time without importing cache, you’ll never hit cache. Create and reuse a named builder within the job, and import cache from registry.

6. Corrupted cache / “failed to compute cache key” / checksum errors

Symptoms

failed to compute cache key: failed to walk ... no such file or directory
rpc error: code = Unknown desc = failed to solve ...
unexpected EOF pulling layers
checksum mismatch or content digest ... not found

Root causes

Build context contains symlinks or paths that behave differently in CI.
Concurrent builds share the same local cache directory (especially with Docker-in-Docker volumes).
Files are generated during build in a way that conflicts with cache expectations.
Registry cache ref is overwritten by multiple branches simultaneously.

Fixes

A) Use `--progress=plain` and isolate the failing step

docker buildx build --progress=plain -t myapp:debug .

The plain output shows exactly which step fails and whether it’s during context transfer, cache lookup, or layer extraction.

B) Avoid non-deterministic file generation in early layers

If you run commands that generate timestamps, random IDs, or embed build time, do it late:

# Bad: invalidates cache constantly
RUN echo "Built at $(date)" > /app/buildinfo.txt

# Better: pass build metadata via labels at the end
ARG VCS_REF
LABEL org.opencontainers.image.revision=$VCS_REF

C) Separate cache refs per branch (or use a “main only” cache)

If every branch writes to the same cache ref, they can thrash it.

Use a cache key strategy:

myapp:buildcache-main for main branch
myapp:buildcache-pr-123 for PRs (optional)
Or read from main cache but write to branch cache

Example pattern:

CACHE_FROM="type=registry,ref=registry.example.com/myapp:buildcache-main"
CACHE_TO="type=registry,ref=registry.example.com/myapp:buildcache-${CI_BRANCH},mode=max"

docker buildx build \
  --cache-from "$CACHE_FROM" \
  --cache-to "$CACHE_TO" \
  -t registry.example.com/myapp:${CI_SHA} \
  --push \
  .

D) When in doubt, prune builder cache (ephemeral runners)

docker builder prune -af

If this “fixes” it, you likely have a shared cache corruption issue—move to isolated runners or registry-backed cache.

7. Registry auth and rate limits: pulls that fail only in CI

Symptoms

toomanyrequests: You have reached your pull rate limit (Docker Hub)
unauthorized: authentication required
denied: requested access to the resource is denied
Pulls succeed locally but fail in CI

Diagnose

Check what image is failing:

docker pull ubuntu:24.04
docker pull node:20

If it fails in CI but not locally, you’re likely authenticated locally but not in CI.

Docker Hub:

echo "$DOCKERHUB_TOKEN" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin

GHCR:

echo "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin

Then pull:

docker pull ghcr.io/org/base:latest

Fix: mirror base images to your registry

If Docker Hub rate limits are a recurring problem, mirror critical bases:

docker pull ubuntu:24.04
docker tag ubuntu:24.04 registry.example.com/mirror/ubuntu:24.04
docker push registry.example.com/mirror/ubuntu:24.04

Then in Dockerfile:

FROM registry.example.com/mirror/ubuntu:24.04

Fix: pin by digest for reproducibility

Tags can move. Pinning avoids “works yesterday, fails today”:

docker pull ubuntu:24.04
docker inspect --format='{{index .RepoDigests 0}}' ubuntu:24.04

Use digest:

FROM ubuntu@sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

8. Network and DNS failures during `apt`, `pip`, `npm`, `go`, etc.

Symptoms

Temporary failure resolving 'deb.debian.org'
Could not resolve host: pypi.org
npm ERR! network timeout
TLS handshake timeout

Diagnose inside the build

Build steps run in an isolated environment. A host-level curl might succeed while build-time DNS fails.

Use a temporary debug container:

docker run --rm -it ubuntu:24.04 bash -lc \
  "apt-get update && apt-get install -y curl dnsutils && \
   cat /etc/resolv.conf && \
   dig pypi.org +short && \
   curl -I https://pypi.org"

If this fails, your runner network/DNS is the issue, not your Dockerfile.

Fixes

A) Configure Docker daemon DNS (runner-level)

If you control the daemon, set DNS servers (e.g., 1.1.1.1, 8.8.8.8) in /etc/docker/daemon.json:

{
  "dns": ["1.1.1.1", "8.8.8.8"]
}

Restart Docker:

sudo systemctl restart docker

B) Add retries for flaky package installs

For apt:

RUN set -eux; \
    apt-get update; \
    apt-get install -y --no-install-recommends ca-certificates curl; \
    rm -rf /var/lib/apt/lists/*

If you need stronger retries:

RUN set -eux; \
    for i in 1 2 3; do apt-get update && break || sleep 5; done; \
    for i in 1 2 3; do apt-get install -y --no-install-recommends curl && break || sleep 5; done; \
    rm -rf /var/lib/apt/lists/*

For npm, prefer npm ci and consider:

npm config set fetch-retries 5
npm config set fetch-retry-mintimeout 20000
npm config set fetch-retry-maxtimeout 120000

C) Use internal mirrors/proxies

In enterprise CI, outbound access may be restricted. Use:

apt mirror
pip index URL
npm registry proxy (Verdaccio, Artifactory, Nexus)
Go proxy (GOPROXY)

Example for pip:

ARG PIP_INDEX_URL
ENV PIP_INDEX_URL=$PIP_INDEX_URL
RUN pip install -r requirements.txt

Then build with:

docker build --build-arg PIP_INDEX_URL=https://pypi.mycorp.local/simple -t myapp:ci .

9. Permission problems: rootless, Docker socket, and file ownership

Symptoms

permission denied accessing /var/run/docker.sock
Build fails when trying to write to mounted directories
Files created in workspace become owned by root, breaking later CI steps

Diagnose

Check socket permissions:

ls -l /var/run/docker.sock || true
id
groups

Try:

docker ps

If you can’t access the daemon, your CI user isn’t in the docker group (or the daemon is remote).

Fix: add CI user to docker group (self-hosted runners)

sudo usermod -aG docker "$USER"
newgrp docker
docker ps

Fix: avoid root-owned artifacts in bind mounts

If you mount the workspace into a container and build artifacts inside, they may be owned by root.

Run container with the host UID/GID:

docker run --rm \
  -u "$(id -u):$(id -g)" \
  -v "$PWD:/work" -w /work \
  node:20 bash -lc "npm ci && npm test"

Fix: prefer building inside Dockerfile (not in mounted workspace)

Instead of running build tools in a container with a bind mount, bake the build into the Dockerfile and export artifacts via multi-stage or docker cp. This avoids permission mismatches and makes caching more predictable.

10. Multi-arch builds: QEMU, emulation, and cache pitfalls

Symptoms

exec format error
Builds are extremely slow in CI
Cache doesn’t seem to apply between architectures

Diagnose

Check platform:

uname -m
docker version --format '{{.Server.Arch}}'

List buildx platforms:

docker buildx inspect --bootstrap

Fix: use `buildx` with explicit platforms

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t registry.example.com/myapp:latest \
  --push \
  .

Fix: install QEMU (if cross-building on amd64)

On many CI systems you can run:

docker run --privileged --rm tonistiigi/binfmt --install all

Then:

docker buildx create --use --name multi
docker buildx inspect --bootstrap

Cache note for multi-arch

Cache is often platform-specific. A layer built for linux/amd64 is not reusable for linux/arm64. You can still share some metadata, but expect less reuse.

To avoid cache thrash, keep architecture-dependent steps later (e.g., compilation) and maximize earlier architecture-independent steps (e.g., downloading source, installing pure scripts).

11. Reproducible builds: pinning base images and dependencies

CI failures often appear “random” because upstream dependencies changed.

Pin base image by digest

As shown earlier:

FROM node@sha256:...

Pin OS packages (when possible)

For Debian/Ubuntu, exact versions can be pinned:

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl=7.88.1-10+deb12u5 \
 && rm -rf /var/lib/apt/lists/*

Be careful: pinning too aggressively can break when mirrors rotate. A more common approach is:

Pin the base image digest
Keep OS packages minimal
Use a stable mirror snapshot if you need strict reproducibility

Pin language dependencies

Node: package-lock.json / npm ci
Python: requirements.txt with hashes (pip-compile --generate-hashes)
Go: go.sum and GONOSUMDB/GOPROXY as needed
Rust: Cargo.lock

These reduce CI flakiness and improve caching because dependency graphs change less frequently.

12. A practical “CI-proof” Dockerfile pattern

Below is a pattern that addresses many CI cache issues for a Node app. It uses:

Stable layer ordering
BuildKit cache mounts
Minimal runtime image
No leaking secrets into layers

Requires BuildKit (DOCKER_BUILDKIT=1) and preferably buildx.

# syntax=docker/dockerfile:1.7

FROM node:20-bookworm AS deps
WORKDIR /app

# Copy only dependency manifests first for caching
COPY package.json package-lock.json ./

# Cache npm downloads between builds (within BuildKit cache)
RUN --mount=type=cache,target=/root/.npm \
    npm ci

FROM node:20-bookworm AS build
WORKDIR /app

# Reuse node_modules from deps stage
COPY --from=deps /app/node_modules ./node_modules
COPY . .

RUN npm run build

FROM node:20-bookworm-slim AS runtime
WORKDIR /app
ENV NODE_ENV=production

# Copy only what you need at runtime
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./package.json

# If you need production deps only:
COPY --from=deps /app/node_modules ./node_modules

EXPOSE 3000
CMD ["node", "dist/server.js"]

Build with registry-backed cache:

docker buildx create --name ci-builder --use || docker buildx use ci-builder
docker buildx inspect --bootstrap

docker buildx build \
  --progress=plain \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:${GIT_SHA} \
  -t registry.example.com/myapp:latest \
  --push \
  .

If you want to verify cache hits, run the same command twice and look for CACHED steps in the output.

13. A repeatable troubleshooting checklist

When a Docker build fails in CI, walk through this list in order.

A) Confirm environment and builder

docker version
docker info
docker buildx ls
docker buildx inspect --bootstrap

Are you using BuildKit?
Are you building on the platform you think you are?

B) Check disk and cache pressure

df -h
docker system df

If disk is tight, prune (ephemeral runners):

docker system prune -af --volumes
docker builder prune -af

C) Inspect build context size and `.dockerignore`

ls -la
test -f .dockerignore && cat .dockerignore

If .dockerignore is missing, add one.

D) Re-run with plain progress to pinpoint the failing step

docker buildx build --progress=plain -t myapp:debug .

E) Validate registry auth and rate limits

docker login registry.example.com
docker pull your-base-image:tag

F) Validate network inside containers

docker run --rm alpine:3.20 sh -lc "apk add --no-cache curl && curl -I https://example.com"

If this fails, fix runner DNS/proxy settings.

G) Fix caching explicitly (don’t rely on “local cache” in CI)

Use registry cache:

docker buildx build \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:ci \
  --push \
  .

H) If cache is unstable, isolate or partition it

Separate cache refs per branch
Or only allow main to write cache and others read from it

Closing notes

Most CI Docker build failures become straightforward once you treat caching and environment as explicit inputs rather than “whatever Docker happens to have locally.” The most reliable pattern is:

Use BuildKit (buildx) consistently.
Keep Dockerfile layers stable and cache-friendly.
Export/import cache via registry.
Reduce context size with .dockerignore.
Pin base images and dependency graphs.
Collect diagnostics (disk, network, builder) whenever something fails.

If you share your Dockerfile and the exact CI error output (with --progress=plain), you can usually map the failure to one of the categories above in minutes.

Docker in CI Pipelines: Fixing Common Build and Cache Failures

Docker in CI Pipelines: Fixing Common Build and Cache Failures

Table of Contents

1. Mental model: what’s different in CI?

2. Essential diagnostics you should always capture

3. BuildKit vs legacy builder: why it matters

4. “No space left on device” and disk pressure failures

Symptoms

Diagnose

Fixes

A) Reduce build context size with .dockerignore

B) Prune safely (especially on shared runners)

C) Use multi-stage builds to keep images small

D) Avoid creating large layers

5. Cache misses: why every build starts from scratch

Symptoms

Root causes

Fix: restructure Dockerfile for stable caching

Fix: use BuildKit cache mounts for package managers

Fix: export/import cache via registry (works well in CI)

Fix: ensure the same builder is used

6. Corrupted cache / “failed to compute cache key” / checksum errors

Symptoms

Root causes

Fixes

A) Use --progress=plain and isolate the failing step

B) Avoid non-deterministic file generation in early layers

C) Separate cache refs per branch (or use a “main only” cache)

D) When in doubt, prune builder cache (ephemeral runners)

7. Registry auth and rate limits: pulls that fail only in CI

Symptoms

Diagnose

Fix: login explicitly in CI

Fix: mirror base images to your registry

Fix: pin by digest for reproducibility

8. Network and DNS failures during apt, pip, npm, go, etc.

Symptoms

Diagnose inside the build

Fixes

A) Configure Docker daemon DNS (runner-level)

B) Add retries for flaky package installs

C) Use internal mirrors/proxies

9. Permission problems: rootless, Docker socket, and file ownership

Symptoms

Diagnose

Fix: add CI user to docker group (self-hosted runners)

Fix: avoid root-owned artifacts in bind mounts

Fix: prefer building inside Dockerfile (not in mounted workspace)

10. Multi-arch builds: QEMU, emulation, and cache pitfalls

Symptoms

Diagnose

Fix: use buildx with explicit platforms

Fix: install QEMU (if cross-building on amd64)

Cache note for multi-arch

11. Reproducible builds: pinning base images and dependencies

Pin base image by digest

Pin OS packages (when possible)

Pin language dependencies

12. A practical “CI-proof” Dockerfile pattern

13. A repeatable troubleshooting checklist

A) Confirm environment and builder

B) Check disk and cache pressure

C) Inspect build context size and .dockerignore

D) Re-run with plain progress to pinpoint the failing step

E) Validate registry auth and rate limits

F) Validate network inside containers

G) Fix caching explicitly (don’t rely on “local cache” in CI)

H) If cache is unstable, isolate or partition it

Closing notes

Related Tutorials

A) Reduce build context size with `.dockerignore`

A) Use `--progress=plain` and isolate the failing step

8. Network and DNS failures during `apt`, `pip`, `npm`, `go`, etc.

Fix: use `buildx` with explicit platforms

C) Inspect build context size and `.dockerignore`