← Back to Tutorials

Docker in CI Pipelines: Fixing Common Build and Cache Failures

dockerci-cddevopsbuildkitdocker-buildcachingcontainer-registrygithub-actionsgitlab-cijenkins

Docker in CI Pipelines: Fixing Common Build and Cache Failures

Docker builds that work perfectly on a developer laptop often fail (or become painfully slow) in CI. The reasons are rarely “Docker is flaky” and almost always come down to how CI runners handle caching, networking, permissions, and build isolation.

This tutorial focuses on common Docker build and cache failures in CI pipelines, how to diagnose them with real commands, and how to fix them reliably. Examples assume Linux-based CI runners (GitHub Actions, GitLab CI, Jenkins agents, Buildkite, etc.), but the principles apply broadly.


Table of Contents


1. Mental model: what’s different in CI?

On a laptop, Docker benefits from:

In CI, you often have the opposite:

Your goal is to make builds:

  1. Deterministic (same inputs → same outputs),
  2. Cache-friendly (stable layer boundaries and explicit cache export/import),
  3. Resilient (retries, mirrors, pinned versions, and good diagnostics).

2. Essential diagnostics you should always capture

Before “fixing” anything, capture enough context to know what the runner is doing.

Run these commands early in the job (or when a failure occurs):

set -euxo pipefail

docker version
docker info

df -h
df -i

docker system df
docker builder ls || true
docker buildx version || true
docker buildx ls || true

If you suspect network/DNS issues:

cat /etc/resolv.conf || true
getent hosts registry-1.docker.io || true
getent hosts github.com || true

curl -I https://registry-1.docker.io/v2/ || true
curl -I https://google.com || true

If your CI uses Docker-in-Docker (DinD), also check:

ps aux | grep -i dockerd || true
docker context ls || true
docker context show || true

These outputs often immediately reveal the real cause: disk full, wrong daemon, no BuildKit, DNS misconfigured, etc.


3. BuildKit vs legacy builder: why it matters

Most modern CI setups should use BuildKit because it provides:

Check if BuildKit is enabled:

docker buildx version
docker buildx ls

If docker build is using legacy mode, you might see older output formatting and missing features. You can force BuildKit:

export DOCKER_BUILDKIT=1
docker build --progress=plain -t myapp:ci .

For buildx:

docker buildx create --name ci-builder --use
docker buildx inspect --bootstrap

Key point: many cache-related “mysteries” are simply because the pipeline is using the legacy builder in one environment and BuildKit in another, producing different behavior and cache compatibility.


4. “No space left on device” and disk pressure failures

Symptoms

Diagnose

Check disk usage and Docker storage:

df -h
docker system df
docker info | sed -n '/Storage Driver/,$p'

Look for:

Fixes

A) Reduce build context size with .dockerignore

A huge context can fill disk and also kills caching (because context changes frequently).

Create .dockerignore:

.git
node_modules
dist
target
coverage
*.log
.env
.DS_Store

Measure context size:

tar -czf /tmp/context.tgz .
ls -lh /tmp/context.tgz

B) Prune safely (especially on shared runners)

On ephemeral runners, pruning is usually safe:

docker system prune -af --volumes
docker builder prune -af

On shared runners, pruning can break other jobs. Prefer runner-level isolation or dedicated build nodes.

C) Use multi-stage builds to keep images small

If your final image contains compilers, package caches, and build artifacts, it grows quickly.

Use a builder stage and a runtime stage:

FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/app ./cmd/app

FROM gcr.io/distroless/static-debian12
COPY --from=build /out/app /app
ENTRYPOINT ["/app"]

D) Avoid creating large layers

Combine cleanup into the same RUN layer:

RUN apt-get update \
 && apt-get install -y --no-install-recommends ca-certificates curl \
 && rm -rf /var/lib/apt/lists/*

If you do cleanup in a later layer, the earlier layer still contains the files and consumes space.


5. Cache misses: why every build starts from scratch

Symptoms

Root causes

  1. Ephemeral runners: no persistent local cache.
  2. Cache not exported/imported: BuildKit cache exists only locally unless you export it.
  3. Unstable Dockerfile layer boundaries: copying the whole repo too early invalidates dependency layers.
  4. Build args or timestamps change: ARG values or generated files change each run.
  5. Different builder instances: build cache is tied to the builder.

Fix: restructure Dockerfile for stable caching

Bad pattern:

COPY . .
RUN npm ci
RUN npm run build

Any change in the repo invalidates the npm ci layer.

Better pattern:

WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

Now dependency install is cached unless package*.json changes.

Fix: use BuildKit cache mounts for package managers

For example, apt cache:

RUN --mount=type=cache,target=/var/cache/apt \
    --mount=type=cache,target=/var/lib/apt \
    apt-get update && apt-get install -y --no-install-recommends curl

For npm:

RUN --mount=type=cache,target=/root/.npm npm ci

For pip:

RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt

These caches speed up rebuilds within the same builder cache. To persist across CI runs, you also need cache export/import.

Fix: export/import cache via registry (works well in CI)

With buildx, you can push cache to a registry:

docker buildx build \
  --builder ci-builder \
  --progress=plain \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to   type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:ci \
  --push \
  .

Notes:

If you don’t want to push the image (only cache), you can still export cache:

docker buildx build \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  --output type=docker \
  -t myapp:ci \
  .

(Behavior depends on driver; often you’ll still need --push for remote builders.)

Fix: ensure the same builder is used

List builders:

docker buildx ls

If your CI creates a new builder each time without importing cache, you’ll never hit cache. Create and reuse a named builder within the job, and import cache from registry.


6. Corrupted cache / “failed to compute cache key” / checksum errors

Symptoms

Root causes

Fixes

A) Use --progress=plain and isolate the failing step

docker buildx build --progress=plain -t myapp:debug .

The plain output shows exactly which step fails and whether it’s during context transfer, cache lookup, or layer extraction.

B) Avoid non-deterministic file generation in early layers

If you run commands that generate timestamps, random IDs, or embed build time, do it late:

# Bad: invalidates cache constantly
RUN echo "Built at $(date)" > /app/buildinfo.txt

# Better: pass build metadata via labels at the end
ARG VCS_REF
LABEL org.opencontainers.image.revision=$VCS_REF

C) Separate cache refs per branch (or use a “main only” cache)

If every branch writes to the same cache ref, they can thrash it.

Use a cache key strategy:

Example pattern:

CACHE_FROM="type=registry,ref=registry.example.com/myapp:buildcache-main"
CACHE_TO="type=registry,ref=registry.example.com/myapp:buildcache-${CI_BRANCH},mode=max"

docker buildx build \
  --cache-from "$CACHE_FROM" \
  --cache-to "$CACHE_TO" \
  -t registry.example.com/myapp:${CI_SHA} \
  --push \
  .

D) When in doubt, prune builder cache (ephemeral runners)

docker builder prune -af

If this “fixes” it, you likely have a shared cache corruption issue—move to isolated runners or registry-backed cache.


7. Registry auth and rate limits: pulls that fail only in CI

Symptoms

Diagnose

Check what image is failing:

docker pull ubuntu:24.04
docker pull node:20

If it fails in CI but not locally, you’re likely authenticated locally but not in CI.

Fix: login explicitly in CI

Docker Hub:

echo "$DOCKERHUB_TOKEN" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin

GHCR:

echo "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin

Then pull:

docker pull ghcr.io/org/base:latest

Fix: mirror base images to your registry

If Docker Hub rate limits are a recurring problem, mirror critical bases:

docker pull ubuntu:24.04
docker tag ubuntu:24.04 registry.example.com/mirror/ubuntu:24.04
docker push registry.example.com/mirror/ubuntu:24.04

Then in Dockerfile:

FROM registry.example.com/mirror/ubuntu:24.04

Fix: pin by digest for reproducibility

Tags can move. Pinning avoids “works yesterday, fails today”:

docker pull ubuntu:24.04
docker inspect --format='{{index .RepoDigests 0}}' ubuntu:24.04

Use digest:

FROM ubuntu@sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

8. Network and DNS failures during apt, pip, npm, go, etc.

Symptoms

Diagnose inside the build

Build steps run in an isolated environment. A host-level curl might succeed while build-time DNS fails.

Use a temporary debug container:

docker run --rm -it ubuntu:24.04 bash -lc \
  "apt-get update && apt-get install -y curl dnsutils && \
   cat /etc/resolv.conf && \
   dig pypi.org +short && \
   curl -I https://pypi.org"

If this fails, your runner network/DNS is the issue, not your Dockerfile.

Fixes

A) Configure Docker daemon DNS (runner-level)

If you control the daemon, set DNS servers (e.g., 1.1.1.1, 8.8.8.8) in /etc/docker/daemon.json:

{
  "dns": ["1.1.1.1", "8.8.8.8"]
}

Restart Docker:

sudo systemctl restart docker

B) Add retries for flaky package installs

For apt:

RUN set -eux; \
    apt-get update; \
    apt-get install -y --no-install-recommends ca-certificates curl; \
    rm -rf /var/lib/apt/lists/*

If you need stronger retries:

RUN set -eux; \
    for i in 1 2 3; do apt-get update && break || sleep 5; done; \
    for i in 1 2 3; do apt-get install -y --no-install-recommends curl && break || sleep 5; done; \
    rm -rf /var/lib/apt/lists/*

For npm, prefer npm ci and consider:

npm config set fetch-retries 5
npm config set fetch-retry-mintimeout 20000
npm config set fetch-retry-maxtimeout 120000

C) Use internal mirrors/proxies

In enterprise CI, outbound access may be restricted. Use:

Example for pip:

ARG PIP_INDEX_URL
ENV PIP_INDEX_URL=$PIP_INDEX_URL
RUN pip install -r requirements.txt

Then build with:

docker build --build-arg PIP_INDEX_URL=https://pypi.mycorp.local/simple -t myapp:ci .

9. Permission problems: rootless, Docker socket, and file ownership

Symptoms

Diagnose

Check socket permissions:

ls -l /var/run/docker.sock || true
id
groups

Try:

docker ps

If you can’t access the daemon, your CI user isn’t in the docker group (or the daemon is remote).

Fix: add CI user to docker group (self-hosted runners)

sudo usermod -aG docker "$USER"
newgrp docker
docker ps

Fix: avoid root-owned artifacts in bind mounts

If you mount the workspace into a container and build artifacts inside, they may be owned by root.

Run container with the host UID/GID:

docker run --rm \
  -u "$(id -u):$(id -g)" \
  -v "$PWD:/work" -w /work \
  node:20 bash -lc "npm ci && npm test"

Fix: prefer building inside Dockerfile (not in mounted workspace)

Instead of running build tools in a container with a bind mount, bake the build into the Dockerfile and export artifacts via multi-stage or docker cp. This avoids permission mismatches and makes caching more predictable.


10. Multi-arch builds: QEMU, emulation, and cache pitfalls

Symptoms

Diagnose

Check platform:

uname -m
docker version --format '{{.Server.Arch}}'

List buildx platforms:

docker buildx inspect --bootstrap

Fix: use buildx with explicit platforms

docker buildx build \
  --platform linux/amd64,linux/arm64 \
  -t registry.example.com/myapp:latest \
  --push \
  .

Fix: install QEMU (if cross-building on amd64)

On many CI systems you can run:

docker run --privileged --rm tonistiigi/binfmt --install all

Then:

docker buildx create --use --name multi
docker buildx inspect --bootstrap

Cache note for multi-arch

Cache is often platform-specific. A layer built for linux/amd64 is not reusable for linux/arm64. You can still share some metadata, but expect less reuse.

To avoid cache thrash, keep architecture-dependent steps later (e.g., compilation) and maximize earlier architecture-independent steps (e.g., downloading source, installing pure scripts).


11. Reproducible builds: pinning base images and dependencies

CI failures often appear “random” because upstream dependencies changed.

Pin base image by digest

As shown earlier:

FROM node@sha256:...

Pin OS packages (when possible)

For Debian/Ubuntu, exact versions can be pinned:

RUN apt-get update && apt-get install -y --no-install-recommends \
    curl=7.88.1-10+deb12u5 \
 && rm -rf /var/lib/apt/lists/*

Be careful: pinning too aggressively can break when mirrors rotate. A more common approach is:

Pin language dependencies

These reduce CI flakiness and improve caching because dependency graphs change less frequently.


12. A practical “CI-proof” Dockerfile pattern

Below is a pattern that addresses many CI cache issues for a Node app. It uses:

Requires BuildKit (DOCKER_BUILDKIT=1) and preferably buildx.

# syntax=docker/dockerfile:1.7

FROM node:20-bookworm AS deps
WORKDIR /app

# Copy only dependency manifests first for caching
COPY package.json package-lock.json ./

# Cache npm downloads between builds (within BuildKit cache)
RUN --mount=type=cache,target=/root/.npm \
    npm ci

FROM node:20-bookworm AS build
WORKDIR /app

# Reuse node_modules from deps stage
COPY --from=deps /app/node_modules ./node_modules
COPY . .

RUN npm run build

FROM node:20-bookworm-slim AS runtime
WORKDIR /app
ENV NODE_ENV=production

# Copy only what you need at runtime
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./package.json

# If you need production deps only:
COPY --from=deps /app/node_modules ./node_modules

EXPOSE 3000
CMD ["node", "dist/server.js"]

Build with registry-backed cache:

docker buildx create --name ci-builder --use || docker buildx use ci-builder
docker buildx inspect --bootstrap

docker buildx build \
  --progress=plain \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:${GIT_SHA} \
  -t registry.example.com/myapp:latest \
  --push \
  .

If you want to verify cache hits, run the same command twice and look for CACHED steps in the output.


13. A repeatable troubleshooting checklist

When a Docker build fails in CI, walk through this list in order.

A) Confirm environment and builder

docker version
docker info
docker buildx ls
docker buildx inspect --bootstrap

B) Check disk and cache pressure

df -h
docker system df

If disk is tight, prune (ephemeral runners):

docker system prune -af --volumes
docker builder prune -af

C) Inspect build context size and .dockerignore

ls -la
test -f .dockerignore && cat .dockerignore

If .dockerignore is missing, add one.

D) Re-run with plain progress to pinpoint the failing step

docker buildx build --progress=plain -t myapp:debug .

E) Validate registry auth and rate limits

docker login registry.example.com
docker pull your-base-image:tag

F) Validate network inside containers

docker run --rm alpine:3.20 sh -lc "apk add --no-cache curl && curl -I https://example.com"

If this fails, fix runner DNS/proxy settings.

G) Fix caching explicitly (don’t rely on “local cache” in CI)

Use registry cache:

docker buildx build \
  --cache-from type=registry,ref=registry.example.com/myapp:buildcache \
  --cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
  -t registry.example.com/myapp:ci \
  --push \
  .

H) If cache is unstable, isolate or partition it


Closing notes

Most CI Docker build failures become straightforward once you treat caching and environment as explicit inputs rather than “whatever Docker happens to have locally.” The most reliable pattern is:

  1. Use BuildKit (buildx) consistently.
  2. Keep Dockerfile layers stable and cache-friendly.
  3. Export/import cache via registry.
  4. Reduce context size with .dockerignore.
  5. Pin base images and dependency graphs.
  6. Collect diagnostics (disk, network, builder) whenever something fails.

If you share your Dockerfile and the exact CI error output (with --progress=plain), you can usually map the failure to one of the categories above in minutes.