Docker in CI Pipelines: Fixing Common Build and Cache Failures
Docker builds that work perfectly on a developer laptop often fail (or become painfully slow) in CI. The reasons are rarely “Docker is flaky” and almost always come down to how CI runners handle caching, networking, permissions, and build isolation.
This tutorial focuses on common Docker build and cache failures in CI pipelines, how to diagnose them with real commands, and how to fix them reliably. Examples assume Linux-based CI runners (GitHub Actions, GitLab CI, Jenkins agents, Buildkite, etc.), but the principles apply broadly.
Table of Contents
- 1. Mental model: what’s different in CI?
- 2. Essential diagnostics you should always capture
- 3. BuildKit vs legacy builder: why it matters
- 4. “No space left on device” and disk pressure failures
- 5. Cache misses: why every build starts from scratch
- 6. Corrupted cache / “failed to compute cache key” / checksum errors
- 7. Registry auth and rate limits: pulls that fail only in CI
- 8. Network and DNS failures during
apt,pip,npm,go, etc. - 9. Permission problems: rootless, Docker socket, and file ownership
- 10. Multi-arch builds: QEMU, emulation, and cache pitfalls
- 11. Reproducible builds: pinning base images and dependencies
- 12. A practical “CI-proof” Dockerfile pattern
- 13. A repeatable troubleshooting checklist
1. Mental model: what’s different in CI?
On a laptop, Docker benefits from:
- A warm local image cache (base images and intermediate layers already present).
- Stable DNS and network.
- Persistent disk and build cache across builds.
- Consistent user permissions.
- No concurrent builds competing for disk and CPU.
In CI, you often have the opposite:
- Runners are ephemeral: every job starts with an empty cache.
- Multiple jobs share the same host: caches can be evicted or corrupted under pressure.
- Outbound network can be restricted or unstable.
- Docker daemon may run in a different context (Docker-in-Docker, remote daemon, rootless).
- You may build on ARM runners (or cross-build), changing cache behavior.
Your goal is to make builds:
- Deterministic (same inputs → same outputs),
- Cache-friendly (stable layer boundaries and explicit cache export/import),
- Resilient (retries, mirrors, pinned versions, and good diagnostics).
2. Essential diagnostics you should always capture
Before “fixing” anything, capture enough context to know what the runner is doing.
Run these commands early in the job (or when a failure occurs):
set -euxo pipefail
docker version
docker info
df -h
df -i
docker system df
docker builder ls || true
docker buildx version || true
docker buildx ls || true
If you suspect network/DNS issues:
cat /etc/resolv.conf || true
getent hosts registry-1.docker.io || true
getent hosts github.com || true
curl -I https://registry-1.docker.io/v2/ || true
curl -I https://google.com || true
If your CI uses Docker-in-Docker (DinD), also check:
ps aux | grep -i dockerd || true
docker context ls || true
docker context show || true
These outputs often immediately reveal the real cause: disk full, wrong daemon, no BuildKit, DNS misconfigured, etc.
3. BuildKit vs legacy builder: why it matters
Most modern CI setups should use BuildKit because it provides:
- Better caching and parallelism
--mount=type=cachefor package managers--secretmounts for credentials- Cache export/import (inline, registry, local)
- More reliable multi-platform builds via
buildx
Check if BuildKit is enabled:
docker buildx version
docker buildx ls
If docker build is using legacy mode, you might see older output formatting and missing features. You can force BuildKit:
export DOCKER_BUILDKIT=1
docker build --progress=plain -t myapp:ci .
For buildx:
docker buildx create --name ci-builder --use
docker buildx inspect --bootstrap
Key point: many cache-related “mysteries” are simply because the pipeline is using the legacy builder in one environment and BuildKit in another, producing different behavior and cache compatibility.
4. “No space left on device” and disk pressure failures
Symptoms
no space left on devicefailed to register layer: write ... no space left on device- Random failures extracting layers
- Build succeeds sometimes, fails under concurrency
Diagnose
Check disk usage and Docker storage:
df -h
docker system df
docker info | sed -n '/Storage Driver/,$p'
Look for:
- Very small root disks (common on hosted runners)
- Large build contexts being sent to Docker
- Many dangling images/layers
Fixes
A) Reduce build context size with .dockerignore
A huge context can fill disk and also kills caching (because context changes frequently).
Create .dockerignore:
.git
node_modules
dist
target
coverage
*.log
.env
.DS_Store
Measure context size:
tar -czf /tmp/context.tgz .
ls -lh /tmp/context.tgz
B) Prune safely (especially on shared runners)
On ephemeral runners, pruning is usually safe:
docker system prune -af --volumes
docker builder prune -af
On shared runners, pruning can break other jobs. Prefer runner-level isolation or dedicated build nodes.
C) Use multi-stage builds to keep images small
If your final image contains compilers, package caches, and build artifacts, it grows quickly.
Use a builder stage and a runtime stage:
FROM golang:1.22 AS build
WORKDIR /src
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /out/app ./cmd/app
FROM gcr.io/distroless/static-debian12
COPY --from=build /out/app /app
ENTRYPOINT ["/app"]
D) Avoid creating large layers
Combine cleanup into the same RUN layer:
RUN apt-get update \
&& apt-get install -y --no-install-recommends ca-certificates curl \
&& rm -rf /var/lib/apt/lists/*
If you do cleanup in a later layer, the earlier layer still contains the files and consumes space.
5. Cache misses: why every build starts from scratch
Symptoms
- CI builds are always slow
- Every run downloads base images and dependencies again
CACHEDrarely appears in build output
Root causes
- Ephemeral runners: no persistent local cache.
- Cache not exported/imported: BuildKit cache exists only locally unless you export it.
- Unstable Dockerfile layer boundaries: copying the whole repo too early invalidates dependency layers.
- Build args or timestamps change:
ARGvalues or generated files change each run. - Different builder instances: build cache is tied to the builder.
Fix: restructure Dockerfile for stable caching
Bad pattern:
COPY . .
RUN npm ci
RUN npm run build
Any change in the repo invalidates the npm ci layer.
Better pattern:
WORKDIR /app
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
Now dependency install is cached unless package*.json changes.
Fix: use BuildKit cache mounts for package managers
For example, apt cache:
RUN --mount=type=cache,target=/var/cache/apt \
--mount=type=cache,target=/var/lib/apt \
apt-get update && apt-get install -y --no-install-recommends curl
For npm:
RUN --mount=type=cache,target=/root/.npm npm ci
For pip:
RUN --mount=type=cache,target=/root/.cache/pip pip install -r requirements.txt
These caches speed up rebuilds within the same builder cache. To persist across CI runs, you also need cache export/import.
Fix: export/import cache via registry (works well in CI)
With buildx, you can push cache to a registry:
docker buildx build \
--builder ci-builder \
--progress=plain \
--cache-from type=registry,ref=registry.example.com/myapp:buildcache \
--cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
-t registry.example.com/myapp:ci \
--push \
.
Notes:
mode=maxstores more cache metadata (bigger but better reuse).- You can also use
type=inlineto embed cache into the image manifest, but registry cache refs are often more flexible.
If you don’t want to push the image (only cache), you can still export cache:
docker buildx build \
--cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
--output type=docker \
-t myapp:ci \
.
(Behavior depends on driver; often you’ll still need --push for remote builders.)
Fix: ensure the same builder is used
List builders:
docker buildx ls
If your CI creates a new builder each time without importing cache, you’ll never hit cache. Create and reuse a named builder within the job, and import cache from registry.
6. Corrupted cache / “failed to compute cache key” / checksum errors
Symptoms
failed to compute cache key: failed to walk ... no such file or directoryrpc error: code = Unknown desc = failed to solve ...unexpected EOFpulling layerschecksum mismatchorcontent digest ... not found
Root causes
- Build context contains symlinks or paths that behave differently in CI.
- Concurrent builds share the same local cache directory (especially with Docker-in-Docker volumes).
- Files are generated during build in a way that conflicts with cache expectations.
- Registry cache ref is overwritten by multiple branches simultaneously.
Fixes
A) Use --progress=plain and isolate the failing step
docker buildx build --progress=plain -t myapp:debug .
The plain output shows exactly which step fails and whether it’s during context transfer, cache lookup, or layer extraction.
B) Avoid non-deterministic file generation in early layers
If you run commands that generate timestamps, random IDs, or embed build time, do it late:
# Bad: invalidates cache constantly
RUN echo "Built at $(date)" > /app/buildinfo.txt
# Better: pass build metadata via labels at the end
ARG VCS_REF
LABEL org.opencontainers.image.revision=$VCS_REF
C) Separate cache refs per branch (or use a “main only” cache)
If every branch writes to the same cache ref, they can thrash it.
Use a cache key strategy:
myapp:buildcache-mainfor main branchmyapp:buildcache-pr-123for PRs (optional)- Or read from main cache but write to branch cache
Example pattern:
CACHE_FROM="type=registry,ref=registry.example.com/myapp:buildcache-main"
CACHE_TO="type=registry,ref=registry.example.com/myapp:buildcache-${CI_BRANCH},mode=max"
docker buildx build \
--cache-from "$CACHE_FROM" \
--cache-to "$CACHE_TO" \
-t registry.example.com/myapp:${CI_SHA} \
--push \
.
D) When in doubt, prune builder cache (ephemeral runners)
docker builder prune -af
If this “fixes” it, you likely have a shared cache corruption issue—move to isolated runners or registry-backed cache.
7. Registry auth and rate limits: pulls that fail only in CI
Symptoms
toomanyrequests: You have reached your pull rate limit(Docker Hub)unauthorized: authentication requireddenied: requested access to the resource is denied- Pulls succeed locally but fail in CI
Diagnose
Check what image is failing:
docker pull ubuntu:24.04
docker pull node:20
If it fails in CI but not locally, you’re likely authenticated locally but not in CI.
Fix: login explicitly in CI
Docker Hub:
echo "$DOCKERHUB_TOKEN" | docker login -u "$DOCKERHUB_USERNAME" --password-stdin
GHCR:
echo "$GHCR_TOKEN" | docker login ghcr.io -u "$GHCR_USERNAME" --password-stdin
Then pull:
docker pull ghcr.io/org/base:latest
Fix: mirror base images to your registry
If Docker Hub rate limits are a recurring problem, mirror critical bases:
docker pull ubuntu:24.04
docker tag ubuntu:24.04 registry.example.com/mirror/ubuntu:24.04
docker push registry.example.com/mirror/ubuntu:24.04
Then in Dockerfile:
FROM registry.example.com/mirror/ubuntu:24.04
Fix: pin by digest for reproducibility
Tags can move. Pinning avoids “works yesterday, fails today”:
docker pull ubuntu:24.04
docker inspect --format='{{index .RepoDigests 0}}' ubuntu:24.04
Use digest:
FROM ubuntu@sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
8. Network and DNS failures during apt, pip, npm, go, etc.
Symptoms
Temporary failure resolving 'deb.debian.org'Could not resolve host: pypi.orgnpm ERR! network timeoutTLS handshake timeout
Diagnose inside the build
Build steps run in an isolated environment. A host-level curl might succeed while build-time DNS fails.
Use a temporary debug container:
docker run --rm -it ubuntu:24.04 bash -lc \
"apt-get update && apt-get install -y curl dnsutils && \
cat /etc/resolv.conf && \
dig pypi.org +short && \
curl -I https://pypi.org"
If this fails, your runner network/DNS is the issue, not your Dockerfile.
Fixes
A) Configure Docker daemon DNS (runner-level)
If you control the daemon, set DNS servers (e.g., 1.1.1.1, 8.8.8.8) in /etc/docker/daemon.json:
{
"dns": ["1.1.1.1", "8.8.8.8"]
}
Restart Docker:
sudo systemctl restart docker
B) Add retries for flaky package installs
For apt:
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends ca-certificates curl; \
rm -rf /var/lib/apt/lists/*
If you need stronger retries:
RUN set -eux; \
for i in 1 2 3; do apt-get update && break || sleep 5; done; \
for i in 1 2 3; do apt-get install -y --no-install-recommends curl && break || sleep 5; done; \
rm -rf /var/lib/apt/lists/*
For npm, prefer npm ci and consider:
npm config set fetch-retries 5
npm config set fetch-retry-mintimeout 20000
npm config set fetch-retry-maxtimeout 120000
C) Use internal mirrors/proxies
In enterprise CI, outbound access may be restricted. Use:
- apt mirror
- pip index URL
- npm registry proxy (Verdaccio, Artifactory, Nexus)
- Go proxy (
GOPROXY)
Example for pip:
ARG PIP_INDEX_URL
ENV PIP_INDEX_URL=$PIP_INDEX_URL
RUN pip install -r requirements.txt
Then build with:
docker build --build-arg PIP_INDEX_URL=https://pypi.mycorp.local/simple -t myapp:ci .
9. Permission problems: rootless, Docker socket, and file ownership
Symptoms
permission deniedaccessing/var/run/docker.sock- Build fails when trying to write to mounted directories
- Files created in workspace become owned by root, breaking later CI steps
Diagnose
Check socket permissions:
ls -l /var/run/docker.sock || true
id
groups
Try:
docker ps
If you can’t access the daemon, your CI user isn’t in the docker group (or the daemon is remote).
Fix: add CI user to docker group (self-hosted runners)
sudo usermod -aG docker "$USER"
newgrp docker
docker ps
Fix: avoid root-owned artifacts in bind mounts
If you mount the workspace into a container and build artifacts inside, they may be owned by root.
Run container with the host UID/GID:
docker run --rm \
-u "$(id -u):$(id -g)" \
-v "$PWD:/work" -w /work \
node:20 bash -lc "npm ci && npm test"
Fix: prefer building inside Dockerfile (not in mounted workspace)
Instead of running build tools in a container with a bind mount, bake the build into the Dockerfile and export artifacts via multi-stage or docker cp. This avoids permission mismatches and makes caching more predictable.
10. Multi-arch builds: QEMU, emulation, and cache pitfalls
Symptoms
exec format error- Builds are extremely slow in CI
- Cache doesn’t seem to apply between architectures
Diagnose
Check platform:
uname -m
docker version --format '{{.Server.Arch}}'
List buildx platforms:
docker buildx inspect --bootstrap
Fix: use buildx with explicit platforms
docker buildx build \
--platform linux/amd64,linux/arm64 \
-t registry.example.com/myapp:latest \
--push \
.
Fix: install QEMU (if cross-building on amd64)
On many CI systems you can run:
docker run --privileged --rm tonistiigi/binfmt --install all
Then:
docker buildx create --use --name multi
docker buildx inspect --bootstrap
Cache note for multi-arch
Cache is often platform-specific. A layer built for linux/amd64 is not reusable for linux/arm64. You can still share some metadata, but expect less reuse.
To avoid cache thrash, keep architecture-dependent steps later (e.g., compilation) and maximize earlier architecture-independent steps (e.g., downloading source, installing pure scripts).
11. Reproducible builds: pinning base images and dependencies
CI failures often appear “random” because upstream dependencies changed.
Pin base image by digest
As shown earlier:
FROM node@sha256:...
Pin OS packages (when possible)
For Debian/Ubuntu, exact versions can be pinned:
RUN apt-get update && apt-get install -y --no-install-recommends \
curl=7.88.1-10+deb12u5 \
&& rm -rf /var/lib/apt/lists/*
Be careful: pinning too aggressively can break when mirrors rotate. A more common approach is:
- Pin the base image digest
- Keep OS packages minimal
- Use a stable mirror snapshot if you need strict reproducibility
Pin language dependencies
- Node:
package-lock.json/npm ci - Python:
requirements.txtwith hashes (pip-compile --generate-hashes) - Go:
go.sumandGONOSUMDB/GOPROXYas needed - Rust:
Cargo.lock
These reduce CI flakiness and improve caching because dependency graphs change less frequently.
12. A practical “CI-proof” Dockerfile pattern
Below is a pattern that addresses many CI cache issues for a Node app. It uses:
- Stable layer ordering
- BuildKit cache mounts
- Minimal runtime image
- No leaking secrets into layers
Requires BuildKit (
DOCKER_BUILDKIT=1) and preferablybuildx.
# syntax=docker/dockerfile:1.7
FROM node:20-bookworm AS deps
WORKDIR /app
# Copy only dependency manifests first for caching
COPY package.json package-lock.json ./
# Cache npm downloads between builds (within BuildKit cache)
RUN --mount=type=cache,target=/root/.npm \
npm ci
FROM node:20-bookworm AS build
WORKDIR /app
# Reuse node_modules from deps stage
COPY --from=deps /app/node_modules ./node_modules
COPY . .
RUN npm run build
FROM node:20-bookworm-slim AS runtime
WORKDIR /app
ENV NODE_ENV=production
# Copy only what you need at runtime
COPY --from=build /app/dist ./dist
COPY --from=build /app/package.json ./package.json
# If you need production deps only:
COPY --from=deps /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/server.js"]
Build with registry-backed cache:
docker buildx create --name ci-builder --use || docker buildx use ci-builder
docker buildx inspect --bootstrap
docker buildx build \
--progress=plain \
--cache-from type=registry,ref=registry.example.com/myapp:buildcache \
--cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
-t registry.example.com/myapp:${GIT_SHA} \
-t registry.example.com/myapp:latest \
--push \
.
If you want to verify cache hits, run the same command twice and look for CACHED steps in the output.
13. A repeatable troubleshooting checklist
When a Docker build fails in CI, walk through this list in order.
A) Confirm environment and builder
docker version
docker info
docker buildx ls
docker buildx inspect --bootstrap
- Are you using BuildKit?
- Are you building on the platform you think you are?
B) Check disk and cache pressure
df -h
docker system df
If disk is tight, prune (ephemeral runners):
docker system prune -af --volumes
docker builder prune -af
C) Inspect build context size and .dockerignore
ls -la
test -f .dockerignore && cat .dockerignore
If .dockerignore is missing, add one.
D) Re-run with plain progress to pinpoint the failing step
docker buildx build --progress=plain -t myapp:debug .
E) Validate registry auth and rate limits
docker login registry.example.com
docker pull your-base-image:tag
F) Validate network inside containers
docker run --rm alpine:3.20 sh -lc "apk add --no-cache curl && curl -I https://example.com"
If this fails, fix runner DNS/proxy settings.
G) Fix caching explicitly (don’t rely on “local cache” in CI)
Use registry cache:
docker buildx build \
--cache-from type=registry,ref=registry.example.com/myapp:buildcache \
--cache-to type=registry,ref=registry.example.com/myapp:buildcache,mode=max \
-t registry.example.com/myapp:ci \
--push \
.
H) If cache is unstable, isolate or partition it
- Separate cache refs per branch
- Or only allow
mainto write cache and others read from it
Closing notes
Most CI Docker build failures become straightforward once you treat caching and environment as explicit inputs rather than “whatever Docker happens to have locally.” The most reliable pattern is:
- Use BuildKit (
buildx) consistently. - Keep Dockerfile layers stable and cache-friendly.
- Export/import cache via registry.
- Reduce context size with
.dockerignore. - Pin base images and dependency graphs.
- Collect diagnostics (disk, network, builder) whenever something fails.
If you share your Dockerfile and the exact CI error output (with --progress=plain), you can usually map the failure to one of the categories above in minutes.