CI/CD Infrastructure Pitfalls & Learnings¶
STATUS: ACTIVE — Updated during CI/CD pipeline implementation OWNER: Alex (Infrastructure), Koen (Quality) CATEGORY: Infrastructure, CI/CD
This page captures hard-won learnings from setting up GE's CI/CD pipeline. Every issue encountered during implementation is documented here so no agent or session repeats the same mistakes.
Registry Issues¶
PITFALL-CICD-001: GitLab Registry pods stuck in Init:0/2¶
Date: 2026-04-01
Symptom: All 3 gitlab-registry pods stuck in Init:0/2 for 13+ days.
Root cause: MountVolume.SetUp failed for volume "registry-secrets" : secret "gitlab-registry-storage" not found
Analysis: The GitLab Helm chart expects a secret named gitlab-registry-storage containing storage backend credentials (S3/Minio). This secret was never created during initial GitLab deployment, or was deleted.
Fix: Created gitlab-registry-storage secret with Minio S3 config:
kubectl create secret generic gitlab-registry-storage -n ge-gitlab \
--from-literal=storage="s3:\n accesskey: <from gitlab-minio-secret>\n secretkey: <from gitlab-minio-secret>\n bucket: registry\n regionendpoint: http://gitlab-minio-svc.ge-gitlab.svc.cluster.local:9000\n secure: false\n v4auth: true"
kubectl delete pods -n ge-gitlab -l app=registry
Result: Registry pods came up within 20 seconds. 2 replicas running.
Lesson: Always verify ALL init container dependencies after Helm upgrade. A missing secret silently blocks pods forever with no obvious error in kubectl get pods — you must kubectl describe to find it. The Helm values (registry.storage.secret) define the expected secret name.
Runner Issues¶
PITFALL-CICD-002: Runners online but not picking up jobs — tag mismatch¶
Date: 2026-04-01
Symptom: Runners registered and online, but CI jobs stay in "pending" forever.
Root cause: CI template jobs had tags: [docker, linux] but runners had NO tags. GitLab only dispatches tagged jobs to runners with matching tags.
Fix: Added tags to all runners via API:
curl -X PUT "http://gitlab-webservice-default:8080/api/v4/runners/$ID" \
-H "PRIVATE-TOKEN: $TOKEN" -d "tag_list=docker,linux"
kubectl get pods shows runners as Running, gitlab-runner verify shows them as connected, but they silently refuse unmatched tagged jobs. Check the job detail via API to see tag_list.
PITFALL-CICD-003: Runner DNS resolution failure inside pods¶
Date: 2026-04-01
Symptom: Runners log dial tcp: lookup gitlab.ge.internal on 10.43.0.10:53: server misbehaving
Root cause: CoreDNS NodeHosts didn't include gitlab.ge.internal (only had gitlab.internal.ge-ops.local). The Helm chart registered runners with gitlab.ge.internal URL.
Fix: Patched CoreDNS configmap to add 10.43.71.244 gitlab.ge.internal registry.ge.internal minio.ge.internal kas.ge.internal to NodeHosts. Then restarted CoreDNS and runners.
Lesson: When GitLab is deployed via Helm with custom hosts (.global.hosts.domain), ALL in-cluster services must be able to resolve those hostnames. CoreDNS NodeHosts is the place to add them. This is the same k3s networking issue documented in memory — ClusterIP doesn't work from host, DNS doesn't work from pods unless explicitly configured.
Pipeline Issues¶
PITFALL-CICD-004: First pipeline run reveals 1,279 lint errors¶
Date: 2026-04-01
Symptom: lint:python job fails with 1,279 ruff errors, 686 auto-fixable.
Root cause: The codebase was never run through the full CI pipeline before. Legacy code, dead imports, unused variables accumulated over months.
Resolution strategy: NOT a blocker. The pipeline correctly catches issues. Fix in phases:
1. Auto-fix the 686 fixable errors with ruff check --fix
2. Address remaining 593 manually or suppress with # noqa where appropriate
3. Add --select to ruff config to start with critical rules only, expand over time
Lesson: First CI activation on an existing codebase ALWAYS produces a flood of findings. Plan for a "clean-up sprint" and don't block all development. Use allow_failure: true initially, then tighten.
PITFALL-CICD-005: YAML comments inside script blocks break GitLab CI parsing¶
Date: 2026-04-01
Symptom: Pipeline fails with "script config should be a string or a nested array of strings up to 10 levels deep". No jobs created.
Root cause: GitLab CI parser does NOT allow YAML # comments inside the script: array. The comments are interpreted as part of the string.
Fix: Remove all # comments from inside script: blocks. Use echo "description" lines instead.
Lesson: GitLab CI script: arrays are strict — strings only. Put comments ABOVE the job definition, not inside the script block. Also: use the /api/v4/projects/{id}/ci/lint endpoint to validate YAML before pushing.
PITFALL-CICD-006: Hardcoded repo paths break in CI runners¶
Date: 2026-04-01
Symptom: 43 tests fail in CI with AssertionError: main.py not found at /home/claude/ge-bootstrap/... — but pass locally.
Root cause: Test files hardcode /home/claude/ge-bootstrap as repo root. CI runner checks out at /builds/growing-europe/ge-bootstrap.
Fix: Made tests/conftest.py detect repo root dynamically:
1. Check CI_PROJECT_DIR env var (GitLab CI sets this)
2. Walk up from test file looking for CLAUDE.md marker
3. Fallback to /home/claude/ge-bootstrap
All test files updated to import tests.conftest.GE_ROOT_PATH.
Lesson: NEVER hardcode absolute paths in tests. Always use dynamic detection or fixtures. CI runners have different checkout paths than local dev. Use the conftest pattern: detect root once, share via fixture.
PITFALL-CICD-007: Build stage fails on monorepo without pyproject.toml¶
Date: 2026-04-01
Symptom: build:backend fails: "Source does not appear to be a Python project: no pyproject.toml or setup.py"
Root cause: CI template assumed a packaged Python project with python -m build. GE is a monorepo.
Fix: Replaced build step with import verification: python -c "import ge_orchestrator".
Lesson: Monorepos don't need python -m build. The "build" step should verify that imports resolve and dependencies install — not package a wheel.
PITFALL-CICD-008: PEP 668 blocks pip install in Python 3.12 slim containers¶
Date: 2026-04-01
Symptom: pip install ruff fails with "externally-managed-environment" error in CI jobs using python:3.12-slim.
Root cause: PEP 668 (2024) marks system Python as externally managed. pip install is blocked to prevent breaking the OS.
Fix: Add --break-system-packages to ALL pip install commands in CI. Or use python -m venv (slower in CI).
Lesson: When upgrading runner images to Python 3.12+, every pip install needs --break-system-packages. Search-and-replace across all CI templates. This will affect EVERY new project.
PITFALL-CICD-009: Custom runner image not pullable by k8s executor¶
Date: 2026-04-01
Symptom: All pipeline jobs fail after switching to custom ge-ci-runner:latest image. Jobs can't start.
Root cause: GitLab k8s executor creates fresh pods per job. These pods try to pull the image from a registry. A local-only image (k3s ctr images import) is not pullable by new pods — only by the host containerd.
Fix: Push the image to GitLab's built-in container registry (registry.ge.internal), use the full registry path in CI template.
Steps:
1. docker login registry.ge.internal -u root -p <password>
2. docker tag ge-ci-runner:latest registry.ge.internal/growing-europe/ge-bootstrap/ci-runner:latest
3. docker push registry.ge.internal/growing-europe/ge-bootstrap/ci-runner:latest
4. Update CI template: image: registry.ge.internal/growing-europe/ge-bootstrap/ci-runner:latest
Lesson: k8s executor ≠ local Docker. Every CI job is a fresh pod that must pull its image. Always use a registry path. Also needed: Docker daemon insecure-registries config for non-TLS registries, and /etc/hosts entry for custom registry hostnames.
PITFALL-CICD-011: Shell executor runner picking up k8s-targeted jobs¶
Date: 2026-04-01
Symptom: Pipeline jobs fail with ruff: command not found despite using custom image.
Root cause: A host-installed GitLab runner (FK-DEV, shell executor) was also registered with docker,linux tags. It picked up jobs meant for the k8s executor runners and ran them directly on the host (where ruff isn't installed).
Fix: Pause the shell executor runner via API: PUT /api/v4/runners/14 -d paused=true
Lesson: Multiple runner types (shell, docker, k8s) with overlapping tags cause unpredictable job assignment. Either use distinct tags per executor type, or pause runners you don't want active.
PITFALL-CICD-012: k3s restart re-registers runners without tags¶
Date: 2026-04-01
Symptom: After k3s restart, pipeline jobs stay pending forever.
Root cause: When k3s restarts, runner pods re-register with GitLab as NEW runners (new IDs) without the docker,linux tags that were set via API on the old runners.
Fix: Re-add tags via API after every k3s restart, or configure tags in the runner Helm values/ConfigMap so they persist.
Lesson: Runner tags set via API are per-registration, not per-config. If the runner re-registers (pod restart, k3s restart), tags are lost. Set tags in the runner template config for persistence.
PITFALL-CICD-013: Pipeline progress check intervals¶
Date: 2026-04-01 Learning: Never wait 300s between pipeline checks. Use 15s intervals until the first successful full build. After that, set the interval to the observed pipeline duration + 30s buffer. Rule: Until a project's CI pipeline has passed completely at least once, check every 15 seconds. After first success, calibrate to real timing.
PITFALL-CICD-014: ESLint passes locally, tsc fails in CI¶
Date: 2026-04-01
Symptom: lint:typescript stage runs both ESLint AND tsc --noEmit. ESLint passed after fixes but tsc found 47 type errors (null vs undefined, Drizzle ORM enum mismatches).
Root cause: The lint:typescript stage chains two commands — ESLint for style and tsc for types. Local ESLint pass ≠ CI stage pass.
Lesson: Always run BOTH npx eslint . AND npx tsc --noEmit locally before declaring lint:typescript fixed.
PITFALL-CICD-015: pyright needs libatomic.so.1 in python:3.12-slim¶
Date: 2026-04-02
Symptom: types:python job fails with libatomic.so.1: cannot open shared object file. pyright installs Node via nodeenv which requires libatomic.
Fix: Add libatomic1 to apt-get install in the runner Dockerfile.
Lesson: pyright-python downloads its own Node binary. python:3.12-slim does not include libatomic1 (needed by Node on x86_64). Always test new tools inside the actual runner image, not just on the host.
PITFALL-CICD-016: pip install to user site-packages not on PATH in runner image¶
Date: 2026-04-02
Symptom: vulture: command not found / checkov: command not found despite pip install succeeding in the CI job script.
Root cause: Runner image runs as UID 1000 (non-root). pip install --break-system-packages installs to /home/runner/.local/bin which is not on PATH.
Fix: Pre-install ALL Python tools in the Dockerfile (before USER runner) so they land in /usr/local/bin.
Lesson: NEVER pip install at runtime in CI jobs. Pre-install everything in the runner image. Runtime installs waste time AND may land in unpredictable locations. The runner image is the single source of truth for tool versions.
PITFALL-CICD-017: Stryker needs ps command (procps)¶
Date: 2026-04-02
Symptom: mutation:typescript fails with Error: spawn ps ENOENT.
Root cause: Stryker uses ps to manage child processes. node:20-slim does not include procps.
Fix: apt-get install -y procps in the CI job script (or use node:20 full image).
Lesson: -slim images are minimal. Always verify that tools' runtime dependencies are available, not just the tools themselves.
PITFALL-CICD-018: verify:health can't reach ClusterIP across k3s namespaces¶
Date: 2026-04-02
Symptom: Health check fails with Connection refused when targeting admin-ui.ge-system.svc.cluster.local from a CI runner pod in ge-gitlab namespace.
Root cause: Same k3s ClusterIP networking issue documented in main memory. CI runner pods cannot reach services in other namespaces via ClusterIP.
Fix: Use host IP (192.168.1.85) with Host header for Traefik routing instead of ClusterIP.
Lesson: k3s single-node ClusterIP is unreliable across namespaces. For CI jobs that need to reach application services, use the host IP + Traefik ingress. Document the Host header needed for each service.
PITFALL-CICD-019: Green line bias — || true hides real failures¶
Date: 2026-04-02 (incident from 2026-04-01)
Symptom: Pipeline shows all green, but multiple stages are swallowing errors with || true or 2>&1 || true. Real failures (Playwright test failures, Stryker ps ENOENT) are invisible.
Rule: NEVER use || true on the primary tool invocation in a CI stage. Use it ONLY for optional/advisory steps (e.g., artifact upload). If a stage's primary tool fails, the stage MUST fail. allow_failure: true is acceptable during initial rollout but must be removed within one sprint.
PITFALL-CICD-020: Excluding tests instead of running them properly¶
Date: 2026-04-02
Symptom: After removing || true from test:unit:frontend, 55 integration tests failed (ECONNREFUSED PostgreSQL).
Tempting fix: Exclude integration tests from the unit stage. This hides the problem.
Correct fix: Exclude from unit stage AND create test:integration:frontend stage with PostgreSQL service container. Tests must run somewhere — never silently drop them.
Rule: When moving tests out of a stage, always verify they run in another stage. grep -r "vitest\|pytest" config/gitlab-ci-template.yaml to audit coverage.
PITFALL-CICD-021: Checkov --check and --config-file are mutually exclusive¶
Date: 2026-04-02
Symptom: checkov: error: The check ids specified for '--check' and '--skip-check' must be mutually exclusive
Root cause: Checkov auto-discovers .checkov.yaml in the working directory. If .checkov.yaml has skip-check: and the CLI uses --check, they conflict.
Fix: Use --check (positive list) OR .checkov.yaml with skip-check (negative list). Never both. For app-level scanning, a positive list of ~30 relevant checks is cleaner than trying to skip 80+ cluster-level checks.
Also learned: Checkov exits 0 even with FAILED checks unless --hard-fail-on is set. Always use --hard-fail-on HIGH or --hard-fail-on CRITICAL to get proper exit codes.
PITFALL-CICD-023: GitLab CI variables don't support bash ${VAR:-default} syntax¶
Date: 2026-04-02
Symptom: CI job variable SMOKE_URL: "${GE_SMOKE_URL:-http://192.168.1.85}" resolves to the literal string including :- syntax. Playwright receives the raw string, not the resolved value.
Root cause: GitLab CI variable expansion only supports $VAR or ${VAR}. Bash default syntax (${VAR:-default}) is NOT supported in the variables: block — it's passed literally.
Fix: Use $GE_SMOKE_URL (without default) and ensure the variable is always set in the generated .gitlab-ci.yml. Handle defaults in the script block where bash actually runs.
Lesson: GitLab CI variables: is NOT bash. Only $VAR expansion works. Defaults must be handled in script: blocks.
PITFALL-CICD-024: Non-root containers can't write /etc/hosts¶
Date: 2026-04-02
Symptom: DAST containers (ZAP, Nuclei) fail with /etc/hosts: Permission denied when trying to add hostname entries.
Root cause: Security-focused images run as non-root by default. /etc/hosts is owned by root.
Fix: Don't modify /etc/hosts. Use IP addresses directly as scan targets. DAST tools don't need hostname routing — they scan any URL.
Lesson: Never assume you can write to system files in CI containers. Use environment variables and IP addresses instead.
PITFALL-CICD-025: Nuclei v3 changed -json flag to -jsonl¶
Date: 2026-04-02
Symptom: flag provided but not defined: -json
Fix: Use -jsonl (JSON Lines format) instead of -json.
Lesson: Always check tool changelogs when pulling :latest tags. Pin versions in production.
PITFALL-CICD-026: OWASP ZAP image moved from Docker Hub to GHCR¶
Date: 2026-04-02
Symptom: ErrImagePull for owasp/zap2docker-stable:latest
Fix: Use ghcr.io/zaproxy/zaproxy:stable. Import to k3s: docker save | k3s ctr images import.
Lesson: Check image source URLs periodically. Docker Hub deprecations happen silently.
PITFALL-CICD-022: Kaniko can't resolve registry.ge.internal¶
Date: 2026-04-02
Symptom: dial tcp: lookup registry.ge.internal on 10.43.0.10:53: no such host
Root cause: CoreDNS NodeHosts had registry.internal.ge-ops.local but not registry.ge.internal. The CI_REGISTRY variable uses registry.ge.internal.
Fix: Add 10.43.148.219 registry.ge.internal to CoreDNS configmap NodeHosts. Also add --insecure --skip-tls-verify to kaniko (internal registry uses self-signed certs).
Lesson: Every new hostname used in CI jobs must be in CoreDNS NodeHosts. The Helm chart registers services with .ge.internal domain but CoreDNS was configured with .internal.ge-ops.local. Keep a single domain convention.
PITFALL-CICD-027: ArgoCD repo secret needs type: git¶
Date: 2026-04-02
Symptom: ArgoCD repo connection "Failed" — authentication required: HTTP Basic: Access denied.
Root cause: ArgoCD repo secret (repo-694135144) had an empty type field. Without type: git, ArgoCD stores the credentials but never passes them to git operations.
Fix: kubectl patch secret repo-694135144 -n argocd --type='json' -p='[{"op":"replace","path":"/data/type","value":"Z2l0"}]' (Z2l0 = base64 of "git"). Restart argocd-repo-server.
Lesson: ArgoCD repo secrets have 5 required fields: url, username, password, insecure, and type. Missing type fails silently — credentials are stored but never used.
PITFALL-CICD-028: Semgrep Docker detection blocks CI scans¶
Date: 2026-04-02
Symptom: Exception: Detected Docker environment without a code volume — semgrep refuses to scan.
Root cause: Semgrep's returntocorp/semgrep:latest image detects it's in Docker and demands -v "${PWD}:/src". In k8s CI runners, there's no Docker volume mount — the code is git-cloned into the pod.
Fix: Use semgrep scan ... . (explicit path argument) and set SEMGREP_SEND_METRICS=off. The . tells semgrep where the code is.
Discovery: This bug was hidden by || true on the semgrep invocation. The crash detection pattern from W1 exposed it — semgrep was NEVER actually scanning in any previous pipeline run.
Lesson: || true on security tools doesn't just hide crashes — it can hide tools that never ran at all. Always verify tools produce their expected output file.
PITFALL-CICD-029: Cosign sign:image fails against internal registry (3 layered issues)¶
Date: 2026-04-11
Symptom: sign:image job fails with cosign unable to sign container images.
Root cause: Three compounding issues, each masked by the previous one:
Layer 1 — Wrong port (i/o timeout):
dial tcp 10.43.148.219:443: i/o timeout
The GitLab Container Registry ClusterIP (gitlab-registry service in ge-gitlab namespace) listens on port 5000, not 443. Cosign defaults to HTTPS on 443 when no port is specified.
Fix: Use REGISTRY_HOST=registry.ge.internal:5000 and build IMAGE_REF with explicit port. Must match kaniko's REGISTRY_HOST in build-image.yaml.
Layer 2 — HTTPS vs HTTP (protocol mismatch):
http: server gave HTTP response to HTTPS client
The --allow-insecure-registry flag allows self-signed TLS certificates, but does NOT downgrade to plain HTTP. The internal registry serves HTTP only.
Fix: Add --allow-http-registry flag to both cosign sign and cosign attest commands.
Layer 3 — Missing auth (access forbidden):
GET http://gitlab.ge.internal/jwt/auth?scope=...&service=container_registry: DENIED: access forbidden
Cosign reads registry credentials from ~/.docker/config.json. Without it, the GitLab registry JWT auth endpoint rejects the request. Kaniko sets up the same file in /kaniko/.docker/config.json.
Fix: Add auth setup step before signing:
mkdir -p ~/.docker
echo "{\"auths\":{\"${REGISTRY_HOST}\":{\"username\":\"${CI_REGISTRY_USER}\",\"password\":\"${CI_REGISTRY_PASSWORD}\"}}}" > ~/.docker/config.json
Additional fix — empty project name:
GitLab CI variables: block does NOT support bash-style ${VAR:-default} syntax. Using ${GE_PROJECT:-ge-bootstrap} in variables produced an empty string. Fix: Use CI_PROJECT_NAME (GitLab built-in) and build IMAGE_REF in the script: block where bash expansion works.
Working sign:image configuration (2026-04-11):
variables:
REGISTRY_HOST: "registry.ge.internal:5000"
script:
- IMAGE_REF="${REGISTRY_HOST}/growing-europe/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA}"
- mkdir -p ~/.docker
- echo "{\"auths\":{\"${REGISTRY_HOST}\":{\"username\":\"${CI_REGISTRY_USER}\",\"password\":\"${CI_REGISTRY_PASSWORD}\"}}}" > ~/.docker/config.json
- COSIGN_PASSWORD="" cosign sign --key "$COSIGN_PRIVATE_KEY" --yes --allow-insecure-registry --allow-http-registry "${IMAGE_REF}"
- COSIGN_PASSWORD="" cosign attest --key "$COSIGN_PRIVATE_KEY" --yes --allow-insecure-registry --allow-http-registry --predicate sbom.json --type cyclonedx "${IMAGE_REF}"
Lesson: Internal registries compound three issues that don't exist with Docker Hub or GHCR: non-standard port, plain HTTP, and separate auth. Each layer masks the next — you must fix them in order (port → protocol → auth) or you'll never see the real error. Always check build-image.yaml for the working registry configuration and mirror it exactly in sign-image.yaml.
Agent-CI Bridge Issues¶
(Reserved for future learnings)