CI/CD Infrastructure Pitfalls & Learnings¶

STATUS: ACTIVE — Updated during CI/CD pipeline implementation OWNER: Alex (Infrastructure), Koen (Quality) CATEGORY: Infrastructure, CI/CD

This page captures hard-won learnings from setting up GE's CI/CD pipeline. Every issue encountered during implementation is documented here so no agent or session repeats the same mistakes.

Registry Issues¶

PITFALL-CICD-001: GitLab Registry pods stuck in Init:0/2¶

Date: 2026-04-01 Symptom: All 3 gitlab-registry pods stuck in Init:0/2 for 13+ days. Root cause: MountVolume.SetUp failed for volume "registry-secrets" : secret "gitlab-registry-storage" not found Analysis: The GitLab Helm chart expects a secret named gitlab-registry-storage containing storage backend credentials (S3/Minio). This secret was never created during initial GitLab deployment, or was deleted. Fix: Created gitlab-registry-storage secret with Minio S3 config:

kubectl create secret generic gitlab-registry-storage -n ge-gitlab \
  --from-literal=storage="s3:\n  accesskey: <from gitlab-minio-secret>\n  secretkey: <from gitlab-minio-secret>\n  bucket: registry\n  regionendpoint: http://gitlab-minio-svc.ge-gitlab.svc.cluster.local:9000\n  secure: false\n  v4auth: true"

Then delete stuck pods: kubectl delete pods -n ge-gitlab -l app=registry Result: Registry pods came up within 20 seconds. 2 replicas running. Lesson: Always verify ALL init container dependencies after Helm upgrade. A missing secret silently blocks pods forever with no obvious error in kubectl get pods — you must kubectl describe to find it. The Helm values (registry.storage.secret) define the expected secret name.

Runner Issues¶

PITFALL-CICD-002: Runners online but not picking up jobs — tag mismatch¶

Date: 2026-04-01 Symptom: Runners registered and online, but CI jobs stay in "pending" forever. Root cause: CI template jobs had tags: [docker, linux] but runners had NO tags. GitLab only dispatches tagged jobs to runners with matching tags. Fix: Added tags to all runners via API:

curl -X PUT "http://gitlab-webservice-default:8080/api/v4/runners/$ID" \
  -H "PRIVATE-TOKEN: $TOKEN" -d "tag_list=docker,linux"

Lesson: Always verify runner tags match job tags. kubectl get pods shows runners as Running, gitlab-runner verify shows them as connected, but they silently refuse unmatched tagged jobs. Check the job detail via API to see tag_list.

PITFALL-CICD-003: Runner DNS resolution failure inside pods¶

Date: 2026-04-01 Symptom: Runners log dial tcp: lookup gitlab.ge.internal on 10.43.0.10:53: server misbehaving Root cause: CoreDNS NodeHosts didn't include gitlab.ge.internal (only had gitlab.internal.ge-ops.local). The Helm chart registered runners with gitlab.ge.internal URL. Fix: Patched CoreDNS configmap to add 10.43.71.244 gitlab.ge.internal registry.ge.internal minio.ge.internal kas.ge.internal to NodeHosts. Then restarted CoreDNS and runners. Lesson: When GitLab is deployed via Helm with custom hosts (.global.hosts.domain), ALL in-cluster services must be able to resolve those hostnames. CoreDNS NodeHosts is the place to add them. This is the same k3s networking issue documented in memory — ClusterIP doesn't work from host, DNS doesn't work from pods unless explicitly configured.

Pipeline Issues¶

PITFALL-CICD-004: First pipeline run reveals 1,279 lint errors¶

Date: 2026-04-01 Symptom: lint:python job fails with 1,279 ruff errors, 686 auto-fixable. Root cause: The codebase was never run through the full CI pipeline before. Legacy code, dead imports, unused variables accumulated over months. Resolution strategy: NOT a blocker. The pipeline correctly catches issues. Fix in phases: 1. Auto-fix the 686 fixable errors with ruff check --fix 2. Address remaining 593 manually or suppress with # noqa where appropriate 3. Add --select to ruff config to start with critical rules only, expand over time Lesson: First CI activation on an existing codebase ALWAYS produces a flood of findings. Plan for a "clean-up sprint" and don't block all development. Use allow_failure: true initially, then tighten.

PITFALL-CICD-005: YAML comments inside script blocks break GitLab CI parsing¶

Date: 2026-04-01 Symptom: Pipeline fails with "script config should be a string or a nested array of strings up to 10 levels deep". No jobs created. Root cause: GitLab CI parser does NOT allow YAML # comments inside the script: array. The comments are interpreted as part of the string. Fix: Remove all # comments from inside script: blocks. Use echo "description" lines instead. Lesson: GitLab CI script: arrays are strict — strings only. Put comments ABOVE the job definition, not inside the script block. Also: use the /api/v4/projects/{id}/ci/lint endpoint to validate YAML before pushing.

PITFALL-CICD-006: Hardcoded repo paths break in CI runners¶

Date: 2026-04-01 Symptom: 43 tests fail in CI with AssertionError: main.py not found at /home/claude/ge-bootstrap/... — but pass locally. Root cause: Test files hardcode /home/claude/ge-bootstrap as repo root. CI runner checks out at /builds/growing-europe/ge-bootstrap. Fix: Made tests/conftest.py detect repo root dynamically: 1. Check CI_PROJECT_DIR env var (GitLab CI sets this) 2. Walk up from test file looking for CLAUDE.md marker 3. Fallback to /home/claude/ge-bootstrap All test files updated to import tests.conftest.GE_ROOT_PATH. Lesson: NEVER hardcode absolute paths in tests. Always use dynamic detection or fixtures. CI runners have different checkout paths than local dev. Use the conftest pattern: detect root once, share via fixture.

PITFALL-CICD-007: Build stage fails on monorepo without pyproject.toml¶

Date: 2026-04-01 Symptom: build:backend fails: "Source does not appear to be a Python project: no pyproject.toml or setup.py" Root cause: CI template assumed a packaged Python project with python -m build. GE is a monorepo. Fix: Replaced build step with import verification: python -c "import ge_orchestrator". Lesson: Monorepos don't need python -m build. The "build" step should verify that imports resolve and dependencies install — not package a wheel.

PITFALL-CICD-008: PEP 668 blocks pip install in Python 3.12 slim containers¶

Date: 2026-04-01 Symptom: pip install ruff fails with "externally-managed-environment" error in CI jobs using python:3.12-slim. Root cause: PEP 668 (2024) marks system Python as externally managed. pip install is blocked to prevent breaking the OS. Fix: Add --break-system-packages to ALL pip install commands in CI. Or use python -m venv (slower in CI). Lesson: When upgrading runner images to Python 3.12+, every pip install needs --break-system-packages. Search-and-replace across all CI templates. This will affect EVERY new project.

PITFALL-CICD-009: Custom runner image not pullable by k8s executor¶

Date: 2026-04-01 Symptom: All pipeline jobs fail after switching to custom ge-ci-runner:latest image. Jobs can't start. Root cause: GitLab k8s executor creates fresh pods per job. These pods try to pull the image from a registry. A local-only image (k3s ctr images import) is not pullable by new pods — only by the host containerd. Fix: Push the image to GitLab's built-in container registry (registry.ge.internal), use the full registry path in CI template. Steps: 1. docker login registry.ge.internal -u root -p <password> 2. docker tag ge-ci-runner:latest registry.ge.internal/growing-europe/ge-bootstrap/ci-runner:latest 3. docker push registry.ge.internal/growing-europe/ge-bootstrap/ci-runner:latest 4. Update CI template: image: registry.ge.internal/growing-europe/ge-bootstrap/ci-runner:latest Lesson: k8s executor ≠ local Docker. Every CI job is a fresh pod that must pull its image. Always use a registry path. Also needed: Docker daemon insecure-registries config for non-TLS registries, and /etc/hosts entry for custom registry hostnames.

PITFALL-CICD-011: Shell executor runner picking up k8s-targeted jobs¶

Date: 2026-04-01 Symptom: Pipeline jobs fail with ruff: command not found despite using custom image. Root cause: A host-installed GitLab runner (FK-DEV, shell executor) was also registered with docker,linux tags. It picked up jobs meant for the k8s executor runners and ran them directly on the host (where ruff isn't installed). Fix: Pause the shell executor runner via API: PUT /api/v4/runners/14 -d paused=true Lesson: Multiple runner types (shell, docker, k8s) with overlapping tags cause unpredictable job assignment. Either use distinct tags per executor type, or pause runners you don't want active.

PITFALL-CICD-012: k3s restart re-registers runners without tags¶

Date: 2026-04-01 Symptom: After k3s restart, pipeline jobs stay pending forever. Root cause: When k3s restarts, runner pods re-register with GitLab as NEW runners (new IDs) without the docker,linux tags that were set via API on the old runners. Fix: Re-add tags via API after every k3s restart, or configure tags in the runner Helm values/ConfigMap so they persist. Lesson: Runner tags set via API are per-registration, not per-config. If the runner re-registers (pod restart, k3s restart), tags are lost. Set tags in the runner template config for persistence.

PITFALL-CICD-013: Pipeline progress check intervals¶

Date: 2026-04-01 Learning: Never wait 300s between pipeline checks. Use 15s intervals until the first successful full build. After that, set the interval to the observed pipeline duration + 30s buffer. Rule: Until a project's CI pipeline has passed completely at least once, check every 15 seconds. After first success, calibrate to real timing.

PITFALL-CICD-014: ESLint passes locally, tsc fails in CI¶

Date: 2026-04-01 Symptom: lint:typescript stage runs both ESLint AND tsc --noEmit. ESLint passed after fixes but tsc found 47 type errors (null vs undefined, Drizzle ORM enum mismatches). Root cause: The lint:typescript stage chains two commands — ESLint for style and tsc for types. Local ESLint pass ≠ CI stage pass. Lesson: Always run BOTH npx eslint . AND npx tsc --noEmit locally before declaring lint:typescript fixed.

PITFALL-CICD-015: pyright needs libatomic.so.1 in python:3.12-slim¶

Date: 2026-04-02 Symptom: types:python job fails with libatomic.so.1: cannot open shared object file. pyright installs Node via nodeenv which requires libatomic. Fix: Add libatomic1 to apt-get install in the runner Dockerfile. Lesson: pyright-python downloads its own Node binary. python:3.12-slim does not include libatomic1 (needed by Node on x86_64). Always test new tools inside the actual runner image, not just on the host.

PITFALL-CICD-016: pip install to user site-packages not on PATH in runner image¶

Date: 2026-04-02 Symptom: vulture: command not found / checkov: command not found despite pip install succeeding in the CI job script. Root cause: Runner image runs as UID 1000 (non-root). pip install --break-system-packages installs to /home/runner/.local/bin which is not on PATH. Fix: Pre-install ALL Python tools in the Dockerfile (before USER runner) so they land in /usr/local/bin. Lesson: NEVER pip install at runtime in CI jobs. Pre-install everything in the runner image. Runtime installs waste time AND may land in unpredictable locations. The runner image is the single source of truth for tool versions.

PITFALL-CICD-017: Stryker needs `ps` command (procps)¶

Date: 2026-04-02 Symptom: mutation:typescript fails with Error: spawn ps ENOENT. Root cause: Stryker uses ps to manage child processes. node:20-slim does not include procps. Fix: apt-get install -y procps in the CI job script (or use node:20 full image). Lesson: -slim images are minimal. Always verify that tools' runtime dependencies are available, not just the tools themselves.

PITFALL-CICD-018: verify:health can't reach ClusterIP across k3s namespaces¶

Date: 2026-04-02 Symptom: Health check fails with Connection refused when targeting admin-ui.ge-system.svc.cluster.local from a CI runner pod in ge-gitlab namespace. Root cause: Same k3s ClusterIP networking issue documented in main memory. CI runner pods cannot reach services in other namespaces via ClusterIP. Fix: Use host IP (192.168.1.85) with Host header for Traefik routing instead of ClusterIP. Lesson: k3s single-node ClusterIP is unreliable across namespaces. For CI jobs that need to reach application services, use the host IP + Traefik ingress. Document the Host header needed for each service.

PITFALL-CICD-019: Green line bias — || true hides real failures¶

Date: 2026-04-02 (incident from 2026-04-01) Symptom: Pipeline shows all green, but multiple stages are swallowing errors with || true or 2>&1 || true. Real failures (Playwright test failures, Stryker ps ENOENT) are invisible. Rule: NEVER use || true on the primary tool invocation in a CI stage. Use it ONLY for optional/advisory steps (e.g., artifact upload). If a stage's primary tool fails, the stage MUST fail. allow_failure: true is acceptable during initial rollout but must be removed within one sprint.

PITFALL-CICD-020: Excluding tests instead of running them properly¶

Date: 2026-04-02 Symptom: After removing || true from test:unit:frontend, 55 integration tests failed (ECONNREFUSED PostgreSQL). Tempting fix: Exclude integration tests from the unit stage. This hides the problem. Correct fix: Exclude from unit stage AND create test:integration:frontend stage with PostgreSQL service container. Tests must run somewhere — never silently drop them. Rule: When moving tests out of a stage, always verify they run in another stage. grep -r "vitest\|pytest" config/gitlab-ci-template.yaml to audit coverage.

PITFALL-CICD-021: Checkov --check and --config-file are mutually exclusive¶

Date: 2026-04-02 Symptom: checkov: error: The check ids specified for '--check' and '--skip-check' must be mutually exclusive Root cause: Checkov auto-discovers .checkov.yaml in the working directory. If .checkov.yaml has skip-check: and the CLI uses --check, they conflict. Fix: Use --check (positive list) OR .checkov.yaml with skip-check (negative list). Never both. For app-level scanning, a positive list of ~30 relevant checks is cleaner than trying to skip 80+ cluster-level checks. Also learned: Checkov exits 0 even with FAILED checks unless --hard-fail-on is set. Always use --hard-fail-on HIGH or --hard-fail-on CRITICAL to get proper exit codes.

PITFALL-CICD-023: GitLab CI variables don't support bash ${VAR:-default} syntax¶

Date: 2026-04-02 Symptom: CI job variable SMOKE_URL: "${GE_SMOKE_URL:-http://192.168.1.85}" resolves to the literal string including :- syntax. Playwright receives the raw string, not the resolved value. Root cause: GitLab CI variable expansion only supports $VAR or ${VAR}. Bash default syntax (${VAR:-default}) is NOT supported in the variables: block — it's passed literally. Fix: Use $GE_SMOKE_URL (without default) and ensure the variable is always set in the generated .gitlab-ci.yml. Handle defaults in the script block where bash actually runs. Lesson: GitLab CI variables: is NOT bash. Only $VAR expansion works. Defaults must be handled in script: blocks.

PITFALL-CICD-024: Non-root containers can't write /etc/hosts¶

Date: 2026-04-02 Symptom: DAST containers (ZAP, Nuclei) fail with /etc/hosts: Permission denied when trying to add hostname entries. Root cause: Security-focused images run as non-root by default. /etc/hosts is owned by root. Fix: Don't modify /etc/hosts. Use IP addresses directly as scan targets. DAST tools don't need hostname routing — they scan any URL. Lesson: Never assume you can write to system files in CI containers. Use environment variables and IP addresses instead.

PITFALL-CICD-025: Nuclei v3 changed -json flag to -jsonl¶

Date: 2026-04-02 Symptom: flag provided but not defined: -json Fix: Use -jsonl (JSON Lines format) instead of -json. Lesson: Always check tool changelogs when pulling :latest tags. Pin versions in production.

PITFALL-CICD-026: OWASP ZAP image moved from Docker Hub to GHCR¶

Date: 2026-04-02 Symptom: ErrImagePull for owasp/zap2docker-stable:latest Fix: Use ghcr.io/zaproxy/zaproxy:stable. Import to k3s: docker save | k3s ctr images import. Lesson: Check image source URLs periodically. Docker Hub deprecations happen silently.

PITFALL-CICD-022: Kaniko can't resolve registry.ge.internal¶

Date: 2026-04-02 Symptom: dial tcp: lookup registry.ge.internal on 10.43.0.10:53: no such host Root cause: CoreDNS NodeHosts had registry.internal.ge-ops.local but not registry.ge.internal. The CI_REGISTRY variable uses registry.ge.internal. Fix: Add 10.43.148.219 registry.ge.internal to CoreDNS configmap NodeHosts. Also add --insecure --skip-tls-verify to kaniko (internal registry uses self-signed certs). Lesson: Every new hostname used in CI jobs must be in CoreDNS NodeHosts. The Helm chart registers services with .ge.internal domain but CoreDNS was configured with .internal.ge-ops.local. Keep a single domain convention.

PITFALL-CICD-027: ArgoCD repo secret needs `type: git`¶

Date: 2026-04-02 Symptom: ArgoCD repo connection "Failed" — authentication required: HTTP Basic: Access denied. Root cause: ArgoCD repo secret (repo-694135144) had an empty type field. Without type: git, ArgoCD stores the credentials but never passes them to git operations. Fix: kubectl patch secret repo-694135144 -n argocd --type='json' -p='[{"op":"replace","path":"/data/type","value":"Z2l0"}]' (Z2l0 = base64 of "git"). Restart argocd-repo-server. Lesson: ArgoCD repo secrets have 5 required fields: url, username, password, insecure, and type. Missing type fails silently — credentials are stored but never used.

PITFALL-CICD-028: Semgrep Docker detection blocks CI scans¶

Date: 2026-04-02 Symptom: Exception: Detected Docker environment without a code volume — semgrep refuses to scan. Root cause: Semgrep's returntocorp/semgrep:latest image detects it's in Docker and demands -v "${PWD}:/src". In k8s CI runners, there's no Docker volume mount — the code is git-cloned into the pod. Fix: Use semgrep scan ... . (explicit path argument) and set SEMGREP_SEND_METRICS=off. The . tells semgrep where the code is. Discovery: This bug was hidden by || true on the semgrep invocation. The crash detection pattern from W1 exposed it — semgrep was NEVER actually scanning in any previous pipeline run. Lesson: || true on security tools doesn't just hide crashes — it can hide tools that never ran at all. Always verify tools produce their expected output file.

PITFALL-CICD-029: Cosign sign:image fails against internal registry (3 layered issues)¶

Date: 2026-04-11 Symptom: sign:image job fails with cosign unable to sign container images. Root cause: Three compounding issues, each masked by the previous one:

Layer 1 — Wrong port (i/o timeout): dial tcp 10.43.148.219:443: i/o timeout The GitLab Container Registry ClusterIP (gitlab-registry service in ge-gitlab namespace) listens on port 5000, not 443. Cosign defaults to HTTPS on 443 when no port is specified. Fix: Use REGISTRY_HOST=registry.ge.internal:5000 and build IMAGE_REF with explicit port. Must match kaniko's REGISTRY_HOST in build-image.yaml.

Layer 2 — HTTPS vs HTTP (protocol mismatch): http: server gave HTTP response to HTTPS client The --allow-insecure-registry flag allows self-signed TLS certificates, but does NOT downgrade to plain HTTP. The internal registry serves HTTP only. Fix: Add --allow-http-registry flag to both cosign sign and cosign attest commands.

Layer 3 — Missing auth (access forbidden): GET http://gitlab.ge.internal/jwt/auth?scope=...&service=container_registry: DENIED: access forbidden Cosign reads registry credentials from ~/.docker/config.json. Without it, the GitLab registry JWT auth endpoint rejects the request. Kaniko sets up the same file in /kaniko/.docker/config.json. Fix: Add auth setup step before signing:

mkdir -p ~/.docker
echo "{\"auths\":{\"${REGISTRY_HOST}\":{\"username\":\"${CI_REGISTRY_USER}\",\"password\":\"${CI_REGISTRY_PASSWORD}\"}}}" > ~/.docker/config.json

Additional fix — empty project name: GitLab CI variables: block does NOT support bash-style ${VAR:-default} syntax. Using ${GE_PROJECT:-ge-bootstrap} in variables produced an empty string. Fix: Use CI_PROJECT_NAME (GitLab built-in) and build IMAGE_REF in the script: block where bash expansion works.

Working sign:image configuration (2026-04-11):

variables:
  REGISTRY_HOST: "registry.ge.internal:5000"
script:
  - IMAGE_REF="${REGISTRY_HOST}/growing-europe/${CI_PROJECT_NAME}:${CI_COMMIT_SHORT_SHA}"
  - mkdir -p ~/.docker
  - echo "{\"auths\":{\"${REGISTRY_HOST}\":{\"username\":\"${CI_REGISTRY_USER}\",\"password\":\"${CI_REGISTRY_PASSWORD}\"}}}" > ~/.docker/config.json
  - COSIGN_PASSWORD="" cosign sign --key "$COSIGN_PRIVATE_KEY" --yes --allow-insecure-registry --allow-http-registry "${IMAGE_REF}"
  - COSIGN_PASSWORD="" cosign attest --key "$COSIGN_PRIVATE_KEY" --yes --allow-insecure-registry --allow-http-registry --predicate sbom.json --type cyclonedx "${IMAGE_REF}"

Lesson: Internal registries compound three issues that don't exist with Docker Hub or GHCR: non-standard port, plain HTTP, and separate auth. Each layer masks the next — you must fix them in order (port → protocol → auth) or you'll never see the real error. Always check build-image.yaml for the working registry configuration and mirror it exactly in sign-image.yaml.

Agent-CI Bridge Issues¶

(Reserved for future learnings)