Kubernetes — Pitfalls¶

OWNER: gerco (dev), thijmen (staging), rutger (production)
ALSO_USED_BY: arjan, alex, tjitte
LAST_VERIFIED: 2026-03-26
GE_STACK_VERSION: k3s v1.34.x (Zone 1), UpCloud Managed K8s (Zones 2+3)

Overview¶

Known failure modes, sharp edges, and hard-won lessons from running Kubernetes
in GE. Every item here has caused real downtime or wasted real money.
This page grows organically — agents add items when they hit new pitfalls.

k3s ClusterIP Broken from Inside Pods¶

Severity: CRITICAL
Zone: 1 (k3s)

k3s ClusterIP 10.43.0.1:443 returns connection refused when called from
inside pods. This breaks any code that uses in-cluster Kubernetes client config.

IF: pod needs Kubernetes API access in Zone 1
THEN: mount kubeconfig and kubectl as hostPath volumes
THEN: do NOT use the default in-cluster service account token flow

volumes:  
  - name: kubeconfig  
    hostPath:  
      path: /etc/rancher/k3s/k3s.yaml  
      type: File  
  - name: kubectl  
    hostPath:  
      path: /usr/local/bin/kubectl  
      type: File

This does NOT affect Zones 2+3 (UpCloud Managed K8s).

ADDED_FROM: admin-ui-infrastructure-page-2026-02, k8s health endpoint unreachable

hostNetwork Causes Port Conflicts¶

Severity: HIGH
Zone: ALL

Setting hostNetwork: true on a pod binds its ports directly to the host.
During rolling updates, the old pod and new pod both try to bind the same
port, causing the new pod to fail.

ANTI_PATTERN: hostNetwork: true in pod spec
FIX: use ClusterIP Services + Ingress for external access
FIX: use NodePort for LAN access (Zone 1 only)

ADDED_FROM: executor-scaling-2026-02, port 8080 conflict during rollout

Image Pull Failures (Zone 1)¶

Severity: MEDIUM
Zone: 1 (k3s)

k3s uses containerd directly. Images must be imported via k3s ctr images import.
If the import step is skipped, pods fail with ImagePullBackOff because there is
no registry to pull from.

IF: pod is in ImagePullBackOff in Zone 1
THEN: check if the image was imported
RUN: sudo k3s ctr images list | grep {image-name}

IF: image is missing
THEN: rebuild and import
RUN: docker build -t {image}:latest . && docker save {image}:latest | sudo k3s ctr images import -

CHECK: imagePullPolicy: Never or imagePullPolicy: IfNotPresent in Zone 1 manifests
ANTI_PATTERN: imagePullPolicy: Always in Zone 1 — there is no registry to pull from

ADDED_FROM: executor-deployment-2026-02, missing image after rebuild

Resource Starvation¶

Severity: HIGH
Zone: ALL

If pods have no resource requests, the scheduler cannot make informed
placement decisions. If pods have no resource limits, a single pod
can consume all node resources and starve everything else.

IF: pods are being evicted or node is unresponsive
THEN: check resource usage
RUN: kubectl top pods --all-namespaces --sort-by=memory
RUN: kubectl top nodes

CHECK: every container has both resources.requests and resources.limits
CHECK: total requests across all pods do not exceed 80% of node capacity
READ_ALSO: wiki/docs/stack/kubernetes/manifests.md

ADDED_FROM: redis-oom-2026-02, redis pod consumed all available memory

OOMKilled¶

Severity: HIGH
Zone: ALL

When a container exceeds its memory limit, Kubernetes kills it with OOMKilled.
The pod restarts, potentially entering CrashLoopBackOff.

IF: pod is OOMKilled
THEN: check the memory limit in the manifest
RUN: kubectl describe pod {pod-name} -n ge-{namespace} | grep -A 5 "Last State"

IF: memory limit is already generous
THEN: investigate the memory leak in the application code
THEN: do not just raise limits indefinitely

Common OOMKilled causes in GE:
- Python executor loading large LLM context into memory
- Node.js heap growth in long-running admin-ui processes
- Redis dataset exceeding configured maxmemory

ANTI_PATTERN: raising memory limits without investigating the leak
FIX: profile first, fix the leak, then set appropriate limits

ADDED_FROM: executor-oom-2026-02, agent session context loaded entire wiki

CrashLoopBackOff¶

Severity: MEDIUM
Zone: ALL

CrashLoopBackOff means the container starts, crashes, restarts, crashes again.
Kubernetes applies exponential backoff (10s, 20s, 40s... up to 5 minutes).

IF: pod is in CrashLoopBackOff
THEN: check logs from the PREVIOUS container instance
RUN: kubectl logs {pod-name} -n ge-{namespace} --previous

Common causes in GE:
- Missing environment variable or secret
- Redis connection refused (wrong port — must be 6381)
- Database migration not applied
- Python import error (module not in container image)

The dolly orchestrator spent 30 days in CrashLoopBackOff with 8509 restarts
before being decommissioned. Do not ignore CrashLoopBackOff.

ADDED_FROM: dolly-decommission-2026-03, 8509 restarts over 30 days

Python Module Caching¶

Severity: HIGH
Zone: 1 (k3s)

Python caches imported modules in sys.modules at startup.
Using kubectl cp to patch files in a running pod has no effect
because the old module code remains cached in memory.

ANTI_PATTERN: using kubectl cp to hot-patch Python code
FIX: ALWAYS rebuild the container image and redeploy
RUN: bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh
RUN: kubectl rollout restart deployment/ge-executor -n ge-agents

ADDED_FROM: executor-hotpatch-failure-2026-02, patched file had no effect

Stale Consumer Groups (Redis Streams)¶

Severity: MEDIUM
Zone: ALL

When executor pods restart, old consumer group entries may remain.
The ge-orchestrator has 2-phase orphan claiming and stale consumer cleanup,
but manual intervention may be needed.

IF: messages stuck in Redis Streams pending list
THEN: check pending entries
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD XPENDING triggers.{agent} ge-executor-group - + 10

IF: entries are stuck with a dead consumer
THEN: claim them
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD XCLAIM triggers.{agent} ge-executor-group {new-consumer} 300000 {message-id}

ADDED_FROM: orchestrator-deployment-2026-03, orphaned messages after pod restart

Secret Sync Delays¶

Severity: MEDIUM
Zone: ALL

External Secrets Operator syncs from Vault on a schedule (default 1h).
When a secret is rotated in Vault, pods may use the old value until
the next sync cycle.

IF: secret was rotated but pods still have old value
THEN: force sync
RUN: kubectl annotate externalsecret ge-{service}-secrets -n ge-{namespace} force-sync=$(date +%s) --overwrite

IF: need faster sync
THEN: reduce refreshInterval on the ExternalSecret (minimum 15m recommended)

ADDED_FROM: api-key-rotation-2026-03, pods used expired API key for 45 minutes

Traefik Ingress Not Updating¶

Severity: LOW
Zone: 1 (k3s)

Traefik in k3s occasionally takes 30-60 seconds to pick up new Ingress resources.
This is normal but can be confusing during debugging.

IF: Ingress created but route returns 404
THEN: wait 60 seconds
THEN: check Traefik logs
RUN: kubectl logs -n kube-system -l app.kubernetes.io/name=traefik --tail=50

ADDED_FROM: wiki-ingress-2026-02, 404 resolved after 45-second wait

etcd Upgrade Path¶

Severity: HIGH
Zone: 1 (k3s)

k3s v1.34+ bundles etcd 3.6. There is NO safe direct upgrade path from
etcd 3.5 to 3.6 — you MUST upgrade to v3.5.26 first.

IF: upgrading k3s from v1.33 or earlier
THEN: ensure etcd is at v3.5.26 before upgrading k3s
THEN: back up etcd data before the upgrade

ADDED_FROM: k3s-release-notes-2026-03, etcd data loss risk

Cross-References¶

READ_ALSO: wiki/docs/stack/kubernetes/index.md
READ_ALSO: wiki/docs/stack/kubernetes/operations.md
READ_ALSO: wiki/docs/stack/kubernetes/security.md
READ_ALSO: wiki/docs/stack/kubernetes/checklist.md