Skip to content

Infrastructure Pitfalls

Redis Port

WRONG: Assuming default port 6379 RIGHT: Read config/ports.yaml for actual port (currently 6381 as of 2026-02-14) IMPACT: Connection refused, agent triggers fail silently AUTHORITY: config/ports.yaml is the ONLY source of truth for port assignments

k3s ClusterIP

WRONG: Using 10.43.0.1:443 from inside pods RIGHT: Use service DNS (e.g., admin-ui.ge-system.svc.cluster.local) IMPACT: Connection refused from inside pods. ClusterIP is BROKEN on this cluster.

DNS from Executor Pods

ISSUE: DNS resolution to admin-ui is unreliable from executor pods WORKAROUND: Host cron + scripts/k8s-health-dump.sh -> public/k8s-health.json IMPACT: Agents trying to curl discussion API fail with "Could not resolve host"

Code Deployment

WRONG: kubectl cp files into running pods RIGHT: Rebuild executor image with ge-ops/infrastructure/local/k3s/executor/build-executor.sh, then kubectl rollout restart deployment/ge-executor -n ge-agents IMPACT: kubectl cp changes are lost on pod restart. Silent regression.

Executor Import Priority

ISSUE: /app/ (WORKDIR) takes import priority over PYTHONPATH=/home/claude/ge-bootstrap IMPACT: Baked-in code in the container image wins over hostPath-mounted code FIX: Always rebuild the image to deploy code changes. The image IS the deployment artifact.

Vault After Restart

ISSUE: Vault seals itself after pod restart RECOVERY: CronJob (vault-unseal, every 2min) handles this automatically IMPACT: If CronJob fails, all secret-dependent operations fail LOCATION: vault.keys at /home/claude/ge-bootstrap/vault.keys on host

Legacy Docker Compose

ISSUE: docker-compose.yml still exists in repo from pre-k3s era REALITY: System is fully on k3s. Docker-compose is archived/legacy. Do not use it. IMPACT: Referencing docker-compose for system configuration will give you outdated information

Per-Agent Deployments (Scaled to 0)

ISSUE: Individual agent pods (anna, annegreet, ron, etc.) exist as k8s deployments but ALL are scaled to 0 REALITY: All agents run through the shared executor (ge-executor). The per-agent deployments are legacy. IMPACT: Trying to restart a per-agent pod won't do anything. Agent work goes through the executor.

Liveness Probe Kills Active Pods

WRONG: HTTP liveness probe on the health server endpoint RIGHT: exec kill -0 1 (process check that doesn't depend on event loop) REASON: The health server shares the asyncio event loop with agent execution. During PTY sessions (1-5 min), the loop blocks and HTTP probes timeout. FIX: Startup probe for boot, exec liveness for steady state, HTTP readiness for traffic routing. LOCATION: k8s/base/agents/executor.yaml

Redis Authentication

WRONG: Assuming Redis is open (no password), or referencing a redis-password secret RIGHT: Redis uses --requirepass. Password is in ge-secrets secret, key redis-password (exists in both ge-system and ge-agents namespaces) IMPACT: AuthenticationError: Authentication required — pods crashloop NOTE: Liveness/readiness probes that connect to Redis also need the password via env var. The executor gets away without explicit auth in REDIS_URL but new deployments need REDIS_PASSWORD. DISCOVERED: 2026-03-19 during ge-orchestrator deployment

Previously Disabled Agents (Re-enabled 2026-02-15)

HISTORY: ron, annegreet, mira, eltjo were excluded from executor due to hook cross-trigger loops STATUS: All 4 re-enabled after hook loop fix deployed (3-layer protection) DETAILS: See Hook Loops for the full story