Infrastructure Pitfalls¶
Redis Port¶
WRONG: Assuming default port 6379 RIGHT: Read config/ports.yaml for actual port (currently 6381 as of 2026-02-14) IMPACT: Connection refused, agent triggers fail silently AUTHORITY: config/ports.yaml is the ONLY source of truth for port assignments
k3s ClusterIP¶
WRONG: Using 10.43.0.1:443 from inside pods RIGHT: Use service DNS (e.g., admin-ui.ge-system.svc.cluster.local) IMPACT: Connection refused from inside pods. ClusterIP is BROKEN on this cluster.
DNS from Executor Pods¶
ISSUE: DNS resolution to admin-ui is unreliable from executor pods WORKAROUND: Host cron + scripts/k8s-health-dump.sh -> public/k8s-health.json IMPACT: Agents trying to curl discussion API fail with "Could not resolve host"
Code Deployment¶
WRONG: kubectl cp files into running pods RIGHT: Rebuild executor image with ge-ops/infrastructure/local/k3s/executor/build-executor.sh, then kubectl rollout restart deployment/ge-executor -n ge-agents IMPACT: kubectl cp changes are lost on pod restart. Silent regression.
Executor Import Priority¶
ISSUE: /app/ (WORKDIR) takes import priority over PYTHONPATH=/home/claude/ge-bootstrap IMPACT: Baked-in code in the container image wins over hostPath-mounted code FIX: Always rebuild the image to deploy code changes. The image IS the deployment artifact.
Vault After Restart¶
ISSUE: Vault seals itself after pod restart RECOVERY: CronJob (vault-unseal, every 2min) handles this automatically IMPACT: If CronJob fails, all secret-dependent operations fail LOCATION: vault.keys at /home/claude/ge-bootstrap/vault.keys on host
Legacy Docker Compose¶
ISSUE: docker-compose.yml still exists in repo from pre-k3s era REALITY: System is fully on k3s. Docker-compose is archived/legacy. Do not use it. IMPACT: Referencing docker-compose for system configuration will give you outdated information
Per-Agent Deployments (Scaled to 0)¶
ISSUE: Individual agent pods (anna, annegreet, ron, etc.) exist as k8s deployments but ALL are scaled to 0 REALITY: All agents run through the shared executor (ge-executor). The per-agent deployments are legacy. IMPACT: Trying to restart a per-agent pod won't do anything. Agent work goes through the executor.
Liveness Probe Kills Active Pods¶
WRONG: HTTP liveness probe on the health server endpoint
RIGHT: exec kill -0 1 (process check that doesn't depend on event loop)
REASON: The health server shares the asyncio event loop with agent execution. During PTY sessions (1-5 min), the loop blocks and HTTP probes timeout.
FIX: Startup probe for boot, exec liveness for steady state, HTTP readiness for traffic routing.
LOCATION: k8s/base/agents/executor.yaml
Redis Authentication¶
WRONG: Assuming Redis is open (no password), or referencing a redis-password secret
RIGHT: Redis uses --requirepass. Password is in ge-secrets secret, key redis-password (exists in both ge-system and ge-agents namespaces)
IMPACT: AuthenticationError: Authentication required — pods crashloop
NOTE: Liveness/readiness probes that connect to Redis also need the password via env var. The executor gets away without explicit auth in REDIS_URL but new deployments need REDIS_PASSWORD.
DISCOVERED: 2026-03-19 during ge-orchestrator deployment
Previously Disabled Agents (Re-enabled 2026-02-15)¶
HISTORY: ron, annegreet, mira, eltjo were excluded from executor due to hook cross-trigger loops STATUS: All 4 re-enabled after hook loop fix deployed (3-layer protection) DETAILS: See Hook Loops for the full story