Kubernetes Troubleshooting Guide¶
This guide documents critical issues discovered during the GE K8s migration and their resolutions.
Critical Issue 1: Shared Executor Startup Race Condition¶
Problem Description¶
During K8s migration on 2026-01-29, shared executors and dedicated agents became stuck at startup with the following symptoms:
- Pods stuck at "Scanning for orphaned executions..." log message
- No Redis ping result logged (neither success nor error)
- Health server started successfully but main event loop never entered
- Process appeared frozen with only 2 threads visible
- Pods that started 5-6 minutes later (like annegreet) worked normally
Symptoms to Watch For¶
# Stuck pod shows:
kubectl logs -n ge-agents <pod> --timestamps | tail -20
# Output shows:
# [timestamp] INFO Starting health server on 0.0.0.0:8080
# [timestamp] DEBUG Scanning for orphaned executions...
# ...then nothing (no Redis ping result)
# Working pod shows:
# [timestamp] INFO Starting health server on 0.0.0.0:8080
# [timestamp] DEBUG Scanning for orphaned executions...
# [timestamp] INFO Redis ping successful
# [timestamp] INFO Listening on channel: ...
Diagnostic Commands¶
Check if pods are stuck:
Check TCP connections (should show Redis connection):
kubectl exec -n ge-agents <pod> -- cat /proc/1/net/tcp | wc -l
# Stuck pod: ~2-3 connections (health server only)
# Working pod: ~4-5 connections (includes Redis)
Test Redis connectivity from inside pod:
kubectl exec -n ge-agents <pod> -- python3 -c "
import redis
r = redis.Redis(host='redis', port=6379, db=0)
print(r.ping())
"
# Should return: True
Root Cause Analysis¶
Race condition during pod startup:
- Network timing: Agent containers started before network/DNS was fully initialized
- Bad connection state: The async Redis connection was created but in an invalid state
- No error raised: The broken connection caused
redis_client.ping()to hang indefinitely - No timeout configured: Neither connection timeout nor ping timeout was set
- Async gather blocking:
asyncio.gather()running both health_server and listen_for_triggers meant both were blocked
Evidence: - Pods started at 23:35 UTC showed stuck behavior - Annegreet started at 23:41 UTC, got "Error 111 connecting to redis" and continued normally - Manual test inside stuck pod showed Redis working fine (fresh connection) - Restarting stuck pods immediately resolved the issue
Resolution Steps¶
Immediate fix (applied during incident):
# Rolling restart of affected pods
kubectl rollout restart deployment/shared-executor -n ge-agents
kubectl rollout restart deployment/<dedicated-agent> -n ge-agents
# Monitor restart
kubectl get pods -n ge-agents -w
Code fix required (permanent solution):
The executor code needs these changes:
# 1. Add connection timeout
redis_client = redis.from_url(
redis_url,
decode_responses=True,
socket_connect_timeout=5, # ADD THIS
socket_timeout=10, # ADD THIS
retry_on_timeout=True
)
# 2. Add timeout wrapper around ping
import asyncio
try:
ping_result = await asyncio.wait_for(
redis_client.ping(),
timeout=5.0 # 5 second timeout
)
logger.info(f"Redis ping successful: {ping_result}")
except asyncio.TimeoutError:
logger.error("Redis ping timed out")
return # Early exit
except Exception as e:
logger.error(f"Redis connection error: {e}")
return # Early exit (already handled)
Prevention Checklist¶
- [ ] Add connection timeout to all
redis.from_url()calls - [ ] Add timeout wrapper around all
redis_client.ping()calls - [ ] Consider adding startup delay or readiness probe dependencies
- [ ] Test pod restart scenarios in dev environment
- [ ] Document expected startup sequence in runbook
Related Files¶
- Executor code:
/ge-bootstrap/executors/shared/executor.py - Deployment manifests:
/ge-bootstrap/k8s/base/executors/
Critical Issue 2: Network Policy Missing Dedicated Agents¶
Problem Description¶
Dedicated agents could not reach the Anthropic API after deployment, while shared executors worked fine.
Symptoms to Watch For¶
- Executions showed "0+0 tokens" in logs
- API calls failed with connection errors
- curl to api.anthropic.com from dedicated agent pods returned exit code 7 (connection refused)
- curl from shared executor pods returned HTTP 405 (Method Not Allowed - expected, means API is reachable)
Diagnostic Commands¶
Test API connectivity from different pod types:
# Test from shared executor (should work)
kubectl exec -n ge-agents deployment/shared-executor -- \
curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
https://api.anthropic.com/v1/messages
# Expected: 405
# Test from dedicated agent (was failing)
kubectl exec -n ge-agents deployment/annegreet -- \
curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
https://api.anthropic.com/v1/messages
# Before fix: exit code 7 (connection refused)
# After fix: 405
Check network policy coverage:
kubectl get pods -n ge-agents -o custom-columns=\
"NAME:.metadata.name,COMPONENT:.metadata.labels.component"
Check which network policies apply to a pod:
Root Cause Analysis¶
The network policy executors-external-https only matched pods with label component: shared-executor:
# BEFORE (incorrect)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: executors-external-https
namespace: ge-agents
spec:
podSelector:
matchLabels:
component: shared-executor # ❌ Only matches shared executors
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
But dedicated agents have different component labels:
- component: dedicated-agent (annegreet, arjan, corne, etc.)
- component: watcher (watchers)
- component: guardian (ron)
- component: orchestrator (dolly)
Resolution Steps¶
Fix applied:
Updated network policy to use matchExpressions instead of matchLabels:
# AFTER (correct)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: executors-external-https
namespace: ge-agents
spec:
podSelector:
matchExpressions:
- key: component
operator: In
values:
- shared-executor
- dedicated-agent
- watcher
- guardian
- orchestrator
policyTypes:
- Egress
egress:
- to:
- namespaceSelector: {}
ports:
- protocol: TCP
port: 443
File location: /home/claude/ge-bootstrap/k8s/base/core/network-policies.yaml
Apply the fix:
kubectl apply -f /home/claude/ge-bootstrap/k8s/base/core/network-policies.yaml
# Verify policy updated
kubectl describe networkpolicy executors-external-https -n ge-agents
# Test API connectivity again
kubectl exec -n ge-agents deployment/annegreet -- \
curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
https://api.anthropic.com/v1/messages
# Should now return: 405
Prevention Checklist¶
- [ ] When adding new pod types, update network policies to include them
- [ ] Use
matchExpressionswithoperator: Infor multi-value matching - [ ] Test API connectivity from all pod types after network policy changes
- [ ] Document which component types need external HTTPS access
- [ ] Add integration test for API connectivity across all agent types
Related Files¶
- Network policies:
/ge-bootstrap/k8s/base/core/network-policies.yaml - Pod labels:
/ge-bootstrap/k8s/base/agents/*.yaml,/ge-bootstrap/k8s/base/executors/*.yaml
General Troubleshooting Commands¶
Pod Status and Logs¶
# Get all pods with status
kubectl get pods -n ge-agents -o wide
# Get logs with timestamps
kubectl logs -n ge-agents <pod> --timestamps
# Get logs from previous container (if crashed)
kubectl logs -n ge-agents <pod> --previous
# Follow logs in real-time
kubectl logs -n ge-agents <pod> -f
# Get logs from specific container in multi-container pod
kubectl logs -n ge-agents <pod> -c <container-name>
Network Diagnostics¶
# Test DNS resolution
kubectl exec -n ge-agents <pod> -- nslookup redis
kubectl exec -n ge-agents <pod> -- nslookup api.anthropic.com
# Test TCP connectivity
kubectl exec -n ge-agents <pod> -- nc -zv redis 6379
kubectl exec -n ge-agents <pod> -- nc -zv api.anthropic.com 443
# Check active connections
kubectl exec -n ge-agents <pod> -- netstat -an | grep ESTABLISHED
# Test HTTP/HTTPS endpoint
kubectl exec -n ge-agents <pod> -- curl -v https://api.anthropic.com/v1/messages
Resource Inspection¶
# Check pod resource usage
kubectl top pod -n ge-agents
# Check pod events
kubectl get events -n ge-agents --sort-by='.lastTimestamp'
# Describe pod (includes events)
kubectl describe pod -n ge-agents <pod>
# Check network policies affecting pod
kubectl get networkpolicies -n ge-agents
kubectl describe networkpolicy <policy-name> -n ge-agents
Redis Connectivity¶
# Test Redis from pod
kubectl exec -n ge-agents <pod> -- python3 -c "
import redis
r = redis.Redis(host='redis', port=6379, db=0)
print(f'Ping: {r.ping()}')
print(f'Info: {r.info()["redis_version"]}')
"
# Check Redis service
kubectl get svc redis -n ge-agents
kubectl describe svc redis -n ge-agents
# Check Redis endpoints
kubectl get endpoints redis -n ge-agents
Escalation Path¶
If troubleshooting doesn't resolve the issue:
- Gather diagnostics using commands above
- Check recent changes in git log for infrastructure changes
- Contact infrastructure team:
- Arjan (infrastructure provisioner)
- Tijmen (Kubernetes specialist)
- Create incident report in
/ge-ops/system/outbox/pending/
Last updated: 2026-01-29 by Annegreet Source: K8s migration debugging session