Kubernetes Troubleshooting Guide¶

This guide documents critical issues discovered during the GE K8s migration and their resolutions.

Critical Issue 1: Shared Executor Startup Race Condition¶

Problem Description¶

During K8s migration on 2026-01-29, shared executors and dedicated agents became stuck at startup with the following symptoms:

Pods stuck at "Scanning for orphaned executions..." log message
No Redis ping result logged (neither success nor error)
Health server started successfully but main event loop never entered
Process appeared frozen with only 2 threads visible
Pods that started 5-6 minutes later (like annegreet) worked normally

Symptoms to Watch For¶

# Stuck pod shows:
kubectl logs -n ge-agents <pod> --timestamps | tail -20
# Output shows:
# [timestamp] INFO Starting health server on 0.0.0.0:8080
# [timestamp] DEBUG Scanning for orphaned executions...
# ...then nothing (no Redis ping result)

# Working pod shows:
# [timestamp] INFO Starting health server on 0.0.0.0:8080
# [timestamp] DEBUG Scanning for orphaned executions...
# [timestamp] INFO Redis ping successful
# [timestamp] INFO Listening on channel: ...

Diagnostic Commands¶

Check if pods are stuck:

kubectl logs -n ge-agents <pod> --timestamps | grep -E "(orphan|DEBUG|Recovery|ping)"

Check TCP connections (should show Redis connection):

kubectl exec -n ge-agents <pod> -- cat /proc/1/net/tcp | wc -l
# Stuck pod: ~2-3 connections (health server only)
# Working pod: ~4-5 connections (includes Redis)

Test Redis connectivity from inside pod:

kubectl exec -n ge-agents <pod> -- python3 -c "
import redis
r = redis.Redis(host='redis', port=6379, db=0)
print(r.ping())
"
# Should return: True

Root Cause Analysis¶

Race condition during pod startup:

Network timing: Agent containers started before network/DNS was fully initialized
Bad connection state: The async Redis connection was created but in an invalid state
No error raised: The broken connection caused redis_client.ping() to hang indefinitely
No timeout configured: Neither connection timeout nor ping timeout was set
Async gather blocking: asyncio.gather() running both health_server and listen_for_triggers meant both were blocked

Evidence: - Pods started at 23:35 UTC showed stuck behavior - Annegreet started at 23:41 UTC, got "Error 111 connecting to redis" and continued normally - Manual test inside stuck pod showed Redis working fine (fresh connection) - Restarting stuck pods immediately resolved the issue

Resolution Steps¶

Immediate fix (applied during incident):

# Rolling restart of affected pods
kubectl rollout restart deployment/shared-executor -n ge-agents
kubectl rollout restart deployment/<dedicated-agent> -n ge-agents

# Monitor restart
kubectl get pods -n ge-agents -w

Code fix required (permanent solution):

The executor code needs these changes:

# 1. Add connection timeout
redis_client = redis.from_url(
    redis_url,
    decode_responses=True,
    socket_connect_timeout=5,  # ADD THIS
    socket_timeout=10,          # ADD THIS
    retry_on_timeout=True
)

# 2. Add timeout wrapper around ping
import asyncio

try:
    ping_result = await asyncio.wait_for(
        redis_client.ping(),
        timeout=5.0  # 5 second timeout
    )
    logger.info(f"Redis ping successful: {ping_result}")
except asyncio.TimeoutError:
    logger.error("Redis ping timed out")
    return  # Early exit
except Exception as e:
    logger.error(f"Redis connection error: {e}")
    return  # Early exit (already handled)

Prevention Checklist¶

[ ] Add connection timeout to all redis.from_url() calls
[ ] Add timeout wrapper around all redis_client.ping() calls
[ ] Consider adding startup delay or readiness probe dependencies
[ ] Test pod restart scenarios in dev environment
[ ] Document expected startup sequence in runbook

Executor code: /ge-bootstrap/executors/shared/executor.py
Deployment manifests: /ge-bootstrap/k8s/base/executors/

Critical Issue 2: Network Policy Missing Dedicated Agents¶

Problem Description¶

Dedicated agents could not reach the Anthropic API after deployment, while shared executors worked fine.

Symptoms to Watch For¶

Executions showed "0+0 tokens" in logs
API calls failed with connection errors
curl to api.anthropic.com from dedicated agent pods returned exit code 7 (connection refused)
curl from shared executor pods returned HTTP 405 (Method Not Allowed - expected, means API is reachable)

Diagnostic Commands¶

Test API connectivity from different pod types:

# Test from shared executor (should work)
kubectl exec -n ge-agents deployment/shared-executor -- \
  curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
  https://api.anthropic.com/v1/messages
# Expected: 405

# Test from dedicated agent (was failing)
kubectl exec -n ge-agents deployment/annegreet -- \
  curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
  https://api.anthropic.com/v1/messages
# Before fix: exit code 7 (connection refused)
# After fix: 405

Check network policy coverage:

kubectl get pods -n ge-agents -o custom-columns=\
"NAME:.metadata.name,COMPONENT:.metadata.labels.component"

Check which network policies apply to a pod:

kubectl describe pod <pod-name> -n ge-agents | grep -A 10 "Network Policy"

Root Cause Analysis¶

The network policy executors-external-https only matched pods with label component: shared-executor:

# BEFORE (incorrect)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: executors-external-https
  namespace: ge-agents
spec:
  podSelector:
    matchLabels:
      component: shared-executor  # ❌ Only matches shared executors
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

But dedicated agents have different component labels: - component: dedicated-agent (annegreet, arjan, corne, etc.) - component: watcher (watchers) - component: guardian (ron) - component: orchestrator (dolly)

Resolution Steps¶

Fix applied:

Updated network policy to use matchExpressions instead of matchLabels:

# AFTER (correct)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: executors-external-https
  namespace: ge-agents
spec:
  podSelector:
    matchExpressions:
    - key: component
      operator: In
      values:
      - shared-executor
      - dedicated-agent
      - watcher
      - guardian
      - orchestrator
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

File location: /home/claude/ge-bootstrap/k8s/base/core/network-policies.yaml

Apply the fix:

kubectl apply -f /home/claude/ge-bootstrap/k8s/base/core/network-policies.yaml

# Verify policy updated
kubectl describe networkpolicy executors-external-https -n ge-agents

# Test API connectivity again
kubectl exec -n ge-agents deployment/annegreet -- \
  curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
  https://api.anthropic.com/v1/messages
# Should now return: 405

Prevention Checklist¶

[ ] When adding new pod types, update network policies to include them
[ ] Use matchExpressions with operator: In for multi-value matching
[ ] Test API connectivity from all pod types after network policy changes
[ ] Document which component types need external HTTPS access
[ ] Add integration test for API connectivity across all agent types

Network policies: /ge-bootstrap/k8s/base/core/network-policies.yaml
Pod labels: /ge-bootstrap/k8s/base/agents/*.yaml, /ge-bootstrap/k8s/base/executors/*.yaml

General Troubleshooting Commands¶

Pod Status and Logs¶

# Get all pods with status
kubectl get pods -n ge-agents -o wide

# Get logs with timestamps
kubectl logs -n ge-agents <pod> --timestamps

# Get logs from previous container (if crashed)
kubectl logs -n ge-agents <pod> --previous

# Follow logs in real-time
kubectl logs -n ge-agents <pod> -f

# Get logs from specific container in multi-container pod
kubectl logs -n ge-agents <pod> -c <container-name>

Network Diagnostics¶

# Test DNS resolution
kubectl exec -n ge-agents <pod> -- nslookup redis
kubectl exec -n ge-agents <pod> -- nslookup api.anthropic.com

# Test TCP connectivity
kubectl exec -n ge-agents <pod> -- nc -zv redis 6379
kubectl exec -n ge-agents <pod> -- nc -zv api.anthropic.com 443

# Check active connections
kubectl exec -n ge-agents <pod> -- netstat -an | grep ESTABLISHED

# Test HTTP/HTTPS endpoint
kubectl exec -n ge-agents <pod> -- curl -v https://api.anthropic.com/v1/messages

Resource Inspection¶

# Check pod resource usage
kubectl top pod -n ge-agents

# Check pod events
kubectl get events -n ge-agents --sort-by='.lastTimestamp'

# Describe pod (includes events)
kubectl describe pod -n ge-agents <pod>

# Check network policies affecting pod
kubectl get networkpolicies -n ge-agents
kubectl describe networkpolicy <policy-name> -n ge-agents

Redis Connectivity¶

# Test Redis from pod
kubectl exec -n ge-agents <pod> -- python3 -c "
import redis
r = redis.Redis(host='redis', port=6379, db=0)
print(f'Ping: {r.ping()}')
print(f'Info: {r.info()["redis_version"]}')
"

# Check Redis service
kubectl get svc redis -n ge-agents
kubectl describe svc redis -n ge-agents

# Check Redis endpoints
kubectl get endpoints redis -n ge-agents

Escalation Path¶

If troubleshooting doesn't resolve the issue:

Gather diagnostics using commands above
Check recent changes in git log for infrastructure changes
Contact infrastructure team:
Arjan (infrastructure provisioner)
Tijmen (Kubernetes specialist)
Create incident report in /ge-ops/system/outbox/pending/

Last updated: 2026-01-29 by Annegreet Source: K8s migration debugging session

Kubernetes Troubleshooting Guide¶

Critical Issue 1: Shared Executor Startup Race Condition¶

Problem Description¶

Symptoms to Watch For¶

Diagnostic Commands¶

Root Cause Analysis¶

Resolution Steps¶

Prevention Checklist¶

Related Files¶

Critical Issue 2: Network Policy Missing Dedicated Agents¶

Problem Description¶

Symptoms to Watch For¶

Diagnostic Commands¶

Root Cause Analysis¶

Resolution Steps¶

Prevention Checklist¶

Related Files¶

General Troubleshooting Commands¶

Pod Status and Logs¶

Network Diagnostics¶

Resource Inspection¶

Redis Connectivity¶

Escalation Path¶