Skip to content

Kubernetes Troubleshooting Guide

This guide documents critical issues discovered during the GE K8s migration and their resolutions.

Critical Issue 1: Shared Executor Startup Race Condition

Problem Description

During K8s migration on 2026-01-29, shared executors and dedicated agents became stuck at startup with the following symptoms:

  • Pods stuck at "Scanning for orphaned executions..." log message
  • No Redis ping result logged (neither success nor error)
  • Health server started successfully but main event loop never entered
  • Process appeared frozen with only 2 threads visible
  • Pods that started 5-6 minutes later (like annegreet) worked normally

Symptoms to Watch For

# Stuck pod shows:
kubectl logs -n ge-agents <pod> --timestamps | tail -20
# Output shows:
# [timestamp] INFO Starting health server on 0.0.0.0:8080
# [timestamp] DEBUG Scanning for orphaned executions...
# ...then nothing (no Redis ping result)

# Working pod shows:
# [timestamp] INFO Starting health server on 0.0.0.0:8080
# [timestamp] DEBUG Scanning for orphaned executions...
# [timestamp] INFO Redis ping successful
# [timestamp] INFO Listening on channel: ...

Diagnostic Commands

Check if pods are stuck:

kubectl logs -n ge-agents <pod> --timestamps | grep -E "(orphan|DEBUG|Recovery|ping)"

Check TCP connections (should show Redis connection):

kubectl exec -n ge-agents <pod> -- cat /proc/1/net/tcp | wc -l
# Stuck pod: ~2-3 connections (health server only)
# Working pod: ~4-5 connections (includes Redis)

Test Redis connectivity from inside pod:

kubectl exec -n ge-agents <pod> -- python3 -c "
import redis
r = redis.Redis(host='redis', port=6379, db=0)
print(r.ping())
"
# Should return: True

Root Cause Analysis

Race condition during pod startup:

  1. Network timing: Agent containers started before network/DNS was fully initialized
  2. Bad connection state: The async Redis connection was created but in an invalid state
  3. No error raised: The broken connection caused redis_client.ping() to hang indefinitely
  4. No timeout configured: Neither connection timeout nor ping timeout was set
  5. Async gather blocking: asyncio.gather() running both health_server and listen_for_triggers meant both were blocked

Evidence: - Pods started at 23:35 UTC showed stuck behavior - Annegreet started at 23:41 UTC, got "Error 111 connecting to redis" and continued normally - Manual test inside stuck pod showed Redis working fine (fresh connection) - Restarting stuck pods immediately resolved the issue

Resolution Steps

Immediate fix (applied during incident):

# Rolling restart of affected pods
kubectl rollout restart deployment/shared-executor -n ge-agents
kubectl rollout restart deployment/<dedicated-agent> -n ge-agents

# Monitor restart
kubectl get pods -n ge-agents -w

Code fix required (permanent solution):

The executor code needs these changes:

# 1. Add connection timeout
redis_client = redis.from_url(
    redis_url,
    decode_responses=True,
    socket_connect_timeout=5,  # ADD THIS
    socket_timeout=10,          # ADD THIS
    retry_on_timeout=True
)

# 2. Add timeout wrapper around ping
import asyncio

try:
    ping_result = await asyncio.wait_for(
        redis_client.ping(),
        timeout=5.0  # 5 second timeout
    )
    logger.info(f"Redis ping successful: {ping_result}")
except asyncio.TimeoutError:
    logger.error("Redis ping timed out")
    return  # Early exit
except Exception as e:
    logger.error(f"Redis connection error: {e}")
    return  # Early exit (already handled)

Prevention Checklist

  • [ ] Add connection timeout to all redis.from_url() calls
  • [ ] Add timeout wrapper around all redis_client.ping() calls
  • [ ] Consider adding startup delay or readiness probe dependencies
  • [ ] Test pod restart scenarios in dev environment
  • [ ] Document expected startup sequence in runbook
  • Executor code: /ge-bootstrap/executors/shared/executor.py
  • Deployment manifests: /ge-bootstrap/k8s/base/executors/

Critical Issue 2: Network Policy Missing Dedicated Agents

Problem Description

Dedicated agents could not reach the Anthropic API after deployment, while shared executors worked fine.

Symptoms to Watch For

  • Executions showed "0+0 tokens" in logs
  • API calls failed with connection errors
  • curl to api.anthropic.com from dedicated agent pods returned exit code 7 (connection refused)
  • curl from shared executor pods returned HTTP 405 (Method Not Allowed - expected, means API is reachable)

Diagnostic Commands

Test API connectivity from different pod types:

# Test from shared executor (should work)
kubectl exec -n ge-agents deployment/shared-executor -- \
  curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
  https://api.anthropic.com/v1/messages
# Expected: 405

# Test from dedicated agent (was failing)
kubectl exec -n ge-agents deployment/annegreet -- \
  curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
  https://api.anthropic.com/v1/messages
# Before fix: exit code 7 (connection refused)
# After fix: 405

Check network policy coverage:

kubectl get pods -n ge-agents -o custom-columns=\
"NAME:.metadata.name,COMPONENT:.metadata.labels.component"

Check which network policies apply to a pod:

kubectl describe pod <pod-name> -n ge-agents | grep -A 10 "Network Policy"

Root Cause Analysis

The network policy executors-external-https only matched pods with label component: shared-executor:

# BEFORE (incorrect)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: executors-external-https
  namespace: ge-agents
spec:
  podSelector:
    matchLabels:
      component: shared-executor  # ❌ Only matches shared executors
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

But dedicated agents have different component labels: - component: dedicated-agent (annegreet, arjan, corne, etc.) - component: watcher (watchers) - component: guardian (ron) - component: orchestrator (dolly)

Resolution Steps

Fix applied:

Updated network policy to use matchExpressions instead of matchLabels:

# AFTER (correct)
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: executors-external-https
  namespace: ge-agents
spec:
  podSelector:
    matchExpressions:
    - key: component
      operator: In
      values:
      - shared-executor
      - dedicated-agent
      - watcher
      - guardian
      - orchestrator
  policyTypes:
  - Egress
  egress:
  - to:
    - namespaceSelector: {}
    ports:
    - protocol: TCP
      port: 443

File location: /home/claude/ge-bootstrap/k8s/base/core/network-policies.yaml

Apply the fix:

kubectl apply -f /home/claude/ge-bootstrap/k8s/base/core/network-policies.yaml

# Verify policy updated
kubectl describe networkpolicy executors-external-https -n ge-agents

# Test API connectivity again
kubectl exec -n ge-agents deployment/annegreet -- \
  curl -s -o /dev/null -w "%{http_code}" --max-time 10 \
  https://api.anthropic.com/v1/messages
# Should now return: 405

Prevention Checklist

  • [ ] When adding new pod types, update network policies to include them
  • [ ] Use matchExpressions with operator: In for multi-value matching
  • [ ] Test API connectivity from all pod types after network policy changes
  • [ ] Document which component types need external HTTPS access
  • [ ] Add integration test for API connectivity across all agent types
  • Network policies: /ge-bootstrap/k8s/base/core/network-policies.yaml
  • Pod labels: /ge-bootstrap/k8s/base/agents/*.yaml, /ge-bootstrap/k8s/base/executors/*.yaml

General Troubleshooting Commands

Pod Status and Logs

# Get all pods with status
kubectl get pods -n ge-agents -o wide

# Get logs with timestamps
kubectl logs -n ge-agents <pod> --timestamps

# Get logs from previous container (if crashed)
kubectl logs -n ge-agents <pod> --previous

# Follow logs in real-time
kubectl logs -n ge-agents <pod> -f

# Get logs from specific container in multi-container pod
kubectl logs -n ge-agents <pod> -c <container-name>

Network Diagnostics

# Test DNS resolution
kubectl exec -n ge-agents <pod> -- nslookup redis
kubectl exec -n ge-agents <pod> -- nslookup api.anthropic.com

# Test TCP connectivity
kubectl exec -n ge-agents <pod> -- nc -zv redis 6379
kubectl exec -n ge-agents <pod> -- nc -zv api.anthropic.com 443

# Check active connections
kubectl exec -n ge-agents <pod> -- netstat -an | grep ESTABLISHED

# Test HTTP/HTTPS endpoint
kubectl exec -n ge-agents <pod> -- curl -v https://api.anthropic.com/v1/messages

Resource Inspection

# Check pod resource usage
kubectl top pod -n ge-agents

# Check pod events
kubectl get events -n ge-agents --sort-by='.lastTimestamp'

# Describe pod (includes events)
kubectl describe pod -n ge-agents <pod>

# Check network policies affecting pod
kubectl get networkpolicies -n ge-agents
kubectl describe networkpolicy <policy-name> -n ge-agents

Redis Connectivity

# Test Redis from pod
kubectl exec -n ge-agents <pod> -- python3 -c "
import redis
r = redis.Redis(host='redis', port=6379, db=0)
print(f'Ping: {r.ping()}')
print(f'Info: {r.info()["redis_version"]}')
"

# Check Redis service
kubectl get svc redis -n ge-agents
kubectl describe svc redis -n ge-agents

# Check Redis endpoints
kubectl get endpoints redis -n ge-agents

Escalation Path

If troubleshooting doesn't resolve the issue:

  1. Gather diagnostics using commands above
  2. Check recent changes in git log for infrastructure changes
  3. Contact infrastructure team:
  4. Arjan (infrastructure provisioner)
  5. Tijmen (Kubernetes specialist)
  6. Create incident report in /ge-ops/system/outbox/pending/

Last updated: 2026-01-29 by Annegreet Source: K8s migration debugging session