Skip to content

Authentication Failures Troubleshooting Guide

Last Updated: 2026-03-18 Status: Active Maintained by: GE Infrastructure Team


Overview

This guide provides systematic troubleshooting procedures for authentication-related failures in the GE agent platform. Authentication issues are the most common cause of service disruption after platform restarts or deployments.

Covered Scenarios: - Redis authentication failures ("NOAUTH" errors) - Vault authentication failures ("authentication failed") - Pods in CrashLoopBackOff due to auth issues - Executors unable to access external APIs


Table of Contents


Quick Diagnosis

Identify Which Service is Failing

# Check all pods status
kubectl get pods -A

# Look for:
# - CrashLoopBackOff (pod repeatedly restarting)
# - Error (pod failed to start)
# - Running but frequent restarts (high RESTARTS count)

# Get logs from failing pod
kubectl logs -n <namespace> <pod-name> --tail=100

# Common error patterns:
# - "NOAUTH Authentication required" → Redis auth failure
# - "Vault authentication failed" → Vault AppRole issue
# - "Connection refused to localhost" → Environment variable mismatch
# - "401 Unauthorized" → API key invalid or missing

Check Secrets Exist

# List all secrets in ge-agents namespace
kubectl get secrets -n ge-agents

# Expected secrets:
# - ge-executor-secrets (Vault AppRole credentials)
# - redis-credentials (Redis password)
# - anthropic-api-key (Anthropic API key)

# Verify secrets contain data (not empty)
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | jq
kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data}' | jq
kubectl get secret anthropic-api-key -n ge-agents -o jsonpath='{.data}' | jq

# If any secret is empty or missing → Run bootstrap script

Check Vault Storage Files

# Verify Vault storage files exist
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/approles/

# Verify files contain data
cat /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json

# Expected: JSON files with non-empty password, role_id, secret_id fields

Redis Authentication Failures

Symptoms

Error Messages: - NOAUTH Authentication required - ERR AUTH <password> called without any password configured for the default user - Connection refused to redis.ge-system.svc.cluster.local:6381

Where Errors Appear: - Dolly orchestrator logs - Executor logs - Any service attempting to connect to Redis

Diagnosis

Step 1: Check Redis pod is running

kubectl get pods -n ge-system | grep redis

# Expected: redis-0 Running
# If not running: kubectl logs -n ge-system redis-0 --tail=100

Step 2: Check Redis password secret exists

# Check in ge-agents namespace (executor secret)
kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d

# Check in ge-system namespace (Redis config)
kubectl get secret ge-secrets -n ge-system -o jsonpath='{.data.redis-password}' | base64 -d

# Both should return same non-empty password

Step 3: Test Redis authentication manually

# Get password
REDIS_PASSWORD=$(kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d)

# Test connection from executor pod
kubectl exec -n ge-agents deployment/ge-executor -- \
  redis-cli -h redis.ge-system.svc.cluster.local -p 6381 -a "$REDIS_PASSWORD" PING

# Expected output: PONG
# If "NOAUTH": Password is incorrect or missing
# If "Connection refused": Redis not running or network issue

Step 4: Compare passwords in Vault storage vs K8s

# Vault storage password
VAULT_PASSWORD=$(cat /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json | jq -r .password)

# K8s secret password
K8S_PASSWORD=$(kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d)

# Compare
if [ "$VAULT_PASSWORD" = "$K8S_PASSWORD" ]; then
  echo "Passwords match ✓"
else
  echo "PASSWORD MISMATCH - Run bootstrap script"
fi

Resolution

Case 1: Passwords mismatch or K8s secret empty

# Re-run bootstrap to sync from Vault storage
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents

# Verify logs show successful connection
kubectl logs -n ge-agents deployment/ge-executor --tail=50 | grep -i redis

Case 2: Vault storage password missing or empty

# Generate new password and store in Vault
NEW_PASSWORD=$(openssl rand -base64 32)
echo "{\"password\": \"$NEW_PASSWORD\"}" > /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json

# Update K8s secrets
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force

# Update Redis configuration
kubectl create secret generic ge-secrets \
  --namespace=ge-system \
  --from-literal=redis-password="$NEW_PASSWORD" \
  --dry-run=client -o yaml | kubectl apply -f -

# Restart Redis
kubectl rollout restart statefulset/redis -n ge-system

# Wait for Redis ready
kubectl rollout status statefulset/redis -n ge-system

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents

Case 3: Redis pod not running

# Check Redis logs for errors
kubectl logs -n ge-system redis-0 --tail=100

# Common issues:
# - PVC not available: Check PersistentVolumeClaim
# - Resource limits: Check CPU/memory limits
# - Config error: Check ConfigMap

# Restart Redis
kubectl delete pod redis-0 -n ge-system

# Wait for new pod to start
kubectl get pods -n ge-system -w

Vault Authentication Failures

Symptoms

Error Messages: - Vault authentication failed - Failed to retrieve secret from Vault - AppRole authentication failed: invalid role_id or secret_id

Where Errors Appear: - Executor logs during startup - Dolly logs during initialization

Diagnosis

Step 1: Check Vault AppRole secret exists

kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | jq

# Expected output:
# {
#   "vault-role-id": "base64encodedvalue",
#   "vault-secret-id": "base64encodedvalue"
# }

# Decode and verify not empty
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-role-id}' | base64 -d
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-secret-id}' | base64 -d

# Should return non-empty UUID-like strings

Step 2: Check Vault storage files

cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json

# Expected output:
# {
#   "role_id": "uuid-format-string",
#   "secret_id": "long-base64-string"
# }

Step 3: Compare Vault storage vs K8s secret

# Vault storage role_id
VAULT_ROLE_ID=$(cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json | jq -r .role_id)

# K8s secret role_id
K8S_ROLE_ID=$(kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-role-id}' | base64 -d)

# Compare
if [ "$VAULT_ROLE_ID" = "$K8S_ROLE_ID" ]; then
  echo "Role IDs match ✓"
else
  echo "ROLE ID MISMATCH - Run bootstrap script"
fi

Step 4: Check executor logs for specific error

kubectl logs -n ge-agents deployment/ge-executor --tail=100 | grep -i vault

# Look for:
# - "Vault credentials not configured" → Env vars not set in deployment
# - "Authentication failed" → Credentials invalid or expired
# - "Access denied" → Policy doesn't grant access to secret paths

Resolution

Case 1: AppRole credentials missing or empty

# Generate new AppRole credentials
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents

# Watch logs for successful authentication
kubectl logs -n ge-agents deployment/ge-executor -f | grep -i "authenticated"

Case 2: Credentials mismatch between Vault and K8s

# Re-sync from Vault storage (preserves existing credentials)
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents

# Verify authentication
kubectl logs -n ge-agents deployment/ge-executor --tail=50 | grep -i vault

Case 3: Environment variables not injected into pod

# Check deployment env vars configuration
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -A20 "env:"

# Expected:
# env:
# - name: VAULT_ROLE_ID
#   valueFrom:
#     secretKeyRef:
#       name: ge-executor-secrets
#       key: vault-role-id
# - name: VAULT_SECRET_ID
#   valueFrom:
#     secretKeyRef:
#       name: ge-executor-secrets
#       key: vault-secret-id

# If missing: Deployment manifest needs update
# See: k8s/base/agents/executor.yaml

Case 4: Vault pod not running (future: actual Vault deployment)

# Current implementation: File-based Vault storage (no pod)
# Future: HashiCorp Vault pod in K8s

# If Vault pod deployed:
kubectl get pods -n ge-system | grep vault

# Check if sealed
kubectl exec -n ge-system deployment/vault -- vault status

# If sealed, see: Platform Recovery Runbook - Vault Sealed Scenarios

CrashLoopBackOff Due to Auth Issues

Symptoms

Pod Status: - Pod status: CrashLoopBackOff - High restart count (RESTARTS: 5+) - Pod repeatedly starts, runs for <30 seconds, then crashes

Common Causes: 1. Environment variable mismatch (app expects different var names) 2. Secrets not injected into pod (K8s secret missing) 3. Authentication fails on startup, app exits with error code

Diagnosis

Step 1: Check current logs

kubectl logs -n ge-agents <pod-name> --tail=100

# Look for errors during initialization phase

Step 2: Check previous crash logs

kubectl logs -n ge-agents <pod-name> --previous

# This shows logs from the LAST crashed instance
# Critical for diagnosing startup failures

Step 3: Check pod events

kubectl describe pod -n ge-agents <pod-name>

# Events section shows:
# - CrashLoopBackOff events
# - Error exit codes
# - Back-off restart messages

Step 4: Inspect application code for env var expectations

# If code is in hostPath mount:
grep -n "os.getenv\|os.environ" /home/claude/ge-bootstrap/ge_agent/runner.py

# Identify what environment variables the application expects
# Common patterns:
# - REDIS_URL vs REDIS_HOST + REDIS_PORT
# - VAULT_ADDR vs VAULT_HOST
# - ANTHROPIC_API_KEY vs API_KEY

Step 5: Review deployment manifest env vars

kubectl get deployment <deployment-name> -n ge-agents -o yaml | grep -A30 "env:"

# Compare manifest env var names to code expectations

Resolution

Case 1: Environment variable name mismatch

Example: App expects REDIS_URL, manifest provides REDIS_HOST + REDIS_PORT.

Fix deployment manifest:

# Option 1: Update manifest to provide REDIS_URL
kubectl patch deployment ge-executor -n ge-agents --type='json' -p='[
  {
    "op": "add",
    "path": "/spec/template/spec/containers/0/env/-",
    "value": {
      "name": "REDIS_URL",
      "value": "redis://redis.ge-system.svc.cluster.local:6381"
    }
  }
]'

# Option 2: Update application code to read separate env vars
# (Requires code change, not recommended for quick fix)

For detailed procedure, see: CrashLoopBackOff Environment Variables Pattern

Case 2: Secrets not injected

# Check if secret exists
kubectl get secret ge-executor-secrets -n ge-agents

# If missing:
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh

# If exists but not injected:
# Verify deployment references correct secret name and keys
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -B5 -A10 "secretKeyRef"

Case 3: Authentication fails on startup

# Check logs for auth failure pattern
kubectl logs -n ge-agents <pod-name> --previous | grep -i "auth\|failed\|error"

# If Redis auth fails: See "Redis Authentication Failures" section
# If Vault auth fails: See "Vault Authentication Failures" section

Anthropic API Authentication

Symptoms

Error Messages: - 401 Unauthorized when calling Anthropic API - ANTHROPIC_API_KEY not set - Invalid API key format

Where Errors Appear: - Executor logs during Claude API calls - Task execution failures

Diagnosis

Step 1: Check API key secret exists

kubectl get secret anthropic-api-key -n ge-agents -o jsonpath='{.data.api_key}' | base64 -d

# Expected: String starting with "sk-ant-api03-"
# If empty or wrong format: API key invalid

Step 2: Verify environment variable injection

# Check deployment references secret
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -B5 -A5 "ANTHROPIC_API_KEY"

# Expected:
# - name: ANTHROPIC_API_KEY
#   valueFrom:
#     secretKeyRef:
#       name: anthropic-api-key
#       key: api_key

Step 3: Test API key manually

# Get API key
API_KEY=$(kubectl get secret anthropic-api-key -n ge-agents -o jsonpath='{.data.api_key}' | base64 -d)

# Test API call
curl -s https://api.anthropic.com/v1/messages \
  -H "x-api-key: $API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-3-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"hi"}]}'

# Expected: JSON response with message
# If 401: API key invalid or expired

Resolution

Case 1: API key missing or empty

# Set environment variable
export ANTHROPIC_API_KEY="sk-ant-api03-YOUR_KEY_HERE"

# Re-run bootstrap
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents

Case 2: API key invalid or expired

# Get new API key from Anthropic dashboard
# https://console.anthropic.com/settings/keys

# Update environment variable
export ANTHROPIC_API_KEY="sk-ant-api03-NEW_KEY_HERE"

# Force regenerate secret
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents

Case 3: Secret exists but not injected

# Verify deployment manifest
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -B10 -A10 "ANTHROPIC_API_KEY"

# If missing env var reference:
# Edit deployment: kubectl edit deployment ge-executor -n ge-agents
# Add env var section referencing anthropic-api-key secret

Preventive Measures

Before Platform Restarts

Always run pre-flight checks:

# 1. Verify secrets exist in Vault storage
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/approles/

# 2. Backup current K8s secrets
kubectl get secrets -n ge-agents -o yaml > /tmp/secrets-backup-$(date +%Y%m%d-%H%M%S).yaml

# 3. Verify ANTHROPIC_API_KEY is set
echo $ANTHROPIC_API_KEY

# 4. Run bootstrap check
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --check

After Platform Restarts

Always verify services:

# 1. Check all pods running
kubectl get pods -A

# 2. Test Redis authentication
kubectl exec -n ge-system redis-0 -- redis-cli PING

# 3. Check executor logs
kubectl logs -n ge-agents deployment/ge-executor --tail=50

# 4. Verify agent tasks executing
kubectl logs -n ge-agents deployment/ge-executor -f

Regular Maintenance

Weekly: - Check secret rotation schedule - Review authentication failure logs - Verify backup procedures

Monthly: - Rotate secrets (currently manual, see bootstrap-secrets.sh --force) - Audit secret access patterns - Test recovery procedures

Quarterly: - Full security audit - Update documentation - Review and improve procedures



Emergency Contacts

Issue Type Contact Method
Redis auth failures Victoria (Security) ge-ops/agents/victoria/inbox/
Vault auth failures Piotr (Secrets Manager) ge-ops/agents/piotr/inbox/
CrashLoopBackOff Marije (Infrastructure) ge-ops/agents/marije/inbox/
API key issues Human escalation ge-ops/notifications/human/
Critical incidents Dirk-Jan (CEO) ge-ops/notifications/human/

Document Version: 1.0 Last Updated: 2026-03-18 Maintained by: Annegreet (Knowledge Curator) Next Review: 2026-04-18