Authentication Failures Troubleshooting Guide¶
Last Updated: 2026-03-18 Status: Active Maintained by: GE Infrastructure Team
Overview¶
This guide provides systematic troubleshooting procedures for authentication-related failures in the GE agent platform. Authentication issues are the most common cause of service disruption after platform restarts or deployments.
Covered Scenarios: - Redis authentication failures ("NOAUTH" errors) - Vault authentication failures ("authentication failed") - Pods in CrashLoopBackOff due to auth issues - Executors unable to access external APIs
Table of Contents¶
- Quick Diagnosis
- Redis Authentication Failures
- Vault Authentication Failures
- CrashLoopBackOff Due to Auth Issues
- Anthropic API Authentication
- Preventive Measures
Quick Diagnosis¶
Identify Which Service is Failing¶
# Check all pods status
kubectl get pods -A
# Look for:
# - CrashLoopBackOff (pod repeatedly restarting)
# - Error (pod failed to start)
# - Running but frequent restarts (high RESTARTS count)
# Get logs from failing pod
kubectl logs -n <namespace> <pod-name> --tail=100
# Common error patterns:
# - "NOAUTH Authentication required" → Redis auth failure
# - "Vault authentication failed" → Vault AppRole issue
# - "Connection refused to localhost" → Environment variable mismatch
# - "401 Unauthorized" → API key invalid or missing
Check Secrets Exist¶
# List all secrets in ge-agents namespace
kubectl get secrets -n ge-agents
# Expected secrets:
# - ge-executor-secrets (Vault AppRole credentials)
# - redis-credentials (Redis password)
# - anthropic-api-key (Anthropic API key)
# Verify secrets contain data (not empty)
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | jq
kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data}' | jq
kubectl get secret anthropic-api-key -n ge-agents -o jsonpath='{.data}' | jq
# If any secret is empty or missing → Run bootstrap script
Check Vault Storage Files¶
# Verify Vault storage files exist
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/approles/
# Verify files contain data
cat /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json
# Expected: JSON files with non-empty password, role_id, secret_id fields
Redis Authentication Failures¶
Symptoms¶
Error Messages:
- NOAUTH Authentication required
- ERR AUTH <password> called without any password configured for the default user
- Connection refused to redis.ge-system.svc.cluster.local:6381
Where Errors Appear: - Dolly orchestrator logs - Executor logs - Any service attempting to connect to Redis
Diagnosis¶
Step 1: Check Redis pod is running
kubectl get pods -n ge-system | grep redis
# Expected: redis-0 Running
# If not running: kubectl logs -n ge-system redis-0 --tail=100
Step 2: Check Redis password secret exists
# Check in ge-agents namespace (executor secret)
kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d
# Check in ge-system namespace (Redis config)
kubectl get secret ge-secrets -n ge-system -o jsonpath='{.data.redis-password}' | base64 -d
# Both should return same non-empty password
Step 3: Test Redis authentication manually
# Get password
REDIS_PASSWORD=$(kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d)
# Test connection from executor pod
kubectl exec -n ge-agents deployment/ge-executor -- \
redis-cli -h redis.ge-system.svc.cluster.local -p 6381 -a "$REDIS_PASSWORD" PING
# Expected output: PONG
# If "NOAUTH": Password is incorrect or missing
# If "Connection refused": Redis not running or network issue
Step 4: Compare passwords in Vault storage vs K8s
# Vault storage password
VAULT_PASSWORD=$(cat /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json | jq -r .password)
# K8s secret password
K8S_PASSWORD=$(kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d)
# Compare
if [ "$VAULT_PASSWORD" = "$K8S_PASSWORD" ]; then
echo "Passwords match ✓"
else
echo "PASSWORD MISMATCH - Run bootstrap script"
fi
Resolution¶
Case 1: Passwords mismatch or K8s secret empty
# Re-run bootstrap to sync from Vault storage
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents
# Verify logs show successful connection
kubectl logs -n ge-agents deployment/ge-executor --tail=50 | grep -i redis
Case 2: Vault storage password missing or empty
# Generate new password and store in Vault
NEW_PASSWORD=$(openssl rand -base64 32)
echo "{\"password\": \"$NEW_PASSWORD\"}" > /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json
# Update K8s secrets
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force
# Update Redis configuration
kubectl create secret generic ge-secrets \
--namespace=ge-system \
--from-literal=redis-password="$NEW_PASSWORD" \
--dry-run=client -o yaml | kubectl apply -f -
# Restart Redis
kubectl rollout restart statefulset/redis -n ge-system
# Wait for Redis ready
kubectl rollout status statefulset/redis -n ge-system
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents
Case 3: Redis pod not running
# Check Redis logs for errors
kubectl logs -n ge-system redis-0 --tail=100
# Common issues:
# - PVC not available: Check PersistentVolumeClaim
# - Resource limits: Check CPU/memory limits
# - Config error: Check ConfigMap
# Restart Redis
kubectl delete pod redis-0 -n ge-system
# Wait for new pod to start
kubectl get pods -n ge-system -w
Vault Authentication Failures¶
Symptoms¶
Error Messages:
- Vault authentication failed
- Failed to retrieve secret from Vault
- AppRole authentication failed: invalid role_id or secret_id
Where Errors Appear: - Executor logs during startup - Dolly logs during initialization
Diagnosis¶
Step 1: Check Vault AppRole secret exists
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | jq
# Expected output:
# {
# "vault-role-id": "base64encodedvalue",
# "vault-secret-id": "base64encodedvalue"
# }
# Decode and verify not empty
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-role-id}' | base64 -d
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-secret-id}' | base64 -d
# Should return non-empty UUID-like strings
Step 2: Check Vault storage files
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json
# Expected output:
# {
# "role_id": "uuid-format-string",
# "secret_id": "long-base64-string"
# }
Step 3: Compare Vault storage vs K8s secret
# Vault storage role_id
VAULT_ROLE_ID=$(cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json | jq -r .role_id)
# K8s secret role_id
K8S_ROLE_ID=$(kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-role-id}' | base64 -d)
# Compare
if [ "$VAULT_ROLE_ID" = "$K8S_ROLE_ID" ]; then
echo "Role IDs match ✓"
else
echo "ROLE ID MISMATCH - Run bootstrap script"
fi
Step 4: Check executor logs for specific error
kubectl logs -n ge-agents deployment/ge-executor --tail=100 | grep -i vault
# Look for:
# - "Vault credentials not configured" → Env vars not set in deployment
# - "Authentication failed" → Credentials invalid or expired
# - "Access denied" → Policy doesn't grant access to secret paths
Resolution¶
Case 1: AppRole credentials missing or empty
# Generate new AppRole credentials
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents
# Watch logs for successful authentication
kubectl logs -n ge-agents deployment/ge-executor -f | grep -i "authenticated"
Case 2: Credentials mismatch between Vault and K8s
# Re-sync from Vault storage (preserves existing credentials)
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
# Verify authentication
kubectl logs -n ge-agents deployment/ge-executor --tail=50 | grep -i vault
Case 3: Environment variables not injected into pod
# Check deployment env vars configuration
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -A20 "env:"
# Expected:
# env:
# - name: VAULT_ROLE_ID
# valueFrom:
# secretKeyRef:
# name: ge-executor-secrets
# key: vault-role-id
# - name: VAULT_SECRET_ID
# valueFrom:
# secretKeyRef:
# name: ge-executor-secrets
# key: vault-secret-id
# If missing: Deployment manifest needs update
# See: k8s/base/agents/executor.yaml
Case 4: Vault pod not running (future: actual Vault deployment)
# Current implementation: File-based Vault storage (no pod)
# Future: HashiCorp Vault pod in K8s
# If Vault pod deployed:
kubectl get pods -n ge-system | grep vault
# Check if sealed
kubectl exec -n ge-system deployment/vault -- vault status
# If sealed, see: Platform Recovery Runbook - Vault Sealed Scenarios
CrashLoopBackOff Due to Auth Issues¶
Symptoms¶
Pod Status:
- Pod status: CrashLoopBackOff
- High restart count (RESTARTS: 5+)
- Pod repeatedly starts, runs for <30 seconds, then crashes
Common Causes: 1. Environment variable mismatch (app expects different var names) 2. Secrets not injected into pod (K8s secret missing) 3. Authentication fails on startup, app exits with error code
Diagnosis¶
Step 1: Check current logs
Step 2: Check previous crash logs
kubectl logs -n ge-agents <pod-name> --previous
# This shows logs from the LAST crashed instance
# Critical for diagnosing startup failures
Step 3: Check pod events
kubectl describe pod -n ge-agents <pod-name>
# Events section shows:
# - CrashLoopBackOff events
# - Error exit codes
# - Back-off restart messages
Step 4: Inspect application code for env var expectations
# If code is in hostPath mount:
grep -n "os.getenv\|os.environ" /home/claude/ge-bootstrap/ge_agent/runner.py
# Identify what environment variables the application expects
# Common patterns:
# - REDIS_URL vs REDIS_HOST + REDIS_PORT
# - VAULT_ADDR vs VAULT_HOST
# - ANTHROPIC_API_KEY vs API_KEY
Step 5: Review deployment manifest env vars
kubectl get deployment <deployment-name> -n ge-agents -o yaml | grep -A30 "env:"
# Compare manifest env var names to code expectations
Resolution¶
Case 1: Environment variable name mismatch
Example: App expects REDIS_URL, manifest provides REDIS_HOST + REDIS_PORT.
Fix deployment manifest:
# Option 1: Update manifest to provide REDIS_URL
kubectl patch deployment ge-executor -n ge-agents --type='json' -p='[
{
"op": "add",
"path": "/spec/template/spec/containers/0/env/-",
"value": {
"name": "REDIS_URL",
"value": "redis://redis.ge-system.svc.cluster.local:6381"
}
}
]'
# Option 2: Update application code to read separate env vars
# (Requires code change, not recommended for quick fix)
For detailed procedure, see: CrashLoopBackOff Environment Variables Pattern
Case 2: Secrets not injected
# Check if secret exists
kubectl get secret ge-executor-secrets -n ge-agents
# If missing:
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh
# If exists but not injected:
# Verify deployment references correct secret name and keys
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -B5 -A10 "secretKeyRef"
Case 3: Authentication fails on startup
# Check logs for auth failure pattern
kubectl logs -n ge-agents <pod-name> --previous | grep -i "auth\|failed\|error"
# If Redis auth fails: See "Redis Authentication Failures" section
# If Vault auth fails: See "Vault Authentication Failures" section
Anthropic API Authentication¶
Symptoms¶
Error Messages:
- 401 Unauthorized when calling Anthropic API
- ANTHROPIC_API_KEY not set
- Invalid API key format
Where Errors Appear: - Executor logs during Claude API calls - Task execution failures
Diagnosis¶
Step 1: Check API key secret exists
kubectl get secret anthropic-api-key -n ge-agents -o jsonpath='{.data.api_key}' | base64 -d
# Expected: String starting with "sk-ant-api03-"
# If empty or wrong format: API key invalid
Step 2: Verify environment variable injection
# Check deployment references secret
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -B5 -A5 "ANTHROPIC_API_KEY"
# Expected:
# - name: ANTHROPIC_API_KEY
# valueFrom:
# secretKeyRef:
# name: anthropic-api-key
# key: api_key
Step 3: Test API key manually
# Get API key
API_KEY=$(kubectl get secret anthropic-api-key -n ge-agents -o jsonpath='{.data.api_key}' | base64 -d)
# Test API call
curl -s https://api.anthropic.com/v1/messages \
-H "x-api-key: $API_KEY" \
-H "anthropic-version: 2023-06-01" \
-H "content-type: application/json" \
-d '{"model":"claude-3-haiku-20240307","max_tokens":10,"messages":[{"role":"user","content":"hi"}]}'
# Expected: JSON response with message
# If 401: API key invalid or expired
Resolution¶
Case 1: API key missing or empty
# Set environment variable
export ANTHROPIC_API_KEY="sk-ant-api03-YOUR_KEY_HERE"
# Re-run bootstrap
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
Case 2: API key invalid or expired
# Get new API key from Anthropic dashboard
# https://console.anthropic.com/settings/keys
# Update environment variable
export ANTHROPIC_API_KEY="sk-ant-api03-NEW_KEY_HERE"
# Force regenerate secret
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
Case 3: Secret exists but not injected
# Verify deployment manifest
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -B10 -A10 "ANTHROPIC_API_KEY"
# If missing env var reference:
# Edit deployment: kubectl edit deployment ge-executor -n ge-agents
# Add env var section referencing anthropic-api-key secret
Preventive Measures¶
Before Platform Restarts¶
Always run pre-flight checks:
# 1. Verify secrets exist in Vault storage
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/approles/
# 2. Backup current K8s secrets
kubectl get secrets -n ge-agents -o yaml > /tmp/secrets-backup-$(date +%Y%m%d-%H%M%S).yaml
# 3. Verify ANTHROPIC_API_KEY is set
echo $ANTHROPIC_API_KEY
# 4. Run bootstrap check
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --check
After Platform Restarts¶
Always verify services:
# 1. Check all pods running
kubectl get pods -A
# 2. Test Redis authentication
kubectl exec -n ge-system redis-0 -- redis-cli PING
# 3. Check executor logs
kubectl logs -n ge-agents deployment/ge-executor --tail=50
# 4. Verify agent tasks executing
kubectl logs -n ge-agents deployment/ge-executor -f
Regular Maintenance¶
Weekly: - Check secret rotation schedule - Review authentication failure logs - Verify backup procedures
Monthly: - Rotate secrets (currently manual, see bootstrap-secrets.sh --force) - Audit secret access patterns - Test recovery procedures
Quarterly: - Full security audit - Update documentation - Review and improve procedures
Related Documentation¶
- Platform Recovery Runbook
- Secrets Management Architecture
- Secrets Bootstrap Guide
- CrashLoopBackOff Pattern
Emergency Contacts¶
| Issue Type | Contact | Method |
|---|---|---|
| Redis auth failures | Victoria (Security) | ge-ops/agents/victoria/inbox/ |
| Vault auth failures | Piotr (Secrets Manager) | ge-ops/agents/piotr/inbox/ |
| CrashLoopBackOff | Marije (Infrastructure) | ge-ops/agents/marije/inbox/ |
| API key issues | Human escalation | ge-ops/notifications/human/ |
| Critical incidents | Dirk-Jan (CEO) | ge-ops/notifications/human/ |
Document Version: 1.0 Last Updated: 2026-03-18 Maintained by: Annegreet (Knowledge Curator) Next Review: 2026-04-18