Skip to content

Platform Recovery Runbook

Last Updated: 2026-03-18 Status: Active Maintained by: GE Infrastructure Team Estimated Time: 30-60 minutes


Overview

This runbook provides step-by-step procedures for recovering the GE agent platform after a full shutdown, crash, or infrastructure failure. Following this runbook ensures secrets are preserved, services are started in the correct order, and all components are verified operational.

Recovery Scenarios Covered: - Full platform restart (planned maintenance) - K3s cluster failure recovery - Secrets management system recovery - Individual service failures


Table of Contents


Pre-Recovery Checklist

CRITICAL: Complete these checks BEFORE starting recovery to prevent data loss.

1. Verify Secrets Exist

Purpose: Prevent accidental secret overwrites that cause authentication failures.

# Check Vault storage
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/approles/

# Verify Redis password exists
cat /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json

# Verify AppRole credentials exist
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json

Expected Output: - Redis credentials file exists with password field (not empty) - AppRole credentials file exists with role_id and secret_id (not empty)

If secrets are missing or empty: STOP. Do not proceed with recovery. See Emergency Secret Recovery.

2. Check Environment Variables

# Verify critical environment variables are set
echo $ANTHROPIC_API_KEY
echo $KUBECONFIG

Expected Output: - ANTHROPIC_API_KEY: Should start with sk-ant-api03- - KUBECONFIG: Should be /etc/rancher/k3s/k3s.yaml or similar

If missing: Export required variables:

export ANTHROPIC_API_KEY="sk-ant-api03-YOUR_KEY_HERE"
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

3. Verify K3s Cluster Status

# Check K3s service
sudo systemctl status k3s

# If down, start it
sudo systemctl start k3s

# Wait for cluster ready (30-60 seconds)
kubectl get nodes

# Verify node is Ready
kubectl get nodes

Expected Output:

NAME             STATUS   ROLES                  AGE   VERSION
fort-knox-dev    Ready    control-plane,master   45d   v1.28.5+k3s1

# Backup current K8s secrets (if cluster is up)
kubectl get secrets -A -o yaml > /tmp/k8s-secrets-backup-$(date +%Y%m%d-%H%M%S).yaml

# Backup Vault storage
cp -r /home/claude/ge-bootstrap/ge-ops/system/vault /tmp/vault-backup-$(date +%Y%m%d-%H%M%S)

Full Platform Recovery

Phase 1: Infrastructure Prerequisites

Goal: Ensure K3s cluster and networking are operational.

# 1. Start K3s (if not running)
sudo systemctl start k3s

# 2. Verify cluster access
kubectl cluster-info

# 3. Check namespace readiness
kubectl get namespaces | grep ge-

# Expected namespaces:
# - ge-agents
# - ge-system
# - ge-monitoring

If namespaces don't exist: They will be created in Phase 3.

Phase 2: Secrets Bootstrap

Goal: Restore secrets from Vault storage to K8s secrets WITHOUT overwriting.

IMPORTANT: This step uses the secrets bootstrap script to safely restore secrets without generating new ones (unless forced).

# Navigate to bootstrap scripts
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/

# Run bootstrap in check mode first (verify secrets exist)
./bootstrap-secrets.sh --check

# If check passes, run bootstrap (idempotent, safe)
./bootstrap-secrets.sh

# Verify K8s secrets created
kubectl get secrets -n ge-agents

Expected Output:

NAME                  TYPE     DATA   AGE
ge-executor-secrets   Opaque   2      1m
redis-credentials     Opaque   1      1m
anthropic-api-key     Opaque   1      1m

CRITICAL: If bootstrap fails with "secrets already exist" - this is EXPECTED. The script is idempotent and will skip existing secrets. Verify the secrets have data:

# Verify secrets contain data (not just keys)
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | base64 -d
kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d

If secrets exist but are empty: STOP. Delete the empty secrets and re-run bootstrap:

kubectl delete secret ge-executor-secrets -n ge-agents
kubectl delete secret redis-credentials -n ge-agents
kubectl delete secret anthropic-api-key -n ge-agents

# Re-run bootstrap
./bootstrap-secrets.sh

Phase 3: Core Services

Goal: Deploy Redis, monitoring, and base infrastructure.

# Deploy core services using startup script
cd /home/claude/ge-bootstrap/tools/

# Run startup script (phases: namespaces, secrets, core, monitoring)
./ge-platform-startup.sh --phase core

# Verify Redis is running
kubectl get pods -n ge-system | grep redis

# Expected: redis-0 Running

Wait for Redis readiness:

# Watch Redis pod until Running
kubectl get pods -n ge-system -w

# Test Redis connection (should succeed)
kubectl exec -n ge-system redis-0 -- redis-cli PING

If Redis fails to start: See Troubleshooting Redis Authentication.

Phase 4: Agent Executors

Goal: Deploy agent execution pods (Dolly, GE Executor).

# Deploy agents
./ge-platform-startup.sh --phase agents

# Verify executor pods
kubectl get pods -n ge-agents

# Expected:
# - ge-orchestrator-* Running
# - ge-executor-* Running (1+ pods)

Wait for executor readiness:

# Watch until all pods Running
kubectl get pods -n ge-agents -w

# Check executor logs for successful Vault authentication
kubectl logs -n ge-agents -l app=ge-executor --tail=50

# Expected log line:
# "Successfully authenticated to Vault using AppRole"

If executors fail: See Troubleshooting Executor CrashLoopBackOff.

Phase 5: Monitoring and Ingress

Goal: Deploy Grafana, Traefik, and monitoring services.

# Deploy monitoring
./ge-platform-startup.sh --phase monitoring

# Deploy ingress
./ge-platform-startup.sh --phase ingress

# Verify Grafana
kubectl get pods -n ge-monitoring | grep grafana

# Verify Traefik
kubectl get pods -n ge-ingress | grep traefik

Phase 6: Client Workloads (If Applicable)

Goal: Restore any client hosting workloads.

# List existing client namespaces
kubectl get namespaces | grep -E 'sh-|ded-'

# Deploy clients using startup script
./ge-platform-startup.sh --phase clients

# Or deploy manually:
cd /home/claude/ge-bootstrap/k8s/overlays/
kubectl apply -k <client-namespace>/

Post-Recovery Verification

1. Service Health Checks

# Check all namespaces
kubectl get pods -A

# Expected: All pods Running or Completed
# Not expected: CrashLoopBackOff, Error, ImagePullBackOff

2. Redis Connectivity

# Test Redis authentication from executor
kubectl exec -n ge-agents deployment/ge-executor -- \
  redis-cli -h redis.ge-system.svc.cluster.local -p 6381 -a $(kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d) PING

# Expected output: PONG

3. Vault Authentication

# Check executor logs for Vault authentication success
kubectl logs -n ge-agents -l app=ge-executor --tail=100 | grep -i vault

# Expected log line:
# "Successfully authenticated to Vault"
# "Retrieved secrets from Vault: ['redis', 'anthropic']"

4. ge-orchestrator

# Check ge-orchestrator logs for Redis connection
kubectl logs -n ge-agents deployment/ge-orchestrator --tail=100

# Expected log lines:
# "Connected to Redis at redis.ge-system.svc.cluster.local:6381"
# "ge-orchestrator started successfully"

5. API Endpoints

# Test admin-ui health endpoint
curl -I https://office.growing-europe.com/health

# Expected: HTTP 200 OK

# Test client workload (if deployed)
curl -I https://test-acme.hosting.growing-europe.com

# Expected: HTTP 200 OK

6. Agent Task Execution

# Monitor agent execution
kubectl logs -n ge-agents -l app=ge-executor -f

# Expected: Agent tasks being picked up and executed
# Watch for completion signals and handoffs

Recovery Complete: All services are operational. Platform is ready for production use.


Known Issues and Workarounds

Issue 1: k8s/base/core/secrets.yaml Contains Empty Placeholders

Problem: The secrets manifest file contains placeholder values (empty strings). If kubectl apply is run directly, it overwrites K8s secrets with empty values, breaking authentication.

Impact: Agents cannot authenticate to Redis or Vault, all tasks fail.

Workaround: - ALWAYS use bootstrap-secrets.sh to populate secrets from Vault storage - NEVER run kubectl apply -f k8s/base/core/secrets.yaml directly - The startup script now checks for existing secrets and skips if present

Root Cause: Secrets manifest is designed as a template, not for direct application. The bootstrap script is the correct entry point.

Status: Working as designed. Documentation updated (2026-01-29).

Issue 2: Vault AppRole Needs Configuration

Problem: Initial deployment does not configure Vault AppRole automatically. Executors expect vault-role-id and vault-secret-id in K8s secrets.

Impact: Executors cannot authenticate to Vault, fall back to env vars (less secure).

Workaround: - Use bootstrap-secrets.sh to generate AppRole credentials - Script generates role-id and secret-id, stores in Vault storage - Subsequent runs reuse existing credentials (idempotent)

Status: Resolved by bootstrap script (2026-01-29).

Issue 3: Dolly Fallback to REDIS_PASSWORD Env Var

Problem: When Vault authentication fails, Dolly falls back to reading REDIS_PASSWORD env var directly from K8s secrets.

Impact: Secrets exposed via environment variables (standard K8s practice, but less secure than Vault).

Workaround: - Ensure Vault authentication succeeds (AppRole configured) - Monitor Dolly logs for fallback warnings - Fallback is acceptable for cold start, but should not persist

Status: Monitoring required. Victoria tracking (INB-20260129-secrets-security-review).

Issue 4: Grafana Cannot Reach External Hosts

Problem: Network policies block Grafana egress to grafana.com (plugin installation).

Impact: Cannot install Grafana plugins at runtime. Pre-installed plugins work fine.

Workaround: - Pre-install required plugins in Grafana Docker image - Or adjust network policy to allow egress to grafana.com (security review required)

Status: Low priority. Monitoring as-is (2026-01-29).


Troubleshooting Common Failures

Troubleshooting Redis Authentication

Symptom: Redis logs show "NOAUTH Authentication required" errors.

Cause: Redis expects password, but clients not providing it.

Diagnosis:

# Check Redis logs
kubectl logs -n ge-system redis-0 --tail=100

# Check if redis-credentials secret exists and has data
kubectl get secret redis-credentials -n ge-system -o yaml

# Verify password is not empty
kubectl get secret redis-credentials -n ge-system -o jsonpath='{.data.password}' | base64 -d

Solution:

# If secret missing or empty:
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force

# Restart Redis
kubectl rollout restart statefulset/redis -n ge-system

# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents

Troubleshooting Executor CrashLoopBackOff

Symptom: Executor pods repeatedly crash and restart.

Cause: Environment variable mismatch, missing secrets, or Vault authentication failure.

Diagnosis:

# Check executor logs
kubectl logs -n ge-agents -l app=ge-executor --tail=100

# Check previous crash logs
kubectl logs -n ge-agents <pod-name> --previous

# Check pod events
kubectl describe pod -n ge-agents <pod-name>

Common Log Patterns:

  1. "Connection refused to localhost:6379"
  2. Application expects REDIS_URL env var
  3. Deployment provides REDIS_HOST + REDIS_PORT separately
  4. See pattern: CrashLoopBackOff Environment Variables

  5. "Vault authentication failed"

  6. AppRole credentials missing or incorrect
  7. Run ./bootstrap-secrets.sh to regenerate
  8. Verify ge-executor-secrets exists with data

  9. "ANTHROPIC_API_KEY not set"

  10. Environment variable not set before bootstrap
  11. Export ANTHROPIC_API_KEY and re-run bootstrap

Solution:

# Fix 1: Verify secrets exist
kubectl get secret ge-executor-secrets -n ge-agents -o yaml

# Fix 2: Re-run bootstrap
./bootstrap-secrets.sh

# Fix 3: Check deployment env vars
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -A10 "env:"

# Fix 4: Restart deployment
kubectl rollout restart deployment/ge-executor -n ge-agents

Troubleshooting Vault Authentication Failures

Symptom: Logs show "Vault authentication failed" or "Failed to retrieve secret from Vault".

Cause: AppRole credentials missing, expired, or incorrect.

Diagnosis:

# Check AppRole credentials exist in Vault storage
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json

# Check K8s secret
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | jq

# Verify base64 encoding is correct
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-role-id}' | base64 -d

Solution:

# Regenerate AppRole credentials
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force

# Restart executor pods
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents

# Watch logs for successful authentication
kubectl logs -n ge-agents -l app=ge-executor -f | grep -i vault

Emergency Secret Recovery

Scenario: Secrets were accidentally overwritten or deleted during recovery.

CRITICAL: Do NOT proceed with normal recovery if secrets are lost. Follow these steps:

  1. Check backups:

    ls -la /tmp/vault-backup-*
    ls -la /tmp/k8s-secrets-backup-*
    

  2. Restore from backup (if available):

    # Restore Vault storage
    cp -r /tmp/vault-backup-YYYYMMDD-HHMMSS/* /home/claude/ge-bootstrap/ge-ops/system/vault/
    
    # Restore K8s secrets
    kubectl apply -f /tmp/k8s-secrets-backup-YYYYMMDD-HHMMSS.yaml
    

  3. If no backup available:

    # Contact human for secret recovery
    # Secrets may need to be regenerated from secure storage
    # This requires:
    # - New Redis password (forces Redis restart)
    # - New Vault AppRole credentials
    # - Re-entry of ANTHROPIC_API_KEY
    

  4. Regenerate all secrets (last resort):

    # WARNING: This breaks all existing connections
    export ANTHROPIC_API_KEY="sk-ant-api03-YOUR_KEY"
    
    cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
    ./bootstrap-secrets.sh --force
    
    # Restart all services
    kubectl delete pods -n ge-system --all
    kubectl delete pods -n ge-agents --all
    


Partial Recovery Procedures

Recover Only Redis

# Restart Redis statefulset
kubectl rollout restart statefulset/redis -n ge-system

# Wait for ready
kubectl rollout status statefulset/redis -n ge-system

# Verify
kubectl exec -n ge-system redis-0 -- redis-cli PING

Recover Only Executors

# Restart executor deployment
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents

# Wait for ready
kubectl rollout status deployment/ge-executor -n ge-agents

# Verify logs
kubectl logs -n ge-agents -l app=ge-executor --tail=50

Recover Only Monitoring

# Restart Grafana
kubectl rollout restart deployment/grafana -n ge-monitoring

# Verify Grafana UI
curl -I http://localhost:3000/

Emergency Contacts

Role Agent Contact Method Purpose
Secrets Manager Piotr ge-ops/agents/piotr/inbox/ AppRole issues, secret rotation
Security Lead Victoria ge-ops/agents/victoria/inbox/ Security incidents, secret exposure
Infrastructure Lead Marije ge-ops/agents/marije/inbox/ K3s issues, networking
Knowledge Curator Annegreet ge-ops/agents/annegreet/inbox/ Documentation updates, pattern capture
Human Escalation Dirk-Jan (CEO) ge-ops/notifications/human/ Critical failures, data loss

References


Document Version: 1.0 Last Updated: 2026-03-18 Maintained by: Annegreet (Knowledge Curator) Next Review: 2026-04-18