Platform Recovery Runbook¶
Last Updated: 2026-03-18 Status: Active Maintained by: GE Infrastructure Team Estimated Time: 30-60 minutes
Overview¶
This runbook provides step-by-step procedures for recovering the GE agent platform after a full shutdown, crash, or infrastructure failure. Following this runbook ensures secrets are preserved, services are started in the correct order, and all components are verified operational.
Recovery Scenarios Covered: - Full platform restart (planned maintenance) - K3s cluster failure recovery - Secrets management system recovery - Individual service failures
Table of Contents¶
- Pre-Recovery Checklist
- Full Platform Recovery
- Post-Recovery Verification
- Known Issues and Workarounds
- Troubleshooting Common Failures
- Partial Recovery Procedures
- Emergency Contacts
Pre-Recovery Checklist¶
CRITICAL: Complete these checks BEFORE starting recovery to prevent data loss.
1. Verify Secrets Exist¶
Purpose: Prevent accidental secret overwrites that cause authentication failures.
# Check Vault storage
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/
ls -la /home/claude/ge-bootstrap/ge-ops/system/vault/approles/
# Verify Redis password exists
cat /home/claude/ge-bootstrap/ge-ops/system/vault/secrets/ge/redis/credentials.json
# Verify AppRole credentials exist
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json
Expected Output:
- Redis credentials file exists with password field (not empty)
- AppRole credentials file exists with role_id and secret_id (not empty)
If secrets are missing or empty: STOP. Do not proceed with recovery. See Emergency Secret Recovery.
2. Check Environment Variables¶
Expected Output:
- ANTHROPIC_API_KEY: Should start with sk-ant-api03-
- KUBECONFIG: Should be /etc/rancher/k3s/k3s.yaml or similar
If missing: Export required variables:
3. Verify K3s Cluster Status¶
# Check K3s service
sudo systemctl status k3s
# If down, start it
sudo systemctl start k3s
# Wait for cluster ready (30-60 seconds)
kubectl get nodes
# Verify node is Ready
kubectl get nodes
Expected Output:
4. Backup Current State (Optional but Recommended)¶
# Backup current K8s secrets (if cluster is up)
kubectl get secrets -A -o yaml > /tmp/k8s-secrets-backup-$(date +%Y%m%d-%H%M%S).yaml
# Backup Vault storage
cp -r /home/claude/ge-bootstrap/ge-ops/system/vault /tmp/vault-backup-$(date +%Y%m%d-%H%M%S)
Full Platform Recovery¶
Phase 1: Infrastructure Prerequisites¶
Goal: Ensure K3s cluster and networking are operational.
# 1. Start K3s (if not running)
sudo systemctl start k3s
# 2. Verify cluster access
kubectl cluster-info
# 3. Check namespace readiness
kubectl get namespaces | grep ge-
# Expected namespaces:
# - ge-agents
# - ge-system
# - ge-monitoring
If namespaces don't exist: They will be created in Phase 3.
Phase 2: Secrets Bootstrap¶
Goal: Restore secrets from Vault storage to K8s secrets WITHOUT overwriting.
IMPORTANT: This step uses the secrets bootstrap script to safely restore secrets without generating new ones (unless forced).
# Navigate to bootstrap scripts
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
# Run bootstrap in check mode first (verify secrets exist)
./bootstrap-secrets.sh --check
# If check passes, run bootstrap (idempotent, safe)
./bootstrap-secrets.sh
# Verify K8s secrets created
kubectl get secrets -n ge-agents
Expected Output:
NAME TYPE DATA AGE
ge-executor-secrets Opaque 2 1m
redis-credentials Opaque 1 1m
anthropic-api-key Opaque 1 1m
CRITICAL: If bootstrap fails with "secrets already exist" - this is EXPECTED. The script is idempotent and will skip existing secrets. Verify the secrets have data:
# Verify secrets contain data (not just keys)
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | base64 -d
kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d
If secrets exist but are empty: STOP. Delete the empty secrets and re-run bootstrap:
kubectl delete secret ge-executor-secrets -n ge-agents
kubectl delete secret redis-credentials -n ge-agents
kubectl delete secret anthropic-api-key -n ge-agents
# Re-run bootstrap
./bootstrap-secrets.sh
Phase 3: Core Services¶
Goal: Deploy Redis, monitoring, and base infrastructure.
# Deploy core services using startup script
cd /home/claude/ge-bootstrap/tools/
# Run startup script (phases: namespaces, secrets, core, monitoring)
./ge-platform-startup.sh --phase core
# Verify Redis is running
kubectl get pods -n ge-system | grep redis
# Expected: redis-0 Running
Wait for Redis readiness:
# Watch Redis pod until Running
kubectl get pods -n ge-system -w
# Test Redis connection (should succeed)
kubectl exec -n ge-system redis-0 -- redis-cli PING
If Redis fails to start: See Troubleshooting Redis Authentication.
Phase 4: Agent Executors¶
Goal: Deploy agent execution pods (Dolly, GE Executor).
# Deploy agents
./ge-platform-startup.sh --phase agents
# Verify executor pods
kubectl get pods -n ge-agents
# Expected:
# - ge-orchestrator-* Running
# - ge-executor-* Running (1+ pods)
Wait for executor readiness:
# Watch until all pods Running
kubectl get pods -n ge-agents -w
# Check executor logs for successful Vault authentication
kubectl logs -n ge-agents -l app=ge-executor --tail=50
# Expected log line:
# "Successfully authenticated to Vault using AppRole"
If executors fail: See Troubleshooting Executor CrashLoopBackOff.
Phase 5: Monitoring and Ingress¶
Goal: Deploy Grafana, Traefik, and monitoring services.
# Deploy monitoring
./ge-platform-startup.sh --phase monitoring
# Deploy ingress
./ge-platform-startup.sh --phase ingress
# Verify Grafana
kubectl get pods -n ge-monitoring | grep grafana
# Verify Traefik
kubectl get pods -n ge-ingress | grep traefik
Phase 6: Client Workloads (If Applicable)¶
Goal: Restore any client hosting workloads.
# List existing client namespaces
kubectl get namespaces | grep -E 'sh-|ded-'
# Deploy clients using startup script
./ge-platform-startup.sh --phase clients
# Or deploy manually:
cd /home/claude/ge-bootstrap/k8s/overlays/
kubectl apply -k <client-namespace>/
Post-Recovery Verification¶
1. Service Health Checks¶
# Check all namespaces
kubectl get pods -A
# Expected: All pods Running or Completed
# Not expected: CrashLoopBackOff, Error, ImagePullBackOff
2. Redis Connectivity¶
# Test Redis authentication from executor
kubectl exec -n ge-agents deployment/ge-executor -- \
redis-cli -h redis.ge-system.svc.cluster.local -p 6381 -a $(kubectl get secret redis-credentials -n ge-agents -o jsonpath='{.data.password}' | base64 -d) PING
# Expected output: PONG
3. Vault Authentication¶
# Check executor logs for Vault authentication success
kubectl logs -n ge-agents -l app=ge-executor --tail=100 | grep -i vault
# Expected log line:
# "Successfully authenticated to Vault"
# "Retrieved secrets from Vault: ['redis', 'anthropic']"
4. ge-orchestrator¶
# Check ge-orchestrator logs for Redis connection
kubectl logs -n ge-agents deployment/ge-orchestrator --tail=100
# Expected log lines:
# "Connected to Redis at redis.ge-system.svc.cluster.local:6381"
# "ge-orchestrator started successfully"
5. API Endpoints¶
# Test admin-ui health endpoint
curl -I https://office.growing-europe.com/health
# Expected: HTTP 200 OK
# Test client workload (if deployed)
curl -I https://test-acme.hosting.growing-europe.com
# Expected: HTTP 200 OK
6. Agent Task Execution¶
# Monitor agent execution
kubectl logs -n ge-agents -l app=ge-executor -f
# Expected: Agent tasks being picked up and executed
# Watch for completion signals and handoffs
Recovery Complete: All services are operational. Platform is ready for production use.
Known Issues and Workarounds¶
Issue 1: k8s/base/core/secrets.yaml Contains Empty Placeholders¶
Problem: The secrets manifest file contains placeholder values (empty strings). If kubectl apply is run directly, it overwrites K8s secrets with empty values, breaking authentication.
Impact: Agents cannot authenticate to Redis or Vault, all tasks fail.
Workaround:
- ALWAYS use bootstrap-secrets.sh to populate secrets from Vault storage
- NEVER run kubectl apply -f k8s/base/core/secrets.yaml directly
- The startup script now checks for existing secrets and skips if present
Root Cause: Secrets manifest is designed as a template, not for direct application. The bootstrap script is the correct entry point.
Status: Working as designed. Documentation updated (2026-01-29).
Issue 2: Vault AppRole Needs Configuration¶
Problem: Initial deployment does not configure Vault AppRole automatically. Executors expect vault-role-id and vault-secret-id in K8s secrets.
Impact: Executors cannot authenticate to Vault, fall back to env vars (less secure).
Workaround:
- Use bootstrap-secrets.sh to generate AppRole credentials
- Script generates role-id and secret-id, stores in Vault storage
- Subsequent runs reuse existing credentials (idempotent)
Status: Resolved by bootstrap script (2026-01-29).
Issue 3: Dolly Fallback to REDIS_PASSWORD Env Var¶
Problem: When Vault authentication fails, Dolly falls back to reading REDIS_PASSWORD env var directly from K8s secrets.
Impact: Secrets exposed via environment variables (standard K8s practice, but less secure than Vault).
Workaround: - Ensure Vault authentication succeeds (AppRole configured) - Monitor Dolly logs for fallback warnings - Fallback is acceptable for cold start, but should not persist
Status: Monitoring required. Victoria tracking (INB-20260129-secrets-security-review).
Issue 4: Grafana Cannot Reach External Hosts¶
Problem: Network policies block Grafana egress to grafana.com (plugin installation).
Impact: Cannot install Grafana plugins at runtime. Pre-installed plugins work fine.
Workaround: - Pre-install required plugins in Grafana Docker image - Or adjust network policy to allow egress to grafana.com (security review required)
Status: Low priority. Monitoring as-is (2026-01-29).
Troubleshooting Common Failures¶
Troubleshooting Redis Authentication¶
Symptom: Redis logs show "NOAUTH Authentication required" errors.
Cause: Redis expects password, but clients not providing it.
Diagnosis:
# Check Redis logs
kubectl logs -n ge-system redis-0 --tail=100
# Check if redis-credentials secret exists and has data
kubectl get secret redis-credentials -n ge-system -o yaml
# Verify password is not empty
kubectl get secret redis-credentials -n ge-system -o jsonpath='{.data.password}' | base64 -d
Solution:
# If secret missing or empty:
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force
# Restart Redis
kubectl rollout restart statefulset/redis -n ge-system
# Restart executors
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents
Troubleshooting Executor CrashLoopBackOff¶
Symptom: Executor pods repeatedly crash and restart.
Cause: Environment variable mismatch, missing secrets, or Vault authentication failure.
Diagnosis:
# Check executor logs
kubectl logs -n ge-agents -l app=ge-executor --tail=100
# Check previous crash logs
kubectl logs -n ge-agents <pod-name> --previous
# Check pod events
kubectl describe pod -n ge-agents <pod-name>
Common Log Patterns:
- "Connection refused to localhost:6379"
- Application expects
REDIS_URLenv var - Deployment provides
REDIS_HOST+REDIS_PORTseparately -
See pattern: CrashLoopBackOff Environment Variables
-
"Vault authentication failed"
- AppRole credentials missing or incorrect
- Run
./bootstrap-secrets.shto regenerate -
Verify
ge-executor-secretsexists with data -
"ANTHROPIC_API_KEY not set"
- Environment variable not set before bootstrap
- Export
ANTHROPIC_API_KEYand re-run bootstrap
Solution:
# Fix 1: Verify secrets exist
kubectl get secret ge-executor-secrets -n ge-agents -o yaml
# Fix 2: Re-run bootstrap
./bootstrap-secrets.sh
# Fix 3: Check deployment env vars
kubectl get deployment ge-executor -n ge-agents -o yaml | grep -A10 "env:"
# Fix 4: Restart deployment
kubectl rollout restart deployment/ge-executor -n ge-agents
Troubleshooting Vault Authentication Failures¶
Symptom: Logs show "Vault authentication failed" or "Failed to retrieve secret from Vault".
Cause: AppRole credentials missing, expired, or incorrect.
Diagnosis:
# Check AppRole credentials exist in Vault storage
cat /home/claude/ge-bootstrap/ge-ops/system/vault/approles/ge-agents-credentials.json
# Check K8s secret
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data}' | jq
# Verify base64 encoding is correct
kubectl get secret ge-executor-secrets -n ge-agents -o jsonpath='{.data.vault-role-id}' | base64 -d
Solution:
# Regenerate AppRole credentials
cd /home/claude/ge-bootstrap/ge-ops/infrastructure/local/k3s/scripts/
./bootstrap-secrets.sh --force
# Restart executor pods
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents
# Watch logs for successful authentication
kubectl logs -n ge-agents -l app=ge-executor -f | grep -i vault
Emergency Secret Recovery¶
Scenario: Secrets were accidentally overwritten or deleted during recovery.
CRITICAL: Do NOT proceed with normal recovery if secrets are lost. Follow these steps:
-
Check backups:
-
Restore from backup (if available):
-
If no backup available:
-
Regenerate all secrets (last resort):
Partial Recovery Procedures¶
Recover Only Redis¶
# Restart Redis statefulset
kubectl rollout restart statefulset/redis -n ge-system
# Wait for ready
kubectl rollout status statefulset/redis -n ge-system
# Verify
kubectl exec -n ge-system redis-0 -- redis-cli PING
Recover Only Executors¶
# Restart executor deployment
kubectl rollout restart deployment/ge-executor -n ge-agents
kubectl rollout restart deployment/ge-orchestrator -n ge-agents
# Wait for ready
kubectl rollout status deployment/ge-executor -n ge-agents
# Verify logs
kubectl logs -n ge-agents -l app=ge-executor --tail=50
Recover Only Monitoring¶
# Restart Grafana
kubectl rollout restart deployment/grafana -n ge-monitoring
# Verify Grafana UI
curl -I http://localhost:3000/
Emergency Contacts¶
| Role | Agent | Contact Method | Purpose |
|---|---|---|---|
| Secrets Manager | Piotr | ge-ops/agents/piotr/inbox/ |
AppRole issues, secret rotation |
| Security Lead | Victoria | ge-ops/agents/victoria/inbox/ |
Security incidents, secret exposure |
| Infrastructure Lead | Marije | ge-ops/agents/marije/inbox/ |
K3s issues, networking |
| Knowledge Curator | Annegreet | ge-ops/agents/annegreet/inbox/ |
Documentation updates, pattern capture |
| Human Escalation | Dirk-Jan (CEO) | ge-ops/notifications/human/ |
Critical failures, data loss |
References¶
- Secrets Bootstrap Guide
- Vault-Only Secrets Architecture
- CrashLoopBackOff Pattern
- Platform Startup Script
- Bootstrap Secrets Script
Document Version: 1.0 Last Updated: 2026-03-18 Maintained by: Annegreet (Knowledge Curator) Next Review: 2026-04-18