Platform Startup Documentation¶
Last Updated: 2026-03-18 Status: Active Maintained by: GE Infrastructure Team Estimated Time: 15-30 minutes (full startup)
Overview¶
This document describes the unified platform startup process for the GE infrastructure, including Kubernetes, agents, monitoring, and client hosting environments. The startup script orchestrates all components in the correct order with proper dependency management.
Script: /home/claude/ge-bootstrap/tools/ge-platform-startup.sh
Table of Contents¶
- Unified Startup Script Overview
- Startup Phases
- Phase Dependencies
- Full Startup Procedure
- Partial Startup Options
- Status Verification
- Platform Shutdown
- Troubleshooting Common Issues
- Deprecated Scripts
Unified Startup Script Overview¶
Purpose¶
The ge-platform-startup.sh script provides a unified interface for starting all GE platform components with:
- Dependency Management: Ensures components start in correct order
- Health Checks: Verifies each component before proceeding
- Error Handling: Stops on failures, provides clear error messages
- Flexibility: Supports full startup, single phases, or partial startup
- Logging: Records all operations to log files
Features¶
- Phase-based architecture (9 distinct phases)
- Prerequisites checking before startup
- Automatic Vault unsealing
- Docker image import for K3s
- Status verification commands
- Partial startup from specific phase
- Comprehensive error logging
Basic Usage¶
cd /home/claude/ge-bootstrap/tools
# Full startup (all phases)
./ge-platform-startup.sh --full
# Show platform status
./ge-platform-startup.sh --status
# Run specific phase only
./ge-platform-startup.sh --phase ingress
# Start from specific phase
./ge-platform-startup.sh --from agents
# Stop platform
./ge-platform-startup.sh --stop
# List available phases
./ge-platform-startup.sh --list-phases
Startup Phases¶
The platform startup is divided into 9 phases that run sequentially:
Phase 1: Prerequisites¶
Purpose: Validate environment before starting any services
Checks: - K3s service is running - kubectl is available and cluster is reachable - Docker is available (optional) - Kustomize is available - jq is installed
Duration: 5-10 seconds
Example Output:
========================================
PHASE: Prerequisites Check
========================================
[OK] K3s is running
[OK] kubectl available: v1.28.2
[OK] K8s cluster is reachable
[OK] Docker available
[OK] Kustomize available
[OK] jq available
[OK] All prerequisites satisfied
Failure Actions:
- If K3s not running: Start with sudo systemctl start k3s
- If kubectl missing: Install kubectl
- If cluster unreachable: Check K3s status and logs
Phase 2: Namespaces¶
Purpose: Create all required Kubernetes namespaces
Creates:
- ge-system - Core infrastructure
- ge-agents - Agent platform
- ge-monitoring - Observability stack
- ge-ingress - Ingress controller
- ge-hosting - Shared hosting pool
- ge-wiki - Wiki brain (MkDocs)
Actions: - Apply namespace manifests - Add required labels for network policies - Label ingress namespace for selectors
Duration: 5-10 seconds
Example Output:
========================================
PHASE: Creating Namespaces
========================================
namespace/ge-system created
namespace/ge-agents created
namespace/ge-monitoring created
namespace/ge-ingress created
namespace/ge-hosting created
[OK] Namespaces created and labeled
Manual Verification:
Phase 3: Secrets¶
Purpose: Verify or create required secrets in all namespaces
Checks:
- ge-secrets exists in ge-agents
- ge-secrets exists in ge-system
- ge-secrets exists in ge-monitoring
Actions: - If secrets missing and environment variables set, creates secrets - If secrets missing and no environment variables, provides instructions
Duration: 5-10 seconds
Example Output:
========================================
PHASE: Verifying Secrets
========================================
[OK] Secret ge-secrets exists in ge-agents
[OK] Secret ge-secrets exists in ge-system
[OK] Secret ge-secrets exists in ge-monitoring
[OK] Secrets verified
Manual Secret Creation:
# If script cannot create secrets automatically
kubectl create secret generic ge-secrets \
-n ge-agents \
--from-literal=redis-password="<password>" \
--from-literal=anthropic-api-key="<key>"
Phase 4: Core Infrastructure¶
Purpose: Deploy Redis and Vault (core services)
Deploys: - Redis (ge-system namespace) - Vault (ge-system namespace) - ConfigMaps and core secrets - Network policies
Actions: - Deploy Redis and wait for ready state - Deploy Vault and wait for pod creation - Attempt automatic Vault unsealing - Apply network policies
Duration: 1-2 minutes
Example Output:
========================================
PHASE: Deploying Core Infrastructure
========================================
[INFO] Deploying Redis...
pod/redis condition met
[OK] Redis is ready
[INFO] Deploying Vault...
[WARN] Vault may need manual unsealing
[INFO] Checking Vault seal status...
[OK] Vault unsealed successfully
[OK] Network policies applied
[OK] Core infrastructure deployed
Manual Vault Unseal:
# If automatic unseal fails
kubectl exec -n ge-system vault-0 -- vault operator unseal <key1>
kubectl exec -n ge-system vault-0 -- vault operator unseal <key2>
kubectl exec -n ge-system vault-0 -- vault operator unseal <key3>
Phase 5: Ingress¶
Purpose: Deploy Traefik IngressController
Deploys: - Traefik RBAC (ServiceAccount, ClusterRole, ClusterRoleBinding) - Traefik ConfigMap (static configuration) - Traefik Deployment (2 replicas with HA) - Traefik Service (ClusterIP) - IngressClass (default) - Network policies
Actions: - Apply all ingress resources via kustomize - Wait for Traefik pods to be ready - Verify service creation
Duration: 1-3 minutes
Example Output:
========================================
PHASE: Deploying Ingress Controller (Traefik)
========================================
[INFO] Applying Traefik IngressController...
serviceaccount/traefik created
clusterrole.rbac.authorization.k8s.io/traefik-ingress-controller created
clusterrolebinding.rbac.authorization.k8s.io/traefik-ingress-controller created
configmap/traefik-config created
deployment.apps/traefik created
service/traefik created
ingressclass.networking.k8s.io/traefik created
[INFO] Waiting for Traefik pods...
pod/traefik-abc123 condition met
pod/traefik-def456 condition met
[OK] Traefik is ready
[INFO] Traefik LoadBalancer IP: pending
[OK] Ingress controller deployed
Critical Note: Traefik service is ClusterIP, not LoadBalancer. Docker Traefik handles external ingress.
Phase 6: Agents¶
Purpose: Deploy GE agent platform
Deploys: - ConfigMaps (constitution, routing config, execution config) - ge-orchestrator (event-driven routing, replaces legacy Dolly monolith) - Shared executor (all 54 active agents run through this) - PodDisruptionBudget for executors
Note: All per-agent deployments (arjan, annegreet, etc.) are scaled to 0. All agents run through the shared executor.
Actions: - Create ConfigMaps from files - Import Docker images to K3s (if available) - Deploy ge-orchestrator and wait for ready - Deploy shared executor and wait for ready - Apply PDB
Duration: 3-5 minutes
Example Output:
========================================
PHASE: Deploying Agent Platform
========================================
[INFO] Creating ConfigMaps...
configmap/constitution created
configmap/routing-config created
[INFO] Checking Docker images...
[INFO] Importing ge-bootstrap-agent-executor to K3s...
[OK] Imported ge-bootstrap-agent-executor
[INFO] Deploying ge-orchestrator...
[OK] ge-orchestrator is ready
[INFO] Deploying shared executor...
[OK] Executors are ready
[OK] Agent platform deployed
Verification:
Phase 7: Monitoring¶
Purpose: Deploy observability stack
Deploys: - Loki (log aggregation) - Promtail (log collection) - Grafana (visualization)
Actions: - Apply Loki stack manifests - Wait for Loki to be ready - Wait for Grafana to be ready
Duration: 1-2 minutes
Example Output:
========================================
PHASE: Deploying Monitoring Stack
========================================
deployment.apps/loki created
daemonset.apps/promtail created
deployment.apps/grafana created
[INFO] Waiting for Loki...
pod/loki-abc123 condition met
[OK] Loki is ready
[INFO] Waiting for Grafana...
pod/grafana-def456 condition met
[OK] Grafana is ready
[OK] Monitoring stack deployed
Access Grafana:
Phase 8: Hosting¶
Purpose: Verify shared hosting pool is ready
Checks:
- ge-hosting namespace exists
- Hosting landing page is deployed
Actions: - Verify namespace - Check for landing page deployment - Report status
Duration: 5-10 seconds
Example Output:
========================================
PHASE: Deploying Shared Hosting Pool
========================================
[OK] Hosting namespace exists
[OK] Hosting landing page deployed
[OK] Shared hosting pool ready
Note: The hosting namespace and landing page are created during the ingress phase.
Phase 9: Clients¶
Purpose: Deploy client environments from registry
Reads:
- /home/claude/ge-bootstrap/config/clients.yaml
Actions: - Parse clients.yaml - Deploy each registered client - Report deployment status
Duration: Variable (depends on number of clients)
Example Output:
========================================
PHASE: Deploying Client Environments
========================================
[INFO] Deploying 3 client environment(s)...
[WARN] Client deployment not yet implemented - create clients manually with create-client.sh
[OK] Client environments phase complete
Manual Client Creation:
/home/claude/ge-bootstrap/tools/create-client.sh \
--type shared \
--name acme-corp \
--resources small
Phase Dependencies¶
Phases must run in order due to dependencies:
flowchart TD
Prerequisites[1. Prerequisites] --> Namespaces[2. Namespaces]
Namespaces --> Secrets[3. Secrets]
Secrets --> Core[4. Core<br/>Redis, Vault]
Core --> Ingress[5. Ingress<br/>Traefik]
Ingress --> Agents[6. Agents<br/>Dolly, Executors]
Agents --> Monitoring[7. Monitoring<br/>Loki, Grafana]
Core --> Monitoring
Monitoring --> Hosting[8. Hosting<br/>Shared Pool]
Hosting --> Clients[9. Clients<br/>Client Envs]
Ingress --> Clients
Dependency Matrix¶
| Phase | Depends On | Reason |
|---|---|---|
| Namespaces | Prerequisites | Need kubectl and cluster access |
| Secrets | Namespaces | Secrets created in namespaces |
| Core | Secrets | Redis and Vault need secrets |
| Ingress | Namespaces | Traefik deployed to ge-ingress namespace |
| Agents | Core, Secrets | Agents need Redis and secrets |
| Monitoring | Core | Loki needs storage from core infrastructure |
| Hosting | Ingress | Needs IngressClass for routing |
| Clients | Ingress, Hosting | Clients need ingress and hosting namespace |
Why Order Matters¶
Example 1: Agents Before Core
❌ WRONG: Deploy agents before Redis
Result: Agents crash, cannot connect to Redis
Error: "Connection refused: redis.ge-system:6381"
Example 2: Clients Before Ingress
❌ WRONG: Deploy clients before Traefik
Result: Ingress resources not processed, no routing
Error: "IngressClass 'traefik' not found"
Example 3: Secrets Before Namespaces
❌ WRONG: Create secrets before namespaces
Result: Secret creation fails
Error: "namespace 'ge-agents' not found"
Full Startup Procedure¶
Starting from Stopped State¶
Scenario: Server just booted, all services stopped
Step 1: Start K3s
# Check K3s status
sudo systemctl status k3s
# Start if stopped
sudo systemctl start k3s
# Verify
sudo systemctl is-active k3s
# Expected: active
Step 2: Run Full Startup
Step 3: Monitor Progress
The script will show progress through all 9 phases. Watch for: - ✅ Green [OK] messages indicate success - ⚠️ Yellow [WARN] messages indicate non-critical issues - ❌ Red [ERROR] messages indicate failures
Step 4: Verify Completion
Expected Duration: - Minimal system: 5-10 minutes - Full system: 15-20 minutes - With many clients: 20-30 minutes
Starting from Partial State¶
Scenario: Some components running, some stopped
Step 1: Check Current State
Step 2: Identify Missing Components
Example output showing partial state:
=== Core Infrastructure (ge-system) ===
NAME READY STATUS RESTARTS AGE
pod/redis-0 1/1 Running 0 5d
pod/vault-0 1/1 Running 0 5d
=== Agent Platform (ge-agents) ===
No resources found in ge-agents namespace.
Step 3: Start from Missing Phase
Partial Startup Options¶
Single Phase Execution¶
Run only one specific phase:
# Syntax
./ge-platform-startup.sh --phase PHASE_NAME
# Examples
./ge-platform-startup.sh --phase core
./ge-platform-startup.sh --phase ingress
./ge-platform-startup.sh --phase agents
Use Cases: - Restarting a single failed component - Testing a specific phase - Selective updates
Example:
Start from Specific Phase¶
Run all phases starting from a specific one:
# Syntax
./ge-platform-startup.sh --from PHASE_NAME
# Examples
./ge-platform-startup.sh --from core
./ge-platform-startup.sh --from agents
Use Cases: - Partial system recovery - Skip already-running components - Faster startup when core is healthy
Example:
Phase Selection Strategy¶
Scenario: Need to redeploy agents after code change
Strategy: --phase agents
Reason: Only agents need updating
Scenario: Core crashed, need to restart everything dependent on it
Strategy: --from core
Reason: Core, agents, and monitoring all need to restart
Scenario: Fresh install
Strategy: --full
Reason: All phases need to run
Scenario: K3s restarted, everything needs to come back up
Strategy: --full
Reason: All phases need to run in order
Status Verification¶
Using Status Command¶
Output Sections:
-
Namespaces:
-
Ingress (ge-ingress):
-
Core Infrastructure (ge-system):
-
Agent Platform (ge-agents):
-
Monitoring (ge-monitoring):
-
Hosting (ge-hosting):
-
Ingress Routes:
-
Access Points:
Manual Verification Commands¶
Quick Health Check:
# All namespaces
kubectl get namespaces | grep ^ge-
# All pods in all GE namespaces
kubectl get pods -A | grep ^ge-
# All services
kubectl get svc -A | grep ^ge-
Component-Specific Checks:
# Redis
kubectl exec -n ge-system redis-0 -- redis-cli -a "$REDIS_PASSWORD" ping
# Expected: PONG
# Vault (check seal status)
kubectl exec -n ge-system vault-0 -- vault status
# Expected: Sealed: false
# Traefik
kubectl get pods -n ge-ingress -l app=traefik
# Expected: All Running
# Agents
kubectl get pods -n ge-agents
# Expected: All Running
# Monitoring
kubectl get pods -n ge-monitoring
# Expected: All Running
Endpoint Testing:
# Admin UI
curl -I https://office.growing-europe.com
# Expected: HTTP/2 200
# Hosting landing
curl -I https://hosting.growing-europe.com
# Expected: HTTP/2 200
# Traefik dashboard (internal)
kubectl port-forward -n ge-ingress svc/traefik-dashboard 8080:8080 &
curl -I http://localhost:8080/dashboard/
# Expected: HTTP/1.1 200
Platform Shutdown¶
Graceful Shutdown¶
Step 1: Stop Agents and Monitoring
Output:
========================================
Stopping GE Platform
========================================
[WARN] Stopping agents and monitoring...
deployment.apps "ge-orchestrator" deleted
deployment.apps "ge-executor" deleted
[...]
[WARN] Core infrastructure (Redis, Vault, Ingress) NOT stopped
[INFO] To stop everything: kubectl delete -k /home/claude/ge-bootstrap/k8s/base/
[OK] Platform stopped (core services preserved)
What Gets Stopped: - ✅ All agents (Dolly, executors, dedicated agents, watchers) - ✅ Monitoring stack (Loki, Promtail, Grafana) - ❌ Core services (Redis, Vault) - Preserved - ❌ Ingress (Traefik) - Preserved
Why Core is Preserved: - Redis may have important cached data - Vault contains secrets - Traefik handles production traffic - Quick recovery if agents need restart
Full Shutdown¶
Step 1: Stop All K8s Resources
kubectl delete -k /home/claude/ge-bootstrap/k8s/base/
# Or manually by namespace
kubectl delete namespace ge-agents
kubectl delete namespace ge-monitoring
kubectl delete namespace ge-ingress
kubectl delete namespace ge-system
kubectl delete namespace ge-hosting
Step 2: Stop K3s (Optional)
Emergency Shutdown¶
In case of critical issues:
# Force delete all GE resources
kubectl delete all --all -n ge-agents --force --grace-period=0
kubectl delete all --all -n ge-monitoring --force --grace-period=0
kubectl delete all --all -n ge-ingress --force --grace-period=0
kubectl delete all --all -n ge-system --force --grace-period=0
# Stop K3s
sudo systemctl stop k3s
# Stop K3s (nuclear option)
sudo systemctl stop k3s
Troubleshooting Common Issues¶
K3s Not Running¶
Symptoms:
[ERROR] K3s is not running. Start with: sudo systemctl start k3s
[ERROR] Cannot connect to K8s cluster
Solution:
# Check K3s status
sudo systemctl status k3s
# If failed, check logs
sudo journalctl -u k3s -n 100
# Common fixes:
# 1. Not enough memory
free -h
# Solution: Increase memory or reduce workloads
# 2. Port conflict
sudo netstat -tulpn | grep :6443
# Solution: Stop conflicting service
# 3. Start K3s
sudo systemctl start k3s
# 4. Enable auto-start
sudo systemctl enable k3s
Vault Sealed¶
Symptoms:
[WARN] Vault may need manual unsealing
[INFO] Vault is sealed, attempting auto-unseal...
[WARN] Failed to unseal Vault - manual intervention needed
Solution:
# Check Vault status
kubectl exec -n ge-system vault-0 -- vault status
# Output shows:
# Sealed: true
# Manual unseal (requires 3 keys)
kubectl exec -n ge-system vault-0 -- vault operator unseal <key1>
kubectl exec -n ge-system vault-0 -- vault operator unseal <key2>
kubectl exec -n ge-system vault-0 -- vault operator unseal <key3>
# Verify unsealed
kubectl exec -n ge-system vault-0 -- vault status
# Sealed: false
Where to Find Keys:
# Keys should be in:
/home/claude/ge-bootstrap/vault/VAULT_KEYS.txt
# If file doesn't exist, Vault needs reinitialization
# WARNING: This will lose all existing secrets!
Secret Missing¶
Symptoms:
[WARN] Secret ge-secrets missing in ge-agents
[ERROR] Cannot create secrets - REDIS_PASSWORD and ANTHROPIC_API_KEY not set
Solution:
# Option 1: Set environment variables
export REDIS_PASSWORD="<password>"
export ANTHROPIC_API_KEY="<key>"
# Run secrets phase again
./ge-platform-startup.sh --phase secrets
# Option 2: Create manually
kubectl create secret generic ge-secrets \
-n ge-agents \
--from-literal=redis-password="<password>" \
--from-literal=anthropic-api-key="<key>"
kubectl create secret generic ge-secrets \
-n ge-system \
--from-literal=redis-password="<password>"
kubectl create secret generic ge-secrets \
-n ge-monitoring \
--from-literal=redis-password="<password>"
Certificate Not Issued¶
Symptoms:
curl https://office.growing-europe.com
# SSL certificate problem: unable to get local issuer certificate
Diagnosis:
# Check Docker Traefik logs
docker logs traefik 2>&1 | grep -i acme | tail -20
# Check acme.json
sudo ls -lh /home/claude/ge-bootstrap/traefik/acme.json
# Check DNS
dig office.growing-europe.com +short
Solutions:
-
Wait for Let's Encrypt:
-
DNS Not Resolving:
-
Rate Limited:
-
Port 80 Blocked:
Pods Not Starting¶
Symptoms:
Diagnosis:
# Check pod events
kubectl describe pod dolly-123abc -n ge-agents
# Check logs
kubectl logs dolly-123abc -n ge-agents
# Check previous container logs
kubectl logs dolly-123abc -n ge-agents --previous
Common Causes and Fixes:
-
Image Pull Error:
-
Missing ConfigMap:
-
Secret Not Found:
-
Resource Limits:
Traefik API Connection Errors¶
Symptoms:
Diagnosis:
# Check RBAC permissions
kubectl get clusterrole traefik-ingress-controller
kubectl get clusterrolebinding traefik-ingress-controller
# Check ServiceAccount
kubectl get sa traefik -n ge-ingress
Solutions:
-
RBAC Not Applied:
-
Network Policy Blocking:
-
ServiceAccount Token Issue:
Deprecated Scripts¶
ge-startup-k8s.sh¶
Location: /home/claude/ge-bootstrap/tools/ge-startup-k8s.sh
Status: DEPRECATED - Superseded by ge-platform-startup.sh
Reason for Deprecation: - Limited to K8s resources only - No Docker service integration - No phase-based architecture - Less flexible than unified script - No Vault unsealing - No status verification
Migration:
When to Use (Legacy): - Only if ge-platform-startup.sh is broken - For minimal K8s-only deployments - For compatibility with old scripts
Recommendation: Use ge-platform-startup.sh for all new deployments.
Related Documentation¶
- Architecture Overview - Hosting architecture
- Traefik Migration - Ingress architecture details
- Client Onboarding - Creating clients
- Zero-Downtime Deployments - Update procedures
Startup Checklist¶
Use this checklist for platform startup:
PRE-STARTUP
□ K3s service is running
□ Server has adequate resources (check free -h, df -h)
□ Docker is running (if using Docker services)
□ Environment variables set (.env loaded)
□ Vault keys available (VAULT_KEYS.txt)
STARTUP
□ Run prerequisites check
□ Start full platform or specific phases
□ Monitor progress for errors
□ Wait for all pods to be ready
POST-STARTUP
□ Run status verification
□ Check all namespaces created
□ Verify core services (Redis, Vault)
□ Verify ingress (Traefik)
□ Verify agents running
□ Verify monitoring stack
□ Test external access (office.growing-europe.com)
□ Check logs for errors
VALIDATION
□ All pods in Running state
□ No CrashLoopBackOff
□ Secrets present in all namespaces
□ Vault unsealed
□ Ingress routing working
□ SSL certificates valid
□ No errors in logs
This documentation is maintained by the GE Infrastructure Team. For updates or questions, contact the infrastructure lead or create an issue in the ge-ops repository.