DOMAIN:INFRASTRUCTURE:DEPLOYMENT_STRATEGIES¶
OWNER: leon (deployment coordinator) UPDATED: 2026-03-24 SCOPE: all deployment strategies across GE's three-zone architecture AGENTS: leon (coordinator), arjan (infra), thijmen (Zone 2), rutger (Zone 3), stef (network), karel (CDN), gerco (Zone 1)
DEPLOYMENT:OVERVIEW¶
PURPOSE: define when and how to deploy code changes to GE client environments PRINCIPLE: every deployment must be zero-downtime unless explicitly agreed with client ZONES: Zone 1 (dev/k3s) -> Zone 2 (staging/UpCloud MKE) -> Zone 3 (production/UpCloud MKE) COORDINATOR: leon orchestrates the deploy chain — agents execute their zone-specific steps
DEPLOYMENT:STRATEGIES¶
ROLLING_UPDATE¶
HOW_IT_WORKS: 1. New pods created alongside old pods 2. Readiness probe confirms new pod healthy 3. Old pod terminated 4. Repeat until all pods replaced
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # create 1 extra pod
maxUnavailable: 0 # zero-downtime — never kill old before new is ready
USE_WHEN: - Standard stateless application deployments (most GE client apps) - Backend API servers - Frontend SSR servers - Any service where new and old versions can coexist briefly
DO_NOT_USE_WHEN: - Database schema migration requires breaking API changes - Stateful services with exclusive locks - Service cannot handle two versions running simultaneously
ZERO_DOWNTIME_CHECKLIST: - [ ] maxUnavailable: 0 (never kill before replacement ready) - [ ] readinessProbe configured and tested - [ ] graceful shutdown handler (SIGTERM -> drain connections -> exit) - [ ] database migrations are backward-compatible (additive only) - [ ] no breaking API changes (old clients must work with new server) - [ ] PDB configured if replicas >= 2 - [ ] connection draining timeout >= longest expected request
GE_DEFAULT: rolling update with maxSurge=1, maxUnavailable=0
BLUE_GREEN¶
HOW_IT_WORKS: 1. "Blue" = current production (serving traffic) 2. "Green" = new version deployed alongside (not serving traffic) 3. Full test suite runs against green 4. Traffic switched from blue to green (DNS, Ingress, or service selector) 5. Blue kept alive for instant rollback 6. After confidence period, blue torn down
IMPLEMENTATION_IN_K8S:
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: {app}-green
labels:
app: {app}
version: green
spec:
replicas: {same-as-blue}
template:
metadata:
labels:
app: {app}
version: green
spec:
containers:
- name: {app}
image: {image}:{new-tag}
TRAFFIC_SWITCH (service selector update):
TOOL: kubectl
RUN: kubectl patch service {app} -n {namespace} -p '{"spec":{"selector":{"version":"green"}}}'
VERIFY: kubectl get endpoints {app} -n {namespace}
ROLLBACK (instant — switch back to blue):
TOOL: kubectl
RUN: kubectl patch service {app} -n {namespace} -p '{"spec":{"selector":{"version":"blue"}}}'
USE_WHEN: - Major version releases with significant changes - Client has zero tolerance for issues (contractual SLA) - Deployment includes database migration that needs validation - Client wants to approve new version before cutover
DO_NOT_USE_WHEN: - Simple feature additions or bug fixes - Resources are constrained (blue-green needs 2x resources temporarily) - Changes are trivially backward-compatible
COST: 2x resource consumption during deployment window BENEFIT: instant rollback, full pre-production validation
CANARY¶
HOW_IT_WORKS: 1. Small percentage of traffic routed to new version (canary) 2. Monitor error rates, latency, business metrics 3. Gradually increase canary percentage (5% -> 25% -> 50% -> 100%) 4. If metrics degrade, route all traffic back to stable version
IMPLEMENTATION (Traefik weighted routing):
apiVersion: traefik.io/v1alpha1
kind: IngressRoute
metadata:
name: {app}-canary
namespace: {namespace}
spec:
entryPoints:
- websecure
routes:
- match: Host(`{domain}`)
kind: Rule
services:
- name: {app}-stable
port: 80
weight: 95
- name: {app}-canary
port: 80
weight: 5
PROGRESSIVE_ROLLOUT:
PHASE_1: 5% traffic to canary — monitor for 15 minutes
CHECK: error rate delta < 0.1%, latency p99 delta < 100ms
PHASE_2: 25% traffic — monitor for 30 minutes
CHECK: same metrics + business metrics (conversion, signups)
PHASE_3: 50% traffic — monitor for 1 hour
CHECK: all metrics stable
PHASE_4: 100% traffic — canary becomes stable
THEN: remove old stable deployment
USE_WHEN: - High-traffic client applications where risk must be minimized - Feature flag-driven releases - A/B testing new features - Production validation needed before full rollout
DO_NOT_USE_WHEN: - Low-traffic applications (not enough traffic for meaningful canary metrics) - Database schema changes (both versions hit same DB) - Changes that are all-or-nothing (cannot work at partial rollout)
DEPLOYMENT:STRATEGY_DECISION_TREE¶
START: new deployment needed
Q1: Is this a simple bug fix or minor feature?
YES → ROLLING UPDATE
NO → continue
Q2: Does the client have strict zero-downtime SLA?
YES → BLUE-GREEN
NO → continue
Q3: Is this a high-risk change to a high-traffic service?
YES → CANARY
NO → continue
Q4: Does the change include database migration?
YES + backward-compatible → ROLLING UPDATE
YES + breaking → BLUE-GREEN (validate green with new schema first)
NO → ROLLING UPDATE
DEFAULT: ROLLING UPDATE (covers 90%+ of GE deployments)
DEPLOYMENT:THREE_ZONE_PROMOTION¶
FLOW: STAGING VERIFIED -> PRODUCTION¶
ZONE_1 (dev, gerco):
→ developer/agent builds and tests locally
→ image built and verified in k3s
→ automated tests pass
→ GATE: all tests green, no lint errors
ZONE_2 (staging, thijmen):
→ same image deployed to UpCloud MKE staging
→ integration tests against staging environment
→ client-facing staging URL for preview
→ GATE: QA passed, client approved (if applicable)
ZONE_3 (production, rutger):
→ same verified image deployed to production
→ Victoria security review for infrastructure changes
→ zero-downtime deployment strategy applied
→ GATE: health checks passing, metrics nominal, no error spike
LEON'S DEPLOY CHAIN¶
Leon coordinates the full deploy chain. Each agent executes their phase and confirms to Leon before the next agent starts.
LEON (triggers deploy chain)
→ ARJAN: infrastructure ready? (new resources if needed)
→ GERCO: Zone 1 build + test
→ THIJMEN: Zone 2 staging deploy + verify
→ RUTGER: Zone 3 production deploy + verify
→ STEF: DNS + TLS + firewall verified
→ KAREL: CDN cache purge + edge config
→ LEON: cutover confirmed, deployment complete
EACH_AGENT_CONFIRMS:
TOOL: completion report
FORMAT:
AGENT: {name}
PHASE: {deploy-chain-phase}
STATUS: READY | BLOCKED | FAILED
DETAILS: {what was done}
BLOCKERS: {if any}
IF_ANY_AGENT_REPORTS_BLOCKED_OR_FAILED: 1. Leon halts the deploy chain 2. Root cause investigated 3. Fix applied and agent re-confirms 4. Chain resumes from that point 5. IF unresolvable: rollback all zones to last known good
IMAGE_PROMOTION¶
RULE: exact same image artifact moves through zones — never rebuild per zone FLOW: 1. Build image in Zone 1 (docker build + test) 2. Push to container registry (tag with git SHA) 3. Deploy same SHA to Zone 2 (staging) 4. After staging verification, deploy same SHA to Zone 3 (production)
TAG_CONVENTION:
- {image}:{git-sha} — immutable tag for traceability
- {image}:staging — mutable tag pointing to current staging
- {image}:production — mutable tag pointing to current production
- NEVER use latest in Zone 2 or Zone 3
DEPLOYMENT:ROLLBACK_PROCEDURES¶
IMMEDIATE_ROLLBACK (production issue detected)¶
TRIGGER: error rate spike, health probe failure, client report, monitoring alert
STEP_1: Leon initiates rollback
STEP_2: Rutger rolls back Zone 3:
TOOL: kubectl
RUN: kubectl rollout undo deployment/{app} -n {namespace}
VERIFY: kubectl rollout status deployment/{app} -n {namespace}
STEP_3: Karel purges CDN cache for rolled-back paths:
TOOL: bunny API
RUN: curl -X POST "https://api.bunny.net/pullzone/{id}/purgeCache" \
-H "AccessKey: {key}"
STEP_4: Stef verifies DNS and TLS still valid (no action usually needed)
STEP_5: Leon confirms rollback complete
STEP_6: Mira opens incident if client-impacting (incident-response domain)
ROLLBACK_WITH_DATABASE_MIGRATION¶
COMPLEXITY: high — database changes may not be reversible
IF: migration was additive only (new columns, new tables)
THEN: application rollback is safe — old code ignores new columns
ACTION: rollback application, leave database as-is
IF: migration was destructive (dropped columns, renamed tables)
THEN: application rollback may break
ACTION: restore database from pre-migration backup (otto's domain)
WARNING: data created between migration and rollback is lost (document RPO gap)
RULE: destructive migrations MUST be split into phases: 1. Phase 1: add new column (additive, safe) 2. Phase 2: deploy code that writes to both old and new columns 3. Phase 3: migrate data from old to new column 4. Phase 4: deploy code that reads only from new column 5. Phase 5: drop old column (only after Phase 4 is stable in production)
DEPLOYMENT:ZERO_DOWNTIME_CHECKLIST¶
PRE_DEPLOYMENT: - [ ] Database migrations are backward-compatible (additive only) - [ ] Application handles graceful shutdown (SIGTERM -> drain -> exit) - [ ] Health probes configured (readiness gates traffic, liveness restarts) - [ ] PDB configured for services with >= 2 replicas - [ ] maxUnavailable: 0 in rolling update strategy - [ ] Connection draining timeout >= longest request duration - [ ] Feature flags ready for instant disable if needed - [ ] Rollback plan documented and tested
DURING_DEPLOYMENT: - [ ] Monitor error rates during rollout - [ ] Monitor latency p50/p95/p99 - [ ] Monitor CPU/memory on new pods - [ ] Verify readiness probe passes before old pods killed - [ ] Watch for connection reset errors (sign of premature pod termination)
POST_DEPLOYMENT: - [ ] All health probes passing for 5+ minutes - [ ] Error rate returned to baseline - [ ] No degradation in latency - [ ] CDN cache purged for updated assets - [ ] DNS verified (if changed) - [ ] Client notified of successful deployment (if applicable) - [ ] Deployment recorded in change log
DEPLOYMENT:AGENT_WORKFLOW¶
FOR_LEON (deploy coordinator)¶
ON_DEPLOY_TRIGGER: 1. READ this page for strategy selection 2. DETERMINE strategy using decision tree 3. TRIGGER deploy chain: arjan -> gerco -> thijmen -> rutger -> stef -> karel 4. COLLECT confirmations from each agent 5. IF any agent blocked/failed: HALT chain, investigate 6. CONFIRM cutover when all agents report READY 7. MONITOR for 15 minutes post-deploy 8. IF issues: initiate rollback procedure 9. LOG deployment to wiki
FOR_RUTGER (production deployment)¶
ON_PRODUCTION_DEPLOY: 1. RECEIVE verified image SHA from thijmen 2. CHECK: victoria security review passed? 3. APPLY deployment using selected strategy 4. VERIFY: health probes, metrics, error rates 5. CONFIRM to leon: Zone 3 READY or FAILED 6. IF rollback needed: execute immediately, notify leon
DEPLOYMENT:ANTI_PATTERNS¶
BEFORE_EVERY_DEPLOYMENT:
1. Am I deploying untested code directly to production? (NEVER — Zone 2 staging first)
2. Am I using latest tag in production? (NEVER — use immutable SHA tags)
3. Am I skipping health probes? (NEVER — mandatory for zero-downtime)
4. Am I deploying during peak hours without client agreement? (NEVER)
5. Am I deploying a destructive migration without a phased plan? (NEVER)
6. Am I hot-patching a running pod? (NEVER — rebuild image and redeploy)
7. Am I deploying without a rollback plan? (NEVER — always have a rollback ready)
8. Am I deploying to Zone 3 without Victoria's review? (NEVER for infra changes)
DEPLOYMENT:CROSS_REFERENCES¶
KUBERNETES_OPERATIONS: domains/infrastructure/kubernetes-operations.md — kubectl rollout commands TERRAFORM_PATTERNS: domains/infrastructure/terraform-patterns.md — infrastructure for blue-green BACKUP_DR: domains/infrastructure/backup-disaster-recovery.md — pre-migration backups DNS_MANAGEMENT: domains/networking/dns-management.md — DNS changes during deploy CDN_EDGE: domains/networking/cdn-edge.md — cache purge post-deploy INCIDENT_RESPONSE: domains/incident-response/index.md — when deployments go wrong TLS_CERTIFICATES: domains/networking/tls-certificates.md — cert verification during deploy