Zero-Downtime Deployment Runbook¶

Last Updated: 2026-01-29 Status: Active Maintained by: GE Infrastructure Team Estimated Time: 10-20 minutes per deployment

Overview¶

This runbook provides step-by-step procedures for deploying updates to client workloads without service interruption. Zero-downtime deployments are critical for maintaining service availability and meeting SLA commitments.

Deployment Strategies Covered: - Rolling updates for shared hosting (default) - Blue-green deployments for dedicated hosting (optional) - Canary deployments for gradual rollouts (optional)

Rolling Update Strategy¶

Overview¶

Rolling updates gradually replace old pods with new pods, maintaining availability throughout the update process.

How It Works: 1. Create new pod with updated image/config 2. Wait for new pod to become Ready 3. Terminate one old pod 4. Repeat until all pods are updated

Configuration¶

Shared Hosting Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: sh-acme-corp
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1           # Allow 1 extra pod during update
      maxUnavailable: 0     # Ensure all replicas available

Dedicated Hosting Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: ded-bigcorp
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 100%        # Double capacity during update
      maxUnavailable: 0     # Ensure all replicas available

Strategy Parameters¶

Parameter	Description	Shared Hosting	Dedicated Hosting
`maxSurge`	Extra pods during update	1 (33-50%)	100% (double capacity)
`maxUnavailable`	Pods that can be down	0 (always available)	0 (always available)

maxSurge Examples:

Shared (1 replica): maxSurge: 1
During update: 2 pods total (1 old + 1 new)
After new pod ready: 1 pod (old terminated)
Shared (2 replicas): maxSurge: 1
During update: 3 pods total (2 old + 1 new)
Rolling: Gradually replace old with new
Dedicated (3 replicas): maxSurge: 100%
During update: 6 pods total (3 old + 3 new)
All new pods start before any old pods terminate
Maximum capacity during update

maxUnavailable = 0: - Always maintains desired replica count or higher - No service degradation during update - Critical for zero-downtime requirement

Update Timeline¶

Shared Hosting (2 replicas):

sequenceDiagram
    participant K8s
    participant OldPod1
    participant OldPod2
    participant NewPod1
    participant NewPod2

    Note over K8s: Start rolling update
    K8s->>NewPod1: Create (maxSurge +1)
    Note over NewPod1: Starting...
    NewPod1->>K8s: Ready
    K8s->>OldPod1: Terminate
    Note over OldPod1: Shutting down
    K8s->>NewPod2: Create
    NewPod2->>K8s: Ready
    K8s->>OldPod2: Terminate
    Note over K8s: Update complete

Duration: 1-3 minutes (depends on startup time)

Dedicated Hosting (3 replicas, maxSurge 100%):

sequenceDiagram
    participant K8s
    participant Old[Old Pods (3)]
    participant New[New Pods (3)]

    Note over K8s: Start rolling update
    K8s->>New: Create all 3 new pods (maxSurge 100%)
    Note over New: Starting...
    New->>K8s: All ready
    Note over K8s,New: 6 pods serving traffic
    K8s->>Old: Terminate all old pods
    Note over Old: Graceful shutdown (30s)
    Note over K8s: Update complete

Duration: 30-60 seconds (parallel startup + graceful shutdown)

Blue-Green Deployment¶

Overview¶

Blue-green deployments maintain two identical environments (blue = current, green = new). Traffic switches instantly after the green environment is validated.

Use Case: Dedicated hosting with critical uptime requirements

Procedure¶

Step 1: Deploy Green Environment

# Create green deployment (new version)
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-green
  namespace: ded-bigcorp
  labels:
    app: web
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
      version: green
  template:
    metadata:
      labels:
        app: web
        version: green
    spec:
      containers:
        - name: nginx
          image: bigcorp/webapp:v2.0.0
          # ... rest of spec
EOF

Step 2: Wait for Green to be Ready

kubectl wait --for=condition=available \
  deployment/web-green -n ded-bigcorp --timeout=180s

Step 3: Test Green Environment

# Port-forward to green pods for testing
kubectl port-forward -n ded-bigcorp deploy/web-green 8080:80

# Test in browser or with curl
curl http://localhost:8080/health
curl http://localhost:8080/api/version

Step 4: Switch Traffic to Green

# Update service selector to point to green
kubectl patch service web -n ded-bigcorp -p '{"spec":{"selector":{"version":"green"}}}'

Step 5: Verify Traffic Switch

# Check service endpoints
kubectl get endpoints web -n ded-bigcorp

# Should show green pod IPs
# Test production URL
curl https://bigcorp.hosting.growing-europe.com/health

Step 6: Monitor Green

# Watch pods
kubectl get pods -n ded-bigcorp -l version=green -w

# Monitor logs
kubectl logs -n ded-bigcorp -l version=green --tail=50 -f

# Wait 10-15 minutes, monitor for errors

Step 7: Remove Blue (Old)

# After successful verification, remove old deployment
kubectl delete deployment web-blue -n ded-bigcorp

# Or keep for quick rollback
# Delete after longer observation period (1 hour)

Blue-Green Rollback¶

If issues are found after switching:

# Instantly switch back to blue
kubectl patch service web -n ded-bigcorp -p '{"spec":{"selector":{"version":"blue"}}}'

# Verify rollback
kubectl get endpoints web -n ded-bigcorp
curl https://bigcorp.hosting.growing-europe.com/health

Pre-Deployment Checklist¶

Use this checklist before every deployment:

Health Check Verification¶

CLIENT_NAME="acme-corp"
NAMESPACE="sh-${CLIENT_NAME}"

# 1. Current deployment is healthy
kubectl get deployment web -n "$NAMESPACE"
# Verify: READY shows X/X (all replicas ready)

# 2. Pods are running
kubectl get pods -n "$NAMESPACE"
# Verify: All pods show Running, no CrashLoopBackOff

# 3. Health endpoints responding
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- wget -O- http://localhost:8080/health
# Verify: Returns 200 OK

# 4. No recent errors in logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=100 | grep -i error
# Verify: No critical errors

Resource Availability¶

# 1. Node capacity
kubectl top nodes
# Verify: Nodes have available CPU/memory

# 2. Current resource usage
kubectl top pods -n "$NAMESPACE"
# Verify: Pods not hitting limits

# 3. Storage availability
kubectl get pvc -n "$NAMESPACE"
# Verify: No PVCs in Pending state

# 4. Image availability
IMAGE="acme-corp/webapp:v1.2.3"
docker pull "$IMAGE"
# Verify: Image can be pulled

Database Migrations (If Applicable)¶

# 1. Check if migrations are needed
# Review application changelog

# 2. Test migrations in staging
# Run migration scripts in staging environment

# 3. Backup production database
# Create database backup before deployment

# 4. Plan migration execution
# Decide: before deployment, during, or after?
# Ensure migrations are backward compatible

Secret Validation¶

# 1. Secrets exist in namespace
kubectl get secrets -n "$NAMESPACE"
# Verify: ge-secrets exists

# 2. Secrets have required keys
kubectl get secret ge-secrets -n "$NAMESPACE" -o jsonpath='{.data}' | jq 'keys'
# Verify: Contains required keys (redis-password, api-key, etc.)

# 3. Secrets are in Vault
vault kv get secret/clients/acme-corp
# Verify: Vault contains all required secrets

Network Connectivity¶

# 1. Ingress is healthy
kubectl get ingress web -n "$NAMESPACE"
# Verify: Address is populated

# 2. DNS resolves
dig acme-corp.hosting.growing-europe.com
# Verify: Points to correct IP

# 3. SSL certificate is valid
echo | openssl s_client -servername acme-corp.hosting.growing-europe.com \
  -connect acme-corp.hosting.growing-europe.com:443 2>/dev/null | \
  openssl x509 -noout -dates
# Verify: Not expired

# 4. Traefik is routing
kubectl logs -n ge-ingress deploy/traefik | grep acme-corp | tail -5
# Verify: Recent traffic logs

Communication¶

[ ] Notify stakeholders of deployment window
[ ] Schedule deployment during low-traffic period (if possible)
[ ] Prepare rollback plan
[ ] Have incident response team on standby (for critical deployments)

Deployment Procedure¶

Using Immutable Packages¶

Recommended method for production deployments.

Step 1: Verify Package

cd /home/claude/ge-bootstrap/tools

# Verify package integrity
./verify-package.sh ../packages/acme-corp-v1.2.3

Step 2: Dry Run

# Preview changes without applying
./deploy-package.sh \
  --package ../packages/acme-corp-v1.2.3 \
  --dry-run

Step 3: Deploy

# Deploy to cluster
./deploy-package.sh --package ../packages/acme-corp-v1.2.3

# Deployment prompts for confirmation:
# Proceed with deployment? (y/N)

Step 4: Monitor Rollout

NAMESPACE="sh-acme-corp"

# Watch rollout status
kubectl rollout status deployment/web -n "$NAMESPACE"

# Watch pods updating
kubectl get pods -n "$NAMESPACE" -w

Expected Output:

NAME                  READY   STATUS              RESTARTS   AGE
web-7d4f8b-old        1/1     Running             0          5m
web-9f2k1a-new        0/1     ContainerCreating   0          5s

# After new pod ready:
web-7d4f8b-old        1/1     Running             0          5m
web-9f2k1a-new        1/1     Running             0          20s

# Old pod terminating:
web-7d4f8b-old        1/1     Terminating         0          5m
web-9f2k1a-new        1/1     Running             0          30s

# Update complete:
web-9f2k1a-new        1/1     Running             0          45s

Using kubectl (Manual)¶

Step 1: Update Image

NAMESPACE="sh-acme-corp"
IMAGE="acme-corp/webapp:v1.2.3"

# Update deployment image
kubectl set image deployment/web \
  nginx="$IMAGE" \
  -n "$NAMESPACE"

Step 2: Monitor Rollout

# Watch rollout
kubectl rollout status deployment/web -n "$NAMESPACE"

# If issues, pause rollout
kubectl rollout pause deployment/web -n "$NAMESPACE"

# Fix issues, then resume
kubectl rollout resume deployment/web -n "$NAMESPACE"

Using Kustomize (GitOps)¶

Step 1: Update Kustomization

cd /home/claude/ge-bootstrap/k8s/clients/acme-corp

# Update image in kustomization.yaml
kustomize edit set image nginx=acme-corp/webapp:v1.2.3

Step 2: Apply Changes

kubectl apply -k .

Step 3: Monitor

kubectl rollout status deployment/web -n sh-acme-corp

Post-Deployment Verification¶

Immediate Checks (0-5 minutes)¶

1. Deployment Status

NAMESPACE="sh-acme-corp"

# Check deployment condition
kubectl get deployment web -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'
# Expected: True

# Check replica count
kubectl get deployment web -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}/{.spec.replicas}'
# Expected: 2/2 (or configured replica count)

2. Pod Health

# All pods running
kubectl get pods -n "$NAMESPACE" -l app=web
# Verify: All show Running, READY 1/1

# No restarts
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Expected: 0 or low number

# Check events
kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -10
# Verify: No error events

3. Application Health

# Test health endpoint
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- wget -qO- http://localhost:8080/health
# Expected: {"status": "healthy"} or similar

# Check application logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=50
# Verify: No errors, successful startup messages

4. Ingress Routing

# Test external access
curl -I https://acme-corp.hosting.growing-europe.com
# Expected: HTTP/2 200

# Test full page load
curl -s https://acme-corp.hosting.growing-europe.com | head -20
# Verify: Correct content returned

Extended Monitoring (5-30 minutes)¶

1. Performance Metrics

# CPU and memory usage
kubectl top pods -n "$NAMESPACE"
# Verify: Within expected ranges

# Request rate (from Traefik)
kubectl logs -n ge-ingress deploy/traefik | grep acme-corp | tail -20
# Verify: 200 responses, no 5xx errors

2. Error Rate

# Application errors
kubectl logs -n "$NAMESPACE" -l app=web --tail=500 | grep -i error
# Verify: No new errors

# Container restarts
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Verify: No increase in restarts

3. Dependency Connectivity

# If app uses Redis
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- nc -zv redis.ge-system.svc.cluster.local 6381
# Expected: Connection successful

# If app uses external APIs
kubectl logs -n "$NAMESPACE" -l app=web | grep "API connection"
# Verify: Successful connections

4. User Impact Check

# Check for error reports (from monitoring/alerting)
# Check user-facing metrics (response time, error rate)
# Review support tickets (if any)

Rollback Procedures¶

When to Rollback¶

Rollback immediately if: - Pods are crashing (CrashLoopBackOff) - Health checks failing consistently - Error rate significantly increased - Application not starting - Critical functionality broken

Quick Rollback (kubectl)¶

Method 1: Rollback to Previous Version

NAMESPACE="sh-acme-corp"

# Immediate rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

# Monitor rollback
kubectl rollout status deployment/web -n "$NAMESPACE"

# Verify
kubectl get pods -n "$NAMESPACE" -l app=web

Method 2: Rollback to Specific Revision

# Check rollout history
kubectl rollout history deployment/web -n "$NAMESPACE"

# Example output:
# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Update to v1.2.2
# 3         Update to v1.2.3

# Rollback to specific revision
kubectl rollout undo deployment/web -n "$NAMESPACE" --to-revision=2

# Verify
kubectl rollout status deployment/web -n "$NAMESPACE"

Rollback Using Previous Package¶

Preferred method for critical production environments.

cd /home/claude/ge-bootstrap/tools

# Identify previous working package
ls -lt ../packages/ | grep acme-corp

# Deploy previous package
./deploy-package.sh --package ../packages/acme-corp-v1.2.2

# Monitor
kubectl rollout status deployment/web -n sh-acme-corp

Rollback Verification¶

NAMESPACE="sh-acme-corp"

# 1. Pods running with old image
kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].spec.containers[*].image}'
# Expected: Previous image version

# 2. Health checks passing
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
# Expected: True for all pods

# 3. Application responding
curl -I https://acme-corp.hosting.growing-europe.com
# Expected: HTTP/2 200

# 4. No errors in logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=50 | grep -i error
# Expected: No critical errors

Post-Rollback Actions¶

Document Rollback

# Add annotation to deployment
kubectl annotate deployment web -n "$NAMESPACE" \
  rollback.timestamp="$(date -Iseconds)" \
  rollback.reason="Pods crashing on startup" \
  rollback.from-version="v1.2.3" \
  rollback.to-version="v1.2.2"

Investigate Root Cause
Review deployment logs
Check application logs
Analyze pod events
Test in staging environment
Notify Stakeholders
Inform team of rollback
Provide status update
Share investigation plan
Plan Remediation
Fix issues in new version
Test thoroughly in staging
Schedule redeployment

Incident Response During Deployments¶

Deployment Failure Scenarios¶

Scenario 1: Pods Won't Start

Symptoms: - Pods in CrashLoopBackOff - ImagePullBackOff status

Response:

NAMESPACE="sh-acme-corp"

# 1. Pause rollout immediately
kubectl rollout pause deployment/web -n "$NAMESPACE"

# 2. Check pod status
kubectl get pods -n "$NAMESPACE" -l app=web

# 3. Describe problematic pod
kubectl describe pod <failing-pod> -n "$NAMESPACE"

# 4. Check logs
kubectl logs <failing-pod> -n "$NAMESPACE"

# 5. Common fixes:
# - ImagePullBackOff: Verify image exists
docker pull <image-name>

# - CrashLoopBackOff: Check application logs
kubectl logs <failing-pod> -n "$NAMESPACE" --previous

# 6. If unfixable, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

Scenario 2: High Error Rate

Symptoms: - Increased 500 errors from Traefik - Application logging errors - Health checks intermittently failing

Response:

# 1. Check error rate
kubectl logs -n ge-ingress deploy/traefik | grep "acme-corp" | grep -E "500|502|503" | tail -20

# 2. Check application logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=100 | grep -i error

# 3. If error rate unacceptable, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

# 4. If tolerable, investigate and monitor
kubectl logs -n "$NAMESPACE" -l app=web -f

Scenario 3: Database Migration Failed

Symptoms: - Application logs show database errors - Migration script failed

Response:

# 1. Pause deployment
kubectl rollout pause deployment/web -n "$NAMESPACE"

# 2. Check migration status
# Access application pod
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it -n "$NAMESPACE" "$POD" -- /bin/sh

# 3. Manual migration rollback (if possible)
# Run migration rollback script

# 4. Restore database from backup (if necessary)
# Contact DBA or use backup tool

# 5. Rollback deployment
kubectl rollout undo deployment/web -n "$NAMESPACE"

Scenario 4: Secret Missing or Invalid

Symptoms: - Pods show error: "secret not found" - Application can't connect to external services

Response:

# 1. Pause rollout
kubectl rollout pause deployment/web -n "$NAMESPACE"

# 2. Check secrets
kubectl get secrets -n "$NAMESPACE"

# 3. Verify secret content
kubectl get secret ge-secrets -n "$NAMESPACE" -o jsonpath='{.data}' | jq 'keys'

# 4. Recreate missing secrets
vault kv get secret/clients/acme-corp

kubectl create secret generic ge-secrets \
  -n "$NAMESPACE" \
  --from-literal=api-key="..." \
  --dry-run=client -o yaml | kubectl apply -f -

# 5. Resume deployment
kubectl rollout resume deployment/web -n "$NAMESPACE"

# 6. If issues persist, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

Escalation Procedures¶

Level 1: Self-Service (Operator) - Pause rollout - Check logs and events - Attempt quick fixes - Rollback if necessary

Level 2: Team Lead - If deployment repeatedly fails - If root cause unclear - If issue affects multiple clients

Level 3: Infrastructure Team - Cluster-wide issues - Network or ingress problems - Storage or node failures

Level 4: Emergency - Complete service outage - Data loss risk - Security breach

PodDisruptionBudget Explanation¶

What is PodDisruptionBudget (PDB)¶

A PDB ensures a minimum number of pods remain available during voluntary disruptions: - Node maintenance - Cluster upgrades - Deployment updates - Node draining

File: /home/claude/ge-bootstrap/k8s/templates/dedicated/pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web
  namespace: ded-bigcorp
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: web

PDB Configurations¶

Configuration	Description	Use Case
`minAvailable: 1`	At least 1 pod always running	Minimum availability
`minAvailable: 2`	At least 2 pods always running	Higher availability
`maxUnavailable: 1`	Only 1 pod can be down	Gradual updates
`maxUnavailable: 50%`	Half of pods can be down	Faster updates

PDB and Rolling Updates¶

Without PDB (Shared Hosting): - Rolling update proceeds based on deployment strategy - No additional protection

With PDB (Dedicated Hosting): - Rolling update must respect PDB constraints - K8s ensures minAvailable pods are always ready - Update may take longer but guarantees availability

Example:

# Deployment: 3 replicas, maxUnavailable: 0
# PDB: minAvailable: 2

# Rolling update process:
# 1. Start with 3 old pods running
# 2. Create 1 new pod (maxSurge)
# 3. Wait for new pod ready (4 pods total)
# 4. Terminate 1 old pod (3 pods remain, 2 old + 1 new)
# 5. PDB ensures 2+ pods always available
# 6. Repeat until all pods updated

Checking PDB Status¶

# List PDBs
kubectl get pdb -n ded-bigcorp

# Example output:
# NAME   MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# web    1               N/A               2                     5d

# Describe PDB
kubectl describe pdb web -n ded-bigcorp

# Key fields:
# - DisruptionsAllowed: How many pods can be terminated now
# - Current: Current number of healthy pods
# - Desired: Desired number of healthy pods (minAvailable)

PDB Best Practices¶

Set appropriate minAvailable
Too high: Updates may be blocked
Too low: Insufficient availability protection
Use with adequate replicas
PDB with 1 replica is useless
Recommended: minAvailable = replicas - 1
Monitor disruptions
```
kubectl get pdb -n ded-bigcorp -w
```
Test during maintenance

Drain a node to test PDB effectiveness

kubectl drain <node-name> --ignore-daemonsets

Deployment Strategies Comparison¶

Feature	Rolling Update	Blue-Green	Canary
Downtime	None	None	None
Resource Usage	+33-100% during update	Double during switch	+10-50% during rollout
Rollback Speed	Fast (30-60s)	Instant (service switch)	Fast (route change)
Risk	Low	Very Low	Very Low
Complexity	Low	Medium	High
Testing	Limited	Full testing before switch	Gradual validation
Use Case	Standard deployments	Critical services	High-risk changes

When to Use Each Strategy¶

Rolling Update (Default): - Standard application updates - Low-risk changes - Resource-constrained environments - Shared hosting

Blue-Green: - Critical production services - Major version upgrades - When extensive testing is needed - Dedicated hosting with available capacity

Canary (Advanced): - High-risk deployments - A/B testing - Gradual feature rollout - Large-scale services

Troubleshooting¶

Deployment Stuck in Progress¶

Symptoms:

kubectl rollout status deployment/web -n sh-acme-corp
# Output: Waiting for deployment "web" rollout to finish: 1 old replicas are pending termination...

Diagnosis:

# Check pod status
kubectl get pods -n sh-acme-corp -l app=web

# Check pod events
kubectl get events -n sh-acme-corp --sort-by='.lastTimestamp' | tail -20

# Check for PDB blocking termination
kubectl get pdb -n sh-acme-corp

Solutions:

Old Pod Stuck in Terminating

# Force delete pod
kubectl delete pod <pod-name> -n sh-acme-corp --force --grace-period=0

PDB Preventing Termination

# Temporarily relax PDB
kubectl patch pdb web -n sh-acme-corp -p '{"spec":{"minAvailable":0}}'

# After deployment completes, restore PDB
kubectl patch pdb web -n sh-acme-corp -p '{"spec":{"minAvailable":1}}'

New Pod Not Ready

# Check readiness probe
kubectl describe pod <new-pod> -n sh-acme-corp | grep -A 10 "Readiness"

# Check probe endpoint
POD=$(kubectl get pod -n sh-acme-corp -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n sh-acme-corp "$POD" -- wget -O- http://localhost:8080/health

Rollback Doesn't Fix Issue¶

Symptoms: - Rollback completed successfully - Application still not working

Diagnosis:

# Check if issue is environmental, not code-related
# 1. Secrets changed?
kubectl get secret ge-secrets -n sh-acme-corp -o yaml

# 2. ConfigMaps changed?
kubectl get cm -n sh-acme-corp

# 3. External dependencies down?
kubectl logs -n sh-acme-corp -l app=web | grep "connection refused"

# 4. Network policy issue?
kubectl get networkpolicy -n sh-acme-corp

Solutions:

Restore Secrets

vault kv get secret/clients/acme-corp
# Recreate secrets with correct values

Restore ConfigMaps

# Restore from Git
git checkout <previous-commit> -- k8s/clients/acme-corp/configmap.yaml
kubectl apply -f k8s/clients/acme-corp/configmap.yaml

Check External Dependencies

# Test from pod
kubectl run -it --rm debug --image=alpine -n sh-acme-corp -- sh
# Inside pod:
wget -O- https://external-api.example.com/health

Image Not Found¶

Symptoms:

ImagePullBackOff: failed to pull image "acme-corp/webapp:v1.2.3": not found

Solutions:

# 1. Verify image exists
docker pull acme-corp/webapp:v1.2.3

# 2. Check image name spelling
kubectl get deployment web -n sh-acme-corp -o jsonpath='{.spec.template.spec.containers[*].image}'

# 3. If private registry, check image pull secrets
kubectl get secret -n sh-acme-corp | grep docker

# 4. Create image pull secret if missing
kubectl create secret docker-registry regcred \
  -n sh-acme-corp \
  --docker-server=registry.example.com \
  --docker-username=<user> \
  --docker-password=<password>

# 5. Add to deployment
kubectl patch deployment web -n sh-acme-corp -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'

Architecture Overview - Hosting architecture
Deployment Packages - Immutable packages
Client Onboarding - Creating clients
Platform Startup - Platform management

Deployment Checklist¶

Print and use this checklist for production deployments:

PRE-DEPLOYMENT
□ Package verified (verify-package.sh)
□ Dry run executed and reviewed
□ Current deployment is healthy
□ Nodes have available resources
□ Secrets validated in Vault
□ Database backup taken (if applicable)
□ Stakeholders notified
□ Rollback plan prepared

DEPLOYMENT
□ Deployment started
□ Rollout monitored (kubectl rollout status)
□ New pods created successfully
□ New pods passed readiness checks
□ Old pods terminated gracefully
□ Rollout completed successfully

POST-DEPLOYMENT (0-5 min)
□ All replicas ready
□ Pods are running without restarts
□ Health endpoints responding
□ Ingress routing working
□ No errors in logs

POST-DEPLOYMENT (5-30 min)
□ Performance metrics normal
□ Error rate within acceptable range
□ Dependencies connected
□ No user-reported issues

COMPLETION
□ Deployment documented
□ Team notified of success
□ Monitoring reviewed
□ Old package archived (if applicable)

This runbook is maintained by the GE Infrastructure Team. For updates or questions, contact the infrastructure lead or create an issue in the ge-ops repository.

Zero-Downtime Deployment Runbook¶

Overview¶

Table of Contents¶

Rolling Update Strategy¶

Overview¶

Configuration¶

Strategy Parameters¶

Update Timeline¶

Blue-Green Deployment¶

Overview¶

Procedure¶

Blue-Green Rollback¶

Pre-Deployment Checklist¶

Health Check Verification¶

Resource Availability¶

Database Migrations (If Applicable)¶

Secret Validation¶

Network Connectivity¶

Communication¶

Deployment Procedure¶

Using Immutable Packages¶

Using kubectl (Manual)¶

Using Kustomize (GitOps)¶

Post-Deployment Verification¶

Immediate Checks (0-5 minutes)¶

Extended Monitoring (5-30 minutes)¶

Rollback Procedures¶

When to Rollback¶

Quick Rollback (kubectl)¶

Rollback Using Previous Package¶

Rollback Verification¶

Post-Rollback Actions¶

Incident Response During Deployments¶

Deployment Failure Scenarios¶

Escalation Procedures¶

PodDisruptionBudget Explanation¶

What is PodDisruptionBudget (PDB)¶

PDB Configurations¶

PDB and Rolling Updates¶

Checking PDB Status¶

PDB Best Practices¶

Deployment Strategies Comparison¶

When to Use Each Strategy¶

Troubleshooting¶

Deployment Stuck in Progress¶

Rollback Doesn't Fix Issue¶

Image Not Found¶

Related Documentation¶

Deployment Checklist¶