Skip to content

Zero-Downtime Deployment Runbook

Last Updated: 2026-01-29 Status: Active Maintained by: GE Infrastructure Team Estimated Time: 10-20 minutes per deployment


Overview

This runbook provides step-by-step procedures for deploying updates to client workloads without service interruption. Zero-downtime deployments are critical for maintaining service availability and meeting SLA commitments.

Deployment Strategies Covered: - Rolling updates for shared hosting (default) - Blue-green deployments for dedicated hosting (optional) - Canary deployments for gradual rollouts (optional)


Table of Contents


Rolling Update Strategy

Overview

Rolling updates gradually replace old pods with new pods, maintaining availability throughout the update process.

How It Works: 1. Create new pod with updated image/config 2. Wait for new pod to become Ready 3. Terminate one old pod 4. Repeat until all pods are updated

Configuration

Shared Hosting Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: sh-acme-corp
spec:
  replicas: 2
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1           # Allow 1 extra pod during update
      maxUnavailable: 0     # Ensure all replicas available

Dedicated Hosting Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
  namespace: ded-bigcorp
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 100%        # Double capacity during update
      maxUnavailable: 0     # Ensure all replicas available

Strategy Parameters

Parameter Description Shared Hosting Dedicated Hosting
maxSurge Extra pods during update 1 (33-50%) 100% (double capacity)
maxUnavailable Pods that can be down 0 (always available) 0 (always available)

maxSurge Examples:

  • Shared (1 replica): maxSurge: 1
  • During update: 2 pods total (1 old + 1 new)
  • After new pod ready: 1 pod (old terminated)

  • Shared (2 replicas): maxSurge: 1

  • During update: 3 pods total (2 old + 1 new)
  • Rolling: Gradually replace old with new

  • Dedicated (3 replicas): maxSurge: 100%

  • During update: 6 pods total (3 old + 3 new)
  • All new pods start before any old pods terminate
  • Maximum capacity during update

maxUnavailable = 0: - Always maintains desired replica count or higher - No service degradation during update - Critical for zero-downtime requirement

Update Timeline

Shared Hosting (2 replicas):

sequenceDiagram
    participant K8s
    participant OldPod1
    participant OldPod2
    participant NewPod1
    participant NewPod2

    Note over K8s: Start rolling update
    K8s->>NewPod1: Create (maxSurge +1)
    Note over NewPod1: Starting...
    NewPod1->>K8s: Ready
    K8s->>OldPod1: Terminate
    Note over OldPod1: Shutting down
    K8s->>NewPod2: Create
    NewPod2->>K8s: Ready
    K8s->>OldPod2: Terminate
    Note over K8s: Update complete

Duration: 1-3 minutes (depends on startup time)

Dedicated Hosting (3 replicas, maxSurge 100%):

sequenceDiagram
    participant K8s
    participant Old[Old Pods (3)]
    participant New[New Pods (3)]

    Note over K8s: Start rolling update
    K8s->>New: Create all 3 new pods (maxSurge 100%)
    Note over New: Starting...
    New->>K8s: All ready
    Note over K8s,New: 6 pods serving traffic
    K8s->>Old: Terminate all old pods
    Note over Old: Graceful shutdown (30s)
    Note over K8s: Update complete

Duration: 30-60 seconds (parallel startup + graceful shutdown)


Blue-Green Deployment

Overview

Blue-green deployments maintain two identical environments (blue = current, green = new). Traffic switches instantly after the green environment is validated.

Use Case: Dedicated hosting with critical uptime requirements

Procedure

Step 1: Deploy Green Environment

# Create green deployment (new version)
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-green
  namespace: ded-bigcorp
  labels:
    app: web
    version: green
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web
      version: green
  template:
    metadata:
      labels:
        app: web
        version: green
    spec:
      containers:
        - name: nginx
          image: bigcorp/webapp:v2.0.0
          # ... rest of spec
EOF

Step 2: Wait for Green to be Ready

kubectl wait --for=condition=available \
  deployment/web-green -n ded-bigcorp --timeout=180s

Step 3: Test Green Environment

# Port-forward to green pods for testing
kubectl port-forward -n ded-bigcorp deploy/web-green 8080:80

# Test in browser or with curl
curl http://localhost:8080/health
curl http://localhost:8080/api/version

Step 4: Switch Traffic to Green

# Update service selector to point to green
kubectl patch service web -n ded-bigcorp -p '{"spec":{"selector":{"version":"green"}}}'

Step 5: Verify Traffic Switch

# Check service endpoints
kubectl get endpoints web -n ded-bigcorp

# Should show green pod IPs
# Test production URL
curl https://bigcorp.hosting.growing-europe.com/health

Step 6: Monitor Green

# Watch pods
kubectl get pods -n ded-bigcorp -l version=green -w

# Monitor logs
kubectl logs -n ded-bigcorp -l version=green --tail=50 -f

# Wait 10-15 minutes, monitor for errors

Step 7: Remove Blue (Old)

# After successful verification, remove old deployment
kubectl delete deployment web-blue -n ded-bigcorp

# Or keep for quick rollback
# Delete after longer observation period (1 hour)

Blue-Green Rollback

If issues are found after switching:

# Instantly switch back to blue
kubectl patch service web -n ded-bigcorp -p '{"spec":{"selector":{"version":"blue"}}}'

# Verify rollback
kubectl get endpoints web -n ded-bigcorp
curl https://bigcorp.hosting.growing-europe.com/health

Pre-Deployment Checklist

Use this checklist before every deployment:

Health Check Verification

CLIENT_NAME="acme-corp"
NAMESPACE="sh-${CLIENT_NAME}"

# 1. Current deployment is healthy
kubectl get deployment web -n "$NAMESPACE"
# Verify: READY shows X/X (all replicas ready)

# 2. Pods are running
kubectl get pods -n "$NAMESPACE"
# Verify: All pods show Running, no CrashLoopBackOff

# 3. Health endpoints responding
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- wget -O- http://localhost:8080/health
# Verify: Returns 200 OK

# 4. No recent errors in logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=100 | grep -i error
# Verify: No critical errors

Resource Availability

# 1. Node capacity
kubectl top nodes
# Verify: Nodes have available CPU/memory

# 2. Current resource usage
kubectl top pods -n "$NAMESPACE"
# Verify: Pods not hitting limits

# 3. Storage availability
kubectl get pvc -n "$NAMESPACE"
# Verify: No PVCs in Pending state

# 4. Image availability
IMAGE="acme-corp/webapp:v1.2.3"
docker pull "$IMAGE"
# Verify: Image can be pulled

Database Migrations (If Applicable)

# 1. Check if migrations are needed
# Review application changelog

# 2. Test migrations in staging
# Run migration scripts in staging environment

# 3. Backup production database
# Create database backup before deployment

# 4. Plan migration execution
# Decide: before deployment, during, or after?
# Ensure migrations are backward compatible

Secret Validation

# 1. Secrets exist in namespace
kubectl get secrets -n "$NAMESPACE"
# Verify: ge-secrets exists

# 2. Secrets have required keys
kubectl get secret ge-secrets -n "$NAMESPACE" -o jsonpath='{.data}' | jq 'keys'
# Verify: Contains required keys (redis-password, api-key, etc.)

# 3. Secrets are in Vault
vault kv get secret/clients/acme-corp
# Verify: Vault contains all required secrets

Network Connectivity

# 1. Ingress is healthy
kubectl get ingress web -n "$NAMESPACE"
# Verify: Address is populated

# 2. DNS resolves
dig acme-corp.hosting.growing-europe.com
# Verify: Points to correct IP

# 3. SSL certificate is valid
echo | openssl s_client -servername acme-corp.hosting.growing-europe.com \
  -connect acme-corp.hosting.growing-europe.com:443 2>/dev/null | \
  openssl x509 -noout -dates
# Verify: Not expired

# 4. Traefik is routing
kubectl logs -n ge-ingress deploy/traefik | grep acme-corp | tail -5
# Verify: Recent traffic logs

Communication

  • [ ] Notify stakeholders of deployment window
  • [ ] Schedule deployment during low-traffic period (if possible)
  • [ ] Prepare rollback plan
  • [ ] Have incident response team on standby (for critical deployments)

Deployment Procedure

Using Immutable Packages

Recommended method for production deployments.

Step 1: Verify Package

cd /home/claude/ge-bootstrap/tools

# Verify package integrity
./verify-package.sh ../packages/acme-corp-v1.2.3

Step 2: Dry Run

# Preview changes without applying
./deploy-package.sh \
  --package ../packages/acme-corp-v1.2.3 \
  --dry-run

Step 3: Deploy

# Deploy to cluster
./deploy-package.sh --package ../packages/acme-corp-v1.2.3

# Deployment prompts for confirmation:
# Proceed with deployment? (y/N)

Step 4: Monitor Rollout

NAMESPACE="sh-acme-corp"

# Watch rollout status
kubectl rollout status deployment/web -n "$NAMESPACE"

# Watch pods updating
kubectl get pods -n "$NAMESPACE" -w

Expected Output:

NAME                  READY   STATUS              RESTARTS   AGE
web-7d4f8b-old        1/1     Running             0          5m
web-9f2k1a-new        0/1     ContainerCreating   0          5s

# After new pod ready:
web-7d4f8b-old        1/1     Running             0          5m
web-9f2k1a-new        1/1     Running             0          20s

# Old pod terminating:
web-7d4f8b-old        1/1     Terminating         0          5m
web-9f2k1a-new        1/1     Running             0          30s

# Update complete:
web-9f2k1a-new        1/1     Running             0          45s

Using kubectl (Manual)

Step 1: Update Image

NAMESPACE="sh-acme-corp"
IMAGE="acme-corp/webapp:v1.2.3"

# Update deployment image
kubectl set image deployment/web \
  nginx="$IMAGE" \
  -n "$NAMESPACE"

Step 2: Monitor Rollout

# Watch rollout
kubectl rollout status deployment/web -n "$NAMESPACE"

# If issues, pause rollout
kubectl rollout pause deployment/web -n "$NAMESPACE"

# Fix issues, then resume
kubectl rollout resume deployment/web -n "$NAMESPACE"

Using Kustomize (GitOps)

Step 1: Update Kustomization

cd /home/claude/ge-bootstrap/k8s/clients/acme-corp

# Update image in kustomization.yaml
kustomize edit set image nginx=acme-corp/webapp:v1.2.3

Step 2: Apply Changes

kubectl apply -k .

Step 3: Monitor

kubectl rollout status deployment/web -n sh-acme-corp

Post-Deployment Verification

Immediate Checks (0-5 minutes)

1. Deployment Status

NAMESPACE="sh-acme-corp"

# Check deployment condition
kubectl get deployment web -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'
# Expected: True

# Check replica count
kubectl get deployment web -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}/{.spec.replicas}'
# Expected: 2/2 (or configured replica count)

2. Pod Health

# All pods running
kubectl get pods -n "$NAMESPACE" -l app=web
# Verify: All show Running, READY 1/1

# No restarts
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Expected: 0 or low number

# Check events
kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -10
# Verify: No error events

3. Application Health

# Test health endpoint
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- wget -qO- http://localhost:8080/health
# Expected: {"status": "healthy"} or similar

# Check application logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=50
# Verify: No errors, successful startup messages

4. Ingress Routing

# Test external access
curl -I https://acme-corp.hosting.growing-europe.com
# Expected: HTTP/2 200

# Test full page load
curl -s https://acme-corp.hosting.growing-europe.com | head -20
# Verify: Correct content returned

Extended Monitoring (5-30 minutes)

1. Performance Metrics

# CPU and memory usage
kubectl top pods -n "$NAMESPACE"
# Verify: Within expected ranges

# Request rate (from Traefik)
kubectl logs -n ge-ingress deploy/traefik | grep acme-corp | tail -20
# Verify: 200 responses, no 5xx errors

2. Error Rate

# Application errors
kubectl logs -n "$NAMESPACE" -l app=web --tail=500 | grep -i error
# Verify: No new errors

# Container restarts
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Verify: No increase in restarts

3. Dependency Connectivity

# If app uses Redis
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- nc -zv redis.ge-system.svc.cluster.local 6381
# Expected: Connection successful

# If app uses external APIs
kubectl logs -n "$NAMESPACE" -l app=web | grep "API connection"
# Verify: Successful connections

4. User Impact Check

# Check for error reports (from monitoring/alerting)
# Check user-facing metrics (response time, error rate)
# Review support tickets (if any)

Rollback Procedures

When to Rollback

Rollback immediately if: - Pods are crashing (CrashLoopBackOff) - Health checks failing consistently - Error rate significantly increased - Application not starting - Critical functionality broken

Quick Rollback (kubectl)

Method 1: Rollback to Previous Version

NAMESPACE="sh-acme-corp"

# Immediate rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

# Monitor rollback
kubectl rollout status deployment/web -n "$NAMESPACE"

# Verify
kubectl get pods -n "$NAMESPACE" -l app=web

Method 2: Rollback to Specific Revision

# Check rollout history
kubectl rollout history deployment/web -n "$NAMESPACE"

# Example output:
# REVISION  CHANGE-CAUSE
# 1         Initial deployment
# 2         Update to v1.2.2
# 3         Update to v1.2.3

# Rollback to specific revision
kubectl rollout undo deployment/web -n "$NAMESPACE" --to-revision=2

# Verify
kubectl rollout status deployment/web -n "$NAMESPACE"

Rollback Using Previous Package

Preferred method for critical production environments.

cd /home/claude/ge-bootstrap/tools

# Identify previous working package
ls -lt ../packages/ | grep acme-corp

# Deploy previous package
./deploy-package.sh --package ../packages/acme-corp-v1.2.2

# Monitor
kubectl rollout status deployment/web -n sh-acme-corp

Rollback Verification

NAMESPACE="sh-acme-corp"

# 1. Pods running with old image
kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].spec.containers[*].image}'
# Expected: Previous image version

# 2. Health checks passing
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
# Expected: True for all pods

# 3. Application responding
curl -I https://acme-corp.hosting.growing-europe.com
# Expected: HTTP/2 200

# 4. No errors in logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=50 | grep -i error
# Expected: No critical errors

Post-Rollback Actions

  1. Document Rollback

    # Add annotation to deployment
    kubectl annotate deployment web -n "$NAMESPACE" \
      rollback.timestamp="$(date -Iseconds)" \
      rollback.reason="Pods crashing on startup" \
      rollback.from-version="v1.2.3" \
      rollback.to-version="v1.2.2"
    

  2. Investigate Root Cause

  3. Review deployment logs
  4. Check application logs
  5. Analyze pod events
  6. Test in staging environment

  7. Notify Stakeholders

  8. Inform team of rollback
  9. Provide status update
  10. Share investigation plan

  11. Plan Remediation

  12. Fix issues in new version
  13. Test thoroughly in staging
  14. Schedule redeployment

Incident Response During Deployments

Deployment Failure Scenarios

Scenario 1: Pods Won't Start

Symptoms: - Pods in CrashLoopBackOff - ImagePullBackOff status

Response:

NAMESPACE="sh-acme-corp"

# 1. Pause rollout immediately
kubectl rollout pause deployment/web -n "$NAMESPACE"

# 2. Check pod status
kubectl get pods -n "$NAMESPACE" -l app=web

# 3. Describe problematic pod
kubectl describe pod <failing-pod> -n "$NAMESPACE"

# 4. Check logs
kubectl logs <failing-pod> -n "$NAMESPACE"

# 5. Common fixes:
# - ImagePullBackOff: Verify image exists
docker pull <image-name>

# - CrashLoopBackOff: Check application logs
kubectl logs <failing-pod> -n "$NAMESPACE" --previous

# 6. If unfixable, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

Scenario 2: High Error Rate

Symptoms: - Increased 500 errors from Traefik - Application logging errors - Health checks intermittently failing

Response:

# 1. Check error rate
kubectl logs -n ge-ingress deploy/traefik | grep "acme-corp" | grep -E "500|502|503" | tail -20

# 2. Check application logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=100 | grep -i error

# 3. If error rate unacceptable, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

# 4. If tolerable, investigate and monitor
kubectl logs -n "$NAMESPACE" -l app=web -f

Scenario 3: Database Migration Failed

Symptoms: - Application logs show database errors - Migration script failed

Response:

# 1. Pause deployment
kubectl rollout pause deployment/web -n "$NAMESPACE"

# 2. Check migration status
# Access application pod
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it -n "$NAMESPACE" "$POD" -- /bin/sh

# 3. Manual migration rollback (if possible)
# Run migration rollback script

# 4. Restore database from backup (if necessary)
# Contact DBA or use backup tool

# 5. Rollback deployment
kubectl rollout undo deployment/web -n "$NAMESPACE"

Scenario 4: Secret Missing or Invalid

Symptoms: - Pods show error: "secret not found" - Application can't connect to external services

Response:

# 1. Pause rollout
kubectl rollout pause deployment/web -n "$NAMESPACE"

# 2. Check secrets
kubectl get secrets -n "$NAMESPACE"

# 3. Verify secret content
kubectl get secret ge-secrets -n "$NAMESPACE" -o jsonpath='{.data}' | jq 'keys'

# 4. Recreate missing secrets
vault kv get secret/clients/acme-corp

kubectl create secret generic ge-secrets \
  -n "$NAMESPACE" \
  --from-literal=api-key="..." \
  --dry-run=client -o yaml | kubectl apply -f -

# 5. Resume deployment
kubectl rollout resume deployment/web -n "$NAMESPACE"

# 6. If issues persist, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"

Escalation Procedures

Level 1: Self-Service (Operator) - Pause rollout - Check logs and events - Attempt quick fixes - Rollback if necessary

Level 2: Team Lead - If deployment repeatedly fails - If root cause unclear - If issue affects multiple clients

Level 3: Infrastructure Team - Cluster-wide issues - Network or ingress problems - Storage or node failures

Level 4: Emergency - Complete service outage - Data loss risk - Security breach


PodDisruptionBudget Explanation

What is PodDisruptionBudget (PDB)

A PDB ensures a minimum number of pods remain available during voluntary disruptions: - Node maintenance - Cluster upgrades - Deployment updates - Node draining

File: /home/claude/ge-bootstrap/k8s/templates/dedicated/pdb.yaml

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: web
  namespace: ded-bigcorp
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: web

PDB Configurations

Configuration Description Use Case
minAvailable: 1 At least 1 pod always running Minimum availability
minAvailable: 2 At least 2 pods always running Higher availability
maxUnavailable: 1 Only 1 pod can be down Gradual updates
maxUnavailable: 50% Half of pods can be down Faster updates

PDB and Rolling Updates

Without PDB (Shared Hosting): - Rolling update proceeds based on deployment strategy - No additional protection

With PDB (Dedicated Hosting): - Rolling update must respect PDB constraints - K8s ensures minAvailable pods are always ready - Update may take longer but guarantees availability

Example:

# Deployment: 3 replicas, maxUnavailable: 0
# PDB: minAvailable: 2

# Rolling update process:
# 1. Start with 3 old pods running
# 2. Create 1 new pod (maxSurge)
# 3. Wait for new pod ready (4 pods total)
# 4. Terminate 1 old pod (3 pods remain, 2 old + 1 new)
# 5. PDB ensures 2+ pods always available
# 6. Repeat until all pods updated

Checking PDB Status

# List PDBs
kubectl get pdb -n ded-bigcorp

# Example output:
# NAME   MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
# web    1               N/A               2                     5d

# Describe PDB
kubectl describe pdb web -n ded-bigcorp

# Key fields:
# - DisruptionsAllowed: How many pods can be terminated now
# - Current: Current number of healthy pods
# - Desired: Desired number of healthy pods (minAvailable)

PDB Best Practices

  1. Set appropriate minAvailable
  2. Too high: Updates may be blocked
  3. Too low: Insufficient availability protection

  4. Use with adequate replicas

  5. PDB with 1 replica is useless
  6. Recommended: minAvailable = replicas - 1

  7. Monitor disruptions

    kubectl get pdb -n ded-bigcorp -w
    

  8. Test during maintenance

  9. Drain a node to test PDB effectiveness
    kubectl drain <node-name> --ignore-daemonsets
    

Deployment Strategies Comparison

Feature Rolling Update Blue-Green Canary
Downtime None None None
Resource Usage +33-100% during update Double during switch +10-50% during rollout
Rollback Speed Fast (30-60s) Instant (service switch) Fast (route change)
Risk Low Very Low Very Low
Complexity Low Medium High
Testing Limited Full testing before switch Gradual validation
Use Case Standard deployments Critical services High-risk changes

When to Use Each Strategy

Rolling Update (Default): - Standard application updates - Low-risk changes - Resource-constrained environments - Shared hosting

Blue-Green: - Critical production services - Major version upgrades - When extensive testing is needed - Dedicated hosting with available capacity

Canary (Advanced): - High-risk deployments - A/B testing - Gradual feature rollout - Large-scale services


Troubleshooting

Deployment Stuck in Progress

Symptoms:

kubectl rollout status deployment/web -n sh-acme-corp
# Output: Waiting for deployment "web" rollout to finish: 1 old replicas are pending termination...

Diagnosis:

# Check pod status
kubectl get pods -n sh-acme-corp -l app=web

# Check pod events
kubectl get events -n sh-acme-corp --sort-by='.lastTimestamp' | tail -20

# Check for PDB blocking termination
kubectl get pdb -n sh-acme-corp

Solutions:

  1. Old Pod Stuck in Terminating

    # Force delete pod
    kubectl delete pod <pod-name> -n sh-acme-corp --force --grace-period=0
    

  2. PDB Preventing Termination

    # Temporarily relax PDB
    kubectl patch pdb web -n sh-acme-corp -p '{"spec":{"minAvailable":0}}'
    
    # After deployment completes, restore PDB
    kubectl patch pdb web -n sh-acme-corp -p '{"spec":{"minAvailable":1}}'
    

  3. New Pod Not Ready

    # Check readiness probe
    kubectl describe pod <new-pod> -n sh-acme-corp | grep -A 10 "Readiness"
    
    # Check probe endpoint
    POD=$(kubectl get pod -n sh-acme-corp -l app=web -o jsonpath='{.items[0].metadata.name}')
    kubectl exec -n sh-acme-corp "$POD" -- wget -O- http://localhost:8080/health
    

Rollback Doesn't Fix Issue

Symptoms: - Rollback completed successfully - Application still not working

Diagnosis:

# Check if issue is environmental, not code-related
# 1. Secrets changed?
kubectl get secret ge-secrets -n sh-acme-corp -o yaml

# 2. ConfigMaps changed?
kubectl get cm -n sh-acme-corp

# 3. External dependencies down?
kubectl logs -n sh-acme-corp -l app=web | grep "connection refused"

# 4. Network policy issue?
kubectl get networkpolicy -n sh-acme-corp

Solutions:

  1. Restore Secrets

    vault kv get secret/clients/acme-corp
    # Recreate secrets with correct values
    

  2. Restore ConfigMaps

    # Restore from Git
    git checkout <previous-commit> -- k8s/clients/acme-corp/configmap.yaml
    kubectl apply -f k8s/clients/acme-corp/configmap.yaml
    

  3. Check External Dependencies

    # Test from pod
    kubectl run -it --rm debug --image=alpine -n sh-acme-corp -- sh
    # Inside pod:
    wget -O- https://external-api.example.com/health
    

Image Not Found

Symptoms:

ImagePullBackOff: failed to pull image "acme-corp/webapp:v1.2.3": not found

Solutions:

# 1. Verify image exists
docker pull acme-corp/webapp:v1.2.3

# 2. Check image name spelling
kubectl get deployment web -n sh-acme-corp -o jsonpath='{.spec.template.spec.containers[*].image}'

# 3. If private registry, check image pull secrets
kubectl get secret -n sh-acme-corp | grep docker

# 4. Create image pull secret if missing
kubectl create secret docker-registry regcred \
  -n sh-acme-corp \
  --docker-server=registry.example.com \
  --docker-username=<user> \
  --docker-password=<password>

# 5. Add to deployment
kubectl patch deployment web -n sh-acme-corp -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'


Deployment Checklist

Print and use this checklist for production deployments:

PRE-DEPLOYMENT
□ Package verified (verify-package.sh)
□ Dry run executed and reviewed
□ Current deployment is healthy
□ Nodes have available resources
□ Secrets validated in Vault
□ Database backup taken (if applicable)
□ Stakeholders notified
□ Rollback plan prepared

DEPLOYMENT
□ Deployment started
□ Rollout monitored (kubectl rollout status)
□ New pods created successfully
□ New pods passed readiness checks
□ Old pods terminated gracefully
□ Rollout completed successfully

POST-DEPLOYMENT (0-5 min)
□ All replicas ready
□ Pods are running without restarts
□ Health endpoints responding
□ Ingress routing working
□ No errors in logs

POST-DEPLOYMENT (5-30 min)
□ Performance metrics normal
□ Error rate within acceptable range
□ Dependencies connected
□ No user-reported issues

COMPLETION
□ Deployment documented
□ Team notified of success
□ Monitoring reviewed
□ Old package archived (if applicable)

This runbook is maintained by the GE Infrastructure Team. For updates or questions, contact the infrastructure lead or create an issue in the ge-ops repository.