Zero-Downtime Deployment Runbook¶
Last Updated: 2026-01-29 Status: Active Maintained by: GE Infrastructure Team Estimated Time: 10-20 minutes per deployment
Overview¶
This runbook provides step-by-step procedures for deploying updates to client workloads without service interruption. Zero-downtime deployments are critical for maintaining service availability and meeting SLA commitments.
Deployment Strategies Covered: - Rolling updates for shared hosting (default) - Blue-green deployments for dedicated hosting (optional) - Canary deployments for gradual rollouts (optional)
Table of Contents¶
- Rolling Update Strategy
- Blue-Green Deployment
- Pre-Deployment Checklist
- Deployment Procedure
- Post-Deployment Verification
- Rollback Procedures
- Incident Response During Deployments
- PodDisruptionBudget Explanation
- Deployment Strategies Comparison
- Troubleshooting
Rolling Update Strategy¶
Overview¶
Rolling updates gradually replace old pods with new pods, maintaining availability throughout the update process.
How It Works: 1. Create new pod with updated image/config 2. Wait for new pod to become Ready 3. Terminate one old pod 4. Repeat until all pods are updated
Configuration¶
Shared Hosting Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: sh-acme-corp
spec:
replicas: 2
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Allow 1 extra pod during update
maxUnavailable: 0 # Ensure all replicas available
Dedicated Hosting Deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
namespace: ded-bigcorp
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 100% # Double capacity during update
maxUnavailable: 0 # Ensure all replicas available
Strategy Parameters¶
| Parameter | Description | Shared Hosting | Dedicated Hosting |
|---|---|---|---|
maxSurge |
Extra pods during update | 1 (33-50%) | 100% (double capacity) |
maxUnavailable |
Pods that can be down | 0 (always available) | 0 (always available) |
maxSurge Examples:
- Shared (1 replica):
maxSurge: 1 - During update: 2 pods total (1 old + 1 new)
-
After new pod ready: 1 pod (old terminated)
-
Shared (2 replicas):
maxSurge: 1 - During update: 3 pods total (2 old + 1 new)
-
Rolling: Gradually replace old with new
-
Dedicated (3 replicas):
maxSurge: 100% - During update: 6 pods total (3 old + 3 new)
- All new pods start before any old pods terminate
- Maximum capacity during update
maxUnavailable = 0: - Always maintains desired replica count or higher - No service degradation during update - Critical for zero-downtime requirement
Update Timeline¶
Shared Hosting (2 replicas):
sequenceDiagram
participant K8s
participant OldPod1
participant OldPod2
participant NewPod1
participant NewPod2
Note over K8s: Start rolling update
K8s->>NewPod1: Create (maxSurge +1)
Note over NewPod1: Starting...
NewPod1->>K8s: Ready
K8s->>OldPod1: Terminate
Note over OldPod1: Shutting down
K8s->>NewPod2: Create
NewPod2->>K8s: Ready
K8s->>OldPod2: Terminate
Note over K8s: Update complete
Duration: 1-3 minutes (depends on startup time)
Dedicated Hosting (3 replicas, maxSurge 100%):
sequenceDiagram
participant K8s
participant Old[Old Pods (3)]
participant New[New Pods (3)]
Note over K8s: Start rolling update
K8s->>New: Create all 3 new pods (maxSurge 100%)
Note over New: Starting...
New->>K8s: All ready
Note over K8s,New: 6 pods serving traffic
K8s->>Old: Terminate all old pods
Note over Old: Graceful shutdown (30s)
Note over K8s: Update complete
Duration: 30-60 seconds (parallel startup + graceful shutdown)
Blue-Green Deployment¶
Overview¶
Blue-green deployments maintain two identical environments (blue = current, green = new). Traffic switches instantly after the green environment is validated.
Use Case: Dedicated hosting with critical uptime requirements
Procedure¶
Step 1: Deploy Green Environment
# Create green deployment (new version)
kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-green
namespace: ded-bigcorp
labels:
app: web
version: green
spec:
replicas: 3
selector:
matchLabels:
app: web
version: green
template:
metadata:
labels:
app: web
version: green
spec:
containers:
- name: nginx
image: bigcorp/webapp:v2.0.0
# ... rest of spec
EOF
Step 2: Wait for Green to be Ready
Step 3: Test Green Environment
# Port-forward to green pods for testing
kubectl port-forward -n ded-bigcorp deploy/web-green 8080:80
# Test in browser or with curl
curl http://localhost:8080/health
curl http://localhost:8080/api/version
Step 4: Switch Traffic to Green
# Update service selector to point to green
kubectl patch service web -n ded-bigcorp -p '{"spec":{"selector":{"version":"green"}}}'
Step 5: Verify Traffic Switch
# Check service endpoints
kubectl get endpoints web -n ded-bigcorp
# Should show green pod IPs
# Test production URL
curl https://bigcorp.hosting.growing-europe.com/health
Step 6: Monitor Green
# Watch pods
kubectl get pods -n ded-bigcorp -l version=green -w
# Monitor logs
kubectl logs -n ded-bigcorp -l version=green --tail=50 -f
# Wait 10-15 minutes, monitor for errors
Step 7: Remove Blue (Old)
# After successful verification, remove old deployment
kubectl delete deployment web-blue -n ded-bigcorp
# Or keep for quick rollback
# Delete after longer observation period (1 hour)
Blue-Green Rollback¶
If issues are found after switching:
# Instantly switch back to blue
kubectl patch service web -n ded-bigcorp -p '{"spec":{"selector":{"version":"blue"}}}'
# Verify rollback
kubectl get endpoints web -n ded-bigcorp
curl https://bigcorp.hosting.growing-europe.com/health
Pre-Deployment Checklist¶
Use this checklist before every deployment:
Health Check Verification¶
CLIENT_NAME="acme-corp"
NAMESPACE="sh-${CLIENT_NAME}"
# 1. Current deployment is healthy
kubectl get deployment web -n "$NAMESPACE"
# Verify: READY shows X/X (all replicas ready)
# 2. Pods are running
kubectl get pods -n "$NAMESPACE"
# Verify: All pods show Running, no CrashLoopBackOff
# 3. Health endpoints responding
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- wget -O- http://localhost:8080/health
# Verify: Returns 200 OK
# 4. No recent errors in logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=100 | grep -i error
# Verify: No critical errors
Resource Availability¶
# 1. Node capacity
kubectl top nodes
# Verify: Nodes have available CPU/memory
# 2. Current resource usage
kubectl top pods -n "$NAMESPACE"
# Verify: Pods not hitting limits
# 3. Storage availability
kubectl get pvc -n "$NAMESPACE"
# Verify: No PVCs in Pending state
# 4. Image availability
IMAGE="acme-corp/webapp:v1.2.3"
docker pull "$IMAGE"
# Verify: Image can be pulled
Database Migrations (If Applicable)¶
# 1. Check if migrations are needed
# Review application changelog
# 2. Test migrations in staging
# Run migration scripts in staging environment
# 3. Backup production database
# Create database backup before deployment
# 4. Plan migration execution
# Decide: before deployment, during, or after?
# Ensure migrations are backward compatible
Secret Validation¶
# 1. Secrets exist in namespace
kubectl get secrets -n "$NAMESPACE"
# Verify: ge-secrets exists
# 2. Secrets have required keys
kubectl get secret ge-secrets -n "$NAMESPACE" -o jsonpath='{.data}' | jq 'keys'
# Verify: Contains required keys (redis-password, api-key, etc.)
# 3. Secrets are in Vault
vault kv get secret/clients/acme-corp
# Verify: Vault contains all required secrets
Network Connectivity¶
# 1. Ingress is healthy
kubectl get ingress web -n "$NAMESPACE"
# Verify: Address is populated
# 2. DNS resolves
dig acme-corp.hosting.growing-europe.com
# Verify: Points to correct IP
# 3. SSL certificate is valid
echo | openssl s_client -servername acme-corp.hosting.growing-europe.com \
-connect acme-corp.hosting.growing-europe.com:443 2>/dev/null | \
openssl x509 -noout -dates
# Verify: Not expired
# 4. Traefik is routing
kubectl logs -n ge-ingress deploy/traefik | grep acme-corp | tail -5
# Verify: Recent traffic logs
Communication¶
- [ ] Notify stakeholders of deployment window
- [ ] Schedule deployment during low-traffic period (if possible)
- [ ] Prepare rollback plan
- [ ] Have incident response team on standby (for critical deployments)
Deployment Procedure¶
Using Immutable Packages¶
Recommended method for production deployments.
Step 1: Verify Package
cd /home/claude/ge-bootstrap/tools
# Verify package integrity
./verify-package.sh ../packages/acme-corp-v1.2.3
Step 2: Dry Run
# Preview changes without applying
./deploy-package.sh \
--package ../packages/acme-corp-v1.2.3 \
--dry-run
Step 3: Deploy
# Deploy to cluster
./deploy-package.sh --package ../packages/acme-corp-v1.2.3
# Deployment prompts for confirmation:
# Proceed with deployment? (y/N)
Step 4: Monitor Rollout
NAMESPACE="sh-acme-corp"
# Watch rollout status
kubectl rollout status deployment/web -n "$NAMESPACE"
# Watch pods updating
kubectl get pods -n "$NAMESPACE" -w
Expected Output:
NAME READY STATUS RESTARTS AGE
web-7d4f8b-old 1/1 Running 0 5m
web-9f2k1a-new 0/1 ContainerCreating 0 5s
# After new pod ready:
web-7d4f8b-old 1/1 Running 0 5m
web-9f2k1a-new 1/1 Running 0 20s
# Old pod terminating:
web-7d4f8b-old 1/1 Terminating 0 5m
web-9f2k1a-new 1/1 Running 0 30s
# Update complete:
web-9f2k1a-new 1/1 Running 0 45s
Using kubectl (Manual)¶
Step 1: Update Image
NAMESPACE="sh-acme-corp"
IMAGE="acme-corp/webapp:v1.2.3"
# Update deployment image
kubectl set image deployment/web \
nginx="$IMAGE" \
-n "$NAMESPACE"
Step 2: Monitor Rollout
# Watch rollout
kubectl rollout status deployment/web -n "$NAMESPACE"
# If issues, pause rollout
kubectl rollout pause deployment/web -n "$NAMESPACE"
# Fix issues, then resume
kubectl rollout resume deployment/web -n "$NAMESPACE"
Using Kustomize (GitOps)¶
Step 1: Update Kustomization
cd /home/claude/ge-bootstrap/k8s/clients/acme-corp
# Update image in kustomization.yaml
kustomize edit set image nginx=acme-corp/webapp:v1.2.3
Step 2: Apply Changes
Step 3: Monitor
Post-Deployment Verification¶
Immediate Checks (0-5 minutes)¶
1. Deployment Status
NAMESPACE="sh-acme-corp"
# Check deployment condition
kubectl get deployment web -n "$NAMESPACE" -o jsonpath='{.status.conditions[?(@.type=="Available")].status}'
# Expected: True
# Check replica count
kubectl get deployment web -n "$NAMESPACE" -o jsonpath='{.status.readyReplicas}/{.spec.replicas}'
# Expected: 2/2 (or configured replica count)
2. Pod Health
# All pods running
kubectl get pods -n "$NAMESPACE" -l app=web
# Verify: All show Running, READY 1/1
# No restarts
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Expected: 0 or low number
# Check events
kubectl get events -n "$NAMESPACE" --sort-by='.lastTimestamp' | tail -10
# Verify: No error events
3. Application Health
# Test health endpoint
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- wget -qO- http://localhost:8080/health
# Expected: {"status": "healthy"} or similar
# Check application logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=50
# Verify: No errors, successful startup messages
4. Ingress Routing
# Test external access
curl -I https://acme-corp.hosting.growing-europe.com
# Expected: HTTP/2 200
# Test full page load
curl -s https://acme-corp.hosting.growing-europe.com | head -20
# Verify: Correct content returned
Extended Monitoring (5-30 minutes)¶
1. Performance Metrics
# CPU and memory usage
kubectl top pods -n "$NAMESPACE"
# Verify: Within expected ranges
# Request rate (from Traefik)
kubectl logs -n ge-ingress deploy/traefik | grep acme-corp | tail -20
# Verify: 200 responses, no 5xx errors
2. Error Rate
# Application errors
kubectl logs -n "$NAMESPACE" -l app=web --tail=500 | grep -i error
# Verify: No new errors
# Container restarts
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.containerStatuses[*].restartCount}'
# Verify: No increase in restarts
3. Dependency Connectivity
# If app uses Redis
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -n "$NAMESPACE" "$POD" -- nc -zv redis.ge-system.svc.cluster.local 6381
# Expected: Connection successful
# If app uses external APIs
kubectl logs -n "$NAMESPACE" -l app=web | grep "API connection"
# Verify: Successful connections
4. User Impact Check
# Check for error reports (from monitoring/alerting)
# Check user-facing metrics (response time, error rate)
# Review support tickets (if any)
Rollback Procedures¶
When to Rollback¶
Rollback immediately if: - Pods are crashing (CrashLoopBackOff) - Health checks failing consistently - Error rate significantly increased - Application not starting - Critical functionality broken
Quick Rollback (kubectl)¶
Method 1: Rollback to Previous Version
NAMESPACE="sh-acme-corp"
# Immediate rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"
# Monitor rollback
kubectl rollout status deployment/web -n "$NAMESPACE"
# Verify
kubectl get pods -n "$NAMESPACE" -l app=web
Method 2: Rollback to Specific Revision
# Check rollout history
kubectl rollout history deployment/web -n "$NAMESPACE"
# Example output:
# REVISION CHANGE-CAUSE
# 1 Initial deployment
# 2 Update to v1.2.2
# 3 Update to v1.2.3
# Rollback to specific revision
kubectl rollout undo deployment/web -n "$NAMESPACE" --to-revision=2
# Verify
kubectl rollout status deployment/web -n "$NAMESPACE"
Rollback Using Previous Package¶
Preferred method for critical production environments.
cd /home/claude/ge-bootstrap/tools
# Identify previous working package
ls -lt ../packages/ | grep acme-corp
# Deploy previous package
./deploy-package.sh --package ../packages/acme-corp-v1.2.2
# Monitor
kubectl rollout status deployment/web -n sh-acme-corp
Rollback Verification¶
NAMESPACE="sh-acme-corp"
# 1. Pods running with old image
kubectl get pods -n "$NAMESPACE" -o jsonpath='{.items[*].spec.containers[*].image}'
# Expected: Previous image version
# 2. Health checks passing
kubectl get pods -n "$NAMESPACE" -l app=web -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}'
# Expected: True for all pods
# 3. Application responding
curl -I https://acme-corp.hosting.growing-europe.com
# Expected: HTTP/2 200
# 4. No errors in logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=50 | grep -i error
# Expected: No critical errors
Post-Rollback Actions¶
-
Document Rollback
-
Investigate Root Cause
- Review deployment logs
- Check application logs
- Analyze pod events
-
Test in staging environment
-
Notify Stakeholders
- Inform team of rollback
- Provide status update
-
Share investigation plan
-
Plan Remediation
- Fix issues in new version
- Test thoroughly in staging
- Schedule redeployment
Incident Response During Deployments¶
Deployment Failure Scenarios¶
Scenario 1: Pods Won't Start
Symptoms: - Pods in CrashLoopBackOff - ImagePullBackOff status
Response:
NAMESPACE="sh-acme-corp"
# 1. Pause rollout immediately
kubectl rollout pause deployment/web -n "$NAMESPACE"
# 2. Check pod status
kubectl get pods -n "$NAMESPACE" -l app=web
# 3. Describe problematic pod
kubectl describe pod <failing-pod> -n "$NAMESPACE"
# 4. Check logs
kubectl logs <failing-pod> -n "$NAMESPACE"
# 5. Common fixes:
# - ImagePullBackOff: Verify image exists
docker pull <image-name>
# - CrashLoopBackOff: Check application logs
kubectl logs <failing-pod> -n "$NAMESPACE" --previous
# 6. If unfixable, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"
Scenario 2: High Error Rate
Symptoms: - Increased 500 errors from Traefik - Application logging errors - Health checks intermittently failing
Response:
# 1. Check error rate
kubectl logs -n ge-ingress deploy/traefik | grep "acme-corp" | grep -E "500|502|503" | tail -20
# 2. Check application logs
kubectl logs -n "$NAMESPACE" -l app=web --tail=100 | grep -i error
# 3. If error rate unacceptable, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"
# 4. If tolerable, investigate and monitor
kubectl logs -n "$NAMESPACE" -l app=web -f
Scenario 3: Database Migration Failed
Symptoms: - Application logs show database errors - Migration script failed
Response:
# 1. Pause deployment
kubectl rollout pause deployment/web -n "$NAMESPACE"
# 2. Check migration status
# Access application pod
POD=$(kubectl get pod -n "$NAMESPACE" -l app=web -o jsonpath='{.items[0].metadata.name}')
kubectl exec -it -n "$NAMESPACE" "$POD" -- /bin/sh
# 3. Manual migration rollback (if possible)
# Run migration rollback script
# 4. Restore database from backup (if necessary)
# Contact DBA or use backup tool
# 5. Rollback deployment
kubectl rollout undo deployment/web -n "$NAMESPACE"
Scenario 4: Secret Missing or Invalid
Symptoms: - Pods show error: "secret not found" - Application can't connect to external services
Response:
# 1. Pause rollout
kubectl rollout pause deployment/web -n "$NAMESPACE"
# 2. Check secrets
kubectl get secrets -n "$NAMESPACE"
# 3. Verify secret content
kubectl get secret ge-secrets -n "$NAMESPACE" -o jsonpath='{.data}' | jq 'keys'
# 4. Recreate missing secrets
vault kv get secret/clients/acme-corp
kubectl create secret generic ge-secrets \
-n "$NAMESPACE" \
--from-literal=api-key="..." \
--dry-run=client -o yaml | kubectl apply -f -
# 5. Resume deployment
kubectl rollout resume deployment/web -n "$NAMESPACE"
# 6. If issues persist, rollback
kubectl rollout undo deployment/web -n "$NAMESPACE"
Escalation Procedures¶
Level 1: Self-Service (Operator) - Pause rollout - Check logs and events - Attempt quick fixes - Rollback if necessary
Level 2: Team Lead - If deployment repeatedly fails - If root cause unclear - If issue affects multiple clients
Level 3: Infrastructure Team - Cluster-wide issues - Network or ingress problems - Storage or node failures
Level 4: Emergency - Complete service outage - Data loss risk - Security breach
PodDisruptionBudget Explanation¶
What is PodDisruptionBudget (PDB)¶
A PDB ensures a minimum number of pods remain available during voluntary disruptions: - Node maintenance - Cluster upgrades - Deployment updates - Node draining
File: /home/claude/ge-bootstrap/k8s/templates/dedicated/pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: web
namespace: ded-bigcorp
spec:
minAvailable: 1
selector:
matchLabels:
app: web
PDB Configurations¶
| Configuration | Description | Use Case |
|---|---|---|
minAvailable: 1 |
At least 1 pod always running | Minimum availability |
minAvailable: 2 |
At least 2 pods always running | Higher availability |
maxUnavailable: 1 |
Only 1 pod can be down | Gradual updates |
maxUnavailable: 50% |
Half of pods can be down | Faster updates |
PDB and Rolling Updates¶
Without PDB (Shared Hosting): - Rolling update proceeds based on deployment strategy - No additional protection
With PDB (Dedicated Hosting):
- Rolling update must respect PDB constraints
- K8s ensures minAvailable pods are always ready
- Update may take longer but guarantees availability
Example:
# Deployment: 3 replicas, maxUnavailable: 0
# PDB: minAvailable: 2
# Rolling update process:
# 1. Start with 3 old pods running
# 2. Create 1 new pod (maxSurge)
# 3. Wait for new pod ready (4 pods total)
# 4. Terminate 1 old pod (3 pods remain, 2 old + 1 new)
# 5. PDB ensures 2+ pods always available
# 6. Repeat until all pods updated
Checking PDB Status¶
# List PDBs
kubectl get pdb -n ded-bigcorp
# Example output:
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# web 1 N/A 2 5d
# Describe PDB
kubectl describe pdb web -n ded-bigcorp
# Key fields:
# - DisruptionsAllowed: How many pods can be terminated now
# - Current: Current number of healthy pods
# - Desired: Desired number of healthy pods (minAvailable)
PDB Best Practices¶
- Set appropriate minAvailable
- Too high: Updates may be blocked
-
Too low: Insufficient availability protection
-
Use with adequate replicas
- PDB with 1 replica is useless
-
Recommended: minAvailable = replicas - 1
-
Monitor disruptions
-
Test during maintenance
- Drain a node to test PDB effectiveness
Deployment Strategies Comparison¶
| Feature | Rolling Update | Blue-Green | Canary |
|---|---|---|---|
| Downtime | None | None | None |
| Resource Usage | +33-100% during update | Double during switch | +10-50% during rollout |
| Rollback Speed | Fast (30-60s) | Instant (service switch) | Fast (route change) |
| Risk | Low | Very Low | Very Low |
| Complexity | Low | Medium | High |
| Testing | Limited | Full testing before switch | Gradual validation |
| Use Case | Standard deployments | Critical services | High-risk changes |
When to Use Each Strategy¶
Rolling Update (Default): - Standard application updates - Low-risk changes - Resource-constrained environments - Shared hosting
Blue-Green: - Critical production services - Major version upgrades - When extensive testing is needed - Dedicated hosting with available capacity
Canary (Advanced): - High-risk deployments - A/B testing - Gradual feature rollout - Large-scale services
Troubleshooting¶
Deployment Stuck in Progress¶
Symptoms:
kubectl rollout status deployment/web -n sh-acme-corp
# Output: Waiting for deployment "web" rollout to finish: 1 old replicas are pending termination...
Diagnosis:
# Check pod status
kubectl get pods -n sh-acme-corp -l app=web
# Check pod events
kubectl get events -n sh-acme-corp --sort-by='.lastTimestamp' | tail -20
# Check for PDB blocking termination
kubectl get pdb -n sh-acme-corp
Solutions:
-
Old Pod Stuck in Terminating
-
PDB Preventing Termination
-
New Pod Not Ready
Rollback Doesn't Fix Issue¶
Symptoms: - Rollback completed successfully - Application still not working
Diagnosis:
# Check if issue is environmental, not code-related
# 1. Secrets changed?
kubectl get secret ge-secrets -n sh-acme-corp -o yaml
# 2. ConfigMaps changed?
kubectl get cm -n sh-acme-corp
# 3. External dependencies down?
kubectl logs -n sh-acme-corp -l app=web | grep "connection refused"
# 4. Network policy issue?
kubectl get networkpolicy -n sh-acme-corp
Solutions:
-
Restore Secrets
-
Restore ConfigMaps
-
Check External Dependencies
Image Not Found¶
Symptoms:
Solutions:
# 1. Verify image exists
docker pull acme-corp/webapp:v1.2.3
# 2. Check image name spelling
kubectl get deployment web -n sh-acme-corp -o jsonpath='{.spec.template.spec.containers[*].image}'
# 3. If private registry, check image pull secrets
kubectl get secret -n sh-acme-corp | grep docker
# 4. Create image pull secret if missing
kubectl create secret docker-registry regcred \
-n sh-acme-corp \
--docker-server=registry.example.com \
--docker-username=<user> \
--docker-password=<password>
# 5. Add to deployment
kubectl patch deployment web -n sh-acme-corp -p '{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"regcred"}]}}}}'
Related Documentation¶
- Architecture Overview - Hosting architecture
- Deployment Packages - Immutable packages
- Client Onboarding - Creating clients
- Platform Startup - Platform management
Deployment Checklist¶
Print and use this checklist for production deployments:
PRE-DEPLOYMENT
□ Package verified (verify-package.sh)
□ Dry run executed and reviewed
□ Current deployment is healthy
□ Nodes have available resources
□ Secrets validated in Vault
□ Database backup taken (if applicable)
□ Stakeholders notified
□ Rollback plan prepared
DEPLOYMENT
□ Deployment started
□ Rollout monitored (kubectl rollout status)
□ New pods created successfully
□ New pods passed readiness checks
□ Old pods terminated gracefully
□ Rollout completed successfully
POST-DEPLOYMENT (0-5 min)
□ All replicas ready
□ Pods are running without restarts
□ Health endpoints responding
□ Ingress routing working
□ No errors in logs
POST-DEPLOYMENT (5-30 min)
□ Performance metrics normal
□ Error rate within acceptable range
□ Dependencies connected
□ No user-reported issues
COMPLETION
□ Deployment documented
□ Team notified of success
□ Monitoring reviewed
□ Old package archived (if applicable)
This runbook is maintained by the GE Infrastructure Team. For updates or questions, contact the infrastructure lead or create an issue in the ge-ops repository.