Kubernetes — Operations¶

OWNER: gerco (dev), thijmen (staging), rutger (production)
ALSO_USED_BY: arjan, alex, tjitte
LAST_VERIFIED: 2026-03-26
GE_STACK_VERSION: k3s v1.34.x (Zone 1), UpCloud Managed K8s (Zones 2+3)

Overview¶

Day-to-day Kubernetes operations in GE: rolling updates, rollbacks,
debugging, log viewing, resource monitoring, and node maintenance.
Most operations target Zone 1 (k3s dev) during development.
Zone 2+3 operations require explicit approval from thijmen/rutger.

Rolling Updates¶

GE Deployments use RollingUpdate strategy with maxUnavailable: 0
to guarantee zero-downtime deployments.

IF: deploying a new image
THEN: update the image tag in the manifest and apply
RUN: kubectl set image deployment/ge-{service} ge-{service}=ge-bootstrap-{service}:latest -n ge-{namespace}
RUN: kubectl rollout status deployment/ge-{service} -n ge-{namespace}

IF: deploying updated Python code to executor
THEN: ALWAYS rebuild the container image first
RUN: bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh
RUN: kubectl rollout restart deployment/ge-executor -n ge-agents

CHECK: rollout status shows "successfully rolled out"
IF: rollout hangs
THEN: check pod events
RUN: kubectl describe pod -l app.kubernetes.io/name=ge-{service} -n ge-{namespace}

Rollback¶

IF: new deployment is broken
THEN: rollback to previous revision
RUN: kubectl rollout undo deployment/ge-{service} -n ge-{namespace}

IF: need to rollback to a specific revision
THEN: check history first
RUN: kubectl rollout history deployment/ge-{service} -n ge-{namespace}
RUN: kubectl rollout undo deployment/ge-{service} --to-revision={N} -n ge-{namespace}

CHECK: after rollback, verify pods are healthy
RUN: kubectl get pods -l app.kubernetes.io/name=ge-{service} -n ge-{namespace}

Exec into Pods¶

IF: need to debug a running pod
RUN: kubectl exec -it deployment/ge-{service} -n ge-{namespace} -- /bin/sh

IF: pod has multiple containers
THEN: specify the container
RUN: kubectl exec -it {pod-name} -c {container-name} -n ge-{namespace} -- /bin/sh

CHECK: exec is for debugging only — never use exec to modify running state
ANTI_PATTERN: using kubectl exec to install packages or patch code
FIX: rebuild the image and redeploy

Log Viewing¶

IF: checking current logs
RUN: kubectl logs deployment/ge-{service} -n ge-{namespace} --tail=100

IF: following logs in real time
RUN: kubectl logs deployment/ge-{service} -n ge-{namespace} -f --tail=50

IF: viewing logs from a crashed pod
RUN: kubectl logs {pod-name} -n ge-{namespace} --previous

IF: viewing logs across all pods of a Deployment
RUN: kubectl logs -l app.kubernetes.io/name=ge-{service} -n ge-{namespace} --tail=50

ANTI_PATTERN: leaving kubectl logs -f running indefinitely
FIX: use --tail=N and close when done — avoid resource consumption

Resource Monitoring¶

IF: checking resource usage per pod
RUN: kubectl top pods -n ge-{namespace}

IF: checking node-level resource usage
RUN: kubectl top nodes

IF: checking resource requests vs limits vs actual usage
RUN: kubectl describe node | grep -A 10 "Allocated resources"

CHECK: CPU utilisation stays below 80% on the node
CHECK: memory utilisation stays below 85% on the node
IF: approaching limits
THEN: review pod resource requests — some may be over-provisioned

Pod Troubleshooting¶

IF: pod is in CrashLoopBackOff
THEN: check logs from the previous crash
RUN: kubectl logs {pod-name} -n ge-{namespace} --previous
RUN: kubectl describe pod {pod-name} -n ge-{namespace}

IF: pod is in Pending state
THEN: check events for scheduling failures
RUN: kubectl describe pod {pod-name} -n ge-{namespace} | tail -20

Common Pending causes:
- Insufficient CPU or memory on node
- PersistentVolumeClaim not bound
- Node selector or affinity mismatch
- ImagePullBackOff (image not found or auth failure)

IF: pod is in ImagePullBackOff
THEN: verify the image exists locally (Zone 1)
RUN: sudo k3s ctr images list | grep ge-bootstrap

IF: pod is OOMKilled
THEN: increase memory limits in the manifest
THEN: investigate the memory leak — OOMKilled is a symptom, not the root cause
READ_ALSO: wiki/docs/stack/kubernetes/pitfalls.md

CronJob Operations¶

IF: checking CronJob status
RUN: kubectl get cronjobs -n ge-{namespace}

IF: manually triggering a CronJob
RUN: kubectl create job --from=cronjob/{cronjob-name} {manual-run-name} -n ge-{namespace}

IF: CronJob is suspended
THEN: check if it was intentionally suspended
RUN: kubectl get cronjob {name} -n ge-{namespace} -o jsonpath='{.spec.suspend}'

CHECK: all GE CronJobs are unsuspended and active (as of 2026-02-15)

Health Check Infrastructure¶

The scripts/k8s-health-dump.sh script runs via host cron in Zone 1.
It collects cluster health data and writes to public/k8s-health.json.
The Admin UI infrastructure page reads this file.

IF: infrastructure page shows stale data
THEN: check the cron job on the host
RUN: crontab -l | grep k8s-health

ANTI_PATTERN: running health checks from inside pods in Zone 1
FIX: k3s ClusterIP is broken from inside pods — use host cron instead

Node Maintenance (Zone 1)¶

IF: performing maintenance on the k3s node
THEN: cordon the node first
RUN: kubectl cordon {node-name}
RUN: kubectl drain {node-name} --ignore-daemonsets --delete-emptydir-data

IF: maintenance is complete
RUN: kubectl uncordon {node-name}

CHECK: all pods rescheduled after uncordon
RUN: kubectl get pods --all-namespaces -o wide

Useful Aliases¶

alias k='kubectl'  
alias kgp='kubectl get pods'  
alias kgs='kubectl get svc'  
alias kgd='kubectl get deployments'  
alias kga='kubectl get all'  
alias kl='kubectl logs'  
alias ke='kubectl exec -it'  
alias kd='kubectl describe'

Cross-References¶

READ_ALSO: wiki/docs/stack/kubernetes/index.md
READ_ALSO: wiki/docs/stack/kubernetes/manifests.md
READ_ALSO: wiki/docs/stack/kubernetes/pitfalls.md
READ_ALSO: wiki/docs/stack/kubernetes/checklist.md