DOMAIN:INFRASTRUCTURE:KUBERNETES_OPERATIONS¶
OWNER: gerco (Zone 1), thijmen (Zone 2), rutger (Zone 3) UPDATED: 2026-03-24 SCOPE: all k8s operations across GE's three-zone architecture AGENTS: gerco (k3s sysadmin), thijmen (UpCloud MKE staging), rutger (UpCloud MKE production), arjan (cluster provisioning), leon (deploy coordination)
K8S:ZONE_ARCHITECTURE¶
ZONE_1: k3s on Minisforum 790 Pro (fort-knox-dev)¶
OWNER: gerco RUNTIME: k3s (lightweight single-node Kubernetes) HOST: Minisforum 790 Pro — AMD Ryzen 9 7940HS, 64GB DDR5, 2TB NVMe KUBECONFIG: /etc/rancher/k3s/k3s.yaml API_ENDPOINT: https://127.0.0.1:6443 PURPOSE: development environment, agent execution, GE platform itself INGRESS: Traefik (bundled with k3s, custom config in ge-ingress namespace)
CRITICAL_PITFALL: k3s ClusterIP (10.43.0.1:443) is BROKEN from inside pods — connection refused FIX: use service DNS names instead (e.g., kubernetes.default.svc.cluster.local) PITFALL: do NOT use hostNetwork: true — causes port conflicts on rolling updates
ZONE_2: UpCloud Managed Kubernetes (staging)¶
OWNER: thijmen RUNTIME: UpCloud Managed Kubernetes Engine (MKE) LOCATION: de-fra1 (Frankfurt, EU) PURPOSE: client staging environments, pre-production validation PROVISIONED_BY: arjan (Terraform)
ZONE_3: UpCloud Managed Kubernetes (production)¶
OWNER: rutger RUNTIME: UpCloud Managed Kubernetes Engine (MKE) LOCATION: de-fra1 (Frankfurt, EU), DR: nl-ams1 (Amsterdam) PURPOSE: client production workloads PROVISIONED_BY: arjan (Terraform) RULE: ALL Zone 3 changes require Victoria security review
K8S:DEPLOYMENTS¶
CREATING_DEPLOYMENTS¶
RULE: every deployment MUST have resource requests AND limits RULE: every deployment MUST have health probes (liveness + readiness, startup optional) RULE: every deployment MUST have a security context (runAsNonRoot preferred) RULE: every container MUST have PYTHONUNBUFFERED=1 or equivalent for log visibility
TEMPLATE: minimal deployment for GE workloads
apiVersion: apps/v1
kind: Deployment
metadata:
name: {service}
namespace: {namespace}
labels:
app: {service}
component: {component-type}
spec:
replicas: {N}
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0 # zero-downtime default
selector:
matchLabels:
app: {service}
template:
metadata:
labels:
app: {service}
component: {component-type}
spec:
securityContext:
runAsNonRoot: true
fsGroup: 1004
containers:
- name: {service}
image: {image}:{tag}
imagePullPolicy: IfNotPresent
resources:
requests:
memory: "{X}Mi"
cpu: "{X}m"
limits:
memory: "{X}Mi"
cpu: "{X}m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
ROLLOUT_STRATEGIES¶
ROLLING_UPDATE (default for GE): - maxSurge: 1 (create 1 extra pod before killing old) - maxUnavailable: 0 (zero-downtime) - USE_WHEN: standard deployments, stateless services
RECREATE: - Kills all old pods before creating new ones - USE_WHEN: singleton services that cannot have two instances (e.g., services with exclusive resource locks) - ANTI_PATTERN: using Recreate for services that could use RollingUpdate
CHECK rollout status:
TOOL: kubectl
RUN: kubectl rollout status deployment/{name} -n {namespace} --timeout=120s
EXPECT: "deployment successfully rolled out"
IF_STUCK: kubectl describe deployment/{name} -n {namespace}
ROLLBACKS¶
IMMEDIATE_ROLLBACK (when deployment is failing):
TOOL: kubectl
RUN: kubectl rollout undo deployment/{name} -n {namespace}
VERIFY: kubectl rollout status deployment/{name} -n {namespace}
ROLLBACK_TO_SPECIFIC_REVISION:
TOOL: kubectl
RUN: kubectl rollout history deployment/{name} -n {namespace}
RUN: kubectl rollout undo deployment/{name} -n {namespace} --to-revision={N}
RULE: after rollback, immediately investigate root cause — rollback is mitigation, not resolution RULE: record rollback in incident log if it affects clients
K8S:HEALTH_PROBES¶
PROBE_TYPES¶
STARTUP_PROBE: - PURPOSE: allow slow-starting containers to boot (Vault init, CLI installs, model downloads) - FIRES: only during startup, before liveness/readiness take over - GE_EXECUTOR_EXAMPLE: initialDelaySeconds=10, periodSeconds=10, failureThreshold=12 (120s max boot) - USE_WHEN: container needs >30s to start
LIVENESS_PROBE:
- PURPOSE: detect deadlocked/hung processes, trigger pod restart
- FIRES: continuously after startup probe succeeds
- ANTI_PATTERN: HTTP liveness on executor pods — event loop blocks during PTY execution
- FIX: use exec probe with kill -0 1 (checks process alive without HTTP)
- GE_EXECUTOR_CONFIG: exec probe, periodSeconds=30, failureThreshold=6 (180s tolerance for long sessions)
READINESS_PROBE: - PURPOSE: control traffic routing — pod only receives traffic when ready - FIRES: continuously, can toggle pod in/out of service - USE_WHEN: service needs warm-up, or should stop receiving work during overload - GE_EXECUTOR_CONFIG: httpGet /ready, periodSeconds=10, failureThreshold=3
PROBE_DECISION_TREE¶
IF: service has HTTP endpoint
THEN: httpGet probe preferred
IF: service blocks event loop during work (like executor PTY sessions)
THEN: exec probe for liveness (`kill -0 1`), httpGet for readiness
IF: service uses TCP only (Redis, databases)
THEN: tcpSocket probe
IF: service takes >30s to start
THEN: add startupProbe with generous failureThreshold
ANTI_PATTERN: liveness probe timeout shorter than longest operation FIX: GE learned this hard way — executor liveness kills during 3-min agent sessions caused restart loops and wasted tokens
K8S:HPA (Horizontal Pod Autoscaler)¶
CONFIGURATION¶
GE_EXECUTOR_HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 2
maxReplicas: 5 # HARD LIMIT — never exceed (token burn prevention)
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # wait 5min before scaling down
scaleUp:
stabilizationWindowSeconds: 120 # wait 2min before scaling up
RULE: maxReplicas MUST NOT exceed 5 for executor (token burn prevention)
RULE: scaleUp stabilizationWindowSeconds MUST be >= 120s
RULE: before scaling executors, run bash scripts/verify-executor-safety.sh — MUST exit 0
CHECK_HPA_STATUS:
TOOL: kubectl
RUN: kubectl get hpa -n ge-agents
RUN: kubectl describe hpa ge-executor-hpa -n ge-agents
K8S:PDB (Pod Disruption Budget)¶
PURPOSE: ensure minimum availability during voluntary disruptions (node drain, rolling update)
GE_EXECUTOR_PDB: minAvailable=2 (always keep 2 executor pods running)
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: ge-executor-pdb
namespace: ge-agents
spec:
minAvailable: 2
selector:
matchLabels:
app: ge-executor
WHEN_TO_USE_PDB: - Services with >= 2 replicas that must maintain availability - Stateful services where disruption = data risk
WHEN_NOT_TO_USE: - Singletons with Recreate strategy (dolly, orchestrator singletons) - Stateless batch jobs that can restart safely
K8S:RESOURCE_MANAGEMENT¶
NAMESPACE_ORGANIZATION¶
| Namespace | Purpose | Owner |
|---|---|---|
| ge-agents | Agent executor, orchestrator, dedicated agents | gerco |
| ge-system | Redis, Vault, admin-ui, core infrastructure | gerco |
| ge-ingress | Traefik IngressController | stef/gerco |
| ge-monitoring | Grafana, Loki, Prometheus, Promtail | ron |
| ge-wiki | MkDocs wiki brain | gerco |
| ge-hosting | Client hosting (shared) | rutger |
| sh-{client} | Shared hosting per client | rutger |
| ded-{client} | Dedicated hosting per client | rutger |
RESOURCE_TIERS (client workloads)¶
| Tier | CPU Req | CPU Limit | Mem Req | Mem Limit | Replicas |
|---|---|---|---|---|---|
| small | 10m | 100m | 32Mi | 128Mi | 1 |
| medium | 50m | 250m | 64Mi | 256Mi | 2 |
| large | 100m | 500m | 128Mi | 512Mi | 2 |
| xlarge | 200m | 1000m | 256Mi | 1Gi | 3 |
CHECK resource usage:
TOOL: kubectl
RUN: kubectl top pods -n {namespace}
RUN: kubectl top nodes
RUN: kubectl describe node | grep -A 10 "Allocated resources"
K8S:CRONJOBS_AND_JOBS¶
CRONJOB_PATTERNS¶
GE uses CronJobs for scheduled maintenance:
apiVersion: batch/v1
kind: CronJob
metadata:
name: {job-name}
namespace: {namespace}
spec:
schedule: "{cron-expression}"
concurrencyPolicy: Forbid # never run concurrent instances
successfulJobsHistoryLimit: 3 # keep last 3 successful
failedJobsHistoryLimit: 3 # keep last 3 failed
startingDeadlineSeconds: 120 # skip if >2min late
jobTemplate:
spec:
backoffLimit: 2 # max 2 retries
activeDeadlineSeconds: 300 # kill after 5min
template:
spec:
restartPolicy: OnFailure
ACTIVE_CRONJOBS: - cost-monitor: agent cost tracking and alerting - pod-refresh: periodic pod recycling for memory hygiene - k8s-health-dump: host cron (scripts/k8s-health-dump.sh) for infrastructure page
CHECK_CRONJOB_STATUS:
TOOL: kubectl
RUN: kubectl get cronjobs -n ge-agents
RUN: kubectl get jobs -n ge-agents --sort-by=.metadata.creationTimestamp
RUN: kubectl logs job/{job-name} -n ge-agents
SUSPEND/UNSUSPEND:
TOOL: kubectl
RUN: kubectl patch cronjob {name} -n {namespace} -p '{"spec":{"suspend":true}}'
RUN: kubectl patch cronjob {name} -n {namespace} -p '{"spec":{"suspend":false}}'
K8S:DEBUGGING¶
ESSENTIAL_KUBECTL_PATTERNS¶
POD_STATUS:
TOOL: kubectl
RUN: kubectl get pods -n {namespace} -o wide
RUN: kubectl describe pod {pod-name} -n {namespace}
RUN: kubectl get events -n {namespace} --sort-by='.lastTimestamp' | tail -20
LOGS:
TOOL: kubectl
RUN: kubectl logs {pod-name} -n {namespace} --tail=100
RUN: kubectl logs {pod-name} -n {namespace} --previous # logs from crashed container
RUN: kubectl logs -l app={label} -n {namespace} --tail=50 # logs by label
EXEC_INTO_POD:
TOOL: kubectl
RUN: kubectl exec -it {pod-name} -n {namespace} -- /bin/sh
RULE: for debugging only — NEVER hot-patch running pods
RESOURCE_INSPECTION:
TOOL: kubectl
RUN: kubectl get all -n {namespace}
RUN: kubectl get endpoints -n {namespace}
RUN: kubectl get networkpolicy -n {namespace}
COMMON_FAILURE_PATTERNS¶
CRASHLOOPBACKOFF:
1. CHECK: kubectl logs {pod} -n {ns} --previous
2. CHECK: kubectl describe pod {pod} -n {ns} (look at Events section)
3. COMMON_CAUSES: missing secrets, wrong image, OOMKilled, failed health probe
4. IF: OOMKilled THEN: increase memory limits
5. IF: probe failure THEN: check probe config matches actual endpoint
IMAGEPULLBACKOFF:
1. CHECK: is image name/tag correct?
2. CHECK: for local images, is imagePullPolicy=IfNotPresent?
3. CHECK: was image imported to k3s? (sudo k3s ctr images ls | grep {image})
PENDING (pod won't schedule):
1. CHECK: kubectl describe pod — look for scheduling failures
2. COMMON: insufficient CPU/memory, node selector mismatch, PVC pending
3. CHECK: kubectl describe node — see allocatable vs allocated
EVICTED:
1. CHECK: node under disk/memory pressure?
2. CHECK: kubectl get events --field-selector reason=Evicted -n {ns}
3. FIX: increase node resources or reduce pod requests
LOG_AGGREGATION¶
STACK: Promtail (collector) -> Loki (storage) -> Grafana (visualization) NAMESPACE: ge-monitoring
Promtail collects logs from all pods automatically via DaemonSet. Loki stores logs with label-based indexing. Grafana provides LogQL query interface.
QUERY_EXAMPLE (LogQL in Grafana):
{namespace="ge-agents", app="ge-executor"} |= "error"
{namespace="ge-agents"} | json | level="error" | line_format "{{.msg}}"
DIRECT_LOKI_QUERY:
TOOL: curl
RUN: curl -s "https://loki:3100/loki/api/v1/query_range" \
--data-urlencode 'query={namespace="ge-agents"} |= "error"' \
--data-urlencode 'start=1hr' \
-k
K8S:IMAGE_MANAGEMENT¶
BUILD_AND_DEPLOY (Zone 1 k3s)¶
CRITICAL: NEVER use kubectl cp or hot-patch running pods — Python caches modules in sys.modules at startup
BUILD_EXECUTOR:
This script: 1. docker build -t ge-bootstrap-agent-executor:latest 2. docker save | k3s ctr images import 3. Image available to k3s immediately
DEPLOY_AFTER_BUILD:
TOOL: kubectl
RUN: kubectl rollout restart deployment/ge-executor -n ge-agents
VERIFY: kubectl rollout status deployment/ge-executor -n ge-agents --timeout=120s
BUILD_AND_DEPLOY (Zone 2/3 UpCloud MKE)¶
FOR_UPCLOUD: images pushed to container registry (not local import) FLOW: build -> tag -> push to registry -> kubectl set image or rollout restart
K8S:K3S_SPECIFIC¶
K3S_VS_FULL_K8S¶
DIFFERENCES:
- k3s bundles Traefik as default ingress (GE uses custom Traefik config)
- k3s uses SQLite/etcd for cluster state (single-node = SQLite)
- k3s has lighter resource footprint (~512MB RAM for control plane)
- k3s images imported via k3s ctr images import (no registry needed for Zone 1)
- k3s kubeconfig at /etc/rancher/k3s/k3s.yaml (not ~/.kube/config)
GOTCHAS: - ClusterIP broken from inside pods (GE-specific finding, documented in MEMORY.md) - k3s auto-updates Traefik — pin version in HelmChartConfig if needed - k3s uses flannel CNI by default — NetworkPolicies require --flannel-backend=none + separate CNI
RESTART_K3S:
TOOL: bash
RUN: sudo systemctl restart k3s
VERIFY: sudo systemctl status k3s
VERIFY: kubectl get nodes
UPCLOUD_MKE_SPECIFIC¶
DIFFERENCES_FROM_K3S: - UpCloud manages control plane (API server, etcd, scheduler) - Node pools managed via Terraform (arjan provisions) - Uses Calico CNI (full NetworkPolicy support out of box) - Container registry integration via UpCloud Container Registry - LoadBalancer service type creates UpCloud Load Balancer automatically
K8S:AGENT_RULES¶
FOR_GERCO (Zone 1)¶
ON_DEPLOYMENT_TASK:
1. READ this page for k8s patterns
2. WRITE manifest with mandatory fields (resources, probes, security context)
3. CHECK NetworkPolicy allows required traffic
4. APPLY manifest: kubectl apply -f {manifest} -n {namespace}
5. VERIFY pod running and healthy
6. NEVER touch Zone 2 or Zone 3
FOR_THIJMEN (Zone 2 staging)¶
ON_WORKLOAD_TASK: 1. READ this page for k8s patterns 2. RECEIVE cluster credentials from arjan 3. DEPLOY workloads to staging cluster 4. VERIFY with full integration test 5. HAND OFF to rutger for production promotion 6. NEVER provision clusters (arjan's domain)
FOR_RUTGER (Zone 3 production)¶
ON_PRODUCTION_TASK: 1. READ this page for k8s patterns 2. RECEIVE validated workload from thijmen 3. APPLY to production with zero-downtime strategy 4. VERIFY health and performance 5. HAND OFF to stef for network configuration 6. ALL changes require Victoria security review
K8S:CROSS_REFERENCES¶
DEPLOYMENT_STRATEGIES: domains/infrastructure/deployment-strategies.md — when to use rolling vs blue-green BACKUP: domains/infrastructure/backup-disaster-recovery.md — etcd backup, pod state recovery NETWORK_SECURITY: domains/networking/network-security.md — NetworkPolicies, Traefik config TLS: domains/networking/tls-certificates.md — cert-manager integration MONITORING: domains/monitoring/index.md — Prometheus, Grafana, alerting INCIDENT_RESPONSE: domains/incident-response/index.md — when deployments go wrong