Skip to content

DOMAIN:INFRASTRUCTURE:KUBERNETES_OPERATIONS

OWNER: gerco (Zone 1), thijmen (Zone 2), rutger (Zone 3) UPDATED: 2026-03-24 SCOPE: all k8s operations across GE's three-zone architecture AGENTS: gerco (k3s sysadmin), thijmen (UpCloud MKE staging), rutger (UpCloud MKE production), arjan (cluster provisioning), leon (deploy coordination)


K8S:ZONE_ARCHITECTURE

ZONE_1: k3s on Minisforum 790 Pro (fort-knox-dev)

OWNER: gerco RUNTIME: k3s (lightweight single-node Kubernetes) HOST: Minisforum 790 Pro — AMD Ryzen 9 7940HS, 64GB DDR5, 2TB NVMe KUBECONFIG: /etc/rancher/k3s/k3s.yaml API_ENDPOINT: https://127.0.0.1:6443 PURPOSE: development environment, agent execution, GE platform itself INGRESS: Traefik (bundled with k3s, custom config in ge-ingress namespace)

CRITICAL_PITFALL: k3s ClusterIP (10.43.0.1:443) is BROKEN from inside pods — connection refused FIX: use service DNS names instead (e.g., kubernetes.default.svc.cluster.local) PITFALL: do NOT use hostNetwork: true — causes port conflicts on rolling updates

ZONE_2: UpCloud Managed Kubernetes (staging)

OWNER: thijmen RUNTIME: UpCloud Managed Kubernetes Engine (MKE) LOCATION: de-fra1 (Frankfurt, EU) PURPOSE: client staging environments, pre-production validation PROVISIONED_BY: arjan (Terraform)

ZONE_3: UpCloud Managed Kubernetes (production)

OWNER: rutger RUNTIME: UpCloud Managed Kubernetes Engine (MKE) LOCATION: de-fra1 (Frankfurt, EU), DR: nl-ams1 (Amsterdam) PURPOSE: client production workloads PROVISIONED_BY: arjan (Terraform) RULE: ALL Zone 3 changes require Victoria security review


K8S:DEPLOYMENTS

CREATING_DEPLOYMENTS

RULE: every deployment MUST have resource requests AND limits RULE: every deployment MUST have health probes (liveness + readiness, startup optional) RULE: every deployment MUST have a security context (runAsNonRoot preferred) RULE: every container MUST have PYTHONUNBUFFERED=1 or equivalent for log visibility

TEMPLATE: minimal deployment for GE workloads

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {service}
  namespace: {namespace}
  labels:
    app: {service}
    component: {component-type}
spec:
  replicas: {N}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # zero-downtime default
  selector:
    matchLabels:
      app: {service}
  template:
    metadata:
      labels:
        app: {service}
        component: {component-type}
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 1004
      containers:
      - name: {service}
        image: {image}:{tag}
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            memory: "{X}Mi"
            cpu: "{X}m"
          limits:
            memory: "{X}Mi"
            cpu: "{X}m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

ROLLOUT_STRATEGIES

ROLLING_UPDATE (default for GE): - maxSurge: 1 (create 1 extra pod before killing old) - maxUnavailable: 0 (zero-downtime) - USE_WHEN: standard deployments, stateless services

RECREATE: - Kills all old pods before creating new ones - USE_WHEN: singleton services that cannot have two instances (e.g., services with exclusive resource locks) - ANTI_PATTERN: using Recreate for services that could use RollingUpdate

CHECK rollout status:

TOOL: kubectl
RUN: kubectl rollout status deployment/{name} -n {namespace} --timeout=120s
EXPECT: "deployment successfully rolled out"
IF_STUCK: kubectl describe deployment/{name} -n {namespace}

ROLLBACKS

IMMEDIATE_ROLLBACK (when deployment is failing):

TOOL: kubectl
RUN: kubectl rollout undo deployment/{name} -n {namespace}
VERIFY: kubectl rollout status deployment/{name} -n {namespace}

ROLLBACK_TO_SPECIFIC_REVISION:

TOOL: kubectl
RUN: kubectl rollout history deployment/{name} -n {namespace}
RUN: kubectl rollout undo deployment/{name} -n {namespace} --to-revision={N}

RULE: after rollback, immediately investigate root cause — rollback is mitigation, not resolution RULE: record rollback in incident log if it affects clients


K8S:HEALTH_PROBES

PROBE_TYPES

STARTUP_PROBE: - PURPOSE: allow slow-starting containers to boot (Vault init, CLI installs, model downloads) - FIRES: only during startup, before liveness/readiness take over - GE_EXECUTOR_EXAMPLE: initialDelaySeconds=10, periodSeconds=10, failureThreshold=12 (120s max boot) - USE_WHEN: container needs >30s to start

LIVENESS_PROBE: - PURPOSE: detect deadlocked/hung processes, trigger pod restart - FIRES: continuously after startup probe succeeds - ANTI_PATTERN: HTTP liveness on executor pods — event loop blocks during PTY execution - FIX: use exec probe with kill -0 1 (checks process alive without HTTP) - GE_EXECUTOR_CONFIG: exec probe, periodSeconds=30, failureThreshold=6 (180s tolerance for long sessions)

READINESS_PROBE: - PURPOSE: control traffic routing — pod only receives traffic when ready - FIRES: continuously, can toggle pod in/out of service - USE_WHEN: service needs warm-up, or should stop receiving work during overload - GE_EXECUTOR_CONFIG: httpGet /ready, periodSeconds=10, failureThreshold=3

PROBE_DECISION_TREE

IF: service has HTTP endpoint
  THEN: httpGet probe preferred
IF: service blocks event loop during work (like executor PTY sessions)
  THEN: exec probe for liveness (`kill -0 1`), httpGet for readiness
IF: service uses TCP only (Redis, databases)
  THEN: tcpSocket probe
IF: service takes >30s to start
  THEN: add startupProbe with generous failureThreshold

ANTI_PATTERN: liveness probe timeout shorter than longest operation FIX: GE learned this hard way — executor liveness kills during 3-min agent sessions caused restart loops and wasted tokens


K8S:HPA (Horizontal Pod Autoscaler)

CONFIGURATION

GE_EXECUTOR_HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 5      # HARD LIMIT — never exceed (token burn prevention)
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300    # wait 5min before scaling down
    scaleUp:
      stabilizationWindowSeconds: 120    # wait 2min before scaling up

RULE: maxReplicas MUST NOT exceed 5 for executor (token burn prevention) RULE: scaleUp stabilizationWindowSeconds MUST be >= 120s RULE: before scaling executors, run bash scripts/verify-executor-safety.sh — MUST exit 0

CHECK_HPA_STATUS:

TOOL: kubectl
RUN: kubectl get hpa -n ge-agents
RUN: kubectl describe hpa ge-executor-hpa -n ge-agents

K8S:PDB (Pod Disruption Budget)

PURPOSE: ensure minimum availability during voluntary disruptions (node drain, rolling update)

GE_EXECUTOR_PDB: minAvailable=2 (always keep 2 executor pods running)

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ge-executor-pdb
  namespace: ge-agents
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: ge-executor

WHEN_TO_USE_PDB: - Services with >= 2 replicas that must maintain availability - Stateful services where disruption = data risk

WHEN_NOT_TO_USE: - Singletons with Recreate strategy (dolly, orchestrator singletons) - Stateless batch jobs that can restart safely


K8S:RESOURCE_MANAGEMENT

NAMESPACE_ORGANIZATION

Namespace Purpose Owner
ge-agents Agent executor, orchestrator, dedicated agents gerco
ge-system Redis, Vault, admin-ui, core infrastructure gerco
ge-ingress Traefik IngressController stef/gerco
ge-monitoring Grafana, Loki, Prometheus, Promtail ron
ge-wiki MkDocs wiki brain gerco
ge-hosting Client hosting (shared) rutger
sh-{client} Shared hosting per client rutger
ded-{client} Dedicated hosting per client rutger

RESOURCE_TIERS (client workloads)

Tier CPU Req CPU Limit Mem Req Mem Limit Replicas
small 10m 100m 32Mi 128Mi 1
medium 50m 250m 64Mi 256Mi 2
large 100m 500m 128Mi 512Mi 2
xlarge 200m 1000m 256Mi 1Gi 3

CHECK resource usage:

TOOL: kubectl
RUN: kubectl top pods -n {namespace}
RUN: kubectl top nodes
RUN: kubectl describe node | grep -A 10 "Allocated resources"

K8S:CRONJOBS_AND_JOBS

CRONJOB_PATTERNS

GE uses CronJobs for scheduled maintenance:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: {job-name}
  namespace: {namespace}
spec:
  schedule: "{cron-expression}"
  concurrencyPolicy: Forbid          # never run concurrent instances
  successfulJobsHistoryLimit: 3       # keep last 3 successful
  failedJobsHistoryLimit: 3           # keep last 3 failed
  startingDeadlineSeconds: 120        # skip if >2min late
  jobTemplate:
    spec:
      backoffLimit: 2                 # max 2 retries
      activeDeadlineSeconds: 300      # kill after 5min
      template:
        spec:
          restartPolicy: OnFailure

ACTIVE_CRONJOBS: - cost-monitor: agent cost tracking and alerting - pod-refresh: periodic pod recycling for memory hygiene - k8s-health-dump: host cron (scripts/k8s-health-dump.sh) for infrastructure page

CHECK_CRONJOB_STATUS:

TOOL: kubectl
RUN: kubectl get cronjobs -n ge-agents
RUN: kubectl get jobs -n ge-agents --sort-by=.metadata.creationTimestamp
RUN: kubectl logs job/{job-name} -n ge-agents

SUSPEND/UNSUSPEND:

TOOL: kubectl
RUN: kubectl patch cronjob {name} -n {namespace} -p '{"spec":{"suspend":true}}'
RUN: kubectl patch cronjob {name} -n {namespace} -p '{"spec":{"suspend":false}}'

K8S:DEBUGGING

ESSENTIAL_KUBECTL_PATTERNS

POD_STATUS:

TOOL: kubectl
RUN: kubectl get pods -n {namespace} -o wide
RUN: kubectl describe pod {pod-name} -n {namespace}
RUN: kubectl get events -n {namespace} --sort-by='.lastTimestamp' | tail -20

LOGS:

TOOL: kubectl
RUN: kubectl logs {pod-name} -n {namespace} --tail=100
RUN: kubectl logs {pod-name} -n {namespace} --previous   # logs from crashed container
RUN: kubectl logs -l app={label} -n {namespace} --tail=50  # logs by label

EXEC_INTO_POD:

TOOL: kubectl
RUN: kubectl exec -it {pod-name} -n {namespace} -- /bin/sh
RULE: for debugging only — NEVER hot-patch running pods

RESOURCE_INSPECTION:

TOOL: kubectl
RUN: kubectl get all -n {namespace}
RUN: kubectl get endpoints -n {namespace}
RUN: kubectl get networkpolicy -n {namespace}

COMMON_FAILURE_PATTERNS

CRASHLOOPBACKOFF: 1. CHECK: kubectl logs {pod} -n {ns} --previous 2. CHECK: kubectl describe pod {pod} -n {ns} (look at Events section) 3. COMMON_CAUSES: missing secrets, wrong image, OOMKilled, failed health probe 4. IF: OOMKilled THEN: increase memory limits 5. IF: probe failure THEN: check probe config matches actual endpoint

IMAGEPULLBACKOFF: 1. CHECK: is image name/tag correct? 2. CHECK: for local images, is imagePullPolicy=IfNotPresent? 3. CHECK: was image imported to k3s? (sudo k3s ctr images ls | grep {image})

PENDING (pod won't schedule): 1. CHECK: kubectl describe pod — look for scheduling failures 2. COMMON: insufficient CPU/memory, node selector mismatch, PVC pending 3. CHECK: kubectl describe node — see allocatable vs allocated

EVICTED: 1. CHECK: node under disk/memory pressure? 2. CHECK: kubectl get events --field-selector reason=Evicted -n {ns} 3. FIX: increase node resources or reduce pod requests

LOG_AGGREGATION

STACK: Promtail (collector) -> Loki (storage) -> Grafana (visualization) NAMESPACE: ge-monitoring

Promtail collects logs from all pods automatically via DaemonSet. Loki stores logs with label-based indexing. Grafana provides LogQL query interface.

QUERY_EXAMPLE (LogQL in Grafana):

{namespace="ge-agents", app="ge-executor"} |= "error"
{namespace="ge-agents"} | json | level="error" | line_format "{{.msg}}"

DIRECT_LOKI_QUERY:

TOOL: curl
RUN: curl -s "https://loki:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={namespace="ge-agents"} |= "error"' \
  --data-urlencode 'start=1hr' \
  -k

K8S:IMAGE_MANAGEMENT

BUILD_AND_DEPLOY (Zone 1 k3s)

CRITICAL: NEVER use kubectl cp or hot-patch running pods — Python caches modules in sys.modules at startup

BUILD_EXECUTOR:

TOOL: bash
RUN: bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh

This script: 1. docker build -t ge-bootstrap-agent-executor:latest 2. docker save | k3s ctr images import 3. Image available to k3s immediately

DEPLOY_AFTER_BUILD:

TOOL: kubectl
RUN: kubectl rollout restart deployment/ge-executor -n ge-agents
VERIFY: kubectl rollout status deployment/ge-executor -n ge-agents --timeout=120s

BUILD_AND_DEPLOY (Zone 2/3 UpCloud MKE)

FOR_UPCLOUD: images pushed to container registry (not local import) FLOW: build -> tag -> push to registry -> kubectl set image or rollout restart


K8S:K3S_SPECIFIC

K3S_VS_FULL_K8S

DIFFERENCES: - k3s bundles Traefik as default ingress (GE uses custom Traefik config) - k3s uses SQLite/etcd for cluster state (single-node = SQLite) - k3s has lighter resource footprint (~512MB RAM for control plane) - k3s images imported via k3s ctr images import (no registry needed for Zone 1) - k3s kubeconfig at /etc/rancher/k3s/k3s.yaml (not ~/.kube/config)

GOTCHAS: - ClusterIP broken from inside pods (GE-specific finding, documented in MEMORY.md) - k3s auto-updates Traefik — pin version in HelmChartConfig if needed - k3s uses flannel CNI by default — NetworkPolicies require --flannel-backend=none + separate CNI

RESTART_K3S:

TOOL: bash
RUN: sudo systemctl restart k3s
VERIFY: sudo systemctl status k3s
VERIFY: kubectl get nodes

UPCLOUD_MKE_SPECIFIC

DIFFERENCES_FROM_K3S: - UpCloud manages control plane (API server, etcd, scheduler) - Node pools managed via Terraform (arjan provisions) - Uses Calico CNI (full NetworkPolicy support out of box) - Container registry integration via UpCloud Container Registry - LoadBalancer service type creates UpCloud Load Balancer automatically


K8S:AGENT_RULES

FOR_GERCO (Zone 1)

ON_DEPLOYMENT_TASK: 1. READ this page for k8s patterns 2. WRITE manifest with mandatory fields (resources, probes, security context) 3. CHECK NetworkPolicy allows required traffic 4. APPLY manifest: kubectl apply -f {manifest} -n {namespace} 5. VERIFY pod running and healthy 6. NEVER touch Zone 2 or Zone 3

FOR_THIJMEN (Zone 2 staging)

ON_WORKLOAD_TASK: 1. READ this page for k8s patterns 2. RECEIVE cluster credentials from arjan 3. DEPLOY workloads to staging cluster 4. VERIFY with full integration test 5. HAND OFF to rutger for production promotion 6. NEVER provision clusters (arjan's domain)

FOR_RUTGER (Zone 3 production)

ON_PRODUCTION_TASK: 1. READ this page for k8s patterns 2. RECEIVE validated workload from thijmen 3. APPLY to production with zero-downtime strategy 4. VERIFY health and performance 5. HAND OFF to stef for network configuration 6. ALL changes require Victoria security review


K8S:CROSS_REFERENCES

DEPLOYMENT_STRATEGIES: domains/infrastructure/deployment-strategies.md — when to use rolling vs blue-green BACKUP: domains/infrastructure/backup-disaster-recovery.md — etcd backup, pod state recovery NETWORK_SECURITY: domains/networking/network-security.md — NetworkPolicies, Traefik config TLS: domains/networking/tls-certificates.md — cert-manager integration MONITORING: domains/monitoring/index.md — Prometheus, Grafana, alerting INCIDENT_RESPONSE: domains/incident-response/index.md — when deployments go wrong