DOMAIN:INFRASTRUCTURE:KUBERNETES_OPERATIONS¶

OWNER: gerco (Zone 1), thijmen (Zone 2), rutger (Zone 3) UPDATED: 2026-03-24 SCOPE: all k8s operations across GE's three-zone architecture AGENTS: gerco (k3s sysadmin), thijmen (UpCloud MKE staging), rutger (UpCloud MKE production), arjan (cluster provisioning), leon (deploy coordination)

K8S:ZONE_ARCHITECTURE¶

ZONE_1: k3s on Minisforum 790 Pro (fort-knox-dev)¶

OWNER: gerco RUNTIME: k3s (lightweight single-node Kubernetes) HOST: Minisforum 790 Pro — AMD Ryzen 9 7940HS, 64GB DDR5, 2TB NVMe KUBECONFIG: /etc/rancher/k3s/k3s.yaml API_ENDPOINT: https://127.0.0.1:6443 PURPOSE: development environment, agent execution, GE platform itself INGRESS: Traefik (bundled with k3s, custom config in ge-ingress namespace)

CRITICAL_PITFALL: k3s ClusterIP (10.43.0.1:443) is BROKEN from inside pods — connection refused FIX: use service DNS names instead (e.g., kubernetes.default.svc.cluster.local) PITFALL: do NOT use hostNetwork: true — causes port conflicts on rolling updates

ZONE_2: UpCloud Managed Kubernetes (staging)¶

OWNER: thijmen RUNTIME: UpCloud Managed Kubernetes Engine (MKE) LOCATION: de-fra1 (Frankfurt, EU) PURPOSE: client staging environments, pre-production validation PROVISIONED_BY: arjan (Terraform)

ZONE_3: UpCloud Managed Kubernetes (production)¶

OWNER: rutger RUNTIME: UpCloud Managed Kubernetes Engine (MKE) LOCATION: de-fra1 (Frankfurt, EU), DR: nl-ams1 (Amsterdam) PURPOSE: client production workloads PROVISIONED_BY: arjan (Terraform) RULE: ALL Zone 3 changes require Victoria security review

K8S:DEPLOYMENTS¶

CREATING_DEPLOYMENTS¶

RULE: every deployment MUST have resource requests AND limits RULE: every deployment MUST have health probes (liveness + readiness, startup optional) RULE: every deployment MUST have a security context (runAsNonRoot preferred) RULE: every container MUST have PYTHONUNBUFFERED=1 or equivalent for log visibility

TEMPLATE: minimal deployment for GE workloads

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {service}
  namespace: {namespace}
  labels:
    app: {service}
    component: {component-type}
spec:
  replicas: {N}
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0  # zero-downtime default
  selector:
    matchLabels:
      app: {service}
  template:
    metadata:
      labels:
        app: {service}
        component: {component-type}
    spec:
      securityContext:
        runAsNonRoot: true
        fsGroup: 1004
      containers:
      - name: {service}
        image: {image}:{tag}
        imagePullPolicy: IfNotPresent
        resources:
          requests:
            memory: "{X}Mi"
            cpu: "{X}m"
          limits:
            memory: "{X}Mi"
            cpu: "{X}m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

ROLLOUT_STRATEGIES¶

ROLLING_UPDATE (default for GE): - maxSurge: 1 (create 1 extra pod before killing old) - maxUnavailable: 0 (zero-downtime) - USE_WHEN: standard deployments, stateless services

RECREATE: - Kills all old pods before creating new ones - USE_WHEN: singleton services that cannot have two instances (e.g., services with exclusive resource locks) - ANTI_PATTERN: using Recreate for services that could use RollingUpdate

CHECK rollout status:

TOOL: kubectl
RUN: kubectl rollout status deployment/{name} -n {namespace} --timeout=120s
EXPECT: "deployment successfully rolled out"
IF_STUCK: kubectl describe deployment/{name} -n {namespace}

ROLLBACKS¶

IMMEDIATE_ROLLBACK (when deployment is failing):

TOOL: kubectl
RUN: kubectl rollout undo deployment/{name} -n {namespace}
VERIFY: kubectl rollout status deployment/{name} -n {namespace}

ROLLBACK_TO_SPECIFIC_REVISION:

TOOL: kubectl
RUN: kubectl rollout history deployment/{name} -n {namespace}
RUN: kubectl rollout undo deployment/{name} -n {namespace} --to-revision={N}

RULE: after rollback, immediately investigate root cause — rollback is mitigation, not resolution RULE: record rollback in incident log if it affects clients

K8S:HEALTH_PROBES¶

PROBE_TYPES¶

STARTUP_PROBE: - PURPOSE: allow slow-starting containers to boot (Vault init, CLI installs, model downloads) - FIRES: only during startup, before liveness/readiness take over - GE_EXECUTOR_EXAMPLE: initialDelaySeconds=10, periodSeconds=10, failureThreshold=12 (120s max boot) - USE_WHEN: container needs >30s to start

LIVENESS_PROBE: - PURPOSE: detect deadlocked/hung processes, trigger pod restart - FIRES: continuously after startup probe succeeds - ANTI_PATTERN: HTTP liveness on executor pods — event loop blocks during PTY execution - FIX: use exec probe with kill -0 1 (checks process alive without HTTP) - GE_EXECUTOR_CONFIG: exec probe, periodSeconds=30, failureThreshold=6 (180s tolerance for long sessions)

READINESS_PROBE: - PURPOSE: control traffic routing — pod only receives traffic when ready - FIRES: continuously, can toggle pod in/out of service - USE_WHEN: service needs warm-up, or should stop receiving work during overload - GE_EXECUTOR_CONFIG: httpGet /ready, periodSeconds=10, failureThreshold=3

PROBE_DECISION_TREE¶

IF: service has HTTP endpoint
  THEN: httpGet probe preferred
IF: service blocks event loop during work (like executor PTY sessions)
  THEN: exec probe for liveness (`kill -0 1`), httpGet for readiness
IF: service uses TCP only (Redis, databases)
  THEN: tcpSocket probe
IF: service takes >30s to start
  THEN: add startupProbe with generous failureThreshold

ANTI_PATTERN: liveness probe timeout shorter than longest operation FIX: GE learned this hard way — executor liveness kills during 3-min agent sessions caused restart loops and wasted tokens

K8S:HPA (Horizontal Pod Autoscaler)¶

CONFIGURATION¶

GE_EXECUTOR_HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 5      # HARD LIMIT — never exceed (token burn prevention)
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300    # wait 5min before scaling down
    scaleUp:
      stabilizationWindowSeconds: 120    # wait 2min before scaling up

RULE: maxReplicas MUST NOT exceed 5 for executor (token burn prevention) RULE: scaleUp stabilizationWindowSeconds MUST be >= 120s RULE: before scaling executors, run bash scripts/verify-executor-safety.sh — MUST exit 0

CHECK_HPA_STATUS:

TOOL: kubectl
RUN: kubectl get hpa -n ge-agents
RUN: kubectl describe hpa ge-executor-hpa -n ge-agents

K8S:PDB (Pod Disruption Budget)¶

PURPOSE: ensure minimum availability during voluntary disruptions (node drain, rolling update)

GE_EXECUTOR_PDB: minAvailable=2 (always keep 2 executor pods running)

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: ge-executor-pdb
  namespace: ge-agents
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: ge-executor

WHEN_TO_USE_PDB: - Services with >= 2 replicas that must maintain availability - Stateful services where disruption = data risk

WHEN_NOT_TO_USE: - Singletons with Recreate strategy (dolly, orchestrator singletons) - Stateless batch jobs that can restart safely

K8S:RESOURCE_MANAGEMENT¶

NAMESPACE_ORGANIZATION¶

Namespace	Purpose	Owner
ge-agents	Agent executor, orchestrator, dedicated agents	gerco
ge-system	Redis, Vault, admin-ui, core infrastructure	gerco
ge-ingress	Traefik IngressController	stef/gerco
ge-monitoring	Grafana, Loki, Prometheus, Promtail	ron
ge-wiki	MkDocs wiki brain	gerco
ge-hosting	Client hosting (shared)	rutger
sh-{client}	Shared hosting per client	rutger
ded-{client}	Dedicated hosting per client	rutger

RESOURCE_TIERS (client workloads)¶

Tier	CPU Req	CPU Limit	Mem Req	Mem Limit	Replicas
small	10m	100m	32Mi	128Mi	1
medium	50m	250m	64Mi	256Mi	2
large	100m	500m	128Mi	512Mi	2
xlarge	200m	1000m	256Mi	1Gi	3

CHECK resource usage:

TOOL: kubectl
RUN: kubectl top pods -n {namespace}
RUN: kubectl top nodes
RUN: kubectl describe node | grep -A 10 "Allocated resources"

K8S:CRONJOBS_AND_JOBS¶

CRONJOB_PATTERNS¶

GE uses CronJobs for scheduled maintenance:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: {job-name}
  namespace: {namespace}
spec:
  schedule: "{cron-expression}"
  concurrencyPolicy: Forbid          # never run concurrent instances
  successfulJobsHistoryLimit: 3       # keep last 3 successful
  failedJobsHistoryLimit: 3           # keep last 3 failed
  startingDeadlineSeconds: 120        # skip if >2min late
  jobTemplate:
    spec:
      backoffLimit: 2                 # max 2 retries
      activeDeadlineSeconds: 300      # kill after 5min
      template:
        spec:
          restartPolicy: OnFailure

ACTIVE_CRONJOBS: - cost-monitor: agent cost tracking and alerting - pod-refresh: periodic pod recycling for memory hygiene - k8s-health-dump: host cron (scripts/k8s-health-dump.sh) for infrastructure page

CHECK_CRONJOB_STATUS:

TOOL: kubectl
RUN: kubectl get cronjobs -n ge-agents
RUN: kubectl get jobs -n ge-agents --sort-by=.metadata.creationTimestamp
RUN: kubectl logs job/{job-name} -n ge-agents

SUSPEND/UNSUSPEND:

TOOL: kubectl
RUN: kubectl patch cronjob {name} -n {namespace} -p '{"spec":{"suspend":true}}'
RUN: kubectl patch cronjob {name} -n {namespace} -p '{"spec":{"suspend":false}}'

K8S:DEBUGGING¶

ESSENTIAL_KUBECTL_PATTERNS¶

POD_STATUS:

TOOL: kubectl
RUN: kubectl get pods -n {namespace} -o wide
RUN: kubectl describe pod {pod-name} -n {namespace}
RUN: kubectl get events -n {namespace} --sort-by='.lastTimestamp' | tail -20

LOGS:

TOOL: kubectl
RUN: kubectl logs {pod-name} -n {namespace} --tail=100
RUN: kubectl logs {pod-name} -n {namespace} --previous   # logs from crashed container
RUN: kubectl logs -l app={label} -n {namespace} --tail=50  # logs by label

EXEC_INTO_POD:

TOOL: kubectl
RUN: kubectl exec -it {pod-name} -n {namespace} -- /bin/sh
RULE: for debugging only — NEVER hot-patch running pods

RESOURCE_INSPECTION:

TOOL: kubectl
RUN: kubectl get all -n {namespace}
RUN: kubectl get endpoints -n {namespace}
RUN: kubectl get networkpolicy -n {namespace}

COMMON_FAILURE_PATTERNS¶

CRASHLOOPBACKOFF: 1. CHECK: kubectl logs {pod} -n {ns} --previous 2. CHECK: kubectl describe pod {pod} -n {ns} (look at Events section) 3. COMMON_CAUSES: missing secrets, wrong image, OOMKilled, failed health probe 4. IF: OOMKilled THEN: increase memory limits 5. IF: probe failure THEN: check probe config matches actual endpoint

IMAGEPULLBACKOFF: 1. CHECK: is image name/tag correct? 2. CHECK: for local images, is imagePullPolicy=IfNotPresent? 3. CHECK: was image imported to k3s? (sudo k3s ctr images ls | grep {image})

PENDING (pod won't schedule): 1. CHECK: kubectl describe pod — look for scheduling failures 2. COMMON: insufficient CPU/memory, node selector mismatch, PVC pending 3. CHECK: kubectl describe node — see allocatable vs allocated

EVICTED: 1. CHECK: node under disk/memory pressure? 2. CHECK: kubectl get events --field-selector reason=Evicted -n {ns} 3. FIX: increase node resources or reduce pod requests

LOG_AGGREGATION¶

STACK: Promtail (collector) -> Loki (storage) -> Grafana (visualization) NAMESPACE: ge-monitoring

Promtail collects logs from all pods automatically via DaemonSet. Loki stores logs with label-based indexing. Grafana provides LogQL query interface.

QUERY_EXAMPLE (LogQL in Grafana):

{namespace="ge-agents", app="ge-executor"} |= "error"
{namespace="ge-agents"} | json | level="error" | line_format "{{.msg}}"

DIRECT_LOKI_QUERY:

TOOL: curl
RUN: curl -s "https://loki:3100/loki/api/v1/query_range" \
  --data-urlencode 'query={namespace="ge-agents"} |= "error"' \
  --data-urlencode 'start=1hr' \
  -k

K8S:IMAGE_MANAGEMENT¶

BUILD_AND_DEPLOY (Zone 1 k3s)¶

CRITICAL: NEVER use kubectl cp or hot-patch running pods — Python caches modules in sys.modules at startup

BUILD_EXECUTOR:

TOOL: bash
RUN: bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh

This script: 1. docker build -t ge-bootstrap-agent-executor:latest 2. docker save | k3s ctr images import 3. Image available to k3s immediately

DEPLOY_AFTER_BUILD:

TOOL: kubectl
RUN: kubectl rollout restart deployment/ge-executor -n ge-agents
VERIFY: kubectl rollout status deployment/ge-executor -n ge-agents --timeout=120s

BUILD_AND_DEPLOY (Zone 2/3 UpCloud MKE)¶

FOR_UPCLOUD: images pushed to container registry (not local import) FLOW: build -> tag -> push to registry -> kubectl set image or rollout restart

K8S:K3S_SPECIFIC¶

K3S_VS_FULL_K8S¶

DIFFERENCES: - k3s bundles Traefik as default ingress (GE uses custom Traefik config) - k3s uses SQLite/etcd for cluster state (single-node = SQLite) - k3s has lighter resource footprint (~512MB RAM for control plane) - k3s images imported via k3s ctr images import (no registry needed for Zone 1) - k3s kubeconfig at /etc/rancher/k3s/k3s.yaml (not ~/.kube/config)

GOTCHAS: - ClusterIP broken from inside pods (GE-specific finding, documented in MEMORY.md) - k3s auto-updates Traefik — pin version in HelmChartConfig if needed - k3s uses flannel CNI by default — NetworkPolicies require --flannel-backend=none + separate CNI

RESTART_K3S:

TOOL: bash
RUN: sudo systemctl restart k3s
VERIFY: sudo systemctl status k3s
VERIFY: kubectl get nodes

UPCLOUD_MKE_SPECIFIC¶

DIFFERENCES_FROM_K3S: - UpCloud manages control plane (API server, etcd, scheduler) - Node pools managed via Terraform (arjan provisions) - Uses Calico CNI (full NetworkPolicy support out of box) - Container registry integration via UpCloud Container Registry - LoadBalancer service type creates UpCloud Load Balancer automatically

K8S:AGENT_RULES¶

FOR_GERCO (Zone 1)¶

ON_DEPLOYMENT_TASK: 1. READ this page for k8s patterns 2. WRITE manifest with mandatory fields (resources, probes, security context) 3. CHECK NetworkPolicy allows required traffic 4. APPLY manifest: kubectl apply -f {manifest} -n {namespace} 5. VERIFY pod running and healthy 6. NEVER touch Zone 2 or Zone 3

FOR_THIJMEN (Zone 2 staging)¶

ON_WORKLOAD_TASK: 1. READ this page for k8s patterns 2. RECEIVE cluster credentials from arjan 3. DEPLOY workloads to staging cluster 4. VERIFY with full integration test 5. HAND OFF to rutger for production promotion 6. NEVER provision clusters (arjan's domain)

FOR_RUTGER (Zone 3 production)¶

ON_PRODUCTION_TASK: 1. READ this page for k8s patterns 2. RECEIVE validated workload from thijmen 3. APPLY to production with zero-downtime strategy 4. VERIFY health and performance 5. HAND OFF to stef for network configuration 6. ALL changes require Victoria security review

K8S:CROSS_REFERENCES¶

DEPLOYMENT_STRATEGIES: domains/infrastructure/deployment-strategies.md — when to use rolling vs blue-green BACKUP: domains/infrastructure/backup-disaster-recovery.md — etcd backup, pod state recovery NETWORK_SECURITY: domains/networking/network-security.md — NetworkPolicies, Traefik config TLS: domains/networking/tls-certificates.md — cert-manager integration MONITORING: domains/monitoring/index.md — Prometheus, Grafana, alerting INCIDENT_RESPONSE: domains/incident-response/index.md — when deployments go wrong