Skip to content

DOMAIN:SYSTEM_INTEGRITY — HEALTH_CHECKS

OWNER: ron
ALSO_USED_BY: gerco, thijmen, rutger, annegreet
UPDATED: 2026-03-26
SCOPE: system health check patterns, probe types, cascading health, dependency aggregation, SLA measurement


HEALTH_CHECKS:CORE_PRINCIPLE

PURPOSE: continuously verify that all system components are functioning correctly and meeting service level expectations

RULE: health checks MUST be deterministic — same state produces same result
RULE: health checks MUST be fast — a slow health check is itself a health problem
RULE: health checks MUST NOT modify state — read-only operations only
RULE: health checks MUST report structured results, not just pass/fail

CHECK: does the health check itself have a timeout?
IF: no THEN: a hanging health check blocks the monitoring loop — always set timeouts
IF: yes THEN: timeout should be < 50% of the check interval


HEALTH_CHECKS:KUBERNETES_PROBE_TYPES

LIVENESS_PROBE

PURPOSE: detect if a container is stuck and needs to be restarted

RULE: liveness probes answer "is this process alive and not deadlocked?"
RULE: liveness probes should NOT check dependencies — only the process itself
RULE: if the liveness probe fails, kubelet kills and restarts the container

CHECK: does the liveness probe check external dependencies (DB, Redis)?
IF: yes THEN: ANTI_PATTERN — a Redis outage will cascade-restart all pods
FIX: liveness probes check only the local process health

CONFIGURATION:

livenessProbe:  
  httpGet:  
    path: /healthz  
    port: 3000  
  initialDelaySeconds: 30  
  periodSeconds: 10  
  timeoutSeconds: 5  
  failureThreshold: 3  
  successThreshold: 1  

RULE: initialDelaySeconds must account for application startup time
RULE: failureThreshold * periodSeconds = time before restart (30s in example above)
RULE: set timeoutSeconds generously enough to avoid false positives under load

ANTI_PATTERN: setting failureThreshold to 1
FIX: one slow response should not trigger a restart — use threshold >= 3

ANTI_PATTERN: liveness probe that allocates memory or opens connections
FIX: probe handler should be minimal — return 200 if main loop is responsive

READINESS_PROBE

PURPOSE: detect if a container is ready to receive traffic

RULE: readiness probes answer "can this container handle requests right now?"
RULE: readiness probes SHOULD check critical dependencies (DB connection, cache warmth)
RULE: if the readiness probe fails, pod is removed from Service endpoints (no traffic routed)
RULE: pod is NOT restarted — it stays running and is re-added when probe passes

CONFIGURATION:

readinessProbe:  
  httpGet:  
    path: /ready  
    port: 3000  
  initialDelaySeconds: 5  
  periodSeconds: 5  
  timeoutSeconds: 3  
  failureThreshold: 2  
  successThreshold: 1  

CHECK: does the readiness probe verify database connectivity?
IF: no THEN: pod may receive requests it cannot serve — add DB health check
IF: yes THEN: ensure the DB check has its own timeout shorter than probe timeout

CHECK: does the readiness probe verify Redis connectivity?
IF: pod depends on Redis for task dispatch THEN: yes, check Redis
IF: pod can function without Redis (degraded mode) THEN: no, keep probe passing

ANTI_PATTERN: readiness probe identical to liveness probe
FIX: readiness checks dependencies, liveness checks only the process

STARTUP_PROBE

PURPOSE: give slow-starting containers time to initialize before liveness/readiness kick in

RULE: startup probes disable liveness and readiness checks until the startup probe succeeds
RULE: use startup probes for containers that need to load large models, warm caches, or run migrations

CONFIGURATION:

startupProbe:  
  httpGet:  
    path: /healthz  
    port: 3000  
  periodSeconds: 5  
  failureThreshold: 30  
  # Total startup budget: 5 * 30 = 150 seconds  

CHECK: does the container take > 30 seconds to start?
IF: yes THEN: startup probe prevents premature liveness kills during initialization
IF: no THEN: startup probe is optional — liveness initialDelaySeconds may suffice

ANTI_PATTERN: no startup probe on containers that load ML models or large configs
FIX: without startup probe, liveness kills the container before it finishes loading


HEALTH_CHECKS:GE_COMPONENT_HEALTH

ADMIN_UI_HEALTH

TOOL: curl
RUN: curl -sf http://admin-ui.ge-system.svc.cluster.local:3000/api/health

CHECK: HTTP 200 response within 5 seconds
CHECK: response includes DB connectivity status
CHECK: response includes Redis connectivity status
CHECK: response includes uptime and version

IF: DB unreachable THEN: admin-ui cannot serve data — readiness should fail
IF: Redis unreachable THEN: task dispatch broken — alert but do not restart

GE_ORCHESTRATOR_HEALTH

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: GET ge:orchestrator:heartbeat

CHECK: heartbeat timestamp is within last 60 seconds
CHECK: both replicas are reporting heartbeats
CHECK: consumer group ge-orchestrator exists on ge:work:incoming

IF: heartbeat stale > 60s THEN: orchestrator may be hung — check pod logs
IF: only 1 replica reporting THEN: HA degraded — investigate missing replica
IF: consumer group missing THEN: CRITICAL — orchestrator cannot consume work

GE_EXECUTOR_HEALTH

TOOL: kubectl
RUN: kubectl get pods -n ge-agents -l app=ge-executor -o json | jq '.items[] | {name: .metadata.name, phase: .status.phase, restarts: .status.containerStatuses[0].restartCount, ready: .status.containerStatuses[0].ready}'

CHECK: all executor pods are in Running phase
CHECK: restart count is not rapidly increasing
CHECK: ready status is true

IF: restarts > 10 in last hour THEN: crashloop — check logs for root cause
IF: not ready THEN: executor cannot receive work — check readiness probe failures

WIKI_BRAIN_HEALTH

TOOL: curl
RUN: curl -sf http://192.168.1.85:30080/

CHECK: HTTP 200 response
CHECK: page content includes MkDocs markup
CHECK: search index is available

IF: wiki unreachable THEN: knowledge layer offline — agents cannot access learnings
IF: search broken THEN: JIT injection cannot find relevant learnings

REDIS_HEALTH

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: PING
RUN: INFO memory
RUN: INFO clients

CHECK: PING returns PONG
CHECK: used_memory_rss < 80% of maxmemory
CHECK: connected_clients < 100 (reasonable for single-node)
CHECK: rejected_connections = 0

IF: PING fails THEN: CRITICAL — Redis is down, all task dispatch halted
IF: memory > 80% THEN: HIGH — OOM risk, check stream lengths
IF: rejected_connections > 0 THEN: connection limit hit — investigate client leaks

POSTGRESQL_HEALTH

TOOL: psql or admin-ui API
RUN: psql -h localhost -U ge_admin -d ge_production -c "SELECT 1"

CHECK: query returns within 1 second
CHECK: connection pool is not exhausted
CHECK: replication lag is acceptable (if replicated)
CHECK: disk usage < 80%

IF: connection fails THEN: CRITICAL — DB is SSOT, everything depends on it
IF: query slow > 5s THEN: HIGH — performance degradation, check active queries


HEALTH_CHECKS:CASCADING_HEALTH

DEPENDENCY_GRAPH

PURPOSE: understand how component failures cascade through the system

DEPENDENCY_MAP:

admin-ui → PostgreSQL (SSOT)  
admin-ui → Redis (task dispatch)  
ge-orchestrator → Redis (stream routing)  
ge-orchestrator → PostgreSQL (team routing, DAG)  
ge-executor → Redis (task consumption)  
ge-executor → LLM providers (Claude, OpenAI, Gemini)  
wiki → filesystem (MkDocs content)  
all agents → ge-executor → Redis → ge-orchestrator  

RULE: a component is HEALTHY only if it AND all its critical dependencies are healthy
RULE: a component is DEGRADED if non-critical dependencies are down
RULE: a component is UNHEALTHY if any critical dependency is down

CHECK: does a single Redis failure cascade to all components?
IF: yes THEN: Redis is a single point of failure — monitor with highest priority

HEALTH_AGGREGATION

PURPOSE: roll up individual health checks into a system-wide health score

TECHNIQUE:

1. check each component individually  
2. classify: HEALTHY (all checks pass), DEGRADED (some checks fail),  
   UNHEALTHY (critical checks fail)  
3. propagate: if dependency is UNHEALTHY, dependent is at most DEGRADED  
4. aggregate: system health = worst component health  
5. report: structured JSON with per-component status  

TOOL: scripts/k8s-health-dump.sh
RUN: bash scripts/k8s-health-dump.sh
OUTPUT: public/k8s-health.json — consumed by admin-ui infrastructure page

CHECK: health dump runs on host cron (not inside pods)
IF: health dump fails THEN: admin-ui shows stale data — monitor the cron job itself

NOTE: k8s ClusterIP (10.43.0.1:443) is BROKEN from inside pods — that is why health dump runs on host

ANTI_PATTERN: each component only checks itself, no dependency awareness
FIX: health endpoint should report dependency status alongside own status

ANTI_PATTERN: health aggregation that marks system HEALTHY when a critical component is DEGRADED
FIX: system health = min(component_health) for critical path components


HEALTH_CHECKS:HEALTH_DASHBOARDS

DASHBOARD_DESIGN_PRINCIPLES

RULE: dashboards show current state, not historical data (use metrics for history)
RULE: traffic-light pattern: green (healthy), yellow (degraded), red (unhealthy)
RULE: dashboard must load in < 2 seconds — a slow dashboard is useless during incidents
RULE: show dependency relationships visually — failures cascade, dashboards should show the cascade

COMPONENTS_TO_DISPLAY:

ROW 1: System Health (overall) | Active Incidents | Error Budget Remaining  
ROW 2: PostgreSQL | Redis | Admin-UI | Orchestrator | Wiki  
ROW 3: Executor Pods (per-pod status) | Agent Status (active/unavailable count)  
ROW 4: Stream Depths | Cost Today | Hook Depth Violations  

CHECK: dashboard data source is the health dump JSON, not live k8s queries
IF: dashboard queries k8s directly from browser THEN: CORS issues and auth complexity
FIX: use the host cron → JSON → static serve pattern (already implemented)

ADMIN_UI_INFRASTRUCTURE_PAGE

TOOL: admin-ui
PATH: /infrastructure
DATA_SOURCE: public/k8s-health.json

CHECK: JSON file is refreshed every minute by host cron
IF: file age > 5 minutes THEN: cron may have failed — check crontab


HEALTH_CHECKS:SLA_MEASUREMENT

SERVICE_LEVEL_INDICATORS

PURPOSE: measure actual service performance against SLA targets

SLI_DEFINITIONS:

AVAILABILITY:     successful_requests / total_requests (target: 99.5%)  
TASK_LATENCY:     time from task creation to completion (target: p95 < 5 min)  
DISPATCH_LATENCY: time from XADD to executor pickup (target: p95 < 30s)  
HEALTH_CHECK_PASS_RATE: passing_checks / total_checks (target: > 98%)  
MTTR:             mean time from incident detection to resolution (target: < 30 min)  

TOOL: query session_learnings and work items for timing data

SELECT  
    DATE_TRUNC('hour', created_at) as hour,  
    COUNT(*) FILTER (WHERE outcome = 'success') as successful,  
    COUNT(*) as total,  
    ROUND(100.0 * COUNT(*) FILTER (WHERE outcome = 'success') / COUNT(*), 2) as availability_pct  
FROM session_learnings  
WHERE created_at > NOW() - INTERVAL '24 hours'  
GROUP BY hour  
ORDER BY hour;  

ERROR_BUDGET

PURPOSE: track how much downtime or failure is permitted within the SLA window

FORMULA:

error_budget = 1 - SLO_target  
monthly_budget_minutes = error_budget * 30 * 24 * 60  

Example for 99.5% SLO:  
error_budget = 0.005  
monthly_budget_minutes = 0.005 * 43200 = 216 minutes (3.6 hours)  

CHECK: how much error budget remains this month?
IF: < 25% remaining THEN: freeze non-critical changes, reduce blast radius
IF: exhausted THEN: focus entirely on reliability, no new features

RULE: error budget resets monthly
RULE: error budget consumption rate determines urgency of reliability work

ANTI_PATTERN: ignoring error budget until it is exhausted
FIX: track burn rate daily — a 2x burn rate means budget exhausts in 15 days

SLA_REPORTING

RULE: generate weekly SLA report covering all SLIs
RULE: report includes trend (improving/stable/degrading)
RULE: report includes top incidents that consumed error budget
RULE: report is stored in wiki at wiki/docs/development/reports/

ANTI_PATTERN: measuring SLA only when asked
FIX: continuous measurement — SLA data must be always available

ANTI_PATTERN: SLA covers only availability, not latency or correctness
FIX: multi-dimensional SLA captures real user experience better