DOMAIN:SYSTEM_INTEGRITY — HEALTH_CHECKS¶
OWNER: ron
ALSO_USED_BY: gerco, thijmen, rutger, annegreet
UPDATED: 2026-03-26
SCOPE: system health check patterns, probe types, cascading health, dependency aggregation, SLA measurement
HEALTH_CHECKS:CORE_PRINCIPLE¶
PURPOSE: continuously verify that all system components are functioning correctly and meeting service level expectations
RULE: health checks MUST be deterministic — same state produces same result
RULE: health checks MUST be fast — a slow health check is itself a health problem
RULE: health checks MUST NOT modify state — read-only operations only
RULE: health checks MUST report structured results, not just pass/fail
CHECK: does the health check itself have a timeout?
IF: no THEN: a hanging health check blocks the monitoring loop — always set timeouts
IF: yes THEN: timeout should be < 50% of the check interval
HEALTH_CHECKS:KUBERNETES_PROBE_TYPES¶
LIVENESS_PROBE¶
PURPOSE: detect if a container is stuck and needs to be restarted
RULE: liveness probes answer "is this process alive and not deadlocked?"
RULE: liveness probes should NOT check dependencies — only the process itself
RULE: if the liveness probe fails, kubelet kills and restarts the container
CHECK: does the liveness probe check external dependencies (DB, Redis)?
IF: yes THEN: ANTI_PATTERN — a Redis outage will cascade-restart all pods
FIX: liveness probes check only the local process health
CONFIGURATION:
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
successThreshold: 1
RULE: initialDelaySeconds must account for application startup time
RULE: failureThreshold * periodSeconds = time before restart (30s in example above)
RULE: set timeoutSeconds generously enough to avoid false positives under load
ANTI_PATTERN: setting failureThreshold to 1
FIX: one slow response should not trigger a restart — use threshold >= 3
ANTI_PATTERN: liveness probe that allocates memory or opens connections
FIX: probe handler should be minimal — return 200 if main loop is responsive
READINESS_PROBE¶
PURPOSE: detect if a container is ready to receive traffic
RULE: readiness probes answer "can this container handle requests right now?"
RULE: readiness probes SHOULD check critical dependencies (DB connection, cache warmth)
RULE: if the readiness probe fails, pod is removed from Service endpoints (no traffic routed)
RULE: pod is NOT restarted — it stays running and is re-added when probe passes
CONFIGURATION:
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 1
CHECK: does the readiness probe verify database connectivity?
IF: no THEN: pod may receive requests it cannot serve — add DB health check
IF: yes THEN: ensure the DB check has its own timeout shorter than probe timeout
CHECK: does the readiness probe verify Redis connectivity?
IF: pod depends on Redis for task dispatch THEN: yes, check Redis
IF: pod can function without Redis (degraded mode) THEN: no, keep probe passing
ANTI_PATTERN: readiness probe identical to liveness probe
FIX: readiness checks dependencies, liveness checks only the process
STARTUP_PROBE¶
PURPOSE: give slow-starting containers time to initialize before liveness/readiness kick in
RULE: startup probes disable liveness and readiness checks until the startup probe succeeds
RULE: use startup probes for containers that need to load large models, warm caches, or run migrations
CONFIGURATION:
startupProbe:
httpGet:
path: /healthz
port: 3000
periodSeconds: 5
failureThreshold: 30
# Total startup budget: 5 * 30 = 150 seconds
CHECK: does the container take > 30 seconds to start?
IF: yes THEN: startup probe prevents premature liveness kills during initialization
IF: no THEN: startup probe is optional — liveness initialDelaySeconds may suffice
ANTI_PATTERN: no startup probe on containers that load ML models or large configs
FIX: without startup probe, liveness kills the container before it finishes loading
HEALTH_CHECKS:GE_COMPONENT_HEALTH¶
ADMIN_UI_HEALTH¶
TOOL: curl
RUN: curl -sf http://admin-ui.ge-system.svc.cluster.local:3000/api/health
CHECK: HTTP 200 response within 5 seconds
CHECK: response includes DB connectivity status
CHECK: response includes Redis connectivity status
CHECK: response includes uptime and version
IF: DB unreachable THEN: admin-ui cannot serve data — readiness should fail
IF: Redis unreachable THEN: task dispatch broken — alert but do not restart
GE_ORCHESTRATOR_HEALTH¶
TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: GET ge:orchestrator:heartbeat
CHECK: heartbeat timestamp is within last 60 seconds
CHECK: both replicas are reporting heartbeats
CHECK: consumer group ge-orchestrator exists on ge:work:incoming
IF: heartbeat stale > 60s THEN: orchestrator may be hung — check pod logs
IF: only 1 replica reporting THEN: HA degraded — investigate missing replica
IF: consumer group missing THEN: CRITICAL — orchestrator cannot consume work
GE_EXECUTOR_HEALTH¶
TOOL: kubectl
RUN: kubectl get pods -n ge-agents -l app=ge-executor -o json | jq '.items[] | {name: .metadata.name, phase: .status.phase, restarts: .status.containerStatuses[0].restartCount, ready: .status.containerStatuses[0].ready}'
CHECK: all executor pods are in Running phase
CHECK: restart count is not rapidly increasing
CHECK: ready status is true
IF: restarts > 10 in last hour THEN: crashloop — check logs for root cause
IF: not ready THEN: executor cannot receive work — check readiness probe failures
WIKI_BRAIN_HEALTH¶
TOOL: curl
RUN: curl -sf http://192.168.1.85:30080/
CHECK: HTTP 200 response
CHECK: page content includes MkDocs markup
CHECK: search index is available
IF: wiki unreachable THEN: knowledge layer offline — agents cannot access learnings
IF: search broken THEN: JIT injection cannot find relevant learnings
REDIS_HEALTH¶
TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: PING
RUN: INFO memory
RUN: INFO clients
CHECK: PING returns PONG
CHECK: used_memory_rss < 80% of maxmemory
CHECK: connected_clients < 100 (reasonable for single-node)
CHECK: rejected_connections = 0
IF: PING fails THEN: CRITICAL — Redis is down, all task dispatch halted
IF: memory > 80% THEN: HIGH — OOM risk, check stream lengths
IF: rejected_connections > 0 THEN: connection limit hit — investigate client leaks
POSTGRESQL_HEALTH¶
TOOL: psql or admin-ui API
RUN: psql -h localhost -U ge_admin -d ge_production -c "SELECT 1"
CHECK: query returns within 1 second
CHECK: connection pool is not exhausted
CHECK: replication lag is acceptable (if replicated)
CHECK: disk usage < 80%
IF: connection fails THEN: CRITICAL — DB is SSOT, everything depends on it
IF: query slow > 5s THEN: HIGH — performance degradation, check active queries
HEALTH_CHECKS:CASCADING_HEALTH¶
DEPENDENCY_GRAPH¶
PURPOSE: understand how component failures cascade through the system
DEPENDENCY_MAP:
admin-ui → PostgreSQL (SSOT)
admin-ui → Redis (task dispatch)
ge-orchestrator → Redis (stream routing)
ge-orchestrator → PostgreSQL (team routing, DAG)
ge-executor → Redis (task consumption)
ge-executor → LLM providers (Claude, OpenAI, Gemini)
wiki → filesystem (MkDocs content)
all agents → ge-executor → Redis → ge-orchestrator
RULE: a component is HEALTHY only if it AND all its critical dependencies are healthy
RULE: a component is DEGRADED if non-critical dependencies are down
RULE: a component is UNHEALTHY if any critical dependency is down
CHECK: does a single Redis failure cascade to all components?
IF: yes THEN: Redis is a single point of failure — monitor with highest priority
HEALTH_AGGREGATION¶
PURPOSE: roll up individual health checks into a system-wide health score
TECHNIQUE:
1. check each component individually
2. classify: HEALTHY (all checks pass), DEGRADED (some checks fail),
UNHEALTHY (critical checks fail)
3. propagate: if dependency is UNHEALTHY, dependent is at most DEGRADED
4. aggregate: system health = worst component health
5. report: structured JSON with per-component status
TOOL: scripts/k8s-health-dump.sh
RUN: bash scripts/k8s-health-dump.sh
OUTPUT: public/k8s-health.json — consumed by admin-ui infrastructure page
CHECK: health dump runs on host cron (not inside pods)
IF: health dump fails THEN: admin-ui shows stale data — monitor the cron job itself
NOTE: k8s ClusterIP (10.43.0.1:443) is BROKEN from inside pods — that is why health dump runs on host
ANTI_PATTERN: each component only checks itself, no dependency awareness
FIX: health endpoint should report dependency status alongside own status
ANTI_PATTERN: health aggregation that marks system HEALTHY when a critical component is DEGRADED
FIX: system health = min(component_health) for critical path components
HEALTH_CHECKS:HEALTH_DASHBOARDS¶
DASHBOARD_DESIGN_PRINCIPLES¶
RULE: dashboards show current state, not historical data (use metrics for history)
RULE: traffic-light pattern: green (healthy), yellow (degraded), red (unhealthy)
RULE: dashboard must load in < 2 seconds — a slow dashboard is useless during incidents
RULE: show dependency relationships visually — failures cascade, dashboards should show the cascade
COMPONENTS_TO_DISPLAY:
ROW 1: System Health (overall) | Active Incidents | Error Budget Remaining
ROW 2: PostgreSQL | Redis | Admin-UI | Orchestrator | Wiki
ROW 3: Executor Pods (per-pod status) | Agent Status (active/unavailable count)
ROW 4: Stream Depths | Cost Today | Hook Depth Violations
CHECK: dashboard data source is the health dump JSON, not live k8s queries
IF: dashboard queries k8s directly from browser THEN: CORS issues and auth complexity
FIX: use the host cron → JSON → static serve pattern (already implemented)
ADMIN_UI_INFRASTRUCTURE_PAGE¶
TOOL: admin-ui
PATH: /infrastructure
DATA_SOURCE: public/k8s-health.json
CHECK: JSON file is refreshed every minute by host cron
IF: file age > 5 minutes THEN: cron may have failed — check crontab
HEALTH_CHECKS:SLA_MEASUREMENT¶
SERVICE_LEVEL_INDICATORS¶
PURPOSE: measure actual service performance against SLA targets
SLI_DEFINITIONS:
AVAILABILITY: successful_requests / total_requests (target: 99.5%)
TASK_LATENCY: time from task creation to completion (target: p95 < 5 min)
DISPATCH_LATENCY: time from XADD to executor pickup (target: p95 < 30s)
HEALTH_CHECK_PASS_RATE: passing_checks / total_checks (target: > 98%)
MTTR: mean time from incident detection to resolution (target: < 30 min)
TOOL: query session_learnings and work items for timing data
SELECT
DATE_TRUNC('hour', created_at) as hour,
COUNT(*) FILTER (WHERE outcome = 'success') as successful,
COUNT(*) as total,
ROUND(100.0 * COUNT(*) FILTER (WHERE outcome = 'success') / COUNT(*), 2) as availability_pct
FROM session_learnings
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY hour
ORDER BY hour;
ERROR_BUDGET¶
PURPOSE: track how much downtime or failure is permitted within the SLA window
FORMULA:
error_budget = 1 - SLO_target
monthly_budget_minutes = error_budget * 30 * 24 * 60
Example for 99.5% SLO:
error_budget = 0.005
monthly_budget_minutes = 0.005 * 43200 = 216 minutes (3.6 hours)
CHECK: how much error budget remains this month?
IF: < 25% remaining THEN: freeze non-critical changes, reduce blast radius
IF: exhausted THEN: focus entirely on reliability, no new features
RULE: error budget resets monthly
RULE: error budget consumption rate determines urgency of reliability work
ANTI_PATTERN: ignoring error budget until it is exhausted
FIX: track burn rate daily — a 2x burn rate means budget exhausts in 15 days
SLA_REPORTING¶
RULE: generate weekly SLA report covering all SLIs
RULE: report includes trend (improving/stable/degrading)
RULE: report includes top incidents that consumed error budget
RULE: report is stored in wiki at wiki/docs/development/reports/
ANTI_PATTERN: measuring SLA only when asked
FIX: continuous measurement — SLA data must be always available
ANTI_PATTERN: SLA covers only availability, not latency or correctness
FIX: multi-dimensional SLA captures real user experience better