Skip to content

DOMAIN:MONITORING — OBSERVABILITY_PATTERNS

OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: three pillars of observability, structured logging, correlation IDs, distributed tracing, metric naming, dashboard design


OBSERVABILITY_PATTERNS:CORE_PRINCIPLE

PURPOSE: make system behavior observable so that failures can be diagnosed from telemetry data alone, without requiring code changes or redeployment

RULE: observability answers "why is this broken?" — monitoring answers "is this broken?"
RULE: all three pillars (logs, metrics, traces) must share a common correlation ID
RULE: observability data must be structured, not free-text
RULE: instrument once with OpenTelemetry — export to any backend

CHECK: can you diagnose the last incident using only telemetry data?
IF: no THEN: observability is insufficient — add missing signals
IF: yes THEN: observability is working — focus on reducing noise


OBSERVABILITY_PATTERNS:THREE_PILLARS

PILLAR_1_LOGS

PURPOSE: immutable record of discrete events with full context

RULE: all logs MUST be structured (JSON format)
RULE: every log entry MUST include: timestamp, level, service, message, correlation_id
RULE: NEVER log sensitive data (API keys, passwords, PII)
RULE: log at the appropriate level — see LOG_LEVELS section below

STRUCTURED_LOG_FORMAT:

{  
  "timestamp": "2026-03-26T14:30:00.123Z",  
  "level": "error",  
  "service": "ge-orchestrator",  
  "correlation_id": "task-abc123",  
  "agent": "boris",  
  "message": "Failed to route task to agent",  
  "error_class": "ConnectionRefusedError",  
  "error_message": "Redis connection refused on port 6381",  
  "context": {  
    "work_type": "code_review",  
    "stream": "triggers.boris",  
    "retry_count": 2  
  }  
}  

CHECK: are logs structured JSON?
IF: plain text THEN: logs are unsearchable at scale — convert to structured format
IF: JSON THEN: verify all required fields are present

ANTI_PATTERN: logging entire request/response bodies
FIX: log metadata (size, status code, duration) not payloads — payloads bloat storage

ANTI_PATTERN: using string interpolation in log messages (f"Error for {user}")
FIX: use structured fields — {"message": "Error for user", "user": user} — enables aggregation

ANTI_PATTERN: inconsistent timestamp formats across services
FIX: ISO 8601 with timezone (Z or +00:00) everywhere — no local time, no epoch-only

LOG_LEVELS

RULE: use consistent log levels across all GE services

LEVEL_DEFINITIONS:

FATAL:   system cannot continue, requires immediate intervention  
         Example: database connection permanently lost, cannot recover  
ERROR:   operation failed, but system continues  
         Example: task dispatch failed for one agent, other agents unaffected  
WARN:    unexpected condition that may become an error  
         Example: Redis memory at 75%, approaching threshold  
INFO:    significant business events (normal operation)  
         Example: task dispatched to agent, session completed  
DEBUG:   detailed diagnostic information for development  
         Example: parsed routing rule, calculated confidence score  
TRACE:   extremely detailed, per-step execution flow  
         Example: entering function X, parameter Y = Z  

RULE: production systems run at INFO level by default
RULE: DEBUG and TRACE are enabled per-service when investigating issues
RULE: FATAL and ERROR always trigger log aggregation alerts
RULE: WARN is reviewed in daily digests

CHECK: is the log level appropriate for the message?
IF: normal operation logged as ERROR THEN: noise — demote to INFO
IF: failure condition logged as INFO THEN: missed alert — promote to ERROR

PILLAR_2_METRICS

PURPOSE: numerical measurements of system behavior over time, enabling trends and alerts

RULE: metrics are aggregated, not individual events — use logs for individual events
RULE: metric names follow a consistent naming convention
RULE: every metric has labels/dimensions for slicing (service, agent, work_type)
RULE: counters only go up; use gauges for values that go up and down

METRIC_TYPES:

COUNTER:    monotonically increasing value (total requests, total errors)  
GAUGE:      point-in-time value (current queue depth, memory usage)  
HISTOGRAM:  distribution of values (request latency, task duration)  
SUMMARY:    similar to histogram but with pre-calculated quantiles  

NAMING_CONVENTION:

{namespace}_{subsystem}_{metric_name}_{unit}  

Examples:  
ge_orchestrator_tasks_routed_total          (counter)  
ge_executor_task_duration_seconds           (histogram)  
ge_redis_stream_depth_messages              (gauge)  
ge_agent_session_cost_dollars               (histogram)  
ge_admin_ui_http_requests_total             (counter)  
ge_admin_ui_http_request_duration_seconds   (histogram)  

RULE: always include the unit in the metric name (_seconds, _bytes, _total, _messages)
RULE: use _total suffix for counters
RULE: use base units (seconds not milliseconds, bytes not kilobytes)

ANTI_PATTERN: metric names with variable components (agent name IN the metric name)
FIX: use labels — ge_executor_tasks_total{agent="boris"} not ge_executor_boris_tasks_total

ANTI_PATTERN: high-cardinality labels (session_id, request_id as labels)
FIX: high cardinality causes cardinality explosion — use logs for per-request data

PILLAR_3_TRACES

PURPOSE: follow a single request across multiple services to identify where time is spent and where failures occur

RULE: every incoming request gets a unique trace_id
RULE: each service call within the request is a span with its own span_id
RULE: spans have parent-child relationships forming a tree
RULE: trace context propagates across service boundaries via HTTP headers or Redis fields

GE_TRACE_FLOW:

trace_id: auto-generated at request origin  

SPAN 1: admin-ui receives task creation request  
  └── SPAN 2: TaskService writes to PostgreSQL  
  └── SPAN 3: TaskService XADDs to ge:work:incoming  
       └── SPAN 4: ge-orchestrator consumes from stream  
            └── SPAN 5: router.py selects agent  
            └── SPAN 6: XADDs to triggers.{agent}  
                 └── SPAN 7: ge-executor consumes task  
                      └── SPAN 8: LLM provider API call  
                      └── SPAN 9: agent writes COMP file  

TRACE_CONTEXT_PROPAGATION:

HTTP: W3C Trace Context headers (traceparent, tracestate)  
Redis: add trace_id and span_id as fields in XADD payload  
DB: include trace_id in query comments for slow query correlation  

CHECK: does the trace survive across Redis stream boundaries?
IF: no THEN: trace_id must be included in XADD payload fields
IF: yes THEN: verify span parent-child relationships are correct

ANTI_PATTERN: tracing only within a single service
FIX: distributed tracing requires context propagation across ALL service boundaries

ANTI_PATTERN: tracing every single operation including health checks
FIX: sample traces — 100% tracing generates excessive data. Sample at 10-20% for normal traffic, 100% for errors


OBSERVABILITY_PATTERNS:CORRELATION

CORRELATION_ID_STRATEGY

PURPOSE: link logs, metrics, and traces for the same logical operation

RULE: the trace_id serves as the primary correlation ID
RULE: every log entry includes the correlation_id field
RULE: every metric emission includes the trace context where applicable
RULE: correlation IDs propagate across async boundaries (Redis streams)

GE_CORRELATION_FIELDS:

trace_id:       unique per end-to-end request  
work_item_id:   unique per work item (persisted in DB)  
session_id:     unique per agent execution session  
agent_name:     which agent is executing  
task_id:        which task triggered the work  

CHECK: given a work_item_id, can you find all logs, metrics, and traces?
IF: no THEN: correlation is broken — ensure all telemetry includes work_item_id
IF: yes THEN: correlation is working — you can reconstruct the full story

ANTI_PATTERN: different correlation ID formats across services
FIX: standardize on UUID v4 for all correlation IDs

ANTI_PATTERN: correlation ID generated midway through the request
FIX: generate at the entry point (admin-ui or API gateway) and propagate everywhere


OBSERVABILITY_PATTERNS:DASHBOARD_DESIGN

DASHBOARD_PRINCIPLES

RULE: dashboards are for humans — optimize for comprehension, not data density
RULE: every dashboard has a clear purpose statement at the top
RULE: use consistent color coding: green=healthy, yellow=warning, red=critical
RULE: show rate of change, not just current values — trends reveal problems before thresholds

DASHBOARD_HIERARCHY:

LEVEL 1 — Executive Dashboard (1 screen):  
  system health (green/yellow/red), cost today, active agents, active incidents  

LEVEL 2 — Service Dashboards (per service):  
  request rate, error rate, latency percentiles, resource utilization  

LEVEL 3 — Investigation Dashboards (per incident):  
  detailed metrics, log search, trace waterfall, dependency status  

RULE: Level 1 is always visible — shows "do I need to care right now?"
RULE: Level 2 is for daily operations — shows "what needs attention today?"
RULE: Level 3 is for incident response — shows "what exactly is happening?"

CHECK: can an operator diagnose a typical incident from Level 2 alone?
IF: no THEN: Level 2 is missing critical data — add error breakdowns and latency histograms
IF: yes THEN: Level 2 is well-designed

GE_DASHBOARD_LAYOUT

ADMIN_UI_INFRASTRUCTURE_PAGE:

DATA_SOURCE: public/k8s-health.json (host cron updated)  
REFRESH: auto-refresh every 60 seconds (matches cron interval)  

SECTIONS:  
  1. Cluster Overview: node status, total pods, namespace breakdown  
  2. Agent Status: active/unavailable/maintenance counts  
  3. Stream Health: depth per agent stream, system stream  
  4. Cost: hourly, daily totals from session_learnings  
  5. Recent Events: last 20 k8s events  

ANTI_PATTERN: dashboard that requires scrolling to see critical information
FIX: critical info (health, active incidents) always above the fold

ANTI_PATTERN: dashboard with 50+ panels
FIX: split into focused dashboards by domain — no one reads a 50-panel wall


OBSERVABILITY_PATTERNS:OPENTELEMETRY

INSTRUMENTATION_STRATEGY

PURPOSE: vendor-neutral instrumentation using OpenTelemetry SDK

RULE: instrument with OpenTelemetry — export to any backend without code changes
RULE: auto-instrumentation for HTTP, DB, and Redis where available
RULE: manual instrumentation for business-logic spans (routing decisions, cost calculations)

PYTHON_INSTRUMENTATION (executor, orchestrator):

from opentelemetry import trace  
from opentelemetry.sdk.trace import TracerProvider  

tracer = trace.get_tracer("ge-orchestrator")  

with tracer.start_as_current_span("route_task") as span:  
    span.set_attribute("work_type", task.work_type)  
    span.set_attribute("agent", selected_agent)  
    # routing logic here  

NODE_INSTRUMENTATION (admin-ui):

import { trace } from '@opentelemetry/api';  

const tracer = trace.getTracer('admin-ui');  
const span = tracer.startSpan('create_task');  
span.setAttribute('work_type', task.workType);  
// task creation logic  
span.end();  

RULE: set meaningful span names — route_task not doWork
RULE: add business-relevant attributes — work_type, agent, team
RULE: record exceptions on spans — span.recordException(error)

ANTI_PATTERN: instrumenting everything with maximum detail
FIX: instrument boundaries (service entry/exit, external calls) — internal details go in logs

ANTI_PATTERN: creating spans for trivial operations (JSON parse, string format)
FIX: spans represent meaningful units of work, not every function call