DOMAIN:MONITORING — OBSERVABILITY_PATTERNS¶
OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: three pillars of observability, structured logging, correlation IDs, distributed tracing, metric naming, dashboard design
OBSERVABILITY_PATTERNS:CORE_PRINCIPLE¶
PURPOSE: make system behavior observable so that failures can be diagnosed from telemetry data alone, without requiring code changes or redeployment
RULE: observability answers "why is this broken?" — monitoring answers "is this broken?"
RULE: all three pillars (logs, metrics, traces) must share a common correlation ID
RULE: observability data must be structured, not free-text
RULE: instrument once with OpenTelemetry — export to any backend
CHECK: can you diagnose the last incident using only telemetry data?
IF: no THEN: observability is insufficient — add missing signals
IF: yes THEN: observability is working — focus on reducing noise
OBSERVABILITY_PATTERNS:THREE_PILLARS¶
PILLAR_1_LOGS¶
PURPOSE: immutable record of discrete events with full context
RULE: all logs MUST be structured (JSON format)
RULE: every log entry MUST include: timestamp, level, service, message, correlation_id
RULE: NEVER log sensitive data (API keys, passwords, PII)
RULE: log at the appropriate level — see LOG_LEVELS section below
STRUCTURED_LOG_FORMAT:
{
"timestamp": "2026-03-26T14:30:00.123Z",
"level": "error",
"service": "ge-orchestrator",
"correlation_id": "task-abc123",
"agent": "boris",
"message": "Failed to route task to agent",
"error_class": "ConnectionRefusedError",
"error_message": "Redis connection refused on port 6381",
"context": {
"work_type": "code_review",
"stream": "triggers.boris",
"retry_count": 2
}
}
CHECK: are logs structured JSON?
IF: plain text THEN: logs are unsearchable at scale — convert to structured format
IF: JSON THEN: verify all required fields are present
ANTI_PATTERN: logging entire request/response bodies
FIX: log metadata (size, status code, duration) not payloads — payloads bloat storage
ANTI_PATTERN: using string interpolation in log messages (f"Error for {user}")
FIX: use structured fields — {"message": "Error for user", "user": user} — enables aggregation
ANTI_PATTERN: inconsistent timestamp formats across services
FIX: ISO 8601 with timezone (Z or +00:00) everywhere — no local time, no epoch-only
LOG_LEVELS¶
RULE: use consistent log levels across all GE services
LEVEL_DEFINITIONS:
FATAL: system cannot continue, requires immediate intervention
Example: database connection permanently lost, cannot recover
ERROR: operation failed, but system continues
Example: task dispatch failed for one agent, other agents unaffected
WARN: unexpected condition that may become an error
Example: Redis memory at 75%, approaching threshold
INFO: significant business events (normal operation)
Example: task dispatched to agent, session completed
DEBUG: detailed diagnostic information for development
Example: parsed routing rule, calculated confidence score
TRACE: extremely detailed, per-step execution flow
Example: entering function X, parameter Y = Z
RULE: production systems run at INFO level by default
RULE: DEBUG and TRACE are enabled per-service when investigating issues
RULE: FATAL and ERROR always trigger log aggregation alerts
RULE: WARN is reviewed in daily digests
CHECK: is the log level appropriate for the message?
IF: normal operation logged as ERROR THEN: noise — demote to INFO
IF: failure condition logged as INFO THEN: missed alert — promote to ERROR
PILLAR_2_METRICS¶
PURPOSE: numerical measurements of system behavior over time, enabling trends and alerts
RULE: metrics are aggregated, not individual events — use logs for individual events
RULE: metric names follow a consistent naming convention
RULE: every metric has labels/dimensions for slicing (service, agent, work_type)
RULE: counters only go up; use gauges for values that go up and down
METRIC_TYPES:
COUNTER: monotonically increasing value (total requests, total errors)
GAUGE: point-in-time value (current queue depth, memory usage)
HISTOGRAM: distribution of values (request latency, task duration)
SUMMARY: similar to histogram but with pre-calculated quantiles
NAMING_CONVENTION:
{namespace}_{subsystem}_{metric_name}_{unit}
Examples:
ge_orchestrator_tasks_routed_total (counter)
ge_executor_task_duration_seconds (histogram)
ge_redis_stream_depth_messages (gauge)
ge_agent_session_cost_dollars (histogram)
ge_admin_ui_http_requests_total (counter)
ge_admin_ui_http_request_duration_seconds (histogram)
RULE: always include the unit in the metric name (_seconds, _bytes, _total, _messages)
RULE: use _total suffix for counters
RULE: use base units (seconds not milliseconds, bytes not kilobytes)
ANTI_PATTERN: metric names with variable components (agent name IN the metric name)
FIX: use labels — ge_executor_tasks_total{agent="boris"} not ge_executor_boris_tasks_total
ANTI_PATTERN: high-cardinality labels (session_id, request_id as labels)
FIX: high cardinality causes cardinality explosion — use logs for per-request data
PILLAR_3_TRACES¶
PURPOSE: follow a single request across multiple services to identify where time is spent and where failures occur
RULE: every incoming request gets a unique trace_id
RULE: each service call within the request is a span with its own span_id
RULE: spans have parent-child relationships forming a tree
RULE: trace context propagates across service boundaries via HTTP headers or Redis fields
GE_TRACE_FLOW:
trace_id: auto-generated at request origin
SPAN 1: admin-ui receives task creation request
└── SPAN 2: TaskService writes to PostgreSQL
└── SPAN 3: TaskService XADDs to ge:work:incoming
└── SPAN 4: ge-orchestrator consumes from stream
└── SPAN 5: router.py selects agent
└── SPAN 6: XADDs to triggers.{agent}
└── SPAN 7: ge-executor consumes task
└── SPAN 8: LLM provider API call
└── SPAN 9: agent writes COMP file
TRACE_CONTEXT_PROPAGATION:
HTTP: W3C Trace Context headers (traceparent, tracestate)
Redis: add trace_id and span_id as fields in XADD payload
DB: include trace_id in query comments for slow query correlation
CHECK: does the trace survive across Redis stream boundaries?
IF: no THEN: trace_id must be included in XADD payload fields
IF: yes THEN: verify span parent-child relationships are correct
ANTI_PATTERN: tracing only within a single service
FIX: distributed tracing requires context propagation across ALL service boundaries
ANTI_PATTERN: tracing every single operation including health checks
FIX: sample traces — 100% tracing generates excessive data. Sample at 10-20% for normal traffic, 100% for errors
OBSERVABILITY_PATTERNS:CORRELATION¶
CORRELATION_ID_STRATEGY¶
PURPOSE: link logs, metrics, and traces for the same logical operation
RULE: the trace_id serves as the primary correlation ID
RULE: every log entry includes the correlation_id field
RULE: every metric emission includes the trace context where applicable
RULE: correlation IDs propagate across async boundaries (Redis streams)
GE_CORRELATION_FIELDS:
trace_id: unique per end-to-end request
work_item_id: unique per work item (persisted in DB)
session_id: unique per agent execution session
agent_name: which agent is executing
task_id: which task triggered the work
CHECK: given a work_item_id, can you find all logs, metrics, and traces?
IF: no THEN: correlation is broken — ensure all telemetry includes work_item_id
IF: yes THEN: correlation is working — you can reconstruct the full story
ANTI_PATTERN: different correlation ID formats across services
FIX: standardize on UUID v4 for all correlation IDs
ANTI_PATTERN: correlation ID generated midway through the request
FIX: generate at the entry point (admin-ui or API gateway) and propagate everywhere
OBSERVABILITY_PATTERNS:DASHBOARD_DESIGN¶
DASHBOARD_PRINCIPLES¶
RULE: dashboards are for humans — optimize for comprehension, not data density
RULE: every dashboard has a clear purpose statement at the top
RULE: use consistent color coding: green=healthy, yellow=warning, red=critical
RULE: show rate of change, not just current values — trends reveal problems before thresholds
DASHBOARD_HIERARCHY:
LEVEL 1 — Executive Dashboard (1 screen):
system health (green/yellow/red), cost today, active agents, active incidents
LEVEL 2 — Service Dashboards (per service):
request rate, error rate, latency percentiles, resource utilization
LEVEL 3 — Investigation Dashboards (per incident):
detailed metrics, log search, trace waterfall, dependency status
RULE: Level 1 is always visible — shows "do I need to care right now?"
RULE: Level 2 is for daily operations — shows "what needs attention today?"
RULE: Level 3 is for incident response — shows "what exactly is happening?"
CHECK: can an operator diagnose a typical incident from Level 2 alone?
IF: no THEN: Level 2 is missing critical data — add error breakdowns and latency histograms
IF: yes THEN: Level 2 is well-designed
GE_DASHBOARD_LAYOUT¶
ADMIN_UI_INFRASTRUCTURE_PAGE:
DATA_SOURCE: public/k8s-health.json (host cron updated)
REFRESH: auto-refresh every 60 seconds (matches cron interval)
SECTIONS:
1. Cluster Overview: node status, total pods, namespace breakdown
2. Agent Status: active/unavailable/maintenance counts
3. Stream Health: depth per agent stream, system stream
4. Cost: hourly, daily totals from session_learnings
5. Recent Events: last 20 k8s events
ANTI_PATTERN: dashboard that requires scrolling to see critical information
FIX: critical info (health, active incidents) always above the fold
ANTI_PATTERN: dashboard with 50+ panels
FIX: split into focused dashboards by domain — no one reads a 50-panel wall
OBSERVABILITY_PATTERNS:OPENTELEMETRY¶
INSTRUMENTATION_STRATEGY¶
PURPOSE: vendor-neutral instrumentation using OpenTelemetry SDK
RULE: instrument with OpenTelemetry — export to any backend without code changes
RULE: auto-instrumentation for HTTP, DB, and Redis where available
RULE: manual instrumentation for business-logic spans (routing decisions, cost calculations)
PYTHON_INSTRUMENTATION (executor, orchestrator):
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
tracer = trace.get_tracer("ge-orchestrator")
with tracer.start_as_current_span("route_task") as span:
span.set_attribute("work_type", task.work_type)
span.set_attribute("agent", selected_agent)
# routing logic here
NODE_INSTRUMENTATION (admin-ui):
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('admin-ui');
const span = tracer.startSpan('create_task');
span.setAttribute('work_type', task.workType);
// task creation logic
span.end();
RULE: set meaningful span names — route_task not doWork
RULE: add business-relevant attributes — work_type, agent, team
RULE: record exceptions on spans — span.recordException(error)
ANTI_PATTERN: instrumenting everything with maximum detail
FIX: instrument boundaries (service entry/exit, external calls) — internal details go in logs
ANTI_PATTERN: creating spans for trivial operations (JSON parse, string format)
FIX: spans represent meaningful units of work, not every function call