DOMAIN:MONITORING — LOG_ANALYSIS¶

OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: log aggregation, structured logging, log levels, log rotation, sensitive data redaction, search patterns, error fingerprinting

LOG_ANALYSIS:CORE_PRINCIPLE¶

PURPOSE: aggregate, structure, and analyze logs from all GE components to detect errors, understand system behavior, and extract learnings

RULE: logs are the most detailed telemetry signal — they contain the full story
RULE: logs MUST be structured (JSON) to be searchable at scale
RULE: logs MUST be centralized — per-pod logs are useless after pod restart
RULE: logs MUST be redacted — no secrets, no PII, no API keys

CHECK: can you search logs across all services for a specific correlation_id?
IF: no THEN: log aggregation is broken — fix centralization first
IF: yes THEN: aggregation is working — focus on log quality

LOG_ANALYSIS:LOG_AGGREGATION¶

CENTRALIZED_LOG_COLLECTION¶

PURPOSE: collect logs from all pods and persist them beyond pod lifecycle

TECHNIQUE: sidecar or DaemonSet log collector

1. each pod writes structured JSON to stdout/stderr  
2. k8s captures stdout/stderr to node filesystem  
3. log collector (Fluentd/Fluent Bit/Vector) reads from node logs  
4. collector forwards to centralized storage (Loki, Elasticsearch, or PostgreSQL)  
5. centralized storage enables cross-service search

GE_LOG_SOURCES:

admin-ui:        Next.js server logs (stdout)  
ge-orchestrator: Python asyncio logs (stdout)  
ge-executor:     Python execution logs + PTY capture (stdout)  
Redis:           slow log, connection log  
PostgreSQL:      query log, error log  
k8s events:      event stream from API server  
host cron:       script output from health dumps

CHECK: are logs from all sources reaching centralized storage?
IF: missing source THEN: check collector config — pod may not be matched by selector
IF: all present THEN: verify retention period is sufficient

RULE: retain logs for minimum 30 days (operational) and 90 days (compliance)
RULE: compress archived logs — JSON compresses well (10:1 typical)

ANTI_PATTERN: relying on kubectl logs for historical analysis
FIX: kubectl logs only shows current pod — logs are lost on restart without centralization

ANTI_PATTERN: each service writing to its own log file instead of stdout
FIX: k8s expects stdout/stderr — file-based logging requires extra volume mounts and collectors

LOG_INDEXING¶

PURPOSE: enable fast search across large volumes of log data

RULE: index on: timestamp, level, service, correlation_id, agent, error_class
RULE: do NOT index on: full message text (too expensive), raw stack traces
RULE: full-text search is a fallback, not the primary search method

CHECK: can you find all ERROR logs for agent "boris" in the last hour in < 5 seconds?
IF: no THEN: indexing is insufficient — add agent and level to index
IF: yes THEN: indexing is working

LOG_ANALYSIS:STRUCTURED_LOGGING_PATTERNS¶

JSON_LOG_FORMAT¶

PURPOSE: consistent structured format across all GE services

REQUIRED_FIELDS:

{  
  "timestamp": "ISO 8601 with timezone",  
  "level": "error|warn|info|debug|trace",  
  "service": "service name (ge-orchestrator, admin-ui, etc.)",  
  "correlation_id": "trace_id or work_item_id",  
  "message": "human-readable description of the event"  
}

OPTIONAL_FIELDS:

{  
  "agent": "agent name if applicable",  
  "work_type": "task work type",  
  "error_class": "exception class name",  
  "error_message": "exception message (redacted)",  
  "stack_trace": "first 10 lines of stack trace",  
  "duration_ms": "operation duration in milliseconds",  
  "context": { "additional_key": "additional_value" }  
}

RULE: required fields are ALWAYS present — missing fields break aggregation
RULE: optional fields are present when relevant — do not pad with nulls
RULE: context object holds domain-specific data — keeps top level clean

PYTHON_STRUCTURED_LOGGING¶

PURPOSE: structured logging in orchestrator and executor

TECHNIQUE: python-json-logger or structlog

import structlog  

logger = structlog.get_logger()  

logger.info("task_routed",  
    correlation_id=task.work_item_id,  
    agent=selected_agent,  
    work_type=task.work_type,  
    routing_reason="explicit_agent_id"  
)  

logger.error("task_routing_failed",  
    correlation_id=task.work_item_id,  
    error_class=type(e).__name__,  
    error_message=str(e),  
    work_type=task.work_type  
)

RULE: use keyword arguments for structured fields — never f-strings in messages
RULE: bind common fields (service, hostname) once at logger creation

TYPESCRIPT_STRUCTURED_LOGGING¶

PURPOSE: structured logging in admin-ui

TECHNIQUE: pino or custom JSON logger

import pino from 'pino';  

const logger = pino({ level: 'info' });  

logger.info({  
  correlationId: task.workItemId,  
  agent: task.agentName,  
  workType: task.workType,  
}, 'Task dispatched to agent');  

logger.error({  
  correlationId: task.workItemId,  
  err: error,  
  workType: task.workType,  
}, 'Task dispatch failed');

RULE: pass structured object as first argument, message as second
RULE: use err field for Error objects — pino serializes stack traces automatically

LOG_ANALYSIS:LOG_ROTATION¶

ROTATION_POLICY¶

PURPOSE: prevent log storage from growing unbounded

RULE: rotate logs by size (100MB per file) or time (daily), whichever comes first
RULE: retain 7 days of uncompressed logs for fast access
RULE: retain 30 days of compressed logs for operational queries
RULE: retain 90 days of archived logs for compliance (ISO 27001)

CHECK: is total log storage growing faster than expected?
IF: yes THEN: check for DEBUG/TRACE level enabled in production — disable
IF: yes THEN: check for high-frequency logging in a loop — fix the loop

TECHNIQUE: k8s container log rotation

# kubelet handles rotation automatically  
# configure in kubelet config:  
containerLogMaxSize: 100Mi  
containerLogMaxFiles: 5

ANTI_PATTERN: no log rotation — disk fills up and pod crashes
FIX: always configure rotation — unbounded logs are a ticking time bomb

ANTI_PATTERN: rotating too aggressively (keep only 1 hour)
FIX: 1 hour is not enough for post-incident analysis — keep minimum 7 days

LOG_VOLUME_ESTIMATION¶

PURPOSE: predict storage needs and detect anomalous log volume

FORMULA:

daily_log_volume = avg_log_lines_per_second * 86400 * avg_bytes_per_line  
monthly_storage = daily_log_volume * retention_days * compression_ratio  

Example:  
  50 lines/sec * 86400 * 500 bytes = ~2.1 GB/day uncompressed  
  2.1 GB * 30 days * 0.1 (compression) = ~6.3 GB/month compressed

CHECK: actual log volume matches estimate?
IF: > 2x estimate THEN: log volume anomaly — investigate cause (debug logging, error storm)
IF: < 0.5x estimate THEN: logs may be silently lost — check collector health

LOG_ANALYSIS:SENSITIVE_DATA_REDACTION¶

WHAT_TO_REDACT¶

PURPOSE: prevent sensitive data from appearing in logs

MUST_REDACT:

API keys:         replace with "***API_KEY***"  
Passwords:        replace with "***PASSWORD***"  
Bearer tokens:    replace with "Bearer ***TOKEN***"  
Redis passwords:  replace with "***REDIS_PASS***"  
Database URLs:    redact password portion of connection string  
PII:              email addresses, IP addresses (if applicable)  
Session tokens:   replace with first 8 chars + "***"

CHECK: do any log entries contain unredacted secrets?
TOOL: grep
RUN: grep -rn 'ANTHROPIC_API_KEY\|sk-ant-\|sk-proj-\|password=' /var/log/pods/ | head -5
IF: matches found THEN: CRITICAL — secrets in logs, fix redaction immediately

REDACTION_TECHNIQUES¶

TECHNIQUE: logger-level redaction (preferred)

import re  

REDACTION_PATTERNS = [  
    (re.compile(r'sk-ant-[a-zA-Z0-9_-]+'), '***ANTHROPIC_KEY***'),  
    (re.compile(r'sk-proj-[a-zA-Z0-9_-]+'), '***OPENAI_KEY***'),  
    (re.compile(r'password=\S+'), 'password=***REDACTED***'),  
    (re.compile(r'Bearer\s+\S+'), 'Bearer ***TOKEN***'),  
]  

def redact(message: str) -> str:  
    for pattern, replacement in REDACTION_PATTERNS:  
        message = pattern.sub(replacement, message)  
    return message

RULE: redaction runs BEFORE log emission — never write then redact
RULE: test redaction patterns against known secret formats
RULE: redaction must not break JSON structure

ANTI_PATTERN: redacting at aggregation time instead of emission time
FIX: secrets should never leave the process — redact at source

ANTI_PATTERN: logging full HTTP request headers (includes Authorization)
FIX: log only safe headers (Content-Type, Content-Length) — never Authorization

LOG_ANALYSIS:SEARCH_PATTERNS¶

COMMON_SEARCH_QUERIES¶

PURPOSE: standard queries for operational investigation

FIND_ALL_ERRORS_FOR_AGENT:

level:error AND agent:"boris" AND timestamp:[NOW-1h TO NOW]

FIND_TASK_LIFECYCLE:

correlation_id:"work-item-abc123" | sort by timestamp asc

FIND_ERROR_CLUSTERS:

level:error | group by error_class | sort by count desc | top 10

FIND_SLOW_OPERATIONS:

duration_ms:>5000 AND service:"ge-orchestrator" AND timestamp:[NOW-1h TO NOW]

FIND_COST_ANOMALIES:

message:"cost_gate" AND level:warn | sort by timestamp desc

ERROR_FINGERPRINTING¶

PURPOSE: group similar errors together for pattern detection

RULE: fingerprint = sha256(normalized_error_class + normalized_message_template)
RULE: normalize before hashing — strip variable parts

NORMALIZATION_PIPELINE:

1. strip ANSI codes:        s/\x1b\[[0-9;]*m//g  
2. replace paths:           s|/home/claude/[^\s:]+|<PATH>|g  
3. replace session IDs:     s/sess-[0-9]{8}-[0-9]{6}-[a-f0-9]{6}/<SESSION>/g  
4. replace UUIDs:           s/[a-f0-9]{8}(-[a-f0-9]{4}){3}-[a-f0-9]{12}/<UUID>/g  
5. replace timestamps:      s/\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[^\s]*/<TS>/g  
6. replace large numbers:   s/\b\d{4,}\b/<N>/g  
7. collapse whitespace:     s/\s+/ /g  
8. compute sha256 of result

CHECK: is the fingerprint matching too broadly (unrelated errors grouped)?
IF: yes THEN: normalization is too aggressive — keep more specific tokens
CHECK: is the fingerprint matching too narrowly (same error gets multiple fingerprints)?
IF: yes THEN: normalization is too conservative — strip more variable parts

ANTI_PATTERN: fingerprinting the raw error including timestamps and line numbers
FIX: normalize first — timestamps and line numbers change between occurrences

ANTI_PATTERN: separate fingerprints for the same error on different ports
FIX: replace port numbers with <PORT> unless port is the diagnostic signal

LOG_ANALYSIS:ERROR_CLASSIFICATION¶

ERROR_TAXONOMY¶

PURPOSE: classify errors by domain for targeted investigation and pattern detection

CATEGORIES:

infra:network     — connection refused, timeout, DNS failure  
infra:resource    — OOM, disk full, CPU throttle  
infra:permission  — EACCES, EPERM, 403, RBAC denied  
runtime:import    — ModuleNotFoundError, cannot find module  
runtime:type      — TypeError, AttributeError, undefined  
runtime:state     — race condition, stale data, missing key  
api:auth          — 401, token expired, invalid credentials  
api:validation    — 400, Zod error (.issues not .errors), schema mismatch  
api:ratelimit     — 429, quota exceeded  
cost:burn         — session > $5, agent > $10/hr, daily > $100  
loop:hook         — hook re-trigger, chain depth exceeded  
loop:infinite     — MAX_TURNS hit, same tool called 10+ times

RULE: classify by root cause, not symptom
EXAMPLE: a 403 from missing RBAC is infra:permission, not api:auth
EXAMPLE: a timeout from Redis OOM is infra:resource, not infra:network

CHECK: does the error match multiple categories?
IF: yes THEN: assign primary by root cause, tag secondary categories