DOMAIN:MONITORING — LOG_ANALYSIS¶
OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: log aggregation, structured logging, log levels, log rotation, sensitive data redaction, search patterns, error fingerprinting
LOG_ANALYSIS:CORE_PRINCIPLE¶
PURPOSE: aggregate, structure, and analyze logs from all GE components to detect errors, understand system behavior, and extract learnings
RULE: logs are the most detailed telemetry signal — they contain the full story
RULE: logs MUST be structured (JSON) to be searchable at scale
RULE: logs MUST be centralized — per-pod logs are useless after pod restart
RULE: logs MUST be redacted — no secrets, no PII, no API keys
CHECK: can you search logs across all services for a specific correlation_id?
IF: no THEN: log aggregation is broken — fix centralization first
IF: yes THEN: aggregation is working — focus on log quality
LOG_ANALYSIS:LOG_AGGREGATION¶
CENTRALIZED_LOG_COLLECTION¶
PURPOSE: collect logs from all pods and persist them beyond pod lifecycle
TECHNIQUE: sidecar or DaemonSet log collector
1. each pod writes structured JSON to stdout/stderr
2. k8s captures stdout/stderr to node filesystem
3. log collector (Fluentd/Fluent Bit/Vector) reads from node logs
4. collector forwards to centralized storage (Loki, Elasticsearch, or PostgreSQL)
5. centralized storage enables cross-service search
GE_LOG_SOURCES:
admin-ui: Next.js server logs (stdout)
ge-orchestrator: Python asyncio logs (stdout)
ge-executor: Python execution logs + PTY capture (stdout)
Redis: slow log, connection log
PostgreSQL: query log, error log
k8s events: event stream from API server
host cron: script output from health dumps
CHECK: are logs from all sources reaching centralized storage?
IF: missing source THEN: check collector config — pod may not be matched by selector
IF: all present THEN: verify retention period is sufficient
RULE: retain logs for minimum 30 days (operational) and 90 days (compliance)
RULE: compress archived logs — JSON compresses well (10:1 typical)
ANTI_PATTERN: relying on kubectl logs for historical analysis
FIX: kubectl logs only shows current pod — logs are lost on restart without centralization
ANTI_PATTERN: each service writing to its own log file instead of stdout
FIX: k8s expects stdout/stderr — file-based logging requires extra volume mounts and collectors
LOG_INDEXING¶
PURPOSE: enable fast search across large volumes of log data
RULE: index on: timestamp, level, service, correlation_id, agent, error_class
RULE: do NOT index on: full message text (too expensive), raw stack traces
RULE: full-text search is a fallback, not the primary search method
CHECK: can you find all ERROR logs for agent "boris" in the last hour in < 5 seconds?
IF: no THEN: indexing is insufficient — add agent and level to index
IF: yes THEN: indexing is working
LOG_ANALYSIS:STRUCTURED_LOGGING_PATTERNS¶
JSON_LOG_FORMAT¶
PURPOSE: consistent structured format across all GE services
REQUIRED_FIELDS:
{
"timestamp": "ISO 8601 with timezone",
"level": "error|warn|info|debug|trace",
"service": "service name (ge-orchestrator, admin-ui, etc.)",
"correlation_id": "trace_id or work_item_id",
"message": "human-readable description of the event"
}
OPTIONAL_FIELDS:
{
"agent": "agent name if applicable",
"work_type": "task work type",
"error_class": "exception class name",
"error_message": "exception message (redacted)",
"stack_trace": "first 10 lines of stack trace",
"duration_ms": "operation duration in milliseconds",
"context": { "additional_key": "additional_value" }
}
RULE: required fields are ALWAYS present — missing fields break aggregation
RULE: optional fields are present when relevant — do not pad with nulls
RULE: context object holds domain-specific data — keeps top level clean
PYTHON_STRUCTURED_LOGGING¶
PURPOSE: structured logging in orchestrator and executor
TECHNIQUE: python-json-logger or structlog
import structlog
logger = structlog.get_logger()
logger.info("task_routed",
correlation_id=task.work_item_id,
agent=selected_agent,
work_type=task.work_type,
routing_reason="explicit_agent_id"
)
logger.error("task_routing_failed",
correlation_id=task.work_item_id,
error_class=type(e).__name__,
error_message=str(e),
work_type=task.work_type
)
RULE: use keyword arguments for structured fields — never f-strings in messages
RULE: bind common fields (service, hostname) once at logger creation
TYPESCRIPT_STRUCTURED_LOGGING¶
PURPOSE: structured logging in admin-ui
TECHNIQUE: pino or custom JSON logger
import pino from 'pino';
const logger = pino({ level: 'info' });
logger.info({
correlationId: task.workItemId,
agent: task.agentName,
workType: task.workType,
}, 'Task dispatched to agent');
logger.error({
correlationId: task.workItemId,
err: error,
workType: task.workType,
}, 'Task dispatch failed');
RULE: pass structured object as first argument, message as second
RULE: use err field for Error objects — pino serializes stack traces automatically
LOG_ANALYSIS:LOG_ROTATION¶
ROTATION_POLICY¶
PURPOSE: prevent log storage from growing unbounded
RULE: rotate logs by size (100MB per file) or time (daily), whichever comes first
RULE: retain 7 days of uncompressed logs for fast access
RULE: retain 30 days of compressed logs for operational queries
RULE: retain 90 days of archived logs for compliance (ISO 27001)
CHECK: is total log storage growing faster than expected?
IF: yes THEN: check for DEBUG/TRACE level enabled in production — disable
IF: yes THEN: check for high-frequency logging in a loop — fix the loop
TECHNIQUE: k8s container log rotation
# kubelet handles rotation automatically
# configure in kubelet config:
containerLogMaxSize: 100Mi
containerLogMaxFiles: 5
ANTI_PATTERN: no log rotation — disk fills up and pod crashes
FIX: always configure rotation — unbounded logs are a ticking time bomb
ANTI_PATTERN: rotating too aggressively (keep only 1 hour)
FIX: 1 hour is not enough for post-incident analysis — keep minimum 7 days
LOG_VOLUME_ESTIMATION¶
PURPOSE: predict storage needs and detect anomalous log volume
FORMULA:
daily_log_volume = avg_log_lines_per_second * 86400 * avg_bytes_per_line
monthly_storage = daily_log_volume * retention_days * compression_ratio
Example:
50 lines/sec * 86400 * 500 bytes = ~2.1 GB/day uncompressed
2.1 GB * 30 days * 0.1 (compression) = ~6.3 GB/month compressed
CHECK: actual log volume matches estimate?
IF: > 2x estimate THEN: log volume anomaly — investigate cause (debug logging, error storm)
IF: < 0.5x estimate THEN: logs may be silently lost — check collector health
LOG_ANALYSIS:SENSITIVE_DATA_REDACTION¶
WHAT_TO_REDACT¶
PURPOSE: prevent sensitive data from appearing in logs
MUST_REDACT:
API keys: replace with "***API_KEY***"
Passwords: replace with "***PASSWORD***"
Bearer tokens: replace with "Bearer ***TOKEN***"
Redis passwords: replace with "***REDIS_PASS***"
Database URLs: redact password portion of connection string
PII: email addresses, IP addresses (if applicable)
Session tokens: replace with first 8 chars + "***"
CHECK: do any log entries contain unredacted secrets?
TOOL: grep
RUN: grep -rn 'ANTHROPIC_API_KEY\|sk-ant-\|sk-proj-\|password=' /var/log/pods/ | head -5
IF: matches found THEN: CRITICAL — secrets in logs, fix redaction immediately
REDACTION_TECHNIQUES¶
TECHNIQUE: logger-level redaction (preferred)
import re
REDACTION_PATTERNS = [
(re.compile(r'sk-ant-[a-zA-Z0-9_-]+'), '***ANTHROPIC_KEY***'),
(re.compile(r'sk-proj-[a-zA-Z0-9_-]+'), '***OPENAI_KEY***'),
(re.compile(r'password=\S+'), 'password=***REDACTED***'),
(re.compile(r'Bearer\s+\S+'), 'Bearer ***TOKEN***'),
]
def redact(message: str) -> str:
for pattern, replacement in REDACTION_PATTERNS:
message = pattern.sub(replacement, message)
return message
RULE: redaction runs BEFORE log emission — never write then redact
RULE: test redaction patterns against known secret formats
RULE: redaction must not break JSON structure
ANTI_PATTERN: redacting at aggregation time instead of emission time
FIX: secrets should never leave the process — redact at source
ANTI_PATTERN: logging full HTTP request headers (includes Authorization)
FIX: log only safe headers (Content-Type, Content-Length) — never Authorization
LOG_ANALYSIS:SEARCH_PATTERNS¶
COMMON_SEARCH_QUERIES¶
PURPOSE: standard queries for operational investigation
FIND_ALL_ERRORS_FOR_AGENT:
FIND_TASK_LIFECYCLE:
FIND_ERROR_CLUSTERS:
FIND_SLOW_OPERATIONS:
FIND_COST_ANOMALIES:
ERROR_FINGERPRINTING¶
PURPOSE: group similar errors together for pattern detection
RULE: fingerprint = sha256(normalized_error_class + normalized_message_template)
RULE: normalize before hashing — strip variable parts
NORMALIZATION_PIPELINE:
1. strip ANSI codes: s/\x1b\[[0-9;]*m//g
2. replace paths: s|/home/claude/[^\s:]+|<PATH>|g
3. replace session IDs: s/sess-[0-9]{8}-[0-9]{6}-[a-f0-9]{6}/<SESSION>/g
4. replace UUIDs: s/[a-f0-9]{8}(-[a-f0-9]{4}){3}-[a-f0-9]{12}/<UUID>/g
5. replace timestamps: s/\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[^\s]*/<TS>/g
6. replace large numbers: s/\b\d{4,}\b/<N>/g
7. collapse whitespace: s/\s+/ /g
8. compute sha256 of result
CHECK: is the fingerprint matching too broadly (unrelated errors grouped)?
IF: yes THEN: normalization is too aggressive — keep more specific tokens
CHECK: is the fingerprint matching too narrowly (same error gets multiple fingerprints)?
IF: yes THEN: normalization is too conservative — strip more variable parts
ANTI_PATTERN: fingerprinting the raw error including timestamps and line numbers
FIX: normalize first — timestamps and line numbers change between occurrences
ANTI_PATTERN: separate fingerprints for the same error on different ports
FIX: replace port numbers with <PORT> unless port is the diagnostic signal
LOG_ANALYSIS:ERROR_CLASSIFICATION¶
ERROR_TAXONOMY¶
PURPOSE: classify errors by domain for targeted investigation and pattern detection
CATEGORIES:
infra:network — connection refused, timeout, DNS failure
infra:resource — OOM, disk full, CPU throttle
infra:permission — EACCES, EPERM, 403, RBAC denied
runtime:import — ModuleNotFoundError, cannot find module
runtime:type — TypeError, AttributeError, undefined
runtime:state — race condition, stale data, missing key
api:auth — 401, token expired, invalid credentials
api:validation — 400, Zod error (.issues not .errors), schema mismatch
api:ratelimit — 429, quota exceeded
cost:burn — session > $5, agent > $10/hr, daily > $100
loop:hook — hook re-trigger, chain depth exceeded
loop:infinite — MAX_TURNS hit, same tool called 10+ times
RULE: classify by root cause, not symptom
EXAMPLE: a 403 from missing RBAC is infra:permission, not api:auth
EXAMPLE: a timeout from Redis OOM is infra:resource, not infra:network
CHECK: does the error match multiple categories?
IF: yes THEN: assign primary by root cause, tag secondary categories