DOMAIN:MONITORING¶

OWNER: eltjo
UPDATED: 2026-03-24
SCOPE: cross-session learning pipeline, pattern detection, knowledge synthesis
AGENTS: eltjo (Cross-Session Learning Analyst), annegreet (Session Reporter)

MONITORING:CROSS_SESSION_PATTERN_DETECTION¶

PURPOSE: identify recurring error patterns across multiple agent sessions to extract reusable learnings

ERROR_FINGERPRINTING¶

RULE: fingerprint = sha256(normalized_error_class + normalized_message_template + triggering_file_extension)
RULE: ALWAYS strip variable parts before hashing (timestamps, session IDs, file paths, line numbers, UUIDs)
RULE: normalize error messages by replacing numeric values with <N>, paths with <PATH>, hashes with <HASH>

TECHNIQUE: regex-based normalization pipeline

1. strip ANSI escape codes: s/\x1b\[[0-9;]*m//g
2. replace absolute paths: s|/home/claude/[^\s:]+|<PATH>|g
3. replace session IDs: s/sess-[0-9]{8}-[0-9]{6}-[a-f0-9]{6}/<SESSION>/g
4. replace UUIDs: s/[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}/<UUID>/g
5. replace timestamps: s/\d{4}-\d{2}-\d{2}[T ]\d{2}:\d{2}:\d{2}[^\s]*/<TIMESTAMP>/g
6. replace numeric literals: s/\b\d{4,}\b/<N>/g
7. collapse whitespace: s/\s+/ /g

CHECK: does the fingerprint match an existing pattern in session_learnings table?
IF: match found with confidence >= 0.8 THEN: skip re-analysis, increment hit_count
ELSE: create new pattern entry, set confidence = 0.1

ANTI_PATTERN: hashing the full raw error string including timestamps
FIX: normalize first, then hash the template only

ANTI_PATTERN: treating stack trace line numbers as part of the fingerprint
FIX: strip line numbers from stack traces before fingerprinting (code moves between commits)

ANTI_PATTERN: separate fingerprints for "Connection refused" on port 6381 vs 6379
FIX: replace port numbers with <PORT> unless port is the diagnostic signal itself

PATTERN_CLASSIFICATION¶

RULE: classify patterns by error domain before aggregation

CATEGORIES:
- infra:network — connection refused, DNS resolution, timeout
- infra:resource — OOM, disk full, CPU throttle
- infra:permission — EACCES, EPERM, 403, RBAC denied
- runtime:import — ModuleNotFoundError, cannot find module
- runtime:type — TypeError, AttributeError, undefined is not a function
- runtime:state — race condition, stale data, missing expected key
- api:auth — 401, token expired, invalid credentials
- api:validation — 400, Zod error, schema mismatch
- api:ratelimit — 429, quota exceeded
- cost:burn — session exceeded $5, agent exceeded $10/hr
- loop:hook — hook re-trigger, chain depth exceeded
- loop:infinite — MAX_TURNS hit, same tool called 10+ times

CHECK: does error match multiple categories?
IF: yes THEN: assign primary category by root cause, tag secondary categories
RULE: root cause wins — a 403 caused by missing RBAC is infra:permission not api:auth

CROSS_SESSION_AGGREGATION¶

RULE: aggregate on 3 time windows simultaneously
- HOT: last 1 hour — detect active incidents (same error 3+ times in 1h = active incident)
- WARM: last 24 hours — detect persistent problems
- COLD: last 7 days — detect chronic patterns

TOOL: query session_learnings table

SELECT fingerprint, COUNT(*) as occurrences,
       COUNT(DISTINCT agent_name) as affected_agents,
       MIN(created_at) as first_seen,
       MAX(created_at) as last_seen,
       array_agg(DISTINCT agent_name) as agents
FROM session_learnings
WHERE created_at > NOW() - INTERVAL '7 days'
GROUP BY fingerprint
HAVING COUNT(*) >= 3
ORDER BY occurrences DESC;

CHECK: is the same fingerprint hitting multiple agents?
IF: 3+ agents affected THEN: escalate to system-wide pattern, tag scope:system
IF: single agent only THEN: tag scope:agent:{name}, may be agent-specific misconfiguration

RULE: cross-agent patterns are 3x higher priority than single-agent patterns

CONFIDENCE_SCORING¶

RULE: confidence = f(validation_count, recency, agent_diversity, resolution_success_rate)

FORMULA:

base_confidence = min(1.0, occurrences / 10)
recency_factor = 1.0 if last_seen < 24h, 0.8 if < 7d, 0.5 if < 30d, 0.2 if > 30d
diversity_factor = min(1.0, unique_agents / 5)
resolution_factor = successful_resolutions / total_occurrences  # 0.0 if never resolved
confidence = (base_confidence * 0.4) + (recency_factor * 0.2) + (diversity_factor * 0.2) + (resolution_factor * 0.2)

CHECK: confidence < 0.3 AND last_seen > 30 days?
IF: yes THEN: mark as STALE, exclude from JIT injection
IF: confidence >= 0.8 AND resolution_factor >= 0.7 THEN: promote to VALIDATED learning

RULE: STALE learnings are never deleted, only excluded from injection
RULE: a STALE learning can be revived if a new matching error occurs (recalculate confidence)

LEARNING_STALENESS_DETECTION¶

TRIGGERS for staleness review:
1. learning references a file path that no longer exists
2. learning references a dependency version that has been upgraded
3. learning has not been matched by any session in 60 days
4. codebase has been refactored in the area the learning covers (detected by git log on referenced files)

TOOL: staleness check query

SELECT id, symptom, solution, confidence, last_matched_at,
       EXTRACT(DAYS FROM NOW() - last_matched_at) as days_since_match
FROM knowledge_patterns
WHERE confidence > 0 AND (
    last_matched_at < NOW() - INTERVAL '60 days'
    OR last_matched_at IS NULL
)
ORDER BY confidence DESC;

RULE: do NOT auto-delete stale learnings — demote to confidence 0.1 and exclude from JIT
RULE: weekly staleness sweep (CronJob, not file watcher)

MONITORING:LEARNING_PIPELINE_ARCHITECTURE¶

TIER_1_AGENT_SELF_REPORT¶

WHAT: agent writes structured completion report at end of session
FORMAT: COMP-*.md files in agent outbox
FIELDS: task_id, outcome (success/failure/partial), errors_encountered[], tools_used[], files_modified[], cost, tokens, turns

RULE: executor writes COMP files — agent does not publish to Redis on completion
RULE: host cron syncs COMP files to DB via /api/system/sync-completions (1-min delay)

CHECK: is the COMP file well-formed?
IF: missing required fields THEN: log warning, still ingest partial data
ELSE: parse and insert into session_learnings table

TIER_2_POST_SESSION_ANALYSIS¶

WHAT: automated analysis of session transcript after completion
TOOL: session_summarizer.py — Claude Haiku, ~$0.01-0.03/session
WHEN: triggered by COMP file sync, runs inline

STRUCTURED_OUTPUT_FORMAT:

symptom: "exact error message or observable behavior"
context: "what the agent was doing when it happened"
tried:
  - "first approach attempted"
  - "second approach attempted"
failed_because: "root cause diagnosis"
solution: "what actually worked (or null if unresolved)"
confidence: 0.1-1.0
tags: ["infra:network", "scope:system"]
fingerprint: "sha256:abc123..."

RULE: ALWAYS use structured output format — free-text learnings are unsearchable
RULE: one learning per distinct error — do not bundle unrelated issues
RULE: cost cap: $0.05/session max for Tier 2 analysis. Skip if session had 0 errors.

ANTI_PATTERN: running Tier 2 analysis on every session regardless of outcome
FIX: skip analysis if session outcome=success AND cost < $1.00 AND turns < 20

TIER_3_CROSS_SESSION_PATTERNS¶

WHAT: knowledge_synthesizer.py aggregates Tier 2 learnings into system patterns
WHEN: 6-hourly CronJob
OUTPUT: updates knowledge_patterns table, feeds back to LEARNINGS.md files

PROCESS:
1. query all Tier 2 learnings from last 24 hours
2. group by fingerprint
3. for each group with 3+ occurrences: create/update knowledge_pattern
4. recalculate confidence scores for all active patterns
5. identify newly VALIDATED patterns (confidence crossed 0.8 threshold)
6. write validated patterns to wiki brain: wiki/docs/learnings/patterns/
7. flag STALE patterns (see staleness detection above)

RULE: Tier 3 NEVER calls an LLM for individual error analysis — only for synthesis summaries
RULE: synthesis budget: $0.10 max per 6-hour cycle

JIT_INJECTION¶

WHAT: selecting relevant learnings to inject into agent system prompt at boot time
WHEN: agent receives task from triggers.{agent} stream, executor prepares prompt

SELECTION_ALGORITHM:

1. extract keywords from task description + work_type
2. query knowledge_patterns WHERE confidence >= 0.5 AND status != 'STALE'
3. rank by: keyword_match_score * 0.4 + confidence * 0.3 + recency * 0.2 + agent_relevance * 0.1
4. take top 5 learnings (token budget: ~500 tokens max)
5. format as LEARNING blocks in system prompt

CHECK: is the injected learning relevant to the task type?
IF: task is code_review THEN: prioritize learnings tagged runtime:* and api:*
IF: task is infrastructure THEN: prioritize learnings tagged infra:*
IF: task is security_audit THEN: prioritize learnings tagged with security domains

RULE: max 5 learnings per injection — more causes context dilution
RULE: max 500 tokens total for injected learnings — agents have limited context budget
RULE: agent's own learnings get 2x weight boost (they learned it themselves)
RULE: NEVER inject learnings with confidence < 0.5 — false positives poison agent behavior

ANTI_PATTERN: injecting all 200+ learnings into every agent prompt
FIX: JIT selection by task relevance, hard cap at 5

ANTI_PATTERN: injecting learnings from 6 months ago about a bug that was fixed
FIX: staleness detection excludes learnings that reference resolved issues

PIPELINE_HEALTH_METRICS¶

MONITOR: reporting_rate = sessions_with_COMP / total_sessions (target: >95%)
MONITOR: learning_extraction_rate = sessions_with_learnings / sessions_with_errors (target: >80%)
MONITOR: learning_reuse_rate = injections_matched_to_task / total_injections (target: >60%)
MONITOR: false_positive_rate = learnings_marked_invalid / total_learnings (target: <10%)
MONITOR: staleness_ratio = stale_patterns / total_patterns (target: <30%)
MONITOR: tier2_cost_per_session = total_tier2_cost / sessions_analyzed (target: <$0.03)
MONITOR: tier3_cost_per_cycle = total_tier3_cost / cycles_run (target: <$0.10)

CHECK: reporting_rate < 90%?
IF: yes THEN: executor may be failing to write COMP files — check pod logs and disk space
CHECK: false_positive_rate > 20%?
IF: yes THEN: fingerprinting normalization needs tuning — too many distinct patterns merging

MONITORING:GE_SPECIFIC_INTEGRATION¶

REDIS_STREAM_MONITORING¶

TOOL: redis-cli -p 6381 -a $REDIS_PASSWORD
RUN: XINFO STREAM triggers.{agent} — check length, first/last entry
RUN: XINFO GROUPS triggers.{agent} — check consumer lag
RUN: XLEN ge:work:incoming — system stream depth

CHECK: stream length > MAXLEN (100 for per-agent, 1000 for system)?
IF: yes THEN: XADD without MAXLEN is happening — CRITICAL, find and fix the source

CHECK: consumer lag > 50 entries?
IF: yes THEN: executor is falling behind — check pod health, consider scaling (max 5 replicas)

COST_MONITORING¶

TOOL: query session_learnings for cost aggregation

SELECT agent_name,
       SUM(cost) as total_cost,
       COUNT(*) as session_count,
       AVG(cost) as avg_cost,
       MAX(cost) as max_cost
FROM session_learnings
WHERE created_at > NOW() - INTERVAL '1 hour'
GROUP BY agent_name
ORDER BY total_cost DESC;

CHECK: any agent > $10/hr?
IF: yes THEN: cost_gate should have blocked this — check ge_agent/execution/cost_gate.py enforcement
CHECK: system total > $100/day?
IF: yes THEN: HALT all non-critical work, escalate to mira

HOOK_LOOP_DETECTION¶

CHECK: Redis sorted set ge:hook:graph:* for cycles
RUN: redis-cli -p 6381 -a $REDIS_PASSWORD ZRANGEBYSCORE ge:hook:graph:{agent} -inf +inf WITHSCORES
IF: same agent appears as both trigger source and target within 30-min window THEN: potential loop
IF: hook_origin_depth >= 2 THEN: chain too deep, hooks.py should block

ANTI_PATTERN: monitoring agent triggers another monitoring agent
FIX: monitoring_agent_isolation in hooks.py blocks annegreet/eltjo/victoria/nessa/mira from cross-triggering