DOMAIN:MONITORING — PATTERN_DETECTION¶
OWNER: eltjo, annegreet
ALSO_USED_BY: ron, mira, nessa
UPDATED: 2026-03-26
SCOPE: cross-session pattern recognition, error clustering, anomaly detection, trend analysis, learning extraction, confidence scoring
PATTERN_DETECTION:CORE_PRINCIPLE¶
PURPOSE: automatically identify recurring error patterns, systemic issues, and improvement opportunities from aggregated session data
RULE: pattern detection operates on aggregated data, not individual sessions
RULE: patterns must reach a confidence threshold before being treated as knowledge
RULE: patterns can be validated, stale, or invalidated — never blindly trusted
RULE: human review is required before patterns influence agent behavior (JIT injection)
CHECK: is the pattern detection pipeline producing new patterns?
IF: no patterns in 7 days THEN: pipeline may be broken — check data flow
IF: producing patterns THEN: review confidence distribution — too many low-confidence is noise
PATTERN_DETECTION:CROSS_SESSION_RECOGNITION¶
FINGERPRINT_BASED_GROUPING¶
PURPOSE: group related errors across different agent sessions using fingerprints
TECHNIQUE:
1. each session produces structured learnings with error fingerprints
2. fingerprint = sha256(normalized_error_class + normalized_message_template)
3. group all learnings by fingerprint
4. count occurrences, unique agents, time span
5. groups with 3+ occurrences become candidate patterns
TOOL: SQL aggregation on session_learnings
SELECT
fingerprint,
COUNT(*) as occurrences,
COUNT(DISTINCT agent_name) as affected_agents,
array_agg(DISTINCT agent_name) as agents,
MIN(created_at) as first_seen,
MAX(created_at) as last_seen,
MODE() WITHIN GROUP (ORDER BY error_class) as primary_error_class
FROM session_learnings
WHERE created_at > NOW() - INTERVAL '7 days'
AND fingerprint IS NOT NULL
GROUP BY fingerprint
HAVING COUNT(*) >= 3
ORDER BY occurrences DESC;
CHECK: are fingerprints grouping correctly?
IF: same error gets different fingerprints THEN: normalization too narrow — strip more variable parts
IF: different errors get same fingerprint THEN: normalization too broad — keep more specific tokens
RULE: review fingerprint quality monthly — adjust normalization rules as needed
SCOPE_CLASSIFICATION¶
PURPOSE: determine whether a pattern is system-wide or agent-specific
RULE: 3+ unique agents affected = system-wide pattern (tag scope:system)
RULE: single agent only = agent-specific pattern (tag scope:agent:{name})
RULE: 2 agents = inconclusive — wait for more data before classifying
CHECK: is the pattern scope correctly assigned?
IF: system-wide pattern caused by single root cause THEN: correct
IF: agent-specific patterns share same fingerprint THEN: may be system-wide — check normalization
RULE: system-wide patterns are 3x higher priority than agent-specific patterns
RULE: system-wide patterns are candidates for infrastructure fixes
RULE: agent-specific patterns are candidates for agent configuration changes
PATTERN_DETECTION:ERROR_CLUSTERING¶
TEMPORAL_CLUSTERING¶
PURPOSE: detect bursts of errors that indicate an active incident
TECHNIQUE: sliding window count
1. query errors in last 1 hour, grouped by 5-minute buckets
2. compute mean error rate per bucket
3. if any bucket exceeds 3x the mean: temporal cluster detected
4. temporal cluster = likely active incident
TOOL: SQL windowed aggregation
SELECT
DATE_TRUNC('minute', created_at) as bucket,
COUNT(*) as error_count,
AVG(COUNT(*)) OVER (
ORDER BY DATE_TRUNC('minute', created_at)
ROWS BETWEEN 12 PRECEDING AND CURRENT ROW
) as rolling_avg
FROM session_learnings
WHERE level = 'error'
AND created_at > NOW() - INTERVAL '1 hour'
GROUP BY bucket
HAVING COUNT(*) > 3 * AVG(COUNT(*)) OVER (
ORDER BY DATE_TRUNC('minute', created_at)
ROWS BETWEEN 12 PRECEDING AND CURRENT ROW
);
CHECK: is there a temporal cluster right now?
IF: yes THEN: active incident — correlate with recent deployments or config changes
IF: no THEN: normal error rate — continue routine monitoring
AGENT_CLUSTERING¶
PURPOSE: detect when multiple agents experience the same class of error simultaneously
TECHNIQUE:
1. query errors in last 30 minutes, grouped by error_class
2. for each error_class: count unique agents affected
3. if 3+ agents share the same error_class: agent cluster detected
4. agent cluster = likely infrastructure issue (not agent misconfiguration)
CHECK: are multiple agents failing with the same error?
IF: yes THEN: shared dependency failure — investigate infrastructure (Redis, DB, network)
IF: no THEN: isolated issue — investigate individual agent or task
RULE: agent clustering overrides individual agent alerts — suppress agent alerts, emit cluster alert
CATEGORY_CLUSTERING¶
PURPOSE: detect when errors from the same category spike across the system
CATEGORIES (from error taxonomy):
infra:* — infrastructure failures
runtime:* — code-level errors
api:* — API communication issues
cost:* — budget violations
loop:* — infinite loop / hook loop
CHECK: is one error category dominating the last hour?
IF: > 70% of errors are in one category THEN: systematic issue in that domain
IF: errors spread across categories THEN: multiple unrelated issues or noisy period
PATTERN_DETECTION:ANOMALY_DETECTION¶
BASELINE_DEVIATION¶
PURPOSE: detect when current behavior significantly differs from historical baseline
TECHNIQUE: statistical baseline
1. compute 7-day rolling average for key metrics:
- errors per hour
- session duration
- cost per session
- task completion rate
2. compute standard deviation
3. current value > mean + 2*stddev = anomaly (warning)
4. current value > mean + 3*stddev = anomaly (critical)
METRICS_TO_MONITOR:
error_rate_per_hour: sudden increase = incident
avg_session_duration: sudden increase = performance regression
avg_cost_per_session: sudden increase = cost burn
task_completion_rate: sudden decrease = systemic failure
avg_tokens_per_session: sudden increase = prompt bloat or loop
CHECK: any metric outside 2-sigma?
IF: yes THEN: anomaly detected — investigate
IF: all within bounds THEN: system operating normally
ANTI_PATTERN: setting static thresholds for anomaly detection
FIX: use statistical baselines — static thresholds do not adapt to system growth
ANTI_PATTERN: anomaly detection on less than 7 days of data
FIX: insufficient baseline data produces false positives — wait for enough history
CHANGE_POINT_DETECTION¶
PURPOSE: detect when a metric permanently shifts to a new level (not a spike, a regime change)
TECHNIQUE:
1. compare mean of last 24 hours vs previous 7-day mean
2. if shift > 2*stddev AND sustained for > 6 hours: change point
3. change point = permanent shift, not transient anomaly
4. investigate: was there a deployment, config change, or dependency upgrade?
CHECK: has error rate permanently shifted up after last deployment?
IF: yes THEN: deployment introduced a regression — rollback or fix
IF: no (temporary spike that recovered) THEN: transient issue, monitor
PATTERN_DETECTION:TREND_ANALYSIS¶
SHORT_TERM_TRENDS¶
PURPOSE: detect deteriorating or improving patterns over hours/days
TECHNIQUE: linear regression on hourly error counts
1. collect hourly error counts for last 48 hours
2. fit linear regression: error_count = slope * hour + intercept
3. if slope > 0 (statistically significant): deteriorating trend
4. if slope < 0 (statistically significant): improving trend
5. if slope ~ 0: stable
CHECK: is the error trend deteriorating?
IF: positive slope > 1 error/hour increase THEN: warn — investigate before it becomes critical
IF: stable or improving THEN: no action needed
LONG_TERM_TRENDS¶
PURPOSE: detect systemic improvements or degradations over weeks/months
METRICS_FOR_LONG_TERM:
weekly_error_count: decreasing = system maturing
weekly_mttr: decreasing = operations improving
weekly_cost_per_task: decreasing = efficiency improving
weekly_learning_reuse: increasing = knowledge pipeline working
weekly_false_positive_rate: decreasing = fingerprinting improving
RULE: report long-term trends in weekly SLA report
RULE: deteriorating long-term trends trigger planning discussions, not immediate alerts
PATTERN_DETECTION:LEARNING_EXTRACTION¶
FROM_PATTERNS_TO_LEARNINGS¶
PURPOSE: promote validated patterns into actionable learnings for agent JIT injection
PROMOTION_CRITERIA:
1. confidence >= 0.8 (see confidence scoring below)
2. resolution_rate >= 0.7 (pattern has known solution that works)
3. recency: last_seen within 30 days
4. diversity: affects 3+ agents OR is critical severity
CHECK: does the pattern meet all promotion criteria?
IF: all met THEN: promote to VALIDATED learning in knowledge_patterns table
IF: not met THEN: keep as candidate — continue collecting data
RULE: promoted learnings are written to wiki at wiki/docs/learnings/patterns/
RULE: promoted learnings are eligible for JIT injection into agent prompts
RULE: max 5 learnings injected per session (500 token budget)
LEARNING_FORMAT¶
PURPOSE: structured format for learnings that can be injected into agent context
FORMAT:
symptom: "exact error message template (normalized)"
context: "what the agent was typically doing when this occurs"
root_cause: "underlying reason for the error"
solution: "specific steps to resolve"
prevention: "what to do differently to avoid this"
confidence: 0.0-1.0
tags: ["category:subcategory", "scope:system|agent:name"]
agents_affected: ["boris", "gerco", "thijmen"]
first_seen: "2026-03-01"
last_seen: "2026-03-25"
occurrences: 47
RULE: one learning per distinct root cause
RULE: solution field must be specific and actionable
RULE: learnings with null solution are still valuable (known issues without fix)
PATTERN_DETECTION:CONFIDENCE_SCORING¶
CONFIDENCE_FORMULA¶
PURPOSE: quantify how trustworthy a detected pattern is
FORMULA:
base_confidence = min(1.0, occurrences / 10)
recency_factor = 1.0 if last_seen < 24h
0.8 if last_seen < 7d
0.5 if last_seen < 30d
0.2 if last_seen > 30d
diversity_factor = min(1.0, unique_agents / 5)
resolution_factor = successful_resolutions / total_occurrences
confidence = (base_confidence * 0.4) +
(recency_factor * 0.2) +
(diversity_factor * 0.2) +
(resolution_factor * 0.2)
RULE: confidence is recalculated on every new occurrence
RULE: confidence < 0.3 AND last_seen > 30 days = STALE
RULE: confidence >= 0.8 AND resolution >= 0.7 = VALIDATED
RULE: STALE learnings excluded from JIT injection but never deleted
STALENESS_DETECTION¶
TRIGGERS for staleness review:
1. learning references a file path that no longer exists
2. learning references a dependency version that has been upgraded
3. learning has not matched any session error in 60 days
4. codebase area the learning covers has been refactored (git log check)
TOOL: staleness sweep query
SELECT id, symptom, solution, confidence, last_matched_at,
EXTRACT(DAYS FROM NOW() - last_matched_at) as days_since_match
FROM knowledge_patterns
WHERE confidence > 0
AND (last_matched_at < NOW() - INTERVAL '60 days'
OR last_matched_at IS NULL)
ORDER BY confidence DESC;
RULE: do NOT delete stale learnings — demote to confidence 0.1
RULE: stale learnings can be revived if a new matching error occurs
RULE: weekly staleness sweep (CronJob, not file watcher — NEVER use file watchers)
ANTI_PATTERN: keeping confidence scores from initial detection forever
FIX: recalculate on every match — confidence should reflect current relevance
ANTI_PATTERN: treating confidence as binary (trusted/not trusted)
FIX: confidence is a spectrum — use thresholds for different decisions