Pitfalls in Agentic Engineering¶
Why This Page Exists¶
Every entry on this page represents a real failure that GE experienced, a failure documented in academic literature, or both. Multi-agent systems fail at rates between 41% and 86.7% in research settings. In production, the failures are more expensive. This page documents the most dangerous failure modes, their symptoms, and their mitigations.
RULE: Every agent should receive relevant entries from this page as JIT knowledge when working in the corresponding domain. Prevention is cheaper than repair.
1. Token Burn¶
What It Is¶
Uncontrolled consumption of tokens through system bugs, feedback loops, or misconfiguration. Token burn is the most financially dangerous failure mode in agentic systems because it happens at machine speed — a feedback loop can consume hundreds of dollars in minutes before anyone notices.
Forms of Token Burn¶
File watcher feedback loops. A file watcher detects a change made by an agent. The watcher triggers a re-read or re-process. The processing modifies a file. The watcher detects the modification. The loop continues at machine speed.
GE INCIDENT: File watchers (chokidar) caused $100/hour token burn. Permanently banned from the codebase.
ANTI_PATTERN: Any file watcher in a system that also writes files. FIX: Use polling, event-driven triggers, or cron-based checks. Never use reactive file watchers in production.
Hook chain loops. Post-completion hooks trigger new work. That work completes and triggers more hooks. Without depth limiting, this creates an infinite chain of agent executions.
GE INCIDENT: Monitoring agents Annegreet and Eltjo triggered hooks on each other's completions. The loop consumed multiple agent sessions before the cost gate intervened.
ANTI_PATTERN: Post-completion hook with condition "always" at no_block tier.
FIX: hook_origin_depth field, maximum depth of 2, per-agent rate limit of 20 hooks/hour.
Double delivery. The same work item is published to multiple streams, causing it to be executed multiple times.
GE INCIDENT: Task service published to both triggers.{agent} and ge:work:incoming. The orchestrator picked up from ge:work:incoming and routed to triggers.{agent}. Every task executed twice.
ANTI_PATTERN: Publishing the same task to multiple streams. FIX: Publish to exactly one stream. Use work_item_id deduplication with a 5-minute window as a safety net.
Unbounded streams. Redis streams without MAXLEN grow until they consume all available memory. When memory is exhausted, Redis begins evicting data, which can corrupt the stream state.
ANTI_PATTERN: XADD streamname * field value without MAXLEN.
FIX: Always use XADD streamname MAXLEN ~100 * field value (per-agent) or MAXLEN ~1000 (system).
Subagent Haiku burn. Claude Code uses Haiku internally for subagent processing. Launching a broad exploratory subagent task generates dozens of Haiku requests with 50K-63K input tokens each, consuming significant budget in seconds.
GE INCIDENT: 35 tool calls in an Explore subagent produced 10 Haiku requests totaling over 500K input tokens in 12 seconds.
ANTI_PATTERN: Using Task(subagent_type=Explore) for broad searches.
FIX: Use direct Glob/Grep/Read calls. Only use Explore subagents for genuinely necessary deep research, and warn the human first.
Detection¶
- Monitor total token consumption per minute. Spikes above 3x normal indicate potential burn.
- The cost gate (
cost_gate.py) enforces per-session ($5), per-agent/hour ($10), and daily ($100) limits. - The safety verification script (
scripts/verify-executor-safety.sh) checks all burn prevention mechanisms.
2. Hallucination Patterns¶
What It Is¶
LLMs generating confident, plausible, but incorrect output. In agentic systems, hallucinations are especially dangerous because they propagate through the pipeline — downstream agents treat hallucinated output as fact.
Forms of Hallucination¶
Confident wrong answers. The model states something as fact that is incorrect. There is no hedging, no uncertainty marker, no indication that the answer might be wrong. Research shows that LLM confidence scores are poorly calibrated — a model that says "I'm confident this is correct" provides almost no information about actual correctness.
ANTI_PATTERN: Trusting agent self-assessment of correctness. FIX: Independent verification. Contract checks. Test execution. Never ask "is this right?" — ask the test suite.
Plausible but incorrect code. The code looks syntactically correct, follows reasonable patterns, and might even pass a cursory review. But it contains logical errors: wrong comparison operators, off-by-one errors, incorrect null handling, or security vulnerabilities. Research found that 29-45% of AI-generated code contains security vulnerabilities.
ANTI_PATTERN: Code review by the same agent (or same model) that generated the code. FIX: Independent review (different agent, different model). Deterministic quality gates (type checking, linting). Test execution. Mutation testing.
Phantom dependencies. The model generates code that imports libraries that do not exist, calls APIs that are not available, or references configuration that has not been defined. Nearly 20% of LLM package recommendations reference non-existent libraries.
ANTI_PATTERN: Generating dependency declarations without verifying they resolve. FIX: Contract checks that verify all imports resolve. CI that fails on unresolvable dependencies.
Context poisoning. A hallucination enters the agent's context (conversation history, tool output summary, or previous task result). Subsequent reasoning anchors on the hallucinated content and compounds the error. The agent cannot recover because the false information is now part of its "ground truth."
ANTI_PATTERN: Long-running sessions without context validation. FIX: Session restart with fresh context. Summarize progress (verified facts only) and begin a new session. Do not carry forward unverified assumptions.
Specification drift. The agent solves a different problem than the one specified. The solution is internally consistent and may be well-implemented, but it does not match the requirements. This is especially common when specifications are vague or when the agent's training data contains a similar but different problem.
ANTI_PATTERN: Evaluating implementation quality without comparing to specification. FIX: Reconciliation (Jasper compares spec to implementation). Spec-first test generation (tests encode the requirement, not the implementation).
3. Context Window Overflow¶
What It Is¶
The agent's context window fills up, causing degraded performance, lost instructions, and eventually truncated output.
Symptoms¶
Early warning (60-75% capacity): - Agent repeats actions it has already taken - Agent ignores instructions from the middle of the system prompt - Output becomes more generic, less specific to the task
Critical (75-90% capacity): - Agent references files or functions that do not exist in context - Agent contradicts its own earlier statements - Agent forgets role boundaries and begins acting outside its scope
Terminal (>90% capacity): - Output is truncated mid-response - Agent abandons its identity entirely - Agent confabulates — describes actions it did not take
Root Causes¶
Large tool outputs. An agent calls a function that returns 20,000 tokens of JSON. The output fills the context window, displacing the system prompt and conversation history.
Accumulated conversation history. Multi-turn sessions accumulate history geometrically. Each turn adds both the agent's output and the human/system input. After 40-50 turns, the history alone may consume most of the context window.
Uncompressed error logs. When an agent encounters errors, it often reads log files or error traces. These can be enormous and consume context budget that is needed for reasoning.
Mitigation¶
CHECK: Monitor context utilization during long sessions.
IF: Context exceeds 60% of window capacity
THEN: Summarize conversation history, keeping only recent exchanges in full.
IF: A tool output exceeds 10,000 tokens
THEN: Truncate to relevant sections before it enters context.
IF: Session exceeds 40 turns
THEN: Consider restarting with a fresh context and progress summary.
4. Identity Bleed¶
What It Is¶
An agent adopts the personality, behaviors, or knowledge of another agent whose output appears in its context. This happens when Agent A reads output produced by Agent B and begins behaving like Agent B.
How It Happens¶
- Agent A reviews code written by Agent B. Agent B's code comments contain first-person statements ("I chose this approach because..."). Agent A begins using Agent B's reasoning style.
- Agent A reads a discussion where Agent B expressed a strong opinion. Agent A adopts that opinion as its own, even when it conflicts with Agent A's role.
- Agent A processes a task that includes Agent B's identity information (leaked through a task description or handoff document). Agent A begins behaving according to Agent B's identity.
Symptoms¶
- Agent uses first-person references that match another agent's name or role
- Agent makes decisions outside its authority tier that would be appropriate for another agent
- Agent's communication style changes mid-session to match another agent
- Agent references expertise or experience that belongs to another agent's profile
Mitigation¶
RULE: JIT knowledge injection must never include other agents' identity files. Only include other agents' outputs (code, reviews, test results) with agent attribution removed.
RULE: Task descriptions must not include "Agent X said..." or "According to Agent X's analysis..." formulations. Instead, state the information directly: "The security review found..." or "The test suite requires..."
RULE: When an agent reviews another agent's work, the system prompt should reinforce: "You are {your name}. You are reviewing work produced by another agent. Maintain your own perspective and judgment."
5. Infinite Loops¶
What It Is¶
A pattern where agents create tasks that create more tasks in an unbounded chain. Unlike token burn (which is usually a system-level bug), infinite loops are often a design-level problem where the task decomposition logic has no termination condition.
Forms¶
Task spawning loops. Agent A completes a task and creates a follow-up task. The follow-up task creates another follow-up. Each iteration seems reasonable in isolation, but the chain never terminates.
Review loops. A reviewer finds issues. The developer fixes them. The reviewer finds new issues introduced by the fix. The developer fixes those, introducing yet more issues. Without a "good enough" threshold, this continues indefinitely.
Monitoring loops. A monitoring agent detects an anomaly. It creates a task to investigate. The investigation agent completes its analysis and notifies the monitoring agent. The monitoring agent detects the notification as new activity and creates another investigation task.
Mitigation¶
| Control | Value | Enforced By |
|---|---|---|
| Maximum chain depth | 3 | Orchestrator (scanner.py) |
| Maximum hook depth | 2 (1 for no_block) | Hooks (hooks.py) |
| Maximum parallel tasks per agent | 1 | Orchestrator |
| Maximum total parallel tasks | 5 | Orchestrator |
| Agent rate limit | 10 tasks/min/agent | Orchestrator |
RULE: Every task creation must include a chain_depth field. If chain_depth >= 3, the task is rejected and a human notification is created instead.
6. Premature Scaling¶
What It Is¶
Adding more agents, more replicas, or more features before the underlying pipeline is proven correct. Scaling a broken system does not produce more output — it produces more errors, faster, at higher cost.
Symptoms¶
- Agent count grows but output quality stays flat or decreases
- Agents wait idle because the pipeline bottleneck is upstream
- New agents create work for existing agents, overloading them
- Cost increases linearly with agent count but value does not
GE Experience¶
GE's initial deployment included monitoring agents (Annegreet, Ron, Mira, Eltjo) before the hook system had loop prevention. When these agents were enabled, they immediately created a loop that consumed significant budget. The agents were correct individually — each one was doing its job — but the system-level interaction was broken.
RULE: Before adding a new agent, verify:
1. The pipeline stages this agent participates in are proven correct.
2. The agent's handoff points are tested.
3. The cost gate and safety mechanisms cover the new agent's execution patterns.
4. verify-executor-safety.sh passes with the new agent included.
7. Over-Autonomy¶
What It Is¶
Agents making decisions that exceed their authority tier. This is the most dangerous non-financial failure mode because the consequences are often invisible until they cause real damage — a wrong architectural decision, an unauthorized client communication, a security policy violation.
How It Happens¶
- The agent's identity does not include explicit negative boundaries ("what you do NOT do")
- The task description is vague, giving the agent implicit permission to decide anything
- The agent is running a long session and has gradually drifted from its role
- The agent encounters a decision point not covered by its identity or the Constitution
Symptoms¶
- Agent modifies files outside its owned directories
- Agent makes architectural decisions without initiating a discussion
- Agent communicates in a way that implies authority it does not have
- Agent approves its own work (self-certification)
Mitigation¶
CHECK: When reviewing agent output.
IF: The agent made a decision that is not in its Tier 1 (autonomous) authority
THEN: The decision is invalid. Escalate to the appropriate tier.
IF: The agent modified files outside its owned directories
THEN: Revert the changes. Investigate why the agent exceeded its boundaries.
IF: The agent's identity lacks negative boundaries
THEN: Add them. Every identity needs explicit "do not" statements.
8. Under-Specification¶
What It Is¶
Providing vague instructions to agents and expecting them to "figure it out." LLMs cannot read between lines, infer unstated context, or ask clarifying questions during execution. Every ambiguity becomes a decision point where the agent picks whichever interpretation seems most plausible, which may not be correct.
Examples¶
Vague: "Improve the performance of the API." Result: Agent optimizes a random endpoint, adds caching where it is not needed, and removes validation checks that it perceives as "slow."
Specific: "Reduce the p95 response time of GET /api/tasks from 800ms to under 200ms. Do not modify the authentication middleware. The bottleneck is in the database query — add an index on tasks.agent_id." Result: Agent adds the specified index and measures the improvement.
The Ambiguity Tax¶
Every ambiguity in a task specification has a cost:
| Ambiguity Type | Cost |
|---|---|
| Missing acceptance criteria | Agent produces wrong output, rework required |
| Unclear scope | Agent does too much or too little |
| Undefined boundaries | Agent modifies things it should not touch |
| Absent priority | Agent spends equal time on critical and trivial elements |
| Missing context | Agent hallucinates the missing context |
Mitigation¶
RULE: Every task must include: - [ ] What to do (specific, measurable) - [ ] What NOT to do (explicit boundaries) - [ ] Acceptance criteria (how to verify completion) - [ ] Relevant context (files, specifications, prior decisions) - [ ] Priority (what matters most if trade-offs are necessary)
RULE: If a task description is shorter than 100 words, it is probably under-specified. If it is longer than 1,000 words, it should probably be split into multiple tasks.
9. Cascading Failures¶
What It Is¶
A failure in one agent or stage propagates through the pipeline, causing failures in downstream stages. This is especially dangerous when errors are silent — the downstream agent receives incorrect input but does not detect that it is incorrect.
How It Happens¶
- Agent A produces output with a subtle error (wrong constant, incorrect API path, missing null check)
- Agent B receives A's output and uses it as ground truth
- Agent B's output is correct given its input, but incorrect given the original requirement
- The error propagates until it reaches a stage with mechanical verification (tests, type checking) or a human reviewer
Mitigation¶
- Verification at every stage. Do not assume upstream output is correct. Each stage should verify its inputs as well as its outputs.
- Circuit breakers. If an agent detects suspicious input (malformed data, missing required fields, inconsistent state), it should HALT rather than attempt to proceed.
- Independent verification. The reconciliation pattern (Jasper) catches drift between specification and implementation by comparing independently produced work products.
10. Tool Output Overflow¶
What It Is¶
An agent calls a tool (file read, web search, database query) that returns an enormous result, flooding the context window and displacing critical instructions and history.
How It Happens¶
- Reading an entire large file instead of the relevant section
- Running a search query that returns hundreds of matches
- Querying a database table without LIMIT
- Fetching a web page that includes extensive boilerplate
Mitigation¶
RULE: Always scope tool calls to the minimum necessary: - Read specific line ranges, not entire files - Limit search results to the top 10-20 matches - Use LIMIT and WHERE clauses in database queries - Extract relevant content from web pages before injecting into context
RULE: If a tool output exceeds 10,000 tokens, truncate or summarize before it enters the agent's context window.
Prevention Checklist¶
Before deploying any new agent, pipeline stage, or system feature, verify:
- [ ] No file watchers in any component
- [ ] All XADD calls include MAXLEN
- [ ] No dual-publish to multiple streams for the same task
- [ ] Hook chain depth is limited
- [ ] Cost gate covers the new component
- [ ] Task creation includes chain_depth tracking
- [ ] Agent identity includes negative boundaries
- [ ] Task specifications include acceptance criteria
- [ ] Tool outputs are bounded in size
- [ ]
verify-executor-safety.shpasses