Cost Optimization¶
The Economics of Agentic Engineering¶
Running 60 LLM-powered agents is expensive. Without disciplined cost management, a multi-agent system can consume thousands of dollars per day in API costs — and most of that spend will be waste. GE learned this through direct, painful experience: a feedback loop caused by file watchers generated a $100/hour token burn that required emergency intervention.
Cost optimization in agentic systems is not about being cheap. It is about being efficient — directing spending where it produces value and eliminating spending where it does not. The most expensive model is not always the best choice. The cheapest model is not always the worst. The right model is the one that produces correct output for the task at the lowest total cost, including the cost of errors.
Model Selection Framework¶
GE uses three model tiers. Each tier has a specific role based on the type of reasoning required.
Tier 1: Opus (Judgment)¶
Cost: $5/$25 per million tokens (input/output) Strength: Deep reasoning, complex analysis, nuanced judgment Weakness: Slow, expensive, overkill for routine tasks
Use when:
- The task requires judgment, not just execution
- The task involves regulatory interpretation, legal analysis, or compliance
- The task requires understanding and resolving conflicting requirements
- The task involves architectural decisions with long-term consequences
- The cost of getting it wrong exceeds the cost of using the premium model
- Client-facing scoping and creative direction (Aimee's work)
Do NOT use when:
- The task has a clear specification and the agent just needs to implement it
- The task is routing, classification, or simple decision-making
- The task is mechanical transformation (format conversion, boilerplate generation)
Tier 2: Sonnet (Production)¶
Cost: $3/$15 per million tokens (input/output) Strength: Good balance of capability and speed, handles most production work Weakness: May struggle with the most complex reasoning tasks
Use when:
- The task is well-specified implementation work
- The task requires good coding ability but not deep architectural reasoning
- The task involves code review with established criteria
- The task requires understanding existing code and making changes
- This is the default tier — use Sonnet unless there is a specific reason to use Opus or Haiku
Do NOT use when:
- The task requires only simple classification or routing (use Haiku)
- The task requires deep judgment on ambiguous inputs (use Opus)
Tier 3: Haiku (Routing)¶
Cost: $1/$5 per million tokens (input/output) Strength: Fast, cheap, good enough for simple tasks Weakness: Limited reasoning depth, prone to errors on complex tasks
Use when:
- Routing decisions (which agent should handle this work?)
- Simple monitoring checks (is this service responding?)
- Classification tasks (what type of work item is this?)
- Summarization of structured data
- Health checks and status reports
- Log analysis for known patterns
Do NOT use when:
- The task requires producing code that will run in production
- The task requires understanding complex specifications
- The task involves security-sensitive decisions
- The error cost exceeds the savings from using the cheaper model
Model Selection Decision Tree¶
CHECK: What type of reasoning does this task require?
IF: Judgment on ambiguous inputs, regulatory/legal interpretation, architecture
THEN: Opus
IF: Well-specified implementation, code review, testing, documentation
THEN: Sonnet
IF: Routing, classification, monitoring, simple checks
THEN: Haiku
IF: Uncertain which tier is appropriate
THEN: Start with Sonnet. If output quality is insufficient, escalate to Opus.
Never start with Opus "just to be safe" — this is the most common cost waste.
Token Burn Prevention¶
Token burn is the uncontrolled consumption of tokens through system bugs, design flaws, or misconfiguration. GE has experienced multiple token burn incidents. Each one produced specific learnings that are now encoded as hard rules.
The $100/Hour File Watcher Incident¶
What happened: File watchers (chokidar) were configured to watch the workspace directory. When an agent modified a file, the watcher detected the change and triggered a re-read. The re-read consumed tokens. The token consumption caused logging, which modified files, which triggered the watcher again. The feedback loop ran at machine speed.
Cost: $100/hour in API costs before the loop was detected and killed.
Rule created: NEVER use file watchers (chokidar, fs.watch) in production. Removed from package.json. This is a permanent ban, not a temporary workaround.
The Double Delivery Incident¶
What happened: The task service was publishing work items to both triggers.{agent} (the per-agent queue) and ge:work:incoming (the system queue). The orchestrator picked up items from ge:work:incoming and routed them to triggers.{agent}. Every task was executed twice.
Cost: 2x execution cost for every task until discovered.
Rule created: NEVER XADD to both triggers.{agent} AND ge:work:incoming for the same task.
The Hook Loop Incident¶
What happened: Post-completion hooks allowed monitoring agents (Annegreet, Eltjo) to trigger work on each other's completions. Agent A completed, triggering Agent B. Agent B completed, triggering Agent A. The loop ran until the cost gate killed it.
Cost: Multiple agent sessions consumed before detection.
Rules created:
- hook_origin_depth field prevents infinite hook chains.
- no_block hooks only fire at depth 0.
- Maximum hook chain depth: 2.
- Per-agent hook rate limit: 20 hooks/agent/hour.
GE's Cost Gate System¶
The cost gate (ge_agent/execution/cost_gate.py) enforces hard limits at three levels:
| Level | Limit | Action When Exceeded |
|---|---|---|
| Per-session | $5 | Session terminated immediately |
| Per-agent/hour | $10 | Agent blocked for remainder of the hour |
| Daily system | $100 | All non-essential agents halted, human notified |
The cost gate runs both pre-execution (before a task starts) and mid-execution (at regular intervals during a session). It cannot be bypassed by agent instructions or task configurations.
RULE: Before scaling executors or enabling new agents, run bash scripts/verify-executor-safety.sh. It must exit 0. This script verifies all cost protection mechanisms are in place.
Caching Strategies¶
Prompt Caching¶
Anthropic's prompt caching charges 1.25x the base rate for cache writes but only 0.1x for subsequent reads. For GE agents that share a common system prompt (Constitution + CORE identity), caching saves approximately 90% on repeated prompt processing.
What to cache: - Constitution (identical across all agents) - CORE identity (identical across all sessions for a given agent) - Frequently used wiki pages (if the content is stable)
What NOT to cache: - Task specifications (unique per task) - JIT knowledge (varies by task domain) - Working context (changes every turn)
Context Compression¶
When conversation history grows, compress rather than truncate. A 50-message conversation history can often be summarized in 500 tokens. The summary preserves decision-relevant information while reducing token consumption by 90-95%.
Batch Processing¶
Non-time-sensitive work (nightly reports, periodic analysis, learning extraction) runs through batch APIs that offer 50% discounts. GE schedules these tasks during off-peak hours.
The Cost of Errors vs Cost of Prevention¶
The cheapest model is not always the most cost-effective. A Haiku-generated code change that introduces a bug costs more than a Sonnet-generated change that gets it right the first time, because the bug must be detected (consuming reviewer tokens), diagnosed (consuming developer tokens), and fixed (consuming more developer tokens).
Cost Calculation Framework¶
Total cost = Generation cost + Review cost + Error cost
Where:
Generation cost = tokens * model price
Review cost = reviewer tokens * reviewer model price
Error cost = P(error) * (diagnosis tokens + fix tokens + re-review tokens) * model prices
For simple, well-specified tasks, P(error) is similar across models, so the cheaper model wins. For complex, ambiguous tasks, P(error) is much lower with better models, and the error cost dominates. This is why Opus is sometimes the cheapest option for architectural decisions — it gets them right the first time.
Real-World Example¶
| Scenario | Haiku | Sonnet | Opus |
|---|---|---|---|
| Generation cost | $0.02 | $0.10 | $0.50 |
| P(error) for complex task | 40% | 15% | 5% |
| Expected error cost | $0.80 | $0.30 | $0.10 |
| Total expected cost | $0.82 | $0.40 | $0.60 |
In this example, Sonnet is the most cost-effective choice. Opus costs more but provides a better result than Haiku. This analysis drives GE's default of Sonnet for most production work.
Multi-Model Cost Optimization¶
GE uses multiple LLM providers (Claude, OpenAI, Gemini) strategically:
Provider-Specific Strengths¶
| Provider | Best For | GE Usage |
|---|---|---|
| Claude (Opus) | Complex reasoning, judgment, architecture | Scoping, specification, security review |
| Claude (Sonnet) | General-purpose production work | Most development, testing, code review |
| Claude (Haiku) | Fast classification, routing, monitoring | Orchestrator, health checks, log analysis |
| OpenAI | Alternative perspective, specific task profiles | Select agents (margot, benjamin, jouke, dinand) |
| Gemini | Cost-effective for specific tasks | Evaluation ongoing |
The Benchmark Opportunity¶
GE's multi-team architecture (Team Alfa, Team Bravo) creates a natural A/B testing environment. Teams using different provider configurations can be compared on delivery speed, quality metrics, and total cost. This produces empirical data on which model/provider combinations are most effective for specific types of work.
ROI Metrics¶
GE tracks cost-effectiveness through these metrics:
| Metric | What It Measures | Target |
|---|---|---|
| Cost per work item | Total token cost to complete a work item | Trending down over time |
| Cost per line of shipped code | Token cost divided by lines of code that reach production | Below industry average |
| Rework rate | Percentage of work items that require correction after initial completion | <15% |
| First-attempt success rate | Percentage of work items completed correctly on the first attempt | >70% |
| Model tier distribution | Percentage of tokens consumed by each model tier | 10% Opus, 60% Sonnet, 30% Haiku |
| Daily total cost | Total API spend across all agents | Within budget |
The Learning Dividend¶
GE's wiki brain creates a compounding cost advantage. As learnings accumulate, agents make fewer mistakes, require fewer retries, and produce correct output faster. The cost per work item decreases over time as institutional knowledge grows. This is the economic justification for the learning pipeline's operating cost.
Operational Rules¶
These rules are enforced by the system and violations trigger alerts:
| Rule | Enforced By | Consequence of Violation |
|---|---|---|
| Every XADD includes MAXLEN | Code review + grep check | PR rejected |
| No dual-publish to triggers + incoming | Code review | PR rejected |
| Cost gate cannot be bypassed | cost_gate.py (system-level) | Session terminated |
| HPA max replicas = 5 | executor.yaml | Scaling blocked |
| Hook chain depth max = 2 | hooks.py | Hook discarded |
| Verify safety before scaling | verify-executor-safety.sh | Scaling blocked |
| Per-session limit = $5 | cost_gate.py | Session terminated |
| Per-agent/hour limit = $10 | cost_gate.py | Agent blocked |
| Daily system limit = $100 | cost_gate.py | Non-essential agents halted |