Cost Optimization¶

The Economics of Agentic Engineering¶

Running 60 LLM-powered agents is expensive. Without disciplined cost management, a multi-agent system can consume thousands of dollars per day in API costs — and most of that spend will be waste. GE learned this through direct, painful experience: a feedback loop caused by file watchers generated a $100/hour token burn that required emergency intervention.

Cost optimization in agentic systems is not about being cheap. It is about being efficient — directing spending where it produces value and eliminating spending where it does not. The most expensive model is not always the best choice. The cheapest model is not always the worst. The right model is the one that produces correct output for the task at the lowest total cost, including the cost of errors.

Model Selection Framework¶

GE uses three model tiers. Each tier has a specific role based on the type of reasoning required.

Tier 1: Opus (Judgment)¶

Cost: $5/$25 per million tokens (input/output) Strength: Deep reasoning, complex analysis, nuanced judgment Weakness: Slow, expensive, overkill for routine tasks

Use when:

The task requires judgment, not just execution
The task involves regulatory interpretation, legal analysis, or compliance
The task requires understanding and resolving conflicting requirements
The task involves architectural decisions with long-term consequences
The cost of getting it wrong exceeds the cost of using the premium model
Client-facing scoping and creative direction (Aimee's work)

Do NOT use when:

The task has a clear specification and the agent just needs to implement it
The task is routing, classification, or simple decision-making
The task is mechanical transformation (format conversion, boilerplate generation)

Tier 2: Sonnet (Production)¶

Cost: $3/$15 per million tokens (input/output) Strength: Good balance of capability and speed, handles most production work Weakness: May struggle with the most complex reasoning tasks

Use when:

The task is well-specified implementation work
The task requires good coding ability but not deep architectural reasoning
The task involves code review with established criteria
The task requires understanding existing code and making changes
This is the default tier — use Sonnet unless there is a specific reason to use Opus or Haiku

Do NOT use when:

The task requires only simple classification or routing (use Haiku)
The task requires deep judgment on ambiguous inputs (use Opus)

Tier 3: Haiku (Routing)¶

Cost: $1/$5 per million tokens (input/output) Strength: Fast, cheap, good enough for simple tasks Weakness: Limited reasoning depth, prone to errors on complex tasks

Use when:

Routing decisions (which agent should handle this work?)
Simple monitoring checks (is this service responding?)
Classification tasks (what type of work item is this?)
Summarization of structured data
Health checks and status reports
Log analysis for known patterns

Do NOT use when:

The task requires producing code that will run in production
The task requires understanding complex specifications
The task involves security-sensitive decisions
The error cost exceeds the savings from using the cheaper model

Model Selection Decision Tree¶

CHECK: What type of reasoning does this task require?

IF: Judgment on ambiguous inputs, regulatory/legal interpretation, architecture
THEN: Opus

IF: Well-specified implementation, code review, testing, documentation
THEN: Sonnet

IF: Routing, classification, monitoring, simple checks
THEN: Haiku

IF: Uncertain which tier is appropriate
THEN: Start with Sonnet. If output quality is insufficient, escalate to Opus.
      Never start with Opus "just to be safe" — this is the most common cost waste.

Token Burn Prevention¶

Token burn is the uncontrolled consumption of tokens through system bugs, design flaws, or misconfiguration. GE has experienced multiple token burn incidents. Each one produced specific learnings that are now encoded as hard rules.

The $100/Hour File Watcher Incident¶

What happened: File watchers (chokidar) were configured to watch the workspace directory. When an agent modified a file, the watcher detected the change and triggered a re-read. The re-read consumed tokens. The token consumption caused logging, which modified files, which triggered the watcher again. The feedback loop ran at machine speed.

Cost: $100/hour in API costs before the loop was detected and killed.

Rule created: NEVER use file watchers (chokidar, fs.watch) in production. Removed from package.json. This is a permanent ban, not a temporary workaround.

The Double Delivery Incident¶

What happened: The task service was publishing work items to both triggers.{agent} (the per-agent queue) and ge:work:incoming (the system queue). The orchestrator picked up items from ge:work:incoming and routed them to triggers.{agent}. Every task was executed twice.

Cost: 2x execution cost for every task until discovered.

Rule created: NEVER XADD to both triggers.{agent} AND ge:work:incoming for the same task.

The Hook Loop Incident¶

What happened: Post-completion hooks allowed monitoring agents (Annegreet, Eltjo) to trigger work on each other's completions. Agent A completed, triggering Agent B. Agent B completed, triggering Agent A. The loop ran until the cost gate killed it.

Cost: Multiple agent sessions consumed before detection.

Rules created: - hook_origin_depth field prevents infinite hook chains. - no_block hooks only fire at depth 0. - Maximum hook chain depth: 2. - Per-agent hook rate limit: 20 hooks/agent/hour.

GE's Cost Gate System¶

The cost gate (ge_agent/execution/cost_gate.py) enforces hard limits at three levels:

Level	Limit	Action When Exceeded
Per-session	$5	Session terminated immediately
Per-agent/hour	$10	Agent blocked for remainder of the hour
Daily system	$100	All non-essential agents halted, human notified

The cost gate runs both pre-execution (before a task starts) and mid-execution (at regular intervals during a session). It cannot be bypassed by agent instructions or task configurations.

RULE: Before scaling executors or enabling new agents, run bash scripts/verify-executor-safety.sh. It must exit 0. This script verifies all cost protection mechanisms are in place.

Caching Strategies¶

Prompt Caching¶

Anthropic's prompt caching charges 1.25x the base rate for cache writes but only 0.1x for subsequent reads. For GE agents that share a common system prompt (Constitution + CORE identity), caching saves approximately 90% on repeated prompt processing.

What to cache: - Constitution (identical across all agents) - CORE identity (identical across all sessions for a given agent) - Frequently used wiki pages (if the content is stable)

What NOT to cache: - Task specifications (unique per task) - JIT knowledge (varies by task domain) - Working context (changes every turn)

Context Compression¶

When conversation history grows, compress rather than truncate. A 50-message conversation history can often be summarized in 500 tokens. The summary preserves decision-relevant information while reducing token consumption by 90-95%.

Batch Processing¶

Non-time-sensitive work (nightly reports, periodic analysis, learning extraction) runs through batch APIs that offer 50% discounts. GE schedules these tasks during off-peak hours.

The Cost of Errors vs Cost of Prevention¶

The cheapest model is not always the most cost-effective. A Haiku-generated code change that introduces a bug costs more than a Sonnet-generated change that gets it right the first time, because the bug must be detected (consuming reviewer tokens), diagnosed (consuming developer tokens), and fixed (consuming more developer tokens).

Cost Calculation Framework¶

Total cost = Generation cost + Review cost + Error cost

Where:
  Generation cost = tokens * model price
  Review cost = reviewer tokens * reviewer model price
  Error cost = P(error) * (diagnosis tokens + fix tokens + re-review tokens) * model prices

For simple, well-specified tasks, P(error) is similar across models, so the cheaper model wins. For complex, ambiguous tasks, P(error) is much lower with better models, and the error cost dominates. This is why Opus is sometimes the cheapest option for architectural decisions — it gets them right the first time.

Real-World Example¶

Scenario	Haiku	Sonnet	Opus
Generation cost	$0.02	$0.10	$0.50
P(error) for complex task	40%	15%	5%
Expected error cost	$0.80	$0.30	$0.10
Total expected cost	$0.82	$0.40	$0.60

In this example, Sonnet is the most cost-effective choice. Opus costs more but provides a better result than Haiku. This analysis drives GE's default of Sonnet for most production work.

Multi-Model Cost Optimization¶

GE uses multiple LLM providers (Claude, OpenAI, Gemini) strategically:

Provider-Specific Strengths¶

Provider	Best For	GE Usage
Claude (Opus)	Complex reasoning, judgment, architecture	Scoping, specification, security review
Claude (Sonnet)	General-purpose production work	Most development, testing, code review
Claude (Haiku)	Fast classification, routing, monitoring	Orchestrator, health checks, log analysis
OpenAI	Alternative perspective, specific task profiles	Select agents (margot, benjamin, jouke, dinand)
Gemini	Cost-effective for specific tasks	Evaluation ongoing

The Benchmark Opportunity¶

GE's multi-team architecture (Team Alfa, Team Bravo) creates a natural A/B testing environment. Teams using different provider configurations can be compared on delivery speed, quality metrics, and total cost. This produces empirical data on which model/provider combinations are most effective for specific types of work.

ROI Metrics¶

GE tracks cost-effectiveness through these metrics:

Metric	What It Measures	Target
Cost per work item	Total token cost to complete a work item	Trending down over time
Cost per line of shipped code	Token cost divided by lines of code that reach production	Below industry average
Rework rate	Percentage of work items that require correction after initial completion	<15%
First-attempt success rate	Percentage of work items completed correctly on the first attempt	>70%
Model tier distribution	Percentage of tokens consumed by each model tier	10% Opus, 60% Sonnet, 30% Haiku
Daily total cost	Total API spend across all agents	Within budget

The Learning Dividend¶

GE's wiki brain creates a compounding cost advantage. As learnings accumulate, agents make fewer mistakes, require fewer retries, and produce correct output faster. The cost per work item decreases over time as institutional knowledge grows. This is the economic justification for the learning pipeline's operating cost.

Operational Rules¶

These rules are enforced by the system and violations trigger alerts:

Rule	Enforced By	Consequence of Violation
Every XADD includes MAXLEN	Code review + grep check	PR rejected
No dual-publish to triggers + incoming	Code review	PR rejected
Cost gate cannot be bypassed	cost_gate.py (system-level)	Session terminated
HPA max replicas = 5	executor.yaml	Scaling blocked
Hook chain depth max = 2	hooks.py	Hook discarded
Verify safety before scaling	verify-executor-safety.sh	Scaling blocked
Per-session limit = $5	cost_gate.py	Session terminated
Per-agent/hour limit = $10	cost_gate.py	Agent blocked
Daily system limit = $100	cost_gate.py	Non-essential agents halted