Skip to content

Cost Optimization

The Economics of Agentic Engineering

Running 60 LLM-powered agents is expensive. Without disciplined cost management, a multi-agent system can consume thousands of dollars per day in API costs — and most of that spend will be waste. GE learned this through direct, painful experience: a feedback loop caused by file watchers generated a $100/hour token burn that required emergency intervention.

Cost optimization in agentic systems is not about being cheap. It is about being efficient — directing spending where it produces value and eliminating spending where it does not. The most expensive model is not always the best choice. The cheapest model is not always the worst. The right model is the one that produces correct output for the task at the lowest total cost, including the cost of errors.


Model Selection Framework

GE uses three model tiers. Each tier has a specific role based on the type of reasoning required.

Tier 1: Opus (Judgment)

Cost: $5/$25 per million tokens (input/output) Strength: Deep reasoning, complex analysis, nuanced judgment Weakness: Slow, expensive, overkill for routine tasks

Use when:

  • The task requires judgment, not just execution
  • The task involves regulatory interpretation, legal analysis, or compliance
  • The task requires understanding and resolving conflicting requirements
  • The task involves architectural decisions with long-term consequences
  • The cost of getting it wrong exceeds the cost of using the premium model
  • Client-facing scoping and creative direction (Aimee's work)

Do NOT use when:

  • The task has a clear specification and the agent just needs to implement it
  • The task is routing, classification, or simple decision-making
  • The task is mechanical transformation (format conversion, boilerplate generation)

Tier 2: Sonnet (Production)

Cost: $3/$15 per million tokens (input/output) Strength: Good balance of capability and speed, handles most production work Weakness: May struggle with the most complex reasoning tasks

Use when:

  • The task is well-specified implementation work
  • The task requires good coding ability but not deep architectural reasoning
  • The task involves code review with established criteria
  • The task requires understanding existing code and making changes
  • This is the default tier — use Sonnet unless there is a specific reason to use Opus or Haiku

Do NOT use when:

  • The task requires only simple classification or routing (use Haiku)
  • The task requires deep judgment on ambiguous inputs (use Opus)

Tier 3: Haiku (Routing)

Cost: $1/$5 per million tokens (input/output) Strength: Fast, cheap, good enough for simple tasks Weakness: Limited reasoning depth, prone to errors on complex tasks

Use when:

  • Routing decisions (which agent should handle this work?)
  • Simple monitoring checks (is this service responding?)
  • Classification tasks (what type of work item is this?)
  • Summarization of structured data
  • Health checks and status reports
  • Log analysis for known patterns

Do NOT use when:

  • The task requires producing code that will run in production
  • The task requires understanding complex specifications
  • The task involves security-sensitive decisions
  • The error cost exceeds the savings from using the cheaper model

Model Selection Decision Tree

CHECK: What type of reasoning does this task require?

IF: Judgment on ambiguous inputs, regulatory/legal interpretation, architecture
THEN: Opus

IF: Well-specified implementation, code review, testing, documentation
THEN: Sonnet

IF: Routing, classification, monitoring, simple checks
THEN: Haiku

IF: Uncertain which tier is appropriate
THEN: Start with Sonnet. If output quality is insufficient, escalate to Opus.
      Never start with Opus "just to be safe" — this is the most common cost waste.

Token Burn Prevention

Token burn is the uncontrolled consumption of tokens through system bugs, design flaws, or misconfiguration. GE has experienced multiple token burn incidents. Each one produced specific learnings that are now encoded as hard rules.

The $100/Hour File Watcher Incident

What happened: File watchers (chokidar) were configured to watch the workspace directory. When an agent modified a file, the watcher detected the change and triggered a re-read. The re-read consumed tokens. The token consumption caused logging, which modified files, which triggered the watcher again. The feedback loop ran at machine speed.

Cost: $100/hour in API costs before the loop was detected and killed.

Rule created: NEVER use file watchers (chokidar, fs.watch) in production. Removed from package.json. This is a permanent ban, not a temporary workaround.

The Double Delivery Incident

What happened: The task service was publishing work items to both triggers.{agent} (the per-agent queue) and ge:work:incoming (the system queue). The orchestrator picked up items from ge:work:incoming and routed them to triggers.{agent}. Every task was executed twice.

Cost: 2x execution cost for every task until discovered.

Rule created: NEVER XADD to both triggers.{agent} AND ge:work:incoming for the same task.

The Hook Loop Incident

What happened: Post-completion hooks allowed monitoring agents (Annegreet, Eltjo) to trigger work on each other's completions. Agent A completed, triggering Agent B. Agent B completed, triggering Agent A. The loop ran until the cost gate killed it.

Cost: Multiple agent sessions consumed before detection.

Rules created: - hook_origin_depth field prevents infinite hook chains. - no_block hooks only fire at depth 0. - Maximum hook chain depth: 2. - Per-agent hook rate limit: 20 hooks/agent/hour.

GE's Cost Gate System

The cost gate (ge_agent/execution/cost_gate.py) enforces hard limits at three levels:

Level Limit Action When Exceeded
Per-session $5 Session terminated immediately
Per-agent/hour $10 Agent blocked for remainder of the hour
Daily system $100 All non-essential agents halted, human notified

The cost gate runs both pre-execution (before a task starts) and mid-execution (at regular intervals during a session). It cannot be bypassed by agent instructions or task configurations.

RULE: Before scaling executors or enabling new agents, run bash scripts/verify-executor-safety.sh. It must exit 0. This script verifies all cost protection mechanisms are in place.


Caching Strategies

Prompt Caching

Anthropic's prompt caching charges 1.25x the base rate for cache writes but only 0.1x for subsequent reads. For GE agents that share a common system prompt (Constitution + CORE identity), caching saves approximately 90% on repeated prompt processing.

What to cache: - Constitution (identical across all agents) - CORE identity (identical across all sessions for a given agent) - Frequently used wiki pages (if the content is stable)

What NOT to cache: - Task specifications (unique per task) - JIT knowledge (varies by task domain) - Working context (changes every turn)

Context Compression

When conversation history grows, compress rather than truncate. A 50-message conversation history can often be summarized in 500 tokens. The summary preserves decision-relevant information while reducing token consumption by 90-95%.

Batch Processing

Non-time-sensitive work (nightly reports, periodic analysis, learning extraction) runs through batch APIs that offer 50% discounts. GE schedules these tasks during off-peak hours.


The Cost of Errors vs Cost of Prevention

The cheapest model is not always the most cost-effective. A Haiku-generated code change that introduces a bug costs more than a Sonnet-generated change that gets it right the first time, because the bug must be detected (consuming reviewer tokens), diagnosed (consuming developer tokens), and fixed (consuming more developer tokens).

Cost Calculation Framework

Total cost = Generation cost + Review cost + Error cost

Where:
  Generation cost = tokens * model price
  Review cost = reviewer tokens * reviewer model price
  Error cost = P(error) * (diagnosis tokens + fix tokens + re-review tokens) * model prices

For simple, well-specified tasks, P(error) is similar across models, so the cheaper model wins. For complex, ambiguous tasks, P(error) is much lower with better models, and the error cost dominates. This is why Opus is sometimes the cheapest option for architectural decisions — it gets them right the first time.

Real-World Example

Scenario Haiku Sonnet Opus
Generation cost $0.02 $0.10 $0.50
P(error) for complex task 40% 15% 5%
Expected error cost $0.80 $0.30 $0.10
Total expected cost $0.82 $0.40 $0.60

In this example, Sonnet is the most cost-effective choice. Opus costs more but provides a better result than Haiku. This analysis drives GE's default of Sonnet for most production work.


Multi-Model Cost Optimization

GE uses multiple LLM providers (Claude, OpenAI, Gemini) strategically:

Provider-Specific Strengths

Provider Best For GE Usage
Claude (Opus) Complex reasoning, judgment, architecture Scoping, specification, security review
Claude (Sonnet) General-purpose production work Most development, testing, code review
Claude (Haiku) Fast classification, routing, monitoring Orchestrator, health checks, log analysis
OpenAI Alternative perspective, specific task profiles Select agents (margot, benjamin, jouke, dinand)
Gemini Cost-effective for specific tasks Evaluation ongoing

The Benchmark Opportunity

GE's multi-team architecture (Team Alfa, Team Bravo) creates a natural A/B testing environment. Teams using different provider configurations can be compared on delivery speed, quality metrics, and total cost. This produces empirical data on which model/provider combinations are most effective for specific types of work.


ROI Metrics

GE tracks cost-effectiveness through these metrics:

Metric What It Measures Target
Cost per work item Total token cost to complete a work item Trending down over time
Cost per line of shipped code Token cost divided by lines of code that reach production Below industry average
Rework rate Percentage of work items that require correction after initial completion <15%
First-attempt success rate Percentage of work items completed correctly on the first attempt >70%
Model tier distribution Percentage of tokens consumed by each model tier 10% Opus, 60% Sonnet, 30% Haiku
Daily total cost Total API spend across all agents Within budget

The Learning Dividend

GE's wiki brain creates a compounding cost advantage. As learnings accumulate, agents make fewer mistakes, require fewer retries, and produce correct output faster. The cost per work item decreases over time as institutional knowledge grows. This is the economic justification for the learning pipeline's operating cost.


Operational Rules

These rules are enforced by the system and violations trigger alerts:

Rule Enforced By Consequence of Violation
Every XADD includes MAXLEN Code review + grep check PR rejected
No dual-publish to triggers + incoming Code review PR rejected
Cost gate cannot be bypassed cost_gate.py (system-level) Session terminated
HPA max replicas = 5 executor.yaml Scaling blocked
Hook chain depth max = 2 hooks.py Hook discarded
Verify safety before scaling verify-executor-safety.sh Scaling blocked
Per-session limit = $5 cost_gate.py Session terminated
Per-agent/hour limit = $10 cost_gate.py Agent blocked
Daily system limit = $100 cost_gate.py Non-essential agents halted