DOMAIN:INNOVATION — LLM_LANDSCAPE¶

OWNER: joshua
UPDATED: 2026-03-24
PURPOSE: comprehensive view of the LLM provider landscape, model comparison, selection criteria, and GE's multi-provider strategy
SCOPE: all major LLM providers relevant to GE's agent execution and client project delivery
RELEVANCE: GE's 59 agents run on multiple LLM providers — model selection directly impacts quality, cost, and reliability

LLM:PROVIDER_OVERVIEW¶

GE uses a multi-provider strategy: Claude (Anthropic) as primary, OpenAI and Gemini as supplementary providers. Each agent has a configured provider and model in the agents table (provider + provider_model columns). Provider configuration lives in config/providers/*.yaml and AGENT-REGISTRY.json.

LLM:CLAUDE (Anthropic)¶

COMPANY¶

FOUNDED: 2021 by Dario Amodei, Daniela Amodei (ex-OpenAI)
FOCUS: AI safety, constitutional AI, responsible scaling
DIFFERENTIATOR: safety-first approach, strong reasoning, excellent instruction following
GE_RELATIONSHIP: primary provider — majority of GE agents run on Claude

MODEL_LINEUP (as of early 2026)¶

OPUS: highest capability, strongest reasoning, best for complex multi-step tasks
- context: 200K tokens
- pricing: premium tier ($15/$75 per MTok typical for latest)
- GE usage: Aimee (scoping, requires deep reasoning), complex evaluation tasks
- strength: unmatched reasoning depth, nuanced understanding, long-form analysis

SONNET: balanced capability and cost, strong coding, fast
- context: 200K tokens
- pricing: mid tier ($3/$15 per MTok typical for latest)
- GE usage: most development agents (urszula, maxim, sandro, etc.), general-purpose tasks
- strength: best cost/quality ratio for coding tasks, fast response times

HAIKU: fastest, lowest cost, good for routine tasks
- context: 200K tokens
- pricing: low tier ($0.25/$1.25 per MTok typical for latest)
- GE usage: session_summarizer.py (learning extraction), routine classification, lightweight tasks
- strength: speed and cost — ideal for high-volume low-complexity operations
- WARNING: Claude Code uses Haiku internally for subagent processing — Explore subagent calls burn Haiku tokens rapidly

CLAUDE_STRENGTHS_FOR_GE¶

INSTRUCTION_FOLLOWING: Claude excels at following complex system prompts — critical for GE's tiered identity system
CODE_QUALITY: consistently generates clean, well-structured TypeScript — matches GE's standards
SAFETY: refuses harmful outputs reliably — aligns with GE's constitution
TOOL_USE: excellent function calling and tool use — foundation for agent execution
LONG_CONTEXT: 200K context handles large codebases — GE projects can be substantial

CLAUDE_WEAKNESSES_FOR_GE¶

COST_AT_SCALE: Opus is expensive for high-volume tasks — GE's cost_gate.py exists because of this
RATE_LIMITS: can be restrictive during high-volume agent execution — need careful batching
API_CHANGES: Anthropic iterates rapidly — API version changes require monitoring (Joshua's job)
VISION: capable but not best-in-class for design-to-code tasks

LLM:GPT (OpenAI)¶

COMPANY¶

FOUNDED: 2015 by Sam Altman, Greg Brockman, Ilya Sutskever et al.
FOCUS: AGI development, broad AI capabilities, enterprise adoption
DIFFERENTIATOR: largest ecosystem, broadest tooling, aggressive release cadence
GE_RELATIONSHIP: supplementary provider — specific agents configured for OpenAI

MODEL_LINEUP (as of early 2026)¶

GPT-4o: multimodal flagship, strong all-around performance
- context: 128K tokens
- pricing: mid tier ($2.50/$10 per MTok typical)
- GE usage: agents requiring strong multimodal capabilities (image understanding, design review)
- strength: multimodal integration, broad knowledge, fast

GPT-4o-mini: cost-efficient, fast, good for routine tasks
- context: 128K tokens
- pricing: low tier ($0.15/$0.60 per MTok typical)
- GE usage: high-volume routine tasks where Claude Haiku is not preferred
- strength: very low cost, fast, surprisingly capable for simple tasks

o1/o3 REASONING_MODELS: specialized reasoning models for complex problem-solving
- context: varies by model
- pricing: premium tier
- GE usage: potential for complex architectural decisions, proof-of-concept evaluation
- strength: deep reasoning, chain-of-thought, mathematical and logical tasks
- weakness: slow, expensive, not suitable for routine code generation

GPT_STRENGTHS_FOR_GE¶

ECOSYSTEM: largest ecosystem of tools, libraries, and integrations
FUNCTION_CALLING: mature function calling with parallel tool use
MULTIMODAL: strong vision capabilities for design-related tasks
FINE_TUNING: available for custom model training (future GE consideration)

GPT_WEAKNESSES_FOR_GE¶

INSTRUCTION_FOLLOWING: less precise than Claude for complex multi-step system prompts
CODE_QUALITY: tends toward verbose code, more boilerplate than Claude
SAFETY: more permissive than Claude — requires additional guardrails
CONSISTENCY: output quality varies more between requests than Claude

LLM:GEMINI (Google)¶

COMPANY¶

FOUNDED: Google DeepMind (merger of Google Brain and DeepMind, 2023)
FOCUS: multimodal AI, integration with Google ecosystem, long context
DIFFERENTIATOR: massive context windows, strong multimodal, Google infrastructure
GE_RELATIONSHIP: supplementary provider — specific agents configured for Gemini

MODEL_LINEUP (as of early 2026)¶

GEMINI_2.0_PRO: strong reasoning, very long context, multimodal
- context: 1M+ tokens (largest available)
- pricing: competitive mid tier
- GE usage: agents needing very long context (large codebase analysis, comprehensive reviews)
- strength: context window is unmatched — can process entire codebases

GEMINI_2.0_FLASH: fast, efficient, good for routine tasks
- context: 1M tokens
- pricing: very low
- GE usage: high-volume tasks where extreme context length is valuable
- strength: long context at low cost — best tokens-per-dollar for large context needs

GEMINI_STRENGTHS_FOR_GE¶

CONTEXT_WINDOW: 1M+ tokens — can process entire GE projects in a single prompt
MULTIMODAL: native multimodal (text, image, video, audio) — relevant for GE's media pipeline
COST: competitive pricing, aggressive free tiers
GOOGLE_INTEGRATION: potential for Google Workspace integration in client projects

GEMINI_WEAKNESSES_FOR_GE¶

CODE_QUALITY: less consistent than Claude for TypeScript, more hallucination on newer frameworks
INSTRUCTION_FOLLOWING: less precise for complex system prompts than Claude
API_STABILITY: Google has a history of deprecating products — risk factor
TOOL_USE: function calling less mature than Claude or GPT

LLM:OPEN_SOURCE_MODELS¶

LLAMA (Meta)¶

WHAT: Meta's open-source LLM family
LATEST: Llama 3.1 (8B, 70B, 405B parameters)
STRENGTHS: truly open (weights available), strong community, fine-tunable, self-hostable
WEAKNESSES: requires GPU infrastructure to self-host, smaller context than commercial models, code quality below Claude/GPT for TypeScript
GE_RELEVANCE: potential for self-hosted routine tasks to reduce API costs — assess when GE's volume justifies infrastructure
VERDICT: assess — monitor for cost optimization use case, not suitable as primary provider

MISTRAL¶

WHAT: European AI company, strong open-source models
LATEST: Mistral Large, Mixtral (mixture of experts), Codestral (code-focused)
STRENGTHS: European (data residency), efficient architecture, strong code models (Codestral), competitive performance
WEAKNESSES: smaller ecosystem than OpenAI/Anthropic, less training data, company is young
GE_RELEVANCE: Codestral is interesting for code-focused agents, European data residency aligns with GE's EU client base
VERDICT: assess — Codestral specifically worth evaluating for cost-efficient code generation

QWEN (Alibaba)¶

WHAT: Alibaba's open-source LLM family
LATEST: Qwen 2.5 series (various sizes)
STRENGTHS: strong multilingual (especially CJK), competitive benchmarks, open weights
WEAKNESSES: Alibaba dependency (geopolitical consideration), less tested in Western dev ecosystems, limited TypeScript training data
GE_RELEVANCE: low — GE's client base is European, Qwen's advantages are in Asian language markets
VERDICT: hold — monitor but no immediate relevance

DEEPSEEK¶

WHAT: Chinese AI lab, known for efficient training and strong coding models
LATEST: DeepSeek V3, DeepSeek-Coder
STRENGTHS: excellent code generation, efficient training methodology (MoE), competitive at fraction of compute
WEAKNESSES: geopolitical concerns (data handling, censorship), limited enterprise support, less tested in production
GE_RELEVANCE: DeepSeek-Coder benchmarks are impressive — potential for cost-efficient code generation if geopolitical concerns are acceptable
VERDICT: hold — impressive technically but geopolitical risk and data handling concerns conflict with GE's ISO 27001 requirements

LLM:MODEL_COMPARISON_MATRIX¶

Dimension	Claude Sonnet	GPT-4o	Gemini Pro	Llama 405B	Mistral Large
TypeScript Quality	5	4	3	3	3
Instruction Following	5	4	3	3	3
Reasoning Depth	5	4	4	3	3
Context Window	200K	128K	1M+	128K	128K
Speed	4	5	4	3 (self-host)	4
Cost Efficiency	3	4	5	5 (self-host)	4
Tool Use	5	5	3	2	3
Vision	4	5	5	3	3
Safety	5	3	4	2	3
API Maturity	5	5	3	2	3

SCORING: 1-5 (1=poor, 5=excellent)
NOTE: scores are relative to GE's specific use case (agentic code generation with complex system prompts), not general benchmarks

LLM:AGENT_TYPE_TO_MODEL_MAPPING¶

WHEN_TO_USE_CLAUDE_OPUS¶

complex scoping and architectural decisions (Aimee)
nuanced evaluation requiring deep reasoning
tasks where instruction following precision is critical
long-form analysis and strategy documents
COST: justify the premium — only when Sonnet's reasoning is insufficient

WHEN_TO_USE_CLAUDE_SONNET¶

all standard development tasks (code generation, testing, review)
most agent execution (default for development agents)
tasks requiring strong TypeScript and good tool use
balance of quality and cost
DEFAULT: if unsure, use Sonnet

WHEN_TO_USE_CLAUDE_HAIKU¶

session summarization (session_summarizer.py)
classification and routing tasks
high-volume low-complexity operations
tasks where speed matters more than depth
WARNING: never for code generation in production — quality gap is real

WHEN_TO_USE_GPT_4o¶

multimodal tasks (image understanding, design review)
tasks where GPT's training data has better coverage (niche libraries)
agents configured for OpenAI (margot, benjamin, jouke, dinand)
parallel tool calling (GPT handles multiple function calls well)

WHEN_TO_USE_GEMINI¶

very large codebase analysis (leverage 1M+ context)
cost-sensitive high-volume tasks
multimodal tasks involving video/audio (Gemini's strength)
tasks where extreme context length compensates for slightly lower instruction following

MODEL_SELECTION_DECISION_TREE¶

Is this a complex reasoning/scoping task?
  YES → Claude Opus
  NO ↓
Is this a code generation task?
  YES → Claude Sonnet (default) or GPT-4o (if agent configured for OpenAI)
  NO ↓
Is this a high-volume routine task?
  YES → Claude Haiku or Gemini Flash (cost optimization)
  NO ↓
Does it require multimodal input (images, video)?
  YES → GPT-4o (images) or Gemini Pro (video)
  NO ↓
Does it require very long context (>200K tokens)?
  YES → Gemini Pro
  NO ↓
Default → Claude Sonnet

LLM:EVALUATING_NEW_MODEL_RELEASES¶

EVALUATION_PROTOCOL¶

STEP_1: read the announcement carefully — identify claimed improvements, pricing changes, deprecation notices
STEP_2: check independent benchmarks — SWE-bench, HumanEval, MBPP, MMLU, LiveCodeBench
STEP_3: run GE-specific test suite — standardized prompts that test GE's actual use cases
STEP_4: compare cost — calculate cost-per-task for typical GE workloads, not just per-token pricing
STEP_5: test instruction following — GE's system prompts are complex, test with actual agent identities
STEP_6: test tool use — verify function calling works correctly with GE's tool definitions
STEP_7: write evaluation report with recommendation (switch/wait/ignore)

GE_SPECIFIC_TEST_SUITE¶

TEST_1: generate a Hono API route with Drizzle ORM query, Zod validation, and error handling (TypeScript quality)
TEST_2: follow a 3-tier identity system prompt and maintain character consistency across 10 turns (instruction following)
TEST_3: use 5 tools in sequence to investigate and fix a bug in a 500-line file (tool use)
TEST_4: read a 50K-token codebase and identify architectural issues (long context)
TEST_5: generate a React component matching shadcn/ui patterns with Tailwind (frontend quality)
TEST_6: write Vitest tests for a given function, including edge cases (testing quality)
TEST_7: review a PR diff and provide actionable feedback (code review quality)

EVALUATION_REPORT_TEMPLATE¶

MODEL: [provider] [model name] [version]
RELEASE_DATE: [date]
EVALUATED_BY: joshua
EVALUATION_DATE: [date]

CLAIMED_IMPROVEMENTS: [from announcement]
PRICING: [input/output per MTok, comparison to current]
CONTEXT_WINDOW: [size, comparison to current]

GE_TEST_RESULTS:
- TypeScript quality: [score/5] [notes]
- Instruction following: [score/5] [notes]
- Tool use: [score/5] [notes]
- Long context: [score/5] [notes]
- Frontend quality: [score/5] [notes]
- Testing quality: [score/5] [notes]
- Code review quality: [score/5] [notes]

COST_COMPARISON:
- Typical GE task cost (old model): $[X]
- Typical GE task cost (new model): $[X]
- Monthly projected cost change: [+/- $X] ([+/- %])

RECOMMENDATION: [switch immediately | trial for specific agents | wait for next iteration | ignore]
RATIONALE: [why]
AFFECTED_AGENTS: [which agents would benefit from switching]
MIGRATION_EFFORT: [low/medium/high] — [description]

LLM:MULTI_PROVIDER_STRATEGY¶

WHY_MULTI_PROVIDER¶

RESILIENCE: no single point of failure — if Anthropic has an outage, agents can fall back to OpenAI/Gemini
COST_OPTIMIZATION: use the cheapest model that meets quality requirements for each task type
CAPABILITY_MATCHING: different models excel at different tasks — match model to task
LEVERAGE: no vendor can unilaterally raise prices or change terms — GE has alternatives
COMPLIANCE: EU data residency requirements may mandate European providers (Mistral) for certain clients

IMPLEMENTATION_IN_GE¶

DATABASE: agents table has provider (enum: anthropic, openai, google) and provider_model columns
REGISTRY: AGENT-REGISTRY.json contains provider config per agent
EXECUTOR: agent_runner.py reads provider config and routes to appropriate API
CHAT: claude-chat.ts handles multi-provider chat (system prompt includes "You are powered by {provider}/{model}")
ADMIN_UI: provider and model configurable per agent in admin interface

PROVIDER_DISTRIBUTION (current)¶

ANTHROPIC: ~80% of agents (primary provider, strongest for GE's use case)
OPENAI: ~15% of agents (margot, benjamin, jouke, dinand — specific roles)
GOOGLE: ~5% of agents (experimental, long-context tasks)

FALLBACK_STRATEGY¶

LEVEL_1: retry with same provider (transient errors)
LEVEL_2: fall back to alternative model from same provider (e.g., Opus → Sonnet)
LEVEL_3: fall back to different provider (e.g., Anthropic → OpenAI)
RULE: fallback must maintain quality floor — do not fall back to a model that cannot handle the task
RULE: log all fallbacks for monitoring — frequent fallbacks indicate a systemic issue

LLM:PRICING_AWARENESS¶

COST_TRACKING¶

GE tracks LLM costs at multiple levels: per-session ($5 limit), per-agent-per-hour ($10 limit), per-day ($100 system limit)
cost_gate.py enforces these limits in real-time during execution
Joshua monitors pricing announcements and models cost impact of provider changes

PRICING_TRENDS¶

TREND_1: input token costs are dropping faster than output token costs — favor models with good reasoning (fewer output tokens needed)
TREND_2: context window pricing becoming flat (pay same for 200K as 100K) — long context is becoming free
TREND_3: competition driving prices down ~50% annually — GE's margins improve over time
TREND_4: cached/prompt caching reducing costs for repeated system prompts — GE benefits (same agent identity across sessions)

COST_OPTIMIZATION_TECHNIQUES¶

TECHNIQUE_1: use Haiku/Flash for classification and routing, Sonnet/Pro for execution (tiered model usage)
TECHNIQUE_2: cache system prompts where providers support it (Anthropic prompt caching)
TECHNIQUE_3: minimize unnecessary context (don't send entire codebase when editing one file)
TECHNIQUE_4: batch related tasks to amortize system prompt cost
TECHNIQUE_5: monitor and kill runaway sessions (cost_gate.py enforcement)

LLM:FUTURE_CONSIDERATIONS¶

LOCAL_MODELS: as open-source models improve, GE may self-host for routine tasks — monitor Llama, Mistral, DeepSeek quality-to-cost ratio
FINE_TUNING: provider fine-tuning APIs could create GE-specialized models — evaluate when agent volume justifies the investment
MULTIMODAL_AGENTS: as vision and audio capabilities improve, GE's design and media agents can leverage direct multimodal input
REASONING_MODELS: o1/o3 class models may improve architectural decision quality — evaluate for Aimee and Joshua's use cases
COST_FLOOR: at some point, LLM costs become negligible relative to orchestration costs — plan for this transition