Skip to content

DOMAIN:INNOVATION — LLM_LANDSCAPE

OWNER: joshua
UPDATED: 2026-03-24
PURPOSE: comprehensive view of the LLM provider landscape, model comparison, selection criteria, and GE's multi-provider strategy
SCOPE: all major LLM providers relevant to GE's agent execution and client project delivery
RELEVANCE: GE's 59 agents run on multiple LLM providers — model selection directly impacts quality, cost, and reliability


LLM:PROVIDER_OVERVIEW

GE uses a multi-provider strategy: Claude (Anthropic) as primary, OpenAI and Gemini as supplementary providers. Each agent has a configured provider and model in the agents table (provider + provider_model columns). Provider configuration lives in config/providers/*.yaml and AGENT-REGISTRY.json.


LLM:CLAUDE (Anthropic)

COMPANY

FOUNDED: 2021 by Dario Amodei, Daniela Amodei (ex-OpenAI)
FOCUS: AI safety, constitutional AI, responsible scaling
DIFFERENTIATOR: safety-first approach, strong reasoning, excellent instruction following
GE_RELATIONSHIP: primary provider — majority of GE agents run on Claude

MODEL_LINEUP (as of early 2026)

OPUS: highest capability, strongest reasoning, best for complex multi-step tasks
- context: 200K tokens
- pricing: premium tier ($15/$75 per MTok typical for latest)
- GE usage: Aimee (scoping, requires deep reasoning), complex evaluation tasks
- strength: unmatched reasoning depth, nuanced understanding, long-form analysis

SONNET: balanced capability and cost, strong coding, fast
- context: 200K tokens
- pricing: mid tier ($3/$15 per MTok typical for latest)
- GE usage: most development agents (urszula, maxim, sandro, etc.), general-purpose tasks
- strength: best cost/quality ratio for coding tasks, fast response times

HAIKU: fastest, lowest cost, good for routine tasks
- context: 200K tokens
- pricing: low tier ($0.25/$1.25 per MTok typical for latest)
- GE usage: session_summarizer.py (learning extraction), routine classification, lightweight tasks
- strength: speed and cost — ideal for high-volume low-complexity operations
- WARNING: Claude Code uses Haiku internally for subagent processing — Explore subagent calls burn Haiku tokens rapidly

CLAUDE_STRENGTHS_FOR_GE

INSTRUCTION_FOLLOWING: Claude excels at following complex system prompts — critical for GE's tiered identity system
CODE_QUALITY: consistently generates clean, well-structured TypeScript — matches GE's standards
SAFETY: refuses harmful outputs reliably — aligns with GE's constitution
TOOL_USE: excellent function calling and tool use — foundation for agent execution
LONG_CONTEXT: 200K context handles large codebases — GE projects can be substantial

CLAUDE_WEAKNESSES_FOR_GE

COST_AT_SCALE: Opus is expensive for high-volume tasks — GE's cost_gate.py exists because of this
RATE_LIMITS: can be restrictive during high-volume agent execution — need careful batching
API_CHANGES: Anthropic iterates rapidly — API version changes require monitoring (Joshua's job)
VISION: capable but not best-in-class for design-to-code tasks


LLM:GPT (OpenAI)

COMPANY

FOUNDED: 2015 by Sam Altman, Greg Brockman, Ilya Sutskever et al.
FOCUS: AGI development, broad AI capabilities, enterprise adoption
DIFFERENTIATOR: largest ecosystem, broadest tooling, aggressive release cadence
GE_RELATIONSHIP: supplementary provider — specific agents configured for OpenAI

MODEL_LINEUP (as of early 2026)

GPT-4o: multimodal flagship, strong all-around performance
- context: 128K tokens
- pricing: mid tier ($2.50/$10 per MTok typical)
- GE usage: agents requiring strong multimodal capabilities (image understanding, design review)
- strength: multimodal integration, broad knowledge, fast

GPT-4o-mini: cost-efficient, fast, good for routine tasks
- context: 128K tokens
- pricing: low tier ($0.15/$0.60 per MTok typical)
- GE usage: high-volume routine tasks where Claude Haiku is not preferred
- strength: very low cost, fast, surprisingly capable for simple tasks

o1/o3 REASONING_MODELS: specialized reasoning models for complex problem-solving
- context: varies by model
- pricing: premium tier
- GE usage: potential for complex architectural decisions, proof-of-concept evaluation
- strength: deep reasoning, chain-of-thought, mathematical and logical tasks
- weakness: slow, expensive, not suitable for routine code generation

GPT_STRENGTHS_FOR_GE

ECOSYSTEM: largest ecosystem of tools, libraries, and integrations
FUNCTION_CALLING: mature function calling with parallel tool use
MULTIMODAL: strong vision capabilities for design-related tasks
FINE_TUNING: available for custom model training (future GE consideration)

GPT_WEAKNESSES_FOR_GE

INSTRUCTION_FOLLOWING: less precise than Claude for complex multi-step system prompts
CODE_QUALITY: tends toward verbose code, more boilerplate than Claude
SAFETY: more permissive than Claude — requires additional guardrails
CONSISTENCY: output quality varies more between requests than Claude


LLM:GEMINI (Google)

COMPANY

FOUNDED: Google DeepMind (merger of Google Brain and DeepMind, 2023)
FOCUS: multimodal AI, integration with Google ecosystem, long context
DIFFERENTIATOR: massive context windows, strong multimodal, Google infrastructure
GE_RELATIONSHIP: supplementary provider — specific agents configured for Gemini

MODEL_LINEUP (as of early 2026)

GEMINI_2.0_PRO: strong reasoning, very long context, multimodal
- context: 1M+ tokens (largest available)
- pricing: competitive mid tier
- GE usage: agents needing very long context (large codebase analysis, comprehensive reviews)
- strength: context window is unmatched — can process entire codebases

GEMINI_2.0_FLASH: fast, efficient, good for routine tasks
- context: 1M tokens
- pricing: very low
- GE usage: high-volume tasks where extreme context length is valuable
- strength: long context at low cost — best tokens-per-dollar for large context needs

GEMINI_STRENGTHS_FOR_GE

CONTEXT_WINDOW: 1M+ tokens — can process entire GE projects in a single prompt
MULTIMODAL: native multimodal (text, image, video, audio) — relevant for GE's media pipeline
COST: competitive pricing, aggressive free tiers
GOOGLE_INTEGRATION: potential for Google Workspace integration in client projects

GEMINI_WEAKNESSES_FOR_GE

CODE_QUALITY: less consistent than Claude for TypeScript, more hallucination on newer frameworks
INSTRUCTION_FOLLOWING: less precise for complex system prompts than Claude
API_STABILITY: Google has a history of deprecating products — risk factor
TOOL_USE: function calling less mature than Claude or GPT


LLM:OPEN_SOURCE_MODELS

LLAMA (Meta)

WHAT: Meta's open-source LLM family
LATEST: Llama 3.1 (8B, 70B, 405B parameters)
STRENGTHS: truly open (weights available), strong community, fine-tunable, self-hostable
WEAKNESSES: requires GPU infrastructure to self-host, smaller context than commercial models, code quality below Claude/GPT for TypeScript
GE_RELEVANCE: potential for self-hosted routine tasks to reduce API costs — assess when GE's volume justifies infrastructure
VERDICT: assess — monitor for cost optimization use case, not suitable as primary provider

MISTRAL

WHAT: European AI company, strong open-source models
LATEST: Mistral Large, Mixtral (mixture of experts), Codestral (code-focused)
STRENGTHS: European (data residency), efficient architecture, strong code models (Codestral), competitive performance
WEAKNESSES: smaller ecosystem than OpenAI/Anthropic, less training data, company is young
GE_RELEVANCE: Codestral is interesting for code-focused agents, European data residency aligns with GE's EU client base
VERDICT: assess — Codestral specifically worth evaluating for cost-efficient code generation

QWEN (Alibaba)

WHAT: Alibaba's open-source LLM family
LATEST: Qwen 2.5 series (various sizes)
STRENGTHS: strong multilingual (especially CJK), competitive benchmarks, open weights
WEAKNESSES: Alibaba dependency (geopolitical consideration), less tested in Western dev ecosystems, limited TypeScript training data
GE_RELEVANCE: low — GE's client base is European, Qwen's advantages are in Asian language markets
VERDICT: hold — monitor but no immediate relevance

DEEPSEEK

WHAT: Chinese AI lab, known for efficient training and strong coding models
LATEST: DeepSeek V3, DeepSeek-Coder
STRENGTHS: excellent code generation, efficient training methodology (MoE), competitive at fraction of compute
WEAKNESSES: geopolitical concerns (data handling, censorship), limited enterprise support, less tested in production
GE_RELEVANCE: DeepSeek-Coder benchmarks are impressive — potential for cost-efficient code generation if geopolitical concerns are acceptable
VERDICT: hold — impressive technically but geopolitical risk and data handling concerns conflict with GE's ISO 27001 requirements


LLM:MODEL_COMPARISON_MATRIX

Dimension Claude Sonnet GPT-4o Gemini Pro Llama 405B Mistral Large
TypeScript Quality 5 4 3 3 3
Instruction Following 5 4 3 3 3
Reasoning Depth 5 4 4 3 3
Context Window 200K 128K 1M+ 128K 128K
Speed 4 5 4 3 (self-host) 4
Cost Efficiency 3 4 5 5 (self-host) 4
Tool Use 5 5 3 2 3
Vision 4 5 5 3 3
Safety 5 3 4 2 3
API Maturity 5 5 3 2 3

SCORING: 1-5 (1=poor, 5=excellent)
NOTE: scores are relative to GE's specific use case (agentic code generation with complex system prompts), not general benchmarks


LLM:AGENT_TYPE_TO_MODEL_MAPPING

WHEN_TO_USE_CLAUDE_OPUS

  • complex scoping and architectural decisions (Aimee)
  • nuanced evaluation requiring deep reasoning
  • tasks where instruction following precision is critical
  • long-form analysis and strategy documents
  • COST: justify the premium — only when Sonnet's reasoning is insufficient

WHEN_TO_USE_CLAUDE_SONNET

  • all standard development tasks (code generation, testing, review)
  • most agent execution (default for development agents)
  • tasks requiring strong TypeScript and good tool use
  • balance of quality and cost
  • DEFAULT: if unsure, use Sonnet

WHEN_TO_USE_CLAUDE_HAIKU

  • session summarization (session_summarizer.py)
  • classification and routing tasks
  • high-volume low-complexity operations
  • tasks where speed matters more than depth
  • WARNING: never for code generation in production — quality gap is real

WHEN_TO_USE_GPT_4o

  • multimodal tasks (image understanding, design review)
  • tasks where GPT's training data has better coverage (niche libraries)
  • agents configured for OpenAI (margot, benjamin, jouke, dinand)
  • parallel tool calling (GPT handles multiple function calls well)

WHEN_TO_USE_GEMINI

  • very large codebase analysis (leverage 1M+ context)
  • cost-sensitive high-volume tasks
  • multimodal tasks involving video/audio (Gemini's strength)
  • tasks where extreme context length compensates for slightly lower instruction following

MODEL_SELECTION_DECISION_TREE

Is this a complex reasoning/scoping task?
  YES → Claude Opus
  NO ↓
Is this a code generation task?
  YES → Claude Sonnet (default) or GPT-4o (if agent configured for OpenAI)
  NO ↓
Is this a high-volume routine task?
  YES → Claude Haiku or Gemini Flash (cost optimization)
  NO ↓
Does it require multimodal input (images, video)?
  YES → GPT-4o (images) or Gemini Pro (video)
  NO ↓
Does it require very long context (>200K tokens)?
  YES → Gemini Pro
  NO ↓
Default → Claude Sonnet

LLM:EVALUATING_NEW_MODEL_RELEASES

EVALUATION_PROTOCOL

STEP_1: read the announcement carefully — identify claimed improvements, pricing changes, deprecation notices
STEP_2: check independent benchmarks — SWE-bench, HumanEval, MBPP, MMLU, LiveCodeBench
STEP_3: run GE-specific test suite — standardized prompts that test GE's actual use cases
STEP_4: compare cost — calculate cost-per-task for typical GE workloads, not just per-token pricing
STEP_5: test instruction following — GE's system prompts are complex, test with actual agent identities
STEP_6: test tool use — verify function calling works correctly with GE's tool definitions
STEP_7: write evaluation report with recommendation (switch/wait/ignore)

GE_SPECIFIC_TEST_SUITE

TEST_1: generate a Hono API route with Drizzle ORM query, Zod validation, and error handling (TypeScript quality)
TEST_2: follow a 3-tier identity system prompt and maintain character consistency across 10 turns (instruction following)
TEST_3: use 5 tools in sequence to investigate and fix a bug in a 500-line file (tool use)
TEST_4: read a 50K-token codebase and identify architectural issues (long context)
TEST_5: generate a React component matching shadcn/ui patterns with Tailwind (frontend quality)
TEST_6: write Vitest tests for a given function, including edge cases (testing quality)
TEST_7: review a PR diff and provide actionable feedback (code review quality)

EVALUATION_REPORT_TEMPLATE

MODEL: [provider] [model name] [version]
RELEASE_DATE: [date]
EVALUATED_BY: joshua
EVALUATION_DATE: [date]

CLAIMED_IMPROVEMENTS: [from announcement]
PRICING: [input/output per MTok, comparison to current]
CONTEXT_WINDOW: [size, comparison to current]

GE_TEST_RESULTS:
- TypeScript quality: [score/5] [notes]
- Instruction following: [score/5] [notes]
- Tool use: [score/5] [notes]
- Long context: [score/5] [notes]
- Frontend quality: [score/5] [notes]
- Testing quality: [score/5] [notes]
- Code review quality: [score/5] [notes]

COST_COMPARISON:
- Typical GE task cost (old model): $[X]
- Typical GE task cost (new model): $[X]
- Monthly projected cost change: [+/- $X] ([+/- %])

RECOMMENDATION: [switch immediately | trial for specific agents | wait for next iteration | ignore]
RATIONALE: [why]
AFFECTED_AGENTS: [which agents would benefit from switching]
MIGRATION_EFFORT: [low/medium/high] — [description]

LLM:MULTI_PROVIDER_STRATEGY

WHY_MULTI_PROVIDER

RESILIENCE: no single point of failure — if Anthropic has an outage, agents can fall back to OpenAI/Gemini
COST_OPTIMIZATION: use the cheapest model that meets quality requirements for each task type
CAPABILITY_MATCHING: different models excel at different tasks — match model to task
LEVERAGE: no vendor can unilaterally raise prices or change terms — GE has alternatives
COMPLIANCE: EU data residency requirements may mandate European providers (Mistral) for certain clients

IMPLEMENTATION_IN_GE

DATABASE: agents table has provider (enum: anthropic, openai, google) and provider_model columns
REGISTRY: AGENT-REGISTRY.json contains provider config per agent
EXECUTOR: agent_runner.py reads provider config and routes to appropriate API
CHAT: claude-chat.ts handles multi-provider chat (system prompt includes "You are powered by {provider}/{model}")
ADMIN_UI: provider and model configurable per agent in admin interface

PROVIDER_DISTRIBUTION (current)

ANTHROPIC: ~80% of agents (primary provider, strongest for GE's use case)
OPENAI: ~15% of agents (margot, benjamin, jouke, dinand — specific roles)
GOOGLE: ~5% of agents (experimental, long-context tasks)

FALLBACK_STRATEGY

LEVEL_1: retry with same provider (transient errors)
LEVEL_2: fall back to alternative model from same provider (e.g., Opus → Sonnet)
LEVEL_3: fall back to different provider (e.g., Anthropic → OpenAI)
RULE: fallback must maintain quality floor — do not fall back to a model that cannot handle the task
RULE: log all fallbacks for monitoring — frequent fallbacks indicate a systemic issue


LLM:PRICING_AWARENESS

COST_TRACKING

GE tracks LLM costs at multiple levels: per-session ($5 limit), per-agent-per-hour ($10 limit), per-day ($100 system limit)
cost_gate.py enforces these limits in real-time during execution
Joshua monitors pricing announcements and models cost impact of provider changes

TREND_1: input token costs are dropping faster than output token costs — favor models with good reasoning (fewer output tokens needed)
TREND_2: context window pricing becoming flat (pay same for 200K as 100K) — long context is becoming free
TREND_3: competition driving prices down ~50% annually — GE's margins improve over time
TREND_4: cached/prompt caching reducing costs for repeated system prompts — GE benefits (same agent identity across sessions)

COST_OPTIMIZATION_TECHNIQUES

TECHNIQUE_1: use Haiku/Flash for classification and routing, Sonnet/Pro for execution (tiered model usage)
TECHNIQUE_2: cache system prompts where providers support it (Anthropic prompt caching)
TECHNIQUE_3: minimize unnecessary context (don't send entire codebase when editing one file)
TECHNIQUE_4: batch related tasks to amortize system prompt cost
TECHNIQUE_5: monitor and kill runaway sessions (cost_gate.py enforcement)


LLM:FUTURE_CONSIDERATIONS

LOCAL_MODELS: as open-source models improve, GE may self-host for routine tasks — monitor Llama, Mistral, DeepSeek quality-to-cost ratio
FINE_TUNING: provider fine-tuning APIs could create GE-specialized models — evaluate when agent volume justifies the investment
MULTIMODAL_AGENTS: as vision and audio capabilities improve, GE's design and media agents can leverage direct multimodal input
REASONING_MODELS: o1/o3 class models may improve architectural decision quality — evaluate for Aimee and Joshua's use cases
COST_FLOOR: at some point, LLM costs become negligible relative to orchestration costs — plan for this transition