MCP Integration Pitfalls¶
STATUS: active OWNER: alexander, floris, floor (any agent using MCP tools) CATEGORY: pitfall ADDED: 2026-03-27
MCP CONFIG IN EXECUTOR¶
Per-Agent MCP Configs¶
Agent-specific MCP configs live at /ge-ops/master/mcp-configs/{agent_name}.json.
The executor auto-detects these and passes --mcp-config to Claude Code CLI.
Only Claude provider supports MCP. OpenAI Codex and Gemini CLI do not have MCP support. If an agent uses a non-Claude provider, MCP configs are silently ignored.
Env Var Inheritance¶
MCP server processes inherit environment variables from the parent executor process.
Do NOT put API keys in the MCP config JSON. Instead:
1. Add the key to ge-secrets k8s secret
2. Add the env var to executor-deployment.yaml
3. The MCP server inherits it automatically
Example: Stitch MCP needs STITCH_API_KEY. It's set in the executor pod env,
and the @_davideast/stitch-mcp process inherits it.
MCP Config Format¶
Do NOT include env block unless you need to override inherited vars.
Claude Code does not do ${VAR} shell expansion in MCP config env blocks.
Testing MCP Tools¶
Always test with doctor command first:
Stitch MCP Specifics¶
- Stitch MCP CLI version at time of integration: 0.5.1
- Stitch is mobile-first — landing page asks "what mobile app are we designing today?"
- API endpoint: stitch.googleapis.com
- Auth: API key (separate from Gemini API key — different service, different key)
- Stitch is free (Google Labs) — track usage for analytics only
- Generates HTML/CSS + screenshots — not native SwiftUI/React
- For iOS: screenshots serve as visual references, devs build native implementations
npxdownloads package on demand — no global install required, but executor Dockerfile installs globally for faster startup
Image Rebuild Required¶
After changing Dockerfile or MCP configs:
bash ge-ops/infrastructure/local/k3s/executor/build-executor.sh
kubectl rollout restart deployment/ge-executor -n ge-agents
MCP configs at /ge-ops/master/mcp-configs/ are included in the executor image via the ge-ops COPY.
Changes to MCP config files require image rebuild.
IMAGE CORRUPTION — SESSION KILLER (2026-03-27 INCIDENT)¶
Severity: CRITICAL — discovered during Abby PA dogfood
When an MCP tool (e.g., Stitch) generates or imports an image that the Claude API
cannot process, the image data becomes stuck in the conversation context. Every
subsequent API call re-sends the corrupted image, causing persistent 400 errors.
The session is permanently unrecoverable — the only fix is /clear or killing
the process.
How It Happens¶
- Agent uses Stitch MCP to generate a design
- Stitch returns image data (base64 or file reference)
- Image enters the Claude conversation context
- Claude API returns 400 "unable to process image"
- Every subsequent message includes the same broken image → infinite 400 loop
- Tokens burn on each failed retry until budget exhausted or session killed
Mitigation (Implemented)¶
The executor now has a resilience module (ge_agent/execution/resilience.py):
- Detects error storms (3+ identical errors in 30s window)
- Detects corruption patterns (API 400 + image keywords)
- Kills the session immediately to stop token burn
- Writes formal incident (ISO/SOC tracked) to ge-ops/system/incidents/
- Notifies human inbox at ge-ops/system/inbox/pending/
- Records learning to ge-ops/system/learnings/
Prevention For Agents¶
Agents using image-generating MCP tools MUST: 1. Validate image output before letting it enter conversation context 2. Use file references instead of inline base64 when possible 3. Keep image operations as the LAST step (so corruption doesn't kill earlier work) 4. If generating multiple designs, save each to disk before proceeding to next
RELATED¶
- Stitch integration doc:
development/integrations/google-stitch.md - Executor build:
ge-ops/infrastructure/local/k3s/executor/build-executor.sh - Provider config:
ge_agent/execution/provider_config.py - Claude provider:
ge_agent/execution/providers/claude.py