Context Management¶

The Most Critical Challenge¶

Context management is the single most important discipline in agentic software engineering. Every other methodology — testing, quality, collaboration — depends on agents having the right information in their context window at the right time, without exceeding their token budget.

An agent with too little context produces hallucinated or irrelevant output. An agent with too much context loses focus, forgets instructions, and burns tokens processing information it does not need. The optimal amount of context is the minimum necessary to complete the current task correctly.

This is harder than it sounds. Current LLMs (as of early 2026) offer context windows ranging from 128K to 1M tokens, but research consistently shows that performance degrades well before the window fills. Attention concentrates on the beginning and end of the input — information in the middle gets unreliable processing. This is called "context rot," and it means that a 200K context window does not give you 200K tokens of useful capacity.

Token Budget Architecture¶

Every agent task in GE operates within a token budget. The budget is not a single number — it is an allocation across multiple categories:

Budget Categories¶

Category	Typical Allocation	Purpose
System prompt (identity)	1,200-2,500 tokens	Who the agent is, how it behaves
Constitutional principles	~800 tokens	Shared rules all agents follow
Task specification	500-3,000 tokens	What to do, acceptance criteria
JIT knowledge	1,000-5,000 tokens	Relevant learnings from wiki brain
Working context	Remainder	Code, conversation, tool output
Output reserve	4,000-8,000 tokens	Space for the agent's response

The working context is what remains after all fixed allocations. For a 200K context window, this is generous. For practical purposes — considering context rot — the effective working budget is closer to 30,000-50,000 tokens before quality starts to degrade.

Budget Monitoring¶

CHECK: Before injecting additional context into an agent's session: IF: The total context would exceed 60% of the model's window THEN: Summarize or prune existing context before adding more. IF: The agent has been running for more than 40 turns THEN: Performance is likely degrading. Consider restarting with a fresh context and a summary of progress.

The Three-Tier Identity System¶

GE agents have three tiers of identity, each serving a different purpose and consuming a different amount of the token budget.

Tier 1: CORE Identity (~1,200 tokens)¶

The CORE identity is what the agent always carries. It answers three questions:

Who am I? Name, role, team assignment.
How do I behave? Communication style, decision-making approach.
What are my boundaries? What I must never do.

The CORE identity is deliberately small. It must fit in every context window, alongside every task, without consuming meaningful budget. If the CORE identity exceeds 1,500 tokens, it contains information that belongs in the ROLE tier.

RULE: The CORE identity contains behavior, not knowledge. It defines how the agent operates, not what it knows.

Example of CORE content: - "You are Antje, Senior QA Engineer on Team Alfa." - "You write tests from specifications, never from implementation code." - "You never modify production code directly." - "You escalate to Koen when deterministic quality gates are needed."

Example of what does NOT belong in CORE: - Detailed testing frameworks documentation - History of previous test suites - Technical knowledge about specific domains

Tier 2: ROLE Identity (~2,500 tokens)¶

The ROLE identity provides task-relevant detail about the agent's function. It includes:

Detailed responsibilities and workflows
Specific tools and technologies the agent uses
Interaction protocols with other agents
Decision-making frameworks for common situations

The ROLE identity is loaded when an agent starts a task. It is medium-sized because it needs to provide enough detail for the agent to operate independently, but not so much that it crowds out working context.

RULE: The ROLE identity contains function-specific knowledge that is always relevant to the agent's work. Domain-specific knowledge that is only sometimes relevant goes in the wiki brain.

Tier 3: REFERENCE Identity (~3,500 tokens)¶

The REFERENCE identity is the complete agent profile. It includes everything in CORE and ROLE, plus:

Personality traits and communication nuances
Detailed background and expertise areas
Relationships with other agents
Historical context about how the role evolved

The REFERENCE identity is used during commissioning (when the agent is being set up or aligned) and during deep alignment reviews. It is too large for routine task injection — loading it in full would consume context budget that is better spent on task-specific information.

Why Three Tiers?¶

A single identity file creates a dilemma: make it short and lose behavioral detail, or make it long and waste context on every task. The three-tier system resolves this:

CORE is always loaded (behavior guardrails)
ROLE is loaded for tasks (functional knowledge)
REFERENCE is loaded for commissioning (full alignment)

This mirrors how human teams work. A developer does not re-read their job description before every task. They carry their core habits (CORE), apply their skill set (ROLE), and only revisit their full role definition (REFERENCE) during performance reviews or role changes.

JIT Knowledge Injection¶

JIT (Just-In-Time) knowledge injection is GE's solution to the "everything preloaded" anti-pattern. Instead of giving agents all possible knowledge at boot time, GE injects only the knowledge relevant to the current task.

How It Works¶

An agent receives a task (e.g., "implement Redis stream consumer for work package routing").
Before the agent begins, the system queries the wiki brain for relevant learnings.
Search terms are derived from the task description: "Redis," "stream," "consumer," "routing," "work package."
Matching wiki pages are ranked by relevance and recency.
The top results are injected into the agent's context as JIT learnings.
The agent begins work with institutional knowledge about Redis streams, known pitfalls, and established patterns.

What Gets Injected¶

Source	Content	Priority
Pitfalls	Known failure modes in the task domain	Highest
Standards	Coding conventions and patterns	High
Contracts	Interface definitions the task touches	High
Learnings	Lessons from previous sessions	Medium
Handoff	Current state of related work	Medium

What Does NOT Get Injected¶

Learnings from unrelated domains (if the task is about Redis, do not inject Kubernetes networking pitfalls)
Historical discussions that have been superseded by newer decisions
Full wiki pages when a summary would suffice
Agent identity information for other agents (causes identity bleed)

Budget Control¶

JIT injection has a hard budget: 5,000 tokens maximum. If the relevant learnings exceed this budget, they are ranked and truncated. The system prefers:

Pitfalls over learnings (preventing known mistakes is more valuable than optimization tips)
Recent over old (newer learnings reflect the current state of the codebase)
Specific over general (a learning about "Redis Streams MAXLEN" is more useful than a general "Redis best practices" page)

Context Window Saturation Symptoms¶

When an agent's context window is saturated, behavior degrades in predictable ways. Recognizing these symptoms is critical for operators.

Early Symptoms (60-75% capacity)¶

Instruction drift. The agent begins ignoring instructions from the system prompt, particularly instructions in the middle of the prompt.
Repetition. The agent repeats actions it has already taken, suggesting it has lost track of what it has done.
Reduced precision. Output becomes more generic, less specific to the task at hand.

Late Symptoms (75-90% capacity)¶

Hallucinated context. The agent references files, functions, or conversations that do not exist in its context.
Role confusion. The agent begins adopting behaviors from other agents whose output appears in its context.
Contradictions. The agent produces output that contradicts its own earlier statements within the same session.
Instruction amnesia. The agent forgets critical constraints (e.g., "do not modify production code") and violates them.

Terminal Symptoms (>90% capacity)¶

Truncation damage. The model's output is cut off mid-response, losing work.
Complete role abandonment. The agent stops behaving according to its identity entirely.
Confabulation. The agent confidently describes actions it did not take and results it did not achieve.

Mitigation¶

CHECK: If any early symptom appears: IF: The session has been running for more than 30 turns THEN: Summarize the session so far and restart with fresh context. IF: The agent is processing large tool outputs (>10,000 tokens per call) THEN: Truncate or summarize tool output before it enters the context. IF: The agent is in a multi-turn conversation with many prior messages THEN: Compress history, keeping only the most recent 5-10 exchanges in full and summarizing the rest.

Structuring Information for LLM Consumption¶

Not all text formats are equally effective for LLMs. GE uses a specific format — called "agentic format" — designed to minimize ambiguity when processed by language models.

Agentic Format Principles¶

1. Imperative over descriptive. Bad: "The system generally prefers upserts over insert-then-update patterns." Good: "RULE: Use upsert (ON CONFLICT DO UPDATE). Never use select-then-insert."

2. Conditional logic as CHECK/IF/THEN blocks. Bad: "When you encounter a failing test, you should think about whether the test or the code is wrong." Good:

CHECK: When a test fails after implementation.
IF: The test was derived from the formal specification
THEN: The implementation is wrong. Fix the code, not the test.
IF: The test was written without a specification
THEN: Escalate to Antje for specification alignment.

3. Anti-patterns explicitly named. Bad: "Be careful with file watchers." Good: "ANTI_PATTERN: File watchers (chokidar, fs.watch) in production. Caused $100/hr token burn via feedback loop. NEVER re-enable."

4. Concrete over abstract. Bad: "Use appropriate error handling." Good: "RULE: Every Redis XADD call includes MAXLEN ~100 (per-agent) or ~1000 (system). Unbounded streams caused memory exhaustion on 2026-02-12."

The Decision: Identity vs Wiki vs Task Context¶

A common question: where should a piece of information live?

Information Type	Location	Rationale
Behavioral rules (always apply)	Identity (CORE)	Must be in every context window
Functional knowledge (role-specific)	Identity (ROLE)	Relevant to all tasks this agent performs
Domain knowledge (sometimes relevant)	Wiki brain (JIT injected)	Only loaded when the task domain matches
Task-specific data (one task only)	Task context	Injected with the specific task, discarded after
Historical decisions	Wiki brain	Searchable, not preloaded
Pitfalls and failure modes	Wiki brain (high-priority JIT)	Injected whenever the domain matches

RULE: If you are unsure where information belongs, default to the wiki brain. Information in the wiki is searchable and injectable. Information in the identity file is loaded every time, even when irrelevant.

Context Compression Strategies¶

When context grows too large, compression is necessary. GE uses these strategies in order of preference:

1. Selective Loading¶

Load only the relevant parts of large files. If the task involves a specific function, load that function and its immediate dependencies, not the entire file.

2. Summary Replacement¶

Replace detailed conversation history with structured summaries. A 50-message conversation can often be summarized in 500 tokens without losing decision-relevant information.

3. Tool Output Truncation¶

Tool outputs (especially from file reads and searches) can be enormous. Truncate to the relevant sections before they enter the context window.

4. Progressive Disclosure¶

Start with high-level information. Only load detail when the agent specifically needs it. This is the JIT principle applied within a single task.

5. Session Restart¶

When all else fails, end the current session and start a new one with a clean context. Include a structured summary of what was accomplished and what remains. This is expensive (it loses in-context learning) but it is better than continuing with a degraded context window.

Metrics¶

GE tracks the following context management metrics:

Metric	Target	Alert Threshold
Identity token count (CORE)	<1,500	>2,000
JIT injection size	<5,000 tokens	>8,000 tokens
Session length (turns)	<40	>60
Context utilization at task end	<60%	>75%
Instruction compliance rate	>95%	<90%

These metrics are tracked through PTY capture and session analysis. Degradation in instruction compliance is the strongest signal that context management has failed.