Test-Driven Development in GE¶

Why TDD Is Mandatory¶

RULE: Every feature in GE begins with a failing test. No exceptions. RATIONALE: In a 60-agent system, code that "looks right" is worthless. Only code that provably satisfies a specification is shippable.

GE is not a team of humans who can rely on shared context, hallway conversations, or institutional memory to catch mistakes. GE is a fleet of LLM-powered agents. Each agent operates in isolation, with limited context windows and no persistent memory beyond what the wiki and specifications provide. Without TDD, agent-generated code drifts from requirements silently and confidently.

CHECK: Before any implementation work begins, ask: IF: A formal specification exists for this feature THEN: Tests MUST be derived from that spec by Antje before any developer writes a single line of implementation IF: No formal specification exists THEN: STOP. Escalate to Anna. No spec means no tests means no implementation.

Spec-First, Not Test-First¶

Most TDD literature describes "test-first" development: the developer writes a test, then writes code to pass it. GE goes further.

RULE: GE practices spec-first development. Tests are derived from Anna's formal specification, not from developer intuition.

The distinction matters:

Approach	Test Source	Risk
Test-first (classic TDD)	Developer's mental model	Developer encodes their assumptions, not the requirement
Spec-first (GE TDD)	Anna's formal specification	Tests encode the verified requirement, independent of implementation bias

In classic TDD, the developer who writes the test also writes the code. They share the same mental model, the same blind spots, the same misunderstandings. In GE, these are deliberately separated:

Anna produces the formal specification (what the system MUST do)
Antje derives tests from the specification (how to verify it)
Developers write code to pass the tests (how to implement it)

This three-way separation eliminates the single-brain bias that undermines conventional TDD.

The Red-Green-Refactor Cycle¶

GE follows the standard TDD cycle, adapted for multi-agent execution:

Phase 1: RED — Write Failing Tests¶

OWNER: Antje INPUT: Anna's formal specification OUTPUT: Test suite where every test fails (no implementation exists yet)

CHECK: Every test assertion maps to a specific invariant, edge case, or post-condition from the formal spec. IF: A test cannot be traced to a spec element THEN: The test is speculative and MUST be flagged for review.

ANTI_PATTERN: Writing tests that pass immediately. FIX: If a test passes before implementation, either the test is wrong or the feature already exists. Investigate both.

Phase 2: GREEN — Make Tests Pass¶

OWNER: Developer agents (team-specific) INPUT: Failing test suite from Antje OUTPUT: Minimal implementation that makes all tests pass

RULE: Write the minimum code to pass the tests. Nothing more. RATIONALE: LLMs are prone to over-engineering. A constrained test suite forces focused implementation.

CHECK: After implementation, run full test suite. IF: All tests pass THEN: Proceed to refactor. IF: Any test fails THEN: Fix implementation. Do NOT modify the test without Antje's approval.

Phase 3: REFACTOR — Clean Without Breaking¶

OWNER: Developer agents + Koen (quality gate) INPUT: Passing implementation OUTPUT: Clean, idiomatic code that still passes all tests

RULE: Refactoring must not change behavior. Tests must still pass after every refactor step. RULE: Koen's deterministic quality gates (lint, typecheck, dead code analysis) run during this phase.

Why TDD Matters More for Agentic Development¶

Research consistently shows that LLMs generate higher quality code when tests exist first. A 2024 arXiv study on TDD for code generation found that including test information "significantly bolsters the performance of code generation systems" and "not only elevates the accuracy of generated code but also enhances its alignment with specified requirements."

This aligns with GE's experience:

Tests constrain the output space. Without tests, an LLM has infinite valid implementations to choose from. With tests, the search space narrows to correct implementations only.
Tests provide immediate feedback. An agent can run tests after each code change and know instantly whether it's on track.
Tests prevent hallucination. An LLM cannot hallucinate a feature into existence if the test suite does not confirm it.
Tests are the specification language. For an LLM, executable tests are more precise than natural language requirements.

GE's TDD Pipeline — Overview¶

Aimee (Scope)
  → Anna (Formal Specification)
    → Antje (Test Generation from Spec)
      → Developers (Implement to Pass Tests)
        → Koen (Deterministic Quality Gates)
          → Marije/Judith (Integration/E2E Testing)
            → Jasper (TDD vs Post-Impl Reconciliation)

Each handoff is documented in detail in workflow.md. Domain-specific patterns are in patterns.md. Agentic-specific considerations are in agentic-tdd.md. Common mistakes are in pitfalls.md.

Core Principles¶

1. Tests Are Derived from Specs, Not from Code¶

RULE: Tests must be written by reading the specification, never by reading the implementation. RATIONALE: If you read the code first, you test what it does, not what it should do.

2. Test and Implementation Agents Must Be Different¶

RULE: Antje writes tests. Developer agents write code. They never share context about each other's work. RATIONALE: Oracle independence. The test oracle must be independent of the system under test.

3. A Failing Test Is a Feature Request¶

Every failing test is a clear, unambiguous statement of work for a developer agent. The test defines what "done" means.

4. Tests Are Not Optional Documentation¶

RULE: Tests are executable contracts. They run in CI. They block deployment. They are never "just documentation."

5. Test Modification Requires Spec Change¶

IF: A developer agent needs to change a test to make their code pass THEN: The developer MUST NOT change the test. They escalate to Antje. IF: Antje agrees the test is wrong THEN: Antje traces back to Anna's spec to verify whether the spec or the test is incorrect. IF: The spec is incorrect THEN: Anna revises the spec, Antje regenerates the test.

This chain of custody ensures that requirements flow in one direction: spec → test → code.

TDD Workflow — Full pipeline with decision trees
TDD Patterns — Domain-specific testing patterns
Agentic TDD — AI-specific TDD considerations
TDD Pitfalls — Common mistakes and anti-patterns
Formal Specification — Anna's specification methodology
Testing Standards — Enforcement rules
Anna Agent — Specification agent details