Skip to content

Agentic TDD — How AI Agents Do TDD Differently

Why Agentic TDD Is Not Just TDD

Traditional TDD was designed for human developers. A human writes a test, then writes code to pass it. The human's judgment, intuition, and understanding of the domain guide both activities.

LLM-powered agents operate differently. They have no persistent intuition. They have no domain understanding beyond what is in their context window. They generate code by predicting the most likely next token, not by reasoning from first principles. This changes TDD in fundamental ways.

RULE: Every TDD principle must be re-evaluated through the lens of "what happens when the test writer and code writer are both LLMs?"


Tests Constrain the Output Space

The single most important insight for agentic TDD:

An LLM generates higher quality code when tests exist first.

Research from 2024 (arXiv: "Test-Driven Development for Code Generation") demonstrated that including test information "significantly bolsters the performance of code generation systems" across MBPP, HumanEval, and CodeChef benchmarks. The inclusion of tests "not only elevates the accuracy of generated code but also enhances its alignment with specified requirements."

Why this works:

  1. Without tests, an LLM has an infinite space of valid completions. Many are plausible but wrong.
  2. With tests, the LLM can evaluate its output against concrete assertions. The search space collapses.
  3. Tests act as a reward signal during iterative code generation — the agent can run tests after each attempt.

CHECK: Is the developer agent generating code with tests already present? IF: Yes — tests constrain the output space, quality is higher IF: No — the agent is generating unconstrained code. This is "vibe coding." STOP.

ANTI_PATTERN: Generating code first, then writing tests to match the code. FIX: This is backwards. The test must exist before the code. Always.


Test Suites as Specification Language

For an LLM, executable tests are a more precise specification than natural language.

Consider two ways to specify a function:

Natural language: "Create a function that calculates the total price of an order, applying a 10% discount for orders over $100, and adding 21% VAT."

Test suite:

expect(calculateTotal(90)).toBe(108.90);   // 90 * 1.21 = 108.90
expect(calculateTotal(100)).toBe(121.00);  // 100 * 1.21 = 121.00
expect(calculateTotal(101)).toBe(109.89);  // 101 * 0.9 * 1.21 = 109.989 → 109.89
expect(calculateTotal(0)).toBe(0);
expect(() => calculateTotal(-1)).toThrow();

The test suite is unambiguous. It specifies: - The exact rounding behavior (2 decimal places, round down) - Whether the threshold is inclusive or exclusive (>100, not >=100) - The order of operations (discount first, then VAT) - Edge cases (zero, negative)

An LLM reading the natural language spec might get the threshold wrong (>= vs >), the rounding wrong, or the operation order wrong. An LLM reading the tests cannot get these wrong — the tests enforce correctness.

RULE: Anna's formal spec is the source of truth for WHAT the system does. Antje's tests are the executable encoding of that truth. Together they are more precise than either alone.


Oracle Independence — Why Antje and Developers MUST Be Different Agents

The Problem

If the same LLM writes both the test and the code, a dangerous failure mode emerges: the LLM encodes the same misunderstanding in both artifacts.

Example: - Spec says: "Discount applies to orders over $100" - LLM interprets "over" as >= 100 (incorrect — should be > 100) - LLM writes test: expect(calculateTotal(100)).toBe(108.90) (discount applied at 100) - LLM writes code: if (amount >= 100) applyDiscount() (matches the test) - All tests pass. The code is wrong. Nobody catches it.

The Solution: Oracle Independence

In testing theory, the "test oracle" is the mechanism that determines whether output is correct. The oracle MUST be independent of the system under test.

RULE: Antje writes tests from Anna's specification. Developer agents write code from Antje's tests. These are different agents with different context windows, different sessions, and no shared state.

This separation ensures: 1. Antje's interpretation of the spec is independent of the developer's interpretation 2. If both arrive at the same behavior, confidence is high 3. If they disagree (test fails), the disagreement surfaces immediately

CHECK: Are the test-writing agent and the code-writing agent the same agent? IF: Yes — oracle independence is violated. This is a CRITICAL VIOLATION. THEN: Reassign. Tests and code MUST come from different agents.

CHECK: Did the developer agent read the test code before writing the implementation? IF: Yes — this is expected and correct. The developer reads the tests as their spec. IF: The developer is also reading Antje's reasoning or internal notes THEN: This leaks oracle context. The developer should only see the test assertions, not Antje's interpretation notes.


Mutation Testing Catches LLM-Specific Failures

The "Passes by Coincidence" Problem

LLMs are excellent at pattern matching. They can generate code that passes tests without actually implementing the correct logic. Common failure modes:

  1. Hardcoded return values — The LLM notices the test expects 108.90 and returns 108.90 as a literal
  2. Lookup tables — The LLM encodes the test inputs/outputs as a map instead of computing them
  3. Shallow pattern matching — The code passes the specific test cases but fails on any other input

How Mutation Testing Catches This

Mutation testing introduces small changes to the code (mutations) and checks whether tests detect them. If a mutation does NOT cause a test to fail, the test suite has a gap.

Common mutations: - Change > to >= or < - Change + to - - Remove a conditional branch - Replace a constant with zero - Remove a function call

CHECK: Did Koen run mutation testing on the implementation? IF: Surviving mutants exist in business logic THEN: Escalate to Antje — the test suite does not adequately constrain the implementation

Meta's research on mutation-guided test generation (2025) demonstrated that LLM-based mutation testing at scale is practical. Their ACH system generated 9,095 mutants across 10,795 classes and produced 571 actionable test cases. The key innovation: using LLMs to generate semantically meaningful mutants (not random changes) and to detect equivalent mutants automatically.

RULE: Mutation testing is MANDATORY for all business logic in GE. RATIONALE: Without mutation testing, an LLM can generate "test-passing but semantically wrong" code that looks correct in CI but fails in production.

GE's Mutation Testing Flow

Koen runs mutation testing on developer's code
  → Surviving mutants identified
    → Antje reviews each surviving mutant
      CHECK: Does the mutation violate a spec requirement?
        IF: Yes — Antje writes a test that kills it
        IF: No — mutant is equivalent (semantically identical), document and accept

LLM-Specific Test Quality Metrics

Beyond standard test quality metrics, GE tracks metrics specific to agentic development:

1. Specification Coverage Ratio

SCR = (spec elements with corresponding tests) / (total spec elements)

TARGET: 100%. No spec element without a test. OWNER: Jasper measures this during reconciliation (Stage 6).

2. Mutation Kill Rate

MKR = (killed mutants) / (total non-equivalent mutants)

TARGET: > 90% for business logic. > 80% for utility code. OWNER: Koen measures this during quality gating (Stage 4).

3. Oracle Independence Score

OIS = 1 - (shared context tokens between test agent and dev agent) / (total context tokens)

TARGET: > 95%. Test and dev agents should share almost no context beyond the test file itself. OWNER: Platform team monitors this.

4. First-Pass Rate

FPR = (tests that pass on developer's first implementation attempt) / (total tests)

TARGET: 70-90%. If FPR is 100%, tests may be too easy. If FPR is < 50%, specs may be too ambiguous. OWNER: Jasper tracks this across projects.


The Agent Context Window Problem

LLMs have finite context windows. This affects TDD in specific ways:

Problem: Large Test Suites Exceed Context

IF: A test file exceeds 500 lines THEN: Split into focused test modules, one per spec function RATIONALE: The developer agent needs to hold the test + implementation in context simultaneously

Problem: Developer Loses Track of Which Tests Remain

IF: Developer has passed 20 of 30 tests and loses context on the remaining 10 THEN: Run the test suite, capture failing test names, re-inject as context FIX: Developer agents should run tests frequently (after every function) to maintain orientation

Problem: Antje Generates Redundant Tests

IF: Multiple tests assert the same behavior with different inputs THEN: Use parameterized tests (it.each) to reduce test count without reducing coverage RATIONALE: Fewer tests = smaller context window usage = better agent performance


Research Foundation

GE's agentic TDD practices are grounded in current research:

  • Test-Driven Development for Code Generation (arXiv, 2024): Tests improve LLM code generation accuracy across multiple benchmarks. TDD is "a better development model than just using a problem statement."

  • Meta's Mutation-Guided Test Generation (FSE 2025, EuroSTAR 2025): LLM-based mutation testing is practical at scale. ACH generated 9,095 mutants and 571 actionable tests across 10,795 classes.

  • Spec-Driven Development (arXiv, 2025): "An AI agent is like a brilliant intern who never says 'I don't understand.' If your requirements are fuzzy, it will confidently generate a clean, professional-looking solution that could be catastrophically wrong."

  • FlowGen Multi-Agent TDD (Peking University, 2025): Agent architectures that simulate TDD with separate "developer" and "tester" roles show measurable quality improvements: 18.9% improvement in pass@1 and logical error rate reduced from 38.2% to 19.3%.

  • MuTAP (2024): Augmenting LLM prompts with surviving mutants from mutation testing detects up to 28% more faulty code snippets, achieving 93.57% mutation score.


Summary of Agentic TDD Rules

Rule Rationale
Tests before code, always Constrains LLM output space
Test writer != code writer Oracle independence
Tests from spec, not intuition Prevents shared blind spots
Mutation testing mandatory Catches "passes by coincidence"
Split large test files Context window management
Run tests frequently during dev Maintains agent orientation
No shared context between Antje and devs Prevents oracle contamination