Skip to content

TDD Pitfalls

How to Use This Page

Each pitfall follows the same structure:

  • ANTI_PATTERN: What the mistake looks like
  • WHY IT HAPPENS: Root cause, especially in agentic context
  • DETECTION: How to catch it
  • FIX: How to correct it

RULE: If you recognize any of these patterns in your own work, STOP and fix before proceeding.


Pitfall 1: Testing Implementation, Not Behavior

ANTI_PATTERN: Tests that verify HOW code works instead of WHAT it does.

// BAD: Tests internal state
it('sets the _discountApplied flag to true', () => {
  const order = new Order(150);
  order.calculateTotal();
  expect(order._discountApplied).toBe(true);
});

// GOOD: Tests observable behavior
it('applies 10% discount for orders over $100', () => {
  const order = new Order(150);
  expect(order.calculateTotal()).toBe(163.35); // 150 * 0.9 * 1.21
});

WHY IT HAPPENS: LLMs are trained on code that often includes implementation-detail tests. When generating tests, they mimic this pattern.

DETECTION: Koen flags tests that reference private fields, internal methods, or implementation-specific data structures.

FIX: Every test assertion must correspond to a spec element (invariant, edge case, pre/post-condition). If the assertion cannot be traced to the spec, it is testing implementation.


Pitfall 2: Brittle Tests

ANTI_PATTERN: Tests that break when implementation changes but behavior stays the same.

// BAD: Tied to specific HTML structure
expect(container.querySelector('div.invoice-row > span.amount')).toHaveTextContent('$100');

// GOOD: Uses semantic queries
expect(screen.getByRole('cell', { name: '$100.00' })).toBeInTheDocument();

WHY IT HAPPENS: LLMs generate precise CSS selectors and DOM queries because they match training data patterns. They do not distinguish between stable and unstable selectors.

DETECTION: Tests that break on refactoring without behavior change. Jasper catches these during reconciliation.

FIX: Use semantic queries (roles, labels, test IDs) not structural queries (CSS selectors, DOM hierarchy).


Pitfall 3: Over-Mocking

ANTI_PATTERN: Mocking so much that the test verifies nothing real.

// BAD: Everything is mocked — what is this even testing?
jest.mock('./database');
jest.mock('./redis');
jest.mock('./auth');
jest.mock('./validation');

it('creates an invoice', async () => {
  const result = await createInvoice(mockData);
  expect(database.insert).toHaveBeenCalledWith(mockData);
});

WHY IT HAPPENS: LLMs default to mocking because mocked tests are simpler to generate and always pass. Mocking eliminates the hard part — real integration.

DETECTION: Koen checks mock-to-assertion ratio. If a test has more mock setup lines than assertion lines, it is likely over-mocked.

FIX: Mock only external boundaries (third-party APIs, time, randomness). Never mock your own database, your own Redis, your own auth layer.

RULE: GE's integration tests use real PostgreSQL, real Redis, real network paths. See Testing Standards.


Pitfall 4: Testing Trivial Code

ANTI_PATTERN: Writing tests for getters, setters, simple assignments, or framework boilerplate.

// BAD: Tests that a getter returns the value
it('returns the name', () => {
  const user = new User('Alice');
  expect(user.getName()).toBe('Alice');
});

// BAD: Tests that a constant is constant
it('has the correct API version', () => {
  expect(API_VERSION).toBe('v2');
});

WHY IT HAPPENS: LLMs optimize for test count. More tests look like better coverage. Trivial tests inflate the count without adding safety.

DETECTION: Koen's dead code analysis flags tests with no meaningful assertions. Mutation testing identifies tests where no mutation of the tested code can cause failure.

FIX: Only test behavior that could be wrong. If the code is a direct assignment with no logic, it does not need a test.

RULE: Test code with logic, branching, calculations, or side effects. Do not test code that is a simple passthrough.


Pitfall 5: Test Coupling

ANTI_PATTERN: Tests that depend on execution order or shared state.

// BAD: Test 2 depends on test 1's side effect
it('creates a user', async () => {
  await createUser({ name: 'Alice' });
});

it('finds the created user', async () => {
  const user = await findUser('Alice'); // Depends on test above
  expect(user).toBeDefined();
});

WHY IT HAPPENS: LLMs generate tests sequentially and naturally carry context forward. They treat the test file as a narrative rather than independent specifications.

DETECTION: Run tests in random order. If any test fails, it was coupled.

FIX: Each test sets up its own state and tears it down. Use beforeEach for shared setup, not previous tests.

// GOOD: Each test is independent
it('finds a user by name', async () => {
  await createUser({ name: 'Alice' }); // Own setup
  const user = await findUser('Alice');
  expect(user).toBeDefined();
});

Pitfall 6: Generating Tests and Code Together (LLM-Specific)

ANTI_PATTERN: An LLM agent generates both the test file and the implementation file in the same session.

WHY IT HAPPENS: It is faster. The agent can see both files and make them match perfectly. The temptation is strong.

WHY IT IS DANGEROUS: This defeats the entire purpose of TDD. The test and code share the same mental model (context window). If the LLM misunderstands the requirement, BOTH the test and code will encode the misunderstanding. All tests pass. The code is wrong.

DETECTION: Git history. If the test file and implementation file are committed in the same commit by the same agent, oracle independence is violated.

FIX: Antje writes tests in session A. Developer writes code in session B. Different agents, different sessions, different context windows. No exceptions.

RULE: This is the most important pitfall in agentic TDD. If you remember only one thing from this page, remember this.


Pitfall 7: Tests That Pass by Coincidence (LLM-Specific)

ANTI_PATTERN: Tests that pass for the wrong reason.

// The test passes because the function happens to return the right value
// for THIS specific input, but the logic is wrong for all other inputs
it('calculates tax', () => {
  expect(calculateTax(100)).toBe(21); // 100 * 0.21 = 21
});
// But the implementation is: return input - 79; // Works for 100, fails for everything else

WHY IT HAPPENS: LLMs can generate "shortcut" implementations that satisfy specific test cases without implementing general logic. This is especially common when test cases use round numbers.

DETECTION: Mutation testing. Property-based testing. Testing with non-trivial inputs (not just 0, 1, 100).

FIX: Include diverse test inputs including non-round numbers, large values, and boundary values. Use parameterized tests with many inputs.

// GOOD: Multiple diverse inputs make shortcuts impossible
it.each([
  [100, 21],
  [157.50, 33.075],
  [0.01, 0.0021],
  [99999.99, 20999.9979],
])('calculates 21%% tax on %s as %s', (amount, expected) => {
  expect(calculateTax(amount)).toBeCloseTo(expected, 4);
});

Pitfall 8: Assertion-Free Tests (LLM-Specific)

ANTI_PATTERN: Tests that execute code but assert nothing.

// BAD: No assertion — test passes if the function doesn't throw
it('processes the order', async () => {
  await processOrder(mockOrder);
});

WHY IT HAPPENS: LLMs sometimes generate "smoke tests" that verify a function can be called without crashing. This looks like a test but proves nothing about correctness.

DETECTION: Lint rule: every it() block must contain at least one expect(). Koen enforces this.

FIX: Every test must have at least one assertion that verifies a specific spec element.

// GOOD: Explicit assertions
it('processes the order and sets status to confirmed', async () => {
  const result = await processOrder(mockOrder);
  expect(result.status).toBe('confirmed');
  expect(result.processedAt).toBeInstanceOf(Date);
});

Pitfall 9: Snapshot Overuse

ANTI_PATTERN: Using snapshot tests as the primary testing strategy.

// BAD: Snapshot test — approves whatever the code produces
it('renders correctly', () => {
  const tree = render(<InvoiceList invoices={mockInvoices} />);
  expect(tree).toMatchSnapshot();
});

WHY IT HAPPENS: Snapshots are trivially easy for LLMs to generate. One line tests the entire output. But snapshots test "the output has not changed" rather than "the output is correct."

DETECTION: Count of snapshot tests vs behavioral tests. If snapshots dominate, the test suite is weak.

FIX: Use snapshots sparingly — only for large serialized outputs where manual assertions are impractical. For everything else, write specific assertions.


Pitfall 10: Testing the Happy Path Only

ANTI_PATTERN: All tests verify successful behavior. No tests verify error handling, edge cases, or boundary conditions.

WHY IT HAPPENS: LLMs are optimistic code generators. They produce the "golden path" and stop. Error handling is an afterthought.

DETECTION: Jasper checks spec edge cases against test coverage during reconciliation. Anna's specs include error conditions — if no tests map to them, there is a gap.

FIX: For every happy-path test, write at least: - One invalid input test - One authorization failure test - One boundary condition test - One error propagation test

RULE: At least 30% of tests should be negative tests (testing what SHOULD fail).


Pitfall 11: Flaky Tests

ANTI_PATTERN: Tests that sometimes pass and sometimes fail without code changes.

Common causes in agentic development: - Time-dependent assertions (expect(timestamp).toBe(now)) - Unordered collection comparisons (expect(results).toEqual([a, b]) when order is not guaranteed) - Network-dependent tests without retry/timeout - Race conditions in async code

DETECTION: Run test suite 5 times. If any test has inconsistent results, it is flaky.

FIX: - Use toBeCloseTo for time comparisons - Use toContainEqual or sort before comparing collections - Use waitFor with timeouts for async assertions - Isolate test environments to prevent cross-test interference

RULE: A flaky test is worse than no test. It erodes trust in the entire test suite. Fix or delete.


Pitfall Summary

# Pitfall Severity Who Catches It
1 Testing implementation not behavior HIGH Koen, Jasper
2 Brittle tests MEDIUM Jasper
3 Over-mocking HIGH Koen
4 Testing trivial code LOW Koen
5 Test coupling MEDIUM Random-order run
6 Tests + code in same session CRITICAL Git history audit
7 Passes by coincidence HIGH Mutation testing
8 Assertion-free tests HIGH Lint rule
9 Snapshot overuse MEDIUM Jasper
10 Happy path only HIGH Jasper, spec coverage
11 Flaky tests HIGH CI stability