Skip to content

DOMAIN:TESTING — PITFALLS

OWNER: marije, judith
ALSO_USED_BY: all testing agents, all developers
UPDATED: 2026-03-24
SCOPE: testing anti-patterns for all GE client projects


CRITICAL_PITFALLS

These are the most damaging testing mistakes. Each has caused real problems in production systems.
Every testing agent must know these by heart.


PITFALL_1: TESTING_IMPLEMENTATION_DETAILS

WHAT: tests that break when IMPLEMENTATION changes but BEHAVIOR doesn't.
WHY_BAD: blocks refactoring, provides no real safety, creates maintenance burden.
FREQUENCY: extremely common, especially in LLM-generated tests.

EXAMPLES

BAD: testing internal state

it('adds item to cart', () => {
  const cart = new Cart();
  cart.addItem({ id: '1', name: 'Widget', price: 10 });

  // BAD — testing internal data structure
  expect(cart._items[0].id).toBe('1');
  expect(cart._items.length).toBe(1);
});

GOOD: testing observable behavior

it('adds item to cart', () => {
  const cart = new Cart();
  cart.addItem({ id: '1', name: 'Widget', price: 10 });

  // GOOD — testing what the user sees
  expect(cart.getItemCount()).toBe(1);
  expect(cart.getTotal()).toBe(10);
  expect(cart.containsItem('1')).toBe(true);
});

BAD: testing specific function calls

it('saves user', async () => {
  const mockDb = vi.fn();
  const service = new UserService(mockDb);

  await service.createUser({ name: 'Test' });

  // BAD — testing HOW it saves, not WHAT it saves
  expect(mockDb).toHaveBeenCalledWith(
    'INSERT INTO users (name) VALUES ($1)',
    ['Test']
  );
});

GOOD: testing the result

it('saves user', async () => {
  const service = new UserService(testDb);

  const user = await service.createUser({ name: 'Test' });

  // GOOD — testing what was created
  expect(user.name).toBe('Test');
  expect(user.id).toBeDefined();

  // Verify it persisted
  const found = await service.getUser(user.id);
  expect(found.name).toBe('Test');
});

DETECTION: if refactoring a function (same inputs/outputs, different internal logic) breaks tests,
those tests are testing implementation details.


PITFALL_2: MOCKING_EVERYTHING

WHAT: tests that mock so many dependencies that they test nothing real.
WHY_BAD: mocked tests pass even when the real system is broken.

EXAMPLE

// BAD — mocking everything, testing nothing
it('processes order', async () => {
  const mockDb = vi.fn().mockResolvedValue({ id: '1' });
  const mockPayment = vi.fn().mockResolvedValue({ success: true });
  const mockEmail = vi.fn().mockResolvedValue(true);
  const mockInventory = vi.fn().mockResolvedValue(true);

  const service = new OrderService(mockDb, mockPayment, mockEmail, mockInventory);
  const result = await service.processOrder({ itemId: '1', quantity: 1 });

  expect(result.success).toBe(true);
  // This test tells you NOTHING — you mocked the entire universe
  // If the real DB, payment, email, or inventory API changes, this test still passes
});
// GOOD — mock external boundaries, use real internals
it('processes order', async () => {
  // Real DB (test database)
  // Real service logic
  // Mock ONLY external payment API (we don't own it)
  const mockPayment = vi.fn().mockResolvedValue({ success: true, txId: 'TX123' });

  const service = new OrderService(testDb, mockPayment, emailService, inventoryService);
  const result = await service.processOrder({ itemId: '1', quantity: 1 });

  expect(result.success).toBe(true);
  expect(result.paymentTxId).toBe('TX123');

  // Verify real DB state
  const order = await testDb.select().from(orders).where(eq(orders.id, result.id));
  expect(order).toBeDefined();
  expect(order.status).toBe('completed');
});

RULE: mock what you DON'T own (external APIs, payment gateways).
RULE: use REAL implementations for what you DO own (DB, services, business logic).
RULE: if your test has more mocks than assertions, rethink the test.


PITFALL_3: FLAKY_TESTS

WHAT: tests that sometimes pass and sometimes fail with no code changes.
WHY_BAD: erode trust in the test suite. Team starts ignoring failures. Actual bugs slip through.

COMMON_CAUSES

CAUSE_1: TIMING_DEPENDENCIES

// BAD — depends on timing
it('shows toast after save', async () => {
  await page.click('#save');
  await page.waitForTimeout(500);  // Maybe 500ms isn't enough
  expect(await page.isVisible('.toast')).toBe(true);
});

// GOOD — wait for the actual element
it('shows toast after save', async ({ page }) => {
  await page.getByRole('button', { name: 'Save' }).click();
  await expect(page.getByRole('alert')).toBeVisible();  // Auto-waits
});

CAUSE_2: SHARED_STATE

// BAD — depends on other tests' state
it('counts users', async () => {
  const count = await db.select().from(users).count();
  expect(count).toBe(3);  // Breaks if another test adds users
});

// GOOD — controls its own state
it('counts users', async () => {
  await cleanTestDb(db);
  await db.insert(users).values([{ name: 'A' }, { name: 'B' }, { name: 'C' }]);
  const [{ count }] = await db.select({ count: sql`count(*)` }).from(users);
  expect(Number(count)).toBe(3);
});

CAUSE_3: RANDOM_DATA_WITHOUT_SEEDS

// BAD — random data with no seed, different every run
it('sorts users', () => {
  const users = Array.from({ length: 10 }, () => ({
    name: Math.random().toString(),
  }));
  // This test might pass or fail depending on generated data
});

// GOOD — deterministic or seeded
it('sorts users', () => {
  const users = [
    { name: 'Charlie' },
    { name: 'Alice' },
    { name: 'Bob' },
  ];
  expect(sortUsers(users)).toEqual([
    { name: 'Alice' },
    { name: 'Bob' },
    { name: 'Charlie' },
  ]);
});

CAUSE_4: DATE/TIME_DEPENDENCIES

// BAD — depends on current time
it('shows greeting', () => {
  const greeting = getGreeting();
  expect(greeting).toBe('Good morning');  // Fails after noon
});

// GOOD — inject time
it('shows morning greeting before noon', () => {
  vi.useFakeTimers();
  vi.setSystemTime(new Date('2026-01-15T09:00:00'));
  expect(getGreeting()).toBe('Good morning');
  vi.useRealTimers();
});

POLICY: flaky test detected → IMMEDIATELY quarantine (skip with reason) → fix within 48 hours → un-skip.
NEVER leave a flaky test running — it poisons the entire suite.


PITFALL_4: TEST_THEATER

WHAT: tests that CANNOT fail, providing false confidence.
WHY_BAD: they count toward coverage metrics but catch zero bugs.

FORMS_OF_TEST_THEATER

FORM_1: TAUTOLOGICAL_TESTS

// BAD — always passes, tests nothing
it('returns data', async () => {
  const result = await fetchData();
  expect(result).toBeDefined();  // Even null is defined. Even an error object is defined.
});

FORM_2: TESTS_THAT_TEST_THE_MOCK

// BAD — testing the mock, not the code
vi.mock('./api', () => ({
  fetchUsers: vi.fn().mockResolvedValue([{ id: 1 }]),
}));

it('fetches users', async () => {
  const users = await fetchUsers();
  expect(users).toEqual([{ id: 1 }]);  // Of course — that's what the mock returns!
});

FORM_3: EMPTY_TESTS

// BAD — test body does nothing meaningful
it('handles error', async () => {
  try {
    await riskyOperation();
  } catch (e) {
    // Test "passes" whether it throws or not
  }
});

FORM_4: ALWAYS_TRUE_ASSERTIONS

// BAD — assertion is always true regardless of behavior
it('validates input', () => {
  const result = validate('anything');
  expect(typeof result).toBe('object');  // Even error objects are objects
});

DETECTION: run mutation testing (Koen). If mutation score is low but coverage is high = test theater.
DETECTION: invert an assertion. If the test still passes, it's theater.
DETECTION: delete the function under test. If the test still passes, it's theater.


PITFALL_5: SNAPSHOT_ABUSE

WHAT: over-reliance on snapshot tests, especially for complex objects.
WHY_BAD: reviewers approve snapshot updates blindly. Snapshots with unstable fields break constantly.

WHEN_SNAPSHOTS_ARE_BAD

BAD: large objects with many fields — reviewer won't catch subtle changes
BAD: objects with timestamps, IDs, or random values — break on every run
BAD: as the ONLY test for a behavior — no human verified the snapshot is correct
BAD: for API responses — use toMatchObject with the fields that matter

WHEN_SNAPSHOTS_ARE_OK

OK: small inline snapshots for serialized output
OK: visual regression screenshots (with proper diff tooling)
OK: configuration files (schema validation)
OK: as SUPPLEMENTARY to specific assertions

FIX_PATTERN

// BAD — snapshot everything
it('creates user', async () => {
  const user = await createUser({ name: 'Test' });
  expect(user).toMatchSnapshot();  // What fields matter? Who knows.
});

// GOOD — specific assertions
it('creates user', async () => {
  const user = await createUser({ name: 'Test' });
  expect(user).toMatchObject({
    name: 'Test',
    status: 'active',
  });
  expect(user.id).toMatch(/^usr_/);
  expect(user.createdAt).toBeInstanceOf(Date);
});

PITFALL_6: TESTING_LIBRARY_INTERNALS

WHAT: testing the behavior of libraries/frameworks instead of your own code.
WHY_BAD: you're testing someone else's code. If it breaks, it's THEIR problem.

// BAD — testing that Zod validates correctly (that's Zod's job)
it('rejects invalid email', () => {
  const schema = z.string().email();
  expect(schema.safeParse('not-email').success).toBe(false);
});

// GOOD — testing that YOUR code uses Zod correctly
it('rejects user with invalid email', async () => {
  const result = await createUser({ name: 'Test', email: 'not-email' });
  expect(result.error).toContain('email');
});

RULE: trust that libraries work. Test YOUR integration with them, not their internals.
EXCEPTION: if you find a library bug, write a regression test with a comment explaining why.


PITFALL_7: LLM_SPECIFIC — TAUTOLOGICAL_TEST_GENERATION

WHAT: LLMs generating tests that mirror the implementation.
WHY_BAD: if the code has a bug, the LLM-generated test has the same bug.
THIS IS THE #1 LLM TESTING RISK.

THE_PROBLEM

When an LLM reads implementation code and generates tests, it often:
1. Copies the logic into the test (calculating expected values the same way)
2. Tests the CURRENT behavior, not the CORRECT behavior
3. Generates tests that pass by construction, not by verification

EXAMPLE

Implementation:

function calculateTax(price: number): number {
  return price * 0.19;  // BUG: should be 0.21 (Netherlands VAT)
}

LLM-generated test (BAD):

it('calculates tax', () => {
  expect(calculateTax(100)).toBe(19);  // LLM copied the wrong rate!
});

Human test (GOOD):

it('calculates 21% Dutch VAT', () => {
  expect(calculateTax(100)).toBe(21);  // Expected from domain knowledge
});

GE_DEFENSE

This is precisely why GE has TWO testing phases:
1. Antje writes TDD tests from the SPEC (before code exists) — cannot copy implementation
2. Marije/Judith write post-impl tests from BEHAVIOR — may mirror implementation
3. Jasper reconciles — if Antje says 21 and Marije says 19, the bug is found

ADDITIONAL_DEFENSE: Koen's mutation testing catches tautological tests.
If mutating 0.19 to 0.20 doesn't break any test, the test is tautological.

RULE_FOR_POST_IMPL_AGENTS: derive expected values from the SPEC, not from running the code.
RULE_FOR_POST_IMPL_AGENTS: if you're unsure of the expected value, CHECK THE SPEC first.
RULE_FOR_ALL_AGENTS: never use code output as expected test value without independent verification.


PITFALL_8: NOT_TESTING_ERROR_PATHS

WHAT: testing only the happy path, ignoring errors, edge cases, and failure modes.
WHY_BAD: production errors happen on the SAD path. That's where bugs actually hurt.

THE_PATTERN

// Typical: 5 happy path tests, 0 error tests
describe('createUser', () => {
  it('creates with valid data', () => { /* ... */ });
  it('creates with all fields', () => { /* ... */ });
  it('creates with minimum fields', () => { /* ... */ });
  it('creates and returns ID', () => { /* ... */ });
  it('creates with correct timestamp', () => { /* ... */ });
  // WHERE ARE THE ERROR TESTS?
});

WHAT_TO_ADD

describe('createUser error handling', () => {
  it('rejects missing required fields', () => { /* ... */ });
  it('rejects invalid email format', () => { /* ... */ });
  it('rejects duplicate email', () => { /* ... */ });
  it('handles database connection failure', () => { /* ... */ });
  it('handles database timeout', () => { /* ... */ });
  it('rolls back on partial failure', () => { /* ... */ });
  it('returns meaningful error messages', () => { /* ... */ });
  it('does not leak internal errors to client', () => { /* ... */ });
});

RULE: for every happy path test, write at least ONE error path test.
RULE: test what happens when dependencies fail (DB down, API timeout, disk full).
RULE: test validation of EVERY input field — invalid, missing, wrong type, too long, malicious.


PITFALL_9: TEST_INTERDEPENDENCE

WHAT: tests that depend on execution order or shared mutable state.
WHY_BAD: tests pass in sequence but fail in parallel or random order. Nightmare to debug.

// BAD — test 2 depends on test 1
let sharedUser: User;

it('creates a user', async () => {
  sharedUser = await createUser({ name: 'Test' });
  expect(sharedUser.id).toBeDefined();
});

it('updates the user', async () => {
  // FAILS if test 1 didn't run first
  await updateUser(sharedUser.id, { name: 'Updated' });
});
// GOOD — each test is self-contained
it('creates a user', async () => {
  const user = await createUser({ name: 'Test' });
  expect(user.id).toBeDefined();
});

it('updates a user', async () => {
  const user = await createUser({ name: 'Test' });  // Create its own user
  const updated = await updateUser(user.id, { name: 'Updated' });
  expect(updated.name).toBe('Updated');
});

RULE: every test sets up its own preconditions.
RULE: every test tears down its own state (or uses automatic cleanup).
RULE: tests must pass when run in ANY order, including reverse.


PITFALL_10: OVER_SPECIFYING_TESTS

WHAT: testing TOO MANY details in a single assertion, making tests brittle.
WHY_BAD: test fails for irrelevant reasons, hard to determine WHICH behavior broke.

// BAD — over-specified
it('creates user response', async () => {
  const res = await app.request('/api/users', {
    method: 'POST',
    body: JSON.stringify({ name: 'Test', email: 'test@example.com' }),
    headers: { 'Content-Type': 'application/json' },
  });

  expect(res.status).toBe(201);
  expect(await res.json()).toEqual({
    id: expect.any(String),
    name: 'Test',
    email: 'test@example.com',
    createdAt: expect.any(String),
    updatedAt: expect.any(String),
    role: 'user',
    avatar: null,
    bio: null,
    lastLoginAt: null,
    preferences: {},
    // If ANY field is added/removed/renamed, this test breaks
  });
});

// GOOD — test what matters
it('creates user with correct data', async () => {
  const res = await app.request('/api/users', {
    method: 'POST',
    body: JSON.stringify({ name: 'Test', email: 'test@example.com' }),
    headers: { 'Content-Type': 'application/json' },
  });

  expect(res.status).toBe(201);

  const body = await res.json();
  expect(body).toMatchObject({
    name: 'Test',
    email: 'test@example.com',
  });
  expect(body.id).toBeDefined();
});

RULE: test the MINIMUM needed to verify the behavior.
RULE: use toMatchObject for partial matching.
RULE: each test checks ONE behavior — if it checks two, split it.


PITFALL_CHECKLIST

Before committing any test, verify:

  • [ ] Does this test verify BEHAVIOR, not implementation?
  • [ ] Can this test FAIL? (Try inverting an assertion mentally.)
  • [ ] Does this test run in isolation? (No dependency on other tests.)
  • [ ] Are expected values derived from SPEC, not from code?
  • [ ] Is there at least one error path test per happy path test?
  • [ ] Are mocks limited to external boundaries only?
  • [ ] Are there no timing dependencies (waitForTimeout, sleep)?
  • [ ] Does the test name describe the BEHAVIOR, not the code?
  • [ ] Would a refactoring that preserves behavior keep this test green?
  • [ ] Does mutation testing confirm this test catches real bugs?

CROSS_REFERENCES

THOUGHT_LEADERS: domains/testing/thought-leaders.md — philosophy behind these rules
VITEST: domains/testing/vitest-patterns.md — correct patterns to use instead
TDD: domains/testing/tdd-methodology.md — TDD-specific pitfall prevention
MUTATION: domains/testing/mutation-testing.md — detecting test theater with Stryker
RECONCILIATION: domains/testing/test-reconciliation.md — catching gaps between test suites