Skip to content

DOMAIN:TESTING — CALIBRATION_EXAMPLES

OWNER: marije, judith
ALSO_USED_BY: antje (TDD cross-reference), jasper (reconciliation input)
UPDATED: 2026-03-24
SCOPE: calibration examples for test quality evaluation — JIT injected before every evaluation task
PURPOSE: anchor scoring consistency across evaluator agents by showing concrete good/bad examples with detailed breakdowns


HOW_TO_USE_THIS_PAGE

Read these examples BEFORE evaluating any test suite.
They define what each score means in practice.
If your evaluation does not align with these examples, recalibrate.
Scores are out of 10. Anything below 6 blocks the work item.

SCORING_PRINCIPLES:
- A test's purpose is to catch bugs that would reach users
- Tests that cannot fail are worse than no tests (false confidence)
- Tests that break on refactors waste developer time
- Good tests document behavior, not implementation
- Edge cases matter more than happy paths (happy paths rarely hide bugs)


EXAMPLE_1: LOOKS PASSING BUT MISSES CRITICAL EDGE CASE

SCORE: 4/10

CONTEXT

Spec requires: "createInvoice() must reject negative amounts, zero amounts, and amounts exceeding the client's credit limit"
Developer submitted:

describe('createInvoice', () => {
  it('should create an invoice with a valid amount', async () => {
    const invoice = await createInvoice({ clientId: 'c1', amount: 100.00 });
    expect(invoice).toBeDefined();
    expect(invoice.amount).toBe(100.00);
    expect(invoice.status).toBe('draft');
  });

  it('should reject negative amounts', async () => {
    await expect(createInvoice({ clientId: 'c1', amount: -50 }))
      .rejects.toThrow('Invalid amount');
  });

  it('should reject zero amounts', async () => {
    await expect(createInvoice({ clientId: 'c1', amount: 0 }))
      .rejects.toThrow('Invalid amount');
  });
});

WHY_THIS_SCORE

POSITIVE_FACTORS:
- Tests negative amount rejection (+1)
- Tests zero amount rejection (+1)
- Assertions are specific (checks error message, not just "throws") (+1)
- Test structure is clean and readable (+1)

NEGATIVE_FACTORS:
- MISSING: credit limit check — the spec explicitly requires it (-3)
- MISSING: boundary values — what about 0.01? What about Number.MAX_SAFE_INTEGER? (-1)
- MISSING: currency precision — what about 100.999? Floating point edge case (-1)

CRITICAL_MISS: The credit limit check is a business rule violation if untested. A client could generate invoices exceeding their credit limit, causing real financial exposure. This is not a cosmetic gap.

EVALUATOR_ACTION: Return score 4, flag the credit limit gap as BLOCKING, list boundary values as SHOULD_ADD.


EXAMPLE_2: BRITTLE TEST THAT TESTS IMPLEMENTATION DETAILS

SCORE: 3/10

CONTEXT

Spec requires: "User search returns matching users sorted by relevance"
Developer submitted:

describe('searchUsers', () => {
  it('should search users correctly', async () => {
    const mockPrisma = {
      user: {
        findMany: vi.fn().mockResolvedValue([
          { id: '1', name: 'Alice Johnson', email: 'alice@example.com' },
          { id: '2', name: 'Bob Johnson', email: 'bob@example.com' },
        ]),
      },
    };

    const result = await searchUsers('Johnson', { prisma: mockPrisma });

    expect(mockPrisma.user.findMany).toHaveBeenCalledWith({
      where: {
        OR: [
          { name: { contains: 'Johnson', mode: 'insensitive' } },
          { email: { contains: 'Johnson', mode: 'insensitive' } },
        ],
      },
      orderBy: { name: 'asc' },
      take: 20,
    });
    expect(result).toHaveLength(2);
  });
});

WHY_THIS_SCORE

POSITIVE_FACTORS:
- Test exists and runs (+1)
- Verifies result count (+1)
- Uses dependency injection for testability (+1)

NEGATIVE_FACTORS:
- Tests the QUERY STRUCTURE, not the BEHAVIOR (-3)
- If developer changes from Prisma contains to full-text search, test breaks even though behavior is identical (-2)
- Mock returns hardcoded data — test never exercises actual sorting logic (-1)
- Spec says "sorted by relevance" but test checks orderBy: name: asc — that is alphabetical, NOT relevance (-1)

WHY_3_NOT_1: It is not test theater — it does verify something. But what it verifies is an implementation contract with Prisma, not a behavioral contract with the spec. The developer could change the query optimizer and this test would fail despite behavior being correct.

BETTER_APPROACH:

it('should return users matching the search term', async () => {
  await seedUser({ name: 'Alice Johnson' });
  await seedUser({ name: 'Bob Smith' });
  const result = await searchUsers('Johnson');
  expect(result.map(u => u.name)).toContain('Alice Johnson');
  expect(result.map(u => u.name)).not.toContain('Bob Smith');
});

EVALUATOR_ACTION: Return score 3, flag as IMPLEMENTATION_COUPLING, recommend rewrite to test behavior.


EXAMPLE_3: GOOD INTEGRATION TEST COVERING THE RIGHT BOUNDARIES

SCORE: 9/10

CONTEXT

Spec requires: "Payment webhook handler must process Stripe events idempotently, update order status, and notify the client"
Developer submitted:

describe('Stripe webhook handler', () => {
  let testDb: TestDatabase;

  beforeEach(async () => {
    testDb = await createTestDatabase();
    await testDb.seed({
      orders: [{ id: 'order-1', status: 'pending', amount: 5000 }],
      clients: [{ id: 'client-1', email: 'test@example.com' }],
    });
  });

  afterEach(async () => {
    await testDb.cleanup();
  });

  it('should mark order as paid on payment_intent.succeeded', async () => {
    const event = createStripeEvent('payment_intent.succeeded', {
      metadata: { orderId: 'order-1' },
      amount_received: 5000,
    });

    const response = await handler(event);

    expect(response.status).toBe(200);
    const order = await testDb.orders.findById('order-1');
    expect(order.status).toBe('paid');
    expect(order.paidAt).toBeInstanceOf(Date);
  });

  it('should be idempotent — processing same event twice has no side effects', async () => {
    const event = createStripeEvent('payment_intent.succeeded', {
      metadata: { orderId: 'order-1' },
      amount_received: 5000,
    });

    await handler(event);
    await handler(event);

    const order = await testDb.orders.findById('order-1');
    expect(order.status).toBe('paid');
    const notifications = await testDb.notifications.findByOrderId('order-1');
    expect(notifications).toHaveLength(1); // NOT 2
  });

  it('should reject events with mismatched amounts', async () => {
    const event = createStripeEvent('payment_intent.succeeded', {
      metadata: { orderId: 'order-1' },
      amount_received: 3000, // order expects 5000
    });

    const response = await handler(event);

    expect(response.status).toBe(400);
    const order = await testDb.orders.findById('order-1');
    expect(order.status).toBe('pending'); // unchanged
  });

  it('should handle unknown order IDs gracefully', async () => {
    const event = createStripeEvent('payment_intent.succeeded', {
      metadata: { orderId: 'nonexistent' },
      amount_received: 5000,
    });

    const response = await handler(event);

    expect(response.status).toBe(404);
  });

  it('should send client notification on successful payment', async () => {
    const event = createStripeEvent('payment_intent.succeeded', {
      metadata: { orderId: 'order-1' },
      amount_received: 5000,
    });

    await handler(event);

    const notifications = await testDb.notifications.findByOrderId('order-1');
    expect(notifications).toHaveLength(1);
    expect(notifications[0].recipientEmail).toBe('test@example.com');
    expect(notifications[0].type).toBe('payment_received');
  });
});

WHY_THIS_SCORE

POSITIVE_FACTORS:
- Tests BEHAVIOR across a real boundary (HTTP event → DB state → notification) (+2)
- Idempotency test is exactly right — checks side effect count, not just return code (+2)
- Amount mismatch test catches a real fraud/error scenario (+1)
- Unknown order ID test prevents silent data loss (+1)
- Uses real database, not mocks — catches schema issues (+1)
- Proper setup/teardown prevents test pollution (+1)
- Notification assertion verifies the full chain, not just "was called" (+1)

WHY_NOT_10:
- Missing: what happens if notification service is down? Does the payment still succeed?
- Missing: concurrent webhook delivery (Stripe can send duplicates simultaneously)

EVALUATOR_ACTION: Return score 9, flag concurrent delivery as NICE_TO_HAVE (not blocking).


EXAMPLE_4: E2E TEST THAT CATCHES A REAL USER FLOW BUG

SCORE: 10/10

CONTEXT

Spec requires: "User can complete checkout: add items → enter address → select shipping → pay → receive confirmation"
Developer submitted:

test('complete checkout flow from cart to confirmation', async ({ page }) => {
  // Seed a product and authenticate a user
  const { user, product } = await seedCheckoutData();
  await loginAs(page, user);

  // Add item to cart
  await page.goto(`/products/${product.slug}`);
  await page.getByRole('button', { name: 'Add to cart' }).click();
  await expect(page.getByTestId('cart-count')).toHaveText('1');

  // Go to checkout
  await page.getByRole('link', { name: 'Checkout' }).click();
  await expect(page).toHaveURL('/checkout');

  // Enter shipping address
  await page.getByLabel('Street').fill('123 Main St');
  await page.getByLabel('City').fill('Amsterdam');
  await page.getByLabel('Postal code').fill('1012 AB');
  await page.getByLabel('Country').selectOption('NL');
  await page.getByRole('button', { name: 'Continue to shipping' }).click();

  // Select shipping method — wait for options to load from API
  await expect(page.getByText('Standard delivery')).toBeVisible({ timeout: 5000 });
  await page.getByLabel('Standard delivery (€4.95)').check();
  await page.getByRole('button', { name: 'Continue to payment' }).click();

  // Payment — use Stripe test card
  const stripeFrame = page.frameLocator('iframe[name*="stripe"]');
  await stripeFrame.getByPlaceholder('Card number').fill('4242424242424242');
  await stripeFrame.getByPlaceholder('MM / YY').fill('12/30');
  await stripeFrame.getByPlaceholder('CVC').fill('123');
  await page.getByRole('button', { name: 'Pay now' }).click();

  // Confirmation page
  await expect(page).toHaveURL(/\/orders\/[a-z0-9-]+\/confirmation/);
  await expect(page.getByText('Thank you for your order')).toBeVisible();
  await expect(page.getByText('123 Main St')).toBeVisible();
  await expect(page.getByText('€4.95')).toBeVisible();

  // Verify backend state
  const orders = await getOrdersForUser(user.id);
  expect(orders).toHaveLength(1);
  expect(orders[0].status).toBe('paid');
  expect(orders[0].shippingAddress.city).toBe('Amsterdam');
});

test('checkout preserves cart across page refresh', async ({ page }) => {
  const { user, product } = await seedCheckoutData();
  await loginAs(page, user);

  await page.goto(`/products/${product.slug}`);
  await page.getByRole('button', { name: 'Add to cart' }).click();

  // User refreshes mid-checkout
  await page.goto('/checkout');
  await page.reload();

  // Cart should survive
  await expect(page.getByTestId('cart-item')).toHaveCount(1);
  await expect(page.getByText(product.name)).toBeVisible();
});

test('checkout shows error on expired card', async ({ page }) => {
  const { user, product } = await seedCheckoutData();
  await loginAs(page, user);

  await page.goto(`/products/${product.slug}`);
  await page.getByRole('button', { name: 'Add to cart' }).click();
  await page.getByRole('link', { name: 'Checkout' }).click();

  // Fill address and shipping
  await fillShippingAddress(page, { city: 'Amsterdam', postal: '1012 AB' });
  await selectShipping(page, 'standard');

  // Use expired test card
  const stripeFrame = page.frameLocator('iframe[name*="stripe"]');
  await stripeFrame.getByPlaceholder('Card number').fill('4000000000000069');
  await stripeFrame.getByPlaceholder('MM / YY').fill('01/20');
  await stripeFrame.getByPlaceholder('CVC').fill('123');
  await page.getByRole('button', { name: 'Pay now' }).click();

  // Error shown, not redirected
  await expect(page.getByText(/expired/i)).toBeVisible();
  await expect(page).toHaveURL('/checkout');

  // Order NOT created in backend
  const orders = await getOrdersForUser(user.id);
  expect(orders).toHaveLength(0);
});

WHY_THIS_SCORE

POSITIVE_FACTORS:
- Tests the COMPLETE user journey, not fragments (+2)
- Uses real Stripe test cards and iframe interaction — catches integration bugs (+2)
- Verifies BOTH frontend state (URL, visible text) AND backend state (DB query) (+2)
- Page refresh test catches a common real-world bug (cart lost on reload) (+1)
- Expired card test ensures error path doesn't create ghost orders (+1)
- Uses accessible selectors (getByRole, getByLabel) — resilient to CSS changes (+1)
- Timeout on shipping options loading — accounts for async API (+1)

WHY_THIS_IS_10:
- This test suite would catch a regression where: checkout redirects before payment completes, cart state is client-only, shipping options fail to load, or error cards still create orders.
- Every assertion maps to a real user expectation.
- No implementation coupling — tests survive a complete frontend rewrite.

EVALUATOR_ACTION: Return score 10. No blocking issues. Flag as exemplary for the client project test library.


EXAMPLE_5: TEST THEATER — TEST THAT CAN NEVER FAIL

SCORE: 1/10

CONTEXT

Spec requires: "Email sending service must validate recipient addresses and handle SMTP failures"
Developer submitted:

describe('EmailService', () => {
  it('should send emails', () => {
    const service = new EmailService();
    expect(service).toBeDefined();
  });

  it('should have a send method', () => {
    const service = new EmailService();
    expect(typeof service.send).toBe('function');
  });

  it('should handle email sending', async () => {
    const service = new EmailService();
    const mockTransport = {
      sendMail: vi.fn().mockResolvedValue({ messageId: '123' }),
    };
    service.transport = mockTransport;

    await service.send({
      to: 'test@example.com',
      subject: 'Test',
      body: 'Hello',
    });

    expect(mockTransport.sendMail).toHaveBeenCalled();
  });

  it('should validate emails', () => {
    const service = new EmailService();
    const result = service.validateEmail('test@example.com');
    expect(result).toBeTruthy();
  });
});

WHY_THIS_SCORE

POSITIVE_FACTORS:
- Tests exist and they run (+1)

NEGATIVE_FACTORS:
- Test 1: "service is defined" — can never fail unless constructor throws. Tests nothing. (-1)
- Test 2: "has a send method" — this is a TypeScript compile check, not a behavior test. (-1)
- Test 3: mock returns success, then asserts mock was called. This tests the test setup, not the code. The mock ALWAYS returns success. What happens on SMTP failure? Never tested. (-2)
- Test 4: only tests a valid email. What about "notanemail", "", null, "a@b", "user@.com"? The one case tested is the least likely to have bugs. (-2)
- ZERO tests for: SMTP connection failure, timeout, invalid recipient, rate limiting, retry logic (-2)
- The spec explicitly mentions "handle SMTP failures" — completely ignored (-1)

WHY_1_NOT_0:
- 0 is reserved for tests that actively harm (e.g., tests that delete production data). These tests are merely useless, not harmful. They do establish that the class exists and instantiates.

DANGER: These tests provide 100% line coverage on the happy path, which could trick a coverage gate into thinking the code is well-tested. This is worse than having no tests, because it creates false confidence.

EVALUATOR_ACTION: Return score 1, flag as TEST_THEATER, mark as REWRITE_REQUIRED. Do NOT pass this through the pipeline.


SCORING_SUMMARY_TABLE

Score Meaning Action
10 Exemplary — catches real bugs, covers edge cases, resilient to refactors Ship
9 Strong — minor gaps that are nice-to-have, not blocking Ship
7-8 Adequate — covers core paths, some edge cases missing Ship with notes
6 Minimum viable — happy path covered, critical edges present Ship with improvement ticket
5 Below threshold — significant gaps in coverage or quality Block, return to developer
3-4 Weak — misses critical spec requirements or tests wrong things Block, flag specific gaps
1-2 Test theater — provides false confidence Block, require full rewrite
0 Harmful — test that masks bugs or damages test infrastructure Block, escalate

EVALUATOR_CHECKLIST

Before returning a score, verify:

  1. SPEC_COVERAGE: Does every spec requirement have at least one test?
  2. EDGE_CASES: Are boundary values, null/undefined, and error paths tested?
  3. ASSERTIONS: Are assertions specific (not just "toBeDefined" or "toBeTruthy")?
  4. ISOLATION: Does each test clean up after itself? Can tests run in any order?
  5. RESILIENCE: Would a refactor that preserves behavior break these tests?
  6. REAL_BUGS: Would these tests catch a regression that affects users?
  7. READABILITY: Can a new developer understand what each test verifies?

If any of items 1, 2, 6 are NO — score cannot exceed 5.
If item 5 is NO (brittle) — score cannot exceed 4.
If NONE of items 1-7 are convincingly YES — score is 1.