Skip to content

DOMAIN:TESTING — TEST_RECONCILIATION

OWNER: jasper
ALSO_USED_BY: marije, judith (gap reports), antje (spec clarification), koen (mutation data input)
UPDATED: 2026-03-24
SCOPE: reconciling TDD and post-implementation test suites for all GE client projects


RECONCILIATION_OVERVIEW

Jasper's job: compare Antje's TDD test suite against Marije/Judith's post-implementation test suite.
These two suites were written INDEPENDENTLY with DIFFERENT information.
The differences between them reveal real quality insights.

INPUT_1: Antje's TDD tests (written from Anna's spec, BEFORE implementation)
INPUT_2: Marije/Judith's post-impl tests (written AFTER implementation exists)
INPUT_3: Koen's mutation testing results (which tests actually catch bugs)
INPUT_4: Anna's spec (source of truth for arbitration)
OUTPUT: reconciliation report with coverage gaps, contradictions, and recommendations

WHY_RECONCILIATION_MATTERS:
- TDD tests verify SPEC compliance (what should happen)
- Post-impl tests verify ACTUAL behavior (what does happen)
- If TDD says X and post-impl says Y, one of them is WRONG
- The spec arbitrates — but sometimes the spec itself is ambiguous
- Gaps between the two suites are where real bugs hide


RECONCILIATION_PROCESS

PHASE_1: COVERAGE_MAPPING

Map both test suites to the features/behaviors they cover.

STEP 1: Extract test descriptions from both suites

# Extract test names from TDD suite
npx vitest list --project unit __tests__/tdd/ 2>/dev/null

# Extract test names from post-impl suite
npx vitest list --project unit __tests__/unit/ 2>/dev/null
npx vitest list --project integration __tests__/integration/ 2>/dev/null

STEP 2: Group tests by feature/module they cover
STEP 3: Create coverage matrix showing which behaviors are tested by which suite

COVERAGE_MATRIX_FORMAT:

| Behavior                    | TDD | Post-Impl | Mutation Score | Status     |
|-----------------------------|-----|-----------|----------------|------------|
| User creation (valid)       | YES | YES       | 92%            | COVERED    |
| User creation (duplicate)   | YES | YES       | 88%            | COVERED    |
| User deletion               | YES | NO        | N/A            | GAP_POST   |
| Session timeout              | NO  | YES       | 75%            | GAP_TDD    |
| Password reset flow          | YES | YES       | 45%            | WEAK_TESTS |
| Rate limiting                | NO  | NO        | N/A            | UNTESTED   |

STATUS_DEFINITIONS:
- COVERED: both suites test it, mutation score acceptable
- GAP_TDD: post-impl tests it, TDD doesn't — possible spec gap
- GAP_POST: TDD tests it, post-impl doesn't — possible implementation gap
- WEAK_TESTS: both test it, but mutation score is low — tests don't catch bugs
- UNTESTED: neither suite covers it — critical gap
- CONTRADICTION: suites expect different behavior — needs arbitration


PHASE_2: GAP_ANALYSIS

GAP_TDD (post-impl has it, TDD doesn't)

MEANING: Marije/Judith found behavior worth testing that Antje missed.
ROOT_CAUSES:
1. Anna's spec was incomplete — the behavior wasn't specified
2. Antje misinterpreted the spec — the behavior was implied but not caught
3. Developer added behavior beyond spec — feature creep
4. Integration behavior — only visible when components connect

ACTION:
- IF spec was incomplete → escalate to Anna for spec update → Antje adds TDD test
- IF Antje missed it → Antje adds TDD test
- IF feature creep → escalate to Faye/Sytske (PM) — is this feature approved?
- IF integration-only → acceptable gap — TDD can't test integration behavior

GAP_POST (TDD has it, post-impl doesn't)

MEANING: Antje wrote a test for a spec requirement that Marije/Judith didn't test.
ROOT_CAUSES:
1. Marije/Judith assumed the TDD test was sufficient
2. The behavior is tested indirectly via a higher-level test
3. The behavior was implemented but not explicitly tested
4. The developer scaffolded it — code exists but doesn't actually work

ACTION:
- IF sufficient via TDD → acceptable IF mutation score is high
- IF tested indirectly → verify with mutation testing — if mutants survive, add explicit test
- IF not explicitly tested → Marije/Judith adds post-impl test
- IF scaffolded → CRITICAL — the code may not actually work. Run TDD tests to verify.

UNTESTED (neither suite covers it)

MEANING: a behavior exists that NO tests cover.
THIS IS THE MOST DANGEROUS GAP.
ROOT_CAUSES:
1. Both teams missed the same requirement
2. The behavior is in "glue code" between tested components
3. The behavior was added late and testing wasn't updated
4. Error handling paths that both teams assumed were "obvious"

ACTION:
- ALWAYS escalate untested behavior
- Determine if it's in the spec (Anna checks)
- If in spec: both Antje AND Marije/Judith write tests
- If not in spec: determine if it should be, escalate to PM


PHASE_3: CONTRADICTION_DETECTION

CONTRADICTION: TDD test expects behavior A, post-impl test expects behavior B.

EXAMPLE:

TDD test: "returns empty array when no results found"
Post-impl test: "returns null when no results found"

ARBITRATION_PROCESS:
1. Check Anna's spec — what does it say?
2. IF spec says A → post-impl test is WRONG → Marije/Judith fix their test AND the developer fixes the code
3. IF spec says B → TDD test is WRONG → Antje fix the TDD test AND escalate (spec interpretation error)
4. IF spec is ambiguous → escalate to Anna for clarification → BOTH tests wait for spec update
5. IF spec says neither → escalate to PM — this is an unspecified behavior

RULE: the spec ALWAYS wins. Never resolve contradictions by looking at the code.
RULE: if the code matches one test but not the other, it doesn't matter — check the SPEC.
RULE: contradictions are HIGH PRIORITY — they indicate either a spec gap or an implementation bug.


PHASE_4: TEST_THEATER_DETECTION

Test theater: tests that CANNOT fail, providing false confidence.

INDICATORS_OF_TEST_THEATER

  1. LOW_MUTATION_SCORE: tests exist but mutants survive — tests don't actually catch bugs
  2. ALWAYS_TRUE_ASSERTIONS: expect(true).toBe(true), expect(result).toBeDefined() on non-nullable
  3. TAUTOLOGICAL_TESTS: test mirrors implementation — if code is wrong, test is wrong too
  4. MOCK_EVERYTHING: test mocks so much that it's testing mocks, not code
  5. NO_ERROR_TESTS: only happy path tested — errors are where bugs hide
  6. SNAPSHOT_ONLY: snapshot tests with no assertion about WHAT the snapshot contains

DETECTION_METHODS

METHOD_1: Mutation score analysis

IF mutation_score < 50% AND line_coverage > 80%
THEN likely test theater — tests run code but don't verify behavior

METHOD_2: Test inversion
Temporarily negate a key assertion. If tests still pass, the assertion is worthless.

METHOD_3: Assertion density

IF assertions_per_test < 1.0 average
THEN tests may not be verifying anything meaningful

METHOD_4: Mock ratio

IF mock_count > assertion_count in a test file
THEN tests are likely testing mocks, not real behavior

REPORTING_TEST_THEATER

When test theater is detected:
1. Document the specific tests and WHY they're theater
2. Report to test owner (Antje for TDD, Marije/Judith for post-impl)
3. Recommend specific improvements (add assertion, test error path, reduce mocks)
4. Track improvement over time — test theater should decrease sprint over sprint


RECONCILIATION_REPORT_FORMAT

# Test Reconciliation Report
Feature: [feature name]
Date: [date]
TDD Author: antje
Post-impl Author: marije/judith
Reconciled by: jasper

## Coverage Summary
| Metric | TDD Suite | Post-Impl Suite | Combined |
|--------|-----------|-----------------|----------|
| Behaviors covered | X | Y | Z |
| Mutation score | X% | Y% | Z% |
| Test count | X | Y | Z |

## Gaps Found
### GAP_TDD (missing from TDD)
- [behavior]: [reason] → [action]

### GAP_POST (missing from post-impl)
- [behavior]: [reason] → [action]

### UNTESTED (missing from both)
- [behavior]: [severity] → [action]

## Contradictions
- [behavior]: TDD expects [A], post-impl expects [B]
  Spec says: [C]
  Resolution: [fix TDD/fix post-impl/escalate]

## Test Theater
- [file:line]: [why it's theater] → [fix]

## Recommendations
1. [specific action for antje]
2. [specific action for marije/judith]
3. [spec update needed from anna]
4. [escalation to PM if needed]

AUTOMATED_RECONCILIATION_TOOLS

COVERAGE_DIFF_SCRIPT

// tools/reconcile-coverage.ts
import { readFileSync } from 'fs';

interface CoverageData {
  [file: string]: {
    lines: { [line: string]: number };
    branches: { [branch: string]: number };
  };
}

function findGaps(tddCoverage: CoverageData, postImplCoverage: CoverageData) {
  const gaps = {
    tddOnly: [] as string[],      // Lines covered by TDD only
    postImplOnly: [] as string[],  // Lines covered by post-impl only
    uncovered: [] as string[],     // Lines covered by neither
  };

  // Compare line-by-line coverage from both suites
  for (const file of Object.keys(tddCoverage)) {
    const tdd = tddCoverage[file];
    const post = postImplCoverage[file] ?? { lines: {}, branches: {} };

    for (const line of Object.keys(tdd.lines)) {
      const tddHits = tdd.lines[line] ?? 0;
      const postHits = post.lines[line] ?? 0;

      if (tddHits > 0 && postHits === 0) gaps.tddOnly.push(`${file}:${line}`);
      if (tddHits === 0 && postHits > 0) gaps.postImplOnly.push(`${file}:${line}`);
      if (tddHits === 0 && postHits === 0) gaps.uncovered.push(`${file}:${line}`);
    }
  }

  return gaps;
}

USING_VITEST_COVERAGE_FOR_RECONCILIATION

# Generate coverage for TDD suite
npx vitest run --coverage --coverage.reporter=json \
  --include="__tests__/tdd/**" \
  --coverage.reportsDirectory=coverage/tdd

# Generate coverage for post-impl suite
npx vitest run --coverage --coverage.reporter=json \
  --include="__tests__/unit/**,__tests__/integration/**" \
  --coverage.reportsDirectory=coverage/post-impl

# Compare the two coverage reports
npx ts-node tools/reconcile-coverage.ts \
  coverage/tdd/coverage-final.json \
  coverage/post-impl/coverage-final.json

ESCALATION_RULES

ESCALATE_TO_ANNA (spec):
- Spec is ambiguous and arbitration is impossible
- Untested behavior that may need to be added to spec
- Contradictions that spec doesn't resolve

ESCALATE_TO_PM (Faye/Sytske):
- Feature creep detected (behavior exists but not in spec)
- Untested critical behavior (auth, payment, data)
- Significant quality gaps that affect delivery timeline

ESCALATE_TO_KOEN:
- Need targeted mutation testing on specific areas
- Mutation score data is stale or missing

ESCALATE_TO_ASHLEY:
- Reconciliation reveals edge cases neither suite covers
- Adversarial testing needed for discovered gaps

RULE: every escalation includes specific context — NEVER "please check this"
RULE: escalations via GE discussion system (Redis-backed, tracked)
RULE: if contradictions affect a live feature, escalation is URGENT


TIMING_IN_PIPELINE

Anna (spec)
Antje (TDD tests — from spec)
Developers (implementation — makes TDD tests green)
Koen (deterministic checks — lint, types, format)
Marije/Judith (post-impl tests — from implementation + spec)
Koen (mutation testing — on both suites)
★ JASPER (reconciliation — compares suites, finds gaps) ★
Marco (conflict resolution — if contradictions are complex)
Ashley (adversarial testing — from user perspective)
Jaap (SSOT verification)
Marta (merge)

RULE: Jasper works AFTER both test suites are complete and mutation data is available
RULE: Jasper works BEFORE Ashley — reconciliation gaps inform adversarial testing targets
RULE: Jasper's report is an INPUT to Marco if contradictions need complex resolution


CROSS_REFERENCES

TDD: domains/testing/tdd-methodology.md — how Antje writes pre-impl tests
VITEST: domains/testing/vitest-patterns.md — shared test runner
MUTATION: domains/testing/mutation-testing.md — Koen's mutation data feeds reconciliation
ADVERSARIAL: domains/testing/adversarial-testing.md — Ashley uses reconciliation gaps
PITFALLS: domains/testing/pitfalls.md — test theater patterns in detail