DOMAIN:TESTING — TEST_QUALITY_RUBRIC¶

OWNER: marije, judith
ALSO_USED_BY: antje (TDD quality gate), jasper (reconciliation input), marta (merge gate input)
UPDATED: 2026-03-24
SCOPE: scoring rubric for test suite quality assessment — used by Marije and Judith to evaluate every test suite before it enters the merge pipeline
PURPOSE: define what "good enough to ship" means for tests, with objective criteria that eliminate subjective judgments

SCORING_OVERVIEW¶

Every test suite receives a score from 0 to 10.
Threshold: >= 6 to ship. < 6 requires rework.
The score is the weighted average of 6 criteria.
SEE: testing/calibration-examples.md for scored examples at each level.

CRITERIA_TABLE¶

#	Criterion	Weight	What It Measures
1	Coverage	20%	Are spec-required behaviors tested?
2	Edge cases	25%	Are boundaries, nulls, errors, and unusual inputs tested?
3	Isolation	15%	Do tests run independently without shared state?
4	Readability	10%	Can a new developer understand what each test verifies?
5	Assertion quality	20%	Are assertions specific, meaningful, and behavior-focused?
6	Flakiness risk	10%	Is the test likely to pass/fail inconsistently?

FORMULA: Score = SUM(criterion_score / 10 * weight * 10)

CRITERION_1: COVERAGE (20%)¶

Measures: Does the test suite cover the behaviors specified in Anna's spec?

Score	Definition
10	Every spec requirement has at least one test. Critical paths have multiple tests (happy + error).
8	90%+ spec requirements covered. Gaps are cosmetic only.
6	75-89% covered. Missing items are non-critical. All critical paths tested.
4	50-74% covered. Some critical paths missing.
2	25-49% covered. Most critical paths missing.
0	< 25% covered or tests do not map to spec at all.

IMPORTANT: Coverage means BEHAVIORAL coverage (spec requirements tested), NOT line coverage.
100% line coverage with 0% behavioral coverage is score 0.
60% line coverage with 100% behavioral coverage is score 10.

HOW_TO_MEASURE:
1. List all requirements from Anna's spec
2. For each requirement, find the test(s) that verify it
3. Count: tested / total = coverage percentage
4. Weight critical-path requirements 2x in the count

CRITERION_2: EDGE CASES (25%)¶

Measures: Does the test suite probe boundaries, error paths, and unusual inputs?

Score	Definition
10	Boundary values tested for every numeric input. Null/undefined handled. Error paths for every external call. Empty collections, max-length strings, concurrent access.
8	Most boundaries tested. Error paths for critical external calls (DB, API). Missing only exotic edge cases.
6	Core happy path + obvious error paths. Some boundary testing. Missing: concurrent access, empty states, overflow.
4	Happy path only with 1-2 error cases. No boundary testing.
2	Happy path only. No error paths tested.
0	No meaningful test cases.

EDGE_CASE_CHECKLIST (verify these exist):
- NUMERIC: 0, -1, MAX_SAFE_INTEGER, NaN, Infinity, fractional amounts (0.1 + 0.2)
- STRINGS: empty string, null, undefined, single char, max length, unicode, HTML injection
- COLLECTIONS: empty array, single item, max items, duplicates
- ASYNC: timeout, rejection, concurrent calls, out-of-order responses
- AUTH: expired token, invalid token, missing token, wrong role
- DATA: nonexistent ID, deleted record, stale data

WHY_25_PERCENT_WEIGHT:
Edge cases are where bugs hide. Happy paths work because developers test them manually during development. Edge cases only surface in production under unusual conditions. This is the highest-weighted criterion because it has the highest bug-catching potential.

CRITERION_3: ISOLATION (15%)¶

Measures: Can each test run independently, in any order, without affecting other tests?

Score	Definition
10	Each test creates its own data, cleans up after itself. No shared mutable state. Tests pass in any order, including randomized.
8	Tests use shared setup (beforeEach) but each test gets fresh state. Cleanup is reliable.
6	Tests use shared setup. Cleanup exists but is not comprehensive (some state leaks possible).
4	Tests depend on execution order. Reordering causes failures.
2	Tests share mutable state across test cases. One test's side effects affect another's results.
0	Tests modify global state (environment variables, singletons) without restoration.

ISOLATION_RED_FLAGS:
- beforeAll that seeds data used by multiple tests (use beforeEach instead)
- Tests that reference specific IDs created by other tests
- Tests that call process.env.X = 'value' without restoring
- Database tests without transaction rollback or table truncation
- Tests that depend on external services being in a specific state

CRITERION_4: READABILITY (10%)¶

Measures: Can a developer unfamiliar with the code understand what each test verifies?

Score	Definition
10	Test names describe the behavior being verified. Arrange-Act-Assert pattern is clear. No mystery values. Helper functions have descriptive names.
8	Test names are good. Most tests follow AAA. Occasional magic number or unclear helper.
6	Test names exist but are vague ("should work correctly"). AAA is mostly followed.
4	Test names are unclear. Tests mix setup, action, and assertion without clear boundaries.
2	Tests are unreadable without studying the implementation. No pattern.
0	Tests are actively misleading (test name says X, test verifies Y).

READABILITY_PATTERNS:
- GOOD: it('should reject payment when daily limit exceeded')
- BAD: it('should handle edge case')
- GOOD: it('should return 404 for deleted user')
- BAD: it('test 3')

WHY_ONLY_10_PERCENT:
Readability matters for maintenance, not for catching bugs. A poorly named test that catches real bugs is better than a beautifully named test that catches nothing. But unreadable tests get deleted during refactors because developers don't understand what they protect.

CRITERION_5: ASSERTION QUALITY (20%)¶

Measures: Do assertions verify meaningful behavior, or are they superficial?

Score	Definition
10	Assertions verify specific values, state transitions, and side effects. Multiple assertions per test targeting different aspects of the behavior.
8	Assertions are specific. Occasional use of loose matchers where tight ones would be better.
6	Mix of specific and loose assertions. Some `toBeDefined()` where `toBe(specificValue)` is warranted.
4	Mostly loose assertions. Heavy use of `toBeDefined`, `toBeTruthy`, `toHaveBeenCalled`.
2	Assertions check only that code runs without throwing. No value verification.
0	No assertions, or assertions that can never fail (test theater).

ASSERTION_HIERARCHY (best to worst):

BEST:  expect(result.amount).toBe(150.00)              // exact value
GOOD:  expect(result.items).toHaveLength(3)             // specific structure
OK:    expect(result.items).toEqual(expect.arrayContaining([...]))  // partial match
WEAK:  expect(result).toBeDefined()                     // existence only
BAD:   expect(result).toBeTruthy()                      // anything non-falsy passes
WORST: expect(true).toBe(true)                          // always passes

SIDE_EFFECT_ASSERTIONS:
Tests that verify side effects (email sent, record created, event emitted) are high-value assertions. A test that only checks the return value but ignores side effects misses half the behavior.

CRITERION_6: FLAKINESS RISK (10%)¶

Measures: Is the test likely to produce inconsistent results across runs?

Score	Definition
10	No timing dependencies, no external service calls, deterministic data, fixed seeds for randomness.
8	Minor timing dependency (e.g., `setTimeout` in test) but with adequate wait/poll.
6	Uses real clock or real network but with reasonable timeouts and retry logic.
4	Depends on timing without adequate tolerance. OR uses real external service without mock.
2	Multiple flakiness vectors: timing + network + shared state.
0	Test has known flakiness that is ignored (`.retry(3)` without fixing root cause).

FLAKINESS_RED_FLAGS:
- setTimeout / sleep in tests without a condition to wait for
- Assertions on timestamps without tolerance (toBe instead of toBeCloseTo)
- Tests that pass on powerful CI but fail on developer laptops
- Tests that fail on first run but pass on retry
- Port binding without dynamic port allocation
- File system tests without temp directory isolation

SHIP_READY_VS_NEEDS_WORK¶

SHIP_READY (score >= 6)¶

A test suite is ship-ready when:
- All spec-required critical paths are tested (criterion 1 >= 6)
- Error paths for external dependencies are tested (criterion 2 >= 6)
- Tests don't depend on each other (criterion 3 >= 6)
- Assertions verify actual behavior, not just existence (criterion 5 >= 6)
- No known flakiness vectors (criterion 6 >= 6)

The suite may have gaps in cosmetic areas, readability could be better, and some exotic edge cases may be missing. These are acceptable for shipping and can be addressed in follow-up.

NEEDS_WORK (score < 6)¶

A test suite needs work when ANY of:
- Critical path has zero tests (criterion 1 < 4)
- No error paths tested (criterion 2 < 4)
- Tests depend on execution order (criterion 3 < 4)
- Assertions are all superficial (criterion 5 < 4)
- Known flakiness is ignored (criterion 6 < 4)

Return to developer with specific items to address. Do not list everything — prioritize the highest-impact fixes that would bring the suite to ship-ready.

SCORE_CALCULATION_EXAMPLE¶

Test suite for: User authentication module

Coverage:          8/10 * 20% = 1.6
Edge cases:        7/10 * 25% = 1.75
Isolation:         9/10 * 15% = 1.35
Readability:       6/10 * 10% = 0.6
Assertion quality: 8/10 * 20% = 1.6
Flakiness risk:    9/10 * 10% = 0.9
                              -----
TOTAL:                         7.8 → SHIP READY

Test suite for: Shopping cart module

Coverage:          5/10 * 20% = 1.0
Edge cases:        3/10 * 25% = 0.75
Isolation:         7/10 * 15% = 1.05
Readability:       8/10 * 10% = 0.8
Assertion quality: 4/10 * 20% = 0.8
Flakiness risk:    6/10 * 10% = 0.6
                              -----
TOTAL:                         5.0 → NEEDS WORK

Feedback: "Edge cases critically low (3/10) — no boundary testing for
quantities, no empty cart handling, no concurrent add/remove. Assertion
quality (4/10) — 60% of assertions are toBeDefined(). Priority fixes:
1) Add boundary tests for quantity (0, -1, max). 2) Add empty cart tests.
3) Replace toBeDefined with specific value checks."

REPORTING_FORMAT¶

When returning a test quality assessment, use this format:

TEST_QUALITY_ASSESSMENT
Suite: [name]
Score: [X.X]/10 — [SHIP READY | NEEDS WORK]

Criteria breakdown:
  Coverage:          [X]/10 — [one-line rationale]
  Edge cases:        [X]/10 — [one-line rationale]
  Isolation:         [X]/10 — [one-line rationale]
  Readability:       [X]/10 — [one-line rationale]
  Assertion quality: [X]/10 — [one-line rationale]
  Flakiness risk:    [X]/10 — [one-line rationale]

[If NEEDS WORK: prioritized list of specific fixes]
[If SHIP READY: any non-blocking improvement suggestions]