Testing Standards¶
TDD Approach¶
RULE: Write the test before the implementation. The test defines what "done" means. RATIONALE: Prevents scaffolding without integration. The test is the first call site for any new code.
Test Levels¶
Unit Tests (supplementary)¶
- Test individual functions in isolation
- Fast, numerous, focused
- NOT sufficient for feature verification
- NOT proof of life
Integration Tests (required for features)¶
- Test the actual system path end-to-end
- Exercise real network calls, real database operations, real Redis streams
- This IS proof of life
- Every new feature must have at least one integration test
Regression Tests (scheduled)¶
- Run on schedule (post-deployment or daily)
- Exercise all existing features through actual system paths
- A failing regression test blocks new feature work (Principle 10)
What Tests Must Prove¶
For a new API endpoint: - The route is registered and reachable (curl from outside the pod) - The handler processes a real request - The response matches the contract schema - The side effects (DB writes, Redis publishes) actually occurred
For a new agent capability: - The trigger reaches the executor via Redis Stream - The executor spawns a real CLI session (Claude/Codex/Gemini) - The CLI produces real output (verified via PTY capture) - The completion file is written to ge-ops/system/completions/
NOT ACCEPTABLE: Tests that only verify the function body without verifying it's callable through the real path.
ENFORCEMENT: Marije/Judith run test suites. Koen/Eric verify test coverage in code review.
Testing Tools¶
webapp-testing Skill (Playwright)¶
The webapp-testing skill (installed in .claude/skills/, source: anthropics/skills) provides Playwright-based web app testing patterns. Used by Marije and Judith for E2E and integration test authoring during Phase 8 (Integration).
The skill auto-activates when agents work on test files targeting web applications. It provides structured patterns for:
- Page object models
- Test fixtures and setup/teardown
- Assertion patterns for UI state
- Network interception and mocking
- Visual regression testing
See also: Playwright integration, Anthropic skills and plugins
CI Pipeline Testing Standards¶
General Rules¶
- All tests must pass in CI — no
allow_failureon test stages - Test paths must use dynamic
GE_ROOTdetection viatests.conftest.GE_ROOT_PATH— never hardcode/home/claude/ge-bootstrapin test files - Test fixtures must be self-contained and clean up after themselves
Mutation Testing¶
- Mutation testing threshold: 80% on new code (enforced by
test:mutationCI stage) - Mutation testing threshold: 60% on existing code (tracked, not yet blocking)
- Tool: Stryker (TypeScript), mutmut (Python)
Adversarial Testing¶
- Property-based testing with Hypothesis (Python) and fast-check (TypeScript) in the
test:adversarialCI stage - Fuzz testing on condition evaluator and critical path functions
- All 7 attack categories (type confusion, boundary, resource exhaustion, injection, concurrency, precision, unicode) must be covered
CI Job Reference¶
| CI Job | Stage | What it verifies |
|---|---|---|
tdd:red-gate |
TDD | All TDD tests are red before implementation |
tdd:green-gate |
TDD | All TDD tests turn green after implementation |
tdd:oracle-check |
TDD | Oracle independence — tests don't import implementation |
build:backend |
Build | Implementation compiles and builds |
lint:python |
Quality | Ruff linting (zero errors) |
lint:secrets |
Quality | Gitleaks secret detection |
security:bandit |
Security | Python SAST |
security:semgrep |
Security | Multi-language static analysis |
security:dependency-scan |
Security | Dependency vulnerability audit |
test:unit:backend |
Testing | Backend unit test suite |
test:integration |
Testing | Full integration test suite |
test:reconciliation |
Testing | TDD vs post-impl test suite comparison |
test:adversarial |
Testing | Fuzz and property-based tests |
test:contract |
SSOT | API contract verification + verify_ssot.sh |
test:mutation |
Quality | Mutation testing thresholds |
review:gate |
Merge | Manual merge approval (future: automated scoring) |