Pipeline Stages¶
Each of the 10 stages in the anti-LLM pipeline. For every stage: what it does, what failure class it catches, who owns it, what tools it uses, what its output is, and what blocks progression.
Stage 1: Formal Specification¶
Owner: Anna (Formal Specification Agent) Phase: Pre-implementation Input: Functional specification from Aimee, technical design from project lead Output: Formal specification document with acceptance criteria, API contracts, data model constraints, and behavioral rules
What it does¶
Anna translates human-readable requirements into precise, testable specifications. Every behavior is defined with:
- Pre-conditions (what must be true before the operation)
- Post-conditions (what must be true after the operation)
- Invariants (what must always be true)
- Edge cases explicitly enumerated
- Error conditions with expected responses
- API contracts in OpenAPI format
What failure class it catches¶
Ambiguous requirements. LLMs fill ambiguity with plausible assumptions. If the spec says "the user can update their profile," an LLM will decide what "update" means, what fields are updatable, what validation applies, and what happens on conflict. Anna removes this ambiguity before any code is written.
Without this stage, every subsequent stage validates against the LLM's interpretation — not the actual requirements.
What blocks progression¶
- Incomplete acceptance criteria
- Missing error conditions
- Undefined edge cases
- API contract without response schemas
- Specification reviewed and approved by project lead
Tools¶
- Markdown specification templates
- OpenAPI schema definition
- Zod schema stubs (pre-generated for dev agents)
- Cross-reference check against existing specs (prevents contradictions)
Stage 2: Test-Driven Development¶
Owner: Antje (Test Agent — TDD) Phase: Pre-implementation Input: Anna's formal specification Output: Test suite covering all acceptance criteria, edge cases, and error conditions
What it does¶
Antje writes tests from the specification — not from the code. This is the critical distinction. The tests are written before any implementation exists. They define what "correct" means in executable form.
Test coverage includes:
- Happy path for every acceptance criterion
- Edge cases explicitly listed in the spec
- Error conditions with expected error responses
- Boundary values (min, max, empty, null, overflow)
- State transitions and their constraints
What failure class it catches¶
Test and code from the same brain. When the same agent writes both tests and code, the tests validate the agent's understanding — not the requirements. Antje has never seen the implementation. Her tests come exclusively from Anna's spec. If the dev agent misinterprets the spec, Antje's tests catch it.
What blocks progression¶
- Test suite does not cover all acceptance criteria in the spec
- Missing edge case tests for enumerated edge cases
- Missing error condition tests
- Tests reference implementation details (they must not — they test behavior, not structure)
Tools¶
- Vitest / Jest for unit tests
- Playwright / Testing Library for UI tests
- Supertest for API integration tests
- Coverage reporting (branch coverage, not just line coverage)
Stage 3: Implementation¶
Owner: Dev agents (Urszula/Maxim team leads, team developers) Phase: Implementation Input: Anna's specification + Antje's test suite Output: Implementation code that passes all of Antje's tests
What it does¶
Dev agents write the implementation to pass Antje's tests. They receive:
- The formal specification (what to build)
- The test suite (how to know it is correct)
- Existing codebase context (where it fits)
Their job is to make the tests green without modifying the tests. If a test seems wrong, the dev agent raises a discussion — they do not change the test.
What failure class it catches¶
This stage does not catch failure classes — it produces code. All subsequent stages exist because this stage's output cannot be trusted on its own.
What blocks progression¶
- Any of Antje's tests failing
- New files not following codebase standards
- Missing type annotations (TypeScript strict mode)
Tools¶
- Claude Code / Codex / Gemini (per agent provider config)
- TypeScript compiler (strict mode)
- ESLint with GE configuration
- Prettier for formatting
Stage 4: Deterministic Quality¶
Owner: Koen (Code Quality Automation) Phase: Post-implementation Input: Implementation code from dev agents Output: Quality report with pass/fail per check
What it does¶
Koen runs deterministic checks that do not require LLM judgment. These are binary — pass or fail, no interpretation:
- Type checking:
tsc --noEmit(TypeScript strict) - Linting: ESLint with GE rules (no warnings allowed)
- Formatting: Prettier check (no format drift)
- Dependency audit:
npm audit(no critical/high vulnerabilities) - Bundle analysis: Size limits per route
- Dead code detection: Unused exports, unreachable branches
- Import validation: No circular dependencies, no banned imports
- Spec drift: OpenAPI spec matches implementation schemas
- Naming conventions: snake_case in API, camelCase in code
- File allocation: Code in correct directories per CODEBASE-STANDARDS.md
What failure class it catches¶
Confident hallucination (structural). LLMs confidently import packages that do not exist, use types that are not exported, create circular dependencies, and produce code that looks correct but does not compile in the full project context. Deterministic tools catch all of these without needing to understand the code's purpose.
What blocks progression¶
- Any check failing
- Koen does not apply judgment — if the tool says fail, it fails
- No manual overrides without human approval
Tools¶
- TypeScript compiler
- ESLint + Prettier
- npm audit
- Custom GE lint rules (spec drift, file allocation, naming)
- Bundlephobia for dependency size checks
Stage 5: Integration Testing¶
Owner: Marije (Testing Lead, Alfa) / Judith (Testing Lead, Bravo) Phase: Post-implementation Input: Implementation code that passed Stage 4 Output: Integration test results — pass/fail with failure details
What it does¶
Marije and Judith run the code in the context of the full system. Where Antje's tests verify behavior in isolation, integration tests verify behavior when connected to real databases, real API endpoints, real message queues, and real file systems.
Integration test scope:
- API integration: Full request/response cycle through the stack
- Database integration: Migrations, queries, transactions, rollbacks
- Service integration: Cross-service communication via Redis Streams
- Authentication flow: Full auth cycle including token refresh
- Data flow: End-to-end data transformation across pipeline stages
- Concurrency: Multiple agents accessing the same resource
What failure class it catches¶
Works in isolation, breaks in integration. LLMs generate code in a bounded context. The code works perfectly in a unit test that mocks all dependencies. It breaks when the mocked behavior does not match the real dependency. Integration tests use real dependencies — if it passes here, it works with the system.
Specific catches:
- SQL that works in SQLite but fails in PostgreSQL
- Redis commands that assume a data structure that changed
- API calls that use the wrong authentication method
- Timing-dependent behavior (race conditions, timeouts)
- File paths that work on the dev machine but not in the container
What blocks progression¶
- Any integration test failing
- Test environment not matching production topology
- Missing teardown (tests that leave dirty state)
Tools¶
- Vitest with real database (PostgreSQL test instance)
- Redis test instance (port 6381, matching production)
- k3s test namespace for container-level integration
- Docker Compose for local multi-service testing
Stage 6: Test Reconciliation¶
Owner: Jasper (Test Reconciliation Analyst) Phase: Post-testing Input: Test results from Antje (Stage 2), Marije/Judith (Stage 5), and dev agent local tests Output: Reconciliation report — discrepancies flagged
What it does¶
Jasper compares test results across stages and looks for inconsistencies:
- Tests that pass in isolation but fail in integration (or vice versa)
- Test coverage gaps — code paths exercised by no test suite
- Conflicting assertions — two tests asserting different behavior for the same input
- Flaky tests — tests that sometimes pass and sometimes fail
- Mock/reality drift — mocked behavior that diverges from real behavior
- Coverage regression — branches covered in the previous version but not in this version
What failure class it catches¶
Plausible but wrong (hidden by passing tests). Tests can pass for the wrong reason. A test might assert that a function returns an array, but not check the array contents. A mock might return the expected value regardless of input. Jasper catches these discrepancies by cross-referencing test results across independent test suites.
What blocks progression¶
- Unresolved discrepancies between test suites
- Coverage regression without explicit approval
- Flaky tests (must be fixed or quarantined)
- Mock behavior that diverges from integration test observations
Tools¶
- Custom reconciliation engine (compares test reports)
- Coverage diff tool (compares branch coverage across runs)
- Flaky test detector (tracks pass/fail across multiple runs)
Stage 7: Conflict Detection¶
Owner: Marco (Conflict Detection Agent) Phase: Post-testing Input: The change set (PR diff) in context of all concurrent changes Output: Conflict report — semantic conflicts flagged
What it does¶
Marco detects conflicts that git cannot see. Git detects textual conflicts — two people editing the same line. Marco detects semantic conflicts — two changes that do not touch the same line but break each other:
- Interface conflicts: Agent A adds a required parameter to a shared function. Agent B calls that function without the new parameter. No git conflict. Runtime error.
- State conflicts: Agent A changes the database schema. Agent B writes queries against the old schema. No git conflict. SQL error.
- Behavioral conflicts: Agent A changes the sort order of a list API. Agent B's test depends on the old sort order. No git conflict. Test failure.
- Resource conflicts: Agent A allocates port 3001. Agent B also allocates port 3001. No git conflict. Port collision.
What failure class it catches¶
Works in isolation, breaks in integration (concurrent). LLMs generate code without awareness of what other agents are building simultaneously. Each agent's code works. Together, they conflict. Marco is the only stage that considers concurrent changes.
What blocks progression¶
- Unresolved semantic conflicts
- Interface changes without downstream update plan
- Schema changes without migration compatibility check
Tools¶
- AST-level diff analysis (not textual — structural)
- Dependency graph traversal (traces impact of changes)
- Concurrent PR awareness (checks all open PRs, not just this one)
- Database schema comparison (current vs proposed)
Stage 8: Adversarial Testing¶
Owner: Ashley (Adversarial Agent — Chaos Monkey) Phase: Post-testing Input: Implementation code — Ashley receives NO codebase context Output: Attack report — vulnerabilities and failures found
What it does¶
Ashley attacks the code with zero prior knowledge of the codebase. This is deliberate. Ashley does not read the implementation, does not read the tests, does not read the spec. Ashley sees only the public interface (API endpoints, UI forms, CLI commands) and tries to break it.
Attack categories:
- Input fuzzing: Invalid types, oversized payloads, Unicode edge cases, null bytes, SQL injection, XSS payloads, path traversal
- State manipulation: Out-of-order operations, repeated submissions, concurrent modifications, expired tokens, revoked permissions
- Resource exhaustion: Large uploads, many concurrent requests, deep pagination, expensive queries
- Authentication bypass: Token manipulation, role escalation, IDOR (insecure direct object reference), CSRF
- Error provocation: Network timeouts, database unavailability, disk full, out of memory
What failure class it catches¶
Works for the happy path. LLMs generate code trained on examples that demonstrate success. Error handling is an afterthought. Ashley never takes the happy path. Every request is designed to trigger failure modes the dev agent did not consider.
Confident hallucination (security). LLMs confidently implement "authentication" that checks a token but not its expiry, "authorization" that checks a role but not the resource owner, "validation" that checks the type but not the range. Ashley finds these gaps.
Why zero codebase knowledge matters¶
If Ashley read the code, Ashley would test what the code does. By not reading the code, Ashley tests what the code should do — from an attacker's perspective. This catches "works in isolation but breaks integration" failures that code-aware testing misses.
What blocks progression¶
- Critical vulnerabilities (auth bypass, data leak, injection)
- Unhandled crash on any input (must return proper error response)
- Resource exhaustion without rate limiting
Tools¶
- Custom fuzzing harness
- OWASP ZAP for automated security scanning
- Burp Suite patterns for manual-style attacks
- Load testing tools (k6, artillery) for resource exhaustion
- No access to source code (by design)
Stage 9: SSOT Enforcement¶
Owner: Jaap (SSOT Enforcer) Phase: Pre-merge Input: Full change set after all testing stages Output: SSOT compliance report — pass/fail
What it does¶
Jaap verifies that the code matches every declared source of truth in the system:
- OpenAPI spec: Does the implementation match the API contract?
- Database schema: Do queries match the current schema?
- Configuration: Are values read from config, not hardcoded?
- Agent registry: Are agent names, streams, and roles correct?
- File allocation: Is code in the right directory per standards?
- Naming conventions: Do names follow GE conventions?
- Constitution: Does the change comply with the 10 principles?
What failure class it catches¶
Pattern mimicry without understanding. LLMs copy patterns from their training data. They hardcode values that should come from config. They put files in directories that "look right" but violate GE file allocation rules. They use naming conventions from other codebases. Jaap catches all of these by checking against the actual declared sources of truth — not against patterns.
What blocks progression¶
- Any SSOT violation
- Hardcoded values that should come from config
- Code in wrong directory
- API implementation that does not match spec
- Configuration that contradicts config authority map
Tools¶
verify_ssot.sh— automated SSOT verification script- Config authority map comparison
- OpenAPI spec validator
- File allocation checker against CODEBASE-STANDARDS.md
- Agent registry cross-reference
Stage 10: Merge Gate¶
Owner: Marta (Change Intelligence Engineer, Alfa) / Iwona (Change Intelligence Engineer, Bravo) Phase: Merge decision Input: All reports from Stages 1-9 Output: Merge approval or rejection with reasoning
What it does¶
Marta and Iwona make the final merge decision. They review all stage reports and apply judgment that no individual stage can:
- Holistic assessment: Do all stages agree? Are there contradictions between stage reports?
- Risk evaluation: What is the blast radius of this change? High-risk changes (auth, payments, data model) get extra scrutiny.
- Trend analysis: Is this agent's code quality improving or degrading over time? Recurring issues trigger escalation.
- Dependency check: Does this change depend on other changes that have not merged yet?
- Rollback readiness: If this change breaks production, can it be reverted cleanly?
What failure class it catches¶
Pipeline theater. Individual stages can rubber-stamp. A test suite can have 100% coverage but test nothing meaningful. A security scan can pass because it checked the wrong endpoints. Marta and Iwona look at the full picture and catch cases where individual stages passed but the overall quality is insufficient.
What blocks progression¶
- Any critical issue from any stage unresolved
- Missing stage reports (all 10 must complete)
- Rollback procedure not documented for high-risk changes
- Dependency on unmerged changes without explicit tracking
- Human review required for critical-complexity changes
Tools¶
- Stage report aggregation dashboard
- Change risk scoring model
- Agent quality trend tracker
- Dependency graph for concurrent changes
- Rollback procedure template
Stage Interaction Diagram¶
┌─────────────┐
│ Stage 1 │ Anna: Formal Spec
│ SPEC │ → Catches: ambiguity
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 2 │ Antje: TDD (tests from spec)
│ TESTS │ → Catches: same-brain tests
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 3 │ Dev agents: Implementation
│ CODE │ → Produces code (not a gate)
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 4 │ Koen: Deterministic quality
│ LINT │ → Catches: hallucinated imports, types
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 5 │ Marije/Judith: Integration tests
│ INTEGRATE │ → Catches: isolation-only code
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 6 │ Jasper: Reconciliation
│ RECONCILE │ → Catches: hidden test gaps
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 7 │ Marco: Conflict detection
│ CONFLICTS │ → Catches: concurrent semantic breaks
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 8 │ Ashley: Adversarial testing
│ ATTACK │ → Catches: happy-path-only code
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 9 │ Jaap: SSOT enforcement
│ SSOT │ → Catches: pattern mimicry
└──────┬──────┘
│
┌──────▼──────┐
│ Stage 10 │ Marta/Iwona: Merge gate
│ MERGE │ → Catches: pipeline theater
└─────────────┘
Ownership¶
| Stage | Agent | LLM Failure Class |
|---|---|---|
| 1. Formal Spec | Anna | Ambiguous requirements |
| 2. TDD | Antje | Same-brain tests |
| 3. Implementation | Dev agents | (producer, not gate) |
| 4. Deterministic | Koen | Hallucinated structure |
| 5. Integration | Marije / Judith | Isolation-only code |
| 6. Reconciliation | Jasper | Hidden test gaps |
| 7. Conflict | Marco | Concurrent semantic breaks |
| 8. Adversarial | Ashley | Happy-path-only code |
| 9. SSOT | Jaap | Pattern mimicry |
| 10. Merge Gate | Marta / Iwona | Pipeline theater |