Skip to content

Pipeline Stages

Each of the 10 stages in the anti-LLM pipeline. For every stage: what it does, what failure class it catches, who owns it, what tools it uses, what its output is, and what blocks progression.


Stage 1: Formal Specification

Owner: Anna (Formal Specification Agent) Phase: Pre-implementation Input: Functional specification from Aimee, technical design from project lead Output: Formal specification document with acceptance criteria, API contracts, data model constraints, and behavioral rules

What it does

Anna translates human-readable requirements into precise, testable specifications. Every behavior is defined with:

  • Pre-conditions (what must be true before the operation)
  • Post-conditions (what must be true after the operation)
  • Invariants (what must always be true)
  • Edge cases explicitly enumerated
  • Error conditions with expected responses
  • API contracts in OpenAPI format

What failure class it catches

Ambiguous requirements. LLMs fill ambiguity with plausible assumptions. If the spec says "the user can update their profile," an LLM will decide what "update" means, what fields are updatable, what validation applies, and what happens on conflict. Anna removes this ambiguity before any code is written.

Without this stage, every subsequent stage validates against the LLM's interpretation — not the actual requirements.

What blocks progression

  • Incomplete acceptance criteria
  • Missing error conditions
  • Undefined edge cases
  • API contract without response schemas
  • Specification reviewed and approved by project lead

Tools

  • Markdown specification templates
  • OpenAPI schema definition
  • Zod schema stubs (pre-generated for dev agents)
  • Cross-reference check against existing specs (prevents contradictions)

Stage 2: Test-Driven Development

Owner: Antje (Test Agent — TDD) Phase: Pre-implementation Input: Anna's formal specification Output: Test suite covering all acceptance criteria, edge cases, and error conditions

What it does

Antje writes tests from the specification — not from the code. This is the critical distinction. The tests are written before any implementation exists. They define what "correct" means in executable form.

Test coverage includes:

  • Happy path for every acceptance criterion
  • Edge cases explicitly listed in the spec
  • Error conditions with expected error responses
  • Boundary values (min, max, empty, null, overflow)
  • State transitions and their constraints

What failure class it catches

Test and code from the same brain. When the same agent writes both tests and code, the tests validate the agent's understanding — not the requirements. Antje has never seen the implementation. Her tests come exclusively from Anna's spec. If the dev agent misinterprets the spec, Antje's tests catch it.

What blocks progression

  • Test suite does not cover all acceptance criteria in the spec
  • Missing edge case tests for enumerated edge cases
  • Missing error condition tests
  • Tests reference implementation details (they must not — they test behavior, not structure)

Tools

  • Vitest / Jest for unit tests
  • Playwright / Testing Library for UI tests
  • Supertest for API integration tests
  • Coverage reporting (branch coverage, not just line coverage)

Stage 3: Implementation

Owner: Dev agents (Urszula/Maxim team leads, team developers) Phase: Implementation Input: Anna's specification + Antje's test suite Output: Implementation code that passes all of Antje's tests

What it does

Dev agents write the implementation to pass Antje's tests. They receive:

  1. The formal specification (what to build)
  2. The test suite (how to know it is correct)
  3. Existing codebase context (where it fits)

Their job is to make the tests green without modifying the tests. If a test seems wrong, the dev agent raises a discussion — they do not change the test.

What failure class it catches

This stage does not catch failure classes — it produces code. All subsequent stages exist because this stage's output cannot be trusted on its own.

What blocks progression

  • Any of Antje's tests failing
  • New files not following codebase standards
  • Missing type annotations (TypeScript strict mode)

Tools

  • Claude Code / Codex / Gemini (per agent provider config)
  • TypeScript compiler (strict mode)
  • ESLint with GE configuration
  • Prettier for formatting

Stage 4: Deterministic Quality

Owner: Koen (Code Quality Automation) Phase: Post-implementation Input: Implementation code from dev agents Output: Quality report with pass/fail per check

What it does

Koen runs deterministic checks that do not require LLM judgment. These are binary — pass or fail, no interpretation:

  • Type checking: tsc --noEmit (TypeScript strict)
  • Linting: ESLint with GE rules (no warnings allowed)
  • Formatting: Prettier check (no format drift)
  • Dependency audit: npm audit (no critical/high vulnerabilities)
  • Bundle analysis: Size limits per route
  • Dead code detection: Unused exports, unreachable branches
  • Import validation: No circular dependencies, no banned imports
  • Spec drift: OpenAPI spec matches implementation schemas
  • Naming conventions: snake_case in API, camelCase in code
  • File allocation: Code in correct directories per CODEBASE-STANDARDS.md

What failure class it catches

Confident hallucination (structural). LLMs confidently import packages that do not exist, use types that are not exported, create circular dependencies, and produce code that looks correct but does not compile in the full project context. Deterministic tools catch all of these without needing to understand the code's purpose.

What blocks progression

  • Any check failing
  • Koen does not apply judgment — if the tool says fail, it fails
  • No manual overrides without human approval

Tools

  • TypeScript compiler
  • ESLint + Prettier
  • npm audit
  • Custom GE lint rules (spec drift, file allocation, naming)
  • Bundlephobia for dependency size checks

Stage 5: Integration Testing

Owner: Marije (Testing Lead, Alfa) / Judith (Testing Lead, Bravo) Phase: Post-implementation Input: Implementation code that passed Stage 4 Output: Integration test results — pass/fail with failure details

What it does

Marije and Judith run the code in the context of the full system. Where Antje's tests verify behavior in isolation, integration tests verify behavior when connected to real databases, real API endpoints, real message queues, and real file systems.

Integration test scope:

  • API integration: Full request/response cycle through the stack
  • Database integration: Migrations, queries, transactions, rollbacks
  • Service integration: Cross-service communication via Redis Streams
  • Authentication flow: Full auth cycle including token refresh
  • Data flow: End-to-end data transformation across pipeline stages
  • Concurrency: Multiple agents accessing the same resource

What failure class it catches

Works in isolation, breaks in integration. LLMs generate code in a bounded context. The code works perfectly in a unit test that mocks all dependencies. It breaks when the mocked behavior does not match the real dependency. Integration tests use real dependencies — if it passes here, it works with the system.

Specific catches:

  • SQL that works in SQLite but fails in PostgreSQL
  • Redis commands that assume a data structure that changed
  • API calls that use the wrong authentication method
  • Timing-dependent behavior (race conditions, timeouts)
  • File paths that work on the dev machine but not in the container

What blocks progression

  • Any integration test failing
  • Test environment not matching production topology
  • Missing teardown (tests that leave dirty state)

Tools

  • Vitest with real database (PostgreSQL test instance)
  • Redis test instance (port 6381, matching production)
  • k3s test namespace for container-level integration
  • Docker Compose for local multi-service testing

Stage 6: Test Reconciliation

Owner: Jasper (Test Reconciliation Analyst) Phase: Post-testing Input: Test results from Antje (Stage 2), Marije/Judith (Stage 5), and dev agent local tests Output: Reconciliation report — discrepancies flagged

What it does

Jasper compares test results across stages and looks for inconsistencies:

  • Tests that pass in isolation but fail in integration (or vice versa)
  • Test coverage gaps — code paths exercised by no test suite
  • Conflicting assertions — two tests asserting different behavior for the same input
  • Flaky tests — tests that sometimes pass and sometimes fail
  • Mock/reality drift — mocked behavior that diverges from real behavior
  • Coverage regression — branches covered in the previous version but not in this version

What failure class it catches

Plausible but wrong (hidden by passing tests). Tests can pass for the wrong reason. A test might assert that a function returns an array, but not check the array contents. A mock might return the expected value regardless of input. Jasper catches these discrepancies by cross-referencing test results across independent test suites.

What blocks progression

  • Unresolved discrepancies between test suites
  • Coverage regression without explicit approval
  • Flaky tests (must be fixed or quarantined)
  • Mock behavior that diverges from integration test observations

Tools

  • Custom reconciliation engine (compares test reports)
  • Coverage diff tool (compares branch coverage across runs)
  • Flaky test detector (tracks pass/fail across multiple runs)

Stage 7: Conflict Detection

Owner: Marco (Conflict Detection Agent) Phase: Post-testing Input: The change set (PR diff) in context of all concurrent changes Output: Conflict report — semantic conflicts flagged

What it does

Marco detects conflicts that git cannot see. Git detects textual conflicts — two people editing the same line. Marco detects semantic conflicts — two changes that do not touch the same line but break each other:

  • Interface conflicts: Agent A adds a required parameter to a shared function. Agent B calls that function without the new parameter. No git conflict. Runtime error.
  • State conflicts: Agent A changes the database schema. Agent B writes queries against the old schema. No git conflict. SQL error.
  • Behavioral conflicts: Agent A changes the sort order of a list API. Agent B's test depends on the old sort order. No git conflict. Test failure.
  • Resource conflicts: Agent A allocates port 3001. Agent B also allocates port 3001. No git conflict. Port collision.

What failure class it catches

Works in isolation, breaks in integration (concurrent). LLMs generate code without awareness of what other agents are building simultaneously. Each agent's code works. Together, they conflict. Marco is the only stage that considers concurrent changes.

What blocks progression

  • Unresolved semantic conflicts
  • Interface changes without downstream update plan
  • Schema changes without migration compatibility check

Tools

  • AST-level diff analysis (not textual — structural)
  • Dependency graph traversal (traces impact of changes)
  • Concurrent PR awareness (checks all open PRs, not just this one)
  • Database schema comparison (current vs proposed)

Stage 8: Adversarial Testing

Owner: Ashley (Adversarial Agent — Chaos Monkey) Phase: Post-testing Input: Implementation code — Ashley receives NO codebase context Output: Attack report — vulnerabilities and failures found

What it does

Ashley attacks the code with zero prior knowledge of the codebase. This is deliberate. Ashley does not read the implementation, does not read the tests, does not read the spec. Ashley sees only the public interface (API endpoints, UI forms, CLI commands) and tries to break it.

Attack categories:

  • Input fuzzing: Invalid types, oversized payloads, Unicode edge cases, null bytes, SQL injection, XSS payloads, path traversal
  • State manipulation: Out-of-order operations, repeated submissions, concurrent modifications, expired tokens, revoked permissions
  • Resource exhaustion: Large uploads, many concurrent requests, deep pagination, expensive queries
  • Authentication bypass: Token manipulation, role escalation, IDOR (insecure direct object reference), CSRF
  • Error provocation: Network timeouts, database unavailability, disk full, out of memory

What failure class it catches

Works for the happy path. LLMs generate code trained on examples that demonstrate success. Error handling is an afterthought. Ashley never takes the happy path. Every request is designed to trigger failure modes the dev agent did not consider.

Confident hallucination (security). LLMs confidently implement "authentication" that checks a token but not its expiry, "authorization" that checks a role but not the resource owner, "validation" that checks the type but not the range. Ashley finds these gaps.

Why zero codebase knowledge matters

If Ashley read the code, Ashley would test what the code does. By not reading the code, Ashley tests what the code should do — from an attacker's perspective. This catches "works in isolation but breaks integration" failures that code-aware testing misses.

What blocks progression

  • Critical vulnerabilities (auth bypass, data leak, injection)
  • Unhandled crash on any input (must return proper error response)
  • Resource exhaustion without rate limiting

Tools

  • Custom fuzzing harness
  • OWASP ZAP for automated security scanning
  • Burp Suite patterns for manual-style attacks
  • Load testing tools (k6, artillery) for resource exhaustion
  • No access to source code (by design)

Stage 9: SSOT Enforcement

Owner: Jaap (SSOT Enforcer) Phase: Pre-merge Input: Full change set after all testing stages Output: SSOT compliance report — pass/fail

What it does

Jaap verifies that the code matches every declared source of truth in the system:

  • OpenAPI spec: Does the implementation match the API contract?
  • Database schema: Do queries match the current schema?
  • Configuration: Are values read from config, not hardcoded?
  • Agent registry: Are agent names, streams, and roles correct?
  • File allocation: Is code in the right directory per standards?
  • Naming conventions: Do names follow GE conventions?
  • Constitution: Does the change comply with the 10 principles?

What failure class it catches

Pattern mimicry without understanding. LLMs copy patterns from their training data. They hardcode values that should come from config. They put files in directories that "look right" but violate GE file allocation rules. They use naming conventions from other codebases. Jaap catches all of these by checking against the actual declared sources of truth — not against patterns.

What blocks progression

  • Any SSOT violation
  • Hardcoded values that should come from config
  • Code in wrong directory
  • API implementation that does not match spec
  • Configuration that contradicts config authority map

Tools

  • verify_ssot.sh — automated SSOT verification script
  • Config authority map comparison
  • OpenAPI spec validator
  • File allocation checker against CODEBASE-STANDARDS.md
  • Agent registry cross-reference

Stage 10: Merge Gate

Owner: Marta (Change Intelligence Engineer, Alfa) / Iwona (Change Intelligence Engineer, Bravo) Phase: Merge decision Input: All reports from Stages 1-9 Output: Merge approval or rejection with reasoning

What it does

Marta and Iwona make the final merge decision. They review all stage reports and apply judgment that no individual stage can:

  • Holistic assessment: Do all stages agree? Are there contradictions between stage reports?
  • Risk evaluation: What is the blast radius of this change? High-risk changes (auth, payments, data model) get extra scrutiny.
  • Trend analysis: Is this agent's code quality improving or degrading over time? Recurring issues trigger escalation.
  • Dependency check: Does this change depend on other changes that have not merged yet?
  • Rollback readiness: If this change breaks production, can it be reverted cleanly?

What failure class it catches

Pipeline theater. Individual stages can rubber-stamp. A test suite can have 100% coverage but test nothing meaningful. A security scan can pass because it checked the wrong endpoints. Marta and Iwona look at the full picture and catch cases where individual stages passed but the overall quality is insufficient.

What blocks progression

  • Any critical issue from any stage unresolved
  • Missing stage reports (all 10 must complete)
  • Rollback procedure not documented for high-risk changes
  • Dependency on unmerged changes without explicit tracking
  • Human review required for critical-complexity changes

Tools

  • Stage report aggregation dashboard
  • Change risk scoring model
  • Agent quality trend tracker
  • Dependency graph for concurrent changes
  • Rollback procedure template

Stage Interaction Diagram

┌─────────────┐
│  Stage 1    │  Anna: Formal Spec
│  SPEC       │  → Catches: ambiguity
└──────┬──────┘
┌──────▼──────┐
│  Stage 2    │  Antje: TDD (tests from spec)
│  TESTS      │  → Catches: same-brain tests
└──────┬──────┘
┌──────▼──────┐
│  Stage 3    │  Dev agents: Implementation
│  CODE       │  → Produces code (not a gate)
└──────┬──────┘
┌──────▼──────┐
│  Stage 4    │  Koen: Deterministic quality
│  LINT       │  → Catches: hallucinated imports, types
└──────┬──────┘
┌──────▼──────┐
│  Stage 5    │  Marije/Judith: Integration tests
│  INTEGRATE  │  → Catches: isolation-only code
└──────┬──────┘
┌──────▼──────┐
│  Stage 6    │  Jasper: Reconciliation
│  RECONCILE  │  → Catches: hidden test gaps
└──────┬──────┘
┌──────▼──────┐
│  Stage 7    │  Marco: Conflict detection
│  CONFLICTS  │  → Catches: concurrent semantic breaks
└──────┬──────┘
┌──────▼──────┐
│  Stage 8    │  Ashley: Adversarial testing
│  ATTACK     │  → Catches: happy-path-only code
└──────┬──────┘
┌──────▼──────┐
│  Stage 9    │  Jaap: SSOT enforcement
│  SSOT       │  → Catches: pattern mimicry
└──────┬──────┘
┌──────▼──────┐
│  Stage 10   │  Marta/Iwona: Merge gate
│  MERGE      │  → Catches: pipeline theater
└─────────────┘

Ownership

Stage Agent LLM Failure Class
1. Formal Spec Anna Ambiguous requirements
2. TDD Antje Same-brain tests
3. Implementation Dev agents (producer, not gate)
4. Deterministic Koen Hallucinated structure
5. Integration Marije / Judith Isolation-only code
6. Reconciliation Jasper Hidden test gaps
7. Conflict Marco Concurrent semantic breaks
8. Adversarial Ashley Happy-path-only code
9. SSOT Jaap Pattern mimicry
10. Merge Gate Marta / Iwona Pipeline theater