Pipeline Stages¶

Each of the 10 stages in the anti-LLM pipeline. For every stage: what it does, what failure class it catches, who owns it, what tools it uses, what its output is, and what blocks progression.

Stage 1: Formal Specification¶

Owner: Anna (Formal Specification Agent) Phase: Pre-implementation Input: Functional specification from Aimee, technical design from project lead Output: Formal specification document with acceptance criteria, API contracts, data model constraints, and behavioral rules

What it does¶

Anna translates human-readable requirements into precise, testable specifications. Every behavior is defined with:

Pre-conditions (what must be true before the operation)
Post-conditions (what must be true after the operation)
Invariants (what must always be true)
Edge cases explicitly enumerated
Error conditions with expected responses
API contracts in OpenAPI format

What failure class it catches¶

Ambiguous requirements. LLMs fill ambiguity with plausible assumptions. If the spec says "the user can update their profile," an LLM will decide what "update" means, what fields are updatable, what validation applies, and what happens on conflict. Anna removes this ambiguity before any code is written.

Without this stage, every subsequent stage validates against the LLM's interpretation — not the actual requirements.

What blocks progression¶

Incomplete acceptance criteria
Missing error conditions
Undefined edge cases
API contract without response schemas
Specification reviewed and approved by project lead

Tools¶

Markdown specification templates
OpenAPI schema definition
Zod schema stubs (pre-generated for dev agents)
Cross-reference check against existing specs (prevents contradictions)

Stage 2: Test-Driven Development¶

Owner: Antje (Test Agent — TDD) Phase: Pre-implementation Input: Anna's formal specification Output: Test suite covering all acceptance criteria, edge cases, and error conditions

What it does¶

Antje writes tests from the specification — not from the code. This is the critical distinction. The tests are written before any implementation exists. They define what "correct" means in executable form.

Test coverage includes:

Happy path for every acceptance criterion
Edge cases explicitly listed in the spec
Error conditions with expected error responses
Boundary values (min, max, empty, null, overflow)
State transitions and their constraints

What failure class it catches¶

Test and code from the same brain. When the same agent writes both tests and code, the tests validate the agent's understanding — not the requirements. Antje has never seen the implementation. Her tests come exclusively from Anna's spec. If the dev agent misinterprets the spec, Antje's tests catch it.

What blocks progression¶

Test suite does not cover all acceptance criteria in the spec
Missing edge case tests for enumerated edge cases
Missing error condition tests
Tests reference implementation details (they must not — they test behavior, not structure)

Tools¶

Vitest / Jest for unit tests
Playwright / Testing Library for UI tests
Supertest for API integration tests
Coverage reporting (branch coverage, not just line coverage)

Stage 3: Implementation¶

Owner: Dev agents (Urszula/Maxim team leads, team developers) Phase: Implementation Input: Anna's specification + Antje's test suite Output: Implementation code that passes all of Antje's tests

What it does¶

Dev agents write the implementation to pass Antje's tests. They receive:

The formal specification (what to build)
The test suite (how to know it is correct)
Existing codebase context (where it fits)

Their job is to make the tests green without modifying the tests. If a test seems wrong, the dev agent raises a discussion — they do not change the test.

What failure class it catches¶

This stage does not catch failure classes — it produces code. All subsequent stages exist because this stage's output cannot be trusted on its own.

What blocks progression¶

Any of Antje's tests failing
New files not following codebase standards
Missing type annotations (TypeScript strict mode)

Tools¶

Claude Code / Codex / Gemini (per agent provider config)
TypeScript compiler (strict mode)
ESLint with GE configuration
Prettier for formatting

Stage 4: Deterministic Quality¶

Owner: Koen (Code Quality Automation) Phase: Post-implementation Input: Implementation code from dev agents Output: Quality report with pass/fail per check

What it does¶

Koen runs deterministic checks that do not require LLM judgment. These are binary — pass or fail, no interpretation:

Type checking: tsc --noEmit (TypeScript strict)
Linting: ESLint with GE rules (no warnings allowed)
Formatting: Prettier check (no format drift)
Dependency audit: npm audit (no critical/high vulnerabilities)
Bundle analysis: Size limits per route
Dead code detection: Unused exports, unreachable branches
Import validation: No circular dependencies, no banned imports
Spec drift: OpenAPI spec matches implementation schemas
Naming conventions: snake_case in API, camelCase in code
File allocation: Code in correct directories per CODEBASE-STANDARDS.md

What failure class it catches¶

Confident hallucination (structural). LLMs confidently import packages that do not exist, use types that are not exported, create circular dependencies, and produce code that looks correct but does not compile in the full project context. Deterministic tools catch all of these without needing to understand the code's purpose.

What blocks progression¶

Any check failing
Koen does not apply judgment — if the tool says fail, it fails
No manual overrides without human approval

Tools¶

TypeScript compiler
ESLint + Prettier
npm audit
Custom GE lint rules (spec drift, file allocation, naming)
Bundlephobia for dependency size checks

Stage 5: Integration Testing¶

Owner: Marije (Testing Lead, Alfa) / Judith (Testing Lead, Bravo) Phase: Post-implementation Input: Implementation code that passed Stage 4 Output: Integration test results — pass/fail with failure details

What it does¶

Marije and Judith run the code in the context of the full system. Where Antje's tests verify behavior in isolation, integration tests verify behavior when connected to real databases, real API endpoints, real message queues, and real file systems.

Integration test scope:

API integration: Full request/response cycle through the stack
Database integration: Migrations, queries, transactions, rollbacks
Service integration: Cross-service communication via Redis Streams
Authentication flow: Full auth cycle including token refresh
Data flow: End-to-end data transformation across pipeline stages
Concurrency: Multiple agents accessing the same resource

What failure class it catches¶

Works in isolation, breaks in integration. LLMs generate code in a bounded context. The code works perfectly in a unit test that mocks all dependencies. It breaks when the mocked behavior does not match the real dependency. Integration tests use real dependencies — if it passes here, it works with the system.

Specific catches:

SQL that works in SQLite but fails in PostgreSQL
Redis commands that assume a data structure that changed
API calls that use the wrong authentication method
Timing-dependent behavior (race conditions, timeouts)
File paths that work on the dev machine but not in the container

What blocks progression¶

Any integration test failing
Test environment not matching production topology
Missing teardown (tests that leave dirty state)

Tools¶

Vitest with real database (PostgreSQL test instance)
Redis test instance (port 6381, matching production)
k3s test namespace for container-level integration
Docker Compose for local multi-service testing

Stage 6: Test Reconciliation¶

Owner: Jasper (Test Reconciliation Analyst) Phase: Post-testing Input: Test results from Antje (Stage 2), Marije/Judith (Stage 5), and dev agent local tests Output: Reconciliation report — discrepancies flagged

What it does¶

Jasper compares test results across stages and looks for inconsistencies:

Tests that pass in isolation but fail in integration (or vice versa)
Test coverage gaps — code paths exercised by no test suite
Conflicting assertions — two tests asserting different behavior for the same input
Flaky tests — tests that sometimes pass and sometimes fail
Mock/reality drift — mocked behavior that diverges from real behavior
Coverage regression — branches covered in the previous version but not in this version

What failure class it catches¶

Plausible but wrong (hidden by passing tests). Tests can pass for the wrong reason. A test might assert that a function returns an array, but not check the array contents. A mock might return the expected value regardless of input. Jasper catches these discrepancies by cross-referencing test results across independent test suites.

What blocks progression¶

Unresolved discrepancies between test suites
Coverage regression without explicit approval
Flaky tests (must be fixed or quarantined)
Mock behavior that diverges from integration test observations

Tools¶

Custom reconciliation engine (compares test reports)
Coverage diff tool (compares branch coverage across runs)
Flaky test detector (tracks pass/fail across multiple runs)

Stage 7: Conflict Detection¶

Owner: Marco (Conflict Detection Agent) Phase: Post-testing Input: The change set (PR diff) in context of all concurrent changes Output: Conflict report — semantic conflicts flagged

What it does¶

Marco detects conflicts that git cannot see. Git detects textual conflicts — two people editing the same line. Marco detects semantic conflicts — two changes that do not touch the same line but break each other:

Interface conflicts: Agent A adds a required parameter to a shared function. Agent B calls that function without the new parameter. No git conflict. Runtime error.
State conflicts: Agent A changes the database schema. Agent B writes queries against the old schema. No git conflict. SQL error.
Behavioral conflicts: Agent A changes the sort order of a list API. Agent B's test depends on the old sort order. No git conflict. Test failure.
Resource conflicts: Agent A allocates port 3001. Agent B also allocates port 3001. No git conflict. Port collision.

What failure class it catches¶

Works in isolation, breaks in integration (concurrent). LLMs generate code without awareness of what other agents are building simultaneously. Each agent's code works. Together, they conflict. Marco is the only stage that considers concurrent changes.

What blocks progression¶

Unresolved semantic conflicts
Interface changes without downstream update plan
Schema changes without migration compatibility check

Tools¶

AST-level diff analysis (not textual — structural)
Dependency graph traversal (traces impact of changes)
Concurrent PR awareness (checks all open PRs, not just this one)
Database schema comparison (current vs proposed)

Stage 8: Adversarial Testing¶

Owner: Ashley (Adversarial Agent — Chaos Monkey) Phase: Post-testing Input: Implementation code — Ashley receives NO codebase context Output: Attack report — vulnerabilities and failures found

What it does¶

Ashley attacks the code with zero prior knowledge of the codebase. This is deliberate. Ashley does not read the implementation, does not read the tests, does not read the spec. Ashley sees only the public interface (API endpoints, UI forms, CLI commands) and tries to break it.

Attack categories:

Input fuzzing: Invalid types, oversized payloads, Unicode edge cases, null bytes, SQL injection, XSS payloads, path traversal
State manipulation: Out-of-order operations, repeated submissions, concurrent modifications, expired tokens, revoked permissions
Resource exhaustion: Large uploads, many concurrent requests, deep pagination, expensive queries
Authentication bypass: Token manipulation, role escalation, IDOR (insecure direct object reference), CSRF
Error provocation: Network timeouts, database unavailability, disk full, out of memory

What failure class it catches¶

Works for the happy path. LLMs generate code trained on examples that demonstrate success. Error handling is an afterthought. Ashley never takes the happy path. Every request is designed to trigger failure modes the dev agent did not consider.

Confident hallucination (security). LLMs confidently implement "authentication" that checks a token but not its expiry, "authorization" that checks a role but not the resource owner, "validation" that checks the type but not the range. Ashley finds these gaps.

Why zero codebase knowledge matters¶

If Ashley read the code, Ashley would test what the code does. By not reading the code, Ashley tests what the code should do — from an attacker's perspective. This catches "works in isolation but breaks integration" failures that code-aware testing misses.

What blocks progression¶

Critical vulnerabilities (auth bypass, data leak, injection)
Unhandled crash on any input (must return proper error response)
Resource exhaustion without rate limiting

Tools¶

Custom fuzzing harness
OWASP ZAP for automated security scanning
Burp Suite patterns for manual-style attacks
Load testing tools (k6, artillery) for resource exhaustion
No access to source code (by design)

Stage 9: SSOT Enforcement¶

Owner: Jaap (SSOT Enforcer) Phase: Pre-merge Input: Full change set after all testing stages Output: SSOT compliance report — pass/fail

What it does¶

Jaap verifies that the code matches every declared source of truth in the system:

OpenAPI spec: Does the implementation match the API contract?
Database schema: Do queries match the current schema?
Configuration: Are values read from config, not hardcoded?
Agent registry: Are agent names, streams, and roles correct?
File allocation: Is code in the right directory per standards?
Naming conventions: Do names follow GE conventions?
Constitution: Does the change comply with the 10 principles?

What failure class it catches¶

Pattern mimicry without understanding. LLMs copy patterns from their training data. They hardcode values that should come from config. They put files in directories that "look right" but violate GE file allocation rules. They use naming conventions from other codebases. Jaap catches all of these by checking against the actual declared sources of truth — not against patterns.

What blocks progression¶

Any SSOT violation
Hardcoded values that should come from config
Code in wrong directory
API implementation that does not match spec
Configuration that contradicts config authority map

Tools¶

verify_ssot.sh — automated SSOT verification script
Config authority map comparison
OpenAPI spec validator
File allocation checker against CODEBASE-STANDARDS.md
Agent registry cross-reference

Stage 10: Merge Gate¶

Owner: Marta (Change Intelligence Engineer, Alfa) / Iwona (Change Intelligence Engineer, Bravo) Phase: Merge decision Input: All reports from Stages 1-9 Output: Merge approval or rejection with reasoning

What it does¶

Marta and Iwona make the final merge decision. They review all stage reports and apply judgment that no individual stage can:

Holistic assessment: Do all stages agree? Are there contradictions between stage reports?
Risk evaluation: What is the blast radius of this change? High-risk changes (auth, payments, data model) get extra scrutiny.
Trend analysis: Is this agent's code quality improving or degrading over time? Recurring issues trigger escalation.
Dependency check: Does this change depend on other changes that have not merged yet?
Rollback readiness: If this change breaks production, can it be reverted cleanly?

What failure class it catches¶

Pipeline theater. Individual stages can rubber-stamp. A test suite can have 100% coverage but test nothing meaningful. A security scan can pass because it checked the wrong endpoints. Marta and Iwona look at the full picture and catch cases where individual stages passed but the overall quality is insufficient.

What blocks progression¶

Any critical issue from any stage unresolved
Missing stage reports (all 10 must complete)
Rollback procedure not documented for high-risk changes
Dependency on unmerged changes without explicit tracking
Human review required for critical-complexity changes

Tools¶

Stage report aggregation dashboard
Change risk scoring model
Agent quality trend tracker
Dependency graph for concurrent changes
Rollback procedure template

Stage Interaction Diagram¶

┌─────────────┐
│  Stage 1    │  Anna: Formal Spec
│  SPEC       │  → Catches: ambiguity
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 2    │  Antje: TDD (tests from spec)
│  TESTS      │  → Catches: same-brain tests
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 3    │  Dev agents: Implementation
│  CODE       │  → Produces code (not a gate)
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 4    │  Koen: Deterministic quality
│  LINT       │  → Catches: hallucinated imports, types
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 5    │  Marije/Judith: Integration tests
│  INTEGRATE  │  → Catches: isolation-only code
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 6    │  Jasper: Reconciliation
│  RECONCILE  │  → Catches: hidden test gaps
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 7    │  Marco: Conflict detection
│  CONFLICTS  │  → Catches: concurrent semantic breaks
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 8    │  Ashley: Adversarial testing
│  ATTACK     │  → Catches: happy-path-only code
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 9    │  Jaap: SSOT enforcement
│  SSOT       │  → Catches: pattern mimicry
└──────┬──────┘
       │
┌──────▼──────┐
│  Stage 10   │  Marta/Iwona: Merge gate
│  MERGE      │  → Catches: pipeline theater
└─────────────┘

Ownership¶

Stage	Agent	LLM Failure Class
1. Formal Spec	Anna	Ambiguous requirements
2. TDD	Antje	Same-brain tests
3. Implementation	Dev agents	(producer, not gate)
4. Deterministic	Koen	Hallucinated structure
5. Integration	Marije / Judith	Isolation-only code
6. Reconciliation	Jasper	Hidden test gaps
7. Conflict	Marco	Concurrent semantic breaks
8. Adversarial	Ashley	Happy-path-only code
9. SSOT	Jaap	Pattern mimicry
10. Merge Gate	Marta / Iwona	Pipeline theater