Anti-LLM Pipeline¶

A 10-stage quality pipeline designed specifically to catch the failure modes of AI-generated code. No single stage is sufficient. Every stage catches a different failure class.

Why This Pipeline Exists¶

LLMs write code that looks right. It compiles. It passes the tests the LLM wrote alongside it. It handles the happy path gracefully. And then it breaks in production in ways no human developer would have produced.

This is not a tooling problem. It is a fundamental property of how LLMs generate code. Understanding these failure modes is the first step to building defenses against them.

LLM Failure Modes¶

1. Plausible but wrong¶

LLMs generate code that reads well and follows patterns from their training data. But "reads well" is not "works correctly." An LLM will confidently use an API that does not exist, reference a library method with the wrong signature, or implement an algorithm that is subtly incorrect but structurally convincing.

Human developers make typos and logic errors. LLMs make errors that pass code review because they look intentional.

2. Works for the happy path¶

LLMs are trained on code that mostly demonstrates success cases. Error handling, edge cases, concurrent access, partial failure, and graceful degradation are underrepresented in training data. The result: code that works perfectly when everything goes right and fails catastrophically when anything goes wrong.

3. Test and code from the same brain¶

When an LLM writes both the implementation and the tests, the tests validate the LLM's assumptions — not the requirements. If the LLM misunderstands the spec, both the code and the tests will agree on the wrong behavior. The tests pass. The feature is broken.

This is the most dangerous failure mode because it is invisible to anyone looking only at test results.

4. Works in isolation, breaks in integration¶

LLMs generate code in a bounded context window. They see the file they are editing, maybe a few related files. They do not see the full system. The code works perfectly as a standalone unit and breaks the moment it interacts with the rest of the codebase — naming conflicts, state assumptions, concurrency issues, import cycles.

5. Confident hallucination¶

LLMs do not say "I am not sure." They generate code with the same confidence whether the approach is correct or fabricated. A hallucinated API endpoint, a non-existent configuration option, a made-up library — all presented with the same syntactic certainty as correct code.

6. Pattern mimicry without understanding¶

LLMs replicate patterns they have seen. They will copy an authentication pattern from a different context, apply a caching strategy that makes no sense for the data model, or use a design pattern that adds complexity without solving the actual problem. The code looks professional. The architecture is wrong.

The Principle¶

No single verification stage can catch all these failure modes. A linter catches syntax issues but not semantic errors. Unit tests catch logic bugs but not integration failures. Code review catches design problems but not the subtle wrongness that looks intentional.

The anti-LLM pipeline chains 10 stages. Each stage is owned by a different agent with a different perspective. Each catches a specific failure class. Code must pass all 10 stages before it reaches production.

Pipeline Overview¶

Anna (Formal Spec)
  ↓
Antje (TDD — tests from spec, NOT from code)
  ↓
Dev Agents (Implementation)
  ↓
Koen (Deterministic Quality)
  ↓
Marije / Judith (Integration Testing)
  ↓
Jasper (Test Reconciliation)
  ↓
Marco (Conflict Detection)
  ↓
Ashley (Adversarial Testing)
  ↓
Jaap (SSOT Enforcement)
  ↓
Marta / Iwona (Merge Gate)

Why This Order Matters¶

The pipeline is ordered from specification to deployment. Each stage builds on the guarantees of the previous stage:

Anna defines what correct behavior means — before any code exists
Antje writes tests from the spec — tests are independent of the implementation
Dev agents write code to pass Antje's tests — not their own tests
Koen checks code quality deterministically — no LLM judgment involved
Marije/Judith test integration — does the code work with the system?
Jasper reconciles test results — do the numbers add up?
Marco detects conflicts — does this change break other changes?
Ashley attacks the code — does it survive adversarial input?
Jaap enforces SSOT — does the code match the single source of truth?
Marta/Iwona make the merge decision — is everything clean?

Skipping a stage means a failure class goes undetected. The pipeline is only as strong as its weakest stage.

Key Insight: Separation of Concerns Between Test and Code¶

The single most important design decision in this pipeline: the agent that writes the tests is not the agent that writes the code.

Anna writes the spec. Antje writes the tests from the spec. Dev agents write the implementation. Three different agents, three different perspectives, three different sets of assumptions.

If the same agent wrote both tests and code, the tests would validate the agent's interpretation of the spec — not the spec itself. This separation is what makes the pipeline effective against the "test and code from the same brain" failure mode.

Complexity-Based Routing¶

Not every change needs all 10 stages. The pipeline supports complexity-based routing:

Complexity	Stages	Example
Trivial	Koen → Marta/Iwona	Typo fix, config change
Simple	Antje → Dev → Koen → Marta/Iwona	Single-file feature
Standard	All 10 stages	Multi-file feature
Critical	All 10 + human review	Auth, payments, data model

The orchestrator determines complexity based on:

Number of files changed
Which files are changed (auth, DB, config = critical)
Whether new dependencies are introduced
Whether the change touches multiple services

Metrics¶

The pipeline tracks three core metrics per stage:

Metric	Definition	Target
Defect escape rate	Defects that pass through this stage undetected	< 5%
Time-to-merge	Time from PR open to merge	< 4 hours (standard)
False positive rate	Clean code blocked by this stage	< 10%

Joshua (Chief Innovation Officer) audits these metrics quarterly. Stages that consistently show < 1% defect escape and > 20% false positive rate are candidates for removal or recalibration.