Self-Verification¶

The Problem with Self-Assessment¶

An LLM asked to verify its own work will almost always conclude that its work is correct. This is not a bug in any specific model — it is an inherent property of autoregressive language generation. The same process that produced the original output will produce a justification for that output. The model is not "checking" its work. It is generating plausible text about its work, which is a fundamentally different operation.

Research confirms this: studies on LLM-generated code find that 29-45% contains security vulnerabilities, yet models consistently rate their own output as correct. The confidence-correctness correlation in self-assessment approaches zero for non-trivial tasks.

GE's verification philosophy: never rely on the same agent (or the same type of verification) to check work that it produced. Verification requires oracle independence — the verifier must be independent of the producer.

The Anti-LLM Quality Pipeline¶

GE's quality pipeline is specifically designed to counter the failure modes of LLM-generated code. It is called the "anti-LLM pipeline" because each stage targets a known LLM weakness.

The 10 Stages¶

Stage	Owner	What It Catches	LLM Weakness Targeted
1. Formal specification	Anna	Ambiguous requirements	LLMs fill ambiguity with plausible but wrong assumptions
2. Spec-first test generation	Antje	Requirements misinterpretation	LLMs test what code does, not what it should do
3. Implementation	Developer	N/A (production stage)	N/A
4. Deterministic quality gates	Koen	Syntax errors, type violations, dead code	LLMs generate syntactically plausible but incorrect code
5. Contract checks	Koen	Interface violations	LLMs change function signatures without updating callers
6. Security review	Victoria/Ron	Credential exposure, injection, authz bypass	LLMs hardcode secrets, ignore auth, trust user input
7. Integration testing	Marije	Cross-module failures	LLMs optimize for the unit, not the system
8. Adversarial testing	Ashley	Edge cases, malformed input	LLMs assume happy-path inputs
9. Reconciliation	Jasper	Drift between spec and implementation	LLMs solve a different problem than specified
10. Performance review	Nessa	Resource waste, scaling issues	LLMs ignore O(n) complexity and memory allocation

RULE: No code reaches production without passing all 10 stages. Stages cannot be skipped, reordered, or combined.

Why 10 Stages?¶

Each stage catches errors that previous stages miss. Removing any stage creates a gap. The stages are ordered by cost — cheap checks first (linting, type checking), expensive checks last (integration testing, performance review). This minimizes the cost of catching errors: most defects are caught in stages 1-5, before expensive testing begins.

Contract Checks (Mechanical Verification)¶

Contract checks are deterministic verifications that do not require LLM judgment. They are the fastest, cheapest, and most reliable form of verification.

What Contracts Check¶

Check	What It Verifies	Tool
Type checking	All types are consistent	TypeScript compiler, mypy
Lint	Code follows style rules	ESLint, ruff
Import resolution	All imports resolve to real files	Compiler/bundler
Dead code	No exports without consumers	Custom analysis
Schema validation	API payloads match defined schemas	Zod, JSON Schema
Migration consistency	Database schema matches ORM definitions	Drizzle check
Config integrity	No hardcoded values that should come from config	Custom grep
MAXLEN presence	Every Redis XADD includes MAXLEN	Custom grep

When Contracts Run¶

Contracts run at two points:

Pre-completion. Before an agent declares its work done, all relevant contracts are checked automatically. If any contract fails, the agent must fix the violation before completing.
CI gate. All contracts run in CI on every commit. A failing contract blocks the merge.

Why Contracts, Not Reviews¶

A human (or LLM) reviewer might miss a type error. A type checker will not. A reviewer might overlook a missing MAXLEN on an XADD call. A grep pattern will not. Contracts are not replacements for reviews — they are a layer that catches mechanical errors before reviews begin, freeing reviewers to focus on logic and design.

RULE: If a defect can be caught by a deterministic check, it must be caught by a deterministic check. LLM review time is expensive. Linting time is free.

Learning Protocols (Structured Self-Reflection)¶

While LLMs cannot reliably verify the correctness of their own output, they can extract patterns from their experience. GE's learning protocols capture these patterns in a structured format.

Session Learning Extraction¶

After every agent session, the PTY capture output is analyzed to extract learnings:

What was attempted? Task description and scope.
What succeeded? Approaches that worked, with specific details.
What failed? Approaches that did not work, with error messages and root causes.
What patterns emerged? Recurring themes, techniques, or pitfalls.
What should future agents know? Actionable advice for the next agent working in this domain.

These learnings are written to the wiki brain and become available for JIT injection in future sessions.

The Struggle Detector¶

The struggle detector scores sessions across five dimensions:

Dimension	What It Measures	High Score Indicates
Cost	Token consumption relative to task complexity	Inefficient prompting or excessive retries
Turns	Number of turns relative to task complexity	Confusion, context loss, or wrong approach
Failures	Count of failed attempts within the session	Misunderstanding requirements or unfamiliar domain
Tokens	Raw token volume (input + output)	Over-reading, over-generating, or context bloat
Outcome	Whether the task was completed successfully	Fundamental capability gap

Sessions with high struggle scores are flagged for human review. The patterns in struggled sessions become high-priority learnings — they represent exactly the situations where agents need additional knowledge.

Checklists (Machine-Parseable, Organically Growing)¶

GE agents use checklists as structured verification aids. Unlike free-form self-assessment ("does this look right?"), checklists force specific checks.

Checklist Design Principles¶

1. Machine-parseable. Checklists use checkbox markdown (- [ ] / - [x]) so completion can be verified automatically.

2. Specific, not general. Bad: "- [ ] Code is correct." Good: "- [ ] All Redis XADD calls include MAXLEN."

3. Organically growing. When a new failure mode is discovered, a new checklist item is added. Checklists accrete knowledge over time. They are never "complete."

4. Role-specific. Each agent has checklists relevant to their function. A security reviewer's checklist differs from a backend developer's checklist.

Example: Backend Developer Pre-Completion Checklist¶

- [ ] All new functions have TypeScript types (no `any`)
- [ ] All database queries use parameterized inputs (no string concatenation)
- [ ] All Redis XADD calls include MAXLEN (~100 per-agent, ~1000 system)
- [ ] No hardcoded ports, URLs, or credentials
- [ ] All new exports have at least one consumer
- [ ] Error paths return structured errors, not string messages
- [ ] All async functions have error handling (try/catch or .catch())
- [ ] New API endpoints have Zod schema validation on input
- [ ] No direct file system writes in request handlers
- [ ] Config values read from config files, not constants

Checklist Evolution¶

Checklists grow through two mechanisms:

Incident response. When a production incident is traced to a class of defect, a checklist item is added to prevent recurrence.
Learning extraction. When multiple sessions struggle with the same issue, a checklist item is added to preempt the struggle.

RULE: Checklists are never shortened. Items can be retired (marked as automated by a contract check) but the check still runs — it just runs as code instead of as a manual verification.

Mutation Testing as LLM-Specific Quality Gate¶

Mutation testing introduces small changes (mutations) to the code and verifies that the test suite catches them. If a mutation survives (tests still pass despite the change), the test suite has a gap.

Why Mutation Testing Matters for LLM Code¶

LLMs tend to produce tests that verify the happy path but miss edge cases. A test suite with 100% line coverage but 40% mutation score provides a false sense of security. Mutation testing reveals the difference between "code was executed" (coverage) and "code behavior was verified" (mutation score).

GE's Mutation Testing Protocol¶

Developer completes implementation and all tests pass.
Mutation testing tool runs against the test suite.
Surviving mutants are categorized:
Equivalent mutants (mutation does not change behavior) — excluded.
Trivial mutants (mutation is caught by type system) — excluded.
Real mutants (mutation changes behavior, tests pass) — test gap identified.
For each real mutant, Antje writes a new test that kills it.
Mutation testing runs again until mutation score exceeds threshold.

Mutation Score Targets¶

Code Category	Minimum Mutation Score
Security-critical (auth, crypto)	90%
Business logic (billing, routing)	80%
Infrastructure (deployment, config)	70%
UI components	60%

Reconciliation (Independent Work Product Comparison)¶

Reconciliation is GE's most novel verification technique. Two agents independently produce work products from the same specification. Their outputs are compared. Discrepancies indicate either ambiguity in the specification or errors in one (or both) outputs.

How It Works¶

Anna produces a formal specification.
Antje derives tests from the specification (without seeing any implementation).
A developer implements the feature (without seeing Antje's tests beyond what is needed to pass them).
Jasper compares what the tests verify against what the implementation does.
Any behavior that the implementation provides but the tests do not verify is suspicious.
Any behavior that the tests expect but the implementation does not provide is a bug.

Why Reconciliation Works¶

This is oracle independence applied to the full pipeline. The test writer and the code writer both work from the same specification but bring different interpretations, different blind spots, and different assumptions. Their disagreements are the most valuable signal in the pipeline — they reveal the places where the specification was ambiguous or where an agent went wrong.

A single agent verifying its own work will not find these discrepancies. It has one interpretation, one set of blind spots, and one set of assumptions. Reconciliation provides the perspective that self-verification cannot.

The Independence Principle¶

RULE: For any verification to be meaningful, the verifier must be independent of the producer.

Verification Type	Independence Level	Reliability
Same agent checks own work	None	Unreliable
Same model checks another instance's work	Low	Slightly better
Different model checks work	Medium	Moderate
Deterministic tool checks work	High	Reliable
Independent agent from spec checks work	High	Reliable
Human checks work	Highest	Most reliable (but expensive)

GE's pipeline stacks these independence levels. Deterministic checks catch mechanical errors. Independent agents catch logical errors. Reconciliation catches specification ambiguity. Human review catches everything else, applied selectively at the highest-risk points.

The goal is not to eliminate all errors — that is impossible. The goal is to make errors progressively harder to survive through the pipeline, so that the errors that reach production are rare, minor, and quickly detected.