Skip to content

DOMAIN:INNOVATION — EVALUATION_FRAMEWORKS

OWNER: joshua
UPDATED: 2026-03-24
PURPOSE: define how GE evaluates new technologies for adoption — proof-of-concept design, benchmarking, cost analysis, risk assessment, migration planning, and the decision tree
SCOPE: any technology Joshua recommends moving from Assess to Trial, or any technology Dirk-Jan asks to evaluate
AUTHORITY: Joshua designs and runs evaluations, but adoption decisions require Dirk-Jan approval or discussion consensus


EVAL:DECISION_TREE

The "Should GE adopt X?" decision tree — every technology evaluation starts here:

STEP 1: Does it solve a REAL GE problem?
  NO → STOP. Do not evaluate further. Log as "no current need" in radar.
  YES ↓

STEP 2: Does GE already have a working solution for this problem?
  NO → High priority evaluation (unmet need)
  YES ↓

STEP 3: Is the current solution causing measurable pain?
  NO → Low priority. Log in radar as Assess, revisit quarterly.
  YES ↓

STEP 4: Is the new technology measurably better on GE's scoring criteria?
  (run preliminary scoring from tech-radar-methodology.md)
  NO → STOP. Current solution is adequate despite pain.
  YES ↓

STEP 5: Is the migration cost justified by the improvement?
  (run cost analysis below)
  NO → STOP. Log as "better but not worth switching" in radar.
  YES ↓

STEP 6: Are there blocking risks?
  (run risk assessment below)
  YES, UNMITIGABLE → STOP. Log blockers in radar, revisit when resolved.
  YES, MITIGABLE → Proceed with risk mitigation plan.
  NO ↓

STEP 7: Can we run a meaningful PoC in ≤ 2 weeks?
  NO → Redesign PoC scope, or defer to future evaluation cycle.
  YES ↓

STEP 8: Design and run PoC (see below).
  PoC FAILS → STOP. Document learnings. Move to Hold or keep in Assess.
  PoC SUCCEEDS ↓

STEP 9: Present evaluation report to Dirk-Jan.
  APPROVED → Move to Trial. Assign to 1-2 projects.
  DISCUSSION_REQUIRED → Trigger agent discussion. Majority vote decides.
  REJECTED → Move to Hold with documented rationale.

RULE: every evaluation produces a written report, even if stopped at Step 1 — this prevents re-evaluating the same technology repeatedly
RULE: "solve a real GE problem" means a concrete, current problem — not "this might be useful someday"
RULE: the decision tree is sequential — do not skip steps, do not jump to PoC without completing Steps 1-6


EVAL:PROOF_OF_CONCEPT_DESIGN

POC_PRINCIPLES

PRINCIPLE_1: a PoC must answer a specific question, not "explore" a technology
PRINCIPLE_2: a PoC must be time-boxed (maximum 2 weeks elapsed, maximum 40 agent-hours)
PRINCIPLE_3: a PoC must have pre-defined success criteria (measurable, not subjective)
PRINCIPLE_4: a PoC must test GE's actual use case, not the technology's best-case demo
PRINCIPLE_5: a PoC must include failure testing — what happens when the technology fails?

POC_DESIGN_TEMPLATE

POC: [technology name] Evaluation
QUESTION: [specific question this PoC answers]
HYPOTHESIS: [what we expect to find]

SCOPE:
- IN: [what the PoC covers]
- OUT: [what the PoC explicitly does NOT cover]

SUCCESS_CRITERIA:
- MUST: [non-negotiable requirements — PoC fails if any unmet]
  - criterion 1: [measurable]
  - criterion 2: [measurable]
- SHOULD: [desirable but not blocking]
  - criterion 3: [measurable]
- NICE: [bonus points]
  - criterion 4: [measurable]

IMPLEMENTATION_PLAN:
- Day 1-2: [setup, basic integration]
- Day 3-5: [core functionality test]
- Day 6-8: [failure testing, edge cases]
- Day 9-10: [documentation, report writing]

ASSIGNED_TO: [agent name + team]
BUDGET: [maximum agent-hours] [maximum API cost]
DELIVERABLE: evaluation report (see template below)

ROLLBACK: [how to cleanly remove PoC code/config if evaluation is negative]

POC_ANTI_PATTERNS

SCOPE_CREEP: PoC expands beyond original question — enforce time box strictly
HAPPY_PATH_ONLY: PoC tests the demo scenario but not GE's real-world edge cases
SUNK_COST: PoC is going poorly but team continues because "we've already invested time" — fail fast
PRODUCTION_SNEAK: PoC code sneaks into production without proper evaluation — PoC code is disposable by definition
TECHNOLOGY_CRUSH: evaluator falls in love with the technology and loses objectivity — use pre-defined criteria only


EVAL:BENCHMARKING_METHODOLOGY

WHAT_TO_BENCHMARK

PERFORMANCE: response time, throughput, resource usage, startup time
QUALITY: output correctness, consistency, error rate
COST: per-operation cost, monthly projected cost at GE's volume
DEVELOPER_EXPERIENCE: time-to-first-result, learning curve, documentation quality
LLM_INTERACTION: how well LLMs generate code using this technology (GE-specific)

BENCHMARK_DESIGN_RULES

RULE_1: benchmark GE's actual workload, not synthetic benchmarks — GE's use case may differ from published benchmarks
RULE_2: benchmark against the current solution, not against nothing — the question is "is this better?" not "does this work?"
RULE_3: run benchmarks multiple times (minimum 10 runs) and report p50, p95, p99 — single runs are unreliable
RULE_4: benchmark under realistic load — GE's agents run concurrent tasks, not sequential benchmarks
RULE_5: include cold start benchmarks — GE agents boot frequently (k3s pod restarts)
RULE_6: document environment precisely — hardware, network, OS, runtime versions, configuration

BENCHMARK_TEMPLATE

BENCHMARK: [technology] vs [current solution]
ENVIRONMENT: [hardware, OS, runtime versions]
WORKLOAD: [description of test workload — must match GE's actual use]
RUNS: [number of runs per test]

| Metric | Current Solution | New Technology | Delta | Significant? |
|--------|-----------------|----------------|-------|-------------|
| p50 latency | X ms | Y ms | -Z% | yes/no |
| p95 latency | X ms | Y ms | -Z% | yes/no |
| p99 latency | X ms | Y ms | -Z% | yes/no |
| throughput | X ops/s | Y ops/s | +Z% | yes/no |
| memory usage | X MB | Y MB | -Z% | yes/no |
| cold start | X ms | Y ms | -Z% | yes/no |
| error rate | X% | Y% | -Z% | yes/no |
| LLM code quality | X/5 | Y/5 | +Z | yes/no |
| cost per operation | $X | $Y | -Z% | yes/no |

NOTES: [any caveats, environmental factors, or anomalies]
CONCLUSION: [new technology is faster/slower/cheaper/more expensive by X — significant/not significant for GE]

EVAL:COST_ANALYSIS

TOTAL_COST_OF_OWNERSHIP

Direct costs alone do not tell the full story. GE must evaluate total cost of ownership:

DIRECT_COSTS:
- licensing/subscription fees
- API usage costs (per-call, per-token, per-seat)
- infrastructure costs (hosting, compute, storage)
- support/SLA costs (if enterprise tier needed)

INDIRECT_COSTS:
- learning curve — agent identity updates, system prompt changes, wiki page creation
- migration effort — code changes, testing, deployment
- maintenance overhead — updates, security patches, monitoring
- integration effort — connecting to GE's existing stack (Redis, PostgreSQL, k3s)
- opportunity cost — what the team could be building instead of migrating

HIDDEN_COSTS:
- LLM training data lag — if LLMs cannot generate correct code for the technology, agents waste tokens on errors
- debugging overhead — unfamiliar technology means longer debug cycles
- vendor lock-in — future switching costs if the technology is abandoned or pricing changes
- ecosystem dependency — if the technology depends on a provider ecosystem (e.g., AWS, Google Cloud)

COST_ANALYSIS_TEMPLATE

COST ANALYSIS: [technology name]
PERIOD: monthly projection at current GE volume
COMPARISON: vs [current solution]

DIRECT COSTS:
| Item | Current | Proposed | Delta |
|------|---------|----------|-------|
| Licensing | $X/mo | $Y/mo | +/- $Z |
| API usage | $X/mo | $Y/mo | +/- $Z |
| Infrastructure | $X/mo | $Y/mo | +/- $Z |
| Support | $X/mo | $Y/mo | +/- $Z |
| SUBTOTAL | $X/mo | $Y/mo | +/- $Z |

INDIRECT COSTS (one-time):
| Item | Estimated Hours | Agent Cost | Notes |
|------|----------------|-----------|-------|
| Learning/training | X hrs | $Y | agent identity updates |
| Migration code changes | X hrs | $Y | code modification |
| Testing | X hrs | $Y | regression testing |
| Wiki documentation | X hrs | $Y | knowledge base update |
| SUBTOTAL | X hrs | $Y | |

HIDDEN COSTS (estimated monthly):
| Item | Estimated | Notes |
|------|-----------|-------|
| LLM error overhead | $X/mo | tokens wasted on hallucination |
| Debug overhead | $X/mo | unfamiliar technology debugging |
| SUBTOTAL | $X/mo | |

BREAK-EVEN: [months until one-time costs are recouped by monthly savings]
5-YEAR TCO: current=$X, proposed=$Y, delta=$Z

VERDICT: [cost-justified | marginal | not cost-justified]

COST_AT_SCALE

GE is built for hyperscale (1 → 100k users). Cost analysis must consider:
- does pricing scale linearly, sub-linearly, or super-linearly with volume?
- are there volume discounts or enterprise tiers that change the math at scale?
- does the technology require infrastructure that scales differently than GE's current stack?
- at what volume does self-hosting become cheaper than SaaS? (relevant for LLM providers)


EVAL:RISK_ASSESSMENT

RISK_CATEGORIES

TECHNICAL_RISK:
- maturity: is the technology production-ready? (version, stability, API surface)
- compatibility: does it work with GE's stack? (TypeScript, k3s, Redis, PostgreSQL)
- performance: does it meet GE's performance requirements?
- reliability: what is the uptime/availability track record?

VENDOR_RISK:
- company viability: is the company funded? profitable? at risk of acquisition or shutdown?
- pricing stability: has the vendor changed pricing dramatically? (e.g., Redis/Elastic license changes)
- lock-in: how difficult is it to migrate away? (proprietary formats, APIs, data)
- support: what level of support is available? community-only or enterprise support?

SECURITY_RISK:
- vulnerability history: how many CVEs? how quickly patched?
- data handling: where is data processed/stored? (EU data residency for GDPR)
- supply chain: how deep is the dependency tree? any known compromised dependencies?
- compliance: does it align with ISO 27001, SOC 2 Type II requirements?

OPERATIONAL_RISK:
- complexity: does it add operational complexity to GE's infrastructure?
- monitoring: can GE monitor it with existing tools?
- recovery: what happens when it fails? graceful degradation or total failure?
- team knowledge: does the team have expertise? learning curve?

RISK_ASSESSMENT_TEMPLATE

RISK ASSESSMENT: [technology name]
DATE: [date]
ASSESSOR: joshua

| Risk | Category | Likelihood | Impact | Score | Mitigation |
|------|----------|-----------|--------|-------|-----------|
| [risk 1] | technical | low/med/high | low/med/high | 1-9 | [mitigation] |
| [risk 2] | vendor | low/med/high | low/med/high | 1-9 | [mitigation] |
| [risk 3] | security | low/med/high | low/med/high | 1-9 | [mitigation] |
| [risk 4] | operational | low/med/high | low/med/high | 1-9 | [mitigation] |

SCORING: likelihood (1-3) × impact (1-3) = score (1-9)
THRESHOLD: any risk scoring 6+ must have a documented mitigation plan
BLOCKER: any risk scoring 9 (high likelihood × high impact) is an automatic blocker

OVERALL_RISK_LEVEL: [low | medium | high | blocking]
RECOMMENDATION: [proceed | proceed with mitigations | defer until risks reduced | reject]

EVAL:MIGRATION_PLANNING

WHEN_MIGRATION_IS_NEEDED

  • moving from Hold-tier technology to Adopt-tier replacement (e.g., Express → Hono)
  • upgrading to major version with breaking changes (e.g., framework v4 → v5)
  • responding to vendor changes (license change, pricing change, deprecation)
  • adopting technology that replaces part of GE's current stack

MIGRATION_PLAN_TEMPLATE

MIGRATION PLAN: [from] → [to]
REASON: [why migrating]
OWNER: [agent/team responsible]
TIMELINE: [start date] → [target completion]

PHASE 1: PREPARATION
- [ ] document all usage of current technology (files, patterns, volume)
- [ ] create migration guide (pattern-by-pattern replacement map)
- [ ] set up new technology in development environment
- [ ] update agent identities/system prompts if needed
- [ ] write migration tests (verify behavior equivalence)

PHASE 2: MIGRATION
- [ ] migrate shared/library code first
- [ ] migrate project-by-project (start with least-critical)
- [ ] run parallel operation where possible (old + new)
- [ ] verify each migrated component against migration tests
- [ ] update wiki documentation as components are migrated

PHASE 3: VERIFICATION
- [ ] run full test suite against migrated code
- [ ] performance benchmark (compare to pre-migration baseline)
- [ ] security scan of new dependencies
- [ ] code review of migration changes
- [ ] update tech radar position

PHASE 4: CLEANUP
- [ ] remove old technology dependencies from package.json/requirements.txt
- [ ] remove old configuration files
- [ ] update CI/CD pipelines
- [ ] archive migration guide (for reference)
- [ ] close migration tracking issue

ROLLBACK_PLAN:
- trigger: [what conditions trigger rollback]
- method: [git revert / feature flag / blue-green deployment]
- timeline: [how long rollback takes]
- data: [any data migration that needs reversing]

ROLLBACK_STRATEGY

RULE: every migration must have a rollback plan before starting
RULE: rollback must be testable — practice the rollback before it is needed
RULE: rollback window is defined upfront — after the window closes, rollback becomes a new forward migration

ROLLBACK_TYPES:
- GIT_REVERT: simplest — revert the migration commits (works when migration is code-only)
- FEATURE_FLAG: run old and new in parallel, toggle flag to switch (works when behavior is swappable)
- BLUE_GREEN: deploy new version alongside old, switch traffic (works for services)
- DATA_MIGRATION_REVERSAL: if data schema changed, must reverse that too (most complex, avoid if possible)


EVAL:INTEGRATION_EFFORT_ESTIMATION

ESTIMATION_CATEGORIES

TRIVIAL (< 4 hours):
- drop-in replacement with identical API
- configuration change only (e.g., switching CDN provider)
- adding a new tool that does not touch existing code

LOW (4-16 hours):
- library swap with similar but not identical API
- adding new dependency with clear integration points
- updating agent identity to include new tool knowledge

MEDIUM (16-40 hours):
- framework migration with pattern changes (e.g., Express → Hono)
- new infrastructure component requiring deployment changes
- migration affecting multiple projects/agents

HIGH (40-80 hours):
- fundamental architecture change (e.g., ORM migration, database migration)
- new technology requiring significant learning curve
- migration affecting all agents or all client projects

EXTREME (80+ hours):
- full stack migration
- provider switch affecting agent execution model
- anything requiring downtime or data migration

ESTIMATION_TEMPLATE

INTEGRATION EFFORT: [technology]
ESTIMATOR: joshua
DATE: [date]

| Component | Hours | Complexity | Notes |
|-----------|-------|-----------|-------|
| Code changes | X | low/med/high | [description] |
| Configuration | X | low/med/high | [description] |
| Agent identity updates | X | low/med/high | [N agents affected] |
| Wiki documentation | X | low/med/high | [N pages] |
| Testing | X | low/med/high | [test types needed] |
| CI/CD pipeline | X | low/med/high | [description] |
| Deployment | X | low/med/high | [description] |
| Training/learning | X | low/med/high | [description] |
| TOTAL | X | [overall] | |

CATEGORY: [trivial | low | medium | high | extreme]
CONFIDENCE: [high | medium | low] — [basis for confidence level]
RISK_BUFFER: +[X]% — [why buffer is needed]

EVAL:EVALUATION_REPORT_TEMPLATE

The final deliverable of any technology evaluation:

# TECHNOLOGY EVALUATION REPORT
## [Technology Name] — [Date]

### EXECUTIVE SUMMARY
[2-3 sentences: what was evaluated, key finding, recommendation]

### DECISION TREE RESULT
[Which step of the decision tree was reached, and the outcome]

### SCORING (from tech-radar-methodology.md)
| Dimension | Score | Notes |
|-----------|-------|-------|
| Maturity | X/5 | [notes] |
| Community | X/5 | [notes] |
| LLM Compatibility | X/5 | [notes] |
| GE Stack Fit | X/5 | [notes] |
| Cost | X/5 | [notes] |
| Security | X/5 | [notes] |
| Performance | X/5 | [notes] |
| Migration Path | X/5 | [notes] |
| COMPOSITE | X.X/5 | weighted |

### POC RESULTS (if PoC was run)
[Summary of PoC outcomes against pre-defined success criteria]

### BENCHMARK RESULTS (if benchmarks were run)
[Summary of benchmark data]

### COST ANALYSIS
[TCO summary, break-even analysis]

### RISK ASSESSMENT
[Key risks and mitigations]

### INTEGRATION EFFORT
[Estimation and category]

### RECOMMENDATION
[adopt | trial | assess (continue monitoring) | hold (do not adopt)]

### RATIONALE
[Why this recommendation, addressing both benefits and concerns]

### NEXT STEPS (if recommended for trial/adopt)
[Specific actions, owners, timeline]

### APPENDICES
- Raw benchmark data
- PoC code location
- Reference materials consulted

EVAL:PROCESS_GOVERNANCE

WHO_CAN_REQUEST_EVALUATION: Dirk-Jan (any technology), Joshua (within monitoring scope), any agent (via discussion proposal)
WHO_RUNS_EVALUATION: Joshua designs the evaluation, relevant team agents execute PoC
WHO_DECIDES: Dirk-Jan for Adopt decisions, discussion consensus for Trial decisions, Joshua for Assess/Hold positioning
CADENCE: evaluations are demand-driven (not scheduled) but limited to 2 concurrent evaluations to prevent overload
DOCUMENTATION: all evaluations documented in wiki under docs/development/evaluations/[technology-name].md
HISTORY: past evaluations are never deleted — they provide historical context for re-evaluation