DOMAIN:INNOVATION — EVALUATION_FRAMEWORKS¶
OWNER: joshua
UPDATED: 2026-03-24
PURPOSE: define how GE evaluates new technologies for adoption — proof-of-concept design, benchmarking, cost analysis, risk assessment, migration planning, and the decision tree
SCOPE: any technology Joshua recommends moving from Assess to Trial, or any technology Dirk-Jan asks to evaluate
AUTHORITY: Joshua designs and runs evaluations, but adoption decisions require Dirk-Jan approval or discussion consensus
EVAL:DECISION_TREE¶
The "Should GE adopt X?" decision tree — every technology evaluation starts here:
STEP 1: Does it solve a REAL GE problem?
NO → STOP. Do not evaluate further. Log as "no current need" in radar.
YES ↓
STEP 2: Does GE already have a working solution for this problem?
NO → High priority evaluation (unmet need)
YES ↓
STEP 3: Is the current solution causing measurable pain?
NO → Low priority. Log in radar as Assess, revisit quarterly.
YES ↓
STEP 4: Is the new technology measurably better on GE's scoring criteria?
(run preliminary scoring from tech-radar-methodology.md)
NO → STOP. Current solution is adequate despite pain.
YES ↓
STEP 5: Is the migration cost justified by the improvement?
(run cost analysis below)
NO → STOP. Log as "better but not worth switching" in radar.
YES ↓
STEP 6: Are there blocking risks?
(run risk assessment below)
YES, UNMITIGABLE → STOP. Log blockers in radar, revisit when resolved.
YES, MITIGABLE → Proceed with risk mitigation plan.
NO ↓
STEP 7: Can we run a meaningful PoC in ≤ 2 weeks?
NO → Redesign PoC scope, or defer to future evaluation cycle.
YES ↓
STEP 8: Design and run PoC (see below).
PoC FAILS → STOP. Document learnings. Move to Hold or keep in Assess.
PoC SUCCEEDS ↓
STEP 9: Present evaluation report to Dirk-Jan.
APPROVED → Move to Trial. Assign to 1-2 projects.
DISCUSSION_REQUIRED → Trigger agent discussion. Majority vote decides.
REJECTED → Move to Hold with documented rationale.
RULE: every evaluation produces a written report, even if stopped at Step 1 — this prevents re-evaluating the same technology repeatedly
RULE: "solve a real GE problem" means a concrete, current problem — not "this might be useful someday"
RULE: the decision tree is sequential — do not skip steps, do not jump to PoC without completing Steps 1-6
EVAL:PROOF_OF_CONCEPT_DESIGN¶
POC_PRINCIPLES¶
PRINCIPLE_1: a PoC must answer a specific question, not "explore" a technology
PRINCIPLE_2: a PoC must be time-boxed (maximum 2 weeks elapsed, maximum 40 agent-hours)
PRINCIPLE_3: a PoC must have pre-defined success criteria (measurable, not subjective)
PRINCIPLE_4: a PoC must test GE's actual use case, not the technology's best-case demo
PRINCIPLE_5: a PoC must include failure testing — what happens when the technology fails?
POC_DESIGN_TEMPLATE¶
POC: [technology name] Evaluation
QUESTION: [specific question this PoC answers]
HYPOTHESIS: [what we expect to find]
SCOPE:
- IN: [what the PoC covers]
- OUT: [what the PoC explicitly does NOT cover]
SUCCESS_CRITERIA:
- MUST: [non-negotiable requirements — PoC fails if any unmet]
- criterion 1: [measurable]
- criterion 2: [measurable]
- SHOULD: [desirable but not blocking]
- criterion 3: [measurable]
- NICE: [bonus points]
- criterion 4: [measurable]
IMPLEMENTATION_PLAN:
- Day 1-2: [setup, basic integration]
- Day 3-5: [core functionality test]
- Day 6-8: [failure testing, edge cases]
- Day 9-10: [documentation, report writing]
ASSIGNED_TO: [agent name + team]
BUDGET: [maximum agent-hours] [maximum API cost]
DELIVERABLE: evaluation report (see template below)
ROLLBACK: [how to cleanly remove PoC code/config if evaluation is negative]
POC_ANTI_PATTERNS¶
SCOPE_CREEP: PoC expands beyond original question — enforce time box strictly
HAPPY_PATH_ONLY: PoC tests the demo scenario but not GE's real-world edge cases
SUNK_COST: PoC is going poorly but team continues because "we've already invested time" — fail fast
PRODUCTION_SNEAK: PoC code sneaks into production without proper evaluation — PoC code is disposable by definition
TECHNOLOGY_CRUSH: evaluator falls in love with the technology and loses objectivity — use pre-defined criteria only
EVAL:BENCHMARKING_METHODOLOGY¶
WHAT_TO_BENCHMARK¶
PERFORMANCE: response time, throughput, resource usage, startup time
QUALITY: output correctness, consistency, error rate
COST: per-operation cost, monthly projected cost at GE's volume
DEVELOPER_EXPERIENCE: time-to-first-result, learning curve, documentation quality
LLM_INTERACTION: how well LLMs generate code using this technology (GE-specific)
BENCHMARK_DESIGN_RULES¶
RULE_1: benchmark GE's actual workload, not synthetic benchmarks — GE's use case may differ from published benchmarks
RULE_2: benchmark against the current solution, not against nothing — the question is "is this better?" not "does this work?"
RULE_3: run benchmarks multiple times (minimum 10 runs) and report p50, p95, p99 — single runs are unreliable
RULE_4: benchmark under realistic load — GE's agents run concurrent tasks, not sequential benchmarks
RULE_5: include cold start benchmarks — GE agents boot frequently (k3s pod restarts)
RULE_6: document environment precisely — hardware, network, OS, runtime versions, configuration
BENCHMARK_TEMPLATE¶
BENCHMARK: [technology] vs [current solution]
ENVIRONMENT: [hardware, OS, runtime versions]
WORKLOAD: [description of test workload — must match GE's actual use]
RUNS: [number of runs per test]
| Metric | Current Solution | New Technology | Delta | Significant? |
|--------|-----------------|----------------|-------|-------------|
| p50 latency | X ms | Y ms | -Z% | yes/no |
| p95 latency | X ms | Y ms | -Z% | yes/no |
| p99 latency | X ms | Y ms | -Z% | yes/no |
| throughput | X ops/s | Y ops/s | +Z% | yes/no |
| memory usage | X MB | Y MB | -Z% | yes/no |
| cold start | X ms | Y ms | -Z% | yes/no |
| error rate | X% | Y% | -Z% | yes/no |
| LLM code quality | X/5 | Y/5 | +Z | yes/no |
| cost per operation | $X | $Y | -Z% | yes/no |
NOTES: [any caveats, environmental factors, or anomalies]
CONCLUSION: [new technology is faster/slower/cheaper/more expensive by X — significant/not significant for GE]
EVAL:COST_ANALYSIS¶
TOTAL_COST_OF_OWNERSHIP¶
Direct costs alone do not tell the full story. GE must evaluate total cost of ownership:
DIRECT_COSTS:
- licensing/subscription fees
- API usage costs (per-call, per-token, per-seat)
- infrastructure costs (hosting, compute, storage)
- support/SLA costs (if enterprise tier needed)
INDIRECT_COSTS:
- learning curve — agent identity updates, system prompt changes, wiki page creation
- migration effort — code changes, testing, deployment
- maintenance overhead — updates, security patches, monitoring
- integration effort — connecting to GE's existing stack (Redis, PostgreSQL, k3s)
- opportunity cost — what the team could be building instead of migrating
HIDDEN_COSTS:
- LLM training data lag — if LLMs cannot generate correct code for the technology, agents waste tokens on errors
- debugging overhead — unfamiliar technology means longer debug cycles
- vendor lock-in — future switching costs if the technology is abandoned or pricing changes
- ecosystem dependency — if the technology depends on a provider ecosystem (e.g., AWS, Google Cloud)
COST_ANALYSIS_TEMPLATE¶
COST ANALYSIS: [technology name]
PERIOD: monthly projection at current GE volume
COMPARISON: vs [current solution]
DIRECT COSTS:
| Item | Current | Proposed | Delta |
|------|---------|----------|-------|
| Licensing | $X/mo | $Y/mo | +/- $Z |
| API usage | $X/mo | $Y/mo | +/- $Z |
| Infrastructure | $X/mo | $Y/mo | +/- $Z |
| Support | $X/mo | $Y/mo | +/- $Z |
| SUBTOTAL | $X/mo | $Y/mo | +/- $Z |
INDIRECT COSTS (one-time):
| Item | Estimated Hours | Agent Cost | Notes |
|------|----------------|-----------|-------|
| Learning/training | X hrs | $Y | agent identity updates |
| Migration code changes | X hrs | $Y | code modification |
| Testing | X hrs | $Y | regression testing |
| Wiki documentation | X hrs | $Y | knowledge base update |
| SUBTOTAL | X hrs | $Y | |
HIDDEN COSTS (estimated monthly):
| Item | Estimated | Notes |
|------|-----------|-------|
| LLM error overhead | $X/mo | tokens wasted on hallucination |
| Debug overhead | $X/mo | unfamiliar technology debugging |
| SUBTOTAL | $X/mo | |
BREAK-EVEN: [months until one-time costs are recouped by monthly savings]
5-YEAR TCO: current=$X, proposed=$Y, delta=$Z
VERDICT: [cost-justified | marginal | not cost-justified]
COST_AT_SCALE¶
GE is built for hyperscale (1 → 100k users). Cost analysis must consider:
- does pricing scale linearly, sub-linearly, or super-linearly with volume?
- are there volume discounts or enterprise tiers that change the math at scale?
- does the technology require infrastructure that scales differently than GE's current stack?
- at what volume does self-hosting become cheaper than SaaS? (relevant for LLM providers)
EVAL:RISK_ASSESSMENT¶
RISK_CATEGORIES¶
TECHNICAL_RISK:
- maturity: is the technology production-ready? (version, stability, API surface)
- compatibility: does it work with GE's stack? (TypeScript, k3s, Redis, PostgreSQL)
- performance: does it meet GE's performance requirements?
- reliability: what is the uptime/availability track record?
VENDOR_RISK:
- company viability: is the company funded? profitable? at risk of acquisition or shutdown?
- pricing stability: has the vendor changed pricing dramatically? (e.g., Redis/Elastic license changes)
- lock-in: how difficult is it to migrate away? (proprietary formats, APIs, data)
- support: what level of support is available? community-only or enterprise support?
SECURITY_RISK:
- vulnerability history: how many CVEs? how quickly patched?
- data handling: where is data processed/stored? (EU data residency for GDPR)
- supply chain: how deep is the dependency tree? any known compromised dependencies?
- compliance: does it align with ISO 27001, SOC 2 Type II requirements?
OPERATIONAL_RISK:
- complexity: does it add operational complexity to GE's infrastructure?
- monitoring: can GE monitor it with existing tools?
- recovery: what happens when it fails? graceful degradation or total failure?
- team knowledge: does the team have expertise? learning curve?
RISK_ASSESSMENT_TEMPLATE¶
RISK ASSESSMENT: [technology name]
DATE: [date]
ASSESSOR: joshua
| Risk | Category | Likelihood | Impact | Score | Mitigation |
|------|----------|-----------|--------|-------|-----------|
| [risk 1] | technical | low/med/high | low/med/high | 1-9 | [mitigation] |
| [risk 2] | vendor | low/med/high | low/med/high | 1-9 | [mitigation] |
| [risk 3] | security | low/med/high | low/med/high | 1-9 | [mitigation] |
| [risk 4] | operational | low/med/high | low/med/high | 1-9 | [mitigation] |
SCORING: likelihood (1-3) × impact (1-3) = score (1-9)
THRESHOLD: any risk scoring 6+ must have a documented mitigation plan
BLOCKER: any risk scoring 9 (high likelihood × high impact) is an automatic blocker
OVERALL_RISK_LEVEL: [low | medium | high | blocking]
RECOMMENDATION: [proceed | proceed with mitigations | defer until risks reduced | reject]
EVAL:MIGRATION_PLANNING¶
WHEN_MIGRATION_IS_NEEDED¶
- moving from Hold-tier technology to Adopt-tier replacement (e.g., Express → Hono)
- upgrading to major version with breaking changes (e.g., framework v4 → v5)
- responding to vendor changes (license change, pricing change, deprecation)
- adopting technology that replaces part of GE's current stack
MIGRATION_PLAN_TEMPLATE¶
MIGRATION PLAN: [from] → [to]
REASON: [why migrating]
OWNER: [agent/team responsible]
TIMELINE: [start date] → [target completion]
PHASE 1: PREPARATION
- [ ] document all usage of current technology (files, patterns, volume)
- [ ] create migration guide (pattern-by-pattern replacement map)
- [ ] set up new technology in development environment
- [ ] update agent identities/system prompts if needed
- [ ] write migration tests (verify behavior equivalence)
PHASE 2: MIGRATION
- [ ] migrate shared/library code first
- [ ] migrate project-by-project (start with least-critical)
- [ ] run parallel operation where possible (old + new)
- [ ] verify each migrated component against migration tests
- [ ] update wiki documentation as components are migrated
PHASE 3: VERIFICATION
- [ ] run full test suite against migrated code
- [ ] performance benchmark (compare to pre-migration baseline)
- [ ] security scan of new dependencies
- [ ] code review of migration changes
- [ ] update tech radar position
PHASE 4: CLEANUP
- [ ] remove old technology dependencies from package.json/requirements.txt
- [ ] remove old configuration files
- [ ] update CI/CD pipelines
- [ ] archive migration guide (for reference)
- [ ] close migration tracking issue
ROLLBACK_PLAN:
- trigger: [what conditions trigger rollback]
- method: [git revert / feature flag / blue-green deployment]
- timeline: [how long rollback takes]
- data: [any data migration that needs reversing]
ROLLBACK_STRATEGY¶
RULE: every migration must have a rollback plan before starting
RULE: rollback must be testable — practice the rollback before it is needed
RULE: rollback window is defined upfront — after the window closes, rollback becomes a new forward migration
ROLLBACK_TYPES:
- GIT_REVERT: simplest — revert the migration commits (works when migration is code-only)
- FEATURE_FLAG: run old and new in parallel, toggle flag to switch (works when behavior is swappable)
- BLUE_GREEN: deploy new version alongside old, switch traffic (works for services)
- DATA_MIGRATION_REVERSAL: if data schema changed, must reverse that too (most complex, avoid if possible)
EVAL:INTEGRATION_EFFORT_ESTIMATION¶
ESTIMATION_CATEGORIES¶
TRIVIAL (< 4 hours):
- drop-in replacement with identical API
- configuration change only (e.g., switching CDN provider)
- adding a new tool that does not touch existing code
LOW (4-16 hours):
- library swap with similar but not identical API
- adding new dependency with clear integration points
- updating agent identity to include new tool knowledge
MEDIUM (16-40 hours):
- framework migration with pattern changes (e.g., Express → Hono)
- new infrastructure component requiring deployment changes
- migration affecting multiple projects/agents
HIGH (40-80 hours):
- fundamental architecture change (e.g., ORM migration, database migration)
- new technology requiring significant learning curve
- migration affecting all agents or all client projects
EXTREME (80+ hours):
- full stack migration
- provider switch affecting agent execution model
- anything requiring downtime or data migration
ESTIMATION_TEMPLATE¶
INTEGRATION EFFORT: [technology]
ESTIMATOR: joshua
DATE: [date]
| Component | Hours | Complexity | Notes |
|-----------|-------|-----------|-------|
| Code changes | X | low/med/high | [description] |
| Configuration | X | low/med/high | [description] |
| Agent identity updates | X | low/med/high | [N agents affected] |
| Wiki documentation | X | low/med/high | [N pages] |
| Testing | X | low/med/high | [test types needed] |
| CI/CD pipeline | X | low/med/high | [description] |
| Deployment | X | low/med/high | [description] |
| Training/learning | X | low/med/high | [description] |
| TOTAL | X | [overall] | |
CATEGORY: [trivial | low | medium | high | extreme]
CONFIDENCE: [high | medium | low] — [basis for confidence level]
RISK_BUFFER: +[X]% — [why buffer is needed]
EVAL:EVALUATION_REPORT_TEMPLATE¶
The final deliverable of any technology evaluation:
# TECHNOLOGY EVALUATION REPORT
## [Technology Name] — [Date]
### EXECUTIVE SUMMARY
[2-3 sentences: what was evaluated, key finding, recommendation]
### DECISION TREE RESULT
[Which step of the decision tree was reached, and the outcome]
### SCORING (from tech-radar-methodology.md)
| Dimension | Score | Notes |
|-----------|-------|-------|
| Maturity | X/5 | [notes] |
| Community | X/5 | [notes] |
| LLM Compatibility | X/5 | [notes] |
| GE Stack Fit | X/5 | [notes] |
| Cost | X/5 | [notes] |
| Security | X/5 | [notes] |
| Performance | X/5 | [notes] |
| Migration Path | X/5 | [notes] |
| COMPOSITE | X.X/5 | weighted |
### POC RESULTS (if PoC was run)
[Summary of PoC outcomes against pre-defined success criteria]
### BENCHMARK RESULTS (if benchmarks were run)
[Summary of benchmark data]
### COST ANALYSIS
[TCO summary, break-even analysis]
### RISK ASSESSMENT
[Key risks and mitigations]
### INTEGRATION EFFORT
[Estimation and category]
### RECOMMENDATION
[adopt | trial | assess (continue monitoring) | hold (do not adopt)]
### RATIONALE
[Why this recommendation, addressing both benefits and concerns]
### NEXT STEPS (if recommended for trial/adopt)
[Specific actions, owners, timeline]
### APPENDICES
- Raw benchmark data
- PoC code location
- Reference materials consulted
EVAL:PROCESS_GOVERNANCE¶
WHO_CAN_REQUEST_EVALUATION: Dirk-Jan (any technology), Joshua (within monitoring scope), any agent (via discussion proposal)
WHO_RUNS_EVALUATION: Joshua designs the evaluation, relevant team agents execute PoC
WHO_DECIDES: Dirk-Jan for Adopt decisions, discussion consensus for Trial decisions, Joshua for Assess/Hold positioning
CADENCE: evaluations are demand-driven (not scheduled) but limited to 2 concurrent evaluations to prevent overload
DOCUMENTATION: all evaluations documented in wiki under docs/development/evaluations/[technology-name].md
HISTORY: past evaluations are never deleted — they provide historical context for re-evaluation