Skip to content

DOMAIN:PERFORMANCE — CALIBRATION_EXAMPLES

OWNER: nessa
ALSO_USED_BY: marta (merge gate input), iwona (release readiness input)
UPDATED: 2026-03-24
SCOPE: calibration examples for performance evaluation — JIT injected before every performance assessment
PURPOSE: distinguish real regressions from measurement noise, establish baseline protocols, and calibrate pass/fail decisions


HOW_TO_USE_THIS_PAGE

Read these examples BEFORE evaluating any performance test result.
Nessa's job is to protect users from slow software — but false alarms erode trust and delay delivery.
Every performance decision must distinguish signal from noise.

PERFORMANCE_EVALUATION_PRINCIPLES:
- A single measurement is never conclusive — require multiple runs
- Percentiles matter more than averages (p99 catches tail latency, mean hides it)
- Regressions are relative to baseline — without baseline, you cannot regress
- Client SLOs override internal budgets (if client requires 1s LCP, our 2s budget is irrelevant)
- Context matters: a 3x regression on a health check is different from 3x on checkout


EXAMPLE_1: REAL REGRESSION — P99 LATENCY 3X SLOWER ON KEY ENDPOINT

DECISION: FAIL — BLOCK RELEASE

CONFIDENCE: HIGH

MEASUREMENTS

Endpoint: POST /api/orders (order creation — critical path)
Baseline (last release, 5 runs):

p50: 120ms  p95: 180ms  p99: 210ms  max: 340ms

Current build (5 runs):

p50: 150ms  p95: 380ms  p99: 640ms  max: 1200ms

ANALYSIS

REGRESSION_DETECTED:
- p50: +25% (120ms → 150ms) — noticeable but not alarming
- p95: +111% (180ms → 380ms) — significant
- p99: +205% (210ms → 640ms) — critical regression
- max: +253% (340ms → 1200ms) — tail latency exploded

WHY_THIS_IS_REAL_AND_NOT_NOISE:
- 5 runs with consistent results (p99 standard deviation: 42ms across runs)
- Regression is monotonic across all percentiles (not just a single outlier)
- p99/p50 ratio went from 1.75x to 4.27x — tail distribution widened significantly
- This endpoint is on the critical purchase path — users feel this directly

ROOT_CAUSE_HINTS:
- The p50 only increased 25% but p99 increased 205% — suggests a conditional slow path
- Likely: a new database query that only fires for certain order types (e.g., orders with discounts)
- The max of 1200ms suggests occasional lock contention or missing index

EVALUATOR_ACTION: FAIL. Block release. Return specific numbers. Recommend: profile the endpoint with and without discounts, check for new queries added in this PR, verify database indexes.


EXAMPLE_2: MEASUREMENT NOISE — 5% VARIANCE BETWEEN RUNS

DECISION: PASS — WITHIN EXPECTED RANGE

CONFIDENCE: HIGH

MEASUREMENTS

Endpoint: GET /api/dashboard (authenticated, data aggregation)
Baseline (5 runs):

p50: 245ms  p95: 310ms  p99: 355ms

Current build (5 runs):

Run 1: p50: 252ms  p95: 318ms  p99: 362ms
Run 2: p50: 238ms  p95: 305ms  p99: 348ms
Run 3: p50: 258ms  p95: 322ms  p99: 370ms
Run 4: p50: 241ms  p95: 298ms  p99: 342ms
Run 5: p50: 250ms  p95: 315ms  p99: 360ms
Average: p50: 248ms  p95: 312ms  p99: 356ms

ANALYSIS

NO_REGRESSION:
- p50 delta: +3ms (+1.2%) — within noise
- p95 delta: +2ms (+0.6%) — within noise
- p99 delta: +1ms (+0.3%) — within noise
- Run-to-run variance: ~5% — this is normal for a benchmark running on shared infrastructure

WHY_5_PERCENT_IS_NOISE:
- k3s single-node has background processes (kubelet, containerd, system services)
- GC pauses in Node.js cause ~10ms jitter on individual requests
- Network stack introduces 1-3ms variance per request
- Database connection pool warm/cold state affects first few requests
- Expected variance on this infrastructure: 3-8% between identical runs

NOISE_THRESHOLD_RULES:

< 5% variance with no percentile exceeding 10% → NOISE, pass
5-15% variance → INCONCLUSIVE, run 5 more times
> 15% consistent variance → LIKELY REAL, investigate
Any single percentile > 2x baseline → REAL regardless of other percentiles

EVALUATOR_ACTION: PASS. Note: "Dashboard endpoint stable. +1.2% p50 within expected measurement variance."


EXAMPLE_3: NEW ENDPOINT WITH NO BASELINE

DECISION: ESTABLISH BASELINE — CONDITIONAL PASS

CONFIDENCE: MEDIUM

SCENARIO

New feature: GET /api/projects/:id/analytics — first time this endpoint exists.
No baseline exists because the endpoint is new.

Current measurements (5 runs):

p50: 420ms  p95: 580ms  p99: 780ms  max: 1100ms

ANALYSIS

NO_BASELINE_AVAILABLE:
- Cannot determine regression because there is nothing to regress from
- Must establish whether these numbers are acceptable for this endpoint's purpose

BASELINE_ESTABLISHMENT_PROTOCOL:
1. CATEGORIZE the endpoint by type:
- Read-only data fetch: budget p99 < 500ms
- Data aggregation (this one): budget p99 < 1000ms
- Write operation: budget p99 < 300ms
- File upload: budget p99 < 2000ms

  1. COMPARE to similar endpoints:
  2. GET /api/dashboard (similar aggregation): p99 = 355ms
  3. GET /api/reports/monthly (heavier aggregation): p99 = 890ms
  4. This endpoint's p99 of 780ms is between the two — reasonable

  5. CHECK against client SLO:

  6. Client has not specified an SLO for this endpoint
  7. Default SLO: p99 < 1000ms for data aggregation → 780ms passes

  8. RECORD as baseline:

    ENDPOINT: GET /api/projects/:id/analytics  
    BASELINE_DATE: 2026-03-24  
    BASELINE_VALUES: p50=420ms p95=580ms p99=780ms  
    CATEGORY: data-aggregation  
    SLO: p99 < 1000ms (default)  
    RUNS: 5  
    CONFIDENCE: initial (needs 3 release cycles to stabilize)  
    

WHY_CONDITIONAL_PASS:
- Numbers are within category budget
- But initial baselines have low confidence — the endpoint has never been tested under load
- Recommend: run load test (50 concurrent users) before calling this baseline stable

EVALUATOR_ACTION: CONDITIONAL PASS. Record baseline. Flag for load testing. Re-evaluate after 3 release cycles.


EXAMPLE_4: CLIENT SLO VIOLATION — 2S LCP ON MOBILE

DECISION: FAIL — BUT NEEDS CONTEXT

CONFIDENCE: MEDIUM

MEASUREMENTS

Page: Client project landing page (/)
Client SLO: LCP < 1.5s on mobile (specified in contract)

Lab measurements (Lighthouse, mobile throttling):

LCP: 2.1s (FAIL — exceeds 1.5s SLO)
INP: 85ms (PASS — under 200ms)
CLS: 0.02 (PASS — under 0.1)
FCP: 0.8s (OK)
TTFB: 220ms (OK)

Field data (if available from previous release):

LCP p75: 1.8s (already above SLO before this PR)

ANALYSIS

SLO_VIOLATION:
- LCP 2.1s exceeds client's 1.5s SLO by 40%
- This is a contractual requirement, not an internal preference

CONTEXT_REQUIRED:
- Was LCP already above SLO before this PR? → CHECK FIELD DATA
- Field data shows p75 LCP was 1.8s — the SLO was ALREADY violated before this PR
- This PR made it worse: 1.8s → 2.1s (+17%)

TWO_SEPARATE_ISSUES:
1. PRE-EXISTING: LCP was 1.8s, already above 1.5s SLO → existing tech debt, not this PR's fault
2. THIS_PR: Added 300ms to LCP → this PR made a bad situation worse

DECISION_NUANCE:
- Block this PR for the 300ms regression (this PR caused real degradation)
- File separate ticket for the pre-existing SLO violation (not this PR's responsibility)
- Do NOT block this PR for the entire 600ms gap — only for the 300ms it added

WHAT_TO_INVESTIGATE:
- What changed in this PR that adds 300ms? (new images? unoptimized bundle? render-blocking script?)
- Is the LCP element the same as before, or did the LCP element change? (hero image vs text)
- Can the 300ms be recovered with lazy loading, compression, or CDN caching?

EVALUATOR_ACTION: FAIL for the regression. Block this PR. Note: "Pre-existing SLO violation (1.8s vs 1.5s target) is separate from this PR's +300ms regression. Both need fixing, but this PR is responsible for the delta only."


MEASUREMENT_PROTOCOL

MINIMUM_REQUIREMENTS

  • Run count: minimum 5 runs per measurement
  • Warm-up: discard first run (cold start bias)
  • Environment: same hardware, same load, same time of day (avoid peak hours)
  • Isolation: no other benchmarks running simultaneously
  • Data: use representative test data (not empty DB, not 10M rows if prod has 10K)

PERCENTILE_SELECTION

  • REPORT: p50, p95, p99, max
  • DECIDE ON: p99 (this is what users in the tail experience)
  • ALERT ON: p99 > 2x baseline OR max > 5x baseline

VARIANCE_HANDLING

  • Calculate coefficient of variation (CV = stddev / mean) across runs
  • CV < 0.05: stable measurement, high confidence
  • CV 0.05-0.15: moderate variance, increase run count to 10
  • CV > 0.15: unstable measurement, investigate environment before concluding

COMMON_PITFALLS

PITFALL_1: Testing on empty database
- A query that's fast on 100 rows may be slow on 100K rows
- Always seed representative data volume before benchmarking

PITFALL_2: Ignoring warm-up
- First request warms JIT, connection pools, caches
- First-request latency can be 5-10x steady-state latency
- Discard the first run or first 10 requests within a run

PITFALL_3: Comparing lab to field
- Lab measurements (Lighthouse) use simulated throttling
- Field measurements (CrUX, RUM) reflect real user conditions
- A lab pass does not guarantee a field pass — but a lab fail guarantees a field fail

PITFALL_4: Average instead of percentiles
- Average of [100ms, 100ms, 100ms, 100ms, 5000ms] = 1080ms
- p50 = 100ms, p99 = 5000ms
- The average hides that 1 in 5 users waits 5 seconds
- NEVER use average as the primary metric