Skip to content

DOMAIN:PERFORMANCE — PERFORMANCE_RUBRIC

OWNER: nessa
ALSO_USED_BY: marta (merge gate criterion 5), iwona (release readiness input)
UPDATED: 2026-03-24
SCOPE: performance pass/fail rubric with per-metric thresholds and decision trees — JIT injected before every performance evaluation
PURPOSE: standardize performance decisions across all client projects with clear thresholds, escalation paths, and no-baseline protocols


METRIC_THRESHOLDS

WEB_VITALS (client-facing pages)

Metric Good Needs Improvement Poor Unit
LCP (Largest Contentful Paint) < 1.5s 1.5s - 2.5s > 2.5s seconds
INP (Interaction to Next Paint) < 200ms 200ms - 500ms > 500ms milliseconds
CLS (Cumulative Layout Shift) < 0.1 0.1 - 0.25 > 0.25 unitless
FCP (First Contentful Paint) < 1.0s 1.0s - 1.8s > 1.8s seconds
TTFB (Time to First Byte) < 200ms 200ms - 500ms > 500ms milliseconds

SOURCE: Web Vitals thresholds aligned with Google CrUX "good" benchmarks.
OVERRIDE: Client SLOs take precedence. If a client contract specifies LCP < 1.0s, that is the threshold.

API_ENDPOINTS (backend)

Metric Good Needs Improvement Poor Unit
p50 latency < 100ms 100ms - 300ms > 300ms milliseconds
p95 latency < 250ms 250ms - 500ms > 500ms milliseconds
p99 latency < 500ms 500ms - 1000ms > 1000ms milliseconds
Error rate < 0.1% 0.1% - 1% > 1% percentage
Throughput > 100 rps 50-100 rps < 50 rps requests/second

ENDPOINT_TYPE_ADJUSTMENTS:
- Read endpoints (GET): use standard thresholds above
- Write endpoints (POST/PUT/DELETE): multiply latency thresholds by 1.5x (writes are inherently slower)
- Aggregation endpoints (reports, dashboards): multiply latency thresholds by 3x
- File upload endpoints: multiply latency thresholds by 5x, measure separately from other endpoints
- Health check endpoints: p99 must be < 50ms (used by k8s probes, affects pod restarts)

BUNDLE_SIZE (frontend)

Metric Good Needs Improvement Poor
Initial JS bundle < 150 KB (gzipped) 150-300 KB > 300 KB
Total JS (lazy-loaded) < 500 KB 500 KB - 1 MB > 1 MB
CSS bundle < 50 KB (gzipped) 50-100 KB > 100 KB
Largest single chunk < 100 KB 100-200 KB > 200 KB

DECISION_TREE

PRIMARY_DECISION: BLOCK, WARN, OR PASS

START: Compare current metrics to baseline (or thresholds if no baseline)

1. Is any metric in the POOR range?
   ├─ YES → Is it a critical endpoint (auth, payment, checkout)?
   │   ├─ YES → BLOCK RELEASE
   │   └─ NO → Is the regression > 50% from baseline?
   │       ├─ YES → BLOCK RELEASE
   │       └─ NO → WARN (pass with mandatory follow-up ticket)
   └─ NO → Continue

2. Is any metric in the NEEDS IMPROVEMENT range?
   ├─ YES → Was it previously in the GOOD range?
   │   ├─ YES → Is the regression > 25% from baseline?
   │   │   ├─ YES → WARN (pass with follow-up ticket)
   │   │   └─ NO → PASS (note the change)
   │   └─ NO → PASS (already known, not a new regression)
   └─ NO → Continue

3. All metrics in GOOD range
   └─ PASS

CLIENT_SLO_OVERRIDE

Does the client have explicit SLOs in their contract?
├─ YES → Use client SLOs instead of standard thresholds
│   └─ Any client SLO violated?
│       ├─ YES → Was the SLO already violated before this PR?
│       │   ├─ YES → Did this PR make it WORSE?
│       │   │   ├─ YES → BLOCK (regression on already-bad metric)
│       │   │   └─ NO → WARN (pre-existing issue, not this PR's fault)
│       │   └─ NO → BLOCK (this PR caused the SLO violation)
│       └─ NO → Use standard decision tree above
└─ NO → Use standard thresholds

NO_BASELINE_PROTOCOL

When evaluating a new endpoint or page with no historical data:

STEP_1: CATEGORIZE

Determine the endpoint type from the list:
- READ: simple data fetch (single record or small list)
- AGGREGATION: data computation (reports, dashboards, analytics)
- WRITE: data mutation (create, update, delete)
- UPLOAD: file handling
- HEALTH: system status check

STEP_2: APPLY_CATEGORY_THRESHOLDS

Use the standard API thresholds with the endpoint-type multiplier. For example, an aggregation endpoint gets 3x the standard latency budget.

STEP_3: COMPARE_TO_PEERS

Find similar endpoints in the same project or in other GE client projects:
- Same category, similar data volume → closest peer
- If peer is 2x faster, investigate why the new endpoint is slower
- If peer is similar, accept as reasonable

STEP_4: RECORD_INITIAL_BASELINE

BASELINE_RECORD:
  endpoint: [path]
  date: [date]
  category: [type]
  p50: [value]  p95: [value]  p99: [value]
  data_volume: [approximate record count]
  confidence: initial
  next_review: [date + 3 release cycles]

STEP_5: CONDITIONAL PASS

New endpoints with no baseline receive a CONDITIONAL PASS:
- Metric must be within category thresholds
- Baseline is recorded for future comparison
- Load test recommended before calling baseline stable
- Re-evaluate after 3 release cycles with production traffic data


REGRESSION_CLASSIFICATION

Regression Size Classification Action
< 5% Noise PASS — within measurement variance
5-15% Minor PASS with note — monitor next release
15-25% Moderate WARN — pass with follow-up investigation ticket
25-50% Significant WARN on non-critical, BLOCK on critical endpoints
50-100% Major BLOCK — investigate before merge
> 100% (2x+) Critical BLOCK — likely a bug, not a performance issue

CRITICAL_ENDPOINTS (always BLOCK at 25%+):
- Authentication and session management
- Payment processing
- Order creation and checkout
- Data export and download
- Health check endpoints (affects k8s pod lifecycle)


MEASUREMENT_STANDARDS

MINIMUM_RUN_COUNT

Decision Minimum Runs
PASS (no concerns) 5 runs
WARN (marginal) 10 runs
BLOCK (confirm regression) 10 runs with environment verification

ENVIRONMENT_REQUIREMENTS

  • Benchmark must run on dedicated environment (not shared dev cluster)
  • Database must be seeded with representative data volume
  • No other benchmarks running concurrently
  • First run discarded (warm-up)
  • Same hardware and configuration as previous baseline

VARIANCE_ACCEPTANCE

Coefficient of Variation Interpretation
< 5% Stable — high confidence in results
5-10% Acceptable — results are usable
10-20% Noisy — increase run count, check environment
> 20% Unreliable — fix environment before drawing conclusions

REPORTING_FORMAT

When returning a performance assessment, use this format:

PERFORMANCE_ASSESSMENT
Build: [build ID or PR number]
Decision: [PASS | WARN | BLOCK]

Metrics:
  LCP:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]
  INP:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]
  CLS:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]
  p99:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]

[If WARN: specific metric(s) that need follow-up]
[If BLOCK: specific regression with baseline comparison and recommended investigation]
[If no baseline: initial baseline recorded, conditional pass conditions]

ESCALATION_PATH

ESCALATE_TO_MARTA when:
- Performance block would delay a client delivery
- Client SLO conflict (SLO is unreasonable given the feature complexity)
- Infrastructure bottleneck suspected (not a code issue)

ESCALATE_TO_DEVELOPER when:
- Specific query or function identified as regression source
- Bundle analysis shows unexpected large dependency
- Missing index or N+1 query pattern detected

ESCALATE_TO_ANNA when:
- Spec requires a feature that is inherently slow (real-time aggregation of large dataset)
- Performance and functionality are in direct conflict
- Client SLO needs renegotiation based on technical constraints