DOMAIN:PERFORMANCE — PERFORMANCE_RUBRIC¶

OWNER: nessa
ALSO_USED_BY: marta (merge gate criterion 5), iwona (release readiness input)
UPDATED: 2026-03-24
SCOPE: performance pass/fail rubric with per-metric thresholds and decision trees — JIT injected before every performance evaluation
PURPOSE: standardize performance decisions across all client projects with clear thresholds, escalation paths, and no-baseline protocols

METRIC_THRESHOLDS¶

WEB_VITALS (client-facing pages)¶

Metric	Good	Needs Improvement	Poor	Unit
LCP (Largest Contentful Paint)	< 1.5s	1.5s - 2.5s	> 2.5s	seconds
INP (Interaction to Next Paint)	< 200ms	200ms - 500ms	> 500ms	milliseconds
CLS (Cumulative Layout Shift)	< 0.1	0.1 - 0.25	> 0.25	unitless
FCP (First Contentful Paint)	< 1.0s	1.0s - 1.8s	> 1.8s	seconds
TTFB (Time to First Byte)	< 200ms	200ms - 500ms	> 500ms	milliseconds

SOURCE: Web Vitals thresholds aligned with Google CrUX "good" benchmarks.
OVERRIDE: Client SLOs take precedence. If a client contract specifies LCP < 1.0s, that is the threshold.

API_ENDPOINTS (backend)¶

Metric	Good	Needs Improvement	Poor	Unit
p50 latency	< 100ms	100ms - 300ms	> 300ms	milliseconds
p95 latency	< 250ms	250ms - 500ms	> 500ms	milliseconds
p99 latency	< 500ms	500ms - 1000ms	> 1000ms	milliseconds
Error rate	< 0.1%	0.1% - 1%	> 1%	percentage
Throughput	> 100 rps	50-100 rps	< 50 rps	requests/second

ENDPOINT_TYPE_ADJUSTMENTS:
- Read endpoints (GET): use standard thresholds above
- Write endpoints (POST/PUT/DELETE): multiply latency thresholds by 1.5x (writes are inherently slower)
- Aggregation endpoints (reports, dashboards): multiply latency thresholds by 3x
- File upload endpoints: multiply latency thresholds by 5x, measure separately from other endpoints
- Health check endpoints: p99 must be < 50ms (used by k8s probes, affects pod restarts)

BUNDLE_SIZE (frontend)¶

Metric	Good	Needs Improvement	Poor
Initial JS bundle	< 150 KB (gzipped)	150-300 KB	> 300 KB
Total JS (lazy-loaded)	< 500 KB	500 KB - 1 MB	> 1 MB
CSS bundle	< 50 KB (gzipped)	50-100 KB	> 100 KB
Largest single chunk	< 100 KB	100-200 KB	> 200 KB

DECISION_TREE¶

PRIMARY_DECISION: BLOCK, WARN, OR PASS¶

START: Compare current metrics to baseline (or thresholds if no baseline)

1. Is any metric in the POOR range?
   ├─ YES → Is it a critical endpoint (auth, payment, checkout)?
   │   ├─ YES → BLOCK RELEASE
   │   └─ NO → Is the regression > 50% from baseline?
   │       ├─ YES → BLOCK RELEASE
   │       └─ NO → WARN (pass with mandatory follow-up ticket)
   └─ NO → Continue

2. Is any metric in the NEEDS IMPROVEMENT range?
   ├─ YES → Was it previously in the GOOD range?
   │   ├─ YES → Is the regression > 25% from baseline?
   │   │   ├─ YES → WARN (pass with follow-up ticket)
   │   │   └─ NO → PASS (note the change)
   │   └─ NO → PASS (already known, not a new regression)
   └─ NO → Continue

3. All metrics in GOOD range
   └─ PASS

CLIENT_SLO_OVERRIDE¶

Does the client have explicit SLOs in their contract?
├─ YES → Use client SLOs instead of standard thresholds
│   └─ Any client SLO violated?
│       ├─ YES → Was the SLO already violated before this PR?
│       │   ├─ YES → Did this PR make it WORSE?
│       │   │   ├─ YES → BLOCK (regression on already-bad metric)
│       │   │   └─ NO → WARN (pre-existing issue, not this PR's fault)
│       │   └─ NO → BLOCK (this PR caused the SLO violation)
│       └─ NO → Use standard decision tree above
└─ NO → Use standard thresholds

NO_BASELINE_PROTOCOL¶

When evaluating a new endpoint or page with no historical data:

STEP_1: CATEGORIZE¶

Determine the endpoint type from the list:
- READ: simple data fetch (single record or small list)
- AGGREGATION: data computation (reports, dashboards, analytics)
- WRITE: data mutation (create, update, delete)
- UPLOAD: file handling
- HEALTH: system status check

STEP_2: APPLY_CATEGORY_THRESHOLDS¶

Use the standard API thresholds with the endpoint-type multiplier. For example, an aggregation endpoint gets 3x the standard latency budget.

STEP_3: COMPARE_TO_PEERS¶

Find similar endpoints in the same project or in other GE client projects:
- Same category, similar data volume → closest peer
- If peer is 2x faster, investigate why the new endpoint is slower
- If peer is similar, accept as reasonable

STEP_4: RECORD_INITIAL_BASELINE¶

BASELINE_RECORD:
  endpoint: [path]
  date: [date]
  category: [type]
  p50: [value]  p95: [value]  p99: [value]
  data_volume: [approximate record count]
  confidence: initial
  next_review: [date + 3 release cycles]

STEP_5: CONDITIONAL PASS¶

New endpoints with no baseline receive a CONDITIONAL PASS:
- Metric must be within category thresholds
- Baseline is recorded for future comparison
- Load test recommended before calling baseline stable
- Re-evaluate after 3 release cycles with production traffic data

REGRESSION_CLASSIFICATION¶

Regression Size	Classification	Action
< 5%	Noise	PASS — within measurement variance
5-15%	Minor	PASS with note — monitor next release
15-25%	Moderate	WARN — pass with follow-up investigation ticket
25-50%	Significant	WARN on non-critical, BLOCK on critical endpoints
50-100%	Major	BLOCK — investigate before merge
> 100% (2x+)	Critical	BLOCK — likely a bug, not a performance issue

CRITICAL_ENDPOINTS (always BLOCK at 25%+):
- Authentication and session management
- Payment processing
- Order creation and checkout
- Data export and download
- Health check endpoints (affects k8s pod lifecycle)

MEASUREMENT_STANDARDS¶

MINIMUM_RUN_COUNT¶

Decision	Minimum Runs
PASS (no concerns)	5 runs
WARN (marginal)	10 runs
BLOCK (confirm regression)	10 runs with environment verification

ENVIRONMENT_REQUIREMENTS¶

Benchmark must run on dedicated environment (not shared dev cluster)
Database must be seeded with representative data volume
No other benchmarks running concurrently
First run discarded (warm-up)
Same hardware and configuration as previous baseline

VARIANCE_ACCEPTANCE¶

Coefficient of Variation	Interpretation
< 5%	Stable — high confidence in results
5-10%	Acceptable — results are usable
10-20%	Noisy — increase run count, check environment
> 20%	Unreliable — fix environment before drawing conclusions

REPORTING_FORMAT¶

When returning a performance assessment, use this format:

PERFORMANCE_ASSESSMENT
Build: [build ID or PR number]
Decision: [PASS | WARN | BLOCK]

Metrics:
  LCP:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]
  INP:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]
  CLS:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]
  p99:     [value] ([GOOD|NEEDS_IMPROVEMENT|POOR]) [delta from baseline]

[If WARN: specific metric(s) that need follow-up]
[If BLOCK: specific regression with baseline comparison and recommended investigation]
[If no baseline: initial baseline recorded, conditional pass conditions]

ESCALATION_PATH¶

ESCALATE_TO_MARTA when:
- Performance block would delay a client delivery
- Client SLO conflict (SLO is unreasonable given the feature complexity)
- Infrastructure bottleneck suspected (not a code issue)

ESCALATE_TO_DEVELOPER when:
- Specific query or function identified as regression source
- Bundle analysis shows unexpected large dependency
- Missing index or N+1 query pattern detected

ESCALATE_TO_ANNA when:
- Spec requires a feature that is inherently slow (real-time aggregation of large dataset)
- Performance and functionality are in direct conflict
- Client SLO needs renegotiation based on technical constraints