Skip to content

DOMAIN:BACKEND:NODE_PRODUCTION

OWNER: urszula (Team Alfa), maxim (Team Bravo)
UPDATED: 2026-03-24
SCOPE: Node.js production operations for all GE client projects
ALSO_USED_BY: sandro (hotfix/debugging), mira (incident response)


NODE:RUNTIME_CONFIG

RULE: Node.js 22 LTS for new projects (active LTS until April 2027)
RULE: Node.js 20 LTS acceptable for existing projects (maintenance until April 2026)
RULE: always pin Node.js version in .nvmrc and Dockerfile

# Dockerfile
FROM node:22-alpine AS base
WORKDIR /app

FROM base AS deps
COPY package.json package-lock.json ./
RUN npm ci --omit=dev

FROM base AS build
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM base AS runner
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
USER node
EXPOSE 3000
CMD ["node", "dist/index.js"]
# .nvmrc
22.14.0

NODE:MEMORY_MANAGEMENT

HEAP_SIZING

# Set max old space size based on container memory limit
# Rule: 75% of container memory for heap, rest for stack + native addons
# Container with 512MB → --max-old-space-size=384
# Container with 1GB → --max-old-space-size=768
# Container with 2GB → --max-old-space-size=1536

NODE_OPTIONS="--max-old-space-size=768"

RULE: always set --max-old-space-size explicitly in containers
WHY: Node.js defaults to 75% of system memory, but containers report host memory, not limit
RULE: set to 75% of the container memory LIMIT, not the host memory

MEMORY_LEAK_DETECTION

COMMON_LEAK_SOURCES:
- Closures retaining references to large objects
- Event listeners added but never removed
- Unbounded caches (Map/Set growing forever)
- Unclosed database connections
- Circular references preventing GC
- Global variables accumulating data across requests

// Monitor memory usage periodically
import { memoryUsage } from 'node:process';

const MEMORY_CHECK_INTERVAL = 60_000; // 1 minute
const HEAP_THRESHOLD = 0.85; // 85% of max heap

setInterval(() => {
  const mem = memoryUsage();
  const heapUsedMB = Math.round(mem.heapUsed / 1024 / 1024);
  const heapTotalMB = Math.round(mem.heapTotal / 1024 / 1024);
  const rssGB = Math.round(mem.rss / 1024 / 1024);

  logger.info({ heapUsedMB, heapTotalMB, rssGB }, 'Memory usage');

  if (mem.heapUsed / mem.heapTotal > HEAP_THRESHOLD) {
    logger.warn({ heapUsedMB, heapTotalMB }, 'Heap usage above threshold');
  }
}, MEMORY_CHECK_INTERVAL);

BOUNDED_CACHES

// NEVER: unbounded Map as cache
const cache = new Map(); // grows forever = memory leak

// ALWAYS: LRU cache with max size
import { LRUCache } from 'lru-cache';

const cache = new LRUCache<string, unknown>({
  max: 1000,           // max entries
  ttl: 1000 * 60 * 5,  // 5 min TTL
  maxSize: 50_000_000,  // 50MB max
  sizeCalculation: (value) => JSON.stringify(value).length,
});

ANTI_PATTERN: using global Map/Set as unbounded in-memory cache
FIX: use LRU cache with max entries and TTL, or use Redis


NODE:EVENT_LOOP_MONITORING

WHY_IT_MATTERS

FACT: Node.js is single-threaded — a blocked event loop blocks ALL requests
FACT: event loop lag > 20ms means performance degradation
FACT: event loop lag > 100ms means user-visible latency

MONITORING

// Built-in event loop lag monitoring (Node.js 16+)
import { monitorEventLoopDelay } from 'node:perf_hooks';

const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();

// Report periodically
setInterval(() => {
  const p50 = histogram.percentile(50) / 1e6; // nanoseconds to ms
  const p95 = histogram.percentile(95) / 1e6;
  const p99 = histogram.percentile(99) / 1e6;
  const max = histogram.max / 1e6;

  logger.info({ eventLoopLag: { p50, p95, p99, max } }, 'Event loop delay');

  if (p99 > 50) {
    logger.warn({ p99 }, 'Event loop lag above 50ms at p99');
  }

  histogram.reset();
}, 30_000);

COMMON_BLOCKERS

BLOCKER: JSON.parse/stringify of large objects (> 1MB)
FIX: stream JSON parsing, or move to worker thread

BLOCKER: synchronous file I/O (fs.readFileSync in request path)
FIX: use async fs.readFile, or read at startup and cache

BLOCKER: CPU-intensive computation (crypto, image processing, sorting large arrays)
FIX: use worker_threads for CPU work

BLOCKER: RegExp with catastrophic backtracking on user input
FIX: use re2 library for user-provided regex, or validate regex complexity

// Worker thread for CPU-intensive work
import { Worker, isMainThread, parentPort } from 'node:worker_threads';

if (isMainThread) {
  async function runInWorker<T>(workerPath: string, data: unknown): Promise<T> {
    return new Promise((resolve, reject) => {
      const worker = new Worker(workerPath, { workerData: data });
      worker.on('message', resolve);
      worker.on('error', reject);
      worker.on('exit', (code) => {
        if (code !== 0) reject(new Error(`Worker exited with code ${code}`));
      });
    });
  }
}

NODE:GRACEFUL_SHUTDOWN

RULE: every GE backend service handles SIGTERM gracefully
RULE: stop accepting new connections, finish in-flight requests, close DB/Redis, then exit
RULE: force exit after timeout (10 seconds default)

// lib/shutdown.ts
import { logger } from './logger';

type ShutdownHandler = () => Promise<void>;
const handlers: ShutdownHandler[] = [];

export function onShutdown(handler: ShutdownHandler) {
  handlers.push(handler);
}

export function setupGracefulShutdown(server: ReturnType<typeof import('@hono/node-server').serve>) {
  let isShuttingDown = false;

  const shutdown = async (signal: string) => {
    if (isShuttingDown) return;
    isShuttingDown = true;

    logger.info({ signal }, 'Shutdown signal received, starting graceful shutdown');

    // Stop accepting new connections
    server.close(() => {
      logger.info('HTTP server closed');
    });

    // Run shutdown handlers (DB, Redis, etc.)
    for (const handler of handlers) {
      try {
        await handler();
      } catch (err) {
        logger.error({ err }, 'Shutdown handler failed');
      }
    }

    logger.info('Graceful shutdown complete');
    process.exit(0);
  };

  // Force exit after timeout
  const forceExit = () => {
    logger.fatal('Forced shutdown after timeout');
    process.exit(1);
  };

  process.on('SIGTERM', () => {
    shutdown('SIGTERM');
    setTimeout(forceExit, 10_000).unref();
  });

  process.on('SIGINT', () => {
    shutdown('SIGINT');
    setTimeout(forceExit, 5_000).unref();
  });
}

// Usage in main entry
import { serve } from '@hono/node-server';
import { closeDatabase } from './db';
import { closeRedis } from './redis';

const server = serve({ fetch: app.fetch, port: 3000 });
onShutdown(closeDatabase);
onShutdown(closeRedis);
setupGracefulShutdown(server);

ANTI_PATTERN: process.exit(0) without closing connections
FIX: drain connections first, then exit

ANTI_PATTERN: no SIGTERM handler — k8s kills pod after 30s grace period with SIGKILL
FIX: handle SIGTERM, close within 10 seconds


NODE:STRUCTURED_LOGGING

PINO_SETUP

// lib/logger.ts
import pino from 'pino';

export const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  // Pretty print ONLY in development
  ...(process.env.NODE_ENV === 'development' && {
    transport: { target: 'pino-pretty' },
  }),
  // Redact sensitive fields
  redact: {
    paths: ['req.headers.authorization', 'body.password', 'body.token'],
    censor: '[REDACTED]',
  },
  // Standard fields
  base: {
    service: process.env.SERVICE_NAME ?? 'api',
    env: process.env.NODE_ENV ?? 'development',
  },
});

CHILD_LOGGERS_FOR_REQUEST_CONTEXT

// middleware/logger.ts
import { createMiddleware } from 'hono/factory';
import { logger as rootLogger } from '../lib/logger';

export const requestLogger = createMiddleware(async (c, next) => {
  const requestId = c.get('requestId') ?? crypto.randomUUID();
  const childLogger = rootLogger.child({
    requestId,
    method: c.req.method,
    path: c.req.path,
  });

  c.set('logger', childLogger);

  const start = Date.now();
  await next();
  const duration = Date.now() - start;

  childLogger.info({
    status: c.res.status,
    duration,
  }, 'Request completed');
});

// In handlers — use contextual logger
app.get('/users', (c) => {
  const log = c.get('logger');
  log.info('Fetching users');
  // All logs from this request share requestId automatically
});

LOGGING_RULES

RULE: use structured fields, not string interpolation
RULE: log at appropriate levels: error (broken), warn (degraded), info (normal ops), debug (development)
RULE: include requestId in all request-scoped logs
RULE: never log PII (emails, names) at info level or above
RULE: redact authorization headers, passwords, tokens
RULE: pino-pretty in development ONLY — never in production

// BAD
logger.info(`User ${userId} created project ${projectId}`);

// GOOD
logger.info({ userId, projectId }, 'Project created');

PINO_FINAL_FOR_CRASHES

// Handle uncaught exceptions with guaranteed log flush
process.on('uncaughtException', (err) => {
  const finalLogger = pino.final(logger);
  finalLogger.fatal({ err }, 'Uncaught exception — process will exit');
  process.exit(1);
});

process.on('unhandledRejection', (reason) => {
  const finalLogger = pino.final(logger);
  finalLogger.fatal({ reason }, 'Unhandled promise rejection — process will exit');
  process.exit(1);
});

RULE: always handle uncaughtException and unhandledRejection
RULE: use pino.final() to flush logs before exit
RULE: exit after uncaughtException — process state is corrupt


NODE:HEALTH_CHECKS

LIVENESS_VS_READINESS

// /health — liveness: is the process alive?
// Checks: can the event loop respond? That's it.
app.get('/health', (c) => c.json({ status: 'ok', uptime: process.uptime() }));

// /ready — readiness: can we handle traffic?
// Checks: DB connected? Redis connected? External deps reachable?
app.get('/ready', async (c) => {
  const checks: Record<string, 'ok' | 'fail'> = {};

  try {
    await db.execute(sql`SELECT 1`);
    checks.database = 'ok';
  } catch { checks.database = 'fail'; }

  try {
    await redis.ping();
    checks.redis = 'ok';
  } catch { checks.redis = 'fail'; }

  const allOk = Object.values(checks).every((v) => v === 'ok');
  return c.json({ status: allOk ? 'ready' : 'degraded', checks }, allOk ? 200 : 503);
});

K8S_PROBE_CONFIG

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 2

startupProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 2
  periodSeconds: 2
  failureThreshold: 30

NODE:METRICS

PROMETHEUS_METRICS

// lib/metrics.ts
import { collectDefaultMetrics, register, Counter, Histogram } from 'prom-client';

// Default Node.js metrics (CPU, memory, event loop, GC)
collectDefaultMetrics({ prefix: 'app_' });

// Custom metrics
export const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

export const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path', 'status'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

// Middleware to collect metrics
export const metricsMiddleware = createMiddleware(async (c, next) => {
  const start = Date.now();
  await next();
  const duration = (Date.now() - start) / 1000;

  httpRequestsTotal.inc({ method: c.req.method, path: c.req.routePath, status: c.res.status });
  httpRequestDuration.observe({ method: c.req.method, path: c.req.routePath, status: c.res.status }, duration);
});

// Metrics endpoint
app.get('/metrics', async (c) => {
  const metrics = await register.metrics();
  return c.text(metrics, 200, { 'Content-Type': register.contentType });
});

RULE: expose /metrics endpoint for Prometheus scraping
RULE: use path templates in labels (e.g., /users/:id) not actual paths (e.g., /users/abc-123) — unbounded cardinality
RULE: collect default Node.js metrics (CPU, memory, event loop, GC)


NODE:CLUSTERING_VS_HORIZONTAL_SCALING

CHECK: do you need more throughput from a single machine?
IF: running in k8s THEN: use horizontal pod autoscaling (HPA), NOT clustering
IF: running on bare metal THEN: use Node.js cluster module or PM2

GE_DECISION:
- All client projects run on k8s
- Use HPA for scaling (1-5 replicas, capped per GE token burn rules)
- Do NOT use Node.js cluster module in k8s — let k8s manage replicas
- Each pod = 1 Node.js process = 1 event loop
- Scale horizontally with multiple pods, not multiple processes per pod

ANTI_PATTERN: using cluster module inside k8s containers
FIX: let k8s HPA handle scaling — cluster module complicates health checks and signal handling


NODE:DEBUGGING_PRODUCTION

REMOTE_DEBUGGING (EMERGENCY ONLY)

# Enable inspector on running process (no restart needed)
kill -USR1 <pid>
# Then connect Chrome DevTools to ws://host:9229

RULE: never expose debug port in production permanently — only for active debugging sessions
RULE: use --inspect flag only in development

HEAP_SNAPSHOTS

// Trigger heap snapshot via API (admin-only endpoint)
import { writeHeapSnapshot } from 'node:v8';

app.post('/admin/heap-snapshot', requireRole('admin'), async (c) => {
  const filename = writeHeapSnapshot();
  return c.json({ success: true, data: { filename } });
});

CPU_PROFILING

// Trigger CPU profile via API (admin-only endpoint)
import { Session } from 'node:inspector/promises';

app.post('/admin/cpu-profile', requireRole('admin'), async (c) => {
  const session = new Session();
  session.connect();
  await session.post('Profiler.enable');
  await session.post('Profiler.start');

  // Profile for 10 seconds
  await new Promise((r) => setTimeout(r, 10_000));

  const { profile } = await session.post('Profiler.stop');
  session.disconnect();

  return c.json({ success: true, data: profile });
});

NODE:SECURITY_HARDENING

RULE: run as non-root user in container (USER node)
RULE: use --dns-result-order=ipv4first to avoid IPv6 resolution issues
RULE: set NODE_ENV=production — disables dev-only features, enables optimizations
RULE: freeze prototype chain if handling untrusted JSON: Object.freeze(Object.prototype) (advanced)

// Helmet-like security headers in Hono
import { secureHeaders } from 'hono/secure-headers';

app.use('*', secureHeaders({
  contentSecurityPolicy: {
    defaultSrc: ["'self'"],
    scriptSrc: ["'self'"],
  },
  crossOriginEmbedderPolicy: 'require-corp',
  crossOriginOpenerPolicy: 'same-origin',
  crossOriginResourcePolicy: 'same-origin',
  referrerPolicy: 'strict-origin-when-cross-origin',
  strictTransportSecurity: 'max-age=31536000; includeSubDomains',
  xContentTypeOptions: 'nosniff',
  xFrameOptions: 'DENY',
}));

NODE:AGENTIC_CHECKLIST

ON_DEPLOYING_TO_PRODUCTION:
1. CHECK: is NODE_ENV=production set?
2. CHECK: is --max-old-space-size configured for container memory limit?
3. CHECK: is graceful shutdown handler present (SIGTERM + SIGINT)?
4. CHECK: is pino configured with JSON output (no pino-pretty)?
5. CHECK: are uncaughtException and unhandledRejection handlers present?
6. CHECK: are /health and /ready endpoints implemented?
7. CHECK: is memory monitoring in place (process.memoryUsage)?
8. CHECK: are event loop lag metrics being collected?
9. CHECK: is the container running as non-root user?
10. CHECK: are security headers set (secureHeaders middleware)?
11. RUN: load test to verify no memory leaks under sustained traffic
12. RUN: verify graceful shutdown works (send SIGTERM, check in-flight requests complete)