DOMAIN:BACKEND:NODE_PRODUCTION¶
OWNER: urszula (Team Alfa), maxim (Team Bravo)
UPDATED: 2026-03-24
SCOPE: Node.js production operations for all GE client projects
ALSO_USED_BY: sandro (hotfix/debugging), mira (incident response)
NODE:RUNTIME_CONFIG¶
RULE: Node.js 22 LTS for new projects (active LTS until April 2027)
RULE: Node.js 20 LTS acceptable for existing projects (maintenance until April 2026)
RULE: always pin Node.js version in .nvmrc and Dockerfile
# Dockerfile
FROM node:22-alpine AS base
WORKDIR /app
FROM base AS deps
COPY package.json package-lock.json ./
RUN npm ci --omit=dev
FROM base AS build
COPY package.json package-lock.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM base AS runner
ENV NODE_ENV=production
COPY --from=deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
USER node
EXPOSE 3000
CMD ["node", "dist/index.js"]
NODE:MEMORY_MANAGEMENT¶
HEAP_SIZING¶
# Set max old space size based on container memory limit
# Rule: 75% of container memory for heap, rest for stack + native addons
# Container with 512MB → --max-old-space-size=384
# Container with 1GB → --max-old-space-size=768
# Container with 2GB → --max-old-space-size=1536
NODE_OPTIONS="--max-old-space-size=768"
RULE: always set --max-old-space-size explicitly in containers
WHY: Node.js defaults to 75% of system memory, but containers report host memory, not limit
RULE: set to 75% of the container memory LIMIT, not the host memory
MEMORY_LEAK_DETECTION¶
COMMON_LEAK_SOURCES:
- Closures retaining references to large objects
- Event listeners added but never removed
- Unbounded caches (Map/Set growing forever)
- Unclosed database connections
- Circular references preventing GC
- Global variables accumulating data across requests
// Monitor memory usage periodically
import { memoryUsage } from 'node:process';
const MEMORY_CHECK_INTERVAL = 60_000; // 1 minute
const HEAP_THRESHOLD = 0.85; // 85% of max heap
setInterval(() => {
const mem = memoryUsage();
const heapUsedMB = Math.round(mem.heapUsed / 1024 / 1024);
const heapTotalMB = Math.round(mem.heapTotal / 1024 / 1024);
const rssGB = Math.round(mem.rss / 1024 / 1024);
logger.info({ heapUsedMB, heapTotalMB, rssGB }, 'Memory usage');
if (mem.heapUsed / mem.heapTotal > HEAP_THRESHOLD) {
logger.warn({ heapUsedMB, heapTotalMB }, 'Heap usage above threshold');
}
}, MEMORY_CHECK_INTERVAL);
BOUNDED_CACHES¶
// NEVER: unbounded Map as cache
const cache = new Map(); // grows forever = memory leak
// ALWAYS: LRU cache with max size
import { LRUCache } from 'lru-cache';
const cache = new LRUCache<string, unknown>({
max: 1000, // max entries
ttl: 1000 * 60 * 5, // 5 min TTL
maxSize: 50_000_000, // 50MB max
sizeCalculation: (value) => JSON.stringify(value).length,
});
ANTI_PATTERN: using global Map/Set as unbounded in-memory cache
FIX: use LRU cache with max entries and TTL, or use Redis
NODE:EVENT_LOOP_MONITORING¶
WHY_IT_MATTERS¶
FACT: Node.js is single-threaded — a blocked event loop blocks ALL requests
FACT: event loop lag > 20ms means performance degradation
FACT: event loop lag > 100ms means user-visible latency
MONITORING¶
// Built-in event loop lag monitoring (Node.js 16+)
import { monitorEventLoopDelay } from 'node:perf_hooks';
const histogram = monitorEventLoopDelay({ resolution: 20 });
histogram.enable();
// Report periodically
setInterval(() => {
const p50 = histogram.percentile(50) / 1e6; // nanoseconds to ms
const p95 = histogram.percentile(95) / 1e6;
const p99 = histogram.percentile(99) / 1e6;
const max = histogram.max / 1e6;
logger.info({ eventLoopLag: { p50, p95, p99, max } }, 'Event loop delay');
if (p99 > 50) {
logger.warn({ p99 }, 'Event loop lag above 50ms at p99');
}
histogram.reset();
}, 30_000);
COMMON_BLOCKERS¶
BLOCKER: JSON.parse/stringify of large objects (> 1MB)
FIX: stream JSON parsing, or move to worker thread
BLOCKER: synchronous file I/O (fs.readFileSync in request path)
FIX: use async fs.readFile, or read at startup and cache
BLOCKER: CPU-intensive computation (crypto, image processing, sorting large arrays)
FIX: use worker_threads for CPU work
BLOCKER: RegExp with catastrophic backtracking on user input
FIX: use re2 library for user-provided regex, or validate regex complexity
// Worker thread for CPU-intensive work
import { Worker, isMainThread, parentPort } from 'node:worker_threads';
if (isMainThread) {
async function runInWorker<T>(workerPath: string, data: unknown): Promise<T> {
return new Promise((resolve, reject) => {
const worker = new Worker(workerPath, { workerData: data });
worker.on('message', resolve);
worker.on('error', reject);
worker.on('exit', (code) => {
if (code !== 0) reject(new Error(`Worker exited with code ${code}`));
});
});
}
}
NODE:GRACEFUL_SHUTDOWN¶
RULE: every GE backend service handles SIGTERM gracefully
RULE: stop accepting new connections, finish in-flight requests, close DB/Redis, then exit
RULE: force exit after timeout (10 seconds default)
// lib/shutdown.ts
import { logger } from './logger';
type ShutdownHandler = () => Promise<void>;
const handlers: ShutdownHandler[] = [];
export function onShutdown(handler: ShutdownHandler) {
handlers.push(handler);
}
export function setupGracefulShutdown(server: ReturnType<typeof import('@hono/node-server').serve>) {
let isShuttingDown = false;
const shutdown = async (signal: string) => {
if (isShuttingDown) return;
isShuttingDown = true;
logger.info({ signal }, 'Shutdown signal received, starting graceful shutdown');
// Stop accepting new connections
server.close(() => {
logger.info('HTTP server closed');
});
// Run shutdown handlers (DB, Redis, etc.)
for (const handler of handlers) {
try {
await handler();
} catch (err) {
logger.error({ err }, 'Shutdown handler failed');
}
}
logger.info('Graceful shutdown complete');
process.exit(0);
};
// Force exit after timeout
const forceExit = () => {
logger.fatal('Forced shutdown after timeout');
process.exit(1);
};
process.on('SIGTERM', () => {
shutdown('SIGTERM');
setTimeout(forceExit, 10_000).unref();
});
process.on('SIGINT', () => {
shutdown('SIGINT');
setTimeout(forceExit, 5_000).unref();
});
}
// Usage in main entry
import { serve } from '@hono/node-server';
import { closeDatabase } from './db';
import { closeRedis } from './redis';
const server = serve({ fetch: app.fetch, port: 3000 });
onShutdown(closeDatabase);
onShutdown(closeRedis);
setupGracefulShutdown(server);
ANTI_PATTERN: process.exit(0) without closing connections
FIX: drain connections first, then exit
ANTI_PATTERN: no SIGTERM handler — k8s kills pod after 30s grace period with SIGKILL
FIX: handle SIGTERM, close within 10 seconds
NODE:STRUCTURED_LOGGING¶
PINO_SETUP¶
// lib/logger.ts
import pino from 'pino';
export const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
// Pretty print ONLY in development
...(process.env.NODE_ENV === 'development' && {
transport: { target: 'pino-pretty' },
}),
// Redact sensitive fields
redact: {
paths: ['req.headers.authorization', 'body.password', 'body.token'],
censor: '[REDACTED]',
},
// Standard fields
base: {
service: process.env.SERVICE_NAME ?? 'api',
env: process.env.NODE_ENV ?? 'development',
},
});
CHILD_LOGGERS_FOR_REQUEST_CONTEXT¶
// middleware/logger.ts
import { createMiddleware } from 'hono/factory';
import { logger as rootLogger } from '../lib/logger';
export const requestLogger = createMiddleware(async (c, next) => {
const requestId = c.get('requestId') ?? crypto.randomUUID();
const childLogger = rootLogger.child({
requestId,
method: c.req.method,
path: c.req.path,
});
c.set('logger', childLogger);
const start = Date.now();
await next();
const duration = Date.now() - start;
childLogger.info({
status: c.res.status,
duration,
}, 'Request completed');
});
// In handlers — use contextual logger
app.get('/users', (c) => {
const log = c.get('logger');
log.info('Fetching users');
// All logs from this request share requestId automatically
});
LOGGING_RULES¶
RULE: use structured fields, not string interpolation
RULE: log at appropriate levels: error (broken), warn (degraded), info (normal ops), debug (development)
RULE: include requestId in all request-scoped logs
RULE: never log PII (emails, names) at info level or above
RULE: redact authorization headers, passwords, tokens
RULE: pino-pretty in development ONLY — never in production
// BAD
logger.info(`User ${userId} created project ${projectId}`);
// GOOD
logger.info({ userId, projectId }, 'Project created');
PINO_FINAL_FOR_CRASHES¶
// Handle uncaught exceptions with guaranteed log flush
process.on('uncaughtException', (err) => {
const finalLogger = pino.final(logger);
finalLogger.fatal({ err }, 'Uncaught exception — process will exit');
process.exit(1);
});
process.on('unhandledRejection', (reason) => {
const finalLogger = pino.final(logger);
finalLogger.fatal({ reason }, 'Unhandled promise rejection — process will exit');
process.exit(1);
});
RULE: always handle uncaughtException and unhandledRejection
RULE: use pino.final() to flush logs before exit
RULE: exit after uncaughtException — process state is corrupt
NODE:HEALTH_CHECKS¶
LIVENESS_VS_READINESS¶
// /health — liveness: is the process alive?
// Checks: can the event loop respond? That's it.
app.get('/health', (c) => c.json({ status: 'ok', uptime: process.uptime() }));
// /ready — readiness: can we handle traffic?
// Checks: DB connected? Redis connected? External deps reachable?
app.get('/ready', async (c) => {
const checks: Record<string, 'ok' | 'fail'> = {};
try {
await db.execute(sql`SELECT 1`);
checks.database = 'ok';
} catch { checks.database = 'fail'; }
try {
await redis.ping();
checks.redis = 'ok';
} catch { checks.redis = 'fail'; }
const allOk = Object.values(checks).every((v) => v === 'ok');
return c.json({ status: allOk ? 'ready' : 'degraded', checks }, allOk ? 200 : 503);
});
K8S_PROBE_CONFIG¶
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 2
startupProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 2
periodSeconds: 2
failureThreshold: 30
NODE:METRICS¶
PROMETHEUS_METRICS¶
// lib/metrics.ts
import { collectDefaultMetrics, register, Counter, Histogram } from 'prom-client';
// Default Node.js metrics (CPU, memory, event loop, GC)
collectDefaultMetrics({ prefix: 'app_' });
// Custom metrics
export const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status'],
});
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path', 'status'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
// Middleware to collect metrics
export const metricsMiddleware = createMiddleware(async (c, next) => {
const start = Date.now();
await next();
const duration = (Date.now() - start) / 1000;
httpRequestsTotal.inc({ method: c.req.method, path: c.req.routePath, status: c.res.status });
httpRequestDuration.observe({ method: c.req.method, path: c.req.routePath, status: c.res.status }, duration);
});
// Metrics endpoint
app.get('/metrics', async (c) => {
const metrics = await register.metrics();
return c.text(metrics, 200, { 'Content-Type': register.contentType });
});
RULE: expose /metrics endpoint for Prometheus scraping
RULE: use path templates in labels (e.g., /users/:id) not actual paths (e.g., /users/abc-123) — unbounded cardinality
RULE: collect default Node.js metrics (CPU, memory, event loop, GC)
NODE:CLUSTERING_VS_HORIZONTAL_SCALING¶
CHECK: do you need more throughput from a single machine?
IF: running in k8s THEN: use horizontal pod autoscaling (HPA), NOT clustering
IF: running on bare metal THEN: use Node.js cluster module or PM2
GE_DECISION:
- All client projects run on k8s
- Use HPA for scaling (1-5 replicas, capped per GE token burn rules)
- Do NOT use Node.js cluster module in k8s — let k8s manage replicas
- Each pod = 1 Node.js process = 1 event loop
- Scale horizontally with multiple pods, not multiple processes per pod
ANTI_PATTERN: using cluster module inside k8s containers
FIX: let k8s HPA handle scaling — cluster module complicates health checks and signal handling
NODE:DEBUGGING_PRODUCTION¶
REMOTE_DEBUGGING (EMERGENCY ONLY)¶
# Enable inspector on running process (no restart needed)
kill -USR1 <pid>
# Then connect Chrome DevTools to ws://host:9229
RULE: never expose debug port in production permanently — only for active debugging sessions
RULE: use --inspect flag only in development
HEAP_SNAPSHOTS¶
// Trigger heap snapshot via API (admin-only endpoint)
import { writeHeapSnapshot } from 'node:v8';
app.post('/admin/heap-snapshot', requireRole('admin'), async (c) => {
const filename = writeHeapSnapshot();
return c.json({ success: true, data: { filename } });
});
CPU_PROFILING¶
// Trigger CPU profile via API (admin-only endpoint)
import { Session } from 'node:inspector/promises';
app.post('/admin/cpu-profile', requireRole('admin'), async (c) => {
const session = new Session();
session.connect();
await session.post('Profiler.enable');
await session.post('Profiler.start');
// Profile for 10 seconds
await new Promise((r) => setTimeout(r, 10_000));
const { profile } = await session.post('Profiler.stop');
session.disconnect();
return c.json({ success: true, data: profile });
});
NODE:SECURITY_HARDENING¶
RULE: run as non-root user in container (USER node)
RULE: use --dns-result-order=ipv4first to avoid IPv6 resolution issues
RULE: set NODE_ENV=production — disables dev-only features, enables optimizations
RULE: freeze prototype chain if handling untrusted JSON: Object.freeze(Object.prototype) (advanced)
// Helmet-like security headers in Hono
import { secureHeaders } from 'hono/secure-headers';
app.use('*', secureHeaders({
contentSecurityPolicy: {
defaultSrc: ["'self'"],
scriptSrc: ["'self'"],
},
crossOriginEmbedderPolicy: 'require-corp',
crossOriginOpenerPolicy: 'same-origin',
crossOriginResourcePolicy: 'same-origin',
referrerPolicy: 'strict-origin-when-cross-origin',
strictTransportSecurity: 'max-age=31536000; includeSubDomains',
xContentTypeOptions: 'nosniff',
xFrameOptions: 'DENY',
}));
NODE:AGENTIC_CHECKLIST¶
ON_DEPLOYING_TO_PRODUCTION:
1. CHECK: is NODE_ENV=production set?
2. CHECK: is --max-old-space-size configured for container memory limit?
3. CHECK: is graceful shutdown handler present (SIGTERM + SIGINT)?
4. CHECK: is pino configured with JSON output (no pino-pretty)?
5. CHECK: are uncaughtException and unhandledRejection handlers present?
6. CHECK: are /health and /ready endpoints implemented?
7. CHECK: is memory monitoring in place (process.memoryUsage)?
8. CHECK: are event loop lag metrics being collected?
9. CHECK: is the container running as non-root user?
10. CHECK: are security headers set (secureHeaders middleware)?
11. RUN: load test to verify no memory leaks under sustained traffic
12. RUN: verify graceful shutdown works (send SIGTERM, check in-flight requests complete)