The Ultimate Node.js Production Checklist

Why a checklist matters more than you think

Most Node.js outages aren't caused by clever bugs in business logic. They're caused by the boring stuff: a missing health check that prevents a load balancer from routing correctly, an unhandled promise rejection that silently crashes the process, a forgotten console.log that floods log storage, or a database connection pool that gets exhausted under real traffic. These aren't edge cases — they're the norm for apps that haven't been hardened for production.

This checklist exists because the gap between "it works in staging" and "it runs reliably in production" is filled with configuration details that nobody talks about until something breaks at 2am. Work through it once before your first production deploy, then revisit it before major releases.

1. Health check endpoint

Every production Node.js service needs a GET /health endpoint. Load balancers, container orchestrators (Kubernetes, ECS), uptime monitors, and deployment systems all need to know whether your app is ready to receive traffic. Without this endpoint, your platform has no way to distinguish a healthy instance from one that's hanging.

A good health check does two things: verifies the process is alive, and verifies that its dependencies (database, cache, etc.) are reachable. Return 200 when healthy and 503 when degraded.

app.get('/health', async (req, res) => {
  const start = Date.now();
  let dbStatus = 'ok';
  let dbLatencyMs = 0;

  try {
    const dbStart = Date.now();
    await prisma.$queryRaw`SELECT 1`;
    dbLatencyMs = Date.now() - dbStart;
  } catch {
    dbStatus = 'error';
  }

  const healthy = dbStatus === 'ok';
  res.status(healthy ? 200 : 503).json({
    status: healthy ? 'ok' : 'degraded',
    uptime: process.uptime(),
    responseTimeMs: Date.now() - start,
    dependencies: {
      database: { status: dbStatus, latencyMs: dbLatencyMs },
    },
    version: process.env.APP_VERSION ?? 'unknown',
  });
});

Important: Keep health checks lightweight. If the check itself is slow, the load balancer may mark instances as unhealthy under load. Avoid complex logic and cache the dependency check result for a few seconds if needed.

2. Graceful shutdown

When a container orchestrator wants to stop your instance (during a deploy, scale-in, or node replacement), it sends SIGTERM first, then waits for a grace period (typically 30 seconds), then sends SIGKILL. If your app doesn't handle SIGTERM, the process is killed immediately and any in-flight requests are dropped.

import { PrismaClient } from '@prisma/client';
const prisma = new PrismaClient();
const server = app.listen(PORT);

let shuttingDown = false;

// Mark app as shutting down so health check returns 503
// (removes us from the load balancer rotation before we close)
app.use((req, res, next) => {
  if (shuttingDown) {
    res.setHeader('Connection', 'close');
    return res.status(503).json({ error: 'Server shutting down' });
  }
  next();
});

async function shutdown(signal: string) {
  console.log(`Received ${signal}, starting graceful shutdown`);
  shuttingDown = true;

  // Stop accepting new connections
  server.close(async () => {
    // Disconnect database
    await prisma.$disconnect();
    console.log('Graceful shutdown complete');
    process.exit(0);
  });

  // Force exit after 25 seconds (before SIGKILL at 30s)
  setTimeout(() => {
    console.error('Forced shutdown after timeout');
    process.exit(1);
  }, 25_000);
}

process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));

3. Never store secrets in code or version control

This sounds obvious, but it's one of the most common mistakes in production Node.js apps. The failure modes are subtle: a developer commits a .env file to a private repo that later becomes public, or hard-codes an API key during a late-night debug session and forgets to remove it.

The rules:

Add .env* to .gitignore — all of them, not just .env
Commit a .env.example with placeholder values and comments
In production, inject secrets via your platform's vault or secrets manager
Validate that required env vars are present at startup — fail fast if not

// config.ts — validate at startup, not at usage time
const required = ['DATABASE_URL', 'JWT_SECRET', 'STRIPE_SECRET_KEY'];

for (const key of required) {
  if (!process.env[key]) {
    throw new Error(`Missing required environment variable: ${key}`);
  }
}

export const config = {
  databaseUrl: process.env.DATABASE_URL!,
  jwtSecret: process.env.JWT_SECRET!,
  stripeSecretKey: process.env.STRIPE_SECRET_KEY!,
  port: parseInt(process.env.PORT ?? '3000', 10),
  nodeEnv: process.env.NODE_ENV ?? 'development',
};

4. Structured logging — never console.log in production

console.log outputs unstructured strings. When you have hundreds of concurrent requests, an unstructured log is impossible to search, filter, or correlate. Structured JSON logs let you filter by user ID, request ID, status code, or any other field instantly — in Datadog, Loki, CloudWatch, or any log aggregator.

import pino from 'pino';

export const log = pino({
  level: process.env.LOG_LEVEL || 'info',
  // Pretty print in development, JSON in production
  transport: process.env.NODE_ENV !== 'production'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
  // Redact sensitive fields before they ever reach the log sink
  redact: {
    paths: ['req.headers.authorization', 'req.body.password', 'user.password'],
    censor: '[REDACTED]',
  },
  base: {
    pid: process.pid,
    service: process.env.SERVICE_NAME ?? 'api',
    version: process.env.APP_VERSION,
  },
});

// Correct usage:
log.info({ userId: '123', action: 'login' }, 'User logged in');
log.error({ err, userId }, 'Payment failed');

5. Global unhandled error handlers

Node.js has two error events that will crash your process if unhandled: unhandledRejection (a promise was rejected with no .catch()) and uncaughtException (a synchronous throw escaped all try/catch blocks). In both cases you should log the error and exit — a process in an unknown state is dangerous to keep running.

process.on('unhandledRejection', (reason: unknown) => {
  log.fatal({ reason }, 'Unhandled promise rejection');
  process.exit(1);
});

process.on('uncaughtException', (err: Error) => {
  log.fatal({ err }, 'Uncaught exception');
  process.exit(1);
});

Note: Do not catch these errors and continue running. The process is in an indeterminate state. Exit cleanly, let your orchestrator restart you, and fix the root cause.

6. Request timeouts

Without timeouts, a single slow upstream dependency (a database query that takes 60 seconds, a third-party API that hangs) will hold an HTTP connection open indefinitely. Under load, this exhausts your connection pool and causes cascading failures.

import { createServer } from 'http';

const server = createServer(app);

// Server-level: close connections that are idle too long
server.keepAliveTimeout = 65_000; // slightly more than ALB's 60s

// Per-request timeout in Express
app.use((req, res, next) => {
  res.setTimeout(30_000, () => {
    log.warn({ path: req.path }, 'Request timeout');
    res.status(503).json({ error: 'Request timed out' });
  });
  next();
});

// Outbound HTTP client timeout
const response = await fetch('https://api.stripe.com/v1/charges', {
  signal: AbortSignal.timeout(10_000), // 10 second timeout
});

7. Rate limiting on every public endpoint

Without rate limiting, a single misbehaving client — whether a bug in a mobile app or a deliberate attacker — can saturate your server. At minimum, apply rate limiting to authentication endpoints (brute force prevention) and your general API (abuse prevention).

import rateLimit from 'express-rate-limit';
import RedisStore from 'rate-limit-redis';

// General API rate limit
app.use('/api/', rateLimit({
  windowMs: 60 * 1000,
  max: 200,
  standardHeaders: true,
  legacyHeaders: false,
  store: new RedisStore({ client: redis }),
}));

// Stricter limit for auth endpoints
app.use('/auth/', rateLimit({
  windowMs: 15 * 60 * 1000,
  max: 10,
  message: { error: 'Too many attempts. Try again in 15 minutes.' },
  store: new RedisStore({ client: redis }),
}));

8. Security headers with Helmet

A single middleware call sets 11 security-relevant HTTP headers: Content-Security-Policy, X-Content-Type-Options, X-Frame-Options, Strict-Transport-Security, and more. These aren't optional if your app is public-facing.

import helmet from 'helmet';

app.use(helmet({
  contentSecurityPolicy: {
    directives: {
      defaultSrc: ["'self'"],
      styleSrc: ["'self'", "'unsafe-inline'"],
      scriptSrc: ["'self'"],
      imgSrc: ["'self'", 'data:', 'https:'],
    },
  },
  hsts: {
    maxAge: 31536000,
    includeSubDomains: true,
    preload: true,
  },
}));

9. Set --max-old-space-size

Node.js's V8 engine has a default heap limit of around 1.5GB on 64-bit systems. If your container only has 512MB of RAM and Node tries to grow its heap to 1.5GB, the Linux OOM killer will kill your process with no warning and no useful log message. Always set this flag to match your container limit minus ~100MB for overhead.

# In your Dockerfile CMD or startup script:
CMD ["node", "--max-old-space-size=400", "dist/index.js"]

# Or in package.json scripts:
"start": "node --max-old-space-size=400 dist/index.js"

10. Connection pool sizing

Every Node.js container that runs your app opens its own database connection pool. If you have 10 replicas and a pool size of 20, you're maintaining 200 Postgres connections at all times. Postgres handles connections poorly at high counts — each connection uses ~5-10MB of RAM on the server. Set pool sizes to match your expected replica count.

// Prisma
datasource db {
  provider = "postgresql"
  url      = env("DATABASE_URL")
}

generator client {
  provider = "prisma-client-js"
}

// In your app:
const prisma = new PrismaClient({
  datasources: {
    db: {
      url: process.env.DATABASE_URL,
    },
  },
  // connection_limit in the URL: ?connection_limit=5&pool_timeout=10
});

11. Enable HTTP compression

JSON responses compress exceptionally well — typical compression ratios of 5:1 to 10:1 are common. A 100KB response becomes 10-20KB. This reduces your bandwidth bill, cuts transfer time for users on slower connections, and reduces the load on your CDN.

import compression from 'compression';

app.use(compression({
  level: 6,           // Default: good balance of speed vs compression ratio
  threshold: 1024,    // Only compress responses > 1KB
  filter: (req, res) => {
    // Don't compress already-compressed formats
    if (req.headers['x-no-compression']) return false;
    return compression.filter(req, res);
  },
}));

12. Automated dependency auditing

Add npm audit --audit-level=high to your CI pipeline. This fails the build if any dependency has a known high or critical vulnerability. Also configure Dependabot or Renovate to automatically open PRs when dependency updates are available — security patches often come through minor version bumps.

# .github/workflows/ci.yml
- name: Audit dependencies
  run: npm audit --audit-level=high

# Fail on high severity, warn on moderate
- name: License check
  run: npx license-checker --onlyAllow 'MIT;Apache-2.0;BSD-2-Clause;BSD-3-Clause;ISC'

13. Minimal container image

Large Docker images are slow to pull, expose more attack surface, and consume more registry storage. A well-optimized Node.js image should be under 200MB. The two biggest levers: use Alpine Linux base images and use multi-stage builds to exclude dev dependencies and build artifacts from the runtime image.

# Multi-stage build — build tools stay in the builder stage
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build

FROM node:20-alpine AS runner
WORKDIR /app
ENV NODE_ENV=production
# Copy only what's needed at runtime
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY package.json .
EXPOSE 3000
CMD ["node", "dist/index.js"]

14. Monitoring, metrics, and alerting

You need to know about problems before your users do. The minimum viable monitoring stack for a Node.js API:

Uptime monitoring — ping your /health endpoint every 30 seconds from an external service (Betterstack, UptimeRobot). Alert if it returns non-200.
Error rate — alert if your HTTP 5xx rate exceeds 1% over a 5-minute window
Latency — alert if p95 response time exceeds your SLA threshold
Memory — alert if heap usage exceeds 80% of container limit

// Expose basic metrics for scraping
app.get('/metrics', (req, res) => {
  const mem = process.memoryUsage();
  res.json({
    uptime: process.uptime(),
    heapUsedMB: Math.round(mem.heapUsed / 1024 / 1024),
    heapTotalMB: Math.round(mem.heapTotal / 1024 / 1024),
    rssMB: Math.round(mem.rss / 1024 / 1024),
    eventLoopLagMs: getEventLoopLag(),
  });
});

15. Zero-downtime deployment strategy

Your deployment pipeline should never cause visible downtime. Rolling deploys (replace instances one at a time) work well for most APIs. The prerequisites: your app must handle SIGTERM gracefully (covered in #2), your health check endpoint must work immediately after startup, and your container orchestrator must verify health before routing traffic to new instances.

Test your zero-downtime setup by running a load test with autocannon -c 50 http://localhost:3000/api/ping during a deployment. Any non-200 responses during the transition indicate a gap in your graceful shutdown or readiness probe configuration.

The bottom line

None of these items require advanced Node.js knowledge. They're configuration decisions that compound over time — an app that handles graceful shutdown, has proper health checks, logs structured JSON, and enforces rate limiting will weather incidents that take down less-prepared services. Build this foundation once and you'll spend far more time shipping features and far less time debugging production fires.