Zero-Downtime Deployments Explained

Why deployments cause downtime

The simplest deployment: stop the old process, start the new one. The gap between stop and start is downtime. Even 2 seconds of 502s is enough to fail a health check SLA, drop in-flight requests, and anger users who happened to hit your API at the wrong moment.

What your app needs regardless of strategy

No deployment strategy works without these two things in your Node.js app:

1. A health check endpoint. The orchestrator needs to know when the new version is ready before routing traffic to it.

2. Graceful shutdown. The old version needs time to finish serving in-flight requests before it exits.

const server = app.listen(PORT);
let isShuttingDown = false;

app.get('/health', (req, res) => {
  if (isShuttingDown) return res.status(503).send('shutting down');
  res.status(200).json({ ok: true });
});

process.on('SIGTERM', () => {
  isShuttingDown = true;
  server.close(() => process.exit(0));
});

Rolling updates

The most common strategy. Replace instances one at a time (or in small batches). At any point during the deployment, some instances are running the old version and some are running the new version.

Pros: No extra infrastructure needed. Works with most container platforms out of the box.

Cons: Both versions run simultaneously. Your new code must be backward-compatible with the old code's data. Rollback means rolling forward.

Best for: Most web applications with simple, additive changes.

Blue/Green

Maintain two identical environments (blue = current production, green = new version). Deploy to green, run smoke tests, then switch the load balancer from blue to green in one atomic step.

Pros: Instant rollback — just switch the load balancer back to blue. No mixed-version period.

Cons: Requires double the infrastructure during the deployment window. Database migrations become complex.

Best for: Regulated environments, major version changes, whenever instant rollback is critical.

Canary deployments

Route a small percentage of traffic (e.g. 5%) to the new version while the rest stays on the old version. Monitor error rates and latency. If healthy, gradually increase the canary percentage.

Pros: Limits blast radius of a bad deployment to a fraction of users. Real-world validation before full rollout.

Cons: Complex to implement. Requires sophisticated traffic splitting. Mixed-version period can last minutes to hours.

Best for: High-traffic consumer-facing APIs where even a 0.1% error rate is significant.

Database migrations: the hard part

Zero-downtime deploys become complicated when they involve schema changes. The rule: migrations must be backward-compatible with the running old version. The pattern is expand-contract:

Expand: Add the new column/table. Both old and new code work.
Deploy new code.
Contract: Remove the old column/table in a separate migration.

Never rename or drop a column in the same deployment that removes code that uses it.

Verifying zero downtime

Use autocannon or k6 to generate continuous load against your app during a deployment. Watch the error rate. Any errors during the swap indicate an issue with your graceful shutdown or health check readiness timing.