JournalSystem Design
System Design

Load Balancing: How Modern Systems Distribute Traffic at Scale

Kirtesh Admute
April 5, 2026
8 min read
Load Balancing: How Modern Systems Distribute Traffic at Scale
Share

Load Balancing: How Modern Systems Distribute Traffic at Scale

Every time you open Netflix, book a flight, or send a message, your request doesn't land on a single server. It's intelligently routed to one of many — the one best positioned to handle it right now. That intelligence is load balancing.

Without it, a single overloaded server would become your system's Achilles heel. With it, your system becomes elastic, resilient, and fast — regardless of traffic spikes.


What Is a Load Balancer?

A load balancer is a component that sits between clients and your backend servers, distributing incoming requests across a pool of servers based on defined rules.

Client → Load Balancer → [ Server 1 | Server 2 | Server 3 ]

Its three core responsibilities:

  1. Traffic distribution — spread load across healthy servers
  2. Health checking — route away from unhealthy instances
  3. Session management — optionally maintain sticky sessions

Why Load Balancing Is Non-Negotiable

Eliminate single points of failure. If one server crashes, the load balancer routes around it automatically. Your users see nothing.

Horizontal scalability. Add more servers behind the balancer instead of upgrading to a bigger machine (vertical scaling). This is cheaper and more resilient.

Improve response times. Requests go to the least-busy server, reducing queue depth and latency.

Zero-downtime deployments. Rolling deploys work by taking servers out of rotation one at a time, updating them, then re-adding — users experience no interruption.


Types of Load Balancers

Layer 4 (Transport Layer)

Routes based on IP and TCP/UDP information — fast but limited context.

  • Operates at the network layer
  • Cannot inspect request content (no URL, no headers)
  • Very low latency overhead
  • Best for: high-throughput TCP connections, database proxies, gaming servers

Layer 7 (Application Layer)

Routes based on HTTP attributes — flexible and powerful.

  • Can inspect URL path, headers, cookies, body
  • Enables content-based routing (/api → API servers, /static → CDN origin)
  • Supports A/B testing, canary deployments, and header-based routing
  • Best for: web apps, APIs, microservices
FeatureLayer 4Layer 7
Routing basisIP + portURL, headers, body
SpeedFasterSlightly slower
TLS terminationPass-throughYes
Content-based routingNoYes
Use caseTCP/UDP appsHTTP/HTTPS apps

Load Balancing Algorithms

Choosing the right algorithm is critical for optimal distribution.

Round Robin

Requests are distributed sequentially across servers.

Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (cycle repeats)

✅ Simple, predictable
❌ Ignores server capacity or current load

Weighted Round Robin

Servers with more capacity get proportionally more traffic.

Server A (weight: 3) → 3 out of every 5 requests
Server B (weight: 2) → 2 out of every 5 requests

✅ Accounts for heterogeneous hardware
❌ Still ignores real-time load

Least Connections

Routes to the server with the fewest active connections.

Server A: 150 connections
Server B: 60 connections  ← new request goes here
Server C: 200 connections

✅ Dynamically adapts to load
✅ Great for long-lived connections (WebSockets, streaming)

IP Hash

Hashes the client's IP to always route to the same server.

hash(client_ip) % num_servers → consistent server selection

✅ Achieves session stickiness without cookies
❌ Uneven distribution if many clients share an IP (e.g., behind NAT)

Least Response Time

Routes to the server with the lowest active connections and fastest average response time.

✅ Most sophisticated dynamic algorithm
✅ Used by Nginx Plus, HAProxy, and AWS ALB
❌ Requires tracking response time metrics


Health Checks

A load balancer is only as smart as its health checking.

Every 10 seconds:
  GET /health → HTTP 200 → ✅ healthy
  GET /health → HTTP 500 → ❌ remove from pool
  GET /health → timeout  → ❌ remove from pool

Types of health checks:

TypeWhat it checks
TCPCan connect on the port
HTTPReturns expected status code
CustomReturns specific body or JSON field
DatabaseApp can reach its dependencies

Best practice: your /health endpoint should verify database connectivity, cache availability, and any critical dependency — not just return 200 OK unconditionally.

typescript
1// Example health check endpoint (Next.js API route)
2export async function GET() {
3  try {
4    await db.query('SELECT 1'); // verify DB is reachable
5    return Response.json({ status: 'ok' }, { status: 200 });
6  } catch {
7    return Response.json({ status: 'error' }, { status: 500 });
8  }
9}

Session Persistence (Sticky Sessions)

Some applications require a user to always reach the same server — for example, when session data is stored in-memory (not in Redis).

Cookie-based stickiness: The load balancer sets a cookie (e.g., AWSALB) on the first response. Subsequent requests from that client carry the cookie, allowing the balancer to route to the same server.

Problems with sticky sessions:

  • Uneven distribution — one server gets all requests from a heavy user
  • If the server dies, the session is lost anyway
  • Contradicts the purpose of horizontal scaling

✅ Better approach: store session state in a shared layer (Redis, database) so any server can handle any request.


Deployment Patterns

Single Load Balancer (Simple)

              ┌──────────┐
Client ──────▶│  Load LB │──▶ [ Server 1 | Server 2 | Server 3 ]
              └──────────┘

⚠️ The load balancer itself becomes a single point of failure.

Active-Passive HA Pair

              ┌──────────────┐
              │  Active LB   │──▶ [ Servers... ]
Client ──────▶│              │
              │  Passive LB  │ (takes over if active fails)
              └──────────────┘

✅ High availability — if the active node fails, passive takes over via VIP (virtual IP).

DNS-Based Load Balancing

Multiple A records point to different load balancers or server IPs. DNS resolver distributes clients across them.

api.example.com → 203.0.113.1
api.example.com → 203.0.113.2
api.example.com → 203.0.113.3

✅ Global geographic distribution
❌ TTL means clients may hold stale IPs for minutes
❌ No real-time health awareness

Global Server Load Balancing (GSLB)

Routes users to the geographically closest healthy data center.

User in Mumbai    → Asia-Pacific region servers
User in New York  → US-East region servers
User in Frankfurt → EU-West region servers

Used by: Cloudflare, AWS Route 53, Akamai


Real-World Tools

ToolTypeBest For
NGINXL7 (software)Web apps, reverse proxy, TLS offloading
HAProxyL4/L7High-performance TCP/HTTP balancing
AWS ALBL7 (managed)AWS-native apps, path-based routing
AWS NLBL4 (managed)Ultra-low latency, TCP/UDP
Cloudflare LBL7 (global)Global distribution, DDoS protection
TraefikL7 (software)Kubernetes, Docker-native dynamic LB
EnvoyL7 (software)Service meshes (Istio), microservices

Load Balancing in Microservices

In a microservices architecture, load balancing happens at two levels:

External (north-south): An API gateway or edge load balancer handles traffic from outside the cluster.

Internal (east-west): Service-to-service calls are load-balanced via a service mesh (e.g., Istio with Envoy sidecars) or client-side load balancing (e.g., Ribbon in Spring Cloud).

External Client
 API Gateway (Nginx/ALB)
 Service A  ──(east-west LB via Envoy)──▶  Service B

Common Pitfalls

Not monitoring backend health granularly. A server that responds to /health with 200 OK but has an exhausted DB connection pool will still fail real requests. Make health checks reflect actual readiness.

Ignoring connection draining. When removing a server from rotation (deploy/scale-in), connections in-flight should complete gracefully before the server is terminated. Most managed LBs support a configurable drain timeout (e.g., 30 seconds).

Choosing the wrong algorithm for long-lived connections. Round robin works poorly for WebSocket or gRPC streaming connections — least-connections or least-response-time is a better fit.

SSL termination confusion. Decide upfront where TLS terminates. Terminating at the load balancer is simpler (backend uses plain HTTP); re-encryption (LB → backend also uses HTTPS) is more secure but adds CPU overhead.


Choosing the Right Strategy

ScenarioRecommended Approach
Simple web app, uniform serversRound robin
Mixed hardware capacitiesWeighted round robin
Long-lived connections (WebSocket, gRPC)Least connections
High variability in response timesLeast response time
Stateful app (legacy, in-memory sessions)IP hash or cookie-based sticky sessions
Global multi-regionGSLB + Route 53 / Cloudflare
Kubernetes / microservicesEnvoy / Traefik with service mesh

Conclusion

Load balancing is not a single technology — it's a discipline. Getting it right means understanding your traffic patterns, connection types, hardware topology, and failure modes.

The best load balancers are invisible: users never think about them because they just work. But behind the scenes, they're the quiet orchestrators keeping your system responsive, resilient, and ready to scale to 10x traffic without breaking a sweat.

A well-designed load balancing strategy is the difference between a system that survives Black Friday and one that collapses under it.

Written by

Kirtesh Admute

Full-stack engineer and digital architect — building scalable, production-grade systems with real-world impact.

April 5, 2026 8 min read

Newsletter

Stay in the
loop.

Weekly insights on system design and digital craft. 2,000+ developers subscribed.

No spam. Unsubscribe anytime.