Load Balancing: How Modern Systems Distribute Traffic at Scale

Load Balancing: How Modern Systems Distribute Traffic at Scale
Every time you open Netflix, book a flight, or send a message, your request doesn't land on a single server. It's intelligently routed to one of many — the one best positioned to handle it right now. That intelligence is load balancing.
Without it, a single overloaded server would become your system's Achilles heel. With it, your system becomes elastic, resilient, and fast — regardless of traffic spikes.
What Is a Load Balancer?
A load balancer is a component that sits between clients and your backend servers, distributing incoming requests across a pool of servers based on defined rules.
Client → Load Balancer → [ Server 1 | Server 2 | Server 3 ]
Its three core responsibilities:
- Traffic distribution — spread load across healthy servers
- Health checking — route away from unhealthy instances
- Session management — optionally maintain sticky sessions
Why Load Balancing Is Non-Negotiable
Eliminate single points of failure. If one server crashes, the load balancer routes around it automatically. Your users see nothing.
Horizontal scalability. Add more servers behind the balancer instead of upgrading to a bigger machine (vertical scaling). This is cheaper and more resilient.
Improve response times. Requests go to the least-busy server, reducing queue depth and latency.
Zero-downtime deployments. Rolling deploys work by taking servers out of rotation one at a time, updating them, then re-adding — users experience no interruption.
Types of Load Balancers
Layer 4 (Transport Layer)
Routes based on IP and TCP/UDP information — fast but limited context.
- Operates at the network layer
- Cannot inspect request content (no URL, no headers)
- Very low latency overhead
- Best for: high-throughput TCP connections, database proxies, gaming servers
Layer 7 (Application Layer)
Routes based on HTTP attributes — flexible and powerful.
- Can inspect URL path, headers, cookies, body
- Enables content-based routing (
/api→ API servers,/static→ CDN origin) - Supports A/B testing, canary deployments, and header-based routing
- Best for: web apps, APIs, microservices
| Feature | Layer 4 | Layer 7 |
|---|---|---|
| Routing basis | IP + port | URL, headers, body |
| Speed | Faster | Slightly slower |
| TLS termination | Pass-through | Yes |
| Content-based routing | No | Yes |
| Use case | TCP/UDP apps | HTTP/HTTPS apps |
Load Balancing Algorithms
Choosing the right algorithm is critical for optimal distribution.
Round Robin
Requests are distributed sequentially across servers.
Request 1 → Server A
Request 2 → Server B
Request 3 → Server C
Request 4 → Server A (cycle repeats)✅ Simple, predictable
❌ Ignores server capacity or current load
Weighted Round Robin
Servers with more capacity get proportionally more traffic.
Server A (weight: 3) → 3 out of every 5 requests
Server B (weight: 2) → 2 out of every 5 requests✅ Accounts for heterogeneous hardware
❌ Still ignores real-time load
Least Connections
Routes to the server with the fewest active connections.
Server A: 150 connections
Server B: 60 connections ← new request goes here
Server C: 200 connections✅ Dynamically adapts to load
✅ Great for long-lived connections (WebSockets, streaming)
IP Hash
Hashes the client's IP to always route to the same server.
hash(client_ip) % num_servers → consistent server selection
✅ Achieves session stickiness without cookies
❌ Uneven distribution if many clients share an IP (e.g., behind NAT)
Least Response Time
Routes to the server with the lowest active connections and fastest average response time.
✅ Most sophisticated dynamic algorithm
✅ Used by Nginx Plus, HAProxy, and AWS ALB
❌ Requires tracking response time metrics
Health Checks
A load balancer is only as smart as its health checking.
Every 10 seconds:
GET /health → HTTP 200 → ✅ healthy
GET /health → HTTP 500 → ❌ remove from pool
GET /health → timeout → ❌ remove from poolTypes of health checks:
| Type | What it checks |
|---|---|
| TCP | Can connect on the port |
| HTTP | Returns expected status code |
| Custom | Returns specific body or JSON field |
| Database | App can reach its dependencies |
Best practice: your /health endpoint should verify database connectivity, cache availability, and any critical dependency — not just return 200 OK unconditionally.
1// Example health check endpoint (Next.js API route)
2export async function GET() {
3 try {
4 await db.query('SELECT 1'); // verify DB is reachable
5 return Response.json({ status: 'ok' }, { status: 200 });
6 } catch {
7 return Response.json({ status: 'error' }, { status: 500 });
8 }
9}Session Persistence (Sticky Sessions)
Some applications require a user to always reach the same server — for example, when session data is stored in-memory (not in Redis).
Cookie-based stickiness:
The load balancer sets a cookie (e.g., AWSALB) on the first response. Subsequent requests from that client carry the cookie, allowing the balancer to route to the same server.
Problems with sticky sessions:
- Uneven distribution — one server gets all requests from a heavy user
- If the server dies, the session is lost anyway
- Contradicts the purpose of horizontal scaling
✅ Better approach: store session state in a shared layer (Redis, database) so any server can handle any request.
Deployment Patterns
Single Load Balancer (Simple)
┌──────────┐
Client ──────▶│ Load LB │──▶ [ Server 1 | Server 2 | Server 3 ]
└──────────┘⚠️ The load balancer itself becomes a single point of failure.
Active-Passive HA Pair
┌──────────────┐
│ Active LB │──▶ [ Servers... ]
Client ──────▶│ │
│ Passive LB │ (takes over if active fails)
└──────────────┘✅ High availability — if the active node fails, passive takes over via VIP (virtual IP).
DNS-Based Load Balancing
Multiple A records point to different load balancers or server IPs. DNS resolver distributes clients across them.
api.example.com → 203.0.113.1
api.example.com → 203.0.113.2
api.example.com → 203.0.113.3✅ Global geographic distribution
❌ TTL means clients may hold stale IPs for minutes
❌ No real-time health awareness
Global Server Load Balancing (GSLB)
Routes users to the geographically closest healthy data center.
User in Mumbai → Asia-Pacific region servers
User in New York → US-East region servers
User in Frankfurt → EU-West region serversUsed by: Cloudflare, AWS Route 53, Akamai
Real-World Tools
| Tool | Type | Best For |
|---|---|---|
| NGINX | L7 (software) | Web apps, reverse proxy, TLS offloading |
| HAProxy | L4/L7 | High-performance TCP/HTTP balancing |
| AWS ALB | L7 (managed) | AWS-native apps, path-based routing |
| AWS NLB | L4 (managed) | Ultra-low latency, TCP/UDP |
| Cloudflare LB | L7 (global) | Global distribution, DDoS protection |
| Traefik | L7 (software) | Kubernetes, Docker-native dynamic LB |
| Envoy | L7 (software) | Service meshes (Istio), microservices |
Load Balancing in Microservices
In a microservices architecture, load balancing happens at two levels:
External (north-south): An API gateway or edge load balancer handles traffic from outside the cluster.
Internal (east-west): Service-to-service calls are load-balanced via a service mesh (e.g., Istio with Envoy sidecars) or client-side load balancing (e.g., Ribbon in Spring Cloud).
External Client
│
▼
API Gateway (Nginx/ALB)
│
▼
Service A ──(east-west LB via Envoy)──▶ Service BCommon Pitfalls
Not monitoring backend health granularly.
A server that responds to /health with 200 OK but has an exhausted DB connection pool will still fail real requests. Make health checks reflect actual readiness.
Ignoring connection draining. When removing a server from rotation (deploy/scale-in), connections in-flight should complete gracefully before the server is terminated. Most managed LBs support a configurable drain timeout (e.g., 30 seconds).
Choosing the wrong algorithm for long-lived connections. Round robin works poorly for WebSocket or gRPC streaming connections — least-connections or least-response-time is a better fit.
SSL termination confusion. Decide upfront where TLS terminates. Terminating at the load balancer is simpler (backend uses plain HTTP); re-encryption (LB → backend also uses HTTPS) is more secure but adds CPU overhead.
Choosing the Right Strategy
| Scenario | Recommended Approach |
|---|---|
| Simple web app, uniform servers | Round robin |
| Mixed hardware capacities | Weighted round robin |
| Long-lived connections (WebSocket, gRPC) | Least connections |
| High variability in response times | Least response time |
| Stateful app (legacy, in-memory sessions) | IP hash or cookie-based sticky sessions |
| Global multi-region | GSLB + Route 53 / Cloudflare |
| Kubernetes / microservices | Envoy / Traefik with service mesh |
Conclusion
Load balancing is not a single technology — it's a discipline. Getting it right means understanding your traffic patterns, connection types, hardware topology, and failure modes.
The best load balancers are invisible: users never think about them because they just work. But behind the scenes, they're the quiet orchestrators keeping your system responsive, resilient, and ready to scale to 10x traffic without breaking a sweat.
A well-designed load balancing strategy is the difference between a system that survives Black Friday and one that collapses under it.
Written by
Kirtesh Admute
Full-stack engineer and digital architect — building scalable, production-grade systems with real-world impact.


&w=3840&q=75)