Load BalancingScalabilityCDNHigh AvailabilityWeb Architecture

Load Balancing Explained: How Big Websites Stay Up When Millions Log On

Sam RiveraMay 22, 2024

Let's start with a thought experiment. Imagine you open a very popular bakery. On a normal Tuesday morning, one baker can handle everything just fine. But then a local news segment airs, calls your croissants "the best in the city," and suddenly a thousand people show up at once. One baker, completely overwhelmed, can't keep up. Orders get delayed. Some people wait so long they leave. A few customers get served raw croissants because everything is rushed. It's a disaster.

What do you do? You hire more bakers. And you put someone at the front who directs customers to whichever baker has the shortest line.

That's load balancing. And it's exactly how the internet's biggest services stay up and responsive when millions of users hit them simultaneously.

The Fundamental Problem

A single web server has limits. It has a finite amount of CPU, RAM, and network bandwidth. It can handle a certain number of simultaneous connections before it slows down, and beyond that it starts dropping requests or failing entirely.

For a small website, this is fine. Most personal blogs and small business sites never come close to saturating a single server. But for Netflix, Google, Amazon, Instagram, or any service with millions of users — a single server is laughably insufficient. Even a modest website that goes viral can instantly overwhelm a single server.

The solution is to run many servers in parallel, and use a load balancer to distribute incoming requests across them.

What a Load Balancer Does

A load balancer sits in front of a group of servers (called a server pool or server farm or backend) and acts as the single entry point for all incoming traffic. Clients connect to the load balancer's IP address. The load balancer receives each request and forwards it to one of the backend servers, then returns the server's response to the client.

From the client's perspective, they're talking to one server at the load balancer's address. They have no idea that behind the scenes, their request might be handled by any one of dozens or hundreds of servers.

The load balancer's job is to make sure no single backend server gets overwhelmed while others sit idle — to distribute the work as evenly as efficiently as possible.

Load Balancing Algorithms

The core question for any load balancer: *which server should handle this request?* Different algorithms make different tradeoffs:

Round Robin: The simplest approach. Requests are distributed to servers in a rotating sequence: server 1, server 2, server 3, server 1, server 2, server 3... Each server gets the same number of requests over time.

Pros: Dead simple, easy to understand, fair distribution of request count.

Cons: Doesn't account for request complexity. If request #1 takes 5ms and request #2 takes 5 seconds, the "balanced" server is very unbalanced in terms of actual load.

Weighted Round Robin: Like round robin, but servers are assigned weights based on their capacity. A server with 4x the CPU gets 4x the requests. Useful when servers in the pool have different hardware specs.

Least Connections: Routes each new request to the server with the fewest active connections at that moment. This is smarter than round robin because it naturally handles slow requests — if a server is busy processing a long request, it will have more active connections and receive fewer new ones.

Weighted Least Connections: Combines least connections with server weights. Handles both server capacity differences and uneven request complexity.

IP Hash: The client's IP address is hashed to determine which server handles their requests. Every request from the same IP always goes to the same server. This provides session persistence (also called "sticky sessions") — useful for applications that store session state locally on the server.

Least Response Time: Routes to the server with the lowest average response time. Requires the load balancer to actively track response times. Smart, but more complex.

Random: Literally picks a random server. Surprisingly effective at large scale due to the law of large numbers — with many requests, random selection approximates even distribution.

Session Persistence (Sticky Sessions)

This is a nuanced topic worth understanding. Many web applications are stateful — they store information about a user's session on the server itself (in memory or in local files). This was the original and still common pattern.

For such applications, if a user's requests are sent to different servers, each server sees the user as a stranger with no session history. The user might get logged out, lose their shopping cart, or see inconsistent data.

Sticky sessions solve this by ensuring all requests from a specific client always go to the same backend server, typically based on a cookie that the load balancer sets or the client's IP address.

The cleaner architectural solution is stateless design: store all session state in a shared external store (like Redis or a database) that all servers can access. Then it doesn't matter which server handles which request — they all have access to the same session information. This is much more flexible and resilient. Modern application architectures strongly favor stateless design for this reason.

Types of Load Balancers

Layer 4 Load Balancers (Transport Layer):

These operate at the TCP/UDP level and make routing decisions based on IP addresses and port numbers — without looking at the content of the traffic. They're very fast and efficient because they don't need to parse application-level protocols. The tradeoff is they can't make decisions based on the content of requests (you can't say "send all requests for /api to server group A and all requests for /static to server group B" at Layer 4).

Layer 7 Load Balancers (Application Layer):

These operate at the HTTP level and can make routing decisions based on the actual content of requests — the URL, headers, cookies, request body. This enables sophisticated routing:

Send requests for `/api/*` to the API server pool

Send requests for `/images/*` to the image server pool

Route mobile users to a mobile-optimized backend

A/B test by routing a percentage of traffic to a new version

Layer 7 load balancers are more flexible but require more processing, making them slightly slower than Layer 4.

Most modern load balancers in production use are Layer 7 because the flexibility is worth the cost.

Health Checks: Knowing When a Server Is Down

An intelligent load balancer doesn't just distribute traffic blindly. It continuously monitors the health of backend servers through health checks.

A health check is a periodic request (every few seconds) that the load balancer sends to each backend server. This might be a simple TCP connection check ("is port 80 accepting connections?"), an HTTP request to a specific endpoint (`/health` that returns a 200 OK if the server is fine), or a more sophisticated check that verifies the server can actually connect to its database.

If a server fails a health check (or several consecutive checks), the load balancer automatically stops sending traffic to it and distributes its load among the remaining healthy servers. When the server recovers, it's added back to the pool.

This is a critical feature for high availability. With a load balancer and health checks, the failure of one or several backend servers is handled automatically, transparently, without any manual intervention and without users experiencing an outage.

Content Delivery Networks (CDNs): Load Balancing at Global Scale

When we talk about load balancing at the largest scale — serving users across the entire world with minimal latency — we're talking about CDNs (Content Delivery Networks).

A CDN is a geographically distributed network of servers (called edge servers or points of presence, PoPs) located in data centers around the world. When a user requests content (a video, an image, a web page), the CDN routes them to the closest edge server rather than sending all requests to a central origin server.

For a user in Tokyo, their request goes to a CDN edge server in Tokyo or nearby in Asia. For a user in London, they hit a European edge server. The round-trip time is dramatically shorter, and the load is distributed globally instead of hammering a single location.

CDN providers like Cloudflare, Akaike, Fastly, and AWS CloudFront use sophisticated Anycast routing — multiple servers sharing the same IP address globally, with the network automatically routing each user to the geographically nearest one.

Load Balancing in Cloud Environments

Modern cloud platforms make load balancing almost trivially easy to deploy:

**AWS** offers Elastic Load Balancing (ELB) in several flavors: Application Load Balancer (Layer 7), Network Load Balancer (Layer 4), and Gateway Load Balancer

**Google Cloud** offers Cloud Load Balancing, which can distribute traffic globally across regions from a single IP address

**Azure** offers Azure Load Balancer and Application Gateway

These managed services handle all the operational complexity — hardware maintenance, software updates, scaling — and integrate with auto-scaling groups that automatically add or remove backend servers based on load.

Auto-Scaling: The Partner Technology

Load balancing works best in combination with auto-scaling. Instead of provisioning enough servers to handle your peak load at all times (expensive), auto-scaling automatically adjusts the number of servers based on current demand.

During a traffic spike, auto-scaling launches new server instances. The load balancer automatically detects them (via health checks) and begins distributing traffic to them. When the spike subsides, extra instances are terminated to save cost. The load balancer stops routing to them once they stop passing health checks during graceful shutdown.

This combination — load balancer + auto-scaling + cloud infrastructure — is the foundation of how modern web services handle massive, variable traffic loads economically and reliably.

The Invisible Infrastructure

Load balancers are essentially invisible from a user's perspective. You never see them. They don't appear in a page's HTML. When Netflix handles 200 million concurrent streams, you see Netflix. You don't see the thousands of servers behind dozens of load balancers that are jointly serving your specific stream, along with everyone else's.

But the moment a major website goes down — really goes down, not just one server but the whole thing — it's often a problem with either the load balancer itself, or with something that cascades across all the backend servers simultaneously (a bad deployment, a database outage, an infrastructure failure).

Understanding load balancing helps you understand not just how the internet handles scale, but why it sometimes fails dramatically. When everything behind the load balancer breaks at once, the load balancer faithfully distributes the failure to all your users equally. It's not a character flaw — it's just doing its job.

Load balancing is one of the unglamorous, invisible pillars that holds the modern internet up. And now you know what it is.