JournalSystem Design
System Design

Rate Limiting & Throttling: Protecting Your APIs at Scale

Kirtesh Admute
April 13, 2026
8 min read
Rate Limiting & Throttling: Protecting Your APIs at Scale
Share

Rate Limiting & Throttling: Protecting Your APIs at Scale

"An API without rate limiting is an open invitation to abuse — whether by a bad actor, a buggy client, or your own traffic spike."

Introduction

You've built a great API. It's fast, well-documented, and handles thousands of requests per second in testing. Then one day a misconfigured client sends 10,000 requests per second, your database falls over, and everything goes down.

This is exactly what rate limiting prevents.

Rate limiting is one of those foundational topics that separates hobby projects from production systems. In this article, we'll cover the core algorithms, when to use each one, and how to implement rate limiting in real systems.


What Is Rate Limiting?

Rate limiting controls how many requests a client can make to your API within a given time window. When a client exceeds the limit, subsequent requests are rejected (usually with an HTTP 429 Too Many Requests response) until the window resets.

Throttling is closely related — instead of hard-rejecting requests, you slow them down. Think of it as a speed governor rather than a hard stop.

Why You Need It

  • Prevent abuse — stop malicious clients or scrapers from hammering your API
  • Ensure fairness — one heavy user shouldn't degrade the experience for everyone else
  • Protect downstream services — your database, cache, and third-party APIs all have limits
  • Cost control — especially important when you pay per API call (OpenAI, Stripe, etc.)
  • SLA enforcement — enforce different limits for free vs. paid tiers

The Four Core Algorithms

1. Fixed Window Counter

The simplest approach: count requests in a fixed time window (e.g., 100 requests per minute). At the start of each window, the counter resets.

Window: 00:00 → 01:00
Counter: 0 → 100 → BLOCKED
Reset at 01:00 → counter back to 0

Implementation:

java
public boolean isAllowed(String clientId) {
    String key = clientId + ":" + getCurrentWindowStart(); // e.g., "user123:202604131430"
    long count = redisClient.incr(key);
    if (count == 1) {
        redisClient.expire(key, 60); // set TTL on first request
    }
    return count <= RATE_LIMIT;
}

Problem: The "boundary burst" issue. A client can make 100 requests at 00:59 and another 100 at 01:01 — effectively 200 requests in 2 seconds.


2. Sliding Window Log

Track the timestamp of every request in a sorted set. To check the limit, count requests within the last N seconds.

Current time: 01:30
Window (60s): 00:30 → 01:30
Count requests with timestamp > 00:30

Implementation:

java
1public boolean isAllowed(String clientId) {
2    long now = System.currentTimeMillis();
3    long windowStart = now - WINDOW_SIZE_MS; // e.g., 60_000ms
4    String key = "ratelimit:" + clientId;
5
6    // Remove old entries outside window
7    redisClient.zremrangeByScore(key, 0, windowStart);
8
9    // Count requests in current window
10    long count = redisClient.zcard(key);
11
12    if (count >= RATE_LIMIT) {
13        return false;
14    }
15
16    // Add current request
17    redisClient.zadd(key, now, UUID.randomUUID().toString());
18    redisClient.expire(key, WINDOW_SIZE_SECONDS + 1);
19    return true;
20}

Advantage: No boundary bursts.
Disadvantage: High memory usage — every request timestamp is stored. Not practical at massive scale.


3. Sliding Window Counter

A hybrid that approximates a sliding window using two fixed window counters:

Previous window count: 80 requests
Current window count: 40 requests (40% into the window)

Weighted count = (80 × 0.6) + (40 × 1.0) = 48 + 40 = 88

It estimates how many requests occurred in the sliding window by weighting the previous window based on how far through the current window you are.

This is the algorithm Redis uses internally. It's memory-efficient and accurate enough for most production use cases.


4. Token Bucket

Imagine a bucket that fills with tokens at a constant rate. Each request consumes one token. If the bucket is empty, the request is rejected.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Current tokens: 5

Request arrives → consume 1 token → 4 remaining
10 requests burst → -5 tokens → BLOCKED (wait for refill)

Key properties:

  • Handles burst traffic gracefully (up to the bucket capacity)
  • Smooth, predictable refill rate
  • Used by AWS, Stripe, and most major API gateways
java
1public class TokenBucket {
2    private final long capacity;
3    private final double refillRatePerMs;
4    private double tokens;
5    private long lastRefillTime;
6
7    public synchronized boolean tryConsume() {
8        refill();
9        if (tokens >= 1) {
10            tokens -= 1;
11            return true;
12        }
13        return false;
14    }
15
16    private void refill() {
17        long now = System.currentTimeMillis();
18        double tokensToAdd = (now - lastRefillTime) * refillRatePerMs;
19        tokens = Math.min(capacity, tokens + tokensToAdd);
20        lastRefillTime = now;
21    }
22}

5. Leaky Bucket

The inverse of token bucket. Requests go into a queue (the "bucket") and are processed at a constant rate, regardless of how fast they arrive. Excess requests overflow and are dropped.

Queue capacity: 100
Processing rate: 10 req/second

Burst of 150 requests arrives:
→ 100 go into queue
→ 50 overflow → rejected
→ Queue drains at 10 req/s

Use case: When you need a smooth output rate — e.g., calling a downstream API that charges per request and has no burst tolerance.


Algorithm Comparison

AlgorithmBurst HandlingMemory UsageSmoothnessBest For
Fixed WindowPoorLowLowSimple limits, coarse control
Sliding Window LogExcellentHighHighPrecise limits, low traffic
Sliding Window CounterGoodLowGoodGeneral purpose, production
Token BucketGoodLowMediumBurst-tolerant APIs
Leaky BucketNoneMediumHighSmooth downstream rate control

Where to Implement Rate Limiting

Option 1: API Gateway (Recommended)

Handle rate limiting at the edge — before requests hit your application servers.

Client → API Gateway (rate limit check) → Application Servers → Database
          Redis (counter store)

Tools: Kong, AWS API Gateway, NGINX, Envoy, Traefik

This is the most scalable approach. Your app servers never see blocked requests.

Option 2: Application Layer

Implement rate limiting in your application code using a shared Redis instance.

java
// Spring Boot with Bucket4j
@RateLimiter(name = "apiRateLimit")
@GetMapping("/api/products")
public ResponseEntity<List<Product>> getProducts() {
    // ...
}

Good for fine-grained per-endpoint control or when you need business logic in your rate limiting (e.g., different limits per user tier).

Option 3: Middleware

In frameworks like Express.js or Spring, rate limiting middleware sits between the request and your handler:

javascript
1// Express.js with express-rate-limit
2import rateLimit from 'express-rate-limit';
3
4const limiter = rateLimit({
5  windowMs: 60 * 1000,     // 1 minute
6  max: 100,                 // 100 requests per window
7  standardHeaders: true,    // Return RateLimit-* headers
8  legacyHeaders: false,
9  keyGenerator: (req) => req.user?.id || req.ip, // Per-user or per-IP
10});
11
12app.use('/api/', limiter);

Distributed Rate Limiting with Redis

In a multi-instance deployment, you need a shared counter store. Redis is the standard choice — it's fast, atomic, and supports TTL natively.

Instance A ─┐
Instance B ─┼─→ Redis (shared counter) → Allow/Deny
Instance C ─┘

Lua script for atomic sliding window counter (prevents race conditions):

lua
1local key = KEYS[1]
2local now = tonumber(ARGV[1])
3local window = tonumber(ARGV[2])
4local limit = tonumber(ARGV[3])
5
6-- Remove expired entries
7redis.call('ZREMRANGEBYSCORE', key, 0, now - window)
8
9-- Count current entries
10local count = redis.call('ZCARD', key)
11
12if count < limit then
13    -- Add current request
14    redis.call('ZADD', key, now, now .. math.random())
15    redis.call('EXPIRE', key, math.ceil(window / 1000) + 1)
16    return 1  -- allowed
17else
18    return 0  -- rejected
19end

Using a Lua script ensures the check-and-increment is atomic — no race conditions between distributed instances.


API Response Headers

Always communicate rate limit status to clients via response headers:

HTTP/1.1 200 OK
RateLimit-Limit: 100
RateLimit-Remaining: 73
RateLimit-Reset: 1712967600
Retry-After: 42

When a client is rate limited, return:

1HTTP/1.1 429 Too Many Requests
2RateLimit-Limit: 100
3RateLimit-Remaining: 0
4RateLimit-Reset: 1712967600
5Retry-After: 42
6Content-Type: application/json
7
8{
9  "error": "rate_limit_exceeded",
10  "message": "Too many requests. Please retry after 42 seconds.",
11  "retryAfter": 42
12}

This allows well-behaved clients to implement exponential backoff correctly.


Tiered Rate Limits

Real-world APIs use different limits for different user tiers:

TierRequests/minBurstCost
Free6010Free
Pro600100$29/month
EnterpriseUnlimited1000Custom
java
public RateLimit getRateLimitForUser(User user) {
    return switch (user.getTier()) {
        case FREE       -> new RateLimit(60, Duration.ofMinutes(1));
        case PRO        -> new RateLimit(600, Duration.ofMinutes(1));
        case ENTERPRISE -> new RateLimit(Integer.MAX_VALUE, Duration.ofMinutes(1));
    };
}

Common Mistakes to Avoid

1. Rate Limiting by IP Only

IP-based limiting is easy to bypass (VPNs, rotating proxies). Always rate limit by authenticated user ID when possible. Use IP as a fallback for unauthenticated endpoints.

2. No Retry-After Header

Without a Retry-After header, clients don't know when to retry. They either give up or hammer you in a retry storm — making the problem worse.

3. Forgetting Internal Services

Rate limiting isn't just for public APIs. Internal service-to-service calls can also cause cascading failures. Rate limit at every service boundary.

4. Single Point of Failure

If your Redis instance for rate limiting goes down, decide in advance: fail open (allow all traffic) or fail closed (block all traffic). For most APIs, fail open is safer for availability but riskier for abuse.

java
public boolean isAllowed(String clientId) {
    try {
        return redisRateLimiter.tryAcquire(clientId);
    } catch (RedisException e) {
        log.warn("Rate limiter unavailable, failing open");
        return true; // fail open
    }
}

Real-World Example: Stripe's Approach

Stripe uses a token bucket algorithm with these characteristics:

  • Limits are per API key, not per IP
  • Different limits per endpoint (list operations are cheaper than write operations)
  • Burst allowance for legitimate traffic spikes
  • Clear error messages with retry guidance
  • Separate limits for test mode vs. live mode

Their approach is a good model: be generous with limits for legitimate use, clear about why you're limiting, and easy to retry correctly.


Conclusion

Rate limiting is not a nice-to-have — it's a fundamental part of any production API. The right algorithm depends on your use case:

  • Token bucket for most APIs that need to tolerate legitimate bursts
  • Sliding window counter for precise per-user fairness
  • Leaky bucket when you need to protect a slow downstream service

Implement at the API gateway layer for efficiency, use Redis for distributed counters, and always return proper 429 responses with Retry-After headers.

"Rate limiting is the difference between an API that survives Black Friday and one that doesn't."

Build it in from day one — retrofitting it later is much harder.

Written by

Kirtesh Admute

Full-stack engineer and digital architect — building scalable, production-grade systems with real-world impact.

April 13, 2026 8 min read

Newsletter

Stay in the
loop.

Weekly insights on system design and digital craft. 2,000+ developers subscribed.

No spam. Unsubscribe anytime.