SOFTWARE DEVELOPMENT

Rate Limiting: Algorithms and Real-World Implementation

Q: 1. What is rate limiting in APIs?

Restricts how many requests a client can make in a defined period. Excess requests get a 429 Too Many Requests response until the window resets.

Q: 2. What is the difference between Token Bucket vs Leaky Bucket?

Token Bucket allows traffic bursts up to capacity. Leaky Bucket enforces strictly uniform output with no bursting. Token Bucket is more flexible for user APIs; Leaky Bucket for smooth traffic systems.

Q: 3. What HTTP status code should be returned when rate limiting a request?

Use 429 Too Many Requests. Include a Retry-After header specifying seconds to wait before retrying.

Q: 4. What is the best rate limiting algorithm for a production API?

Sliding Window Counter; accurate, low memory, no boundary attack vulnerability. Token Bucket is a close second if short bursts are acceptable.

Q: 5. Why Redis over in-memory storage?

In-memory counters exist per server. Clients bypass limits by hitting different servers. Redis provides a shared distributed store enforced across all instances.

Q: 6. What is sliding window rate limiting?

Calculates request count over a continuously moving time window, not fixed intervals. Eliminates the boundary attack where clients double their rate at window boundaries.

Q: 7. How do Stripe and Cloudflare implement it?

Stripe uses tiered limits per API key/endpoint with 429 + Retry-After headers. Cloudflare uses Sliding Window Counter in Redis across its edge network with sub-millisecond overhead.

Q: 8. Can rate limiting provide security?

Yes. Effective first line against brute force, credential stuffing, and DDoS. Common uses: limit login attempts per IP, restrict password resets, cap OTP verification.

By Vishalini Devarajan

Jul 23, 2026 5 Min Read 375 Views

(Last Updated)

Quick TL;DR
Introduction
What Is Rate Limiting?
Why Rate Limiting Matters in Production Systems
Rate Limiting Algorithms Explained
Comparison of Rate Limiting Algorithms
Common Mistakes When Implementing Rate Limiting
Conclusion
FAQ

What is rate limiting in APIs?
What is the difference between Token Bucket vs Leaky Bucket?
What HTTP status code should be returned when rate limiting a request?
What is the best rate limiting algorithm for a production API?
Why Redis over in-memory storage?
What is sliding window rate limiting?
How do Stripe and Cloudflare implement it?
Can rate limiting provide security?

Quick TL;DR

Rate limiting algorithms are techniques used to control how many requests a client can make to a server within a defined time window.
Common rate limiting algorithms include Token Bucket, Leaky Bucket, Fixed Window Counter, Sliding Window Log, and Sliding Window Counter.
Rate limiting protects APIs and backend systems from abuse, prevents server overload, and ensures fair usage across all clients.
It is a core concept in system design and a frequently asked topic in senior developer and backend engineering interviews.

Introduction

Many developers build APIs without thinking about what happens when a single client sends thousands of requests per second, either intentionally or due to a bug. Rate limiting algorithms are the mechanism that prevents this from bringing your entire system down. Understanding how to choose and implement the right rate limiting strategy is a skill every backend engineer and system designer needs in 2026.

Want to go deep on system design concepts like rate limiting, caching, and distributed architecture and build the skills needed for senior engineering roles? Explore HCL GUVI’s Software Development Engineering Course, designed for developers ready to level up from writing code to designing systems.

What Is Rate Limiting?

Rate limiting is a technique that restricts the number of requests a client can make to a server within a specified time period. Once a client exceeds the allowed limit, the server rejects further requests until the time window resets.

Rate limiting is used to:

Protect APIs from abuse and brute force attacks
Prevent a single client from consuming all available server resources
Ensure fair usage across all users on a shared platform
Reduce infrastructure costs by preventing unnecessary load
Comply with third-party API usage agreements

Why Rate Limiting Matters in Production Systems

Without rate limiting, a single misbehaving client, whether a bot, a buggy script, or a malicious actor, can flood your server with requests and cause downtime for every other user on the platform.

Rate limiting also plays a critical role in:

API monetisation: Paid tiers offer higher rate limits, making it a direct revenue mechanism for SaaS products.
Security: Slowing down brute force login attempts by limiting requests per IP address.
Cost control: Preventing runaway scripts from generating unexpected cloud infrastructure bills.

Now let’s understand the most widely used rate limiting algorithms and how each one works.

💡 Did You Know?

Cloudflare handles massive global traffic volumes, processing trillions of DNS queries every month across its edge network. To manage scale and protect against abuse, large distributed systems like this rely on efficient rate-limiting strategies such as the Sliding Window Counter approach, which provides a more accurate balance between strict limits and real-world traffic bursts compared to simple fixed-window counters. In distributed environments, maintaining consistent rate-limit state across thousands of edge servers is typically achieved using fast, in-memory data stores like Redis, enabling near real-time synchronization and low-latency request validation at global scale. These techniques are essential for keeping modern internet services fast, secure, and resilient under extreme load.

Rate Limiting Algorithms Explained

Token Bucket

The Token Bucket algorithm is one of the most widely used rate limiting approaches. Imagine a bucket that holds tokens. Tokens are added to the bucket at a fixed rate up to a maximum capacity. Each incoming request consumes one token. If the bucket has tokens, the request is allowed. If the bucket is empty, the request is rejected.

Key properties:

Allows short bursts of traffic up to the bucket capacity
Smooths out traffic over time through the token refill rate
Simple to implement and memory efficient

import time

class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.tokens = capacity
self.refill_rate = refill_rate
self.last_refill = time.time()

def allow_request(self):
now = time.time()
elapsed = now – self.last_refill
self.tokens = min(self.capacity, self.tokens + elapsed * self.refill_rate)
self.last_refill = now

if self.tokens >= 1:
self.tokens -= 1
return True
return False

bucket = TokenBucket(capacity=10, refill_rate=2)
print(bucket.allow_request()) # Output: True

Best for: APIs that need to allow occasional bursts while maintaining an average rate limit over time.

Leaky Bucket

The Leaky Bucket algorithm processes requests at a fixed output rate regardless of how fast they arrive. Incoming requests are added to a queue. Requests leak out of the queue at a constant rate. If the queue is full, new requests are dropped.

Unlike Token Bucket, Leaky Bucket enforces a strictly uniform output rate with no bursting allowed.

Best for: Systems that require smooth, consistent traffic flow such as network packet scheduling and video streaming pipelines.

Fixed Window Counter

The Fixed Window Counter divides time into fixed windows of a set duration, for example one minute. A counter tracks how many requests a client has made in the current window. When the counter exceeds the limit, requests are rejected until the next window begins.

import time

class FixedWindowCounter:
def __init__(self, limit, window_size):
self.limit = limit
self.window_size = window_size
self.counter = 0
self.window_start = time.time()

def allow_request(self):
now = time.time()

if now – self.window_start >= self.window_size:
self.counter = 0
self.window_start = now

if self.counter < self.limit:
self.counter += 1
return True
return False

Best for: Simple use cases where approximate rate limiting is acceptable and implementation simplicity is a priority.

Sliding Window Log

The Sliding Window Log keeps a timestamped log of every request made by a client. When a new request arrives, the algorithm removes all entries older than the current window size and checks whether the remaining log entries exceed the limit.

Best for: Applications requiring precise rate limiting with no boundary attack vulnerability.

Sliding Window Counter

The Sliding Window Counter is a hybrid of Fixed Window Counter and Sliding Window Log. It uses two fixed window counters, the current and previous window, and calculates a weighted count based on how far into the current window the request arrives.

This approximates a true sliding window while using far less memory than the Sliding Window Log.

Best for: High-traffic production systems that need accurate rate limiting with low memory overhead. This is the approach used by Cloudflare and many large-scale API gateways.

Comparison of Rate Limiting Algorithms

Algorithm	Burst Handling	Memory Usage	Accuracy	Best Use Case
Token Bucket	Allows bursts	Low	High	General API rate limiting
Leaky Bucket	No bursts	Low	High	Uniform output rate systems
Fixed Window Counter	Allows boundary burst	Very low	Approximate	Simple low-stakes rate limiting
Sliding Window Log	No bursts	High	Exact	Precise per-client limiting
Sliding Window Counter	Partial bursts	Low	Near-exact	High-traffic production APIs

💡 Did You Know?

Modern API platforms such as Stripe implement rate limiting at multiple layers, including both per-endpoint and per-API-key limits, each with its own defined usage budget. This layered approach helps protect backend systems from overload while ensuring fair usage across clients and services. When a limit is exceeded, the API typically responds with a 429 Too Many Requests status code, along with a Retry-After header that tells the client when it can safely retry the request. This pattern has become an industry standard for handling rate-limited APIs, enabling predictable backoff behavior and more resilient distributed systems.

Common Mistakes When Implementing Rate Limiting

1. Implementing at Application Layer: Every microservice reinventing rate limiting with unshared state. Implement at the API gateway or shared middleware using Redis instead.

2. Wrong HTTP Status Code: Returning 400 or 500 for rate limits is incorrect. Use 429 Too Many Requests with a Retry-After header indicating when to retry.

3. Single Global Rate Limit: Treating all clients the same penalizes paying customers. Implement tiered limits based on user plan, API key type, or client identity.

4. Not Distributed Rate Limiting: Local memory counters work on single servers but break at scale. Each instance has its own counter, clients bypass limits by hitting different servers. Use a shared distributed store (Redis).

5. Ignoring Internal Service Calls: Rate limiting only external APIs while leaving internal endpoints unprotected. A misconfigured internal service can generate thousands of requests per second, causing the same overload issues you tried to prevent.

Conclusion

As backend systems scale to serve millions of users and API ecosystems become more complex, rate limiting algorithms are no longer optional. They are a fundamental layer of protection, fairness, and reliability in every production system.

Understanding when to use Token Bucket over Sliding Window Counter, how to implement distributed rate limiting with Redis, and how to communicate limits clearly to API consumers will set you apart as a backend engineer.

FAQ

1. What is rate limiting in APIs?

Restricts how many requests a client can make in a defined period. Excess requests get a 429 Too Many Requests response until the window resets.

2. What is the difference between Token Bucket vs Leaky Bucket?

Token Bucket allows traffic bursts up to capacity. Leaky Bucket enforces strictly uniform output with no bursting. Token Bucket is more flexible for user APIs; Leaky Bucket for smooth traffic systems.

3. What HTTP status code should be returned when rate limiting a request?

Use 429 Too Many Requests. Include a Retry-After header specifying seconds to wait before retrying.

4. What is the best rate limiting algorithm for a production API?

Sliding Window Counter; accurate, low memory, no boundary attack vulnerability. Token Bucket is a close second if short bursts are acceptable.

5. Why Redis over in-memory storage?

In-memory counters exist per server. Clients bypass limits by hitting different servers. Redis provides a shared distributed store enforced across all instances.

6. What is sliding window rate limiting?

Calculates request count over a continuously moving time window, not fixed intervals. Eliminates the boundary attack where clients double their rate at window boundaries.

7. How do Stripe and Cloudflare implement it?

Stripe uses tiered limits per API key/endpoint with 429 + Retry-After headers. Cloudflare uses Sliding Window Counter in Redis across its edge network with sub-millisecond overhead.

8. Can rate limiting provide security?

Yes. Effective first line against brute force, credential stuffing, and DDoS. Common uses: limit login attempts per IP, restrict password resets, cap OTP verification.

Success Stories

About the Author

Vishalini Devarajan

An Aerospace Engineer turned content writer, I focus on making complex concepts easy to understand through well-structured, reader-friendly blogs. Whether it’s a technical topic or a non-technical one, I love creating content that is clear, engaging, and impactful.

View all posts by Vishalini Devarajan