Introduction
Rate limiting is one of the most critical mechanisms for protecting APIs from abuse, ensuring fair usage, and maintaining service reliability under load. Without rate limiting, a single misbehaving client can overwhelm your backend services, degrading performance for all users. Rate limiting controls the number of requests a client can make within a given time window, rejecting or throttling excess requests. It's the foundation of API security, cost management, and quality of service.
Every major API provider — from Stripe and GitHub to AWS and Google Cloud — implements rate limiting as a core infrastructure component. Stripe limits to 100 requests per second in live mode, GitHub allows 5,000 requests per hour for authenticated users, and AWS API Gateway supports burst limits of 10,000 requests per second. These limits aren't arbitrary — they're carefully calibrated to protect infrastructure while enabling legitimate use cases.
The choice of rate limiting algorithm significantly impacts the user experience and system behavior. Token bucket allows burst traffic while maintaining average rates. Sliding window provides smooth, consistent limiting. Fixed window is simple but can allow traffic spikes at window boundaries. Understanding these algorithms and their trade-offs is essential for implementing rate limiting that protects your system without unnecessarily restricting legitimate users.
Modern APIs use distributed rate limiting — coordinating limits across multiple server instances, data centers, and regions. This requires shared state (typically in Redis) and careful handling of network partitions and clock skew. This guide covers the algorithms, implementation patterns, and production considerations for building robust rate limiting systems.
Why Rate Limiting Matters
Protection Against Abuse
Without rate limiting, a single client can send thousands of requests per second, consuming CPU, memory, database connections, and network bandwidth. This degrades service for all users. Rate limiting acts as a circuit breaker, preventing any single client from monopolizing shared resources.
Cost Control
For APIs that call expensive downstream services — LLMs like GPT-4, payment processors like Stripe, or external databases — each request has a real cost. A runaway script making 10,000 API calls to an LLM endpoint could cost hundreds of dollars in minutes. Rate limiting caps this exposure.
Fair Resource Allocation
In multi-tenant SaaS platforms, rate limiting ensures that no single tenant can monopolize shared infrastructure. A free-tier user shouldn't be able to consume the same resources as an enterprise customer paying 100x more.
DDoS Mitigation
Rate limiting is the first line of defense against distributed denial-of-service attacks. While it won't stop a sophisticated DDoS attack alone, it prevents simple volumetric attacks from overwhelming your infrastructure.
Rate Limiting Algorithms Deep Dive
The Token Bucket Algorithm
The token bucket algorithm is the most widely used rate limiting algorithm in production systems. It works by maintaining a bucket that holds tokens. Tokens are added to the bucket at a fixed rate (the refill rate). Each incoming request must consume one or more tokens to proceed. If the bucket is empty, the request is rejected or queued.
The key parameters are capacity (maximum burst size) and refill rate (tokens added per second). A capacity of 10 with a refill rate of 2 tokens/second allows bursts of 10 requests while maintaining an average of 2 requests per second over time.
How it works step by step:
- A bucket starts with
capacitytokens - Every second,
refillRatetokens are added (up to the capacity) - When a request arrives, check if enough tokens exist
- If yes: subtract tokens, allow request
- If no: reject request (or queue it)
Advantages:
- Allows burst traffic up to the bucket capacity
- Smooth average rate over time
- Memory efficient — only two values to track per client
- Simple to implement and reason about
Disadvantages:
- Burst behavior can surprise backends not designed for it
- Doesn't provide precise per-second limiting
Real-world usage: Amazon API Gateway uses token bucket internally. Stripe's rate limiting is based on a token bucket variant. Most cloud load balancers implement token bucket at the connection level.
class TokenBucket {
private tokens: number;
private lastRefill: number;
constructor(
private capacity: number,
private refillRate: number // tokens per second
) {
this.tokens = capacity;
this.lastRefill = Date.now();
}
consume(tokens: number = 1): { allowed: boolean; remaining: number; resetIn: number } {
this.refill();
if (this.tokens >= tokens) {
this.tokens -= tokens;
return {
allowed: true,
remaining: Math.floor(this.tokens),
resetIn: 0,
};
}
const waitTime = ((tokens - this.tokens) / this.refillRate) * 1000;
return {
allowed: false,
remaining: 0,
resetIn: Math.ceil(waitTime),
};
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
}
}
// Usage
const bucket = new TokenBucket(10, 2); // 10 burst, 2/sec sustained
const result = bucket.consume(1);
console.log(result); // { allowed: true, remaining: 9, resetIn: 0 }The Sliding Window Algorithm
The sliding window algorithm tracks requests within a rolling time window. Instead of resetting counters at fixed intervals, it considers the current time and counts all requests within the preceding window period. This eliminates the boundary spike problem of fixed window algorithms.
Two variants exist:
-
Sliding Window Counter: Combines the current partial window with a weighted portion of the previous window. If you're 30% into the current minute, you count 70% of the previous minute's count plus 100% of the current minute's count.
-
Sliding Window Log: Stores a timestamp for each request in a sorted set. Counts requests within the window by querying the sorted set. More precise but uses more memory.
The weighted formula:
effectiveCount = (previousWindowCount × overlapPercentage) + currentWindowCount
For example, with a 1-minute window and limit of 100:
- Previous minute: 80 requests
- Current minute: 20 requests, 30 seconds elapsed (50% overlap)
- Effective count: (80 × 0.5) + 20 = 60 requests
Advantages:
- No boundary spike problem
- Smooth, predictable behavior
- Memory efficient (counter variant)
- Easy to understand and debug
Disadvantages:
- Slightly more complex than fixed window
- Approximate in the counter variant
- Log variant uses O(n) memory per client
function slidingWindowCounter(
previousCount: number,
currentCount: number,
windowMs: number,
elapsedMs: number
): number {
const overlapPercent = 1 - (elapsedMs / windowMs);
return Math.floor(previousCount * overlapPercent) + currentCount;
}The Fixed Window Algorithm
The simplest approach: count requests in fixed time intervals (per second, per minute, per hour). When the count exceeds the limit, reject requests until the window resets. Simple to implement but allows up to 2x the intended rate at window boundaries.
The boundary problem:
Imagine a limit of 100 requests per minute. A client sends 100 requests at 11:00:59 (end of window) and 100 more at 11:01:00 (start of next window). That's 200 requests in 2 seconds — double the intended rate.
When to use: Simple internal services, rate limiting non-critical endpoints, or when you need the simplest possible implementation and can tolerate boundary spikes.
class FixedWindowRateLimiter {
private windows: Map<string, { count: number; resetAt: number }> = new Map();
isAllowed(key: string, limit: number, windowMs: number): boolean {
const now = Date.now();
const window = this.windows.get(key);
if (!window || now >= window.resetAt) {
this.windows.set(key, { count: 1, resetAt: now + windowMs });
return true;
}
if (window.count >= limit) {
return false;
}
window.count++;
return true;
}
}The Leaky Bucket Algorithm
Requests enter a queue (bucket) and are processed at a fixed rate. If the queue is full, requests are rejected. This produces perfectly smooth output traffic regardless of input burstiness, but adds latency as requests wait in the queue.
Key difference from token bucket: Leaky bucket shapes outgoing traffic to a fixed rate, while token bucket allows bursts. Leaky bucket is a traffic shaper; token bucket is a rate limiter.
Use cases: Network traffic shaping, API gateways that need to protect fragile backends that can't handle any burst traffic, and systems where consistent throughput matters more than latency.
Distributed Rate Limiting with Redis
In production, your API runs on multiple servers. In-memory rate limiting on each server is inconsistent — if you have 10 servers each allowing 100 requests/minute, a client can actually make 1,000 requests/minute by distributing across servers. You need a shared state.
Why Redis?
Redis is the de facto standard for distributed rate limiting because:
- Sub-millisecond latency: Rate limit checks add
<1msto each request - Atomic operations: Lua scripts ensure check-and-increment happen atomically
- Built-in data structures: Sorted sets are perfect for sliding window logs
- TTL support: Keys automatically expire, preventing memory leaks
- Cluster support: Redis Cluster provides high availability for production
Sliding Window with Redis Sorted Sets
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL);
async function slidingWindowRateLimit(
key: string,
limit: number,
windowMs: number
): Promise<{ allowed: boolean; remaining: number; resetIn: number }> {
const now = Date.now();
const windowStart = now - windowMs;
const pipeline = redis.pipeline();
// Remove expired entries outside the window
pipeline.zremrangebyscore(key, 0, windowStart);
// Count requests still in the window
pipeline.zcard(key);
// Add current request timestamp as a unique member
pipeline.zadd(key, now.toString(), `${now}:${Math.random()}`);
// Auto-expire the key to prevent memory leaks
pipeline.expire(key, Math.ceil(windowMs / 1000));
const results = await pipeline.exec();
const count = results![1][1] as number;
if (count >= limit) {
// Find when the oldest request in the window expires
const oldest = await redis.zrange(key, 0, 0, 'WITHSCORES');
const resetIn = oldest.length >= 2
? parseInt(oldest[1]) + windowMs - now
: windowMs;
// Remove the request we just added since it's over limit
await redis.zrem(key, `${now}:${Math.random()}`);
return { allowed: false, remaining: 0, resetIn: Math.ceil(resetIn) };
}
return { allowed: true, remaining: limit - count - 1, resetIn: 0 };
}Atomic Rate Limiting with Lua Scripts
For production systems, use Lua scripts to make the rate limit check atomic. This prevents race conditions where two concurrent requests both check the count, see it's under the limit, and both increment — exceeding the limit.
-- Lua script for atomic sliding window rate limiting
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in milliseconds
-- ARGV[2] = max requests allowed
-- ARGV[3] = current timestamp
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local windowStart = now - window
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, windowStart)
-- Count current entries
local count = redis.call('ZCARD', key)
if count < limit then
-- Add new entry
redis.call('ZADD', key, now, now .. ':' .. math.random())
redis.call('EXPIRE', key, math.ceil(window / 1000))
return {1, limit - count - 1}
else
-- Get oldest entry for reset time
local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
local resetIn = window
if #oldest >= 2 then
resetIn = tonumber(oldest[2]) + window - now
end
return {0, 0, resetIn}
end// Load and execute the Lua script
const RATE_LIMIT_SCRIPT = `
-- (Lua script from above)
`;
async function atomicRateLimit(
key: string,
limit: number,
windowMs: number
): Promise<{ allowed: boolean; remaining: number; resetIn?: number }> {
const now = Date.now();
const result = await redis.eval(
RATE_LIMIT_SCRIPT,
1,
key,
windowMs.toString(),
limit.toString(),
now.toString()
) as number[];
return {
allowed: result[0] === 1,
remaining: result[1],
resetIn: result[2] || undefined,
};
}Architecture Patterns
Gateway-Level Rate Limiting
Implement rate limiting at the API gateway level (Kong, AWS API Gateway, NGINX, Envoy). This protects all backend services uniformly and centralizes rate limit configuration.
# NGINX rate limiting configuration
http {
# Define rate limit zones
limit_req_zone $binary_remote_addr zone=api_general:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=api_expensive:10m rate=1r/s;
limit_req_zone $http_api_key zone=api_by_key:10m rate=100r/s;
server {
location /api/ {
# General API: 10 req/sec with burst of 20
limit_req zone=api_general burst=20 nodelay;
limit_req_status 429;
# Add rate limit headers
add_header X-RateLimit-Limit 10 always;
}
location /api/llm/ {
# Expensive endpoints: 1 req/sec
limit_req zone=api_expensive burst=5 nodelay;
}
}
}Kong Gateway rate limiting:
# Kong rate limiting plugin configuration
plugins:
- name: rate-limiting
config:
minute: 100
hour: 5000
policy: redis
redis:
host: redis-cluster.internal
port: 6379
hide_client_headers: false
fault_tolerant: trueApplication-Level Rate Limiting
Implement rate limiting within your application code for finer-grained control based on business logic.
import express from 'express';
const app = express();
// Rate limit middleware with tiered limits
function rateLimitMiddleware(limit: number, windowMs: number) {
return async (req: express.Request, res: express.Response, next: express.NextFunction) => {
const key = `ratelimit:${req.user?.id || req.ip}:${req.path}`;
const result = await atomicRateLimit(key, limit, windowMs);
// Always set rate limit headers
res.set('X-RateLimit-Limit', limit.toString());
res.set('X-RateLimit-Remaining', result.remaining.toString());
if (!result.allowed) {
res.set('Retry-After', Math.ceil((result.resetIn || 1000) / 1000).toString());
return res.status(429).json({
error: 'Too many requests',
retryAfter: result.resetIn,
});
}
next();
};
}
// Different limits for different endpoints
app.use('/api/search', rateLimitMiddleware(10, 1000)); // 10/sec
app.use('/api/data', rateLimitMiddleware(100, 60000)); // 100/min
app.use('/api/llm/generate', rateLimitMiddleware(1, 1000)); // 1/secTiered Rate Limiting
Implement different rate limits for different client tiers. This enables monetization while protecting the system.
interface RateLimitTier {
name: string;
requestsPerHour: number;
burstLimit: number;
costMultiplier: number;
}
const TIERS: Record<string, RateLimitTier> = {
free: { name: 'free', requestsPerHour: 100, burstLimit: 10, costMultiplier: 1 },
pro: { name: 'pro', requestsPerHour: 10000, burstLimit: 100, costMultiplier: 0.5 },
enterprise: { name: 'enterprise', requestsPerHour: 100000, burstLimit: 1000, costMultiplier: 0.1 },
};
async function tieredRateLimit(req: express.Request, res: express.Response, next: express.NextFunction) {
const userTier = req.user?.tier || 'free';
const tier = TIERS[userTier];
if (!tier) {
return res.status(400).json({ error: 'Invalid tier' });
}
const key = `ratelimit:${req.user?.id || req.ip}`;
const result = await atomicRateLimit(key, tier.requestsPerHour, 3600000);
res.set('X-RateLimit-Limit', tier.requestsPerHour.toString());
res.set('X-RateLimit-Remaining', result.remaining.toString());
res.set('X-RateLimit-Tier', tier.name);
if (!result.allowed) {
res.set('Retry-After', Math.ceil((result.resetIn || 60000) / 1000).toString());
return res.status(429).json({
error: `Rate limit exceeded for ${tier.name} tier`,
limit: tier.requestsPerHour,
upgradeUrl: '/pricing',
});
}
next();
}Cost-Based Rate Limiting
Instead of simple request counts, implement cost-based rate limiting where each API call has a cost based on compute, storage, or external service usage.
const ENDPOINT_COSTS: Record<string, number> = {
'/api/search': 1,
'/api/data/export': 10,
'/api/llm/generate': 50,
'/api/llm/analyze': 100,
};
async function costBasedRateLimit(req: express.Request, res: express.Response, next: express.NextFunction) {
const cost = ENDPOINT_COSTS[req.path] || 1;
const budgetKey = `budget:${req.user?.id}`;
const budget = 1000; // Cost units per hour
const result = await atomicRateLimit(budgetKey, budget, 3600000);
if (!result.allowed) {
return res.status(429).json({
error: 'Cost budget exceeded',
message: `Each request to ${req.path} costs ${cost} units`,
});
}
next();
}Comparison of Rate Limiting Algorithms
| Algorithm | Burst Handling | Smoothness | Memory | Complexity | Best For |
|---|---|---|---|---|---|
| Token Bucket | Allows bursts | Moderate | Low (2 values) | Low | General purpose, API gateways |
| Sliding Window | Smooth | High | Medium | Medium | Production APIs |
| Fixed Window | Boundary spikes | Low | Low (1 counter) | Low | Simple internal services |
| Leaky Bucket | No bursts | Very High | Medium (queue) | Medium | Traffic shaping |
| Sliding Window Log | Smooth | Very High | High (O(n)) | High | Precise limiting |
Best Practices for Production
1. Return Clear Rate Limit Headers
Always include standard rate limit headers in every response — not just when limits are hit. This lets clients implement proactive backoff.
// IETF draft standard headers
res.set('RateLimit-Limit', '100'); // Max requests in window
res.set('RateLimit-Remaining', '42'); // Requests remaining
res.set('RateLimit-Reset', '1640000000'); // Unix timestamp when window resets
res.set('Retry-After', '30'); // Seconds to wait (on 429)2. Use Redis for Distributed State
In-memory rate limiting doesn't work across multiple instances. Redis provides consistent, shared state with sub-millisecond latency. Use Redis Cluster for high availability.
3. Set Per-Endpoint Limits
Read endpoints can have higher limits than write endpoints. Expensive operations (LLM calls, data exports) should have lower limits.
4. Implement Graceful Degradation
Instead of hard 429 rejection, consider:
- Throttling: Slow responses instead of rejecting
- Cached responses: Serve stale data when approaching limits
- Queue and retry: Queue expensive requests and process them when capacity allows
5. Monitor Rate Limit Events
Track which clients hit limits most often. This identifies abusive clients and legitimate users who might need higher limits.
// Log rate limit events for monitoring
function logRateLimitEvent(key: string, allowed: boolean, remaining: number) {
metrics.increment('rate_limit.total', { allowed: allowed.toString() });
if (!allowed) {
metrics.increment('rate_limit.rejected', { key });
logger.warn('Rate limit exceeded', { key, remaining });
}
}6. Handle Clock Skew
In distributed systems, server clocks can drift. Use Redis server time instead of local time for consistent rate limiting across instances.
Common Pitfalls and Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| Fixed window boundary spikes | 2x intended rate at boundaries | Use sliding window algorithm |
| In-memory rate limiting | Inconsistent across instances | Use Redis for distributed state |
| Missing rate limit headers | Clients can't implement backoff | Always include standard headers |
| Same limits for all endpoints | Expensive operations underprotected | Set per-endpoint limits |
| No monitoring | Can't identify abuse or capacity issues | Log all rate limit events |
| Hard rejection only | Poor user experience | Implement throttling and cached fallbacks |
| Clock skew in distributed systems | Inconsistent limiting | Use Redis server time, not client time |
| Race conditions | Limits exceeded under concurrency | Use Lua scripts for atomic operations |
Advanced Patterns
Adaptive Rate Limiting
Dynamically adjust rate limits based on system load. When CPU usage exceeds 80%, reduce rate limits by 50%. When error rates spike, throttle aggressively. This prevents cascading failures under extreme load.
async function adaptiveRateLimit(key: string, baseLimit: number): Promise<boolean> {
const cpuUsage = await getSystemCpuUsage();
const errorRate = await getRecentErrorRate();
let adjustedLimit = baseLimit;
// Reduce limits when system is under stress
if (cpuUsage > 0.8) adjustedLimit *= 0.5;
else if (cpuUsage > 0.6) adjustedLimit *= 0.75;
// Reduce limits when error rate is high
if (errorRate > 0.1) adjustedLimit *= 0.5;
return atomicRateLimit(key, Math.floor(adjustedLimit), 60000);
}Geographic Rate Limiting
Apply different rate limits based on client geography. Clients in regions closer to your data centers might get higher limits due to lower latency costs, while distant regions get lower limits to manage bandwidth.
Rate Limit Budgets with Refill
Implement a credit-based system where clients receive a budget that refills over time. This is more flexible than simple windowed limits and maps naturally to API pricing models.
class RateLimitBudget {
private redis: Redis;
private refillRate: number; // tokens per second
private maxBudget: number;
async consume(key: string, cost: number): Promise<{ allowed: boolean; remaining: number }> {
const script = `
local key = KEYS[1]
local cost = tonumber(ARGV[1])
local refill_rate = tonumber(ARGV[2])
local max_budget = tonumber(ARGV[3])
local now = tonumber(ARGV[4])
local data = redis.call('HMGET', key, 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or max_budget
local last_refill = tonumber(data[2]) or now
-- Refill tokens based on elapsed time
local elapsed = now - last_refill
tokens = math.min(max_budget, tokens + elapsed * refill_rate)
if tokens >= cost then
tokens = tokens - cost
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return { 1, tokens }
else
redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', key, 3600)
return { 0, tokens }
end
`;
const result = await this.redis.eval(script, 1, key, cost, this.refillRate, this.maxBudget, Date.now() / 1000);
return { allowed: result[0] === 1, remaining: result[1] };
}
}Client-Side Rate Limiting
Server-side rate limiting protects your infrastructure, but client-side rate limiting improves user experience by preventing requests that will be rejected. Implement a client-side rate limiter that tracks request counts and queues or drops requests before they hit the network.
class ClientRateLimiter {
private queue: Array<{ resolve: Function; reject: Function }> = [];
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number,
private refillRate: number // tokens per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
async acquire(): Promise<void> {
this.refill();
if (this.tokens >= 1) {
this.tokens -= 1;
return;
}
// Queue the request until tokens are available
return new Promise((resolve, reject) => {
this.queue.push({ resolve, reject });
});
}
private refill(): void {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
this.lastRefill = now;
// Process queued requests
while (this.queue.length > 0 && this.tokens >= 1) {
this.tokens -= 1;
const { resolve } = this.queue.shift()!;
resolve();
}
}
}
// Usage
const limiter = new ClientRateLimiter(100, 10); // 100 burst, 10/sec sustained
async function callAPI(endpoint: string) {
await limiter.acquire();
return fetch(endpoint);
}Conclusion
Rate limiting is a fundamental API protection mechanism that ensures reliability, fairness, and security. The choice of algorithm and implementation strategy significantly impacts both system protection and user experience.
Key takeaways:
- Use token bucket for general-purpose rate limiting — it handles bursts gracefully and is memory efficient
- Use sliding window when you need smooth, predictable limiting without boundary spikes
- Implement distributed rate limiting with Redis for multi-instance deployments using Lua scripts for atomicity
- Return clear rate limit headers (
RateLimit-*,Retry-After) in every response - Implement tiered rate limits aligned with your business model
- Set per-endpoint limits based on operation cost and resource usage
- Monitor rate limit events to identify abuse and capacity issues
- Provide graceful degradation (throttling, cached responses) instead of hard rejection
- Use cost-based limiting for APIs with heterogeneous endpoint costs
- Consider adaptive rate limiting that adjusts based on system health metrics
Start by implementing basic rate limiting at the API gateway level with a simple sliding window algorithm. Once working, add tiered limits, per-endpoint configuration, cost-based budgets, and monitoring. The investment in rate limiting pays dividends in system reliability, cost control, and fair resource allocation.