API Rate Limiting: Algorithms and Implementation

Introduction

Rate limiting is one of the most critical mechanisms for protecting APIs from abuse, ensuring fair usage, and maintaining service reliability under load. Without rate limiting, a single misbehaving client can overwhelm your backend services, degrading performance for all users. Rate limiting controls the number of requests a client can make within a given time window, rejecting or throttling excess requests. It's the foundation of API security, cost management, and quality of service.

Every major API provider — from Stripe and GitHub to AWS and Google Cloud — implements rate limiting as a core infrastructure component. Stripe limits to 100 requests per second in live mode, GitHub allows 5,000 requests per hour for authenticated users, and AWS API Gateway supports burst limits of 10,000 requests per second. These limits aren't arbitrary — they're carefully calibrated to protect infrastructure while enabling legitimate use cases.

The choice of rate limiting algorithm significantly impacts the user experience and system behavior. Token bucket allows burst traffic while maintaining average rates. Sliding window provides smooth, consistent limiting. Fixed window is simple but can allow traffic spikes at window boundaries. Understanding these algorithms and their trade-offs is essential for implementing rate limiting that protects your system without unnecessarily restricting legitimate users.

Modern APIs use distributed rate limiting — coordinating limits across multiple server instances, data centers, and regions. This requires shared state (typically in Redis) and careful handling of network partitions and clock skew. This guide covers the algorithms, implementation patterns, and production considerations for building robust rate limiting systems.

Why Rate Limiting Matters

Protection Against Abuse

Without rate limiting, a single client can send thousands of requests per second, consuming CPU, memory, database connections, and network bandwidth. This degrades service for all users. Rate limiting acts as a circuit breaker, preventing any single client from monopolizing shared resources.

Cost Control

For APIs that call expensive downstream services — LLMs like GPT-4, payment processors like Stripe, or external databases — each request has a real cost. A runaway script making 10,000 API calls to an LLM endpoint could cost hundreds of dollars in minutes. Rate limiting caps this exposure.

Fair Resource Allocation

In multi-tenant SaaS platforms, rate limiting ensures that no single tenant can monopolize shared infrastructure. A free-tier user shouldn't be able to consume the same resources as an enterprise customer paying 100x more.

DDoS Mitigation

Rate limiting is the first line of defense against distributed denial-of-service attacks. While it won't stop a sophisticated DDoS attack alone, it prevents simple volumetric attacks from overwhelming your infrastructure.

Rate Limiting Algorithms Deep Dive

The Token Bucket Algorithm

The token bucket algorithm is the most widely used rate limiting algorithm in production systems. It works by maintaining a bucket that holds tokens. Tokens are added to the bucket at a fixed rate (the refill rate). Each incoming request must consume one or more tokens to proceed. If the bucket is empty, the request is rejected or queued.

The key parameters are capacity (maximum burst size) and refill rate (tokens added per second). A capacity of 10 with a refill rate of 2 tokens/second allows bursts of 10 requests while maintaining an average of 2 requests per second over time.

How it works step by step:

A bucket starts with capacity tokens
Every second, refillRate tokens are added (up to the capacity)
When a request arrives, check if enough tokens exist
If yes: subtract tokens, allow request
If no: reject request (or queue it)

Advantages:

Allows burst traffic up to the bucket capacity
Smooth average rate over time
Memory efficient — only two values to track per client
Simple to implement and reason about

Disadvantages:

Burst behavior can surprise backends not designed for it
Doesn't provide precise per-second limiting

Real-world usage: Amazon API Gateway uses token bucket internally. Stripe's rate limiting is based on a token bucket variant. Most cloud load balancers implement token bucket at the connection level.

class TokenBucket {
  private tokens: number;
  private lastRefill: number;
 
  constructor(
    private capacity: number,
    private refillRate: number // tokens per second
  ) {
    this.tokens = capacity;
    this.lastRefill = Date.now();
  }
 
  consume(tokens: number = 1): { allowed: boolean; remaining: number; resetIn: number } {
    this.refill();
 
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return {
        allowed: true,
        remaining: Math.floor(this.tokens),
        resetIn: 0,
      };
    }
 
    const waitTime = ((tokens - this.tokens) / this.refillRate) * 1000;
    return {
      allowed: false,
      remaining: 0,
      resetIn: Math.ceil(waitTime),
    };
  }
 
  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
  }
}
 
// Usage
const bucket = new TokenBucket(10, 2); // 10 burst, 2/sec sustained
const result = bucket.consume(1);
console.log(result); // { allowed: true, remaining: 9, resetIn: 0 }

The Sliding Window Algorithm

The sliding window algorithm tracks requests within a rolling time window. Instead of resetting counters at fixed intervals, it considers the current time and counts all requests within the preceding window period. This eliminates the boundary spike problem of fixed window algorithms.

Two variants exist:

Sliding Window Counter: Combines the current partial window with a weighted portion of the previous window. If you're 30% into the current minute, you count 70% of the previous minute's count plus 100% of the current minute's count.
Sliding Window Log: Stores a timestamp for each request in a sorted set. Counts requests within the window by querying the sorted set. More precise but uses more memory.

The weighted formula:

effectiveCount = (previousWindowCount × overlapPercentage) + currentWindowCount

For example, with a 1-minute window and limit of 100:

Previous minute: 80 requests
Current minute: 20 requests, 30 seconds elapsed (50% overlap)
Effective count: (80 × 0.5) + 20 = 60 requests

Advantages:

No boundary spike problem
Smooth, predictable behavior
Memory efficient (counter variant)
Easy to understand and debug

Disadvantages:

Slightly more complex than fixed window
Approximate in the counter variant
Log variant uses O(n) memory per client

function slidingWindowCounter(
  previousCount: number,
  currentCount: number,
  windowMs: number,
  elapsedMs: number
): number {
  const overlapPercent = 1 - (elapsedMs / windowMs);
  return Math.floor(previousCount * overlapPercent) + currentCount;
}

The Fixed Window Algorithm

The simplest approach: count requests in fixed time intervals (per second, per minute, per hour). When the count exceeds the limit, reject requests until the window resets. Simple to implement but allows up to 2x the intended rate at window boundaries.

The boundary problem:

Imagine a limit of 100 requests per minute. A client sends 100 requests at 11:00:59 (end of window) and 100 more at 11:01:00 (start of next window). That's 200 requests in 2 seconds — double the intended rate.

When to use: Simple internal services, rate limiting non-critical endpoints, or when you need the simplest possible implementation and can tolerate boundary spikes.

class FixedWindowRateLimiter {
  private windows: Map<string, { count: number; resetAt: number }> = new Map();
 
  isAllowed(key: string, limit: number, windowMs: number): boolean {
    const now = Date.now();
    const window = this.windows.get(key);
 
    if (!window || now >= window.resetAt) {
      this.windows.set(key, { count: 1, resetAt: now + windowMs });
      return true;
    }
 
    if (window.count >= limit) {
      return false;
    }
 
    window.count++;
    return true;
  }
}

The Leaky Bucket Algorithm

Requests enter a queue (bucket) and are processed at a fixed rate. If the queue is full, requests are rejected. This produces perfectly smooth output traffic regardless of input burstiness, but adds latency as requests wait in the queue.

Key difference from token bucket: Leaky bucket shapes outgoing traffic to a fixed rate, while token bucket allows bursts. Leaky bucket is a traffic shaper; token bucket is a rate limiter.

Use cases: Network traffic shaping, API gateways that need to protect fragile backends that can't handle any burst traffic, and systems where consistent throughput matters more than latency.

Distributed Rate Limiting with Redis

In production, your API runs on multiple servers. In-memory rate limiting on each server is inconsistent — if you have 10 servers each allowing 100 requests/minute, a client can actually make 1,000 requests/minute by distributing across servers. You need a shared state.

Why Redis?

Redis is the de facto standard for distributed rate limiting because:

Sub-millisecond latency: Rate limit checks add <1ms to each request
Atomic operations: Lua scripts ensure check-and-increment happen atomically
Built-in data structures: Sorted sets are perfect for sliding window logs
TTL support: Keys automatically expire, preventing memory leaks
Cluster support: Redis Cluster provides high availability for production

Sliding Window with Redis Sorted Sets

import Redis from 'ioredis';
 
const redis = new Redis(process.env.REDIS_URL);
 
async function slidingWindowRateLimit(
  key: string,
  limit: number,
  windowMs: number
): Promise<{ allowed: boolean; remaining: number; resetIn: number }> {
  const now = Date.now();
  const windowStart = now - windowMs;
 
  const pipeline = redis.pipeline();
 
  // Remove expired entries outside the window
  pipeline.zremrangebyscore(key, 0, windowStart);
 
  // Count requests still in the window
  pipeline.zcard(key);
 
  // Add current request timestamp as a unique member
  pipeline.zadd(key, now.toString(), `${now}:${Math.random()}`);
 
  // Auto-expire the key to prevent memory leaks
  pipeline.expire(key, Math.ceil(windowMs / 1000));
 
  const results = await pipeline.exec();
  const count = results![1][1] as number;
 
  if (count >= limit) {
    // Find when the oldest request in the window expires
    const oldest = await redis.zrange(key, 0, 0, 'WITHSCORES');
    const resetIn = oldest.length >= 2
      ? parseInt(oldest[1]) + windowMs - now
      : windowMs;
 
    // Remove the request we just added since it's over limit
    await redis.zrem(key, `${now}:${Math.random()}`);
 
    return { allowed: false, remaining: 0, resetIn: Math.ceil(resetIn) };
  }
 
  return { allowed: true, remaining: limit - count - 1, resetIn: 0 };
}

Atomic Rate Limiting with Lua Scripts

For production systems, use Lua scripts to make the rate limit check atomic. This prevents race conditions where two concurrent requests both check the count, see it's under the limit, and both increment — exceeding the limit.

-- Lua script for atomic sliding window rate limiting
-- KEYS[1] = rate limit key
-- ARGV[1] = window size in milliseconds
-- ARGV[2] = max requests allowed
-- ARGV[3] = current timestamp
 
local key = KEYS[1]
local window = tonumber(ARGV[1])
local limit = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local windowStart = now - window
 
-- Remove expired entries
redis.call('ZREMRANGEBYSCORE', key, 0, windowStart)
 
-- Count current entries
local count = redis.call('ZCARD', key)
 
if count < limit then
  -- Add new entry
  redis.call('ZADD', key, now, now .. ':' .. math.random())
  redis.call('EXPIRE', key, math.ceil(window / 1000))
  return {1, limit - count - 1}
else
  -- Get oldest entry for reset time
  local oldest = redis.call('ZRANGE', key, 0, 0, 'WITHSCORES')
  local resetIn = window
  if #oldest >= 2 then
    resetIn = tonumber(oldest[2]) + window - now
  end
  return {0, 0, resetIn}
end

// Load and execute the Lua script
const RATE_LIMIT_SCRIPT = `
  -- (Lua script from above)
`;
 
async function atomicRateLimit(
  key: string,
  limit: number,
  windowMs: number
): Promise<{ allowed: boolean; remaining: number; resetIn?: number }> {
  const now = Date.now();
  const result = await redis.eval(
    RATE_LIMIT_SCRIPT,
    1,
    key,
    windowMs.toString(),
    limit.toString(),
    now.toString()
  ) as number[];
 
  return {
    allowed: result[0] === 1,
    remaining: result[1],
    resetIn: result[2] || undefined,
  };
}

Architecture Patterns

Gateway-Level Rate Limiting

Implement rate limiting at the API gateway level (Kong, AWS API Gateway, NGINX, Envoy). This protects all backend services uniformly and centralizes rate limit configuration.

# NGINX rate limiting configuration
http {
    # Define rate limit zones
    limit_req_zone $binary_remote_addr zone=api_general:10m rate=10r/s;
    limit_req_zone $binary_remote_addr zone=api_expensive:10m rate=1r/s;
    limit_req_zone $http_api_key zone=api_by_key:10m rate=100r/s;
 
    server {
        location /api/ {
            # General API: 10 req/sec with burst of 20
            limit_req zone=api_general burst=20 nodelay;
            limit_req_status 429;
 
            # Add rate limit headers
            add_header X-RateLimit-Limit 10 always;
        }
 
        location /api/llm/ {
            # Expensive endpoints: 1 req/sec
            limit_req zone=api_expensive burst=5 nodelay;
        }
    }
}

Kong Gateway rate limiting:

# Kong rate limiting plugin configuration
plugins:
  - name: rate-limiting
    config:
      minute: 100
      hour: 5000
      policy: redis
      redis:
        host: redis-cluster.internal
        port: 6379
      hide_client_headers: false
      fault_tolerant: true

Application-Level Rate Limiting

Implement rate limiting within your application code for finer-grained control based on business logic.

import express from 'express';
 
const app = express();
 
// Rate limit middleware with tiered limits
function rateLimitMiddleware(limit: number, windowMs: number) {
  return async (req: express.Request, res: express.Response, next: express.NextFunction) => {
    const key = `ratelimit:${req.user?.id || req.ip}:${req.path}`;
    const result = await atomicRateLimit(key, limit, windowMs);
 
    // Always set rate limit headers
    res.set('X-RateLimit-Limit', limit.toString());
    res.set('X-RateLimit-Remaining', result.remaining.toString());
 
    if (!result.allowed) {
      res.set('Retry-After', Math.ceil((result.resetIn || 1000) / 1000).toString());
      return res.status(429).json({
        error: 'Too many requests',
        retryAfter: result.resetIn,
      });
    }
 
    next();
  };
}
 
// Different limits for different endpoints
app.use('/api/search', rateLimitMiddleware(10, 1000));      // 10/sec
app.use('/api/data', rateLimitMiddleware(100, 60000));       // 100/min
app.use('/api/llm/generate', rateLimitMiddleware(1, 1000));  // 1/sec

Tiered Rate Limiting

Implement different rate limits for different client tiers. This enables monetization while protecting the system.

interface RateLimitTier {
  name: string;
  requestsPerHour: number;
  burstLimit: number;
  costMultiplier: number;
}
 
const TIERS: Record<string, RateLimitTier> = {
  free: { name: 'free', requestsPerHour: 100, burstLimit: 10, costMultiplier: 1 },
  pro: { name: 'pro', requestsPerHour: 10000, burstLimit: 100, costMultiplier: 0.5 },
  enterprise: { name: 'enterprise', requestsPerHour: 100000, burstLimit: 1000, costMultiplier: 0.1 },
};
 
async function tieredRateLimit(req: express.Request, res: express.Response, next: express.NextFunction) {
  const userTier = req.user?.tier || 'free';
  const tier = TIERS[userTier];
 
  if (!tier) {
    return res.status(400).json({ error: 'Invalid tier' });
  }
 
  const key = `ratelimit:${req.user?.id || req.ip}`;
  const result = await atomicRateLimit(key, tier.requestsPerHour, 3600000);
 
  res.set('X-RateLimit-Limit', tier.requestsPerHour.toString());
  res.set('X-RateLimit-Remaining', result.remaining.toString());
  res.set('X-RateLimit-Tier', tier.name);
 
  if (!result.allowed) {
    res.set('Retry-After', Math.ceil((result.resetIn || 60000) / 1000).toString());
    return res.status(429).json({
      error: `Rate limit exceeded for ${tier.name} tier`,
      limit: tier.requestsPerHour,
      upgradeUrl: '/pricing',
    });
  }
 
  next();
}

Cost-Based Rate Limiting

Instead of simple request counts, implement cost-based rate limiting where each API call has a cost based on compute, storage, or external service usage.

const ENDPOINT_COSTS: Record<string, number> = {
  '/api/search': 1,
  '/api/data/export': 10,
  '/api/llm/generate': 50,
  '/api/llm/analyze': 100,
};
 
async function costBasedRateLimit(req: express.Request, res: express.Response, next: express.NextFunction) {
  const cost = ENDPOINT_COSTS[req.path] || 1;
  const budgetKey = `budget:${req.user?.id}`;
  const budget = 1000; // Cost units per hour
 
  const result = await atomicRateLimit(budgetKey, budget, 3600000);
 
  if (!result.allowed) {
    return res.status(429).json({
      error: 'Cost budget exceeded',
      message: `Each request to ${req.path} costs ${cost} units`,
    });
  }
 
  next();
}

Comparison of Rate Limiting Algorithms

Algorithm	Burst Handling	Smoothness	Memory	Complexity	Best For
Token Bucket	Allows bursts	Moderate	Low (2 values)	Low	General purpose, API gateways
Sliding Window	Smooth	High	Medium	Medium	Production APIs
Fixed Window	Boundary spikes	Low	Low (1 counter)	Low	Simple internal services
Leaky Bucket	No bursts	Very High	Medium (queue)	Medium	Traffic shaping
Sliding Window Log	Smooth	Very High	High (O(n))	High	Precise limiting

Best Practices for Production

1. Return Clear Rate Limit Headers

Always include standard rate limit headers in every response — not just when limits are hit. This lets clients implement proactive backoff.

// IETF draft standard headers
res.set('RateLimit-Limit', '100');           // Max requests in window
res.set('RateLimit-Remaining', '42');         // Requests remaining
res.set('RateLimit-Reset', '1640000000');     // Unix timestamp when window resets
res.set('Retry-After', '30');                 // Seconds to wait (on 429)

2. Use Redis for Distributed State

In-memory rate limiting doesn't work across multiple instances. Redis provides consistent, shared state with sub-millisecond latency. Use Redis Cluster for high availability.

3. Set Per-Endpoint Limits

Read endpoints can have higher limits than write endpoints. Expensive operations (LLM calls, data exports) should have lower limits.

4. Implement Graceful Degradation

Instead of hard 429 rejection, consider:

Throttling: Slow responses instead of rejecting
Cached responses: Serve stale data when approaching limits
Queue and retry: Queue expensive requests and process them when capacity allows

5. Monitor Rate Limit Events

Track which clients hit limits most often. This identifies abusive clients and legitimate users who might need higher limits.

// Log rate limit events for monitoring
function logRateLimitEvent(key: string, allowed: boolean, remaining: number) {
  metrics.increment('rate_limit.total', { allowed: allowed.toString() });
  if (!allowed) {
    metrics.increment('rate_limit.rejected', { key });
    logger.warn('Rate limit exceeded', { key, remaining });
  }
}

6. Handle Clock Skew

In distributed systems, server clocks can drift. Use Redis server time instead of local time for consistent rate limiting across instances.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Fixed window boundary spikes	2x intended rate at boundaries	Use sliding window algorithm
In-memory rate limiting	Inconsistent across instances	Use Redis for distributed state
Missing rate limit headers	Clients can't implement backoff	Always include standard headers
Same limits for all endpoints	Expensive operations underprotected	Set per-endpoint limits
No monitoring	Can't identify abuse or capacity issues	Log all rate limit events
Hard rejection only	Poor user experience	Implement throttling and cached fallbacks
Clock skew in distributed systems	Inconsistent limiting	Use Redis server time, not client time
Race conditions	Limits exceeded under concurrency	Use Lua scripts for atomic operations

Advanced Patterns

Adaptive Rate Limiting

Dynamically adjust rate limits based on system load. When CPU usage exceeds 80%, reduce rate limits by 50%. When error rates spike, throttle aggressively. This prevents cascading failures under extreme load.

async function adaptiveRateLimit(key: string, baseLimit: number): Promise<boolean> {
  const cpuUsage = await getSystemCpuUsage();
  const errorRate = await getRecentErrorRate();
 
  let adjustedLimit = baseLimit;
 
  // Reduce limits when system is under stress
  if (cpuUsage > 0.8) adjustedLimit *= 0.5;
  else if (cpuUsage > 0.6) adjustedLimit *= 0.75;
 
  // Reduce limits when error rate is high
  if (errorRate > 0.1) adjustedLimit *= 0.5;
 
  return atomicRateLimit(key, Math.floor(adjustedLimit), 60000);
}

Geographic Rate Limiting

Apply different rate limits based on client geography. Clients in regions closer to your data centers might get higher limits due to lower latency costs, while distant regions get lower limits to manage bandwidth.

Rate Limit Budgets with Refill

Implement a credit-based system where clients receive a budget that refills over time. This is more flexible than simple windowed limits and maps naturally to API pricing models.

class RateLimitBudget {
  private redis: Redis;
  private refillRate: number; // tokens per second
  private maxBudget: number;
 
  async consume(key: string, cost: number): Promise<{ allowed: boolean; remaining: number }> {
    const script = `
      local key = KEYS[1]
      local cost = tonumber(ARGV[1])
      local refill_rate = tonumber(ARGV[2])
      local max_budget = tonumber(ARGV[3])
      local now = tonumber(ARGV[4])
 
      local data = redis.call('HMGET', key, 'tokens', 'last_refill')
      local tokens = tonumber(data[1]) or max_budget
      local last_refill = tonumber(data[2]) or now
 
      -- Refill tokens based on elapsed time
      local elapsed = now - last_refill
      tokens = math.min(max_budget, tokens + elapsed * refill_rate)
 
      if tokens >= cost then
        tokens = tokens - cost
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return { 1, tokens }
      else
        redis.call('HMSET', key, 'tokens', tokens, 'last_refill', now)
        redis.call('EXPIRE', key, 3600)
        return { 0, tokens }
      end
    `;
 
    const result = await this.redis.eval(script, 1, key, cost, this.refillRate, this.maxBudget, Date.now() / 1000);
    return { allowed: result[0] === 1, remaining: result[1] };
  }
}

Client-Side Rate Limiting

Server-side rate limiting protects your infrastructure, but client-side rate limiting improves user experience by preventing requests that will be rejected. Implement a client-side rate limiter that tracks request counts and queues or drops requests before they hit the network.

class ClientRateLimiter {
  private queue: Array<{ resolve: Function; reject: Function }> = [];
  private tokens: number;
  private lastRefill: number;
 
  constructor(
    private maxTokens: number,
    private refillRate: number // tokens per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }
 
  async acquire(): Promise<void> {
    this.refill();
 
    if (this.tokens >= 1) {
      this.tokens -= 1;
      return;
    }
 
    // Queue the request until tokens are available
    return new Promise((resolve, reject) => {
      this.queue.push({ resolve, reject });
    });
  }
 
  private refill(): void {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.maxTokens, this.tokens + elapsed * this.refillRate);
    this.lastRefill = now;
 
    // Process queued requests
    while (this.queue.length > 0 && this.tokens >= 1) {
      this.tokens -= 1;
      const { resolve } = this.queue.shift()!;
      resolve();
    }
  }
}
 
// Usage
const limiter = new ClientRateLimiter(100, 10); // 100 burst, 10/sec sustained
 
async function callAPI(endpoint: string) {
  await limiter.acquire();
  return fetch(endpoint);
}

Conclusion

Rate limiting is a fundamental API protection mechanism that ensures reliability, fairness, and security. The choice of algorithm and implementation strategy significantly impacts both system protection and user experience.

Key takeaways:

Use token bucket for general-purpose rate limiting — it handles bursts gracefully and is memory efficient
Use sliding window when you need smooth, predictable limiting without boundary spikes
Implement distributed rate limiting with Redis for multi-instance deployments using Lua scripts for atomicity
Return clear rate limit headers (RateLimit-*, Retry-After) in every response
Implement tiered rate limits aligned with your business model
Set per-endpoint limits based on operation cost and resource usage
Monitor rate limit events to identify abuse and capacity issues
Provide graceful degradation (throttling, cached responses) instead of hard rejection
Use cost-based limiting for APIs with heterogeneous endpoint costs
Consider adaptive rate limiting that adjusts based on system health metrics

Start by implementing basic rate limiting at the API gateway level with a simple sliding window algorithm. Once working, add tiered limits, per-endpoint configuration, cost-based budgets, and monitoring. The investment in rate limiting pays dividends in system reliability, cost control, and fair resource allocation.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline