system-design rate-limiting scalability reliability backend

Rate Limiting: Introduction

February 2, 20266 min read

Rate Limiting: Introduction | Anil Gurindapalli

Rate Limiting

We were just two days post-launch. A few thousand users had signed up, things were finally moving, and the overall feeling was excitement.

Then my phone rang at 4:00 AM on Saturday.

Support said the system wasn't responding properly.

I opened my laptop and logged into the dashboards. The picture was immediately clear: CPU was maxed out, database connection pools were exhausted, and downstream APIs were starting to fail.

At first glance, it looked bad. But after digging in, it didn't take long to rule out an attack.

This was normal user behaviour — clicking buttons, refreshing screens, retrying when things felt slow.

The problem was simpler and more uncomfortable: we had no rate limiting.

Requests kept arriving faster than the system could process them. Arrival rate kept climbing while processing rate quietly dropped. Latency crept from 80 ms to 300 ms and then into seconds.

Nothing crashed outright. The system was just slowly suffocating under perfectly valid traffic.

The fix was straightforward: Rate Limiting.

We stepped in as gatekeepers, controlling how fast requests were allowed in so the system could actually keep up with the work it had already accepted.

That morning was a quiet reminder that success can be its own bottleneck.

Without balancing arrival rate and processing rate, a growing system doesn't fail dramatically — it breaks itself slowly by trying to serve everyone at once.

What Is Rate Limiting, Really?

At its core, rate limiting enforces a maximum number of requests a system is willing to accept over time.

In simple terms, it ensures that the arrival rate never exceeds the processing rate.

When you say a service allows 100 requests per second, you are not describing traffic — you are declaring capacity.

Why Systems Need Rate Limiting

Every system, regardless of scale or architecture, is bounded by three hard physical constraints:

CPU: Threads are finite. Excess load leads to thread starvation, context switching overhead, and eventually GC death spirals.
IO: Databases have limited connection pools. Disks and networks have throughput ceilings. Saturation here causes cascading latency.
External Dependencies: Third-party APIs impose quotas and rate limits of their own. When they slow down or fail, your system inherits that pain.

These limits are non-negotiable. You can scale them, but you cannot remove them.

Arrival Rate vs Processing Rate

Problems begin when requests arrive faster than the system can complete them.

As load increases:

Queues grow
Latency spikes
Retries amplify traffic
Resources stay busy longer
Effective throughput drops

The system doesn't fail because of a single bad request — it fails because too much valid work was admitted at once.

Rate limiting prevents this by controlling how much work is allowed in, not how much traffic exists outside.

Rate Limiting Is Capacity Enforcement

Rate limiting serves three fundamental purposes:

Protect Resources — Prevents CPU exhaustion, IO saturation, and dependency overload.
Enforce Fairness — Ensures one noisy user or client doesn't starve others.
Meet SLAs — It's better to reject excess requests early than to degrade latency for everyone.

When you configure a limit like 100 RPS, you are implicitly declaring a contract:

For any time window T, the total admitted work must be ≤ T × Capacity

Where Capacity is constrained by the weakest link:

Capacity = min( CPU cost per request, IO usage per request, External dependency quotas )

The Key Insight

Rate limiting is not traffic counting.

It is capacity enforcement.

It is the system saying:

"I only accept as much work as I can finish reliably."

Everything that follows — algorithms, buckets, leaky queues — exists to uphold this single principle.

One Size Never Fits All

Not all systems are created equal, and neither are their rate limits.

Consider two broad classes of systems:

System A — Search, Feed, Authentication

Requests are cheap, mostly stateless, highly cacheable, and tolerant to short bursts. A sudden spike might increase latency, but the system usually recovers quickly.

System B — Payments, Video Start, ML Inference

Requests are expensive, often stateful, have side effects, and are not burst tolerant. A spike here doesn't just slow things down — it can corrupt state, exhaust downstream dependencies, or cause irreversible failures.

Using the same rate limiting strategy for both is a fast way to kill one of them.

Rate Limiting Is About Preserving Invariants

Rate limiting algorithms are not interchangeable knobs.

Each one protects a different invariant, and the choice of algorithm should be driven by which invariant must never break.

In production systems, there are three core invariants:

Fairness — One tenant or client must not starve others.
Stability — The system must never enter a positive feedback loop of overload, retries, and collapse.
Revenue — Paying or high-priority users must not be blocked during contention.

Here's the uncomfortable truth: you cannot guarantee all three simultaneously under failure conditions.

Rate limiting is the act of deciding which invariant you are willing to violate first.

Why Different Algorithms Exist

This is why we have multiple rate limiting algorithms:

Fixed Window
Sliding Window Log
Sliding Window Counter
Token Bucket
Leaky Bucket

Each makes a different trade-off.

Token Bucket favours user experience by allowing bursts. Great for frontend-facing, bursty traffic where short spikes are acceptable.
Leaky Bucket favours backend stability by smoothing traffic into a constant flow. Ideal for expensive, stateful, or fragile systems.
Sliding Window–based approaches favour strict fairness and accurate accounting, making them suitable for billing, quotas, and multi-tenant enforcement.

Choosing the wrong algorithm doesn't just reduce effectiveness — it actively harms the system you're trying to protect.

Rate Limiting Is Multi-Layered in Production

In real systems, rate limiting is never applied in one place.

It is layered, with each layer protecting a different failure domain:

CDN / Edge (e.g., Cloudflare) — Blocks volumetric attacks, IP floods, and obvious abuse.
API Gateway — Enforces per-key, per-tenant, or per-plan limits.
Service Level — Protects internal resources like thread pools and downstream calls.
Database / Storage — Enforces hard limits to prevent catastrophic exhaustion.

Each layer exists because failure propagates inward, and protection must exist at every boundary.

The Final Mental Model

Rate limiting is not about counting requests.

It is not about throttling traffic.

It is about enforcing capacity, preserving invariants, and deciding who gets served when the system is under stress.

If you don't make those decisions explicitly, your system will make them implicitly — and usually at the worst possible time.

In the next post, we'll go deep into each rate limiting algorithm, examine their trade-offs, and show how to choose the right one for the invariant you care about most.

Until then, remember: rate limiting is a policy decision first, and an algorithm second.

#Rate Limiting

#What Is Rate Limiting, Really?

#Why Systems Need Rate Limiting

#Arrival Rate vs Processing Rate

#Rate Limiting Is Capacity Enforcement

#The Key Insight

#One Size Never Fits All

#Rate Limiting Is About Preserving Invariants

#Why Different Algorithms Exist

#Rate Limiting Is Multi-Layered in Production

#The Final Mental Model