Rate Limiting: Introduction


We were just two days post-launch. A few thousand users had signed up, things were finally moving, and the overall feeling was excitement.
Then my phone rang at 4:00 AM on Saturday.
Support said the system wasn't responding properly.
I opened my laptop and logged into the dashboards. The picture was immediately clear: CPU was maxed out, database connection pools were exhausted, and downstream APIs were starting to fail.
At first glance, it looked bad. But after digging in, it didn't take long to rule out an attack.
This was normal user behaviour — clicking buttons, refreshing screens, retrying when things felt slow.
The problem was simpler and more uncomfortable: we had no rate limiting.
Requests kept arriving faster than the system could process them. Arrival rate kept climbing while processing rate quietly dropped. Latency crept from 80 ms to 300 ms and then into seconds.
Nothing crashed outright. The system was just slowly suffocating under perfectly valid traffic.
The fix was straightforward: Rate Limiting.
We stepped in as gatekeepers, controlling how fast requests were allowed in so the system could actually keep up with the work it had already accepted.
That morning was a quiet reminder that success can be its own bottleneck.
Without balancing arrival rate and processing rate, a growing system doesn't fail dramatically — it breaks itself slowly by trying to serve everyone at once.
At its core, rate limiting enforces a maximum number of requests a system is willing to accept over time.
In simple terms, it ensures that the arrival rate never exceeds the processing rate.
When you say a service allows 100 requests per second, you are not describing traffic — you are declaring capacity.
Every system, regardless of scale or architecture, is bounded by three hard physical constraints:
These limits are non-negotiable. You can scale them, but you cannot remove them.
Problems begin when requests arrive faster than the system can complete them.
As load increases:
The system doesn't fail because of a single bad request — it fails because too much valid work was admitted at once.
Rate limiting prevents this by controlling how much work is allowed in, not how much traffic exists outside.
Rate limiting serves three fundamental purposes:
When you configure a limit like 100 RPS, you are implicitly declaring a contract:
For any time window T, the total admitted work must be ≤ T × Capacity
Where Capacity is constrained by the weakest link:
Capacity = min( CPU cost per request, IO usage per request, External dependency quotas )
Rate limiting is not traffic counting.
It is capacity enforcement.
It is the system saying:
"I only accept as much work as I can finish reliably."
Everything that follows — algorithms, buckets, leaky queues — exists to uphold this single principle.
Not all systems are created equal, and neither are their rate limits.
Consider two broad classes of systems:
System A — Search, Feed, Authentication
Requests are cheap, mostly stateless, highly cacheable, and tolerant to short bursts. A sudden spike might increase latency, but the system usually recovers quickly.
System B — Payments, Video Start, ML Inference
Requests are expensive, often stateful, have side effects, and are not burst tolerant. A spike here doesn't just slow things down — it can corrupt state, exhaust downstream dependencies, or cause irreversible failures.
Using the same rate limiting strategy for both is a fast way to kill one of them.
Rate limiting algorithms are not interchangeable knobs.
Each one protects a different invariant, and the choice of algorithm should be driven by which invariant must never break.
In production systems, there are three core invariants:
Here's the uncomfortable truth: you cannot guarantee all three simultaneously under failure conditions.
Rate limiting is the act of deciding which invariant you are willing to violate first.
This is why we have multiple rate limiting algorithms:
Each makes a different trade-off.
Choosing the wrong algorithm doesn't just reduce effectiveness — it actively harms the system you're trying to protect.
In real systems, rate limiting is never applied in one place.
It is layered, with each layer protecting a different failure domain:
Each layer exists because failure propagates inward, and protection must exist at every boundary.
Rate limiting is not about counting requests.
It is not about throttling traffic.
It is about enforcing capacity, preserving invariants, and deciding who gets served when the system is under stress.
If you don't make those decisions explicitly, your system will make them implicitly — and usually at the worst possible time.
In the next post, we'll go deep into each rate limiting algorithm, examine their trade-offs, and show how to choose the right one for the invariant you care about most.
Until then, remember: rate limiting is a policy decision first, and an algorithm second.