llms context-window ai context-engineering

Understanding Context Window in LLMs

April 16, 20268 min read

Understanding Context Window in LLMs | Anil Gurindapalli

I was working on a feature and using an LLM alongside me for quick help — generating code, fixing bugs, and refining logic. Initially, everything was smooth. It understood my context, followed my instructions, and gave accurate responses.

But after some time, things started to feel off.

It began suggesting changes that didn't match my code. It forgot constraints I had clearly mentioned earlier. Sometimes it even gave completely unrelated answers. I found myself wondering — why is this happening?

At first, I thought I wasn't prompting it properly. Maybe I missed something. But the more I retried, the more inconsistent it became.

That's when I realized something important: I was treating the LLM like it remembers everything I told it — like a human would.

But it doesn't.

It only works with a limited "view" of the conversation. And once that limit is crossed, it simply starts forgetting the earlier parts — which explains the wrong answers, the confusion, and what we often call hallucination.

The answer lies in a fundamental concept called the Context Window.

But before we get there, we need to understand something even more basic — tokens. Because when we say "context window," we're not measuring it in words or sentences. We're measuring it in tokens.

What Are Tokens?

LLMs don't read text the way we do. They don't see words. They see tokens — small chunks of text that the model uses as its basic unit of processing.

A token can be a full word like "hello", a part of a word like "un" or "ing", or even a single character like "!". The sentence "I love building apps" might become four tokens: ["I", " love", " building", " apps"]. But a less common word like "tokenization" might get split into ["token", "ization"].

So why not just use whole words? Because every language has thousands of rare words, slang, technical terms, and typos. If the model tried to keep a separate entry for every possible word, the vocabulary would be impossibly large. Tokens give us a middle ground — a compact vocabulary that can still represent any text.

How Do LLMs Create Tokens? (BPE in a Nutshell)

Most modern LLMs — including GPT-4, Claude, and Llama — use an algorithm called Byte Pair Encoding (BPE) to build their token vocabulary.

The idea is simple: start with individual characters, then repeatedly merge the most frequently occurring pair into a new token. Do this thousands of times over a massive text corpus, and you end up with a vocabulary of commonly used subwords.

Here's how it works step by step:

Step 1: Start with every character as its own token.

bytes

"low" → ["l", "o", "w"]
"lower" → ["l", "o", "w", "e", "r"]
"newest" → ["n", "e", "w", "e", "s", "t"]

Step 2: Count all adjacent pairs across the entire corpus and find the most frequent one. Let's say ("l", "o") appears the most.

Step 3: Merge that pair into a new token: "lo".

bytes

"low" → ["lo", "w"]
"lower" → ["lo", "w", "e", "r"]

Step 4: Repeat. Next most frequent pair might be ("lo", "w") → merge into "low".

bytes

"low" → ["low"]
"lower" → ["low", "e", "r"]

This keeps going until the vocabulary reaches a target size — GPT-2 uses about 50,000 tokens, GPT-4 uses around 100,000, and Llama 3 uses about 128,000.

The result? Common words like "the" or "function" become a single token. Rare or technical words get split into recognizable pieces. And the model can handle any input — even words it has never seen before — by breaking them down into known subwords.

Quick aside: This is also why LLMs sometimes struggle with character-level tasks like "How many r's are in strawberry?" — the word might be a single token to the model, so it literally doesn't "see" the individual letters inside it.

A Quick Rule of Thumb

For English text, 1 token ≈ 0.75 words (or roughly 4 characters). So 1,000 words is approximately 1,300 tokens. It's not exact — code and non-English text tend to use more tokens per word — but it's useful for estimation.

Now that we know what tokens are, let's talk about the real constraint.

What Is the Context Window in LLMs?

The context window is the maximum number of tokens an LLM can process in a single request. It's the model's working memory — everything it can "see" at once.

Context Window

And here's the critical part: the context window includes both input and output.

Think of an LLM Like a Function

This is the mental model that made everything click for me. Forget chatbots and conversations for a moment. At its core, an LLM is just a function:

bytes

output = LLM(input)

You give it input (your prompt, system instructions, conversation history, documents — all converted to tokens). It produces output (the response, also in tokens). And the context window is the fixed-size container that holds both.

bytes

Context Window = Input Tokens + Output Tokens

Think of it like a function with a fixed-size argument buffer. The function can only accept and return data that fits within that buffer. If your input is too large, there's no room for a meaningful output. If you ask for a very long output, your input needs to be shorter.

The Golden Rule

For any LLM request to work well:

bytes

Context Window >= Input Tokens + Output Tokens

If the left side is smaller than the right, something has to give. The model might truncate older messages, cut off its own response mid-sentence, or — worse — silently lose important context and start producing confused or inaccurate answers.

This is exactly what happened to me. My conversation grew so long that the accumulated input tokens ate up the entire context window, leaving no room for the model to generate a coherent response — or even remember what I'd told it earlier.

Context Windows Across Popular Models

To give you a sense of scale, here are the context window sizes for some widely used models today:

Model	Context Window
GPT-4o	128K tokens
GPT-4.1	1M tokens (API)
Claude Sonnet 4	200K tokens
Gemini 2.5 Pro	1M tokens
Llama 4 Scout	10M tokens

These numbers are impressive, but bigger doesn't always mean better. A 200K context window that maintains consistent accuracy throughout is more useful than a 1M window where the model loses track of information buried in the middle. This is known as the "lost in the middle" problem — models tend to recall information from the beginning and end of the context much better than from the middle.

Staying Within the Window

Understanding the formula Context Window >= Input + Output changes how you work with LLMs. Here are a few practical habits:

Be intentional with your input. Don't paste an entire codebase when the model only needs two files. Don't include your full conversation history if only the last few exchanges are relevant. Every unnecessary token is a token stolen from the model's ability to reason and respond.

Leave room for output. If your model has a 128K context window and your input is 125K tokens, the model has only 3K tokens (~2,000 words) to respond. That might not be enough for a complex answer. Some APIs even have a separate, smaller output cap — GPT-4o, for example, caps output at 16K tokens regardless of the context window size.

Watch for silent degradation. Most models won't throw an error when you're close to the limit. They'll just quietly drop older context or produce shallower responses. If the model starts "forgetting" things or giving vague answers, your context window is likely full.

Summarize aggressively in long conversations. If you're in a long back-and-forth session, periodically summarize the conversation state and restart with that summary as your input. This is effectively manual "context management" — and it works surprisingly well.

The Takeaway

The context window is one of those deceptively simple concepts that explains a lot of LLM behavior once you internalize it. To recap:

Tokens are the unit of measurement — not words, not characters. Most LLMs use BPE to split text into subword tokens.

The context window is a fixed-size container measured in tokens. It holds everything — your input and the model's output.

The rule is straightforward: Context Window >= Input Tokens + Output Tokens. Violate it, and things break quietly.

And the analogy that sticks: an LLM is a function. The context window is its argument buffer. Work within it, and you get reliable, high-quality responses. Exceed it, and you get confusion, hallucination, and the frustrating experience I described at the start.

Once you see LLMs through this lens, a lot of "mysterious" behavior stops being mysterious. The model isn't broken. It isn't dumb. It's just working within its window — and now you know how to work within it too.

#What Are Tokens?

#How Do LLMs Create Tokens? (BPE in a Nutshell)

#A Quick Rule of Thumb

#What Is the Context Window in LLMs?

#Think of an LLM Like a Function

#The Golden Rule

#Context Windows Across Popular Models

#Staying Within the Window

#The Takeaway

What Are Tokens?

How Do LLMs Create Tokens? (BPE in a Nutshell)

A Quick Rule of Thumb

What Is the Context Window in LLMs?

Think of an LLM Like a Function

The Golden Rule

Context Windows Across Popular Models

Staying Within the Window

The Takeaway