Understanding Context Window in LLMs


I was working on a feature and using an LLM alongside me for quick help — generating code, fixing bugs, and refining logic. Initially, everything was smooth. It understood my context, followed my instructions, and gave accurate responses.
But after some time, things started to feel off.
It began suggesting changes that didn't match my code. It forgot constraints I had clearly mentioned earlier. Sometimes it even gave completely unrelated answers. I found myself wondering — why is this happening?
At first, I thought I wasn't prompting it properly. Maybe I missed something. But the more I retried, the more inconsistent it became.
That's when I realized something important: I was treating the LLM like it remembers everything I told it — like a human would.
But it doesn't.
It only works with a limited "view" of the conversation. And once that limit is crossed, it simply starts forgetting the earlier parts — which explains the wrong answers, the confusion, and what we often call hallucination.
The answer lies in a fundamental concept called the Context Window.
But before we get there, we need to understand something even more basic — tokens. Because when we say "context window," we're not measuring it in words or sentences. We're measuring it in tokens.
LLMs don't read text the way we do. They don't see words. They see tokens — small chunks of text that the model uses as its basic unit of processing.
A token can be a full word like "hello", a part of a word like "un" or "ing", or even a single character like "!". The sentence "I love building apps" might become four tokens: ["I", " love", " building", " apps"]. But a less common word like "tokenization" might get split into ["token", "ization"].
So why not just use whole words? Because every language has thousands of rare words, slang, technical terms, and typos. If the model tried to keep a separate entry for every possible word, the vocabulary would be impossibly large. Tokens give us a middle ground — a compact vocabulary that can still represent any text.
Most modern LLMs — including GPT-4, Claude, and Llama — use an algorithm called Byte Pair Encoding (BPE) to build their token vocabulary.
The idea is simple: start with individual characters, then repeatedly merge the most frequently occurring pair into a new token. Do this thousands of times over a massive text corpus, and you end up with a vocabulary of commonly used subwords.
Here's how it works step by step:
Step 1: Start with every character as its own token.
"low" → ["l", "o", "w"]
"lower" → ["l", "o", "w", "e", "r"]
"newest" → ["n", "e", "w", "e", "s", "t"]Step 2: Count all adjacent pairs across the entire corpus and find the most frequent one. Let's say ("l", "o") appears the most.
Step 3: Merge that pair into a new token: "lo".
"low" → ["lo", "w"]
"lower" → ["lo", "w", "e", "r"]Step 4: Repeat. Next most frequent pair might be ("lo", "w") → merge into "low".
"low" → ["low"]
"lower" → ["low", "e", "r"]This keeps going until the vocabulary reaches a target size — GPT-2 uses about 50,000 tokens, GPT-4 uses around 100,000, and Llama 3 uses about 128,000.
The result? Common words like "the" or "function" become a single token. Rare or technical words get split into recognizable pieces. And the model can handle any input — even words it has never seen before — by breaking them down into known subwords.
Quick aside: This is also why LLMs sometimes struggle with character-level tasks like "How many r's are in strawberry?" — the word might be a single token to the model, so it literally doesn't "see" the individual letters inside it.
For English text, 1 token ≈ 0.75 words (or roughly 4 characters). So 1,000 words is approximately 1,300 tokens. It's not exact — code and non-English text tend to use more tokens per word — but it's useful for estimation.
Now that we know what tokens are, let's talk about the real constraint.
The context window is the maximum number of tokens an LLM can process in a single request. It's the model's working memory — everything it can "see" at once.

And here's the critical part: the context window includes both input and output.
This is the mental model that made everything click for me. Forget chatbots and conversations for a moment. At its core, an LLM is just a function:
output = LLM(input)You give it input (your prompt, system instructions, conversation history, documents — all converted to tokens). It produces output (the response, also in tokens). And the context window is the fixed-size container that holds both.
Context Window = Input Tokens + Output TokensThink of it like a function with a fixed-size argument buffer. The function can only accept and return data that fits within that buffer. If your input is too large, there's no room for a meaningful output. If you ask for a very long output, your input needs to be shorter.
For any LLM request to work well:
Context Window >= Input Tokens + Output TokensIf the left side is smaller than the right, something has to give. The model might truncate older messages, cut off its own response mid-sentence, or — worse — silently lose important context and start producing confused or inaccurate answers.
This is exactly what happened to me. My conversation grew so long that the accumulated input tokens ate up the entire context window, leaving no room for the model to generate a coherent response — or even remember what I'd told it earlier.
To give you a sense of scale, here are the context window sizes for some widely used models today:
| Model | Context Window |
|---|---|
| GPT-4o | 128K tokens |
| GPT-4.1 | 1M tokens (API) |
| Claude Sonnet 4 | 200K tokens |
| Gemini 2.5 Pro | 1M tokens |
| Llama 4 Scout | 10M tokens |
These numbers are impressive, but bigger doesn't always mean better. A 200K context window that maintains consistent accuracy throughout is more useful than a 1M window where the model loses track of information buried in the middle. This is known as the "lost in the middle" problem — models tend to recall information from the beginning and end of the context much better than from the middle.
Understanding the formula Context Window >= Input + Output changes how you work with LLMs. Here are a few practical habits:
Be intentional with your input. Don't paste an entire codebase when the model only needs two files. Don't include your full conversation history if only the last few exchanges are relevant. Every unnecessary token is a token stolen from the model's ability to reason and respond.
Leave room for output. If your model has a 128K context window and your input is 125K tokens, the model has only 3K tokens (~2,000 words) to respond. That might not be enough for a complex answer. Some APIs even have a separate, smaller output cap — GPT-4o, for example, caps output at 16K tokens regardless of the context window size.
Watch for silent degradation. Most models won't throw an error when you're close to the limit. They'll just quietly drop older context or produce shallower responses. If the model starts "forgetting" things or giving vague answers, your context window is likely full.
Summarize aggressively in long conversations. If you're in a long back-and-forth session, periodically summarize the conversation state and restart with that summary as your input. This is effectively manual "context management" — and it works surprisingly well.
The context window is one of those deceptively simple concepts that explains a lot of LLM behavior once you internalize it. To recap:
Tokens are the unit of measurement — not words, not characters. Most LLMs use BPE to split text into subword tokens.
The context window is a fixed-size container measured in tokens. It holds everything — your input and the model's output.
The rule is straightforward: Context Window >= Input Tokens + Output Tokens. Violate it, and things break quietly.
And the analogy that sticks: an LLM is a function. The context window is its argument buffer. Work within it, and you get reliable, high-quality responses. Exceed it, and you get confusion, hallucination, and the frustrating experience I described at the start.
Once you see LLMs through this lens, a lot of "mysterious" behavior stops being mysterious. The model isn't broken. It isn't dumb. It's just working within its window — and now you know how to work within it too.