Understanding Context Windows: A Practical Guide

Bigger is not always better. Learn when 128K, 200K, or 2M tokens actually matter.

Every LLM has a context window — the maximum number of tokens it can process in a single request. But context windows are not created equal. A model that claims 2M tokens might only use the last 128K effectively. This guide separates marketing from mechanics.

What Is a Context Window Really?

The context window includes everything: system instructions, conversation history, uploaded documents, and the prompt itself. When you exceed the limit, the provider either:

  • Truncates from the beginning (oldest messages dropped)
  • Rejects the request with a 400 error
  • Charges more for long-context tiers (Gemini Pro)

Provider Context Limits (April 2026)

ModelClaimed WindowEffective WindowCost Multiplier
GPT-4o128K~90K1.0×
Claude 3.5 Sonnet200K~180K1.0×
Gemini 1.5 Pro2M~1.5M1.0×
Gemini 1.5 Flash1M~800K0.06×
DeepSeek V364K~55K0.028×

The "Lost in the Middle" Problem

Research shows that models struggle to retrieve information from the middle of long contexts. A fact buried at token 50,000 in a 128K prompt is less likely to be recalled than one at token 1,000 or token 127,000. This is true for every model tested.

Practical Impact

  • Document Q&A: If you dump a 200-page PDF into Claude, questions about page 100 will be answered worse than questions about page 1 or page 200.
  • Code Review: A 10,000-line file at the start of context gets less attention than the 500-line file at the end.

Strategies for Effective Context Usage

1. Chunking with Summaries

Break long documents into chunks. Summarize each chunk into a metadata header. The model sees: summary + relevant chunk, not the full document.

2. Re-ranking

Use a cheaper model (Gemini Flash) to identify which chunks are relevant to the user's question. Then feed only those chunks to the expensive model (Claude Sonnet).

3. Hierarchical Prompting

Place the most important instructions at the end of the system prompt, not the beginning. The model pays more attention to recent tokens.

4. Window Management in Chat

In conversational UIs, summarize old turns instead of keeping full history. A rolling summary + last 5 full messages outperforms 50 full messages every time.

Context Window vs Cost

Gemini 1.5 Pro's 2M window sounds revolutionary, but filling it costs $2.50 per million input tokens × 2 million tokens = $5.00 per request. For most use cases, chunking + a 128K window is cheaper and more accurate.

Key Takeaway

Treat the context window as a budget, not a storage container. The best-performing prompts use 20–40% of the available window, place critical information near the end, and never assume the model "read" the middle of a long document.