Context Window
The context window is the maximum amount of text — measured in tokens — that a language model can process in a single request, including both the input prompt and the generated output. It defines the hard limit on how much information you can give the model to work with at one time. GPT-4o supports 128K tokens. Claude supports up to 200K tokens. Gemini 1.5 Pro extends to 2M tokens. Each token is roughly three-quarters of a word in English. The context window is one of the most important constraints in AI product design because it determines how much retrieved content, conversation history, system instructions, and tool output you can fit into a single request. A RAG system that retrieves ten web pages averaging 5,000 tokens each has already consumed 50,000 tokens of context before the model generates a single word of response. Add system prompts, conversation history, and tool definitions, and you can hit the limit quickly. Context window management is an active engineering discipline. Techniques include chunking (breaking documents into smaller pieces and only including the most relevant chunks), summarization (compressing retrieved content before injection), sliding windows (maintaining a rolling buffer of recent conversation), and hierarchical retrieval (using a cheap model to filter results before sending the best ones to an expensive model with a large context window). For product builders working with web data, the context window creates a direct relationship between content quality and cost. Every token of irrelevant content — navigation menus, ads, boilerplate — is a token that could have been used for useful information. This is why LLM-ready data matters: clean, well-extracted content lets you fit more relevant information into the same token budget. It is also why AI search APIs that return pre-cleaned content can be more cost-effective than raw web scraping, even if the per-query price is higher. Larger context windows do not eliminate the need for good retrieval. Research consistently shows that models perform worse when given large amounts of irrelevant context, even when the relevant information is present. More context is not always better — more relevant context is.