AgentEngineering
GlossaryCore Concepts

Context Window

The maximum number of tokens a language model can process in a single forward pass — encompassing the system prompt, conversation history, retrieved documents, tool results, and the model's own generated output.

Definition

The context window is the total token budget available to a language model during a single inference call. Every token that the model can "see" and reason about must fit within this limit. Exceeding it requires truncation, summarization, or retrieval-based strategies.

Modern frontier models have dramatically extended context windows — from GPT-3's 4K tokens to 128K (GPT-4o), 200K (Claude 3), and beyond — but the context window remains a fundamental design constraint in every agent system.

What Counts Against the Limit?

ComponentTypical Token Cost
System prompt500 – 2,000 tokens
Conversation historyGrows linearly with turns
Retrieved RAG chunks500 – 2,000 per retrieval
Tool schemas100 – 500 per tool
Tool call resultsVaries widely
Model output (completion)500 – 4,000 tokens typical

Context Window vs. Memory

The context window is not the same as agent memory. The context window is ephemeral — it exists only for a single inference call. Agent memory systems (vector stores, key-value caches, episode summarization) persist information across calls and selectively load relevant pieces into the context window when needed.

Strategies for Managing Context

  • Sliding window — drop the oldest messages when the window fills up.
  • Summarization — periodically compress prior turns into a compact summary.
  • RAG / selective retrieval — store long documents externally and retrieve only relevant chunks.
  • Tool result truncation — trim or summarize verbose tool outputs before passing them back to the model.
  • Structured state — maintain agent state as a typed object rather than raw conversation history; serialize only what the model needs.

Performance Implications

Inference cost scales roughly linearly with total context tokens (input + output). Latency, particularly time-to-first-token, also increases with context length. For production agents, context management is one of the primary levers for controlling cost and speed.

ShareY