GlossaryMemory & Context
Retrieval-Augmented Generation (RAG)
A technique that enhances LLM outputs by retrieving relevant documents or data from an external knowledge source and injecting them into the model's context before generation.
Definition
Retrieval-Augmented Generation (RAG) addresses a core limitation of language models: their knowledge is frozen at training time. RAG dynamically retrieves up-to-date, domain-specific, or private information at inference time and includes it in the prompt, allowing the model to generate grounded responses without retraining.
The RAG Pipeline
- Index — Documents are chunked and embedded into a vector database (Pinecone, Weaviate, pgvector, Chroma).
- Retrieve — A user query is embedded and compared against indexed chunks; the top-k most similar chunks are returned.
- Augment — Retrieved chunks are inserted into the prompt as context alongside the user query.
- Generate — The LLM generates a response grounded in the retrieved context.
RAG in Agentic Systems
In agent architectures, RAG is typically implemented as a memory tool — the agent calls a retrieval function when it needs to look something up, rather than RAG running automatically on every turn. This gives the agent control over when to retrieve and what to query for, enabling more efficient use of context window space.
Limitations
- Retrieval quality depends heavily on chunking strategy and embedding model choice.
- Retrieved context may contradict the model's parametric knowledge, leading to confusion.
- Long retrieved passages consume context window tokens that could otherwise be used for reasoning.
- Does not handle queries requiring synthesis across many documents well without re-ranking.