Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation, or RAG, is an architecture pattern where a language model's response is informed by external documents retrieved at query time. Instead of relying solely on knowledge baked into the model during training — which may be months or years out of date — a RAG system first searches a knowledge base, retrieves relevant passages, and injects them into the model's prompt as context. The model then generates its answer grounded in those retrieved facts. The pattern was introduced by Meta researchers in 2020 and has since become the dominant approach for building LLM applications that need accurate, up-to-date, or domain-specific information. It is simpler and cheaper than fine-tuning a model on proprietary data, and it lets you update the knowledge base without retraining anything. A typical RAG pipeline has three stages. First, ingestion: documents are chunked, embedded into vectors, and stored in a vector database. Second, retrieval: when a user asks a question, the query is embedded and matched against stored vectors to find the most relevant chunks. Third, generation: the retrieved chunks are inserted into the LLM prompt, and the model produces an answer that cites or synthesizes those sources. For AI product builders, RAG is often the first architecture you reach for when your product needs to answer questions about data the model was not trained on — company docs, product catalogs, legal filings, or live web content. The quality of a RAG system depends heavily on the quality of the retrieval step. Poor chunking, weak embeddings, or irrelevant search results produce poor answers regardless of how capable the underlying model is. Web-connected RAG — where the retrieval step queries the live internet rather than a static document store — is an increasingly common pattern. AI search APIs like Exa and Tavily exist specifically to serve as the retrieval layer for these web-augmented RAG pipelines, handling the complexity of web search, content extraction, and relevance scoring so your application does not have to.