ARTICLE · 17 MIN READ · FEBRUARY 26, 2026
Chapter 14: Knowledge Retrieval (RAG)
LLMs know a lot — but their knowledge stopped the day training ended. RAG is the bridge between static model weights and the live, private, specific knowledge that makes agents actually useful.
The Knowledge Cutoff Problem
Embedding: A numerical representation of text as a vector (list of numbers). An embedding converts "What is the capital of France?" into something like [0.23, -0.81, 0.44, ...] with hundreds or thousands of dimensions. The crucial property: texts with similar meanings get similar vectors. "What is the capital of France?" and "Which city leads France?" would have very similar embeddings even though they share only a few words.
Vector: An ordered list of numbers. A 3D vector might be [1.5, -2.3, 0.7]. In embeddings, vectors typically have 768, 1,536, or 3,072 dimensions. The position in this high-dimensional space encodes meaning — similar concepts cluster together, different concepts are far apart.
Cosine similarity: A measure of how similar two vectors are, regardless of their magnitude. Returns a value from -1 (opposite) to 1 (identical). Used in RAG to find which stored document chunks are most similar to the user's query. If the query embedding and a document chunk embedding have high cosine similarity, they're semantically related.
Hallucination: When an LLM generates text that sounds confident and plausible but is factually incorrect. Happens when the model doesn't know the answer but generates something statistically likely based on context. RAG reduces hallucination by giving the model actual facts to work with, rather than relying on pattern completion.
Chunking: Splitting large documents into smaller pieces for embedding and storage. A 50-page document becomes hundreds of 300-500 word chunks. Each chunk is embedded separately so you can retrieve just the relevant section rather than the whole document.
Vector database: A specialized database built to store embeddings and perform fast similarity search across millions of them. Unlike SQL databases that search by exact value matches, vector databases search by semantic similarity — finding the closest vectors to a query vector. Examples: Pinecone, Weaviate, Chroma, Milvus, pgvector.
Augmentation: In RAG, the process of adding retrieved document chunks to the LLM's prompt. The original query is "augmented" with relevant context before being sent to the model. This is why it's called Retrieval-Augmented Generation.
Every LLM in the world has a training cutoff date — the point at which its training data ended. GPT-4’s cutoff is early 2024. Gemini Pro’s is mid-2024. Whatever happened after that date simply doesn’t exist in the model’s knowledge.
But the more important limitation isn’t the cutoff date — it’s what was never in the training data at all:
- Your company’s internal documentation. The model has never seen your employee handbook, your product specs, your customer contracts, or your proprietary research.
- Your customers’ specific situations. The model can’t look up customer #7823’s account history, their past support tickets, or their current subscription plan.
- Real-time data. The model can’t check today’s stock price, last hour’s server logs, or this morning’s news.
An LLM without access to this information can still answer general questions reasonably well — but for the specific, grounded, current answers that make AI agents genuinely useful, it will either guess (hallucinate) or admit ignorance.
Retrieval-Augmented Generation (RAG) is the solution. Instead of asking the LLM to recall knowledge it doesn’t have, RAG gives it a system to look things up — converting a closed-book test into an open-book one.
How RAG Works: The Six Steps
The key insight: the LLM’s job is to reason and synthesize, not to memorize facts. Give it the relevant facts (retrieved documents), and it can synthesize an excellent answer. Without those facts, it either hallucinates or refuses to answer.
The Four Core Concepts
Concept 1: Embeddings — How Meaning Becomes Numbers
An embedding model is a neural network trained to convert text into vectors such that semantically similar texts produce geometrically close vectors. This sounds abstract, so let’s ground it.
Imagine a 2D map where every word is a dot. Words that often appear in similar contexts end up near each other. “Doctor” and “physician” are close. “Cat” and “kitten” are close. “Bank” (financial) might be near “loan” and “interest,” while “bank” (river) is near “river” and “shore” — the same word appearing in two different clusters based on how it’s typically used.
Real embedding models work in 768 or 1,536 dimensions rather than 2, allowing for vastly more nuanced representation. But the principle is identical: proximity in vector space = similarity in meaning.
Why does this enable semantic search? If you embed the query “What is our vacation policy?” and it produces a vector that’s close to the chunk “Employees are entitled to 15 days of paid leave annually, plus public holidays…” — even though “vacation” never appears in the chunk — the similarity will still be high because the concepts are related. This is impossible with keyword search.
Concept 2: Chunking — Breaking Documents for Retrieval
You can’t embed a 50-page document as a single unit and hope to retrieve relevant sections. The embedding of the entire document would average across all topics, making similarity search imprecise. Instead, you chunk documents into pieces that are:
- Small enough that each chunk is topically coherent (not mixing multiple subjects)
- Large enough that each chunk contains sufficient context to be useful
Common chunking strategies:
| Strategy | How it works | Best for |
|---|---|---|
| Fixed-size | Split every N characters/words, with overlap | Simple baseline |
| Sentence | Split at sentence boundaries | Conversational content |
| Paragraph | Split at paragraph breaks | Structured documents |
| Semantic | Split when topic changes (detected by embedding distance shift) | Dense, topic-switching documents |
| Recursive | Try paragraph splits, fall back to sentence, then fixed-size | General purpose |
Chunk overlap is a critical parameter. If you split a document at every 500 words with no overlap, a sentence that spans the boundary between chunk 3 and chunk 4 may be split in half — losing meaning in both chunks. Overlapping chunks by 50-100 words ensures boundary context is preserved in adjacent chunks.
Concept 3: Retrieval Methods
Once your knowledge base is embedded and indexed, how do you find the relevant chunks?
Vector/Semantic Search: Embed the query, find the K nearest vectors in the database by cosine similarity. This finds chunks that are semantically related even without keyword overlap. Fast implementations use approximate nearest neighbor algorithms like HNSW (Hierarchical Navigable Small World) — a graph-based structure that allows sub-linear search across millions of vectors.
BM25 (Best Match 25): A classical information retrieval algorithm that scores documents based on term frequency and inverse document frequency (TF-IDF with improvements). It finds documents that contain the same words as the query. Very precise for exact keyword matches but blind to synonyms.
Hybrid Search: Combines both approaches. BM25 catches exact keyword matches; semantic search catches conceptually related content. A fusion algorithm (like Reciprocal Rank Fusion) combines the two ranked lists. This is the production standard for most enterprise RAG systems.
# Conceptual hybrid search
def hybrid_search(query: str, k: int = 5) -> list:
# Get semantic candidates (top-20 by embedding similarity)
semantic_results = vector_db.similarity_search(query, k=20)
# Get keyword candidates (top-20 by BM25)
bm25_results = bm25_index.search(query, k=20)
# Combine using Reciprocal Rank Fusion
scores = {}
for rank, doc in enumerate(semantic_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
for rank, doc in enumerate(bm25_results):
scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
# Return top-K by combined score
return sorted(scores, key=scores.get, reverse=True)[:k]
Concept 4: Vector Databases
A vector database is purpose-built for one operation: given a query vector, find the most similar stored vectors as fast as possible.
Traditional SQL databases compare row by row using exact value matching. They can’t natively compute “which of these 10 million vectors is geometrically closest to this query vector?” — that would require computing a similarity score for every single stored vector, which is too slow.
Vector databases solve this with specialized index structures:
- HNSW (used by Weaviate, Qdrant): builds a hierarchical graph structure that enables navigating toward nearest neighbors without checking every vector
- IVF (Inverted File Index) (used by Faiss): clusters vectors into groups; search checks only the most relevant clusters
- LSH (Locality Sensitive Hashing): maps similar vectors to the same hash buckets for fast filtering
The result: searching 10 million vectors takes milliseconds rather than seconds.
Popular options by category:
| Category | Examples | Best for |
|---|---|---|
| Managed cloud | Pinecone, Weaviate Cloud | Production, no infra management |
| Open source | Chroma, Milvus, Qdrant | Self-hosted, customizable |
| Extensions | pgvector (Postgres), Redis | Adding vector search to existing DB |
| Research libraries | FAISS (Meta), ScaNN (Google) | Maximum performance, embedded use |
The RAG Pipeline: Live Demonstration
Without RAG, the LLM would have to answer from its training data — which contains no information about your company's specific policies. It would either refuse ("I don't have information about your company's policies") or hallucinate a plausible-sounding but invented policy.
User query: "What is our company's policy on reimbursing home office equipment for remote employees?"query_vector = embedding_model.encode("What is our company's policy...")
# → [0.23, -0.81, 0.44, 0.17, ... ] (1,536 numbers)results = vector_db.similarity_search(query_vector, k=5)
# Returns chunks ranked by similarity:
# 1. HR Policy Manual, Section 4.2: "Remote Work Equipment" (similarity: 0.94)
# 2. Employee Benefits FAQ: "Home Office Setup" (similarity: 0.88)
# 3. IT Equipment Request Process (similarity: 0.72)
# 4. Remote Work Agreement Template (similarity: 0.68)
# 5. New Employee Onboarding Guide (similarity: 0.61)chunk_1_text = """Remote Work Equipment Reimbursement (Section 4.2):
Full-time remote employees are eligible for a one-time home office
equipment reimbursement of up to $1,500. This includes monitors,
keyboards, mice, webcams, and ergonomic equipment. Contractors
are not eligible for this benefit. Submit receipts within 30 days
of purchase via the expense portal..."""augmented_prompt = f"""
You are a helpful HR assistant. Answer the user's question based ONLY
on the provided context. If the answer isn't in the context, say so.
Cite the specific document section you're referencing.
CONTEXT:
{retrieved_chunks}
QUESTION: {user_query}
ANSWER:"""LLM Response:
"According to Section 4.2 of the HR Policy Manual, full-time remote
employees are eligible for a one-time home office equipment
reimbursement of up to $1,500, covering monitors, keyboards,
webcams, and ergonomic equipment. Note: contractors are not eligible
for this benefit. Submit receipts within 30 days via the expense portal."GraphRAG: When Relationships Matter
Standard RAG treats each document chunk independently. It retrieves relevant chunks and feeds them to the LLM. This works well for localized questions (“What is our vacation policy?”) but struggles with questions that require synthesizing relationships across multiple documents (“How does our acquisition of Company X affect our healthcare benefits for employees who transfer?”).
GraphRAG replaces the vector database with a knowledge graph — a network of entities (nodes) connected by typed relationships (edges).
When asked “How does the Company X acquisition affect transferred employees’ healthcare?”, GraphRAG:
- Finds the “Company X” node
- Traverses the
acquired_byedge to “Our Company” - Traverses the
has_policyedge to “Benefits Policy v3” - Traverses the
coversedge to “Healthcare Benefits” - Checks the
applies_torelationship with “Transferred Employees” - Retrieves the specific healthcare clauses that apply
Standard RAG would struggle because no single document chunk contains this full chain of relationships. GraphRAG navigates the graph to synthesize it.
GraphRAG trade-offs:
| Standard RAG | GraphRAG | |
|---|---|---|
| Setup cost | Low — just embed documents | Very high — build and maintain the graph |
| Query speed | Fast (vector search) | Slower (graph traversal) |
| Best for | Localized, factual questions | Multi-hop, relationship questions |
| Maintenance | Re-embed changed docs | Update graph schema + relationships |
| Use case | 80% of enterprise RAG needs | Complex financial, scientific, legal analysis |
Agentic RAG: When Retrieval Needs Intelligence
Standard RAG has a fundamental architectural limitation: it retrieves passively. Query comes in → retrieve K chunks → send to LLM. There’s no reasoning about whether the retrieved chunks are good, complete, or contradictory.
Agentic RAG adds a reasoning agent that acts as a critical evaluator and orchestrator of the retrieval process.
The four capabilities Agentic RAG adds over standard RAG:
1. Source validation and date-awareness. A standard RAG retrieves the highest-similarity chunks regardless of whether they’re a 2020 blog post or the 2025 official policy. An agent reads document metadata and prioritizes authoritative, current sources.
2. Conflict resolution. When two retrieved chunks disagree (initial proposal says €50K, final report says €65K), an agent reasons about which source is more authoritative rather than presenting both to the LLM and hoping it figures it out.
3. Multi-step decomposition. Complex questions that require combining information from multiple sources are decomposed into sub-queries, each retrieved separately, then synthesized.
4. Knowledge gap detection and tool activation. When the knowledge base doesn’t contain the answer (the internal knowledge base is updated weekly but the user asks about something that happened yesterday), the agent recognizes the gap and activates external tools (web search, live APIs) rather than producing an answer based on stale data.
Hands-On Code: Three Implementations
Implementation 1: Google Search as RAG (Simplest)
from google.adk.tools import google_search
from google.adk.agents import Agent
search_agent = Agent(
name = "research_assistant",
model = "gemini-2.0-flash-exp",
instruction = """You help users research topics accurately.
When asked about current events, company information, or factual topics,
use the Google Search tool to retrieve current information.
Always cite the sources you found.""",
tools = [google_search]
)
Why is Google Search a form of RAG? It implements the core RAG pattern: retrieve relevant documents (web search results) → augment the prompt → generate response. The difference from a custom RAG system is that Google Search retrieves from the public web rather than a private knowledge base. Use this for public information that changes frequently (news, current events, general facts). Use a custom vector database RAG for private, proprietary, or domain-specific knowledge.
Implementation 2: Vertex AI RAG Corpus
from google.adk.memory import VertexAiRagMemoryService
# Resource identifier for your Vertex AI RAG Corpus
RAG_CORPUS_RESOURCE_NAME = (
"projects/your-gcp-project-id"
"/locations/us-central1"
"/ragCorpora/your-corpus-id"
)
memory_service = VertexAiRagMemoryService(
rag_corpus = RAG_CORPUS_RESOURCE_NAME,
similarity_top_k = 5, # retrieve 5 most similar chunks
vector_distance_threshold = 0.7, # only chunks with similarity >= 0.7
)
similarity_top_k=5: Retrieve the 5 chunks with highest similarity to the query. More chunks = more context for the LLM, but more tokens consumed and potential for irrelevant information diluting relevant content. 3-7 is a typical range.
vector_distance_threshold=0.7: Only return chunks with cosine similarity at least 0.7 to the query. If no chunks meet this threshold, return nothing rather than returning irrelevant chunks. Without this threshold, RAG always returns K chunks even if none are relevant — which can cause the LLM to make up answers based on unrelated context (a subtle but common failure mode).
What is a Vertex AI RAG Corpus? A managed service that handles the full RAG pipeline infrastructure: document ingestion, chunking, embedding, and storage in Google’s vector search infrastructure. You upload documents, the service handles everything else. The
VertexAiRagMemoryServicein ADK connects your agent to this corpus for retrieval.
Implementation 3: Custom LangChain + LangGraph RAG Pipeline
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Weaviate
from langchain_openai import ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langgraph.graph import StateGraph, END
from typing import List, TypedDict
import weaviate
from weaviate.embedded import EmbeddedOptions
Why LangGraph instead of a simple chain? A simple chain runs retrieval → generation sequentially with no flexibility. LangGraph defines the pipeline as a graph where each step is a node and transitions are explicit edges. This enables: conditional routing (if retrieval finds no relevant documents, take a different path), loops (retry with a refined query if the first retrieval was insufficient), and parallel branches (retrieve from multiple sources simultaneously). For production RAG, you almost always need this flexibility.
# Step 1: Prepare the knowledge base (done once)
loader = TextLoader('./company_docs.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(
chunk_size = 500, # characters per chunk
chunk_overlap = 50 # overlap prevents losing context at boundaries
)
chunks = text_splitter.split_documents(documents)
chunk_size=500andchunk_overlap=50: These are critical tuning parameters. Smaller chunks (200-300 chars) give more precise retrieval but lose context. Larger chunks (1,000+ chars) preserve context but make retrieval less precise because each chunk covers multiple topics. The overlap of 50 characters ensures that a sentence split across chunk boundaries appears in both chunks — preventing the common failure where the most important sentence in a document is the first/last line of a chunk and loses its context.
# Step 2: Embed and store in Weaviate
client = weaviate.Client(embedded_options=EmbeddedOptions())
vectorstore = Weaviate.from_documents(
client = client,
documents = chunks,
embedding = OpenAIEmbeddings(),
by_text = False
)
retriever = vectorstore.as_retriever()
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
EmbeddedOptions(): Runs Weaviate in-process (embedded mode) without needing a separate Weaviate server. Perfect for development and testing. For production, point to a Weaviate instance running as a separate service.
by_text=False: Use pre-computed embeddings rather than Weaviate’s built-in vectorization. This gives you explicit control over which embedding model is used, ensuring consistency between embedding during indexing and embedding at query time. Always use the same embedding model for both — mixing models produces meaningless similarity scores.
# Step 3: Define graph state
class RAGGraphState(TypedDict):
question: str
documents: List[Document]
generation: str
Why a TypedDict for state? LangGraph passes state between nodes as a dictionary. Using
TypedDictdocuments the expected keys and their types — it’s not strictly required but makes the graph’s data flow explicit and enables type checking. Each node function receives the full state dict and returns a partial update (only the keys it modifies).
# Step 4: Define nodes
def retrieve_documents_node(state: RAGGraphState) -> RAGGraphState:
"""Retrieves relevant document chunks for the user's question."""
documents = retriever.invoke(state["question"])
return {"documents": documents, "question": state["question"], "generation": ""}
def generate_response_node(state: RAGGraphState) -> RAGGraphState:
"""Generates an answer using retrieved context."""
template = """Answer the following question based ONLY on the provided context.
If you cannot answer from the context, say "I don't have enough information."
Context: {context}
Question: {question}
Answer:"""
context = "\n\n".join([doc.page_content for doc in state["documents"]])
prompt = ChatPromptTemplate.from_template(template)
chain = prompt | llm | StrOutputParser()
generation = chain.invoke({"context": context, "question": state["question"]})
return {"question": state["question"], "documents": state["documents"], "generation": generation}
“Answer ONLY from the context”: This instruction is critical for RAG correctness. Without it, the LLM might supplement retrieved context with its own training knowledge, producing a mixed answer that’s hard to attribute or verify. The strict instruction forces the LLM to either use the retrieved evidence or admit it doesn’t know — making the system more honest and auditable.
# Step 5: Build and compile the graph
workflow = StateGraph(RAGGraphState)
workflow.add_node("retrieve", retrieve_documents_node)
workflow.add_node("generate", generate_response_node)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)
app = workflow.compile()
# Step 6: Run queries
result = app.invoke({"question": "What is our company's vacation policy?"})
print(result["generation"])
Why
app.invoke()instead of calling functions directly?invoke()runs the full LangGraph state machine — it manages state initialization, passes state between nodes, handles the edge routing, and returns the final state. Calling functions directly would bypass the graph infrastructure, losing LangGraph’s benefits (streaming, checkpointing, conditional edges).
RAG’s Challenges: What Can Go Wrong
Challenge 1: Multi-document synthesis. If the answer to a question spans 5 different documents, and retrieval only returns 3 of them, the LLM will produce an incomplete answer — worse, it may produce a confident but incomplete one without signaling that information is missing. Mitigation: increase k, use multi-query retrieval (generate multiple variations of the query and merge results), or use Agentic RAG to detect gaps.
Challenge 2: Chunk boundary problems. A crucial sentence that spans two chunk boundaries may appear incomplete in both chunks. Mitigation: chunk overlap, recursive chunking, or sentence-aware splitting.
Challenge 3: Retrieval noise. If irrelevant chunks are retrieved (false positives), they pollute the LLM’s context with confusing or contradictory information. The LLM may hallucinate to reconcile the conflict. Mitigation: use a distance threshold, re-rank retrieved chunks using a cross-encoder model before sending to LLM.
Challenge 4: Index staleness. Your knowledge base is outdated. A policy changed last week but the vector store hasn’t been re-indexed. The LLM confidently answers from the old policy. Mitigation: incremental indexing pipelines, document versioning, freshness metadata, Agentic RAG with web search fallback for time-sensitive queries.
Challenge 5: Query-document vocabulary mismatch. The user asks about “vacation days” but your policy document uses “annual leave.” Pure semantic search helps but isn’t perfect. Mitigation: hybrid search (BM25 + semantic), query expansion (generate synonyms and related terms before retrieval).
Challenge 6: Context window limits. Retrieving 10 long chunks may exceed the LLM’s context window. Mitigation: summarize chunks before passing to LLM, reduce k, use a model with a larger context window.
Four Practical Applications
Enterprise Knowledge Base Q&A
HR chatbots answering policy questions, IT helpdesks troubleshooting from documentation, legal teams searching contracts for specific clauses — all powered by RAG over internal document stores.
HR · IT · Legal · OperationsCustomer Support Automation
Agents that answer product-specific questions from support tickets, FAQs, and product manuals. Unlike a fine-tuned model, RAG-based support agents can be updated by uploading new documentation without retraining.
SaaS · E-commerce · TelecomNews and Research Summarization
LLMs connected to live news feeds can summarize recent developments on any topic. The RAG system retrieves today's articles and the LLM synthesizes them — combining recency with synthesis capability.
Media · Finance · ResearchPersonalized Recommendations
Instead of keyword matching, RAG retrieves products/content that are semantically aligned with a user's expressed preferences, past behavior, and current context — enabling genuine semantic recommendations.
E-commerce · Content · EdTechAt a Glance
A pattern that connects LLMs to external knowledge bases, enabling them to retrieve relevant information at query time rather than relying solely on training data. Converts an LLM from a closed-book test to an open-book one.
LLM training data is static, doesn't include private knowledge, and has a cutoff date. RAG provides access to current, specific, and proprietary information — grounding outputs in verifiable facts and dramatically reducing hallucination.
Use RAG whenever the LLM needs to answer from specific, private, or current information not in its training data. Start with simple vector search RAG; add hybrid search for better precision; add agentic reasoning for complex multi-source synthesis.
Key Takeaways
-
RAG is a closed-book to open-book transformation. The LLM stops guessing from training data and starts reading from actual documents. This is the single most effective intervention for reducing hallucination and increasing factual accuracy.
-
Embeddings encode meaning, not words. The same semantic concept expressed in different words produces similar vectors. This enables finding relevant content even when the user’s phrasing doesn’t match the document’s vocabulary.
-
Chunking strategy matters as much as retrieval strategy. Poorly chunked documents break context at wrong boundaries, produce retrieval noise, and lose the information the user needs. Use sentence/paragraph-aware chunking with overlap.
-
Hybrid search outperforms either BM25 or semantic search alone. BM25 catches exact keyword matches; semantic search catches conceptual relevance. Production systems use both, combined with Reciprocal Rank Fusion.
-
The
vector_distance_thresholdprevents confident wrong answers. Without a minimum similarity threshold, RAG always returns K chunks even when none are relevant — causing the LLM to generate answers from unrelated context. Better to say “I don’t have information on this” than to confidently answer from the wrong context. -
GraphRAG handles relationship-heavy queries; standard RAG handles localized queries. If your questions are mostly “what does this policy say?”, standard RAG is sufficient. If they’re “how does X relate to Y across our organization?”, GraphRAG is worth the added complexity.
-
Agentic RAG adds source validation, conflict resolution, and gap detection. For high-stakes RAG applications (legal, medical, financial), the agent layer ensures retrieved context is current, authoritative, and complete before being passed to the LLM.
-
Vertex AI RAG Corpus handles infrastructure; LangGraph handles pipeline logic. Use managed services for the embedding/storage infrastructure. Use graph-based orchestration (LangGraph) for complex multi-step retrieval pipelines with conditional logic.
This concludes the 14-chapter Agentic AI series. From the simplest prompt chain (Chapter 1) to the most sophisticated retrieval-augmented generation system (Chapter 14), you now have the complete map of how intelligent AI agents are built, deployed, and made reliable in the real world.
Enjoy Reading This Article?
Here are some more articles you might like to read next: