Chapter 14: Knowledge Retrieval (RAG)

The Knowledge Cutoff Problem

Before You Start — Key Terms Explained

Embedding: A numerical representation of text as a vector (list of numbers). An embedding converts "What is the capital of France?" into something like [0.23, -0.81, 0.44, ...] with hundreds or thousands of dimensions. The crucial property: texts with similar meanings get similar vectors. "What is the capital of France?" and "Which city leads France?" would have very similar embeddings even though they share only a few words.

Vector: An ordered list of numbers. A 3D vector might be [1.5, -2.3, 0.7]. In embeddings, vectors typically have 768, 1,536, or 3,072 dimensions. The position in this high-dimensional space encodes meaning — similar concepts cluster together, different concepts are far apart.

Cosine similarity: A measure of how similar two vectors are, regardless of their magnitude. Returns a value from -1 (opposite) to 1 (identical). Used in RAG to find which stored document chunks are most similar to the user's query. If the query embedding and a document chunk embedding have high cosine similarity, they're semantically related.

Hallucination: When an LLM generates text that sounds confident and plausible but is factually incorrect. Happens when the model doesn't know the answer but generates something statistically likely based on context. RAG reduces hallucination by giving the model actual facts to work with, rather than relying on pattern completion.

Chunking: Splitting large documents into smaller pieces for embedding and storage. A 50-page document becomes hundreds of 300-500 word chunks. Each chunk is embedded separately so you can retrieve just the relevant section rather than the whole document.

Vector database: A specialized database built to store embeddings and perform fast similarity search across millions of them. Unlike SQL databases that search by exact value matches, vector databases search by semantic similarity — finding the closest vectors to a query vector. Examples: Pinecone, Weaviate, Chroma, Milvus, pgvector.

Augmentation: In RAG, the process of adding retrieved document chunks to the LLM's prompt. The original query is "augmented" with relevant context before being sent to the model. This is why it's called Retrieval-Augmented Generation.

Every LLM in the world has a training cutoff date — the point at which its training data ended. GPT-4’s cutoff is early 2024. Gemini Pro’s is mid-2024. Whatever happened after that date simply doesn’t exist in the model’s knowledge.

But the more important limitation isn’t the cutoff date — it’s what was never in the training data at all:

Your company’s internal documentation. The model has never seen your employee handbook, your product specs, your customer contracts, or your proprietary research.
Your customers’ specific situations. The model can’t look up customer #7823’s account history, their past support tickets, or their current subscription plan.
Real-time data. The model can’t check today’s stock price, last hour’s server logs, or this morning’s news.

An LLM without access to this information can still answer general questions reasonably well — but for the specific, grounded, current answers that make AI agents genuinely useful, it will either guess (hallucinate) or admit ignorance.

Retrieval-Augmented Generation (RAG) is the solution. Instead of asking the LLM to recall knowledge it doesn’t have, RAG gives it a system to look things up — converting a closed-book test into an open-book one.

How RAG Works: The Six Steps

RAG PIPELINE — from query to grounded answer

OFFLINE PHASE — done once, before any queries

Prepare the knowledge base for fast retrieval

Load Documents

PDFs, wikis, databases, web pages

→

Chunk

Split into 300-500 word pieces

→

Embed + Store

Convert to vectors, save in vector DB

ONLINE PHASE — at query time, for each user request

Retrieve relevant context and generate grounded answer

01 · User Query Arrives

"What is our remote work policy for contractors?"

02 · Embed the Query

Convert query to a vector using the same embedding model used for documents

03 · Semantic Search

Find the top-K most similar vectors in the database. Return those chunks' text.

04 · Augment the Prompt

Combine: [Retrieved Chunks] + [Original Query] → enriched prompt

05 · Generate Grounded Answer

LLM reads the retrieved context and generates a response based on actual documents, not training data guesses. Can cite sources.

The key insight: the LLM’s job is to reason and synthesize, not to memorize facts. Give it the relevant facts (retrieved documents), and it can synthesize an excellent answer. Without those facts, it either hallucinates or refuses to answer.

The Four Core Concepts

Concept 1: Embeddings — How Meaning Becomes Numbers

An embedding model is a neural network trained to convert text into vectors such that semantically similar texts produce geometrically close vectors. This sounds abstract, so let’s ground it.

Imagine a 2D map where every word is a dot. Words that often appear in similar contexts end up near each other. “Doctor” and “physician” are close. “Cat” and “kitten” are close. “Bank” (financial) might be near “loan” and “interest,” while “bank” (river) is near “river” and “shore” — the same word appearing in two different clusters based on how it’s typically used.

Real embedding models work in 768 or 1,536 dimensions rather than 2, allowing for vastly more nuanced representation. But the principle is identical: proximity in vector space = similarity in meaning.

Why does this enable semantic search? If you embed the query “What is our vacation policy?” and it produces a vector that’s close to the chunk “Employees are entitled to 15 days of paid leave annually, plus public holidays…” — even though “vacation” never appears in the chunk — the similarity will still be high because the concepts are related. This is impossible with keyword search.

EMBEDDING SIMILARITY — visualized in 2D (real embeddings are 1536D)

● Your query ● Semantically close chunks (retrieved) ● Unrelated chunks (not retrieved)

Concept 2: Chunking — Breaking Documents for Retrieval

You can’t embed a 50-page document as a single unit and hope to retrieve relevant sections. The embedding of the entire document would average across all topics, making similarity search imprecise. Instead, you chunk documents into pieces that are:

Small enough that each chunk is topically coherent (not mixing multiple subjects)
Large enough that each chunk contains sufficient context to be useful

Common chunking strategies:

Strategy	How it works	Best for
Fixed-size	Split every N characters/words, with overlap	Simple baseline
Sentence	Split at sentence boundaries	Conversational content
Paragraph	Split at paragraph breaks	Structured documents
Semantic	Split when topic changes (detected by embedding distance shift)	Dense, topic-switching documents
Recursive	Try paragraph splits, fall back to sentence, then fixed-size	General purpose

Chunk overlap is a critical parameter. If you split a document at every 500 words with no overlap, a sentence that spans the boundary between chunk 3 and chunk 4 may be split in half — losing meaning in both chunks. Overlapping chunks by 50-100 words ensures boundary context is preserved in adjacent chunks.

Concept 3: Retrieval Methods

Once your knowledge base is embedded and indexed, how do you find the relevant chunks?

Vector/Semantic Search: Embed the query, find the K nearest vectors in the database by cosine similarity. This finds chunks that are semantically related even without keyword overlap. Fast implementations use approximate nearest neighbor algorithms like HNSW (Hierarchical Navigable Small World) — a graph-based structure that allows sub-linear search across millions of vectors.

BM25 (Best Match 25): A classical information retrieval algorithm that scores documents based on term frequency and inverse document frequency (TF-IDF with improvements). It finds documents that contain the same words as the query. Very precise for exact keyword matches but blind to synonyms.

Hybrid Search: Combines both approaches. BM25 catches exact keyword matches; semantic search catches conceptually related content. A fusion algorithm (like Reciprocal Rank Fusion) combines the two ranked lists. This is the production standard for most enterprise RAG systems.

# Conceptual hybrid search
def hybrid_search(query: str, k: int = 5) -> list:
    # Get semantic candidates (top-20 by embedding similarity)
    semantic_results = vector_db.similarity_search(query, k=20)

    # Get keyword candidates (top-20 by BM25)
    bm25_results = bm25_index.search(query, k=20)

    # Combine using Reciprocal Rank Fusion
    scores = {}
    for rank, doc in enumerate(semantic_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)
    for rank, doc in enumerate(bm25_results):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (60 + rank)

    # Return top-K by combined score
    return sorted(scores, key=scores.get, reverse=True)[:k]

Concept 4: Vector Databases

A vector database is purpose-built for one operation: given a query vector, find the most similar stored vectors as fast as possible.

Traditional SQL databases compare row by row using exact value matching. They can’t natively compute “which of these 10 million vectors is geometrically closest to this query vector?” — that would require computing a similarity score for every single stored vector, which is too slow.

Vector databases solve this with specialized index structures:

HNSW (used by Weaviate, Qdrant): builds a hierarchical graph structure that enables navigating toward nearest neighbors without checking every vector
IVF (Inverted File Index) (used by Faiss): clusters vectors into groups; search checks only the most relevant clusters
LSH (Locality Sensitive Hashing): maps similar vectors to the same hash buckets for fast filtering

The result: searching 10 million vectors takes milliseconds rather than seconds.

Popular options by category:

Category	Examples	Best for
Managed cloud	Pinecone, Weaviate Cloud	Production, no infra management
Open source	Chroma, Milvus, Qdrant	Self-hosted, customizable
Extensions	pgvector (Postgres), Redis	Adding vector search to existing DB
Research libraries	FAISS (Meta), ScaNN (Google)	Maximum performance, embedded use

The RAG Pipeline: Live Demonstration

RAG PIPELINE WALKTHROUGH — step through a complete retrieval

Step 1 / 6

User Asks a Question

A user asks: "What is our company's policy on reimbursing home office equipment for remote employees?"

Without RAG, the LLM would have to answer from its training data — which contains no information about your company's specific policies. It would either refuse ("I don't have information about your company's policies") or hallucinate a plausible-sounding but invented policy.

User query: "What is our company's policy on reimbursing home office equipment for remote employees?"

Embed the Query

The query is passed through an embedding model (e.g., OpenAI's text-embedding-3-large or Google's textembedding-gecko). The model converts the query into a 1,536-dimensional vector. This vector captures the semantic meaning of the query in a way that can be compared to the vectors of stored document chunks.

query_vector = embedding_model.encode("What is our company's policy...")
# → [0.23, -0.81, 0.44, 0.17, ... ] (1,536 numbers)

Search the Vector Database

The query vector is compared against all stored document chunk vectors. The database uses HNSW to efficiently find the top-5 most similar chunks (by cosine similarity) without checking every single stored vector. The search takes ~5-20ms even across millions of chunks.

results = vector_db.similarity_search(query_vector, k=5)

# Returns chunks ranked by similarity:
# 1. HR Policy Manual, Section 4.2: "Remote Work Equipment" (similarity: 0.94)
# 2. Employee Benefits FAQ: "Home Office Setup" (similarity: 0.88)
# 3. IT Equipment Request Process (similarity: 0.72)
# 4. Remote Work Agreement Template (similarity: 0.68)
# 5. New Employee Onboarding Guide (similarity: 0.61)

Retrieve Chunk Text

The text of the top-k chunks is fetched from the database. This is the actual content that will be passed to the LLM as context. The chunk text preserves the exact wording from the original documents — this is the ground truth that prevents hallucination.

chunk_1_text = """Remote Work Equipment Reimbursement (Section 4.2):
Full-time remote employees are eligible for a one-time home office
equipment reimbursement of up to $1,500. This includes monitors,
keyboards, mice, webcams, and ergonomic equipment. Contractors
are not eligible for this benefit. Submit receipts within 30 days
of purchase via the expense portal..."""

Augment the Prompt

The retrieved chunks are combined with the original query into an enriched prompt. The LLM now has access to the specific policy information it needs. The prompt template instructs the LLM to answer based on the provided context and to cite the source.

augmented_prompt = f"""
You are a helpful HR assistant. Answer the user's question based ONLY
on the provided context. If the answer isn't in the context, say so.
Cite the specific document section you're referencing.

CONTEXT:
{retrieved_chunks}

QUESTION: {user_query}

ANSWER:"""

Generate Grounded Answer

The LLM reads the retrieved context and generates a response based on the actual policy documents. The answer is factually accurate (grounded in real policy text), specific (mentions the $1,500 limit and contractor exclusion), and citable (references Section 4.2 of the HR Policy Manual). No hallucination possible — the model is reading the answer, not inventing it.

LLM Response:
"According to Section 4.2 of the HR Policy Manual, full-time remote
employees are eligible for a one-time home office equipment
reimbursement of up to $1,500, covering monitors, keyboards,
webcams, and ergonomic equipment. Note: contractors are not eligible
for this benefit. Submit receipts within 30 days via the expense portal."

GraphRAG: When Relationships Matter

Standard RAG treats each document chunk independently. It retrieves relevant chunks and feeds them to the LLM. This works well for localized questions (“What is our vacation policy?”) but struggles with questions that require synthesizing relationships across multiple documents (“How does our acquisition of Company X affect our healthcare benefits for employees who transfer?”).

GraphRAG replaces the vector database with a knowledge graph — a network of entities (nodes) connected by typed relationships (edges).

KNOWLEDGE GRAPH STRUCTURE — entities and relationships

Company X

Entity: Company

acquired_by →

Our Company

Entity: Company

has_policy →

Benefits Policy v3

Entity: Document

covers →

Healthcare Benefits

Entity: BenefitType

applies_to →

Transferred Employees

Entity: EmployeeGroup

When asked “How does the Company X acquisition affect transferred employees’ healthcare?”, GraphRAG:

Finds the “Company X” node
Traverses the acquired_by edge to “Our Company”
Traverses the has_policy edge to “Benefits Policy v3”
Traverses the covers edge to “Healthcare Benefits”
Checks the applies_to relationship with “Transferred Employees”
Retrieves the specific healthcare clauses that apply

Standard RAG would struggle because no single document chunk contains this full chain of relationships. GraphRAG navigates the graph to synthesize it.

GraphRAG trade-offs:

	Standard RAG	GraphRAG
Setup cost	Low — just embed documents	Very high — build and maintain the graph
Query speed	Fast (vector search)	Slower (graph traversal)
Best for	Localized, factual questions	Multi-hop, relationship questions
Maintenance	Re-embed changed docs	Update graph schema + relationships
Use case	80% of enterprise RAG needs	Complex financial, scientific, legal analysis

Agentic RAG: When Retrieval Needs Intelligence

Standard RAG has a fundamental architectural limitation: it retrieves passively. Query comes in → retrieve K chunks → send to LLM. There’s no reasoning about whether the retrieved chunks are good, complete, or contradictory.

Agentic RAG adds a reasoning agent that acts as a critical evaluator and orchestrator of the retrieval process.

AGENTIC RAG — reasoning agent validates and enriches retrieval

Complex User Query

"How do our product's features compare to Competitor X's pricing?"

Reasoning Agent — Decompose Query

Breaks into sub-queries: (1) Our product features, (2) Our pricing, (3) Competitor X features, (4) Competitor X pricing

Parallel Retrieval — runs all sub-queries simultaneously

Sub-query 1

→ Product spec docs

Sub-query 2

→ Pricing sheets

Sub-query 3+4

→ Web search (live)

Agent Validates + Reconciles

Checks dates (use 2025 spec sheet, not 2023). Resolves conflicts (final report overrides draft). Identifies gaps (no Competitor X pricing found → activates web search).

Knowledge Complete?

Gaps remain

Activate External Tools

Web search, live API, human expert query

Complete

Generate Grounded Response

LLM synthesizes validated context into comprehensive answer

The four capabilities Agentic RAG adds over standard RAG:

1. Source validation and date-awareness. A standard RAG retrieves the highest-similarity chunks regardless of whether they’re a 2020 blog post or the 2025 official policy. An agent reads document metadata and prioritizes authoritative, current sources.

2. Conflict resolution. When two retrieved chunks disagree (initial proposal says €50K, final report says €65K), an agent reasons about which source is more authoritative rather than presenting both to the LLM and hoping it figures it out.

3. Multi-step decomposition. Complex questions that require combining information from multiple sources are decomposed into sub-queries, each retrieved separately, then synthesized.

4. Knowledge gap detection and tool activation. When the knowledge base doesn’t contain the answer (the internal knowledge base is updated weekly but the user asks about something that happened yesterday), the agent recognizes the gap and activates external tools (web search, live APIs) rather than producing an answer based on stale data.

Hands-On Code: Three Implementations

Implementation 1: Google Search as RAG (Simplest)

from google.adk.tools import google_search
from google.adk.agents import Agent

search_agent = Agent(
    name        = "research_assistant",
    model       = "gemini-2.0-flash-exp",
    instruction = """You help users research topics accurately.
When asked about current events, company information, or factual topics,
use the Google Search tool to retrieve current information.
Always cite the sources you found.""",
    tools = [google_search]
)

Why is Google Search a form of RAG? It implements the core RAG pattern: retrieve relevant documents (web search results) → augment the prompt → generate response. The difference from a custom RAG system is that Google Search retrieves from the public web rather than a private knowledge base. Use this for public information that changes frequently (news, current events, general facts). Use a custom vector database RAG for private, proprietary, or domain-specific knowledge.

Implementation 2: Vertex AI RAG Corpus

from google.adk.memory import VertexAiRagMemoryService

# Resource identifier for your Vertex AI RAG Corpus
RAG_CORPUS_RESOURCE_NAME = (
    "projects/your-gcp-project-id"
    "/locations/us-central1"
    "/ragCorpora/your-corpus-id"
)

memory_service = VertexAiRagMemoryService(
    rag_corpus                = RAG_CORPUS_RESOURCE_NAME,
    similarity_top_k          = 5,    # retrieve 5 most similar chunks
    vector_distance_threshold = 0.7,  # only chunks with similarity >= 0.7
)

similarity_top_k=5: Retrieve the 5 chunks with highest similarity to the query. More chunks = more context for the LLM, but more tokens consumed and potential for irrelevant information diluting relevant content. 3-7 is a typical range.

vector_distance_threshold=0.7: Only return chunks with cosine similarity at least 0.7 to the query. If no chunks meet this threshold, return nothing rather than returning irrelevant chunks. Without this threshold, RAG always returns K chunks even if none are relevant — which can cause the LLM to make up answers based on unrelated context (a subtle but common failure mode).

What is a Vertex AI RAG Corpus? A managed service that handles the full RAG pipeline infrastructure: document ingestion, chunking, embedding, and storage in Google’s vector search infrastructure. You upload documents, the service handles everything else. The VertexAiRagMemoryService in ADK connects your agent to this corpus for retrieval.

Implementation 3: Custom LangChain + LangGraph RAG Pipeline

from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_community.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Weaviate
from langchain_openai import ChatOpenAI
from langchain.text_splitter import CharacterTextSplitter
from langgraph.graph import StateGraph, END
from typing import List, TypedDict
import weaviate
from weaviate.embedded import EmbeddedOptions

Why LangGraph instead of a simple chain? A simple chain runs retrieval → generation sequentially with no flexibility. LangGraph defines the pipeline as a graph where each step is a node and transitions are explicit edges. This enables: conditional routing (if retrieval finds no relevant documents, take a different path), loops (retry with a refined query if the first retrieval was insufficient), and parallel branches (retrieve from multiple sources simultaneously). For production RAG, you almost always need this flexibility.

# Step 1: Prepare the knowledge base (done once)
loader    = TextLoader('./company_docs.txt')
documents = loader.load()

text_splitter = CharacterTextSplitter(
    chunk_size    = 500,  # characters per chunk
    chunk_overlap = 50    # overlap prevents losing context at boundaries
)
chunks = text_splitter.split_documents(documents)

chunk_size=500 and chunk_overlap=50: These are critical tuning parameters. Smaller chunks (200-300 chars) give more precise retrieval but lose context. Larger chunks (1,000+ chars) preserve context but make retrieval less precise because each chunk covers multiple topics. The overlap of 50 characters ensures that a sentence split across chunk boundaries appears in both chunks — preventing the common failure where the most important sentence in a document is the first/last line of a chunk and loses its context.

# Step 2: Embed and store in Weaviate
client = weaviate.Client(embedded_options=EmbeddedOptions())
vectorstore = Weaviate.from_documents(
    client    = client,
    documents = chunks,
    embedding = OpenAIEmbeddings(),
    by_text   = False
)
retriever = vectorstore.as_retriever()
llm       = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

EmbeddedOptions(): Runs Weaviate in-process (embedded mode) without needing a separate Weaviate server. Perfect for development and testing. For production, point to a Weaviate instance running as a separate service.

by_text=False: Use pre-computed embeddings rather than Weaviate’s built-in vectorization. This gives you explicit control over which embedding model is used, ensuring consistency between embedding during indexing and embedding at query time. Always use the same embedding model for both — mixing models produces meaningless similarity scores.

# Step 3: Define graph state
class RAGGraphState(TypedDict):
    question:   str
    documents:  List[Document]
    generation: str

Why a TypedDict for state? LangGraph passes state between nodes as a dictionary. Using TypedDict documents the expected keys and their types — it’s not strictly required but makes the graph’s data flow explicit and enables type checking. Each node function receives the full state dict and returns a partial update (only the keys it modifies).

# Step 4: Define nodes
def retrieve_documents_node(state: RAGGraphState) -> RAGGraphState:
    """Retrieves relevant document chunks for the user's question."""
    documents = retriever.invoke(state["question"])
    return {"documents": documents, "question": state["question"], "generation": ""}

def generate_response_node(state: RAGGraphState) -> RAGGraphState:
    """Generates an answer using retrieved context."""
    template = """Answer the following question based ONLY on the provided context.
If you cannot answer from the context, say "I don't have enough information."

Context: {context}
Question: {question}
Answer:"""

    context = "\n\n".join([doc.page_content for doc in state["documents"]])
    prompt  = ChatPromptTemplate.from_template(template)
    chain   = prompt | llm | StrOutputParser()
    generation = chain.invoke({"context": context, "question": state["question"]})
    return {"question": state["question"], "documents": state["documents"], "generation": generation}

“Answer ONLY from the context”: This instruction is critical for RAG correctness. Without it, the LLM might supplement retrieved context with its own training knowledge, producing a mixed answer that’s hard to attribute or verify. The strict instruction forces the LLM to either use the retrieved evidence or admit it doesn’t know — making the system more honest and auditable.

# Step 5: Build and compile the graph
workflow = StateGraph(RAGGraphState)
workflow.add_node("retrieve", retrieve_documents_node)
workflow.add_node("generate", generate_response_node)
workflow.set_entry_point("retrieve")
workflow.add_edge("retrieve", "generate")
workflow.add_edge("generate", END)
app = workflow.compile()

# Step 6: Run queries
result = app.invoke({"question": "What is our company's vacation policy?"})
print(result["generation"])

Why app.invoke() instead of calling functions directly? invoke() runs the full LangGraph state machine — it manages state initialization, passes state between nodes, handles the edge routing, and returns the final state. Calling functions directly would bypass the graph infrastructure, losing LangGraph’s benefits (streaming, checkpointing, conditional edges).

RAG’s Challenges: What Can Go Wrong

Challenge 1: Multi-document synthesis. If the answer to a question spans 5 different documents, and retrieval only returns 3 of them, the LLM will produce an incomplete answer — worse, it may produce a confident but incomplete one without signaling that information is missing. Mitigation: increase k, use multi-query retrieval (generate multiple variations of the query and merge results), or use Agentic RAG to detect gaps.

Challenge 2: Chunk boundary problems. A crucial sentence that spans two chunk boundaries may appear incomplete in both chunks. Mitigation: chunk overlap, recursive chunking, or sentence-aware splitting.

Challenge 3: Retrieval noise. If irrelevant chunks are retrieved (false positives), they pollute the LLM’s context with confusing or contradictory information. The LLM may hallucinate to reconcile the conflict. Mitigation: use a distance threshold, re-rank retrieved chunks using a cross-encoder model before sending to LLM.

Challenge 4: Index staleness. Your knowledge base is outdated. A policy changed last week but the vector store hasn’t been re-indexed. The LLM confidently answers from the old policy. Mitigation: incremental indexing pipelines, document versioning, freshness metadata, Agentic RAG with web search fallback for time-sensitive queries.

Challenge 5: Query-document vocabulary mismatch. The user asks about “vacation days” but your policy document uses “annual leave.” Pure semantic search helps but isn’t perfect. Mitigation: hybrid search (BM25 + semantic), query expansion (generate synonyms and related terms before retrieval).

Challenge 6: Context window limits. Retrieving 10 long chunks may exceed the LLM’s context window. Mitigation: summarize chunks before passing to LLM, reduce k, use a model with a larger context window.

Four Practical Applications

Enterprise Knowledge Base Q&A

HR chatbots answering policy questions, IT helpdesks troubleshooting from documentation, legal teams searching contracts for specific clauses — all powered by RAG over internal document stores.

HR · IT · Legal · Operations

Customer Support Automation

Agents that answer product-specific questions from support tickets, FAQs, and product manuals. Unlike a fine-tuned model, RAG-based support agents can be updated by uploading new documentation without retraining.

SaaS · E-commerce · Telecom

News and Research Summarization

LLMs connected to live news feeds can summarize recent developments on any topic. The RAG system retrieves today's articles and the LLM synthesizes them — combining recency with synthesis capability.

Media · Finance · Research

Personalized Recommendations

Instead of keyword matching, RAG retrieves products/content that are semantically aligned with a user's expressed preferences, past behavior, and current context — enabling genuine semantic recommendations.

E-commerce · Content · EdTech

At a Glance

WHAT

A pattern that connects LLMs to external knowledge bases, enabling them to retrieve relevant information at query time rather than relying solely on training data. Converts an LLM from a closed-book test to an open-book one.

WHY

LLM training data is static, doesn't include private knowledge, and has a cutoff date. RAG provides access to current, specific, and proprietary information — grounding outputs in verifiable facts and dramatically reducing hallucination.

RULE OF THUMB

Use RAG whenever the LLM needs to answer from specific, private, or current information not in its training data. Start with simple vector search RAG; add hybrid search for better precision; add agentic reasoning for complex multi-source synthesis.

Key Takeaways

RAG is a closed-book to open-book transformation. The LLM stops guessing from training data and starts reading from actual documents. This is the single most effective intervention for reducing hallucination and increasing factual accuracy.
Embeddings encode meaning, not words. The same semantic concept expressed in different words produces similar vectors. This enables finding relevant content even when the user’s phrasing doesn’t match the document’s vocabulary.
Chunking strategy matters as much as retrieval strategy. Poorly chunked documents break context at wrong boundaries, produce retrieval noise, and lose the information the user needs. Use sentence/paragraph-aware chunking with overlap.
Hybrid search outperforms either BM25 or semantic search alone. BM25 catches exact keyword matches; semantic search catches conceptual relevance. Production systems use both, combined with Reciprocal Rank Fusion.
The vector_distance_threshold prevents confident wrong answers. Without a minimum similarity threshold, RAG always returns K chunks even when none are relevant — causing the LLM to generate answers from unrelated context. Better to say “I don’t have information on this” than to confidently answer from the wrong context.
GraphRAG handles relationship-heavy queries; standard RAG handles localized queries. If your questions are mostly “what does this policy say?”, standard RAG is sufficient. If they’re “how does X relate to Y across our organization?”, GraphRAG is worth the added complexity.
Agentic RAG adds source validation, conflict resolution, and gap detection. For high-stakes RAG applications (legal, medical, financial), the agent layer ensures retrieved context is current, authoritative, and complete before being passed to the LLM.
Vertex AI RAG Corpus handles infrastructure; LangGraph handles pipeline logic. Use managed services for the embedding/storage infrastructure. Use graph-based orchestration (LangGraph) for complex multi-step retrieval pipelines with conditional logic.

This concludes the 14-chapter Agentic AI series. From the simplest prompt chain (Chapter 1) to the most sophisticated retrieval-augmented generation system (Chapter 14), you now have the complete map of how intelligent AI agents are built, deployed, and made reliable in the real world.