Chapter 8: Memory Management

The Stateless Agent Problem

Before You Start — Key Terms Explained

Stateless: A system with no memory between requests. Each call is independent — the system doesn't remember previous interactions. A basic LLM API call is stateless: call it twice with the same prompt and it has no idea the first call happened.

Persistent vs ephemeral: Persistent data survives when the program stops (saved to disk/database). Ephemeral data exists only while the program runs (in RAM) and disappears when it stops. The context window is ephemeral — conversation history in a database is persistent.

Vector database: A database that stores text as vectors (lists of numbers representing meaning) and supports "semantic search" — finding entries similar in meaning, not just matching keywords. Used for long-term agent memory because you want to find relevant past info by topic, not by exact words.

Semantic search: Finding things by meaning rather than keywords. Searching "car" would semantically find "automobile" and "vehicle" too. Vector databases do this by comparing the mathematical distance between word/sentence embeddings.

Session: One complete conversation thread between a user and the agent. Like a phone call — starts when the user first messages, ends when they're done. Multiple sessions with the same user build up long-term memory over time.

Every agent pattern in this series — chaining, routing, planning, multi-agent — has one silent assumption: the agent has context. It knows what the user said, what it did before, what tools it called.

But where does that context live?

Without explicit memory management, agents are stateless. Each call to the LLM starts fresh. Ask it “what’s my order status?” and it has no idea who you are, what you ordered, or what you discussed five messages ago. That’s not an agent — it’s a very expensive autocomplete.

Memory is what transforms a stateless LLM call into an agent that can maintain context, track progress, personalize responses, and learn from past interactions.

Why the context window isn’t enough. You might be thinking: “But the LLM already has a context window — isn’t that memory?” The context window is short-term memory within a single session. If you send 10 messages in a conversation and they all fit in the context window, the LLM sees all of them and can refer back to the first one. That’s working memory.

The problems start at the session boundary. When the conversation ends (the user closes the app, the session times out, the server restarts), the context window contents disappear. The next time the user comes back, the LLM has no idea who they are, what they discussed before, or what preferences they expressed. Every session starts from zero.

The second problem is scale. The context window has a fixed size. If a user has had 200 conversations with your agent, you can’t fit all of them into one context window call — it would be millions of tokens, prohibitively expensive and slow. You need a way to selectively retrieve the most relevant past information rather than including everything every time.

The two-tier memory architecture. The solution is a two-tier system:

Short-term memory (session context): Everything in the current conversation window. Fast, always available, but ephemeral.
Long-term memory (external storage): Key facts, preferences, and summaries extracted from past conversations. Persistent, semantic-searchable, but requires explicit retrieval.

This mirrors how human memory works. You don’t consciously replay every past conversation before responding to someone — you just know relevant facts about them because they’ve been encoded into long-term memory over time. When something is relevant to the current conversation, it surfaces automatically.

Building this two-tier system for AI agents is exactly what this chapter covers — using Google ADK’s Session/State/MemoryService architecture and LangChain’s ConversationBufferMemory and InMemoryStore.

There are two fundamentally different problems:

Short-term: remember what was said in this conversation
Long-term: remember what matters across conversations

Both require different mechanisms. Neither is optional for production agents.

Two Types of Agent Memory

MEMORY ARCHITECTURE

User Message

Arrives at the Runner · new conversation turn

Short-Term Memory — Session Scope

Ephemeral · lost when session ends · lives in the LLM context window

Events

Full message history for this conversation

State

Key-value scratchpad · task progress, preferences

Agent Processes · Responds

LLM has full session context · generates reply

Long-Term Memory — Cross-Session Scope

Persistent · survives session end · retrieved via semantic search

MemoryService

Extracts facts from sessions · stores persistently

Vector Database

Semantic search · retrieved into next session's context

Persistent Knowledge

User preferences · past interactions · learned behaviors

Short-Term Memory

Short-term memory is the context window — everything the model sees on a single call: the system prompt, the conversation history, tool results, agent thoughts.

It’s fast, has zero retrieval cost, and is always 100% relevant to the current conversation. But it’s ephemeral: when the session ends, it’s gone. And it has capacity limits — even a 1M token window fills up eventually, and processing the full history on every call is expensive.

Long-Term Memory

Long-term memory is external storage — databases, vector stores, knowledge graphs — that persists across sessions. When an agent needs something from the past, it queries the store, retrieves the relevant data, and injects it into the current context.

Vector databases are the dominant storage type here because they support semantic search: the agent can find relevant memories by meaning, not just exact keyword match.

Interactive: Watch Memory Work

MEMORY TRACKER — 3-TURN CONVERSATION

Google ADK: Session, State, and MemoryService

The ADK structures memory into three explicit components with distinct responsibilities.

Session: The Conversation Thread

from google.adk.sessions import InMemorySessionService

session_service = InMemorySessionService()

A Session is one conversation thread. It holds:

id — unique identifier for this thread

events — ordered list of all messages, agent replies, and tool calls

state — temporary key-value data for this conversation

last_update_time — timestamp of the most recent activity

You never create Session objects directly — the SessionService manages their lifecycle: create, retrieve, append events, delete.

Three SessionService Implementations

DEVELOPMENT

InMemorySessionService

from google.adk.sessions import InMemorySessionService
session_service = InMemorySessionService()

✓ Zero setup — no database needed
✓ Fast for testing
✗ Data lost on app restart
✗ Single process only

Use for: local development, unit tests

PRODUCTION

DatabaseSessionService

from google.adk.sessions import DatabaseSessionService
session_service = DatabaseSessionService(db_url="sqlite:///./agent.db")

✓ Persistent across restarts
✓ Supports SQLite, PostgreSQL, MySQL
✓ You control the database
✗ Requires infra management

Use for: production apps, self-hosted

CLOUD SCALE

VertexAiSessionService

from google.adk.sessions import VertexAiSessionService
session_service = VertexAiSessionService(project=PROJECT_ID, location="us-central1")

✓ Fully managed by Google Cloud
✓ Scales automatically
✓ Integrated with Reasoning Engine
✗ Requires GCP setup

Use for: enterprise, high-traffic production

State: The Session’s Scratchpad

session.state is a dictionary that lives within a Session. Think of it as a whiteboard for one conversation — agents read from it and write to it throughout the session.

# The state has 4 key namespaces:

session.state["preference"]        # session-scoped (cleared when session ends)
session.state["user:name"]         # user-scoped (persists across this user's sessions)
session.state["app:config"]        # app-scoped (shared across all users)
session.state["temp:validation"]   # turn-scoped (cleared after each processing turn)

Key prefixes determine persistence. Without a prefix, data lives only in the current session. With user:, it follows the user across all their sessions. With app:, it’s global to the application. With temp:, it’s cleared after each processing turn — useful for ephemeral flags.

The right way to update state:

# ✓ Method 1: output_key on the agent (simplest — for agent text replies)
greeting_agent = LlmAgent(
    name       = "Greeter",
    model      = "gemini-2.0-flash",
    instruction = "Generate a short, friendly greeting.",
    output_key = "last_greeting",   # runner auto-saves the reply to state["last_greeting"]
)

Why output_key? The Runner intercepts the agent’s final response and writes it to session.state["last_greeting"] through the normal append_event flow. This means the change is recorded in the event history, properly persisted by the SessionService, and timestamped. It’s the least error-prone option.

# ✓ Method 2: tool that updates state via ToolContext (for complex updates)
from google.adk.tools.tool_context import ToolContext
import time

def log_user_login(tool_context: ToolContext) -> dict:
    """Tracks a user login. Updates login count, status, and timestamp."""
    state = tool_context.state

    login_count = state.get("user:login_count", 0) + 1
    state["user:login_count"]    = login_count
    state["task_status"]          = "active"
    state["user:last_login_ts"]   = time.time()
    state["temp:validation_needed"] = True

    return {"status": "success", "message": f"Login #{login_count} tracked."}

Why use a tool for state updates? Tools have access to ToolContext, which wraps session.state with proper event-tracking. Changes made through tool_context.state get recorded in the event log and properly persisted.

Never do this:
session = session_service.get_session(app_name, user_id, session_id)
session.state["key"] = "value"  # ← bypasses event tracking, may not persist
Direct dictionary writes bypass the append_event mechanism. They won’t be persisted by DatabaseSessionService or VertexAiSessionService and won’t appear in the event history.

MemoryService: Long-Term Knowledge

# Development
from google.adk.memory import InMemoryMemoryService
memory_service = InMemoryMemoryService()

# Production (Vertex AI RAG)
from google.adk.memory import VertexAiRagMemoryService
memory_service = VertexAiRagMemoryService(
    rag_corpus              = "projects/my-project/locations/us-central1/ragCorpora/my-corpus",
    similarity_top_k        = 5,      # retrieve top 5 most relevant memories
    vector_distance_threshold = 0.7,  # minimum similarity score to include
)

VertexAiRagMemoryService stores memories as vector embeddings in a Vertex AI RAG Corpus. When an agent searches memory, it sends a semantic query and gets back the most similar stored facts — not exact keyword matches, but conceptually related information.

similarity_top_k=5: retrieve at most 5 memories per query. Higher numbers give more context but consume more of the context window.

vector_distance_threshold=0.7: only return memories with at least 70% similarity to the query. This prevents irrelevant memories from polluting the context.

# After a session ends: extract facts and store in long-term memory
await memory_service.add_session_to_memory(session)

# When a new session starts: retrieve relevant past memories
relevant_memories = await memory_service.search_memory(
    app_name = "my_app",
    user_id  = "sarah_123",
    query    = "user preferences and past interactions",
)

Session lifecycle in ADK

ADK SESSION LIFECYCLE — what happens on every message, start to finish

New Message Arrives

user sends a message to the agent

Runner

the main executor · retrieves or creates the session for this user

SessionService.get_or_create

loads full conversation history + state so agent has full context

Agent Processes with Session Context

LLM sees all past messages + current state · generates response

SessionService.append_event

saves the agent's response + any state changes to persistent storage

Conversation continues

Wait for Next Message

session stays open · history grows each turn

Session ends

MemoryService extracts facts

key info saved to long-term store for future sessions

LangChain & LangGraph Memory

Short-Term: ConversationBufferMemory

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key      = "chat_history",   # must match the variable name in your prompt
    return_messages = True,              # return list of message objects, not a string
)

memory.save_context(
    {"input":  "What's the weather like?"},
    {"output": "It's sunny today."},
)

memory_key="chat_history" — this string must exactly match the placeholder in your ChatPromptTemplate. If your prompt has MessagesPlaceholder(variable_name="chat_history"), LangChain injects the conversation history there automatically.

return_messages=True — returns a list of HumanMessage / AIMessage objects instead of a single formatted string. Use this with chat models (ChatOpenAI, ChatGemini). The raw string format works for non-chat LLMs but loses role metadata.

from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

prompt = ChatPromptTemplate(messages=[
    ("system", "You are a friendly travel assistant."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{question}"),
])

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversation = LLMChain(llm=ChatOpenAI(), prompt=prompt, memory=memory)

response = conversation.predict(question="I want to book a flight.")   # turn 1
response = conversation.predict(question="My name is Sam, by the way.") # turn 2
response = conversation.predict(question="What was my name again?")    # turn 3 — knows it's Sam

The memory object maintains the conversation buffer. On each predict() call, it: 1) loads the existing history into {chat_history}, 2) generates the response, 3) saves the new exchange to the buffer. Turn 3 works because turns 1 and 2 are in the buffer.

Long-Term: LangGraph InMemoryStore

LangGraph provides three long-term memory types, each mapped to a different human memory analogy:

📖

Semantic Memory

Remembering Facts

Stores facts and user preferences — a continuously updated profile or collection of factual documents. Used to ground responses with user-specific context.

user:name, location, preferences, domain knowledge

🎬

Episodic Memory

Remembering Experiences

Stores past events and successful interaction sequences. Implemented via few-shot examples — the agent learns from past task completions to do them better next time.

successful tool call patterns, past task solutions

⚙

Procedural Memory

Remembering Rules

The agent's core instructions and behaviors — often in the system prompt. Agents can update their own procedural memory through Reflection: reviewing past interactions and rewriting their own instructions.

system prompt rules, behavioral guidelines, self-updated instructions

LangGraph Store: put, get, search

from langgraph.store.memory import InMemoryStore

def embed(texts: list[str]) -> list[list[float]]:
    return [[1.0, 2.0] for _ in texts]   # replace with real embedding model in production

store = InMemoryStore(index={"embed": embed, "dims": 2})

# Namespace: (user_id, context_type) — like a folder structure
namespace = ("sarah_123", "preferences")

# Store a memory
store.put(
    namespace,
    "a-memory",           # key — like a filename
    {
        "rules": ["User likes short, direct responses", "User speaks English and Python"],
        "timezone": "Asia/Tokyo",
    },
)

# Retrieve by key
item = store.get(namespace, "a-memory")

# Semantic search within the namespace
results = store.search(
    namespace,
    filter = {"timezone": "Asia/Tokyo"},    # metadata filter
    query  = "language preferences",        # semantic similarity query
)

Namespace = (user_id, context_type) — think of it as a two-level folder structure. The first level is the user, the second is the type of memory (preferences, episodes, instructions). This prevents memory from one user bleeding into another’s.

store.put(namespace, key, data) — the key is like a filename. If a memory with that key already exists, it’s overwritten. Use descriptive keys like "user-profile" or "booking-preferences".

store.search() with both filter and query — filter does exact metadata matching (deterministic); query does vector similarity search (semantic). Using both together is the most precise retrieval.

Procedural Memory: Self-Updating Instructions

def update_instructions(state: State, store: BaseStore):
    """Reflection node: agent rewrites its own system prompt based on conversation."""
    namespace = ("instructions",)
    current_instructions = store.search(namespace)[0]

    # Ask the LLM to review its behavior and propose improvements
    prompt = prompt_template.format(
        instructions = current_instructions.value["instructions"],
        conversation = state["messages"],
    )
    output = llm.invoke(prompt)
    new_instructions = output["new_instructions"]

    # Overwrite stored instructions with the improved version
    store.put(("agent_instructions",), "agent_a", {"instructions": new_instructions})

This is the Reflection pattern (Chapter 4) applied to procedural memory. The agent reads its own instructions, sees how the conversation went, and rewrites its prompt to do better next time. Each session, it gets slightly smarter.

Vertex Memory Bank

For teams using Google Cloud, Memory Bank (part of Vertex AI Agent Engine) is a fully managed long-term memory service that works across frameworks — ADK, LangGraph, and CrewAI.

from google.adk.memory import VertexAiMemoryBankService

memory_service = VertexAiMemoryBankService(
    project        = "my-gcp-project",
    location       = "us-central1",
    agent_engine_id = agent_engine_id,
)

# After a session completes: extract and store memories
session = await session_service.get_session(app_name=app_name, user_id="USER_ID", session_id=session.id)
await memory_service.add_session_to_memory(session)

How Memory Bank works: After add_session_to_memory(), Gemini asynchronously analyzes the conversation history, extracts key facts and user preferences, and stores them persistently. On the next session, the agent retrieves relevant memories via similarity search — getting only the facts that matter for the current conversation, not the entire history.

What makes it different from VertexAiRagMemoryService: Memory Bank actively uses Gemini to understand and consolidate memories — resolving contradictions, merging related facts, and updating existing preferences. It’s not just a vector store; it’s a managed memory intelligence layer.

Storage Backend Comparison

STORAGE BACKENDS — CAPABILITY COMPARISON Hover a bar for details

InMemory (dev)

Database (prod)

Vertex AI (cloud)

At a Glance

WHAT

A dual-component system: short-term memory (session context window) for the current conversation, and long-term memory (external vector store) for knowledge that persists across sessions.

WHY

Without memory, every conversation starts from zero. Agents can't maintain context, track progress, personalize responses, or learn from past interactions — making them useless for anything beyond single-turn Q&A.

RULE OF THUMB

Use short-term memory (Session + State) for any agent handling multi-turn conversations. Add long-term memory (MemoryService) when users expect personalization or continuity across sessions.

Why Vector Databases Are Used for Long-Term Memory

You might wonder: why not just store memories in a regular SQL database and retrieve them with SQL queries? The answer reveals something fundamental about how language and meaning work.

The keyword matching problem. SQL requires exact string matching. If a user once said “I prefer short answers,” you might store this under the key "user_preferences". But what if next session you want to retrieve “things that affect response length”? The string “response length” doesn’t match “user_preferences” — so you’d miss it entirely.

Language has meaning that transcends exact words. “I dislike verbose responses” and “please be concise” and “keep it brief” all mean the same thing, but share no keywords.

How vector databases solve this. Instead of storing raw text, vector databases store text as embeddings — lists of numbers (typically 1,536 or more numbers) that encode the meaning of the text in a high-dimensional space. Sentences with similar meanings end up with similar embeddings (similar numbers), regardless of which exact words they use.

When you search a vector database with the query “things that affect response length,” the system converts your query to an embedding and finds the stored texts whose embeddings are mathematically closest to your query’s embedding — even if they share no keywords. This is called semantic search — search by meaning rather than by keywords.

How embeddings work intuitively. Think of placing words and sentences in a 3D space (in reality it’s 1,536-dimensional, but the concept is the same). Words with similar meanings are placed near each other. “Car,” “automobile,” and “vehicle” cluster together. “Dog” and “puppy” cluster together, not near “car.” Sentences with similar meanings also land near each other. When you embed “I prefer brief responses” and “keep answers short,” they land near each other in this space. A search for “response length preferences” also lands near both — the database returns them as relevant matches.

The practical flow in an agent. Here is how memory retrieval works in a real ADK agent with VertexAiRagMemoryService: (1) User starts a new session with no prior context. (2) Before processing the user’s first message, the system runs: memories = await memory_service.search_memory(app_name, user_id, query=user_message). (3) The memory service converts user_message to an embedding and searches for semantically similar past memories. (4) The top-K most relevant memories are retrieved and injected into the agent’s context: “From past conversations: User prefers brief responses. User is a data scientist in Tokyo.” (5) The agent now has personalized context without the user having to repeat themselves. (6) After the session ends: await memory_service.add_session_to_memory(session) — key facts are extracted and stored for future retrieval.

This is what makes agents feel “intelligent” about user preferences — they’re not actually remembering in the human sense, they’re retrieving relevant records from a database.

Common Mistakes When Implementing Memory

Mistake 1: Storing everything in long-term memory. Not every message deserves long-term storage. “What time is it?” and “Tell me a joke” don’t contain information worth remembering across sessions. Use selective extraction — store only facts, preferences, and task-relevant context. Otherwise, your memory service fills with noise that makes semantic search less precise.

Mistake 2: Not scoping state keys correctly. Using session.state["name"] (no prefix) when you mean session.state["user:name"] (user-scoped) means the name disappears after the session ends. Always think carefully about the lifecycle you need: session-only, user-persistent, app-wide, or turn-only.

Mistake 3: Reading state directly instead of through events. session.state["key"] = value bypasses the event log and may not persist. Always update state through output_key on agents or through EventActions.state_delta when appending events. This ensures changes are logged, persisted, and timestamped correctly.

Mistake 4: Using InMemorySessionService in production. InMemorySessionService stores all session data in RAM. When the server restarts (which happens regularly in cloud deployments), all session data is lost. Use DatabaseSessionService or VertexAiSessionService for any production deployment where data persistence matters.

Mistake 5: No memory retrieval at session start. The memory service only helps if you actually query it at the beginning of each session and inject the results into the agent’s context. If you create the MemoryService but never call search_memory, you have a write-only memory — agents store information but never benefit from past sessions.

Key Takeaways

Memory has two fundamentally different problems: short-term (within a conversation) and long-term (across conversations). They require different storage mechanisms and have different access patterns.
ADK’s three primitives: Session tracks the conversation thread. State is the session’s temporary scratchpad (with namespace prefixes for scope). MemoryService is the searchable long-term knowledge store.
Update state through append_event, not direct dict writes. session.state["key"] = value bypasses event tracking and may not persist. Use output_key on agents or EventActions.state_delta in events.
State key prefixes determine persistence: user: persists across sessions for that user. app: is global. temp: clears each turn. No prefix = session-only.
Long-term memory uses vector search. Semantic retrieval finds relevant memories by meaning, not keyword. similarity_top_k and vector_distance_threshold control how much you retrieve and how relevant it must be.
LangChain’s ConversationBufferMemory handles short-term automatically — loads history into the prompt, saves each turn. Use return_messages=True for chat models.
LangGraph’s InMemoryStore supports three long-term memory types: Semantic (facts), Episodic (past task solutions), and Procedural (self-updating instructions via Reflection).
Vertex Memory Bank is the fully managed option — Gemini analyzes conversations, extracts facts, resolves contradictions, and provides personalized retrieval across ADK, LangGraph, and CrewAI.

Next up — Chapter 9: Human-in-the-Loop, where agents pause, ask for approval, and incorporate human judgment at critical decision points.