ARTICLE · 16 MIN READ · FEBRUARY 02, 2026
Chapter 8: Memory Management
Without memory, every conversation starts from zero. Memory management gives agents the ability to remember — short-term context within a session, long-term knowledge across sessions.
The Stateless Agent Problem
Stateless: A system with no memory between requests. Each call is independent — the system doesn't remember previous interactions. A basic LLM API call is stateless: call it twice with the same prompt and it has no idea the first call happened.
Persistent vs ephemeral: Persistent data survives when the program stops (saved to disk/database). Ephemeral data exists only while the program runs (in RAM) and disappears when it stops. The context window is ephemeral — conversation history in a database is persistent.
Vector database: A database that stores text as vectors (lists of numbers representing meaning) and supports "semantic search" — finding entries similar in meaning, not just matching keywords. Used for long-term agent memory because you want to find relevant past info by topic, not by exact words.
Semantic search: Finding things by meaning rather than keywords. Searching "car" would semantically find "automobile" and "vehicle" too. Vector databases do this by comparing the mathematical distance between word/sentence embeddings.
Session: One complete conversation thread between a user and the agent. Like a phone call — starts when the user first messages, ends when they're done. Multiple sessions with the same user build up long-term memory over time.
Every agent pattern in this series — chaining, routing, planning, multi-agent — has one silent assumption: the agent has context. It knows what the user said, what it did before, what tools it called.
But where does that context live?
Without explicit memory management, agents are stateless. Each call to the LLM starts fresh. Ask it “what’s my order status?” and it has no idea who you are, what you ordered, or what you discussed five messages ago. That’s not an agent — it’s a very expensive autocomplete.
Memory is what transforms a stateless LLM call into an agent that can maintain context, track progress, personalize responses, and learn from past interactions.
Why the context window isn’t enough. You might be thinking: “But the LLM already has a context window — isn’t that memory?” The context window is short-term memory within a single session. If you send 10 messages in a conversation and they all fit in the context window, the LLM sees all of them and can refer back to the first one. That’s working memory.
The problems start at the session boundary. When the conversation ends (the user closes the app, the session times out, the server restarts), the context window contents disappear. The next time the user comes back, the LLM has no idea who they are, what they discussed before, or what preferences they expressed. Every session starts from zero.
The second problem is scale. The context window has a fixed size. If a user has had 200 conversations with your agent, you can’t fit all of them into one context window call — it would be millions of tokens, prohibitively expensive and slow. You need a way to selectively retrieve the most relevant past information rather than including everything every time.
The two-tier memory architecture. The solution is a two-tier system:
- Short-term memory (session context): Everything in the current conversation window. Fast, always available, but ephemeral.
- Long-term memory (external storage): Key facts, preferences, and summaries extracted from past conversations. Persistent, semantic-searchable, but requires explicit retrieval.
This mirrors how human memory works. You don’t consciously replay every past conversation before responding to someone — you just know relevant facts about them because they’ve been encoded into long-term memory over time. When something is relevant to the current conversation, it surfaces automatically.
Building this two-tier system for AI agents is exactly what this chapter covers — using Google ADK’s Session/State/MemoryService architecture and LangChain’s ConversationBufferMemory and InMemoryStore.
There are two fundamentally different problems:
- Short-term: remember what was said in this conversation
- Long-term: remember what matters across conversations
Both require different mechanisms. Neither is optional for production agents.
Two Types of Agent Memory
Short-Term Memory
Short-term memory is the context window — everything the model sees on a single call: the system prompt, the conversation history, tool results, agent thoughts.
It’s fast, has zero retrieval cost, and is always 100% relevant to the current conversation. But it’s ephemeral: when the session ends, it’s gone. And it has capacity limits — even a 1M token window fills up eventually, and processing the full history on every call is expensive.
Long-Term Memory
Long-term memory is external storage — databases, vector stores, knowledge graphs — that persists across sessions. When an agent needs something from the past, it queries the store, retrieves the relevant data, and injects it into the current context.
Vector databases are the dominant storage type here because they support semantic search: the agent can find relevant memories by meaning, not just exact keyword match.
Interactive: Watch Memory Work
Google ADK: Session, State, and MemoryService
The ADK structures memory into three explicit components with distinct responsibilities.
Session: The Conversation Thread
from google.adk.sessions import InMemorySessionService
session_service = InMemorySessionService()
A
Sessionis one conversation thread. It holds:
id— unique identifier for this threadevents— ordered list of all messages, agent replies, and tool callsstate— temporary key-value data for this conversationlast_update_time— timestamp of the most recent activityYou never create
Sessionobjects directly — theSessionServicemanages their lifecycle: create, retrieve, append events, delete.
Three SessionService Implementations
from google.adk.sessions import InMemorySessionService
session_service = InMemorySessionService() - ✓ Zero setup — no database needed
- ✓ Fast for testing
- ✗ Data lost on app restart
- ✗ Single process only
from google.adk.sessions import DatabaseSessionService
session_service = DatabaseSessionService(db_url="sqlite:///./agent.db") - ✓ Persistent across restarts
- ✓ Supports SQLite, PostgreSQL, MySQL
- ✓ You control the database
- ✗ Requires infra management
from google.adk.sessions import VertexAiSessionService
session_service = VertexAiSessionService(project=PROJECT_ID, location="us-central1") - ✓ Fully managed by Google Cloud
- ✓ Scales automatically
- ✓ Integrated with Reasoning Engine
- ✗ Requires GCP setup
State: The Session’s Scratchpad
session.state is a dictionary that lives within a Session. Think of it as a whiteboard for one conversation — agents read from it and write to it throughout the session.
# The state has 4 key namespaces:
session.state["preference"] # session-scoped (cleared when session ends)
session.state["user:name"] # user-scoped (persists across this user's sessions)
session.state["app:config"] # app-scoped (shared across all users)
session.state["temp:validation"] # turn-scoped (cleared after each processing turn)
Key prefixes determine persistence. Without a prefix, data lives only in the current session. With
user:, it follows the user across all their sessions. Withapp:, it’s global to the application. Withtemp:, it’s cleared after each processing turn — useful for ephemeral flags.
The right way to update state:
# ✓ Method 1: output_key on the agent (simplest — for agent text replies)
greeting_agent = LlmAgent(
name = "Greeter",
model = "gemini-2.0-flash",
instruction = "Generate a short, friendly greeting.",
output_key = "last_greeting", # runner auto-saves the reply to state["last_greeting"]
)
Why
output_key? The Runner intercepts the agent’s final response and writes it tosession.state["last_greeting"]through the normalappend_eventflow. This means the change is recorded in the event history, properly persisted by the SessionService, and timestamped. It’s the least error-prone option.
# ✓ Method 2: tool that updates state via ToolContext (for complex updates)
from google.adk.tools.tool_context import ToolContext
import time
def log_user_login(tool_context: ToolContext) -> dict:
"""Tracks a user login. Updates login count, status, and timestamp."""
state = tool_context.state
login_count = state.get("user:login_count", 0) + 1
state["user:login_count"] = login_count
state["task_status"] = "active"
state["user:last_login_ts"] = time.time()
state["temp:validation_needed"] = True
return {"status": "success", "message": f"Login #{login_count} tracked."}
Why use a tool for state updates? Tools have access to
ToolContext, which wrapssession.statewith proper event-tracking. Changes made throughtool_context.stateget recorded in the event log and properly persisted.Never do this:
session = session_service.get_session(app_name, user_id, session_id) session.state["key"] = "value" # ← bypasses event tracking, may not persistDirect dictionary writes bypass the
append_eventmechanism. They won’t be persisted byDatabaseSessionServiceorVertexAiSessionServiceand won’t appear in the event history.
MemoryService: Long-Term Knowledge
# Development
from google.adk.memory import InMemoryMemoryService
memory_service = InMemoryMemoryService()
# Production (Vertex AI RAG)
from google.adk.memory import VertexAiRagMemoryService
memory_service = VertexAiRagMemoryService(
rag_corpus = "projects/my-project/locations/us-central1/ragCorpora/my-corpus",
similarity_top_k = 5, # retrieve top 5 most relevant memories
vector_distance_threshold = 0.7, # minimum similarity score to include
)
VertexAiRagMemoryServicestores memories as vector embeddings in a Vertex AI RAG Corpus. When an agent searches memory, it sends a semantic query and gets back the most similar stored facts — not exact keyword matches, but conceptually related information.
similarity_top_k=5: retrieve at most 5 memories per query. Higher numbers give more context but consume more of the context window.
vector_distance_threshold=0.7: only return memories with at least 70% similarity to the query. This prevents irrelevant memories from polluting the context.
# After a session ends: extract facts and store in long-term memory
await memory_service.add_session_to_memory(session)
# When a new session starts: retrieve relevant past memories
relevant_memories = await memory_service.search_memory(
app_name = "my_app",
user_id = "sarah_123",
query = "user preferences and past interactions",
)
Session lifecycle in ADK
LangChain & LangGraph Memory
Short-Term: ConversationBufferMemory
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
memory_key = "chat_history", # must match the variable name in your prompt
return_messages = True, # return list of message objects, not a string
)
memory.save_context(
{"input": "What's the weather like?"},
{"output": "It's sunny today."},
)
memory_key="chat_history"— this string must exactly match the placeholder in yourChatPromptTemplate. If your prompt hasMessagesPlaceholder(variable_name="chat_history"), LangChain injects the conversation history there automatically.
return_messages=True— returns a list ofHumanMessage/AIMessageobjects instead of a single formatted string. Use this with chat models (ChatOpenAI, ChatGemini). The raw string format works for non-chat LLMs but loses role metadata.
from langchain_openai import ChatOpenAI
from langchain.chains import LLMChain
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt = ChatPromptTemplate(messages=[
("system", "You are a friendly travel assistant."),
MessagesPlaceholder(variable_name="chat_history"),
("human", "{question}"),
])
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
conversation = LLMChain(llm=ChatOpenAI(), prompt=prompt, memory=memory)
response = conversation.predict(question="I want to book a flight.") # turn 1
response = conversation.predict(question="My name is Sam, by the way.") # turn 2
response = conversation.predict(question="What was my name again?") # turn 3 — knows it's Sam
The
memoryobject maintains the conversation buffer. On eachpredict()call, it: 1) loads the existing history into{chat_history}, 2) generates the response, 3) saves the new exchange to the buffer. Turn 3 works because turns 1 and 2 are in the buffer.
Long-Term: LangGraph InMemoryStore
LangGraph provides three long-term memory types, each mapped to a different human memory analogy:
LangGraph Store: put, get, search
from langgraph.store.memory import InMemoryStore
def embed(texts: list[str]) -> list[list[float]]:
return [[1.0, 2.0] for _ in texts] # replace with real embedding model in production
store = InMemoryStore(index={"embed": embed, "dims": 2})
# Namespace: (user_id, context_type) — like a folder structure
namespace = ("sarah_123", "preferences")
# Store a memory
store.put(
namespace,
"a-memory", # key — like a filename
{
"rules": ["User likes short, direct responses", "User speaks English and Python"],
"timezone": "Asia/Tokyo",
},
)
# Retrieve by key
item = store.get(namespace, "a-memory")
# Semantic search within the namespace
results = store.search(
namespace,
filter = {"timezone": "Asia/Tokyo"}, # metadata filter
query = "language preferences", # semantic similarity query
)
Namespace =
(user_id, context_type)— think of it as a two-level folder structure. The first level is the user, the second is the type of memory (preferences, episodes, instructions). This prevents memory from one user bleeding into another’s.
store.put(namespace, key, data)— thekeyis like a filename. If a memory with that key already exists, it’s overwritten. Use descriptive keys like"user-profile"or"booking-preferences".
store.search()with bothfilterandquery—filterdoes exact metadata matching (deterministic);querydoes vector similarity search (semantic). Using both together is the most precise retrieval.
Procedural Memory: Self-Updating Instructions
def update_instructions(state: State, store: BaseStore):
"""Reflection node: agent rewrites its own system prompt based on conversation."""
namespace = ("instructions",)
current_instructions = store.search(namespace)[0]
# Ask the LLM to review its behavior and propose improvements
prompt = prompt_template.format(
instructions = current_instructions.value["instructions"],
conversation = state["messages"],
)
output = llm.invoke(prompt)
new_instructions = output["new_instructions"]
# Overwrite stored instructions with the improved version
store.put(("agent_instructions",), "agent_a", {"instructions": new_instructions})
This is the Reflection pattern (Chapter 4) applied to procedural memory. The agent reads its own instructions, sees how the conversation went, and rewrites its prompt to do better next time. Each session, it gets slightly smarter.
Vertex Memory Bank
For teams using Google Cloud, Memory Bank (part of Vertex AI Agent Engine) is a fully managed long-term memory service that works across frameworks — ADK, LangGraph, and CrewAI.
from google.adk.memory import VertexAiMemoryBankService
memory_service = VertexAiMemoryBankService(
project = "my-gcp-project",
location = "us-central1",
agent_engine_id = agent_engine_id,
)
# After a session completes: extract and store memories
session = await session_service.get_session(app_name=app_name, user_id="USER_ID", session_id=session.id)
await memory_service.add_session_to_memory(session)
How Memory Bank works: After
add_session_to_memory(), Gemini asynchronously analyzes the conversation history, extracts key facts and user preferences, and stores them persistently. On the next session, the agent retrieves relevant memories via similarity search — getting only the facts that matter for the current conversation, not the entire history.What makes it different from
VertexAiRagMemoryService: Memory Bank actively uses Gemini to understand and consolidate memories — resolving contradictions, merging related facts, and updating existing preferences. It’s not just a vector store; it’s a managed memory intelligence layer.
Storage Backend Comparison
At a Glance
A dual-component system: short-term memory (session context window) for the current conversation, and long-term memory (external vector store) for knowledge that persists across sessions.
Without memory, every conversation starts from zero. Agents can't maintain context, track progress, personalize responses, or learn from past interactions — making them useless for anything beyond single-turn Q&A.
Use short-term memory (Session + State) for any agent handling multi-turn conversations. Add long-term memory (MemoryService) when users expect personalization or continuity across sessions.
Why Vector Databases Are Used for Long-Term Memory
You might wonder: why not just store memories in a regular SQL database and retrieve them with SQL queries? The answer reveals something fundamental about how language and meaning work.
The keyword matching problem. SQL requires exact string matching. If a user once said “I prefer short answers,” you might store this under the key "user_preferences". But what if next session you want to retrieve “things that affect response length”? The string “response length” doesn’t match “user_preferences” — so you’d miss it entirely.
Language has meaning that transcends exact words. “I dislike verbose responses” and “please be concise” and “keep it brief” all mean the same thing, but share no keywords.
How vector databases solve this. Instead of storing raw text, vector databases store text as embeddings — lists of numbers (typically 1,536 or more numbers) that encode the meaning of the text in a high-dimensional space. Sentences with similar meanings end up with similar embeddings (similar numbers), regardless of which exact words they use.
When you search a vector database with the query “things that affect response length,” the system converts your query to an embedding and finds the stored texts whose embeddings are mathematically closest to your query’s embedding — even if they share no keywords. This is called semantic search — search by meaning rather than by keywords.
How embeddings work intuitively. Think of placing words and sentences in a 3D space (in reality it’s 1,536-dimensional, but the concept is the same). Words with similar meanings are placed near each other. “Car,” “automobile,” and “vehicle” cluster together. “Dog” and “puppy” cluster together, not near “car.” Sentences with similar meanings also land near each other. When you embed “I prefer brief responses” and “keep answers short,” they land near each other in this space. A search for “response length preferences” also lands near both — the database returns them as relevant matches.
The practical flow in an agent. Here is how memory retrieval works in a real ADK agent with VertexAiRagMemoryService: (1) User starts a new session with no prior context. (2) Before processing the user’s first message, the system runs: memories = await memory_service.search_memory(app_name, user_id, query=user_message). (3) The memory service converts user_message to an embedding and searches for semantically similar past memories. (4) The top-K most relevant memories are retrieved and injected into the agent’s context: “From past conversations: User prefers brief responses. User is a data scientist in Tokyo.” (5) The agent now has personalized context without the user having to repeat themselves. (6) After the session ends: await memory_service.add_session_to_memory(session) — key facts are extracted and stored for future retrieval.
This is what makes agents feel “intelligent” about user preferences — they’re not actually remembering in the human sense, they’re retrieving relevant records from a database.
Common Mistakes When Implementing Memory
Mistake 1: Storing everything in long-term memory. Not every message deserves long-term storage. “What time is it?” and “Tell me a joke” don’t contain information worth remembering across sessions. Use selective extraction — store only facts, preferences, and task-relevant context. Otherwise, your memory service fills with noise that makes semantic search less precise.
Mistake 2: Not scoping state keys correctly. Using session.state["name"] (no prefix) when you mean session.state["user:name"] (user-scoped) means the name disappears after the session ends. Always think carefully about the lifecycle you need: session-only, user-persistent, app-wide, or turn-only.
Mistake 3: Reading state directly instead of through events. session.state["key"] = value bypasses the event log and may not persist. Always update state through output_key on agents or through EventActions.state_delta when appending events. This ensures changes are logged, persisted, and timestamped correctly.
Mistake 4: Using InMemorySessionService in production. InMemorySessionService stores all session data in RAM. When the server restarts (which happens regularly in cloud deployments), all session data is lost. Use DatabaseSessionService or VertexAiSessionService for any production deployment where data persistence matters.
Mistake 5: No memory retrieval at session start. The memory service only helps if you actually query it at the beginning of each session and inject the results into the agent’s context. If you create the MemoryService but never call search_memory, you have a write-only memory — agents store information but never benefit from past sessions.
Key Takeaways
- Memory has two fundamentally different problems: short-term (within a conversation) and long-term (across conversations). They require different storage mechanisms and have different access patterns.
- ADK’s three primitives:
Sessiontracks the conversation thread.Stateis the session’s temporary scratchpad (with namespace prefixes for scope).MemoryServiceis the searchable long-term knowledge store. - Update state through
append_event, not direct dict writes.session.state["key"] = valuebypasses event tracking and may not persist. Useoutput_keyon agents orEventActions.state_deltain events. - State key prefixes determine persistence:
user:persists across sessions for that user.app:is global.temp:clears each turn. No prefix = session-only. - Long-term memory uses vector search. Semantic retrieval finds relevant memories by meaning, not keyword.
similarity_top_kandvector_distance_thresholdcontrol how much you retrieve and how relevant it must be. - LangChain’s
ConversationBufferMemoryhandles short-term automatically — loads history into the prompt, saves each turn. Usereturn_messages=Truefor chat models. - LangGraph’s
InMemoryStoresupports three long-term memory types: Semantic (facts), Episodic (past task solutions), and Procedural (self-updating instructions via Reflection). - Vertex Memory Bank is the fully managed option — Gemini analyzes conversations, extracts facts, resolves contradictions, and provides personalized retrieval across ADK, LangGraph, and CrewAI.
Next up — Chapter 9: Human-in-the-Loop, where agents pause, ask for approval, and incorporate human judgment at critical decision points.
Enjoy Reading This Article?
Here are some more articles you might like to read next: