ARTICLE · 18 MIN READ · JANUARY 17, 2026
Chapter 4: Reflection
First drafts are rarely final. Reflection gives agents the ability to critique their own outputs, find what's wrong, and iterate toward better results — before returning anything to you.
The Problem with First Drafts
System prompt: A set of instructions sent to the LLM before the user's message. It defines the AI's persona, rules, and behavior. e.g., "You are a senior code reviewer. Be critical and look for bugs." The same model behaves very differently with different system prompts.
Feedback loop: A cycle where the output of one step becomes input to the same (or earlier) step. Reflection is a feedback loop: generate → evaluate → the evaluation feeds back into the generation.
Deterministic vs probabilistic: Deterministic means the same input always gives the same output (like a calculator). Probabilistic means the same input might give different outputs (like an LLM). Temperature 0 makes LLMs more deterministic but never fully so.
Iteration: One complete cycle of "try → evaluate → try again." Three iterations means the agent attempted to improve the output three times.
In Chapter 1 through Chapter 3, every pattern shares one assumption: once a step produces output, that output moves forward. The chain trusts it. The router acts on it. The synthesizer combines it.
But what if the output is wrong?
LLMs make mistakes. They miss edge cases, hallucinate facts, generate code with bugs, write summaries that lose key details. A single-pass pipeline has no way to catch any of this. It just passes the mistake downstream — and by the time it reaches the user, the error is baked in.
The fix is to add a feedback loop: make the agent evaluate its own output before declaring it done.
That’s reflection — one of the most powerful patterns in agentic AI.
The mechanism behind why it works. When you give the LLM a different system prompt for the critique step — “You are a senior software engineer performing a meticulous code review” — you’re not just changing words. You’re changing the entire context in which the model generates its next tokens. The model has learned, from millions of examples of code reviews, what senior engineers look for: off-by-one errors, missing edge cases, documentation gaps, security vulnerabilities, inefficient algorithms. That behavioral pattern is encoded in the model’s weights. The system prompt activates it.
This is why the same model — using the exact same underlying neural network — can produce dramatically better results as a two-call system than as a single call: the critic’s system prompt activates a different behavioral mode, one specifically trained to find problems, rather than the creator’s mode that’s trained to generate solutions. The bias switches from “make something plausible” to “find what’s wrong with this.”
Why self-review is cognitively difficult. When you generate something, you’ve already committed to a mental model of how it works. Reviewing it immediately afterward, you tend to read what you intended to write rather than what you actually wrote. Your brain auto-corrects the errors before you consciously notice them. This is why writers are told to wait a day before editing — fresh eyes catch what tired eyes miss. The same phenomenon applies to LLMs: a separate “critic” call with a fresh context has no prior commitment to the generated output and approaches it with genuine scrutiny.
The feedback loop that matters. What separates reflection from a simple two-step chain is the iterative loop. After the critic identifies problems and the producer corrects them, the critic runs again on the corrected output. This continues until the critic is satisfied or a maximum iteration limit is reached. Each iteration should produce a measurably better output — you can see this clearly in the quality chart below.
What Reflection Is
Reflection is the pattern where an agent:
- Executes — produces an initial output
- Evaluates — critiques that output against specific criteria
- Refines — generates an improved version based on the critique
- Repeats — until the output meets a quality bar or a max iteration count is hit
The key difference from a simple chain: the arrow goes backward. Output becomes input again. This is a feedback loop, not a pipeline.
The core insight: A model reviewing its own work with a different system prompt — “you are a senior code reviewer” vs “you are a code generator” — behaves fundamentally differently. The second prompt surfaces errors the first one wouldn’t.
The Feedback Loop, Step by Step
Self-Reflection vs Producer-Critic
There are two ways to implement reflection. The choice changes both the quality of the output and the architecture of the system. Understanding when to choose each approach is as important as understanding how to implement them.
Approach 1: Self-Reflection
A single agent generates output, then switches roles to critique it.
Simpler to implement. One model, two system prompts. But the same model that generated the output is also evaluating it — it tends to be less critical of its own work.
Approach 2: Producer-Critic
Two distinct agents with separate roles and personas.
More powerful. The Critic has a completely different system prompt — “You are a senior software engineer”, “You are a meticulous fact-checker” — and approaches the output with a fresh lens. It doesn’t have the generator’s blind spots.
| Self-Reflection | Producer-Critic | |
|---|---|---|
| Agents | 1 (two roles) | 2 (dedicated) |
| Objectivity | Lower — same model bias | Higher — separate perspective |
| Cost | Lower | Higher |
| Critique quality | General | Specialized |
| Best for | Quick refinement | High-stakes, quality-critical tasks |
Quality Improves with Each Iteration
Hover over any data point to see what the critique found at that iteration. Notice the diminishing returns past iteration 3 — this is why reflection loops always need a stopping condition.
The Live Demo: A Bug Found and Fixed
This is a concrete reflection cycle. The initial code has a real bug. Click through to watch the critic find it and the producer fix it.
def calculate_factorial(n):
result = 1
for i in range(1, n): # ← bug: misses n itself
result *= i
return result range(1, n) — calculate_factorial(5) returns 24 instead of 120.range(1, n) excludes n. For n=5: produces 1×2×3×4=24, not 120. Must be range(1, n+1).calculate_factorial(0) returns 1 only by coincidence (empty loop). This is mathematically correct but undocumented.ValueError.def calculate_factorial(n: int) -> int for clarity.def calculate_factorial(n: int) -> int:
"""Calculate the factorial of a non-negative integer n.
Args:
n: A non-negative integer.
Returns:
The factorial of n (n!). Returns 1 when n is 0.
Raises:
ValueError: If n is negative.
"""
if n < 0:
raise ValueError(f"Input must be non-negative, got {n}.")
if n == 0:
return 1
result = 1
for i in range(1, n + 1): # ← fixed
result *= i
return result Six Situations Where Reflection Pays Off
Code Generation
Write code, run static analysis or tests, feed results back — the agent fixes its own bugs before you see the output.
Catches runtime errors, logic bugs, style issuesLong-Form Content
Generate a draft, critique for tone, flow, and clarity, rewrite. Repeat until the piece reads like something an editor approved.
Polished prose without human editing roundsSummarization
Generate a summary, compare against the source document for missed key points, refine until complete and accurate.
Reduces hallucinations and key-point omissionsPlanning
Propose a plan, evaluate feasibility and constraint violations, revise. Don't hand over a plan that fails on day one.
More realistic, executable plansFact-Checking
A Critic agent with a "fact-checker" persona reviews every claim in the draft and flags anything that needs sourcing or correction.
Structural defense against hallucinationComplex Reasoning
Propose a reasoning step, evaluate whether it leads closer to the solution or introduces contradictions, backtrack if needed.
Enables multi-step problem solvingThe LangChain Way: LCEL Reflection Loop
The LangChain implementation uses conversation history as the state that carries context between generation and critique cycles. Each iteration appends messages to a growing list, so the model always has full context on what it produced and what the critique said.
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage
Why
messagesinstead ofChatPromptTemplate? This example uses raw message lists rather than prompt templates. That’s because the conversation history needs to grow dynamically — each iteration appends a new critique and a new refinement. Template-based prompts have fixed slots; message lists are append-only, which is exactly what a growing feedback loop needs.
llm = ChatOpenAI(model="gpt-4o", temperature=0.1)
temperature=0.1— very low. For code generation and code review, you want deterministic, consistent outputs. Creativity hurts here; precision helps.
The Task Definition
task_prompt = """
Your task is to create a Python function named `calculate_factorial`.
Requirements:
1. Accept a single integer `n` as input.
2. Calculate its factorial (n!).
3. Include a clear docstring.
4. Handle edge cases: factorial of 0 is 1.
5. Raise ValueError for negative input.
"""
This is the source of truth that both the Generator and the Critic reference. The Critic compares every output against this spec — so the spec must be complete and unambiguous from the start.
The Reflection Loop
max_iterations = 3
current_code = ""
message_history = [HumanMessage(content=task_prompt)] # starts with just the task
for i in range(max_iterations):
# ── STAGE 1: GENERATE (or REFINE) ──────────────────────────────────
if i == 0:
# First pass: just the task prompt
response = llm.invoke(message_history)
else:
# Subsequent passes: task + previous code + critique
message_history.append(
HumanMessage(content="Refine the code using the critiques provided.")
)
response = llm.invoke(message_history)
current_code = response.content
message_history.append(response) # add generated code to history
Why append to history? The model receives the entire conversation on each call. By appending both the generated code and the critique, the model knows:
- The original task (what was asked)
- What it previously generated (so it doesn’t repeat it)
- What the critic said (so it knows what to fix)
Without this, each iteration would generate from scratch with no awareness of previous attempts.
# ── STAGE 2: REFLECT (CRITIQUE) ────────────────────────────────────
reflector_messages = [
SystemMessage(content="""
You are a senior software engineer. Perform a meticulous code review.
Evaluate the code against the original task requirements.
Check for: bugs, missing edge cases, style issues, incomplete docstrings.
If the code is perfect, respond with exactly: CODE_IS_PERFECT
Otherwise, provide a bulleted list of specific critiques.
"""),
HumanMessage(content=f"Task:\n{task_prompt}\n\nCode:\n{current_code}")
]
critique_response = llm.invoke(reflector_messages)
critique = critique_response.content
Why a separate system prompt for the Critic? This is the Producer-Critic split within a single LangChain call. By giving the same model a completely different system prompt (“senior software engineer performing a code review”), you get a different reasoning stance. The model is no longer generating — it’s scrutinizing.
Why not add the Critic’s system prompt to the main conversation history? Because you want the Critic to always evaluate from a fresh perspective, not a perspective shaped by the previous generation attempts. Each critique call is independent.
# ── STOPPING CONDITION ───────────────────────────────────────────────
if "CODE_IS_PERFECT" in critique:
break # stop early — quality bar met
# Add critique to history so next iteration can fix it
message_history.append(
HumanMessage(content=f"Critique:\n{critique}")
)
Early stopping matters. Without a stopping condition, the loop runs all
max_iterationseven when the output is already good — wasting API calls. TheCODE_IS_PERFECTsentinel lets the Critic signal satisfaction explicitly.
Message History at Iteration 2
message_history after 2 iterations:
┌──────────────────────────────────────────────────────────┐
│ [0] Human: "Write a function that calculates factorial…" │ ← task
│ [1] AI: "def calculate_factorial(n):\n result=1…" │ ← v1 code
│ [2] Human: "Critique:\n• Bug in range(1,n)…\n• No doc…" │ ← critique 1
│ [3] Human: "Refine the code using the critiques." │ ← trigger
│ [4] AI: "def calculate_factorial(n: int) -> int:\n…" │ ← v2 code
└──────────────────────────────────────────────────────────┘
The model at iteration 2 sees the full thread and knows exactly what changed and why.
Full Reflection Data Flow
The Google ADK Way: Generator-Critic
The ADK version uses session state (key-value store) instead of message history for passing data between agents. The architecture is simpler to read but less flexible for complex multi-turn loops.
from google.adk.agents import SequentialAgent, LlmAgent
Why only
SequentialAgentandLlmAgent? The ADK has aLoopAgentfor true iterative loops, but the core reflection concept is demonstrated here in a single generate → critique cycle usingSequentialAgent. Two agents, run in order, sharing state viaoutput_key.
The Producer Agent
generator = LlmAgent(
name = "DraftWriter",
description = "Generates initial draft content on a given subject.",
instruction = "Write a short, informative paragraph about the user's subject.",
output_key = "draft_text", # stores output in session state
)
output_key: WhenDraftWritercompletes, its output is stored assession_state["draft_text"]. Any subsequent agent can read this by referencing{draft_text}in its instruction template.Why not just pass the output directly? ADK agents are independent workers. They don’t pass return values to each other — they communicate through shared session state. This decouples the producer from the consumer; you can add more critics, rearrange them, or branch without modifying the producer.
The Critic Agent
reviewer = LlmAgent(
name = "FactChecker",
description = "Reviews text for factual accuracy and provides structured critique.",
instruction = """
You are a meticulous fact-checker.
1. Read the text provided in the state key 'draft_text'.
2. Carefully verify the factual accuracy of all claims.
3. Your final output must be a dictionary with two keys:
- "status": "ACCURATE" or "INACCURATE"
- "reasoning": A clear explanation citing specific issues if any.
""",
output_key = "review_output", # stores critique in session state
)
Why
{draft_text}in the instruction? The ADK automatically fills{key}placeholders from session state before calling the model. So the Critic’s actual prompt at runtime contains the full text thatDraftWriterproduced — without any manual wiring.Why structured output (dict with
status+reasoning)? Structured output is machine-readable. A downstream agent or your own application code can parsesession_state["review_output"]["status"]to decide whether to trigger another iteration — without parsing free-form text.
The Pipeline
review_pipeline = SequentialAgent(
name = "WriteAndReviewPipeline",
sub_agents = [generator, reviewer],
)
SequentialAgentguaranteesgeneratorcompletes beforereviewerstarts. This matters becausereviewerreads fromsession_state["draft_text"]— which must already exist.
Execution State at Each Step
Session State Evolution:
┌─────────────────────────────────────────────────────┐
│ Before: {} │
│ │
│ After DraftWriter: │
│ { "draft_text": "Solar panels convert sunlight…" }│
│ │
│ After FactChecker: │
│ { "draft_text": "Solar panels convert sunlight…", │
│ "review_output": { │
│ "status": "INACCURATE", │
│ "reasoning": "Claim about 40% efficiency │
│ is incorrect — max ~26%…" │
│ } │
│ } │
└─────────────────────────────────────────────────────┘
ADK Orchestration
Side by Side: LangChain vs ADK
| LangChain (LCEL) | Google ADK | |
|---|---|---|
| State mechanism | Growing message_history list | Key-value session state (output_key) |
| Iteration control | Explicit for loop + early-break | LoopAgent (or manual SequentialAgent chains) |
| Critique format | Free-form string in conversation | Structured dict (better for programmatic use) |
| Role switching | New SystemMessage per call | Separate LlmAgent with dedicated instruction |
| Context window risk | Higher — history grows each iteration | Lower — only relevant state keys passed |
| Stopping condition | Sentinel string (CODE_IS_PERFECT) | Status field in structured output |
| Best for | Iterative loops where history matters | Single-cycle generate → critique → act |
The Trade-offs You Can’t Ignore
Reflection is not free. Every iteration adds:
Cost
Each iteration is at least 2 LLM calls (generator + critic). A 3-iteration loop = 6 API calls. At GPT-4o pricing, this adds up fast for high-volume tasks.
Latency
Iterations are sequential — each must complete before the next begins. A 3-iteration loop with 3s per call = 9s minimum. Unsuitable for real-time use cases.
Context Growth
Each iteration appends to the conversation history. After 4-5 cycles on a long document, you may approach the model's context window limit.
Infinite Loops
Without a firm max_iterations cap and a clear stopping condition, the agent can loop indefinitely — especially if the Critic keeps finding minor issues.
Rule of thumb: Use reflection when quality and accuracy matter more than speed and cost. Don’t use it as the default — use it selectively for the parts of your pipeline where errors are expensive.
At a Glance
A feedback loop where an agent evaluates its own output — or a dedicated Critic agent evaluates it — and uses the critique to generate an improved version.
First-pass outputs from LLMs are often incomplete, inaccurate, or non-compliant. Reflection adds a self-correction layer — catching errors before they reach the user.
Use when output quality matters more than speed. Use a separate Critic agent (not self-reflection) when the task requires specialized evaluation — code review, fact-checking, compliance.
Common Mistakes When Implementing Reflection
Mistake 1: No maximum iteration limit. Without a max_iterations cap, a reflection loop can run indefinitely if the critic always finds something to improve. Always set a firm limit. Three iterations is usually enough for most tasks; five is the practical maximum before diminishing returns dominate.
Mistake 2: Vague critique criteria. A critic system prompt that says “review the code for quality” will produce inconsistent, general feedback. Be specific: “Check for off-by-one errors, missing input validation, undocumented edge cases, and incorrect handling of empty inputs.” Specific criteria produce actionable, concrete critiques that the producer can actually act on.
Mistake 3: Producer doesn’t read the critique. The producer’s refinement prompt must explicitly include the critique and instruct the model to address every point. Simply passing “here’s the critique, now rewrite” sometimes leads the model to make superficial changes while ignoring specific issues. Add: “Address each point in the critique above specifically, and explain what you changed.”
Mistake 4: Not tracking which iteration produced which output. When debugging, you need to know whether the quality improved with each iteration. Log the output and the critique at every step. This also helps you identify when to stop — sometimes the output peaks at iteration 2 and gets worse at iteration 3 (over-optimization is real).
Mistake 5: Using reflection for everything. Reflection adds significant latency (multiple LLM calls) and cost. Don’t use it for simple, low-stakes outputs. Use it when: (1) the output quality directly affects user experience, (2) the task involves verifiable correctness (code, factual content), or (3) the cost of a wrong answer is high.
Key Takeaways
- Reflection = feedback loop. The output doesn’t move forward until it passes evaluation. Generate → critique → refine → repeat.
- Producer-Critic beats self-reflection for quality-critical tasks. A separate system prompt creates a genuinely different perspective; the same model reviewing its own work has blind spots.
- The stopping condition is mandatory. Always pair a
max_iterationscap with an explicit quality signal (sentinel string, structured status field). Without both, loops run forever. - LangChain uses growing message history — the full conversation thread (task, code, critique, refinement) is passed on each call. Natural for iterative loops; risks context window overflow at scale.
- ADK uses session state — agents write to named keys, other agents read from them. Cleaner data handoff; structured output is easier to act on programmatically.
- Cost scales with iterations. Each reflection cycle is 2+ LLM calls. Budget for it or gate reflection behind a quality pre-check.
- Connections forward: Reflection pairs naturally with memory (Chapter 8) — a Critic that remembers what it critiqued before can give progressively sharper feedback. It also anchors to goal-setting (Chapter 11) — the goal is the benchmark the Critic evaluates against.
Next up — Chapter 5: Tool Use, where agents stop reasoning in text and start taking real actions in the world.
Enjoy Reading This Article?
Here are some more articles you might like to read next: