ARTICLE  ·  18 MIN READ  ·  JANUARY 17, 2026

Chapter 4: Reflection

First drafts are rarely final. Reflection gives agents the ability to critique their own outputs, find what's wrong, and iterate toward better results — before returning anything to you.


The Problem with First Drafts

Before You Start — Key Terms Explained

System prompt: A set of instructions sent to the LLM before the user's message. It defines the AI's persona, rules, and behavior. e.g., "You are a senior code reviewer. Be critical and look for bugs." The same model behaves very differently with different system prompts.

Feedback loop: A cycle where the output of one step becomes input to the same (or earlier) step. Reflection is a feedback loop: generate → evaluate → the evaluation feeds back into the generation.

Deterministic vs probabilistic: Deterministic means the same input always gives the same output (like a calculator). Probabilistic means the same input might give different outputs (like an LLM). Temperature 0 makes LLMs more deterministic but never fully so.

Iteration: One complete cycle of "try → evaluate → try again." Three iterations means the agent attempted to improve the output three times.

In Chapter 1 through Chapter 3, every pattern shares one assumption: once a step produces output, that output moves forward. The chain trusts it. The router acts on it. The synthesizer combines it.

But what if the output is wrong?

LLMs make mistakes. They miss edge cases, hallucinate facts, generate code with bugs, write summaries that lose key details. A single-pass pipeline has no way to catch any of this. It just passes the mistake downstream — and by the time it reaches the user, the error is baked in.

The fix is to add a feedback loop: make the agent evaluate its own output before declaring it done.

That’s reflection — one of the most powerful patterns in agentic AI.

The mechanism behind why it works. When you give the LLM a different system prompt for the critique step — “You are a senior software engineer performing a meticulous code review” — you’re not just changing words. You’re changing the entire context in which the model generates its next tokens. The model has learned, from millions of examples of code reviews, what senior engineers look for: off-by-one errors, missing edge cases, documentation gaps, security vulnerabilities, inefficient algorithms. That behavioral pattern is encoded in the model’s weights. The system prompt activates it.

This is why the same model — using the exact same underlying neural network — can produce dramatically better results as a two-call system than as a single call: the critic’s system prompt activates a different behavioral mode, one specifically trained to find problems, rather than the creator’s mode that’s trained to generate solutions. The bias switches from “make something plausible” to “find what’s wrong with this.”

Why self-review is cognitively difficult. When you generate something, you’ve already committed to a mental model of how it works. Reviewing it immediately afterward, you tend to read what you intended to write rather than what you actually wrote. Your brain auto-corrects the errors before you consciously notice them. This is why writers are told to wait a day before editing — fresh eyes catch what tired eyes miss. The same phenomenon applies to LLMs: a separate “critic” call with a fresh context has no prior commitment to the generated output and approaches it with genuine scrutiny.

The feedback loop that matters. What separates reflection from a simple two-step chain is the iterative loop. After the critic identifies problems and the producer corrects them, the critic runs again on the corrected output. This continues until the critic is satisfied or a maximum iteration limit is reached. Each iteration should produce a measurably better output — you can see this clearly in the quality chart below.

What Reflection Is

Reflection is the pattern where an agent:

  1. Executes — produces an initial output
  2. Evaluates — critiques that output against specific criteria
  3. Refines — generates an improved version based on the critique
  4. Repeats — until the output meets a quality bar or a max iteration count is hit
REFLECTION FEEDBACK LOOP
Task
Initial prompt or goal
Execute — Producer Agent
Generates initial output · code, text, plan
Evaluate — Critic Agent
Reviews against criteria · finds bugs, gaps, inaccuracies
Satisfactory?
Needs work
Refine
Producer rewrites using critique
↑ Back to Evaluate
Approved
Final Output
Quality bar met · done

The key difference from a simple chain: the arrow goes backward. Output becomes input again. This is a feedback loop, not a pipeline.

The core insight: A model reviewing its own work with a different system prompt — “you are a senior code reviewer” vs “you are a code generator” — behaves fundamentally differently. The second prompt surfaces errors the first one wouldn’t.

The Feedback Loop, Step by Step

REFLECTION FEEDBACK LOOP
01
EXECUTE
Producer generates initial output (code, text, plan…)
02
EVALUATE
Critic checks output against criteria — finds bugs, gaps, inaccuracies
03
REFINE
Producer rewrites using the critique as a guide
Iteration: 0 / 3

Self-Reflection vs Producer-Critic

There are two ways to implement reflection. The choice changes both the quality of the output and the architecture of the system. Understanding when to choose each approach is as important as understanding how to implement them.

Approach 1: Self-Reflection

A single agent generates output, then switches roles to critique it.

SELF-REFLECTION — one agent, two system prompts, one loop
Task
Same Agent — Generator
system prompt: "write code"
Same Agent — Critic
system prompt: "review code"
Needs work → loops back
Approved → Output

Simpler to implement. One model, two system prompts. But the same model that generated the output is also evaluating it — it tends to be less critical of its own work.

Approach 2: Producer-Critic

Two distinct agents with separate roles and personas.

PRODUCER-CRITIC — two separate agents, objective review
Task
what needs to be created
Producer Agent
generates: code, text, or plan — first draft
Draft Output
passed to the critic for review
Critic Agent
different system prompt · fresh perspective · finds issues
Approved?
No — iterate
Back to Producer
critique used to improve next draft
Yes
Final Output
quality bar met

More powerful. The Critic has a completely different system prompt — “You are a senior software engineer”, “You are a meticulous fact-checker” — and approaches the output with a fresh lens. It doesn’t have the generator’s blind spots.

  Self-Reflection Producer-Critic
Agents 1 (two roles) 2 (dedicated)
Objectivity Lower — same model bias Higher — separate perspective
Cost Lower Higher
Critique quality General Specialized
Best for Quick refinement High-stakes, quality-critical tasks

Quality Improves with Each Iteration

OUTPUT QUALITY vs REFLECTION ITERATIONS

Hover over any data point to see what the critique found at that iteration. Notice the diminishing returns past iteration 3 — this is why reflection loops always need a stopping condition.

The Live Demo: A Bug Found and Fixed

This is a concrete reflection cycle. The initial code has a real bug. Click through to watch the critic find it and the producer fix it.

PRODUCER-CRITIC LIVE DEMO — Factorial Function
PRODUCER OUTPUT — initial draft
def calculate_factorial(n):
    result = 1
    for i in range(1, n):  # ← bug: misses n itself
        result *= i
    return result
5 lines. No docstring. No edge cases. Hidden off-by-one bug in range(1, n)calculate_factorial(5) returns 24 instead of 120.
CRITIC OUTPUT — structured review
Off-by-one errorrange(1, n) excludes n. For n=5: produces 1×2×3×4=24, not 120. Must be range(1, n+1).
Missing edge casecalculate_factorial(0) returns 1 only by coincidence (empty loop). This is mathematically correct but undocumented.
!
No input validation — negative input (e.g., n=-3) runs silently and returns 1. Should raise ValueError.
!
No docstring — function contract, args, return value, and exceptions are undocumented.
No type hints — consider def calculate_factorial(n: int) -> int for clarity.
✗ NEEDS REVISION — 2 critical errors, 2 warnings found
PRODUCER OUTPUT — refined v2
def calculate_factorial(n: int) -> int:
    """Calculate the factorial of a non-negative integer n.

    Args:
        n: A non-negative integer.
    Returns:
        The factorial of n (n!). Returns 1 when n is 0.
    Raises:
        ValueError: If n is negative.
    """
    if n < 0:
        raise ValueError(f"Input must be non-negative, got {n}.")
    if n == 0:
        return 1
    result = 1
    for i in range(1, n + 1):  # ← fixed
        result *= i
    return result
✓ APPROVED — all issues addressed
All 5 critique points addressed. Bug fixed. Docstring complete. Edge cases handled. Type hints added.

Six Situations Where Reflection Pays Off

01

Code Generation

Write code, run static analysis or tests, feed results back — the agent fixes its own bugs before you see the output.

Catches runtime errors, logic bugs, style issues
02

Long-Form Content

Generate a draft, critique for tone, flow, and clarity, rewrite. Repeat until the piece reads like something an editor approved.

Polished prose without human editing rounds
03

Summarization

Generate a summary, compare against the source document for missed key points, refine until complete and accurate.

Reduces hallucinations and key-point omissions
04

Planning

Propose a plan, evaluate feasibility and constraint violations, revise. Don't hand over a plan that fails on day one.

More realistic, executable plans
05

Fact-Checking

A Critic agent with a "fact-checker" persona reviews every claim in the draft and flags anything that needs sourcing or correction.

Structural defense against hallucination
06

Complex Reasoning

Propose a reasoning step, evaluate whether it leads closer to the solution or introduces contradictions, backtrack if needed.

Enables multi-step problem solving

The LangChain Way: LCEL Reflection Loop

The LangChain implementation uses conversation history as the state that carries context between generation and critique cycles. Each iteration appends messages to a growing list, so the model always has full context on what it produced and what the critique said.

import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.messages import SystemMessage, HumanMessage

Why messages instead of ChatPromptTemplate? This example uses raw message lists rather than prompt templates. That’s because the conversation history needs to grow dynamically — each iteration appends a new critique and a new refinement. Template-based prompts have fixed slots; message lists are append-only, which is exactly what a growing feedback loop needs.

llm = ChatOpenAI(model="gpt-4o", temperature=0.1)

temperature=0.1 — very low. For code generation and code review, you want deterministic, consistent outputs. Creativity hurts here; precision helps.

The Task Definition

task_prompt = """
Your task is to create a Python function named `calculate_factorial`.
Requirements:
1. Accept a single integer `n` as input.
2. Calculate its factorial (n!).
3. Include a clear docstring.
4. Handle edge cases: factorial of 0 is 1.
5. Raise ValueError for negative input.
"""

This is the source of truth that both the Generator and the Critic reference. The Critic compares every output against this spec — so the spec must be complete and unambiguous from the start.

The Reflection Loop

max_iterations = 3
current_code   = ""
message_history = [HumanMessage(content=task_prompt)]   # starts with just the task

for i in range(max_iterations):

    # ── STAGE 1: GENERATE (or REFINE) ──────────────────────────────────
    if i == 0:
        # First pass: just the task prompt
        response = llm.invoke(message_history)
    else:
        # Subsequent passes: task + previous code + critique
        message_history.append(
            HumanMessage(content="Refine the code using the critiques provided.")
        )
        response = llm.invoke(message_history)

    current_code = response.content
    message_history.append(response)        # add generated code to history

Why append to history? The model receives the entire conversation on each call. By appending both the generated code and the critique, the model knows:

  1. The original task (what was asked)
  2. What it previously generated (so it doesn’t repeat it)
  3. What the critic said (so it knows what to fix)

Without this, each iteration would generate from scratch with no awareness of previous attempts.

    # ── STAGE 2: REFLECT (CRITIQUE) ────────────────────────────────────
    reflector_messages = [
        SystemMessage(content="""
You are a senior software engineer. Perform a meticulous code review.
Evaluate the code against the original task requirements.
Check for: bugs, missing edge cases, style issues, incomplete docstrings.
If the code is perfect, respond with exactly: CODE_IS_PERFECT
Otherwise, provide a bulleted list of specific critiques.
"""),
        HumanMessage(content=f"Task:\n{task_prompt}\n\nCode:\n{current_code}")
    ]
    critique_response = llm.invoke(reflector_messages)
    critique = critique_response.content

Why a separate system prompt for the Critic? This is the Producer-Critic split within a single LangChain call. By giving the same model a completely different system prompt (“senior software engineer performing a code review”), you get a different reasoning stance. The model is no longer generating — it’s scrutinizing.

Why not add the Critic’s system prompt to the main conversation history? Because you want the Critic to always evaluate from a fresh perspective, not a perspective shaped by the previous generation attempts. Each critique call is independent.

    # ── STOPPING CONDITION ───────────────────────────────────────────────
    if "CODE_IS_PERFECT" in critique:
        break                                   # stop early — quality bar met

    # Add critique to history so next iteration can fix it
    message_history.append(
        HumanMessage(content=f"Critique:\n{critique}")
    )

Early stopping matters. Without a stopping condition, the loop runs all max_iterations even when the output is already good — wasting API calls. The CODE_IS_PERFECT sentinel lets the Critic signal satisfaction explicitly.

Message History at Iteration 2

message_history after 2 iterations:
┌──────────────────────────────────────────────────────────┐
│ [0] Human: "Write a function that calculates factorial…" │  ← task
│ [1] AI:    "def calculate_factorial(n):\n    result=1…"  │  ← v1 code
│ [2] Human: "Critique:\n• Bug in range(1,n)…\n• No doc…" │  ← critique 1
│ [3] Human: "Refine the code using the critiques."        │  ← trigger
│ [4] AI:    "def calculate_factorial(n: int) -> int:\n…"  │  ← v2 code
└──────────────────────────────────────────────────────────┘

The model at iteration 2 sees the full thread and knows exactly what changed and why.

Full Reflection Data Flow

LANGCHAIN REFLECTION DATA FLOW — how code and critique pass between roles
task_prompt + message_history
growing conversation thread — task, code, critique, refinement
LLM — Generator Role
writes or improves the code
current_code
added back to message_history for context
LLM — Critic Role
different system prompt · reviews code · produces critique
CODE_IS_PERFECT?
No — has issues
Add critique to history
generator reads it on next iteration
Yes — approved
Final Output
loop exits

The Google ADK Way: Generator-Critic

The ADK version uses session state (key-value store) instead of message history for passing data between agents. The architecture is simpler to read but less flexible for complex multi-turn loops.

from google.adk.agents import SequentialAgent, LlmAgent

Why only SequentialAgent and LlmAgent? The ADK has a LoopAgent for true iterative loops, but the core reflection concept is demonstrated here in a single generate → critique cycle using SequentialAgent. Two agents, run in order, sharing state via output_key.

The Producer Agent

generator = LlmAgent(
    name        = "DraftWriter",
    description = "Generates initial draft content on a given subject.",
    instruction = "Write a short, informative paragraph about the user's subject.",
    output_key  = "draft_text",    # stores output in session state
)

output_key: When DraftWriter completes, its output is stored as session_state["draft_text"]. Any subsequent agent can read this by referencing {draft_text} in its instruction template.

Why not just pass the output directly? ADK agents are independent workers. They don’t pass return values to each other — they communicate through shared session state. This decouples the producer from the consumer; you can add more critics, rearrange them, or branch without modifying the producer.

The Critic Agent

reviewer = LlmAgent(
    name        = "FactChecker",
    description = "Reviews text for factual accuracy and provides structured critique.",
    instruction = """
You are a meticulous fact-checker.
1. Read the text provided in the state key 'draft_text'.
2. Carefully verify the factual accuracy of all claims.
3. Your final output must be a dictionary with two keys:
   - "status":    "ACCURATE" or "INACCURATE"
   - "reasoning": A clear explanation citing specific issues if any.
""",
    output_key  = "review_output",  # stores critique in session state
)

Why {draft_text} in the instruction? The ADK automatically fills {key} placeholders from session state before calling the model. So the Critic’s actual prompt at runtime contains the full text that DraftWriter produced — without any manual wiring.

Why structured output (dict with status + reasoning)? Structured output is machine-readable. A downstream agent or your own application code can parse session_state["review_output"]["status"] to decide whether to trigger another iteration — without parsing free-form text.

The Pipeline

review_pipeline = SequentialAgent(
    name       = "WriteAndReviewPipeline",
    sub_agents = [generator, reviewer],
)

SequentialAgent guarantees generator completes before reviewer starts. This matters because reviewer reads from session_state["draft_text"] — which must already exist.

Execution State at Each Step

Session State Evolution:
┌─────────────────────────────────────────────────────┐
│ Before:  {}                                         │
│                                                     │
│ After DraftWriter:                                  │
│   { "draft_text": "Solar panels convert sunlight…" }│
│                                                     │
│ After FactChecker:                                  │
│   { "draft_text": "Solar panels convert sunlight…", │
│     "review_output": {                              │
│       "status":    "INACCURATE",                    │
│       "reasoning": "Claim about 40% efficiency      │
│                     is incorrect — max ~26%…"       │
│     }                                               │
│   }                                                 │
└─────────────────────────────────────────────────────┘

ADK Orchestration

ADK REFLECTION — SequentialAgent passing state between writer and checker
User Input
e.g. "Write a paragraph about Mars"
SequentialAgent
runs DraftWriter first, then FactChecker
DraftWriter LlmAgent
writes paragraph · output_key: "draft_text" → saved to Session State
Session State
shared memory between agents · FactChecker reads draft_text from here
FactChecker LlmAgent
reads draft_text · outputs: status "ACCURATE" or "INACCURATE" + reasoning
Critique Result
status + reasoning saved to review_output key

Side by Side: LangChain vs ADK

  LangChain (LCEL) Google ADK
State mechanism Growing message_history list Key-value session state (output_key)
Iteration control Explicit for loop + early-break LoopAgent (or manual SequentialAgent chains)
Critique format Free-form string in conversation Structured dict (better for programmatic use)
Role switching New SystemMessage per call Separate LlmAgent with dedicated instruction
Context window risk Higher — history grows each iteration Lower — only relevant state keys passed
Stopping condition Sentinel string (CODE_IS_PERFECT) Status field in structured output
Best for Iterative loops where history matters Single-cycle generate → critique → act

The Trade-offs You Can’t Ignore

Reflection is not free. Every iteration adds:

💸

Cost

Each iteration is at least 2 LLM calls (generator + critic). A 3-iteration loop = 6 API calls. At GPT-4o pricing, this adds up fast for high-volume tasks.

Latency

Iterations are sequential — each must complete before the next begins. A 3-iteration loop with 3s per call = 9s minimum. Unsuitable for real-time use cases.

📋

Context Growth

Each iteration appends to the conversation history. After 4-5 cycles on a long document, you may approach the model's context window limit.

Infinite Loops

Without a firm max_iterations cap and a clear stopping condition, the agent can loop indefinitely — especially if the Critic keeps finding minor issues.

Rule of thumb: Use reflection when quality and accuracy matter more than speed and cost. Don’t use it as the default — use it selectively for the parts of your pipeline where errors are expensive.

At a Glance

WHAT

A feedback loop where an agent evaluates its own output — or a dedicated Critic agent evaluates it — and uses the critique to generate an improved version.

WHY

First-pass outputs from LLMs are often incomplete, inaccurate, or non-compliant. Reflection adds a self-correction layer — catching errors before they reach the user.

RULE OF THUMB

Use when output quality matters more than speed. Use a separate Critic agent (not self-reflection) when the task requires specialized evaluation — code review, fact-checking, compliance.

Common Mistakes When Implementing Reflection

Mistake 1: No maximum iteration limit. Without a max_iterations cap, a reflection loop can run indefinitely if the critic always finds something to improve. Always set a firm limit. Three iterations is usually enough for most tasks; five is the practical maximum before diminishing returns dominate.

Mistake 2: Vague critique criteria. A critic system prompt that says “review the code for quality” will produce inconsistent, general feedback. Be specific: “Check for off-by-one errors, missing input validation, undocumented edge cases, and incorrect handling of empty inputs.” Specific criteria produce actionable, concrete critiques that the producer can actually act on.

Mistake 3: Producer doesn’t read the critique. The producer’s refinement prompt must explicitly include the critique and instruct the model to address every point. Simply passing “here’s the critique, now rewrite” sometimes leads the model to make superficial changes while ignoring specific issues. Add: “Address each point in the critique above specifically, and explain what you changed.”

Mistake 4: Not tracking which iteration produced which output. When debugging, you need to know whether the quality improved with each iteration. Log the output and the critique at every step. This also helps you identify when to stop — sometimes the output peaks at iteration 2 and gets worse at iteration 3 (over-optimization is real).

Mistake 5: Using reflection for everything. Reflection adds significant latency (multiple LLM calls) and cost. Don’t use it for simple, low-stakes outputs. Use it when: (1) the output quality directly affects user experience, (2) the task involves verifiable correctness (code, factual content), or (3) the cost of a wrong answer is high.

Key Takeaways

  • Reflection = feedback loop. The output doesn’t move forward until it passes evaluation. Generate → critique → refine → repeat.
  • Producer-Critic beats self-reflection for quality-critical tasks. A separate system prompt creates a genuinely different perspective; the same model reviewing its own work has blind spots.
  • The stopping condition is mandatory. Always pair a max_iterations cap with an explicit quality signal (sentinel string, structured status field). Without both, loops run forever.
  • LangChain uses growing message history — the full conversation thread (task, code, critique, refinement) is passed on each call. Natural for iterative loops; risks context window overflow at scale.
  • ADK uses session state — agents write to named keys, other agents read from them. Cleaner data handoff; structured output is easier to act on programmatically.
  • Cost scales with iterations. Each reflection cycle is 2+ LLM calls. Budget for it or gate reflection behind a quality pre-check.
  • Connections forward: Reflection pairs naturally with memory (Chapter 8) — a Critic that remembers what it critiqued before can give progressively sharper feedback. It also anchors to goal-setting (Chapter 11) — the goal is the benchmark the Critic evaluates against.

Next up — Chapter 5: Tool Use, where agents stop reasoning in text and start taking real actions in the world.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Chapter 10: Contributing to AI Safety — Paths, Skills, and Getting Started
  • Chapter 9: AI Control — Safety Without Trusting the Model
  • Chapter 19: Evaluation and Monitoring
  • Chapter 18: Guardrails and Safety Patterns
  • Chapter 17: Reasoning Techniques