ARTICLE · 15 MIN READ · MARCH 10, 2026
Chapter 17: Reasoning Techniques
How does an AI agent actually think? This chapter reveals the techniques that transform LLMs from pattern-matchers into deliberate problem-solvers: CoT, ToT, ReAct, self-correction, RLVR, and more.
Why Reasoning Matters
Chain-of-Thought (CoT): A prompting technique that instructs the LLM to generate explicit intermediate reasoning steps before producing a final answer. Instead of "What is 17 × 24? → 408," it produces "17 × 24 = 17 × 20 + 17 × 4 = 340 + 68 = 408." The intermediate steps improve accuracy on complex problems.
Tree-of-Thought (ToT): An extension of CoT that explores multiple reasoning paths simultaneously rather than committing to one sequence. Like a chess player considering several moves ahead, ToT branches into alternatives, evaluates each path, and backtracks from dead ends.
ReAct (Reasoning + Acting): A framework that interleaves the LLM's reasoning steps with actual tool calls. The loop is: Thought (what should I do?) → Action (call a tool) → Observation (what did the tool return?) → Thought (what should I do next?) → ... until done.
Self-correction: When an agent evaluates its own output against specified criteria and iteratively refines it without external feedback. Related to the Reflection pattern (Chapter 4) but applied specifically within a reasoning chain.
RLVR (Reinforcement Learning with Verifiable Rewards): A training technique for reasoning models. The model is trained on problems with known correct answers (math, code). It generates reasoning chains, checks its own answer, and learns which types of reasoning chains lead to correct answers. This is what trains models like o1, o3, and DeepSeek-R1.
Scaling Inference Law: The observation that more computation at inference time (letting the model "think longer") predictably improves output quality — even for the same model. A smaller model given a large thinking budget can outperform a larger model with a small thinking budget.
PALM (Program-Aided Language Model): An approach where the LLM generates code (Python, SQL, etc.) and executes it to get precise answers, rather than computing in natural language. "17 × 24 = ?" → generate `print(17 * 24)` → execute → 408. Eliminates arithmetic errors.
An LLM without structured reasoning is like a student answering a complex exam question by writing the first thing that comes to mind. Sometimes it’s right. Often it’s not. The question is: how do you make the model think rather than just respond?
The answer is a collection of techniques that make the model’s internal reasoning process explicit, systematic, and verifiable. Some of these are prompting strategies (CoT, ToT). Some are architectural patterns (ReAct). Some are training approaches (RLVR). All of them share one principle: allocating more computational “thinking” — more steps, more paths, more iterations — produces significantly better results on complex problems.
This isn’t obvious. Intuitively, you might think a more capable model always produces better results. But the Scaling Inference Law shows something counterintuitive: a smaller model that reasons carefully through multiple steps can outperform a larger model that answers in a single pass. Computation at inference time is a powerful lever — often more cost-effective than using a bigger model.
This chapter covers the full spectrum of reasoning techniques, from the simple (CoT prompting) to the frontier (MASS-optimized multi-agent debates).
Reasoning Technique Overview
GoD mechanism: Arguments form a dynamic graph; the conclusion is the most robustly supported argument cluster, not the last one in a sequence.
Chain-of-Thought in Depth
CoT is the foundational technique. Everything else in this chapter builds on it. Let’s see exactly how it works with a full example.
Without CoT: The prompt says “Explain the differences between classical and quantum computers.” The LLM retrieves and assembles an answer in one forward pass. It might be mostly correct, might miss key points, and the user has no way to verify the reasoning.
With CoT: The prompt instructs the model to follow explicit reasoning steps. The model’s thought process becomes a structured sequence:
You are an Information Retrieval Agent. Answer comprehensively by thinking step-by-step.
Process to follow:
1. Analyze the Query: identify core subject, key entities, type of information needed
2. Formulate Search Queries: list precise queries you'd use to retrieve information
3. Simulate Retrieval: for each query, consider what relevant snippets would look like
4. Synthesize: combine findings into a coherent answer
5. Review: critically evaluate accuracy, completeness, clarity
User Query: "Explain the main differences between classical and quantum computers,
and briefly describe a potential application."
The model then produces visible reasoning:
Thought 1 (Analyze): User wants (1) differences, (2) one application.
Keywords: classical computers, quantum computers, bits, qubits, superposition.
Thought 2 (Search Queries):
- "classical vs quantum computer architecture"
- "superposition entanglement explained"
- "quantum computing applications 2025"
Thought 3 (Simulate Retrieval):
From "classical vs quantum": bits (0 or 1), transistors, binary logic
vs qubits, superposition, entanglement, probabilistic computation.
Thought 4 (Synthesize): Classical computers process information as bits — each
bit is deterministically 0 or 1. Quantum computers use qubits, which can
exist in superposition (0 and 1 simultaneously) and become entangled, enabling
exponentially more complex computations for specific problem classes.
Thought 5 (Review): Covers bits vs qubits ✓, superposition ✓, entanglement ✓,
practical implication ✓. Application included ✓. Concise ✓.
Why does making reasoning explicit improve accuracy? Three reasons:
-
Error containment. Each step is a checkpoint. A mistake in Step 2 is visible and doesn’t automatically corrupt Steps 3-5. In single-pass generation, one wrong assumption silently propagates through the entire response.
-
Attention anchoring. By writing out the reasoning steps, the model forces its own attention to remain focused on the relevant concepts. Long single-pass answers let attention drift.
-
Self-verification. The model can “read” its own intermediate steps and notice if they’re inconsistent — this is the foundation of self-correction.
Tree-of-Thought: When the First Path Fails
Linear CoT has a fundamental weakness: it commits to one reasoning path at the start and follows it to conclusion. If the initial approach is wrong, the model may not backtrack — it just follows the wrong path confidently.
Tree-of-Thought solves this by maintaining multiple candidate reasoning paths simultaneously:
How ToT works in practice:
- At each decision point, generate N alternative next steps (typically 3-5)
- Score each alternative (using the LLM itself to evaluate: “Is this reasoning direction promising?”)
- Expand the highest-scoring path
- If a path reaches a dead end or low score, backtrack and try the next best alternative
- Continue until a satisfactory solution is found or compute budget is exhausted
The computational cost. Exploring a tree with branching factor 3 and depth 4 requires evaluating 3⁴ = 81 paths. This is 10-80× more LLM calls than linear CoT. ToT is appropriate for genuinely hard problems where errors are expensive (strategic planning, complex code generation, research synthesis) — not for everyday queries.
ReAct: The Core Agentic Loop
ReAct is arguably the most important reasoning technique for agents because it’s the one that connects reasoning to the real world. Without ReAct (or a variant), an agent is just a sophisticated text generator. With ReAct, it can gather information, execute code, call APIs, and adapt its plan based on what it finds.
The loop:
Thought: I need to find the current GDP of France.
Action: search("France GDP 2025")
Observation: "France GDP in 2025 estimated at $3.1 trillion, 0.8% growth YoY"
Thought: Good. Now I need to compare this with Germany.
Action: search("Germany GDP 2025")
Observation: "Germany GDP in 2025 approximately $4.2 trillion, -0.3% growth"
Thought: I have both figures. France GDP $3.1T (+0.8%) vs Germany $4.2T (-0.3%).
I can now answer the question with current data.
Action: finish("France's GDP in 2025 is $3.1 trillion (growing 0.8%), while
Germany's is $4.2 trillion (contracting 0.3%). France shows positive
momentum despite being smaller. Source: search results.")
Why the observation step is critical. In linear CoT, the model imagines what search results would say. In ReAct, it actually gets them. This means the model’s reasoning is grounded in real, current data — not hallucinated simulations. Every Observation is a reality check that can confirm or falsify the model’s current hypothesis.
The frequency of thoughts. For knowledge-intensive tasks (fact-checking, research), thoughts appear before every action — the model explicitly reasons about each piece of information before acting. For decision-making tasks requiring many actions (navigating an environment, executing a long workflow), thoughts are used more sparingly — the model acts more on intuition and only stops to reason when facing genuine ambiguity.
PALM: Offloading Computation to Code
LLMs are probabilistic — they approximate. For calculations like “What is 17.8% of $2,847.50?” the model might answer $506.85 when the correct answer is $507.05. Not a huge error, but in finance or science, even small arithmetic errors compound into significant mistakes.
Program-Aided Language Models solve this by never doing arithmetic in natural language:
# The LLM generates this code when asked "What is 17.8% of $2,847.50?"
amount = 2847.50
percentage = 17.8
result = amount * (percentage / 100)
print(f"${result:.2f}")
# Execution returns: $507.05 ← exact, deterministic
The ADK implementation uses BuiltInCodeExecutor:
from google.adk.agents import Agent
from google.adk.code_executors import BuiltInCodeExecutor
coding_agent = Agent(
model = 'gemini-2.0-flash',
name = 'CodeAgent',
instruction = "For any mathematical computation, generate Python code and execute it. Never calculate in natural language.",
code_executor = [BuiltInCodeExecutor], # sandboxed Python interpreter
)
Why a sandboxed executor? The code runs in an isolated environment that can’t affect the host system — it can’t read your files, make network requests, or consume excessive resources. This is the same principle as the Docker containerization discussed in Chapter 9 (SICA), applied at the code execution level.
The Self-Correction Agent in Action
Self-correction is CoT applied to quality improvement. The agent generates output, then acts as its own critic:
ORIGINAL REQUIREMENTS: "Write a short, engaging social media post (max 150 chars)
announcing GreenTech Gadgets, a new eco-friendly product line."
INITIAL DRAFT: "We have new products. They are green and techy. Buy GreenTech Gadgets now!"
Self-Correction Process:
Thought 1 (Requirements check):
- Max 150 chars: ✓ (64 chars — well under)
- Short: ✓
- Engaging: ✗ — "We have new products" is completely flat
- Announces GreenTech: ✓ (mentioned at end)
- Eco-friendly emphasis: ✗ — "green" is vague, not compelling
Thought 2 (Identify weaknesses):
- No excitement about the launch
- "eco-friendly" is the USP but barely mentioned
- Call to action "Buy now!" is weak
- No visual elements (emojis) for social media engagement
- Hashtags missing — crucial for discoverability
Thought 3 (Propose improvements):
- Open with an eco-friendly signal emoji
- Lead with the key benefit (eco-friendly + innovative)
- Stronger CTA with "Shop now!"
- Add relevant hashtags
Thought 4 (Revised version):
"🌱 Discover GreenTech Gadgets! Our new eco-friendly line blends innovation
with sustainability. Go green, go smart! Shop now! #EcoFriendly #GreenTech"
(148 chars — within limit ✓)
The improvement is dramatic — from a generic, flat announcement to an engaging, hashtag-equipped post that leads with the product’s key differentiator. The same LLM produced both, but the second pass had structured criteria to evaluate against.
RLVR: How Modern Reasoning Models Learn to Think
RLVR is the training technique behind OpenAI’s o-series models, Google’s Gemini 2.5 “thinking” mode, and DeepSeek-R1. Understanding it explains why these models reason so differently from standard LLMs.
The problem with standard fine-tuning for reasoning. Standard supervised fine-tuning trains a model by showing it (question, correct_answer) pairs and minimizing the difference between what the model generates and the correct answer. This teaches the model to imitate correct answers but doesn’t teach it how to reason to those answers.
What RLVR does differently:
-
Collect problems with verifiable correct answers — math problems, coding problems, logical puzzles. These are the training problems where you know definitively whether the answer is right or wrong.
-
Let the model generate its answer plus a long reasoning chain. The model isn’t given the correct answer — it generates its own reasoning trajectory.
-
Check the final answer against the known correct answer. If right, give a positive reward. If wrong, give a negative reward.
-
Update the model’s weights to favor the types of reasoning chains that led to correct answers. The model learns which reasoning patterns work — exploring alternatives, self-checking, backtracking — through trial and error.
Why “verifiable rewards” specifically? In RLHF (standard human feedback RL), a reward model judges answer quality. But a reward model is itself an LLM that can be “gamed” — the reasoning model learns to produce text that scores well on the reward model rather than text that’s actually correct. This is the reward hacking problem from Chapter 9 (Learning and Adaptation).
With verifiable rewards, there’s no reward model to hack. The reward is binary and objective: is the final answer correct? The model can’t game math. Either 17 × 24 = 408 or it doesn’t.
What RLVR-trained models do. After RLVR training, models like o3 and DeepSeek-R1 produce extended reasoning traces with behaviors they were never explicitly taught:
- Planning: “Let me think about the overall approach before diving in…”
- Self-monitoring: “Wait, I made an error in step 3. Let me recalculate…”
- Backtracking: “That approach isn’t working. Let me try a different method…”
- Verification: “Let me check my answer by working backwards…”
These behaviors emerged from trial-and-error training — the model discovered that they increase the probability of getting the right answer, so it learned to do them.
The Scaling Inference Law: More Thinking = Better Results
The Scaling Inference Law states: for a given model, performance improves predictably as more computational resources are allocated at inference time.
This seems obvious — of course more compute helps. But the law’s practical implication is counterintuitive:
A smaller model with a large thinking budget can outperform a larger model with a small thinking budget.
This has concrete implications for system design:
What the chart shows: As thinking budget increases (x-axis), all models improve (y-axis). But the smaller model (green) improves more steeply because it was previously bottlenecked by not having enough steps to reason through the problem — once given adequate budget, it catches up to and surpasses the large model’s single-pass performance.
Practical implications for system design:
-
Adaptive compute allocation. Detect query complexity at routing time (Chapter 16 technique). For easy queries, use small budget. For hard queries, allocate large budget. Don’t waste compute on questions that don’t need extended thinking.
-
Test-time compute vs training compute. Increasing a model’s thinking budget at inference is often more cost-effective than training a larger model. A 1B parameter model with 100 thinking steps can match a 10B parameter model with 1 thinking step on many tasks.
-
Problem-specific tuning. Some problem types benefit more from extended thinking (complex math, novel reasoning) than others (factual recall, simple extraction). Profile your use case before defaulting to maximum thinking budget.
Deep Research: All Techniques Combined
Google’s Deep Research (covered in Chapter 6 from the planning perspective) is the clearest real-world demonstration of all these reasoning techniques working together:
graph TD
Q([User Research Question]) --> GQ[Generate initial search queries — CoT decomposes the question]
GQ --> WR[Web Research — ReAct: Action=search, Observation=results]
WR --> RF[Reflection — Self-correction: identifies gaps and contradictions]
RF -->|gaps remain| GQ
RF -->|coverage complete| FA[Finalize Answer — synthesize with citations]
FA --> OUT([Comprehensive Research Report])
style GQ fill:#141b2d,stroke:#2698ba,color:#e0e0e0
style WR fill:#141b2d,stroke:#e6a817,color:#e0e0e0
style RF fill:#141b2d,stroke:#c97af2,color:#e0e0e0
style FA fill:#141b2d,stroke:#4fc97e,color:#e0e0e0
The LangGraph implementation (from gemini-fullstack-langgraph-quickstart):
from langgraph.graph import StateGraph, START, END
builder = StateGraph(OverallState, config_schema=Configuration)
# CoT: decompose the question into targeted search queries
builder.add_node("generate_query", generate_query)
# ReAct: actually execute the searches (Action), observe results
builder.add_node("web_research", web_research)
# Self-correction: evaluate what was found, identify gaps
builder.add_node("reflection", reflection)
# Synthesis: produce the final report
builder.add_node("finalize_answer", finalize_answer)
# The flow: generate → search in parallel → reflect → [more research if needed] → finalize
builder.add_edge(START, "generate_query")
builder.add_conditional_edges("generate_query", continue_to_web_research, ["web_research"])
builder.add_edge("web_research", "reflection")
builder.add_conditional_edges("reflection", evaluate_research, ["web_research", "finalize_answer"])
builder.add_edge("finalize_answer", END)
graph = builder.compile(name="pro-search-agent")
add_conditional_edgesafter reflection: Theevaluate_researchfunction decides: has enough information been gathered, or are there still significant knowledge gaps? If gaps remain, the graph loops back togenerate_querywith refined search terms (self-correction in action). If coverage is sufficient, it proceeds tofinalize_answer. This loop implements the Scaling Inference Law — the agent keeps allocating more inference compute (more search iterations) until quality is sufficient.
continue_to_web_researchgenerates parallel branches: The initial query decomposition produces multiple sub-queries.add_conditional_edgescan spawn multiple simultaneousweb_researchnodes — one per sub-query — implementing the Parallelization pattern (Chapter 3) within the reasoning graph.
MASS: Automating Multi-Agent System Design
For most teams, the hard part of multi-agent reasoning is not the implementation — it’s figuring out the right agent topology and right prompts. This is a vast search space. MASS (Multi-Agent System Search) automates this.
The three-stage MASS optimization:
Key finding from MASS research: The most effective multi-agent systems discovered by MASS consistently have three properties:
- Individual agents with high-quality, specialized prompts (not generic)
- Topologies that combine iterative self-correction with external validation (not just debate without grounding)
- Prompts that are re-optimized for the full workflow after the topology is fixed
This means the order matters: optimize agents → find topology → re-optimize together. Skipping any stage degrades final performance.
Practical Applications of Reasoning Techniques
Complex Q&A
Multi-hop questions that require combining information from multiple sources with logical deduction. "Given these financial reports, which company grew faster in emerging markets?"
Strategic Planning
Problems where the first approach often fails and exploration of multiple strategies is needed. Architecture design, resource allocation, creative problem-solving.
Research & Analysis
Any task requiring external data. The agent's iterative Thought-Action-Observation loop retrieves, synthesizes, and adapts — the foundation of all data-grounded agents.
Mathematical Computation
Financial calculations, statistical analysis, physics simulations — anywhere precise arithmetic matters. The LLM formulates the problem; Python computes it exactly.
Content Generation
Marketing copy, legal documents, technical reports — any output with defined quality criteria. Multiple passes of critique-and-revise dramatically improve polish.
High-Stakes Decisions
Medical diagnosis, legal analysis, investment decisions — where multiple perspectives reduce error and bias. The debate structure forces reasoning to withstand peer scrutiny.
Key Takeaways
-
Reasoning techniques make thinking explicit and auditable. When an agent’s reasoning is a visible chain of steps, errors are detectable, decisions are explainable, and the system is debuggable. Opaque single-pass generation offers none of these properties.
-
CoT is the foundation. “Think step by step” transforms difficult problems into sequences of easier ones. Every other technique in this chapter builds on CoT — ToT extends it to trees, ReAct combines it with tool calls, Self-Correction applies it to quality improvement.
-
ToT handles problems where the first approach fails. When the search space is large and initial attempts are likely to be wrong, branching exploration with backtracking dramatically outperforms linear reasoning. The cost: 10-80× more LLM calls.
-
ReAct is how agents interact with the real world. The Thought-Action-Observation loop is the fundamental operating pattern of virtually every production AI agent. Without it, agents reason about what tools might return; with it, they actually use the tools and ground their reasoning in real results.
-
PALM eliminates arithmetic errors. Never have the LLM compute — have it write code and execute it. The LLM handles language and logic; Python handles numbers. Combine their strengths.
-
RLVR explains why modern reasoning models are so different. Models trained with RLVR (o1, o3, DeepSeek-R1, Gemini 2.5 thinking) spontaneously exhibit planning, self-monitoring, and backtracking — behaviors they discovered through trial-and-error training, not from explicit instruction.
-
The Scaling Inference Law challenges “bigger is better.” Giving a smaller model more thinking time often outperforms a larger model with minimal thinking. Inference compute is a tunable resource — allocate it where it matters most.
-
Deep Research demonstrates all techniques working together: CoT for query decomposition, ReAct for search + observation, Self-Correction for gap detection, Scaling Inference Law for iterative deepening, Parallelization for simultaneous sub-queries.
-
MASS shows that MAS design can be automated. The right agent topology and prompts are hard to find manually in a vast search space. MASS’s three-stage optimization (individual → topology → integrated) consistently discovers configurations that outperform hand-crafted designs.
Enjoy Reading This Article?
Here are some more articles you might like to read next: