Chapter 17: Reasoning Techniques

Why Reasoning Matters

Before You Start — Key Terms Explained

Chain-of-Thought (CoT): A prompting technique that instructs the LLM to generate explicit intermediate reasoning steps before producing a final answer. Instead of "What is 17 × 24? → 408," it produces "17 × 24 = 17 × 20 + 17 × 4 = 340 + 68 = 408." The intermediate steps improve accuracy on complex problems.

Tree-of-Thought (ToT): An extension of CoT that explores multiple reasoning paths simultaneously rather than committing to one sequence. Like a chess player considering several moves ahead, ToT branches into alternatives, evaluates each path, and backtracks from dead ends.

ReAct (Reasoning + Acting): A framework that interleaves the LLM's reasoning steps with actual tool calls. The loop is: Thought (what should I do?) → Action (call a tool) → Observation (what did the tool return?) → Thought (what should I do next?) → ... until done.

Self-correction: When an agent evaluates its own output against specified criteria and iteratively refines it without external feedback. Related to the Reflection pattern (Chapter 4) but applied specifically within a reasoning chain.

RLVR (Reinforcement Learning with Verifiable Rewards): A training technique for reasoning models. The model is trained on problems with known correct answers (math, code). It generates reasoning chains, checks its own answer, and learns which types of reasoning chains lead to correct answers. This is what trains models like o1, o3, and DeepSeek-R1.

Scaling Inference Law: The observation that more computation at inference time (letting the model "think longer") predictably improves output quality — even for the same model. A smaller model given a large thinking budget can outperform a larger model with a small thinking budget.

PALM (Program-Aided Language Model): An approach where the LLM generates code (Python, SQL, etc.) and executes it to get precise answers, rather than computing in natural language. "17 × 24 = ?" → generate `print(17 * 24)` → execute → 408. Eliminates arithmetic errors.

An LLM without structured reasoning is like a student answering a complex exam question by writing the first thing that comes to mind. Sometimes it’s right. Often it’s not. The question is: how do you make the model think rather than just respond?

The answer is a collection of techniques that make the model’s internal reasoning process explicit, systematic, and verifiable. Some of these are prompting strategies (CoT, ToT). Some are architectural patterns (ReAct). Some are training approaches (RLVR). All of them share one principle: allocating more computational “thinking” — more steps, more paths, more iterations — produces significantly better results on complex problems.

This isn’t obvious. Intuitively, you might think a more capable model always produces better results. But the Scaling Inference Law shows something counterintuitive: a smaller model that reasons carefully through multiple steps can outperform a larger model that answers in a single pass. Computation at inference time is a powerful lever — often more cost-effective than using a bigger model.

This chapter covers the full spectrum of reasoning techniques, from the simple (CoT prompting) to the frontier (MASS-optimized multi-agent debates).

Reasoning Technique Overview

REASONING TECHNIQUES — click to explore each

Chain-of-Thought (CoT)

Wei et al., 2022 · Google Brain

Guide the LLM to generate explicit intermediate reasoning steps before the final answer. Transform one hard problem into many easier sequential steps. The model's "thinking" becomes visible and auditable.

Mechanism: Add "Let's think step by step" to the prompt, or provide few-shot examples showing step-by-step reasoning. The model learns to decompose before answering.

Best for: Arithmetic, multi-step logic, commonsense reasoning, factual synthesis. Any task where errors arise from skipping intermediate steps.

Input → Step 1 → Step 2 → Step 3 → ... → Final Answer

Tree-of-Thought (ToT)

Yao et al., 2023 · Princeton & Google

Extend CoT from a single chain into a tree of possible reasoning paths. The model branches into multiple alternatives, evaluates each branch, and backtracks from dead ends — like a chess player considering multiple moves.

Mechanism: At each reasoning step, generate N alternative continuations. Score each (via the LLM or heuristic). Expand the most promising. Prune weak branches. Use BFS or DFS over the tree of thoughts.

Best for: Complex planning, creative generation, puzzle-solving, problems where the first approach often fails and backtracking is needed. Requires more compute than linear CoT.

Input → [Path A, Path B, Path C] → evaluate → expand best → [A1, A2, B1] → ... → Best Answer

Self-Correction

Madaan et al., 2023 · "Self-Refine"

The model generates an initial response, then critiques its own output using specified criteria, then refines based on the critique. A closed-loop quality improvement cycle without external feedback. Related to the Reflection pattern (Chapter 4).

Mechanism: Generate response → provide critique criteria → model identifies weaknesses → generate revised response → repeat until quality threshold met or max iterations reached.

Best for: Content generation (social media, reports), code review, factual verification, any task with well-defined quality criteria. Less effective when the model lacks the knowledge to identify its own errors.

Draft → Critique → Identify Weaknesses → Revise → [repeat] → Final Output

Program-Aided LMs (PALM)

Gao et al., 2023 · CMU

Instead of computing in natural language, the LLM generates executable code (Python, SQL) to solve the problem. The code is executed in a sandbox; the result is returned and incorporated into the final answer. Eliminates arithmetic errors and enables complex data manipulation.

Mechanism: LLM receives question → generates Python code that computes the answer → code executor runs it → exact result returned → LLM formats into natural language response. (This is what BuiltInCodeExecutor in ADK enables.)

Best for: Mathematics, data analysis, statistical computation, any task where precise deterministic calculation is required. The LLM handles understanding and formulation; code handles computation.

Question → Generate Code → Execute in Sandbox → Exact Result → Format Answer

ReAct (Reasoning + Acting)

Yao et al., 2023 · Princeton & Google

Interleave reasoning steps with actual tool calls. The agent alternates: think about what to do → do it (call a tool) → observe the result → think about the next step. This connects reasoning to the real world through dynamic feedback.

Mechanism: Thought: "I need to find the current population of Tokyo." Action: `search("Tokyo population 2025")`. Observation: "13.96 million in city, 37.4 million metro." Thought: "Now I have the data, I can answer." This repeats until the agent has enough information.

Best for: Any task requiring external data, tool use, or multi-step interaction with an environment. This is the foundation of most production agent systems — nearly every LLM-based agent uses ReAct or a variant.

Thought → Action (Tool) → Observation → Thought → Action → ... → Final Answer

Chain of Debates (CoD) / Graph of Debates (GoD)

Microsoft Research, 2024

CoD: Multiple diverse LLM instances debate a question, critiquing each other's reasoning. The final answer emerges from consensus after structured argument exchange — like peer review. GoD: Extends CoD from a linear debate chain into a non-linear graph where arguments (nodes) are connected by "supports" or "refutes" edges. More realistic to actual debate dynamics.

CoD mechanism: Agent A presents answer + reasoning → Agent B critiques → A responds → B revises → synthesizer produces consensus.
GoD mechanism: Arguments form a dynamic graph; the conclusion is the most robustly supported argument cluster, not the last one in a sequence.

Best for: High-stakes decisions requiring objectivity, fact-checking, bias reduction, complex analysis with multiple valid perspectives. Significantly reduces hallucination rate by requiring arguments to withstand peer scrutiny.

Agent A → argues → Agent B → critiques → Agent A → responds → ... → Consensus

RLVR — Reinforcement Learning with Verifiable Rewards

Training technique behind o1, o3, DeepSeek-R1, 2024-2025

A training methodology (not a prompting technique) that teaches models to reason by training them on problems with verifiable correct answers (math, code, logic puzzles). The model generates long reasoning chains, checks its answer, and learns from success/failure — developing genuine problem-solving strategies through trial and error.

Mechanism: Collect problems with ground-truth answers → model generates reasoning chain → check answer against ground truth → positive reward for correct, negative for incorrect → update model weights to favor reasoning patterns that led to correct answers.

Result: Models trained with RLVR produce "reasoning trajectories" — extended internal monologues that include planning, monitoring, backtracking, and self-correction. OpenAI's o-series, Google's Gemini 2.5 "thinking" features, and DeepSeek-R1 all use variants of this approach.

Problem → [generate long CoT] → check answer → reward/penalty → update policy → better reasoning next time

Chain-of-Thought in Depth

CoT is the foundational technique. Everything else in this chapter builds on it. Let’s see exactly how it works with a full example.

Without CoT: The prompt says “Explain the differences between classical and quantum computers.” The LLM retrieves and assembles an answer in one forward pass. It might be mostly correct, might miss key points, and the user has no way to verify the reasoning.

With CoT: The prompt instructs the model to follow explicit reasoning steps. The model’s thought process becomes a structured sequence:

You are an Information Retrieval Agent. Answer comprehensively by thinking step-by-step.

Process to follow:
1. Analyze the Query: identify core subject, key entities, type of information needed
2. Formulate Search Queries: list precise queries you'd use to retrieve information
3. Simulate Retrieval: for each query, consider what relevant snippets would look like
4. Synthesize: combine findings into a coherent answer
5. Review: critically evaluate accuracy, completeness, clarity

User Query: "Explain the main differences between classical and quantum computers,
and briefly describe a potential application."

The model then produces visible reasoning:

Thought 1 (Analyze): User wants (1) differences, (2) one application.
Keywords: classical computers, quantum computers, bits, qubits, superposition.

Thought 2 (Search Queries):
- "classical vs quantum computer architecture"
- "superposition entanglement explained"
- "quantum computing applications 2025"

Thought 3 (Simulate Retrieval):
From "classical vs quantum": bits (0 or 1), transistors, binary logic
vs qubits, superposition, entanglement, probabilistic computation.

Thought 4 (Synthesize): Classical computers process information as bits — each
bit is deterministically 0 or 1. Quantum computers use qubits, which can
exist in superposition (0 and 1 simultaneously) and become entangled, enabling
exponentially more complex computations for specific problem classes.

Thought 5 (Review): Covers bits vs qubits ✓, superposition ✓, entanglement ✓,
practical implication ✓. Application included ✓. Concise ✓.

Why does making reasoning explicit improve accuracy? Three reasons:

Error containment. Each step is a checkpoint. A mistake in Step 2 is visible and doesn’t automatically corrupt Steps 3-5. In single-pass generation, one wrong assumption silently propagates through the entire response.
Attention anchoring. By writing out the reasoning steps, the model forces its own attention to remain focused on the relevant concepts. Long single-pass answers let attention drift.
Self-verification. The model can “read” its own intermediate steps and notice if they’re inconsistent — this is the foundation of self-correction.

Tree-of-Thought: When the First Path Fails

Linear CoT has a fundamental weakness: it commits to one reasoning path at the start and follows it to conclusion. If the initial approach is wrong, the model may not backtrack — it just follows the wrong path confidently.

Tree-of-Thought solves this by maintaining multiple candidate reasoning paths simultaneously:

TREE-OF-THOUGHT vs CHAIN-OF-THOUGHT

Chain-of-Thought — Linear

Input

Step A

Step B

Step C (wrong path!)

No backtracking — follows error to completion

Tree-of-Thought — Branching

Input

Path A

scored: 0.4

Path B ✓

scored: 0.9

Path C

scored: 0.3

Expand Path B → B1, B2

B2 scored highest → final answer

How ToT works in practice:

At each decision point, generate N alternative next steps (typically 3-5)
Score each alternative (using the LLM itself to evaluate: “Is this reasoning direction promising?”)
Expand the highest-scoring path
If a path reaches a dead end or low score, backtrack and try the next best alternative
Continue until a satisfactory solution is found or compute budget is exhausted

The computational cost. Exploring a tree with branching factor 3 and depth 4 requires evaluating 3⁴ = 81 paths. This is 10-80× more LLM calls than linear CoT. ToT is appropriate for genuinely hard problems where errors are expensive (strategic planning, complex code generation, research synthesis) — not for everyday queries.

ReAct: The Core Agentic Loop

ReAct is arguably the most important reasoning technique for agents because it’s the one that connects reasoning to the real world. Without ReAct (or a variant), an agent is just a sophisticated text generator. With ReAct, it can gather information, execute code, call APIs, and adapt its plan based on what it finds.

The loop:

Thought:  I need to find the current GDP of France.
Action:   search("France GDP 2025")
Observation: "France GDP in 2025 estimated at $3.1 trillion, 0.8% growth YoY"

Thought:  Good. Now I need to compare this with Germany.
Action:   search("Germany GDP 2025")
Observation: "Germany GDP in 2025 approximately $4.2 trillion, -0.3% growth"

Thought:  I have both figures. France GDP $3.1T (+0.8%) vs Germany $4.2T (-0.3%).
          I can now answer the question with current data.
Action:   finish("France's GDP in 2025 is $3.1 trillion (growing 0.8%), while
          Germany's is $4.2 trillion (contracting 0.3%). France shows positive
          momentum despite being smaller. Source: search results.")

Why the observation step is critical. In linear CoT, the model imagines what search results would say. In ReAct, it actually gets them. This means the model’s reasoning is grounded in real, current data — not hallucinated simulations. Every Observation is a reality check that can confirm or falsify the model’s current hypothesis.

The frequency of thoughts. For knowledge-intensive tasks (fact-checking, research), thoughts appear before every action — the model explicitly reasons about each piece of information before acting. For decision-making tasks requiring many actions (navigating an environment, executing a long workflow), thoughts are used more sparingly — the model acts more on intuition and only stops to reason when facing genuine ambiguity.

REACT LOOP — thought, action, observation cycle

User Query

Complex question requiring external data or multi-step interaction

ReAct Loop — repeats until goal achieved

Each iteration: think → act → observe → think again

THOUGHT

Internal reasoning: "What do I know? What do I need? What should I do next?"

ACTION

Call a tool: search, calculate, read file, call API, execute code, write output

OBSERVATION

Tool result: actual data from the real world. Updates the agent's context.

Goal achieved?

↑ Back to THOUGHT

Yes

Final Answer

Grounded in real observations, not hallucinations

PALM: Offloading Computation to Code

LLMs are probabilistic — they approximate. For calculations like “What is 17.8% of $2,847.50?” the model might answer $506.85 when the correct answer is $507.05. Not a huge error, but in finance or science, even small arithmetic errors compound into significant mistakes.

Program-Aided Language Models solve this by never doing arithmetic in natural language:

# The LLM generates this code when asked "What is 17.8% of $2,847.50?"
amount = 2847.50
percentage = 17.8
result = amount * (percentage / 100)
print(f"${result:.2f}")

# Execution returns: $507.05  ← exact, deterministic

The ADK implementation uses BuiltInCodeExecutor:

from google.adk.agents import Agent
from google.adk.code_executors import BuiltInCodeExecutor

coding_agent = Agent(
    model         = 'gemini-2.0-flash',
    name          = 'CodeAgent',
    instruction   = "For any mathematical computation, generate Python code and execute it. Never calculate in natural language.",
    code_executor = [BuiltInCodeExecutor],   # sandboxed Python interpreter
)

Why a sandboxed executor? The code runs in an isolated environment that can’t affect the host system — it can’t read your files, make network requests, or consume excessive resources. This is the same principle as the Docker containerization discussed in Chapter 9 (SICA), applied at the code execution level.

The Self-Correction Agent in Action

Self-correction is CoT applied to quality improvement. The agent generates output, then acts as its own critic:

ORIGINAL REQUIREMENTS: "Write a short, engaging social media post (max 150 chars)
announcing GreenTech Gadgets, a new eco-friendly product line."

INITIAL DRAFT: "We have new products. They are green and techy. Buy GreenTech Gadgets now!"

Self-Correction Process:

Thought 1 (Requirements check):
- Max 150 chars: ✓ (64 chars — well under)
- Short: ✓
- Engaging: ✗ — "We have new products" is completely flat
- Announces GreenTech: ✓ (mentioned at end)
- Eco-friendly emphasis: ✗ — "green" is vague, not compelling

Thought 2 (Identify weaknesses):
- No excitement about the launch
- "eco-friendly" is the USP but barely mentioned
- Call to action "Buy now!" is weak
- No visual elements (emojis) for social media engagement
- Hashtags missing — crucial for discoverability

Thought 3 (Propose improvements):
- Open with an eco-friendly signal emoji
- Lead with the key benefit (eco-friendly + innovative)
- Stronger CTA with "Shop now!"
- Add relevant hashtags

Thought 4 (Revised version):
"🌱 Discover GreenTech Gadgets! Our new eco-friendly line blends innovation
with sustainability. Go green, go smart! Shop now! #EcoFriendly #GreenTech"
(148 chars — within limit ✓)

The improvement is dramatic — from a generic, flat announcement to an engaging, hashtag-equipped post that leads with the product’s key differentiator. The same LLM produced both, but the second pass had structured criteria to evaluate against.

RLVR: How Modern Reasoning Models Learn to Think

RLVR is the training technique behind OpenAI’s o-series models, Google’s Gemini 2.5 “thinking” mode, and DeepSeek-R1. Understanding it explains why these models reason so differently from standard LLMs.

The problem with standard fine-tuning for reasoning. Standard supervised fine-tuning trains a model by showing it (question, correct_answer) pairs and minimizing the difference between what the model generates and the correct answer. This teaches the model to imitate correct answers but doesn’t teach it how to reason to those answers.

What RLVR does differently:

Collect problems with verifiable correct answers — math problems, coding problems, logical puzzles. These are the training problems where you know definitively whether the answer is right or wrong.
Let the model generate its answer plus a long reasoning chain. The model isn’t given the correct answer — it generates its own reasoning trajectory.
Check the final answer against the known correct answer. If right, give a positive reward. If wrong, give a negative reward.
Update the model’s weights to favor the types of reasoning chains that led to correct answers. The model learns which reasoning patterns work — exploring alternatives, self-checking, backtracking — through trial and error.

Why “verifiable rewards” specifically? In RLHF (standard human feedback RL), a reward model judges answer quality. But a reward model is itself an LLM that can be “gamed” — the reasoning model learns to produce text that scores well on the reward model rather than text that’s actually correct. This is the reward hacking problem from Chapter 9 (Learning and Adaptation).

With verifiable rewards, there’s no reward model to hack. The reward is binary and objective: is the final answer correct? The model can’t game math. Either 17 × 24 = 408 or it doesn’t.

What RLVR-trained models do. After RLVR training, models like o3 and DeepSeek-R1 produce extended reasoning traces with behaviors they were never explicitly taught:

Planning: “Let me think about the overall approach before diving in…”
Self-monitoring: “Wait, I made an error in step 3. Let me recalculate…”
Backtracking: “That approach isn’t working. Let me try a different method…”
Verification: “Let me check my answer by working backwards…”

These behaviors emerged from trial-and-error training — the model discovered that they increase the probability of getting the right answer, so it learned to do them.

The Scaling Inference Law: More Thinking = Better Results

The Scaling Inference Law states: for a given model, performance improves predictably as more computational resources are allocated at inference time.

This seems obvious — of course more compute helps. But the law’s practical implication is counterintuitive:

A smaller model with a large thinking budget can outperform a larger model with a small thinking budget.

This has concrete implications for system design:

SCALING INFERENCE LAW — quality vs thinking budget Relative illustration; exact curves vary by model and task

Large model, small thinking budget

Medium model, medium budget

Small model, large thinking budget

What the chart shows: As thinking budget increases (x-axis), all models improve (y-axis). But the smaller model (green) improves more steeply because it was previously bottlenecked by not having enough steps to reason through the problem — once given adequate budget, it catches up to and surpasses the large model’s single-pass performance.

Practical implications for system design:

Adaptive compute allocation. Detect query complexity at routing time (Chapter 16 technique). For easy queries, use small budget. For hard queries, allocate large budget. Don’t waste compute on questions that don’t need extended thinking.
Test-time compute vs training compute. Increasing a model’s thinking budget at inference is often more cost-effective than training a larger model. A 1B parameter model with 100 thinking steps can match a 10B parameter model with 1 thinking step on many tasks.
Problem-specific tuning. Some problem types benefit more from extended thinking (complex math, novel reasoning) than others (factual recall, simple extraction). Profile your use case before defaulting to maximum thinking budget.

Deep Research: All Techniques Combined

Google’s Deep Research (covered in Chapter 6 from the planning perspective) is the clearest real-world demonstration of all these reasoning techniques working together:

graph TD
    Q([User Research Question]) --> GQ[Generate initial search queries — CoT decomposes the question]
    GQ --> WR[Web Research — ReAct: Action=search, Observation=results]
    WR --> RF[Reflection — Self-correction: identifies gaps and contradictions]
    RF -->|gaps remain| GQ
    RF -->|coverage complete| FA[Finalize Answer — synthesize with citations]
    FA --> OUT([Comprehensive Research Report])
    style GQ fill:#141b2d,stroke:#2698ba,color:#e0e0e0
    style WR fill:#141b2d,stroke:#e6a817,color:#e0e0e0
    style RF fill:#141b2d,stroke:#c97af2,color:#e0e0e0
    style FA fill:#141b2d,stroke:#4fc97e,color:#e0e0e0

The LangGraph implementation (from gemini-fullstack-langgraph-quickstart):

from langgraph.graph import StateGraph, START, END

builder = StateGraph(OverallState, config_schema=Configuration)

# CoT: decompose the question into targeted search queries
builder.add_node("generate_query", generate_query)

# ReAct: actually execute the searches (Action), observe results
builder.add_node("web_research", web_research)

# Self-correction: evaluate what was found, identify gaps
builder.add_node("reflection", reflection)

# Synthesis: produce the final report
builder.add_node("finalize_answer", finalize_answer)

# The flow: generate → search in parallel → reflect → [more research if needed] → finalize
builder.add_edge(START, "generate_query")
builder.add_conditional_edges("generate_query", continue_to_web_research, ["web_research"])
builder.add_edge("web_research", "reflection")
builder.add_conditional_edges("reflection", evaluate_research, ["web_research", "finalize_answer"])
builder.add_edge("finalize_answer", END)

graph = builder.compile(name="pro-search-agent")

add_conditional_edges after reflection: The evaluate_research function decides: has enough information been gathered, or are there still significant knowledge gaps? If gaps remain, the graph loops back to generate_query with refined search terms (self-correction in action). If coverage is sufficient, it proceeds to finalize_answer. This loop implements the Scaling Inference Law — the agent keeps allocating more inference compute (more search iterations) until quality is sufficient.

continue_to_web_research generates parallel branches: The initial query decomposition produces multiple sub-queries. add_conditional_edges can spawn multiple simultaneous web_research nodes — one per sub-query — implementing the Parallelization pattern (Chapter 3) within the reasoning graph.

MASS: Automating Multi-Agent System Design

For most teams, the hard part of multi-agent reasoning is not the implementation — it’s figuring out the right agent topology and right prompts. This is a vast search space. MASS (Multi-Agent System Search) automates this.

The three-stage MASS optimization:

MASS OPTIMIZATION — three-stage automated MAS design

Stage 1 — Block-Level Prompt Optimization

Optimize prompts for each individual agent type (Predictor, Debater, Summarizer, Tool-user) independently. Ensures each component performs its role well before integration.

→

Stage 2 — Topology Optimization

Search over which agents to include and how they connect. Uses influence-weighted sampling — topologies with higher performance gain vs baseline are explored more. Discovers non-obvious structures (e.g., iterative refinement + external verification for coding).

→

Stage 3 — Workflow-Level Prompt Re-optimization

Fine-tune all prompts together for the chosen topology. Agents are now optimized as an integrated system, not in isolation. Captures inter-agent dependencies that weren't visible in Stage 1.

Key finding from MASS research: The most effective multi-agent systems discovered by MASS consistently have three properties:

Individual agents with high-quality, specialized prompts (not generic)
Topologies that combine iterative self-correction with external validation (not just debate without grounding)
Prompts that are re-optimized for the full workflow after the topology is fixed

This means the order matters: optimize agents → find topology → re-optimize together. Skipping any stage degrades final performance.

Practical Applications of Reasoning Techniques

CoT

Complex Q&A

Multi-hop questions that require combining information from multiple sources with logical deduction. "Given these financial reports, which company grew faster in emerging markets?"

ToT

Strategic Planning

Problems where the first approach often fails and exploration of multiple strategies is needed. Architecture design, resource allocation, creative problem-solving.

ReAct

Research & Analysis

Any task requiring external data. The agent's iterative Thought-Action-Observation loop retrieves, synthesizes, and adapts — the foundation of all data-grounded agents.

PALM

Mathematical Computation

Financial calculations, statistical analysis, physics simulations — anywhere precise arithmetic matters. The LLM formulates the problem; Python computes it exactly.

Self-Correct

Content Generation

Marketing copy, legal documents, technical reports — any output with defined quality criteria. Multiple passes of critique-and-revise dramatically improve polish.

CoD/GoD

High-Stakes Decisions

Medical diagnosis, legal analysis, investment decisions — where multiple perspectives reduce error and bias. The debate structure forces reasoning to withstand peer scrutiny.

Key Takeaways

Reasoning techniques make thinking explicit and auditable. When an agent’s reasoning is a visible chain of steps, errors are detectable, decisions are explainable, and the system is debuggable. Opaque single-pass generation offers none of these properties.
CoT is the foundation. “Think step by step” transforms difficult problems into sequences of easier ones. Every other technique in this chapter builds on CoT — ToT extends it to trees, ReAct combines it with tool calls, Self-Correction applies it to quality improvement.
ToT handles problems where the first approach fails. When the search space is large and initial attempts are likely to be wrong, branching exploration with backtracking dramatically outperforms linear reasoning. The cost: 10-80× more LLM calls.
ReAct is how agents interact with the real world. The Thought-Action-Observation loop is the fundamental operating pattern of virtually every production AI agent. Without it, agents reason about what tools might return; with it, they actually use the tools and ground their reasoning in real results.
PALM eliminates arithmetic errors. Never have the LLM compute — have it write code and execute it. The LLM handles language and logic; Python handles numbers. Combine their strengths.
RLVR explains why modern reasoning models are so different. Models trained with RLVR (o1, o3, DeepSeek-R1, Gemini 2.5 thinking) spontaneously exhibit planning, self-monitoring, and backtracking — behaviors they discovered through trial-and-error training, not from explicit instruction.
The Scaling Inference Law challenges “bigger is better.” Giving a smaller model more thinking time often outperforms a larger model with minimal thinking. Inference compute is a tunable resource — allocate it where it matters most.
Deep Research demonstrates all techniques working together: CoT for query decomposition, ReAct for search + observation, Self-Correction for gap detection, Scaling Inference Law for iterative deepening, Parallelization for simultaneous sub-queries.
MASS shows that MAS design can be automated. The right agent topology and prompts are hard to find manually in a vast search space. MASS’s three-stage optimization (individual → topology → integrated) consistently discovers configurations that outperform hand-crafted designs.