Chapter 6: Planning | Kohsheen Tiku

When Reacting Isn’t Enough

Before You Start — Key Terms Explained

Decomposition: Breaking a complex problem into smaller, more manageable sub-problems. "Plan a conference" → book venue, invite speakers, arrange catering, set up registration. Each sub-task is simpler than the whole.

Constraint: A condition that limits possible solutions. "Keep it under $15k" and "everyone must be able to attend" are constraints. A good planner finds solutions that satisfy all constraints.

Fixed workflow vs planning agent: A fixed workflow is like a recipe — you follow the same steps every time. A planning agent is like hiring a chef — it figures out the steps itself based on what's available. Use fixed when you know the exact steps. Use planning when you don't.

Every pattern so far — chaining, routing, parallelization, reflection, tool use — processes one input and produces one output. The “how” is determined by the developer at design time. The agent executes what was pre-wired.

That works when you know exactly what needs to happen. But real goals are often messier:

“Organize a team offsite for 30 people in Q3, keep it under $15k, and make sure everyone can attend.”

There’s no fixed sequence here. You don’t know in advance whether you’ll need to try three venues before one is available, renegotiate the catering, or push the dates by a week. The “how” needs to be discovered from the goal — not hard-coded before execution begins.

That’s planning: the agent first generates the route, then follows it — adapting when the terrain changes.

The difference from all previous patterns. In Chapters 1-5, the developer decides the structure before deployment:

Prompt chaining: you write the sequence of prompts
Routing: you write the routing rules
Parallelization: you identify the independent tasks
Reflection: you specify the critique criteria
Tool use: you define which tools exist

Planning is different. The agent itself decides the structure at runtime, in response to the specific goal it’s been given. The developer provides the goal and available tools — the agent figures out the rest.

This is the leap from automation (following a pre-defined script) to autonomy (discovering the script on the fly). It’s a qualitatively different kind of system, and it comes with qualitatively different challenges.

What good planning looks like mechanically. When an LLM-based planning agent receives a goal, it:

Parses the goal to understand what success looks like
Identifies what information and actions are needed to get there
Orders those actions based on dependencies (what needs to happen before what)
Generates a concrete sequence of steps
Executes each step, often calling tools along the way
Checks whether each step succeeded and adjusts if not
Synthesizes all intermediate results into a final output

The LLM is doing genuine reasoning here — not pattern-matching to a familiar template, but constructing a novel solution path for a novel problem. This is why planning agents are among the most impressive demonstrations of what modern LLMs can do, and also among the most likely to fail in subtle ways.

What Planning Is

A planning agent receives a high-level objective and does two things before acting:

Decomposes the goal into a sequence of smaller, actionable steps
Adapts when steps fail, constraints change, or new information arrives

PLANNING PATTERN

Complex Goal

High-level objective · solution unknown in advance

Planning Agent

Decomposes goal · generates step sequence

Execute Plan Steps

Sequential or parallel · each step may call tools or sub-agents

Step 1

Search / Gather

Step 2

Analyse

Step N

Draft / Act

Synthesize Results

Merge step outputs into coherent answer

Goal met?

Obstacle / Gap

Re-plan

Adapt to new constraint · revise steps

↑ Back to Planning Agent

Complete

Final Output

Goal achieved · deliver result

The critical loop is the feedback arc from “Goal met?” back to the Planning Agent. When an obstacle blocks a step — a venue is booked, a search returns nothing useful, a dependency fails — a capable planner doesn’t stop. It re-evaluates and generates a revised plan.

Static vs Dynamic: The Core Decision

Before choosing planning, ask one question: does the “how” need to be discovered, or is it already known?

WHEN TO USE PLANNING — DECISION MATRIX

Complex + Known Solution

Use Fixed Workflow

e.g. Invoice processing, payroll runs, CI/CD pipelines

Solution is understood and repeatable. Hard-coding it reduces risk and guarantees consistency.

Complex + Unknown Solution

Use Planning Agent

e.g. Research report, onboarding a new hire, competitive analysis

The "how" must be discovered. Planning handles dependencies, unknowns, and adaptation.

Simple + Known Solution

Single Prompt / Chain

e.g. Summarize this article, translate this text

One step, known output. No planning overhead needed.

Simple + Unknown Solution

Routing + Fallback

e.g. Customer support triage, intent classification

Classify and dispatch. Routing (Chapter 2) handles this cleanly.

← Known Solution Unknown →

↑ Complex / Simple ↓

Watch an Agent Build a Plan

Pick a complex task. Click Generate Plan to watch the agent decompose it into executable steps.

PLAN GENERATION DEMO

GOAL Write a research report on AI in healthcare

Notice the badges on each step — some call tools (search, APIs), others are pure LLM reasoning steps, and some trigger actions. Planning isn’t a monolith; it’s an orchestration of all the prior patterns working in sequence.

Use Cases Where Planning Is Essential

Research & Synthesis

Multi-phase tasks: gather sources, extract findings, identify gaps, refine, write. Each phase depends on the previous one's output.

Research reports, literature reviews

Workflow Automation

Business processes with ordered, interdependent steps — each must complete successfully before the next begins.

Employee onboarding, invoice processing

Robotics & Navigation

Physical or virtual agents must plan a path through state space — optimizing for constraints while avoiding obstacles.

Autonomous vehicles, warehouse robots

Competitive Intelligence

Gather data from multiple sources, cross-reference, identify gaps, synthesize into a structured report with citations.

Market analysis, product benchmarking

Customer Support (Multi-step)

Complex issues requiring diagnosis, solution search, implementation, and verification — not a single-turn response.

Technical troubleshooting, escalations

Content Production

Plan → outline → draft → research → refine → publish. Each phase is distinct and builds on the previous output.

Blog posts, documentation, reports

CrewAI: Plan-then-Execute

CrewAI implements planning by having a single agent receive a two-part task: first produce the plan, then execute it. The key is in how the task description and expected_output are structured.

from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4-turbo")

Why gpt-4-turbo for planning? Planning requires multi-step reasoning, coherent goal decomposition, and the ability to maintain context across a long task. More capable models produce more reliable, logical plans. Use the strongest model your budget allows for the planning stage — the cost is justified by the quality improvement.

The Planning Agent

planner_writer_agent = Agent(
    role      = 'Article Planner and Writer',
    goal      = 'Plan and then write a concise, engaging summary on a specified topic.',
    backstory  = (
        'You are an expert technical writer and content strategist. '
        'Your strength lies in creating a clear, actionable plan before writing, '
        'ensuring the final summary is both informative and easy to digest.'
    ),
    verbose          = True,
    allow_delegation = False,
    llm              = llm,
)

The backstory shapes the plan. In CrewAI, the backstory is injected into the system prompt. Saying “your strength lies in creating a clear, actionable plan before writing” directly nudges the model to produce a plan first rather than diving straight into content. This is prompt engineering embedded in the agent definition.

The Two-Stage Task

topic = "The importance of Reinforcement Learning in AI"

task = Task(
    description = (
        f"1. Create a bullet-point plan for a summary on: '{topic}'.\n"
        f"2. Write the summary based on your plan, keeping it around 200 words."
    ),
    expected_output = (
        "A final report with two distinct sections:\n\n"
        "### Plan\n"
        "- A bulleted list of the main points.\n\n"
        "### Summary\n"
        "- A concise, well-structured summary."
    ),
    agent = planner_writer_agent,
)

The two-part task description is the planning mechanism. Step 1 forces the agent to create a plan. Step 2 forces it to execute the plan it just created. This isn’t magic — it’s a structured prompt that makes the “plan first” behavior explicit and verifiable.

expected_output with section headers. Specifying ### Plan and ### Summary as required headings makes the output machine-readable. Downstream agents, parsers, or quality checks can verify that both sections are present and non-empty.

Execution Flow

crew = Crew(
    agents  = [planner_writer_agent],
    tasks   = [task],
    process = Process.sequential,
)

result = crew.kickoff()

Process.sequential: ensures tasks run in defined order. Relevant when a multi-agent crew has a planner agent followed by an executor agent — the plan must exist before execution begins.

CREWAI PLAN-THEN-EXECUTE — the same agent does two steps in sequence

Topic

"Reinforcement Learning in AI"

→

Planner-Writer Agent

one agent, two-part task

→

Step 1: Generate Plan

bullet list of main points

→

Step 2: Write Summary

200-word blog post based on plan

→

Final Report

### Plan + ### Summary

Google Deep Research: Planning at Scale

Google’s Deep Research is a production planning agent built on Gemini. It doesn’t just generate a plan and execute it once — it continuously re-plans based on what each search returns.

The Architecture

GOOGLE DEEP RESEARCH ARCHITECTURE — the iterative search-plan-analyze cycle

User Query

"Research the impact of AI on healthcare"

Plan Generation

Gemini breaks query into 5-10 research sub-questions

User Reviews Plan

you can edit or approve — human stays in the loop before search starts

Iterative Search Loop — repeats until coverage is complete

each iteration searches the web, analyzes, finds gaps, searches again

Search: sub-q 1

Search: sub-q 2

Search: sub-q N

Analyze + Find Gaps

are there unanswered questions? → loop again. All covered? → proceed

Synthesize

combines all search results into logical sections

Structured Report + Citations

every claim linked to a source · fully verifiable

What Makes It Different

🎯

Plan Generation

Decomposes the query into a multi-point research plan. Presents it to the user for review and modification before any search begins.

User-in-the-loop

↓

🔍

Iterative Search Loop

Dynamically formulates search queries based on what it has already found. Doesn't just run the plan — adapts it as new information arrives. Can analyze hundreds of sources.

Asynchronous — resilient to single-point failures

↓

🧠

Gap Analysis + Re-planning

After each search round, evaluates coverage gaps and corroborates conflicting data points. Triggers new searches to fill holes before moving to synthesis.

Reflection (Chapter 4) + Planning combined

↓

📋

Structured Report

Synthesizes findings into a multi-page report with logical sections, inline citations, and links to all sources. Optionally includes audio overview and interactive charts.

Citations · Verifiable · Interactive

Key Properties

Property	Detail
User review step	Plan is shown to the user before execution. You can edit research questions before the agent searches.
Adaptive queries	Doesn’t run predefined searches — formulates queries dynamically based on what’s been found so far
Gap detection	After each round, identifies what’s missing and generates targeted follow-up searches
Asynchronous	Long-running; user can disengage and is notified on completion
Source transparency	Returns full citation list with direct links to all consulted sources
Private doc integration	Can combine uploaded documents with web research in a single synthesis

OpenAI Deep Research API

The OpenAI Deep Research API gives programmatic access to the same plan → search → synthesize pipeline, with full visibility into every intermediate step.

from openai import OpenAI

client = OpenAI(api_key="YOUR_OPENAI_API_KEY")

The API Call

system_message = """You are a professional researcher preparing a structured report.
Focus on data-rich insights, use reliable sources, and include inline citations."""

user_query = "Research the economic impact of semaglutide on global healthcare systems."

response = client.responses.create(
    model  = "o3-deep-research-2025-06-26",
    input  = [
        {"role": "developer", "content": [{"type": "input_text", "text": system_message}]},
        {"role": "user",      "content": [{"type": "input_text", "text": user_query}]},
    ],
    reasoning = {"summary": "auto"},
    tools     = [{"type": "web_search_preview"}],
)

o3-deep-research-2025-06-26 — the model trained specifically for long-horizon research tasks. It internally manages: query decomposition, web search invocation, result analysis, gap identification, and synthesis. The o4-mini-deep-research-2025-06-26 variant trades quality for speed.

reasoning={"summary": "auto"} — tells the API to include summaries of the model’s internal reasoning steps in the response. This exposes the agent’s “thinking” — what it was trying to figure out before each search — making the process auditable.

tools=[{"type": "web_search_preview"}] — enables the model to call web search as a tool. Without this, it’s just a standard completion. With it, the model can issue multiple search queries and incorporate live web results.

Accessing the Report

# Final synthesized report
final_report = response.output[-1].content[0].text
print(final_report)

response.output[-1] — the output is a list of events (reasoning steps, search calls, code executions, final response). The last item is always the final response. This event-based structure is what enables transparency into intermediate steps.

Extracting Citations

annotations = response.output[-1].content[0].annotations

for i, citation in enumerate(annotations):
    cited_text = final_report[citation.start_index : citation.end_index]
    print(f"Citation {i+1}:")
    print(f"  Cited Text: {cited_text}")
    print(f"  Title:      {citation.title}")
    print(f"  URL:        {citation.url}")
    print(f"  Position:   chars {citation.start_index}–{citation.end_index}")

Inline citations are the critical feature for enterprise use. Every factual claim in the report has start_index and end_index pointing to the exact text span in the report. citation.url links to the original source. This means every claim is verifiable — not just a confident-sounding hallucination.

Inspecting Intermediate Steps

# The agent's internal reasoning (what it was planning)
reasoning_step = next(
    item for item in response.output if item.type == "reasoning"
)
for part in reasoning_step.summary:
    print(f"  Reasoning: {part.text}")

# The exact search queries it executed
search_step = next(
    item for item in response.output if item.type == "web_search_call"
)
print(f"  Query executed: '{search_step.action['query']}'")
print(f"  Status: {search_step.status}")

# Any code it ran (if code_interpreter tool was included)
code_step = next(
    item for item in response.output if item.type == "code_interpreter_call"
)
print(f"  Code ran:\n{code_step.input}")
print(f"  Output: {code_step.output}")

This is the key advantage over ChatGPT Deep Research. The API exposes every intermediate step:

Reasoning steps: the model’s planning narrative — what it was trying to determine

Search calls: the exact query strings submitted to the web, and their status

Code calls: any Python it executed for data analysis or computation

This makes debugging possible. If the report gets something wrong, you can trace exactly which search query returned bad data.

Full Response Structure

response.output = [
  { type: "reasoning",        summary: [{text: "I'll start by..."}] },
  { type: "web_search_call",  action: {query: "semaglutide cost analysis"}, status: "completed" },
  { type: "web_search_call",  action: {query: "GLP-1 drugs healthcare budget 2024"}, status: "completed" },
  { type: "reasoning",        summary: [{text: "I found conflicting data on..."}] },
  { type: "web_search_call",  action: {query: "..."}, status: "completed" },
  ... (more searches as needed) ...
  { type: "message",          content: [{text: "## Economic Impact of Semaglutide\n\n..."}] }
]

Planning Systems Compared

PLANNING SYSTEMS — CAPABILITY COMPARISON Hover a bar for details

CrewAI (Plan-then-Execute)

Google Deep Research

OpenAI Deep Research API

	CrewAI Plan-then-Execute	Google Deep Research	OpenAI Deep Research API
Access	Open source	Gemini app (UI)	REST API
Transparency	Agent steps verbose	Plan + citations	Full: reasoning + queries + code
Adaptability	Manual re-plan	Fully autonomous	Autonomous
Customization	Full control	Minimal	System prompt + MCP tools
Use case	Custom workflows	Ad-hoc research	Production apps
Output	Task result	Structured report	Structured report + metadata

At a Glance

WHAT

An agent receives a high-level goal and generates a sequence of steps to achieve it — before acting. The plan is not known in advance; it's created in response to the request and adapted when obstacles arise.

WHY

Complex goals have unknown "how." Planning transforms a reactive system into a strategic executor — capable of handling multi-step tasks, dependencies, and dynamic obstacles.

RULE OF THUMB

Use planning when the solution to a problem needs to be discovered, not just executed. If the steps are already known and repeatable — use a fixed workflow instead.

How a Planning Agent Decides What to Do

When you give a planning agent a complex goal, how does it actually produce a plan? The answer is more nuanced than “the LLM just figures it out.”

The role of the system prompt in planning. In CrewAI, the agent’s backstory and goal form its system prompt. This prompt shapes how the LLM approaches the planning task. A backstory that says “You are an expert technical writer who always starts by outlining before writing” directly influences the model to produce an outline (a plan) before generating content. This is why the description and expected_output fields in CrewAI tasks matter — they’re part of the prompt that shapes what the LLM produces.

How decomposition actually happens mechanically. The LLM has seen millions of examples of task decomposition in its training data — project plans, tutorials, how-to guides, step-by-step instructions. When it encounters a new goal, it applies learned patterns to decompose it. This is not algorithmic search (like classical AI planning). It’s pattern matching on learned examples. This is why LLM-based planning is so flexible (it handles novel domains) but also why it can be unreliable (it might apply the wrong pattern to an unusual goal).

Why planning agents fail in production. Planning agents fail in characteristic ways: (1) Incorrect decomposition — wrong steps or wrong order. (2) Missing steps — critical steps omitted because they’re not obvious from the goal. (3) Hallucinated capabilities — planning to use a tool or data source that doesn’t exist. (4) Over-planning — 20-step plan for a 3-step task. (5) Adaptation failure — when a step fails, the model doesn’t adapt effectively.

Google DeepResearch vs. a simple planning agent. The key difference between a toy planning agent (like the CrewAI example) and a production system like Google DeepResearch is iterative refinement of the plan itself. A toy agent generates one plan and executes it. DeepResearch continuously updates its plan based on what it discovers — if a search returns unexpected results, the plan changes. If a knowledge gap is found, new searches are added to the plan. This makes it far more robust but far more complex to build.

Common Mistakes When Building Planning Systems

Mistake 1: No clear success criteria. A planning agent needs to know what “done” means. “Research AI in healthcare” can go on forever. “Research AI in healthcare and produce a 5-section report covering: current applications, key companies, clinical trial results, regulatory landscape, and 2026 predictions” has a clear stopping point. Always specify the exact form and scope of the expected output.

Mistake 2: No intermediate checkpoints. Complex plans can go off track at step 3 of 12 and not show obvious failure until step 12. Add intermediate validation: after the research phase, have the agent summarize what it found and verify it covers the required topics before proceeding to drafting.

Mistake 3: Trusting planning agents with irreversible actions. If your planning agent can “send emails,” “book flights,” or “execute database changes,” a planning error can cause real-world harm. Use human-in-the-loop checkpoints for irreversible actions, especially in early development.

Mistake 4: No cost limit. A planning agent with access to web search and LLM calls can rack up significant API costs chasing comprehensive coverage. Always set a maximum number of iterations and tool calls.

Key Takeaways

Planning separates goal from method. The user defines what. The agent discovers how. This is what enables autonomous behavior on complex, open-ended tasks.
Static vs dynamic is the central decision. Known, repeatable processes → fixed workflow. Unknown, context-dependent processes → planning agent.
Plans are starting points, not scripts. A capable planning agent re-plans when obstacles appear — venue is booked, search returns nothing, a step fails. Adaptability is the feature.
CrewAI implements planning through task structure. The two-part description (plan, then execute) and the expected_output with explicit section headers force explicit plan-then-act behavior.
Google Deep Research is an iterative planner. It presents the plan for user review, then runs a search loop that continuously re-formulates queries based on what it finds — combining planning, parallelization, and reflection.
The OpenAI Deep Research API exposes every step. Unlike the UI, the API returns reasoning summaries, search queries, and code executions as inspectable objects — enabling debugging, auditing, and downstream integration.
Citations are non-negotiable for trust. Both deep research systems link every factual claim to a source. For planning agents in enterprise use, verifiability is a requirement, not a feature.

Next up — Chapter 7: Multi-Agent Systems, where individual agents become coordinated teams — each specialized, each accountable, working in parallel toward shared goals.

When Reacting Isn’t Enough

What Planning Is

Static vs Dynamic: The Core Decision

Watch an Agent Build a Plan

Use Cases Where Planning Is Essential

Research & Synthesis

Workflow Automation

Robotics & Navigation

Competitive Intelligence

Customer Support (Multi-step)

Content Production

CrewAI: Plan-then-Execute

The Planning Agent

The Two-Stage Task

Execution Flow

Google Deep Research: Planning at Scale

The Architecture

What Makes It Different

Key Properties

OpenAI Deep Research API

The API Call

Accessing the Report

Extracting Citations

Inspecting Intermediate Steps

Full Response Structure

Planning Systems Compared

At a Glance

How a Planning Agent Decides What to Do

Common Mistakes When Building Planning Systems

Key Takeaways

Enjoy Reading This Article?