ARTICLE · 17 MIN READ · JANUARY 25, 2026
Chapter 6: Planning
Reacting to a single input is easy. Achieving a complex goal across many unknown steps is not. Planning is how agents develop foresight — decomposing a destination into a route before they move.
When Reacting Isn’t Enough
Decomposition: Breaking a complex problem into smaller, more manageable sub-problems. "Plan a conference" → book venue, invite speakers, arrange catering, set up registration. Each sub-task is simpler than the whole.
Constraint: A condition that limits possible solutions. "Keep it under $15k" and "everyone must be able to attend" are constraints. A good planner finds solutions that satisfy all constraints.
Fixed workflow vs planning agent: A fixed workflow is like a recipe — you follow the same steps every time. A planning agent is like hiring a chef — it figures out the steps itself based on what's available. Use fixed when you know the exact steps. Use planning when you don't.
Every pattern so far — chaining, routing, parallelization, reflection, tool use — processes one input and produces one output. The “how” is determined by the developer at design time. The agent executes what was pre-wired.
That works when you know exactly what needs to happen. But real goals are often messier:
“Organize a team offsite for 30 people in Q3, keep it under $15k, and make sure everyone can attend.”
There’s no fixed sequence here. You don’t know in advance whether you’ll need to try three venues before one is available, renegotiate the catering, or push the dates by a week. The “how” needs to be discovered from the goal — not hard-coded before execution begins.
That’s planning: the agent first generates the route, then follows it — adapting when the terrain changes.
The difference from all previous patterns. In Chapters 1-5, the developer decides the structure before deployment:
- Prompt chaining: you write the sequence of prompts
- Routing: you write the routing rules
- Parallelization: you identify the independent tasks
- Reflection: you specify the critique criteria
- Tool use: you define which tools exist
Planning is different. The agent itself decides the structure at runtime, in response to the specific goal it’s been given. The developer provides the goal and available tools — the agent figures out the rest.
This is the leap from automation (following a pre-defined script) to autonomy (discovering the script on the fly). It’s a qualitatively different kind of system, and it comes with qualitatively different challenges.
What good planning looks like mechanically. When an LLM-based planning agent receives a goal, it:
- Parses the goal to understand what success looks like
- Identifies what information and actions are needed to get there
- Orders those actions based on dependencies (what needs to happen before what)
- Generates a concrete sequence of steps
- Executes each step, often calling tools along the way
- Checks whether each step succeeded and adjusts if not
- Synthesizes all intermediate results into a final output
The LLM is doing genuine reasoning here — not pattern-matching to a familiar template, but constructing a novel solution path for a novel problem. This is why planning agents are among the most impressive demonstrations of what modern LLMs can do, and also among the most likely to fail in subtle ways.
What Planning Is
A planning agent receives a high-level objective and does two things before acting:
- Decomposes the goal into a sequence of smaller, actionable steps
- Adapts when steps fail, constraints change, or new information arrives
The critical loop is the feedback arc from “Goal met?” back to the Planning Agent. When an obstacle blocks a step — a venue is booked, a search returns nothing useful, a dependency fails — a capable planner doesn’t stop. It re-evaluates and generates a revised plan.
Static vs Dynamic: The Core Decision
Before choosing planning, ask one question: does the “how” need to be discovered, or is it already known?
Watch an Agent Build a Plan
Pick a complex task. Click Generate Plan to watch the agent decompose it into executable steps.
Notice the badges on each step — some call tools (search, APIs), others are pure LLM reasoning steps, and some trigger actions. Planning isn’t a monolith; it’s an orchestration of all the prior patterns working in sequence.
Use Cases Where Planning Is Essential
Research & Synthesis
Multi-phase tasks: gather sources, extract findings, identify gaps, refine, write. Each phase depends on the previous one's output.
Research reports, literature reviewsWorkflow Automation
Business processes with ordered, interdependent steps — each must complete successfully before the next begins.
Employee onboarding, invoice processingRobotics & Navigation
Physical or virtual agents must plan a path through state space — optimizing for constraints while avoiding obstacles.
Autonomous vehicles, warehouse robotsCompetitive Intelligence
Gather data from multiple sources, cross-reference, identify gaps, synthesize into a structured report with citations.
Market analysis, product benchmarkingCustomer Support (Multi-step)
Complex issues requiring diagnosis, solution search, implementation, and verification — not a single-turn response.
Technical troubleshooting, escalationsContent Production
Plan → outline → draft → research → refine → publish. Each phase is distinct and builds on the previous output.
Blog posts, documentation, reportsCrewAI: Plan-then-Execute
CrewAI implements planning by having a single agent receive a two-part task: first produce the plan, then execute it. The key is in how the task description and expected_output are structured.
from crewai import Agent, Task, Crew, Process
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4-turbo")
Why
gpt-4-turbofor planning? Planning requires multi-step reasoning, coherent goal decomposition, and the ability to maintain context across a long task. More capable models produce more reliable, logical plans. Use the strongest model your budget allows for the planning stage — the cost is justified by the quality improvement.
The Planning Agent
planner_writer_agent = Agent(
role = 'Article Planner and Writer',
goal = 'Plan and then write a concise, engaging summary on a specified topic.',
backstory = (
'You are an expert technical writer and content strategist. '
'Your strength lies in creating a clear, actionable plan before writing, '
'ensuring the final summary is both informative and easy to digest.'
),
verbose = True,
allow_delegation = False,
llm = llm,
)
The backstory shapes the plan. In CrewAI, the
backstoryis injected into the system prompt. Saying “your strength lies in creating a clear, actionable plan before writing” directly nudges the model to produce a plan first rather than diving straight into content. This is prompt engineering embedded in the agent definition.
The Two-Stage Task
topic = "The importance of Reinforcement Learning in AI"
task = Task(
description = (
f"1. Create a bullet-point plan for a summary on: '{topic}'.\n"
f"2. Write the summary based on your plan, keeping it around 200 words."
),
expected_output = (
"A final report with two distinct sections:\n\n"
"### Plan\n"
"- A bulleted list of the main points.\n\n"
"### Summary\n"
"- A concise, well-structured summary."
),
agent = planner_writer_agent,
)
The two-part task description is the planning mechanism. Step 1 forces the agent to create a plan. Step 2 forces it to execute the plan it just created. This isn’t magic — it’s a structured prompt that makes the “plan first” behavior explicit and verifiable.
expected_outputwith section headers. Specifying### Planand### Summaryas required headings makes the output machine-readable. Downstream agents, parsers, or quality checks can verify that both sections are present and non-empty.
Execution Flow
crew = Crew(
agents = [planner_writer_agent],
tasks = [task],
process = Process.sequential,
)
result = crew.kickoff()
Process.sequential: ensures tasks run in defined order. Relevant when a multi-agent crew has a planner agent followed by an executor agent — the plan must exist before execution begins.
Google Deep Research: Planning at Scale
Google’s Deep Research is a production planning agent built on Gemini. It doesn’t just generate a plan and execute it once — it continuously re-plans based on what each search returns.
The Architecture
What Makes It Different
Key Properties
| Property | Detail |
|---|---|
| User review step | Plan is shown to the user before execution. You can edit research questions before the agent searches. |
| Adaptive queries | Doesn’t run predefined searches — formulates queries dynamically based on what’s been found so far |
| Gap detection | After each round, identifies what’s missing and generates targeted follow-up searches |
| Asynchronous | Long-running; user can disengage and is notified on completion |
| Source transparency | Returns full citation list with direct links to all consulted sources |
| Private doc integration | Can combine uploaded documents with web research in a single synthesis |
OpenAI Deep Research API
The OpenAI Deep Research API gives programmatic access to the same plan → search → synthesize pipeline, with full visibility into every intermediate step.
from openai import OpenAI
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
The API Call
system_message = """You are a professional researcher preparing a structured report.
Focus on data-rich insights, use reliable sources, and include inline citations."""
user_query = "Research the economic impact of semaglutide on global healthcare systems."
response = client.responses.create(
model = "o3-deep-research-2025-06-26",
input = [
{"role": "developer", "content": [{"type": "input_text", "text": system_message}]},
{"role": "user", "content": [{"type": "input_text", "text": user_query}]},
],
reasoning = {"summary": "auto"},
tools = [{"type": "web_search_preview"}],
)
o3-deep-research-2025-06-26— the model trained specifically for long-horizon research tasks. It internally manages: query decomposition, web search invocation, result analysis, gap identification, and synthesis. Theo4-mini-deep-research-2025-06-26variant trades quality for speed.
reasoning={"summary": "auto"}— tells the API to include summaries of the model’s internal reasoning steps in the response. This exposes the agent’s “thinking” — what it was trying to figure out before each search — making the process auditable.
tools=[{"type": "web_search_preview"}]— enables the model to call web search as a tool. Without this, it’s just a standard completion. With it, the model can issue multiple search queries and incorporate live web results.
Accessing the Report
# Final synthesized report
final_report = response.output[-1].content[0].text
print(final_report)
response.output[-1]— the output is a list of events (reasoning steps, search calls, code executions, final response). The last item is always the final response. This event-based structure is what enables transparency into intermediate steps.
Extracting Citations
annotations = response.output[-1].content[0].annotations
for i, citation in enumerate(annotations):
cited_text = final_report[citation.start_index : citation.end_index]
print(f"Citation {i+1}:")
print(f" Cited Text: {cited_text}")
print(f" Title: {citation.title}")
print(f" URL: {citation.url}")
print(f" Position: chars {citation.start_index}–{citation.end_index}")
Inline citations are the critical feature for enterprise use. Every factual claim in the report has
start_indexandend_indexpointing to the exact text span in the report.citation.urllinks to the original source. This means every claim is verifiable — not just a confident-sounding hallucination.
Inspecting Intermediate Steps
# The agent's internal reasoning (what it was planning)
reasoning_step = next(
item for item in response.output if item.type == "reasoning"
)
for part in reasoning_step.summary:
print(f" Reasoning: {part.text}")
# The exact search queries it executed
search_step = next(
item for item in response.output if item.type == "web_search_call"
)
print(f" Query executed: '{search_step.action['query']}'")
print(f" Status: {search_step.status}")
# Any code it ran (if code_interpreter tool was included)
code_step = next(
item for item in response.output if item.type == "code_interpreter_call"
)
print(f" Code ran:\n{code_step.input}")
print(f" Output: {code_step.output}")
This is the key advantage over ChatGPT Deep Research. The API exposes every intermediate step:
- Reasoning steps: the model’s planning narrative — what it was trying to determine
- Search calls: the exact query strings submitted to the web, and their status
- Code calls: any Python it executed for data analysis or computation
This makes debugging possible. If the report gets something wrong, you can trace exactly which search query returned bad data.
Full Response Structure
response.output = [
{ type: "reasoning", summary: [{text: "I'll start by..."}] },
{ type: "web_search_call", action: {query: "semaglutide cost analysis"}, status: "completed" },
{ type: "web_search_call", action: {query: "GLP-1 drugs healthcare budget 2024"}, status: "completed" },
{ type: "reasoning", summary: [{text: "I found conflicting data on..."}] },
{ type: "web_search_call", action: {query: "..."}, status: "completed" },
... (more searches as needed) ...
{ type: "message", content: [{text: "## Economic Impact of Semaglutide\n\n..."}] }
]
Planning Systems Compared
| CrewAI Plan-then-Execute | Google Deep Research | OpenAI Deep Research API | |
|---|---|---|---|
| Access | Open source | Gemini app (UI) | REST API |
| Transparency | Agent steps verbose | Plan + citations | Full: reasoning + queries + code |
| Adaptability | Manual re-plan | Fully autonomous | Autonomous |
| Customization | Full control | Minimal | System prompt + MCP tools |
| Use case | Custom workflows | Ad-hoc research | Production apps |
| Output | Task result | Structured report | Structured report + metadata |
At a Glance
An agent receives a high-level goal and generates a sequence of steps to achieve it — before acting. The plan is not known in advance; it's created in response to the request and adapted when obstacles arise.
Complex goals have unknown "how." Planning transforms a reactive system into a strategic executor — capable of handling multi-step tasks, dependencies, and dynamic obstacles.
Use planning when the solution to a problem needs to be discovered, not just executed. If the steps are already known and repeatable — use a fixed workflow instead.
How a Planning Agent Decides What to Do
When you give a planning agent a complex goal, how does it actually produce a plan? The answer is more nuanced than “the LLM just figures it out.”
The role of the system prompt in planning. In CrewAI, the agent’s backstory and goal form its system prompt. This prompt shapes how the LLM approaches the planning task. A backstory that says “You are an expert technical writer who always starts by outlining before writing” directly influences the model to produce an outline (a plan) before generating content. This is why the description and expected_output fields in CrewAI tasks matter — they’re part of the prompt that shapes what the LLM produces.
How decomposition actually happens mechanically. The LLM has seen millions of examples of task decomposition in its training data — project plans, tutorials, how-to guides, step-by-step instructions. When it encounters a new goal, it applies learned patterns to decompose it. This is not algorithmic search (like classical AI planning). It’s pattern matching on learned examples. This is why LLM-based planning is so flexible (it handles novel domains) but also why it can be unreliable (it might apply the wrong pattern to an unusual goal).
Why planning agents fail in production. Planning agents fail in characteristic ways: (1) Incorrect decomposition — wrong steps or wrong order. (2) Missing steps — critical steps omitted because they’re not obvious from the goal. (3) Hallucinated capabilities — planning to use a tool or data source that doesn’t exist. (4) Over-planning — 20-step plan for a 3-step task. (5) Adaptation failure — when a step fails, the model doesn’t adapt effectively.
Google DeepResearch vs. a simple planning agent. The key difference between a toy planning agent (like the CrewAI example) and a production system like Google DeepResearch is iterative refinement of the plan itself. A toy agent generates one plan and executes it. DeepResearch continuously updates its plan based on what it discovers — if a search returns unexpected results, the plan changes. If a knowledge gap is found, new searches are added to the plan. This makes it far more robust but far more complex to build.
Common Mistakes When Building Planning Systems
Mistake 1: No clear success criteria. A planning agent needs to know what “done” means. “Research AI in healthcare” can go on forever. “Research AI in healthcare and produce a 5-section report covering: current applications, key companies, clinical trial results, regulatory landscape, and 2026 predictions” has a clear stopping point. Always specify the exact form and scope of the expected output.
Mistake 2: No intermediate checkpoints. Complex plans can go off track at step 3 of 12 and not show obvious failure until step 12. Add intermediate validation: after the research phase, have the agent summarize what it found and verify it covers the required topics before proceeding to drafting.
Mistake 3: Trusting planning agents with irreversible actions. If your planning agent can “send emails,” “book flights,” or “execute database changes,” a planning error can cause real-world harm. Use human-in-the-loop checkpoints for irreversible actions, especially in early development.
Mistake 4: No cost limit. A planning agent with access to web search and LLM calls can rack up significant API costs chasing comprehensive coverage. Always set a maximum number of iterations and tool calls.
Key Takeaways
- Planning separates goal from method. The user defines what. The agent discovers how. This is what enables autonomous behavior on complex, open-ended tasks.
- Static vs dynamic is the central decision. Known, repeatable processes → fixed workflow. Unknown, context-dependent processes → planning agent.
- Plans are starting points, not scripts. A capable planning agent re-plans when obstacles appear — venue is booked, search returns nothing, a step fails. Adaptability is the feature.
- CrewAI implements planning through task structure. The two-part
description(plan, then execute) and theexpected_outputwith explicit section headers force explicit plan-then-act behavior. - Google Deep Research is an iterative planner. It presents the plan for user review, then runs a search loop that continuously re-formulates queries based on what it finds — combining planning, parallelization, and reflection.
- The OpenAI Deep Research API exposes every step. Unlike the UI, the API returns reasoning summaries, search queries, and code executions as inspectable objects — enabling debugging, auditing, and downstream integration.
- Citations are non-negotiable for trust. Both deep research systems link every factual claim to a source. For planning agents in enterprise use, verifiability is a requirement, not a feature.
Next up — Chapter 7: Multi-Agent Systems, where individual agents become coordinated teams — each specialized, each accountable, working in parallel toward shared goals.
Enjoy Reading This Article?
Here are some more articles you might like to read next: