Chapter 3: Parallelization

The Problem with Waiting

Before You Start — Key Terms Explained

Latency: The time it takes for a response to arrive. If an API takes 2 seconds to respond, its latency is 2 seconds. In agents, total latency = sum of all sequential steps. Parallelization reduces this.

I/O-bound vs CPU-bound: "I/O bound" means the code spends most of its time waiting for input/output (network responses, disk reads). "CPU bound" means it's actively computing. Async (asyncio) helps I/O-bound tasks — it's useless for CPU-bound work.

async/await: Python keywords for writing code that can pause while waiting (e.g., for an API response) and let other code run in the meantime. async def marks a function as async. await pauses that function until a result is ready. This is NOT the same as running on multiple CPU cores.

Event loop: The engine that powers asyncio. It keeps a list of tasks, runs one until it hits an await (waiting point), then switches to another task. It's like a chef managing multiple dishes — not cooking two things simultaneously, but switching attention efficiently.

GIL (Global Interpreter Lock): A Python rule that only allows one thread to execute Python code at a time. This is why Python threads don't give true CPU parallelism. But for I/O-bound work (like LLM API calls), the GIL barely matters because the thread is just waiting, not executing.

In Chapter 1 we chained steps sequentially. In Chapter 2 we added decision-making. Both assume the same thing: one step runs, finishes, then the next begins.

That’s the right model when each step genuinely needs the previous step’s output. But often, it isn’t necessary — it’s just the default.

Imagine your agent needs to research a company. It pulls:

Task	Simulated latency
Search recent news	1.2 s
Fetch stock price data	0.9 s
Check social media mentions	0.7 s
Query internal company database	1.5 s
Synthesize all findings	0.8 s

Sequential total: 5.1 seconds. Every task waits for the one before it — even though none of them depend on each other.

Now think about it differently. News search doesn’t need stock data to start. Social media check doesn’t need news results. All four lookups are completely independent. So fire them all at once. Wait for the slowest (1.5 s), then synthesize.

Parallel total: 2.3 seconds. Same answer. 2.2× faster.

That’s parallelization: identify the independent tasks, fire them concurrently, wait for everything to land, then continue.

The fundamental principle: independence. Two tasks are “independent” if neither one needs the other’s result to start. In the company research example, “fetch stock data” doesn’t need “search recent news” to finish first — they can run simultaneously. But “synthesize all findings” does need all four lookups to finish — it can only run after they all complete.

The speed formula. For a set of N independent tasks each taking time T₁, T₂, …, Tₙ:

Sequential time = T₁ + T₂ + … + Tₙ (you wait for each)
Parallel time = max(T₁, T₂, …, Tₙ) + T_synthesis (you wait for the slowest)

This is why parallelization is most valuable when tasks have similar latencies. If you have three tasks that each take 2 seconds, sequential takes 6 seconds, parallel takes 2 seconds — a 3× speedup. If one task takes 10 seconds and two take 0.1 seconds, parallel barely helps because you’re dominated by the slow task regardless.

Where the time goes in AI systems. In LLM-based agents, almost all the time is spent waiting for API responses. The Python code runs in microseconds. The network round-trip to the LLM API takes 1-5 seconds. This is called “I/O-bound” work — your program is mostly waiting for input/output, not actively computing. This is the ideal scenario for parallelization, because while your program is waiting for API response A, it can fire off requests for B and C simultaneously.

The Core Concept

Parallelization rests on one rule:

If Task B doesn’t need Task A’s output to start, Task B can start the moment Task A starts.

PARALLELIZATION PATTERN

Input Query

Single request · fans out to independent workers

Parallel Execution — all start simultaneously

Independent tasks run concurrently · no waiting

Task A

Search News · 1.2s

Task B

Fetch Stock · 0.9s

Task C

Query DB · 1.5s

Wait for slowest · 1.5s

All parallel results collected before proceeding

Synthesize

Merge all outputs into one final answer · 0.8s

The input fans out to independent tasks. They run simultaneously. Their outputs converge at a single synthesis step.

Notice: the synthesis step is still sequential — it must wait for all parallel tasks before it can begin. You’re not removing sequential dependencies; you’re removing unnecessary ones.

The Time Difference, Visualized

SEQUENTIAL vs PARALLEL — EXECUTION TIMELINE

When to Use It: Seven Scenarios

Information Gathering

Query multiple APIs simultaneously — news, stock data, social feeds, databases — instead of fetching them one by one.

↑ 3–5× faster research agents

Data Analysis

Run sentiment analysis, keyword extraction, categorization, and urgency scoring on the same batch of text — all at once.

↑ Multi-faceted output in one pass

Multi-API Orchestration

A travel agent checking flights, hotels, events, and restaurants simultaneously. Four calls, not four round-trips.

↑ Complete plan, not a drip feed

Content Generation

Generate subject line, body copy, image prompt, and CTA text for an email — in parallel, then assemble.

↑ Faster creative pipelines

Input Validation

Check email format, phone validity, address lookup, and profanity filter simultaneously — return all issues at once.

↑ Sub-second validation feedback

Multi-Modal Processing

Analyze the text and the image in a social post at the same time. Merge insights from both modalities at the end.

↑ No wasted latency on modalities

A/B Option Generation

Generate three different headlines simultaneously using slightly varied prompts. Pick the best one automatically.

↑ More options, same wall-clock time

How It Actually Works: asyncio

Before writing any code, one important nuance needs to be addressed — because it trips up almost everyone.

asyncio does not run code in parallel on multiple CPU cores. It runs on a single thread, using Python’s event loop.

Here’s how it works:

Event Loop (single thread):
┌────────────────────────────────────────────────────────┐
│                                                        │
│  1. Start Task A (send HTTP request)                   │
│  2. While waiting for A's response:                    │
│     → Start Task B (send HTTP request)                 │
│     → Start Task C (send HTTP request)                 │
│  3. A's response arrives → resume Task A               │
│  4. B's response arrives → resume Task B               │
│  5. C's response arrives → resume Task C               │
│  6. All three done → proceed                           │
│                                                        │
└────────────────────────────────────────────────────────┘

The key word is waiting. When Task A is waiting for a network response, that’s idle time — the CPU is doing nothing for Task A. The event loop fills that idle time by starting Task B and C.

This means:

Scenario	asyncio helps?
Multiple API calls / network requests	Yes — I/O bound, lots of waiting
Multiple LLM calls (external API)	Yes — network I/O dominates
Heavy CPU computation (matrix ops)	No — CPU bound, no idle time to exploit
Reading many files	Yes — disk I/O has wait time

For agentic AI — where tasks are overwhelmingly LLM API calls and web requests — asyncio is exactly the right tool. The Python GIL (Global Interpreter Lock) is largely irrelevant here because the threads aren’t fighting for CPU; they’re waiting for network.

asyncio Explained: The Single-Thread Concurrency Model

asyncio is Python’s library for writing concurrent code using the async/await syntax. Understanding it properly requires understanding a key concept: the event loop.

What is the event loop? The event loop is a scheduler — a program that maintains a queue of tasks and decides which one to run next. It’s running on a single thread, meaning there’s no true parallelism at the CPU level. Instead, it exploits the fact that most I/O operations (network requests, file reads) involve waiting — and while you’re waiting, the CPU could be doing something else.

Here’s the step-by-step execution model:

Your main() function calls await asyncio.gather(task_A(), task_B(), task_C()).
The event loop starts task_A. task_A sends an HTTP request to the LLM API and then hits await response — a waiting point.
Since task_A is now waiting (not using the CPU), the event loop switches to task_B. Same thing happens — it sends its request and hits a waiting point.
Same for task_C. All three requests are now “in flight” over the network simultaneously.
Eventually, the LLM API responds to one of them. The event loop wakes up that task, it processes the response, and continues.
When all three tasks complete, asyncio.gather collects their results and returns.

The async def keyword. When you write async def run_query(text), you’re declaring that this function is a coroutine — a function that can be paused and resumed by the event loop. Without async def, you can’t use await inside the function.

The await keyword. await suspends the current coroutine and yields control back to the event loop. The event loop is free to run another coroutine while this one is waiting. Think of await as: “I’m going to wait for this — while I wait, feel free to do other things.”

asyncio.gather() vs running tasks sequentially. Without gather:

result_A = await run_query(text_A)  # wait for A to finish
result_B = await run_query(text_B)  # only then start B
result_C = await run_query(text_C)  # only then start C

Total time: T_A + T_B + T_C.

With gather:

result_A, result_B, result_C = await asyncio.gather(
    run_query(text_A),
    run_query(text_B),
    run_query(text_C)
)

Total time: max(T_A, T_B, T_C).

For LLM API calls that each take ~2 seconds, sequential takes ~6 seconds. Parallel takes ~2 seconds. Same results, 3× faster.

The asyncio.run() entry point. Python scripts are synchronous by default — they don’t have an event loop running. asyncio.run(main()) creates a new event loop, runs the main() coroutine to completion in that loop, and then closes the loop. This is always the pattern for running async code from a synchronous script’s if __name__ == "__main__" block.

Common mistake: using regular invoke inside an async context. If you call chain.invoke() (the synchronous version) inside an async function, it blocks the event loop for the entire duration of the API call. No other coroutine can run during that time. You’ve effectively serialized your “parallel” calls. Always use chain.ainvoke() (async version) inside async def functions.

Watch It Run: A Live Demo

Click Run to see three researcher agents fire simultaneously, then converge into a synthesis step.

LIVE PARALLEL EXECUTION DEMO

INPUT TOPIC "Sustainable Technology Advancements"

▼ Fan-out: all three start simultaneously

⚡ Renewable Energy Idle

🚗 EV Technology Idle

🌿 Carbon Capture Idle

⚙ Synthesis Agent Waiting for all researchers…

The LangChain Way: `RunnableParallel`

LangChain implements parallelization through RunnableParallel — a construct that takes a dictionary of named chains and runs all of them at once, returning a dictionary of results.

import os
import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough

Why these imports?

asyncio — Python’s built-in library for writing concurrent code using async/await

ChatOpenAI — the LangChain wrapper for OpenAI’s chat models (swappable for any other provider)

ChatPromptTemplate — structures messages into system + user roles (what the model expects)

StrOutputParser — converts the raw message object from the LLM into a plain Python string

RunnableParallel — the key component that executes multiple chains simultaneously

RunnablePassthrough — passes the input through unchanged, so downstream steps can still access the original value

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)

temperature=0.7 — a mid-range value. Lower (0.0) makes outputs deterministic and consistent; higher (1.0) adds creativity. For research summaries we want some flexibility, so 0.7 is appropriate.

Defining Three Independent Chains

summarize_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Summarize the following topic concisely:"),
        ("user",   "{topic}")
    ])
    | llm
    | StrOutputParser()
)

questions_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Generate three interesting questions about the following topic:"),
        ("user",   "{topic}")
    ])
    | llm
    | StrOutputParser()
)

terms_chain = (
    ChatPromptTemplate.from_messages([
        ("system", "Identify 5–10 key terms from the following topic, separated by commas:"),
        ("user",   "{topic}")
    ])
    | llm
    | StrOutputParser()
)

Each chain is a complete pipeline: prompt → LLM → string output. They all take {topic} as input and return a string. None of them depend on each other’s output — which makes them perfect candidates for parallel execution.

Building the Parallel Block

map_chain = RunnableParallel({
    "summary":    summarize_chain,
    "questions":  questions_chain,
    "key_terms":  terms_chain,
    "topic":      RunnablePassthrough(),   # ← keep the original input available downstream
})

How RunnableParallel works: When you call map_chain.invoke("space exploration"):

LangChain sends "space exploration" to summarize_chain, questions_chain, terms_chain, and RunnablePassthrough() at the same time

Each chain runs concurrently (as async tasks in the event loop)

Once all four return, RunnableParallel packages their outputs into a single dictionary: {"summary": "...", "questions": "...", "key_terms": "...", "topic": "space exploration"}

Why RunnablePassthrough()? The synthesis step needs the original topic text — not just the processed outputs. Without it, the original string would be consumed and discarded by the parallel step. RunnablePassthrough() passes the input through unchanged so the next step can reference it.

Data flow through map_chain:

Input: "space exploration"
        │
        ├──→ summarize_chain  ──→ "A summary of space exploration..."
        │
        ├──→ questions_chain  ──→ "1. What year... 2. Who... 3. Why..."
        │
        ├──→ terms_chain      ──→ "NASA, Apollo, orbit, rocket..."
        │
        └──→ RunnablePassthrough() ──→ "space exploration"
        
Output: { "summary": ..., "questions": ..., "key_terms": ..., "topic": ... }

The Synthesis Step

synthesis_prompt = ChatPromptTemplate.from_messages([
    ("system", """Based on the following information:
Summary: {summary}
Related Questions: {questions}
Key Terms: {key_terms}

Synthesize a comprehensive answer."""),
    ("user", "Original topic: {topic}")
])

full_parallel_chain = map_chain | synthesis_prompt | llm | StrOutputParser()

The | pipe connects map_chain’s dictionary output directly into synthesis_prompt. LangChain automatically fills {summary}, {questions}, {key_terms}, and {topic} from the dictionary keys. This is why the dictionary keys in RunnableParallel must match the variable names in the synthesis prompt exactly.

Running It Asynchronously

async def run_parallel_example(topic: str) -> None:
    response = await full_parallel_chain.ainvoke(topic)
    print(response)

if __name__ == "__main__":
    asyncio.run(run_parallel_example("The history of space exploration"))

ainvoke vs invoke: ainvoke is the async version. It allows the event loop to switch between the parallel tasks while they’re waiting for API responses. Using the synchronous invoke would block the entire thread during each LLM call, serializing the “parallel” chains and defeating the purpose.

asyncio.run(): This is the standard entry point for running async code from a synchronous context (like a script’s __main__ block). It creates an event loop, runs the coroutine, and then closes the loop.

Full Data Flow

LANGCHAIN PARALLEL DATA FLOW — one input, four simultaneous chains, one output

Topic String

"space exploration"

→

RunnableParallel

fans out to all chains at once

→

summarize_chain

questions_chain

terms_chain

RunnablePassthrough

passes original topic through

→

Merged Dict

summary, questions, key_terms, topic

→

synthesis_prompt

LLM

Final Output

The Google ADK Way: `ParallelAgent`

The Google ADK takes a different approach. Instead of wiring chains together, you define agents and declare their relationships using ParallelAgent and SequentialAgent. The framework handles the scheduling.

from google.adk.agents import LlmAgent, ParallelAgent, SequentialAgent
from google.adk.tools import google_search

GEMINI_MODEL = "gemini-2.0-flash"

Why these imports?

LlmAgent — a single agent powered by an LLM. You give it an instruction and optional tools.

ParallelAgent — an orchestrator that runs its sub_agents concurrently, waiting until all complete before proceeding.

SequentialAgent — an orchestrator that runs its sub_agents one after another. Used to chain the ParallelAgent with the synthesis agent.

google_search — a built-in ADK tool that gives agents access to live web search.

Three Researcher Agents (the parallel workers)

researcher_agent_1 = LlmAgent(
    name        = "RenewableEnergyResearcher",
    model       = GEMINI_MODEL,
    instruction = """You are a research assistant specializing in energy.
Research the latest advancements in 'renewable energy sources'.
Use the Google Search tool provided.
Summarize your key findings concisely (1–2 sentences).
Output *only* the summary.""",
    description = "Researches renewable energy sources.",
    tools       = [google_search],
    output_key  = "renewable_energy_result",   # ← stores result in session state
)

Why docstring-style instructions? The ADK uses the instruction field as the agent’s system prompt. Being explicit about:

What tool to use (Use the Google Search tool)

How much to write (1–2 sentences)

What to output (Output *only* the summary) …prevents the agent from adding preamble, caveats, or asking clarifying questions.

Why output_key? This is how parallel agents share results. When researcher_agent_1 finishes, it stores its output string into the session state under the key "renewable_energy_result". The synthesis agent can then read from {renewable_energy_result} in its instruction template. Without output_key, the parallel agents’ outputs would be lost.

Researchers 2 and 3 are identical in structure, covering EV technology (output_key="ev_technology_result") and carbon capture (output_key="carbon_capture_result").

The ParallelAgent (runs all three at once)

parallel_research_agent = ParallelAgent(
    name        = "ParallelWebResearchAgent",
    sub_agents  = [researcher_agent_1, researcher_agent_2, researcher_agent_3],
    description = "Runs multiple research agents in parallel to gather information.",
)

This is the entire parallelization mechanism in ADK — just declare sub_agents inside a ParallelAgent. The framework:

Starts all three LlmAgents concurrently

Each agent performs its search and writes its result to session state via output_key

ParallelAgent completes once all sub-agents have finished

No async code, no event loop management, no callback hell — the framework handles all of it.

The Synthesis Agent

merger_agent = LlmAgent(
    name  = "SynthesisAgent",
    model = GEMINI_MODEL,
    instruction = """You are responsible for combining research findings into a structured report.

**Input Summaries:**
* Renewable Energy: {renewable_energy_result}
* Electric Vehicles: {ev_technology_result}
* Carbon Capture:   {carbon_capture_result}

**CRITICAL RULE:** Base your entire response *exclusively* on the Input Summaries above.
Do NOT add external knowledge not present in these summaries.

**Output Format:**
## Summary of Recent Sustainable Technology Advancements

### Renewable Energy Findings
[Synthesize only the renewable energy input summary]

### Electric Vehicle Findings
[Synthesize only the EV input summary]

### Carbon Capture Findings
[Synthesize only the carbon capture input summary]

### Overall Conclusion
[1–2 sentences connecting only the findings above]

Output *only* the structured report.""",
    description = "Combines research findings into a structured, cited report.",
)

Why {renewable_energy_result} in the instruction? The ADK automatically fills these {key} placeholders from the session state. Since the three researcher agents stored their outputs under exactly these keys, the synthesis agent receives all three summaries injected directly into its prompt.

Why the “CRITICAL RULE”? Without it, LLMs will use their pre-trained world knowledge to supplement the research, making the output non-deterministic and potentially inconsistent with what was actually found in the search. The explicit constraint forces the agent to stay grounded.

The SequentialAgent (orchestrates everything)

sequential_pipeline_agent = SequentialAgent(
    name        = "ResearchAndSynthesisPipeline",
    sub_agents  = [parallel_research_agent, merger_agent],
    description = "Coordinates parallel research and synthesizes the results.",
)

root_agent = sequential_pipeline_agent

The SequentialAgent runs parallel_research_agent first (which internally runs the three researchers in parallel), waits for it to complete, then runs merger_agent. This gives you parallelism where possible, sequencing where necessary — exactly the right structure for fan-out / fan-in workflows.

ADK Orchestration Flow

ADK ORCHESTRATION FLOW

User Input

SequentialAgent

orchestrates the whole flow

ParallelAgent — all three fire simultaneously

output_key writes each result to Session State

Renewable Energy

→ renewable_energy_result

EV Researcher

→ ev_technology_result

Carbon Capture

→ carbon_capture_result

Session State

fills template placeholders in Synthesis Agent's instruction

Synthesis Agent

merges all results into final report

Structured Report

Side by Side: LangChain vs ADK

	LangChain (LCEL)	Google ADK
Parallelism primitive	`RunnableParallel` dict	`ParallelAgent`
Sequencing primitive	`\\|` pipe operator	`SequentialAgent`
How results are shared	Dict keys flow through the pipeline	`output_key` writes to session state
Async model	`asyncio` via `ainvoke` / `astream`	Managed by ADK framework
Code verbosity	Lower — functional chain composition	Higher — agent class definitions
Observability	LangSmith tracing	ADK built-in tracing
Best for	Tight, composable chains where you control the data flow	Multi-agent systems where agents are independent workers

The fundamental difference: LangChain is data-flow (inputs pipe through transforms), ADK is agent-flow (agents communicate via shared state). Both achieve parallelism, but the mental model is different.

At a Glance

WHAT

Independent tasks that don't need each other's output are executed simultaneously instead of one at a time.

WHY

Sequential execution adds all latencies together. Parallel execution takes only the longest. For I/O-bound work (API calls, LLM requests), this is a 2–5× speedup with zero additional cost.

RULE OF THUMB

Use when a workflow contains multiple independent lookups, computations, or content-generation tasks that each produce a piece of a larger whole.

Key Takeaways

The core rule: Tasks that don’t depend on each other’s output can run in parallel. Tasks that do must remain sequential.
The gain: For I/O-bound work (LLM calls, API requests, database queries), parallelism reduces total time from sum of all durations to max of parallel durations + sequential tail.
asyncio is concurrency, not CPU parallelism. It works by filling idle network-wait time with other tasks. This is exactly what agentic workflows need.
LangChain uses RunnableParallel — wrap a dictionary of chains and the LCEL runtime fires them all concurrently, collecting results into a dict for the next step.
ADK uses ParallelAgent — declare sub-agents in a ParallelAgent, use output_key to write results to session state, and a downstream synthesis agent reads from state via {key} placeholders in its instruction.
The synthesis step is always sequential. Parallelization is a fan-out / fan-in pattern: spread out, work in parallel, reconverge.
Added complexity is real. Parallel workflows are harder to debug, log, and reason about than sequential ones. Use it when the latency gain is significant — not as a default architecture.

Next up — Chapter 4: Orchestration, where we combine chaining, routing, and parallelization into full multi-agent systems.