ARTICLE · 18 MIN READ · JANUARY 13, 2026
Chapter 3: Parallelization
Sequential is clean. Parallel is fast. The art is knowing which tasks can run at the same time — and wiring the plumbing to make it happen.
The Problem with Waiting
Latency: The time it takes for a response to arrive. If an API takes 2 seconds to respond, its latency is 2 seconds. In agents, total latency = sum of all sequential steps. Parallelization reduces this.
I/O-bound vs CPU-bound: "I/O bound" means the code spends most of its time waiting for input/output (network responses, disk reads). "CPU bound" means it's actively computing. Async (asyncio) helps I/O-bound tasks — it's useless for CPU-bound work.
async/await: Python keywords for writing code that can pause while waiting (e.g., for an API response) and let other code run in the meantime. async def marks a function as async. await pauses that function until a result is ready. This is NOT the same as running on multiple CPU cores.
Event loop: The engine that powers asyncio. It keeps a list of tasks, runs one until it hits an await (waiting point), then switches to another task. It's like a chef managing multiple dishes — not cooking two things simultaneously, but switching attention efficiently.
GIL (Global Interpreter Lock): A Python rule that only allows one thread to execute Python code at a time. This is why Python threads don't give true CPU parallelism. But for I/O-bound work (like LLM API calls), the GIL barely matters because the thread is just waiting, not executing.
In Chapter 1 we chained steps sequentially. In Chapter 2 we added decision-making. Both assume the same thing: one step runs, finishes, then the next begins.
That’s the right model when each step genuinely needs the previous step’s output. But often, it isn’t necessary — it’s just the default.
Imagine your agent needs to research a company. It pulls:
| Task | Simulated latency |
|---|---|
| Search recent news | 1.2 s |
| Fetch stock price data | 0.9 s |
| Check social media mentions | 0.7 s |
| Query internal company database | 1.5 s |
| Synthesize all findings | 0.8 s |
Sequential total: 5.1 seconds. Every task waits for the one before it — even though none of them depend on each other.
Now think about it differently. News search doesn’t need stock data to start. Social media check doesn’t need news results. All four lookups are completely independent. So fire them all at once. Wait for the slowest (1.5 s), then synthesize.
Parallel total: 2.3 seconds. Same answer. 2.2× faster.
That’s parallelization: identify the independent tasks, fire them concurrently, wait for everything to land, then continue.
The fundamental principle: independence. Two tasks are “independent” if neither one needs the other’s result to start. In the company research example, “fetch stock data” doesn’t need “search recent news” to finish first — they can run simultaneously. But “synthesize all findings” does need all four lookups to finish — it can only run after they all complete.
The speed formula. For a set of N independent tasks each taking time T₁, T₂, …, Tₙ:
- Sequential time = T₁ + T₂ + … + Tₙ (you wait for each)
- Parallel time = max(T₁, T₂, …, Tₙ) + T_synthesis (you wait for the slowest)
This is why parallelization is most valuable when tasks have similar latencies. If you have three tasks that each take 2 seconds, sequential takes 6 seconds, parallel takes 2 seconds — a 3× speedup. If one task takes 10 seconds and two take 0.1 seconds, parallel barely helps because you’re dominated by the slow task regardless.
Where the time goes in AI systems. In LLM-based agents, almost all the time is spent waiting for API responses. The Python code runs in microseconds. The network round-trip to the LLM API takes 1-5 seconds. This is called “I/O-bound” work — your program is mostly waiting for input/output, not actively computing. This is the ideal scenario for parallelization, because while your program is waiting for API response A, it can fire off requests for B and C simultaneously.
The Core Concept
Parallelization rests on one rule:
If Task B doesn’t need Task A’s output to start, Task B can start the moment Task A starts.
The input fans out to independent tasks. They run simultaneously. Their outputs converge at a single synthesis step.
Notice: the synthesis step is still sequential — it must wait for all parallel tasks before it can begin. You’re not removing sequential dependencies; you’re removing unnecessary ones.
The Time Difference, Visualized
When to Use It: Seven Scenarios
Information Gathering
Query multiple APIs simultaneously — news, stock data, social feeds, databases — instead of fetching them one by one.
↑ 3–5× faster research agentsData Analysis
Run sentiment analysis, keyword extraction, categorization, and urgency scoring on the same batch of text — all at once.
↑ Multi-faceted output in one passMulti-API Orchestration
A travel agent checking flights, hotels, events, and restaurants simultaneously. Four calls, not four round-trips.
↑ Complete plan, not a drip feedContent Generation
Generate subject line, body copy, image prompt, and CTA text for an email — in parallel, then assemble.
↑ Faster creative pipelinesInput Validation
Check email format, phone validity, address lookup, and profanity filter simultaneously — return all issues at once.
↑ Sub-second validation feedbackMulti-Modal Processing
Analyze the text and the image in a social post at the same time. Merge insights from both modalities at the end.
↑ No wasted latency on modalitiesA/B Option Generation
Generate three different headlines simultaneously using slightly varied prompts. Pick the best one automatically.
↑ More options, same wall-clock timeHow It Actually Works: asyncio
Before writing any code, one important nuance needs to be addressed — because it trips up almost everyone.
asyncio does not run code in parallel on multiple CPU cores. It runs on a single thread, using Python’s event loop.
Here’s how it works:
Event Loop (single thread):
┌────────────────────────────────────────────────────────┐
│ │
│ 1. Start Task A (send HTTP request) │
│ 2. While waiting for A's response: │
│ → Start Task B (send HTTP request) │
│ → Start Task C (send HTTP request) │
│ 3. A's response arrives → resume Task A │
│ 4. B's response arrives → resume Task B │
│ 5. C's response arrives → resume Task C │
│ 6. All three done → proceed │
│ │
└────────────────────────────────────────────────────────┘
The key word is waiting. When Task A is waiting for a network response, that’s idle time — the CPU is doing nothing for Task A. The event loop fills that idle time by starting Task B and C.
This means:
| Scenario | asyncio helps? |
|---|---|
| Multiple API calls / network requests | Yes — I/O bound, lots of waiting |
| Multiple LLM calls (external API) | Yes — network I/O dominates |
| Heavy CPU computation (matrix ops) | No — CPU bound, no idle time to exploit |
| Reading many files | Yes — disk I/O has wait time |
For agentic AI — where tasks are overwhelmingly LLM API calls and web requests — asyncio is exactly the right tool. The Python GIL (Global Interpreter Lock) is largely irrelevant here because the threads aren’t fighting for CPU; they’re waiting for network.
asyncio Explained: The Single-Thread Concurrency Model
asyncio is Python’s library for writing concurrent code using the async/await syntax. Understanding it properly requires understanding a key concept: the event loop.
What is the event loop? The event loop is a scheduler — a program that maintains a queue of tasks and decides which one to run next. It’s running on a single thread, meaning there’s no true parallelism at the CPU level. Instead, it exploits the fact that most I/O operations (network requests, file reads) involve waiting — and while you’re waiting, the CPU could be doing something else.
Here’s the step-by-step execution model:
- Your
main()function callsawait asyncio.gather(task_A(), task_B(), task_C()). - The event loop starts
task_A.task_Asends an HTTP request to the LLM API and then hitsawait response— a waiting point. - Since
task_Ais now waiting (not using the CPU), the event loop switches totask_B. Same thing happens — it sends its request and hits a waiting point. - Same for
task_C. All three requests are now “in flight” over the network simultaneously. - Eventually, the LLM API responds to one of them. The event loop wakes up that task, it processes the response, and continues.
- When all three tasks complete,
asyncio.gathercollects their results and returns.
The async def keyword. When you write async def run_query(text), you’re declaring that this function is a coroutine — a function that can be paused and resumed by the event loop. Without async def, you can’t use await inside the function.
The await keyword. await suspends the current coroutine and yields control back to the event loop. The event loop is free to run another coroutine while this one is waiting. Think of await as: “I’m going to wait for this — while I wait, feel free to do other things.”
asyncio.gather() vs running tasks sequentially. Without gather:
result_A = await run_query(text_A) # wait for A to finish
result_B = await run_query(text_B) # only then start B
result_C = await run_query(text_C) # only then start C
Total time: T_A + T_B + T_C.
With gather:
result_A, result_B, result_C = await asyncio.gather(
run_query(text_A),
run_query(text_B),
run_query(text_C)
)
Total time: max(T_A, T_B, T_C).
For LLM API calls that each take ~2 seconds, sequential takes ~6 seconds. Parallel takes ~2 seconds. Same results, 3× faster.
The asyncio.run() entry point. Python scripts are synchronous by default — they don’t have an event loop running. asyncio.run(main()) creates a new event loop, runs the main() coroutine to completion in that loop, and then closes the loop. This is always the pattern for running async code from a synchronous script’s if __name__ == "__main__" block.
Common mistake: using regular invoke inside an async context. If you call chain.invoke() (the synchronous version) inside an async function, it blocks the event loop for the entire duration of the API call. No other coroutine can run during that time. You’ve effectively serialized your “parallel” calls. Always use chain.ainvoke() (async version) inside async def functions.
Watch It Run: A Live Demo
Click Run to see three researcher agents fire simultaneously, then converge into a synthesis step.
The LangChain Way: RunnableParallel
LangChain implements parallelization through RunnableParallel — a construct that takes a dictionary of named chains and runs all of them at once, returning a dictionary of results.
import os
import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
Why these imports?
asyncio— Python’s built-in library for writing concurrent code usingasync/awaitChatOpenAI— the LangChain wrapper for OpenAI’s chat models (swappable for any other provider)ChatPromptTemplate— structures messages intosystem+userroles (what the model expects)StrOutputParser— converts the raw message object from the LLM into a plain Python stringRunnableParallel— the key component that executes multiple chains simultaneouslyRunnablePassthrough— passes the input through unchanged, so downstream steps can still access the original value
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
temperature=0.7— a mid-range value. Lower (0.0) makes outputs deterministic and consistent; higher (1.0) adds creativity. For research summaries we want some flexibility, so 0.7 is appropriate.
Defining Three Independent Chains
summarize_chain = (
ChatPromptTemplate.from_messages([
("system", "Summarize the following topic concisely:"),
("user", "{topic}")
])
| llm
| StrOutputParser()
)
questions_chain = (
ChatPromptTemplate.from_messages([
("system", "Generate three interesting questions about the following topic:"),
("user", "{topic}")
])
| llm
| StrOutputParser()
)
terms_chain = (
ChatPromptTemplate.from_messages([
("system", "Identify 5–10 key terms from the following topic, separated by commas:"),
("user", "{topic}")
])
| llm
| StrOutputParser()
)
Each chain is a complete pipeline:
prompt → LLM → string output. They all take{topic}as input and return a string. None of them depend on each other’s output — which makes them perfect candidates for parallel execution.
Building the Parallel Block
map_chain = RunnableParallel({
"summary": summarize_chain,
"questions": questions_chain,
"key_terms": terms_chain,
"topic": RunnablePassthrough(), # ← keep the original input available downstream
})
How
RunnableParallelworks: When you callmap_chain.invoke("space exploration"):
- LangChain sends
"space exploration"tosummarize_chain,questions_chain,terms_chain, andRunnablePassthrough()at the same time- Each chain runs concurrently (as async tasks in the event loop)
- Once all four return,
RunnableParallelpackages their outputs into a single dictionary:{"summary": "...", "questions": "...", "key_terms": "...", "topic": "space exploration"}Why
RunnablePassthrough()? The synthesis step needs the original topic text — not just the processed outputs. Without it, the original string would be consumed and discarded by the parallel step.RunnablePassthrough()passes the input through unchanged so the next step can reference it.
Data flow through map_chain:
Input: "space exploration"
│
├──→ summarize_chain ──→ "A summary of space exploration..."
│
├──→ questions_chain ──→ "1. What year... 2. Who... 3. Why..."
│
├──→ terms_chain ──→ "NASA, Apollo, orbit, rocket..."
│
└──→ RunnablePassthrough() ──→ "space exploration"
Output: { "summary": ..., "questions": ..., "key_terms": ..., "topic": ... }
The Synthesis Step
synthesis_prompt = ChatPromptTemplate.from_messages([
("system", """Based on the following information:
Summary: {summary}
Related Questions: {questions}
Key Terms: {key_terms}
Synthesize a comprehensive answer."""),
("user", "Original topic: {topic}")
])
full_parallel_chain = map_chain | synthesis_prompt | llm | StrOutputParser()
The
|pipe connectsmap_chain’s dictionary output directly intosynthesis_prompt. LangChain automatically fills{summary},{questions},{key_terms}, and{topic}from the dictionary keys. This is why the dictionary keys inRunnableParallelmust match the variable names in the synthesis prompt exactly.
Running It Asynchronously
async def run_parallel_example(topic: str) -> None:
response = await full_parallel_chain.ainvoke(topic)
print(response)
if __name__ == "__main__":
asyncio.run(run_parallel_example("The history of space exploration"))
ainvokevsinvoke:ainvokeis the async version. It allows the event loop to switch between the parallel tasks while they’re waiting for API responses. Using the synchronousinvokewould block the entire thread during each LLM call, serializing the “parallel” chains and defeating the purpose.
asyncio.run(): This is the standard entry point for running async code from a synchronous context (like a script’s__main__block). It creates an event loop, runs the coroutine, and then closes the loop.
Full Data Flow
The Google ADK Way: ParallelAgent
The Google ADK takes a different approach. Instead of wiring chains together, you define agents and declare their relationships using ParallelAgent and SequentialAgent. The framework handles the scheduling.
from google.adk.agents import LlmAgent, ParallelAgent, SequentialAgent
from google.adk.tools import google_search
GEMINI_MODEL = "gemini-2.0-flash"
Why these imports?
LlmAgent— a single agent powered by an LLM. You give it an instruction and optional tools.ParallelAgent— an orchestrator that runs itssub_agentsconcurrently, waiting until all complete before proceeding.SequentialAgent— an orchestrator that runs itssub_agentsone after another. Used to chain theParallelAgentwith the synthesis agent.google_search— a built-in ADK tool that gives agents access to live web search.
Three Researcher Agents (the parallel workers)
researcher_agent_1 = LlmAgent(
name = "RenewableEnergyResearcher",
model = GEMINI_MODEL,
instruction = """You are a research assistant specializing in energy.
Research the latest advancements in 'renewable energy sources'.
Use the Google Search tool provided.
Summarize your key findings concisely (1–2 sentences).
Output *only* the summary.""",
description = "Researches renewable energy sources.",
tools = [google_search],
output_key = "renewable_energy_result", # ← stores result in session state
)
Why docstring-style instructions? The ADK uses the
instructionfield as the agent’s system prompt. Being explicit about:
- What tool to use (
Use the Google Search tool)- How much to write (
1–2 sentences)- What to output (
Output *only* the summary) …prevents the agent from adding preamble, caveats, or asking clarifying questions.Why
output_key? This is how parallel agents share results. Whenresearcher_agent_1finishes, it stores its output string into the session state under the key"renewable_energy_result". The synthesis agent can then read from{renewable_energy_result}in its instruction template. Withoutoutput_key, the parallel agents’ outputs would be lost.
Researchers 2 and 3 are identical in structure, covering EV technology (output_key="ev_technology_result") and carbon capture (output_key="carbon_capture_result").
The ParallelAgent (runs all three at once)
parallel_research_agent = ParallelAgent(
name = "ParallelWebResearchAgent",
sub_agents = [researcher_agent_1, researcher_agent_2, researcher_agent_3],
description = "Runs multiple research agents in parallel to gather information.",
)
This is the entire parallelization mechanism in ADK — just declare
sub_agentsinside aParallelAgent. The framework:
- Starts all three
LlmAgents concurrently- Each agent performs its search and writes its result to session state via
output_keyParallelAgentcompletes once all sub-agents have finishedNo async code, no event loop management, no callback hell — the framework handles all of it.
The Synthesis Agent
merger_agent = LlmAgent(
name = "SynthesisAgent",
model = GEMINI_MODEL,
instruction = """You are responsible for combining research findings into a structured report.
**Input Summaries:**
* Renewable Energy: {renewable_energy_result}
* Electric Vehicles: {ev_technology_result}
* Carbon Capture: {carbon_capture_result}
**CRITICAL RULE:** Base your entire response *exclusively* on the Input Summaries above.
Do NOT add external knowledge not present in these summaries.
**Output Format:**
## Summary of Recent Sustainable Technology Advancements
### Renewable Energy Findings
[Synthesize only the renewable energy input summary]
### Electric Vehicle Findings
[Synthesize only the EV input summary]
### Carbon Capture Findings
[Synthesize only the carbon capture input summary]
### Overall Conclusion
[1–2 sentences connecting only the findings above]
Output *only* the structured report.""",
description = "Combines research findings into a structured, cited report.",
)
Why
{renewable_energy_result}in the instruction? The ADK automatically fills these{key}placeholders from the session state. Since the three researcher agents stored their outputs under exactly these keys, the synthesis agent receives all three summaries injected directly into its prompt.Why the “CRITICAL RULE”? Without it, LLMs will use their pre-trained world knowledge to supplement the research, making the output non-deterministic and potentially inconsistent with what was actually found in the search. The explicit constraint forces the agent to stay grounded.
The SequentialAgent (orchestrates everything)
sequential_pipeline_agent = SequentialAgent(
name = "ResearchAndSynthesisPipeline",
sub_agents = [parallel_research_agent, merger_agent],
description = "Coordinates parallel research and synthesizes the results.",
)
root_agent = sequential_pipeline_agent
The
SequentialAgentrunsparallel_research_agentfirst (which internally runs the three researchers in parallel), waits for it to complete, then runsmerger_agent. This gives you parallelism where possible, sequencing where necessary — exactly the right structure for fan-out / fan-in workflows.
ADK Orchestration Flow
Side by Side: LangChain vs ADK
| LangChain (LCEL) | Google ADK | |
|---|---|---|
| Parallelism primitive | RunnableParallel dict | ParallelAgent |
| Sequencing primitive | \| pipe operator | SequentialAgent |
| How results are shared | Dict keys flow through the pipeline | output_key writes to session state |
| Async model | asyncio via ainvoke / astream | Managed by ADK framework |
| Code verbosity | Lower — functional chain composition | Higher — agent class definitions |
| Observability | LangSmith tracing | ADK built-in tracing |
| Best for | Tight, composable chains where you control the data flow | Multi-agent systems where agents are independent workers |
The fundamental difference: LangChain is data-flow (inputs pipe through transforms), ADK is agent-flow (agents communicate via shared state). Both achieve parallelism, but the mental model is different.
At a Glance
Independent tasks that don't need each other's output are executed simultaneously instead of one at a time.
Sequential execution adds all latencies together. Parallel execution takes only the longest. For I/O-bound work (API calls, LLM requests), this is a 2–5× speedup with zero additional cost.
Use when a workflow contains multiple independent lookups, computations, or content-generation tasks that each produce a piece of a larger whole.
Key Takeaways
- The core rule: Tasks that don’t depend on each other’s output can run in parallel. Tasks that do must remain sequential.
- The gain: For I/O-bound work (LLM calls, API requests, database queries), parallelism reduces total time from
sum of all durationstomax of parallel durations + sequential tail. -
asynciois concurrency, not CPU parallelism. It works by filling idle network-wait time with other tasks. This is exactly what agentic workflows need. - LangChain uses
RunnableParallel— wrap a dictionary of chains and the LCEL runtime fires them all concurrently, collecting results into a dict for the next step. - ADK uses
ParallelAgent— declare sub-agents in aParallelAgent, useoutput_keyto write results to session state, and a downstream synthesis agent reads from state via{key}placeholders in its instruction. - The synthesis step is always sequential. Parallelization is a fan-out / fan-in pattern: spread out, work in parallel, reconverge.
- Added complexity is real. Parallel workflows are harder to debug, log, and reason about than sequential ones. Use it when the latency gain is significant — not as a default architecture.
Next up — Chapter 4: Orchestration, where we combine chaining, routing, and parallelization into full multi-agent systems.
Enjoy Reading This Article?
Here are some more articles you might like to read next: